Interrupt and data synchronization
It’s the application logic's role to make sure that the interrupts are synchronized with the data packets as well as to handle error conditions.
In page 84 of Xilinx’ pg023, under “Transmit Transaction Ordering” it says:
“The client logic requires ordering to be maintained between a request TLP and MSI/MSI-X TLP signaled through the MSI Message interface. In this case, the client logic must wait for the sequence number of the requester request to appear on the pcie_rq_seq_num[3:0] output before signaling MSI or MSI-X on the MSI Message interface.”
In a typical application, where the interrupt is used to tell the host that a buffer has been filled with DMA, this means that if a bit in cfg_interrupt_msi_int is asserted to issue an interrupt as soon as the last data packet has been pushed into the RQ AXI stream, the interrupt may arrive to the processor before the data packet.
Another issue, which is related to on page 84 as well, is that even if a completion to a request is sent through the CC AXI stream after a data packet is sent on the RQ AXI stream, they may arrive in the reverse order to the processor.
This may cause a problem when DMA writes to a buffer are synchronized with a BAR-mapped status register. If the processor happened to check if the DMA buffer is ready for reading via this status register, this could lead to inconsistencies: The completion, containing the go-ahead to read from the DMA buffer may arrive before the last data packet writing to that buffer.
The solution is mentioned in the citation above: Each packet on the RQ interface can be assigned a 4-bit sequence number, which should appear on s_axis_rq_tuser[27:24] on the first beat of the packet. This number will then be issued by the PCIe Block on pcie_rq_seq_num[3:0] for one clock cycle (with pcie_rq_seq_num_vld high on that cycle), indicating that the packet is beyond reordering. Only then should an interrupt be transmitted and/or a relevant status register be updated.
This may appear to be a flaw in the Gen3 Block’s design, but in most designs synchronization logic on application level is needed anyhow, because of the need to store packets in RAM in order to maintain their continuous transmission, as mentioned above. The sequence number mechanism is in fact helpful to accomplish an end-to-end synchronization.
Interrupt handling
On top of the synchronization logic, which is necessary to prevent the interrupt from slipping in before the data for which it accounts, a small piece of logic is needed to maintain the state of the interrupt transmission.
The discussion here is narrowed down to MSI interrupts only.
The interrupt is triggered with cfg_interrupt_msi_int, which is a 32-bit vector. If no MSI vectoring is used, bits [31:1] should be kept zero, and only bit 0 is used to trigger an interrupt. It’s also advisable to keep cfg_interrupt_msi_select zero, so that the interrupt is issued on behalf of PF0.
Each time an interrupt has been requested with cfg_interrupt_msi_int (asserting one of the bits for one clock cycle), the application logic must wait for either cfg_interrupt_msi_sent or cfg_interrupt_msi_fail to be asserted by the Block (for one clock cycle). Only then is the application logic allowed to assert a new interrupt.
The odd part is cfg_interrupt_msi_fail: If asserted, the interrupt transmission failed, and the application logic should re-issue the interrupt request. But how could an MSI fail? It’s just a plain write to some address. It’s a posted TLP. What could possibly go wrong? Well, that doesn’t really matter. It can.
The Gen3 block also requires the application logic to supply a 64-bit vector of pending interrupts (cfg_interrupt_msi_pending_status) which is presented to the host as the MSI Pending Bits Register (32 bits for each PF).
The endpoint’s bus / device / function ID
This is a difference that probably requires no work to handle.
Previous PCIe cores required the application logic to know the endpoint’s ID on the bus, in order to fill in the Requester ID field in request TLPs (in particular DMA read requests). This data was available as separate wires from the core (cfg_bus_number, cfg_device_number and cfg_function_number). These wires have vanished from the Gen3 Block. Not to be confused with cfg_ds_bus_number, cfg_ds_device_number and cfg_ds_function_number which are inputs to the core, and relate to downstream ports.
Those really curious to know the bus ID may use the interface for reading from the configuration space directly. But this isn’t necessary.
The point is that this information isn’t needed anymore, as the Gen3 block fills in the relevant fields in the TLPs automatically when the Requester / Completer ID Enable bit is cleared to zero in the packet’s descriptor. This leaves the application logic with the possibility to determine the ID manually if necessary, but in most cases, the automatic mechanism is the easy way.
Message interface
Those who process message TLPs directly in the application logic will have to adapt it to the dedicated message interfaces for transmission and reception in the Gen3 Block. This isn’t relevant to most PCIe applications, which merely work with data related packets anyhow.
DWORD aligned vs. Address aligned mode
The Gen3 block supports a new alignment format.
- DWORD aligned mode = data aligned mode (AXISTEN_IF_xx_ALIGNMENT_MODE = “FALSE”) : The payload appears immediately after the 4 DW header. If the data is chopped into DWs in a stream, there is no gap. The old-school approach.
- Address aligned mode (AXISTEN_IF_xx_ALIGNMENT_MODE = “TRUE”) : The payload appears possibly with a gap, to avoid the need to shift the data, in an application where the data wires go directly to memory arrays with byte enables.
In address aligned mode, when creating a write or read request on the RQ interface, the addr_offset[2:0] field in the descriptor informs the Integrated Block how to the data is aligned in the payload part, or how to align the data in the completion, once it arrives, respectively.
In DWORD aligned mode, addr_offset[2:0] should be held zero.
Completion credit data
For reasons mentioned in a different tutorial, it's often necessary for the application logic to know how much buffer space is allocated by the PCIe block for arriving completions TLPs. Since an endpoint must announce infinite credits for completion headers and data, it's up to the application to make sure it doesn't issue read requests that will overflow the PCIe block's buffers.
This information is given in the data sheet, but it depends on the PCIe block's configuration. It therefore made sense in earlier versions to get this data by setting the fc_sel input to zero, and read out the allowed credit spending from fc_cplh and fc_cpld outputs before any read request was issued.
The Gen3 Block has similar pins, only prefixed with cfg_*. Unfortunately, the cfg_fc_cplh and cfg_fc_cpld signals return zero for all choices of cfg_fc_sel that correspond to reception of packets. This is consistent with the announcement of infinite credits, but is useless. In pg023's table 2-16, in the entry for cfg_fc_sel, there's a note stating that infinite credits are signaled as zeros.
Appendix B of the same guide lists the credits explicitly. It's safe to assume 32 header credits and 496 data credits available. But this has to be hardcoded in the application logic.
Comments and corrections are warmly welcomed in the Xillybus forum. Posting is possibly anonymous.