AXI4-Stream Interface
The single pair of AXI-stream interfaces has been replaced with two pairs. This makes a lot of sense, in particular in the typical peripheral application: There is one piece of logic supporting the host’s reads and writes to the peripheral’s register space (“BAR registers”) and a completely different piece of logic transmitting data using DMA. In the register scenario, the application logic is the “slave” on the PCIe bus (let’s call it “Completer”) and in the data transmission scenario, it’s “master” (or “Requester”).
So why should the application logic mux packets between register handling and its data traffic? The Gen3 Block does that instead, separating the traffic into two sets of bidirectional AXI Streams interfaces. So in a typical peripheral application, it looks like this:
For “register logic”:
- Completer Request (CQ): Packets from bus master to us, asking us to read or write
- Completer Completion (CC): Packets completing requests issued by master (most likely data for read requests).
For data transfers logic:
- Requester Request (RQ): Our packets as master on the bus (read and write requests for host’s memory)
- Requester Completion (RC): Completions arriving from the bus to our requests (data read from host’s memory)
Except for this split-in-two (and the fact that it’s not TLPs on the wires), the AXI-Stream interface is the same, in particular the breakdown of beats into individual DWs (little Endian style).
The _tuser signals have a different meaning, as these always change across different versions of PCIe blocks.
Also, the _tkeep signals (appearing as _tstrb in some versions, and possibly in the old user guide) have been narrowed down to one bit per DW, instead of the previous one bit per byte (which was a redundant representation, with four bits forced to be the same as they represent the same DW).
Little Endian format
The PCIe bus specification applies big Endian byte ordering to address and data that is packed into the TLPs. The Integrated Block for Gen3 hides this, and talks with the application logic in little Endian format all the way through. Simply put, Xilinx’ core swaps the byte ordering.
This puts the interface with the Gen3 Block in line with other IP cores offered by Xilinx, but those with existing TLP-minded application logic must unswap the byte ordering. The byte-enable signals remain the same.
For example, the first byte of any DW appears in bits [31:24] in the TLP, but on the AXI stream interface, it’s in bits [7:0] (assuming that the first DW occupies bits [31:0] of the wider data word).
A no-TLP interface
In any interface to a vendor-supplied PCIe core known so far (Altera devices included), the application logic built or analyzed TLPs it communicated with the core. No more. The Gen3 has its own format. It’s still header DWs first, data following, but the bit fields are set up completely differently. Leaving migration issues alone, the new descriptor format eliminates several pitfalls in designing the application logic, but those having something already working the old way have to reopen their design.
Many fields from the TLP appear in the Gen3-dedicated descriptor unchanged. Others are changed slightly, such as the Length field. The Byte Enables are conveyed through the _tuser signals of the AXI Stream interface, and not in the descriptor. For the lazy ones, the additional byte enables of the current transfer beat are generated by the Gen3 Block: There are 32 bits of byte enables for each beat of data (max 256 bits = 32 bytes) on m_axis_cq_tuser[39:8] and m_axis_rc_tuser[31:0]. Keep in mind that e.g. bit 0 of the byte enable corresponds to bits [7:0] of the data word (little Endian). The "classic" byte enables are given through the _tuser interface as well.
The application logic is freed from supplying the Requester ID field when issuing requests (read requests, in particular). The Gen3 Block fills these values correctly in the TLP automatically. The Block can also allocate request tags from a maintained tag pool, and remember some context information for read requests. Those preferring to take control of these issues may do so:
- Assert the Requester ID Enable field in the descriptor to specify the requester ID explicitly.
- Choose External Tag Management (AXISTEN_IF_ENABLE_CLIENT_TAG = “TRUE”) for the old-school tag behavior (that is, the application picks the tags).
One clear advantage of the no-TLP interface is that the descriptor’s length is always 4 DWs for a request, and 3 DWs for a completion, regardless of the 32 or 64 bit addressing used on the PCIe bus itself. And unlike TLPs, request descriptor packets start with the address (always given in 64 bits), possibly giving the address decoding logic a head start.
In the attempt to give an idea of where old-school TLP-based logic should get the information from, here’s an incomplete dissection of a request descriptor. By all means, refer to the product guide for the full picture.
For a Completer Memory or I/O Request, reading or writing to a register (figure 3-6 in the user guide):
- The address: Always given in the two first DWs of the packet, as a 64-bit word, regardless of whether the originating header length (3 or 4 DWs). Note that the DWs are given in little Endian format, lower 32 bits first. This makes sense when the 64-bit word is viewed as a plain bit vector. Bits [1:0] of this vector should be ignored (forced zero, as they contain the PCIe-3.0 specific Address Translation field)
- The First_BE (Byte Enables) are determined from m_axis_cq_tuser[3:0] during the first beat of each packet (only).
- The Last_BE (Byte Enables) are determined from m_axis_cq_tuser[7:4] during the first beat of each packet (only).
- The Length field is copied from bits [10:0] of the third (base+2) DW. Note that there is one bit more than the Length field, so there’s no need to translate 0 to 1024.
- The Fmt and Type fields in the TLP are partly deducible from the Request Type field, bits [14:11] of the third (base+2) DW. There is no immediate translation formula between the values of Fmt and Type and the Request Field. Table 3-4 in the user guide lists the possible values of the latter.
- The Requester / Completer ID is at bits [31:16] of the TLP’s third (base+2) DW. Not to be confused with the TLP’s header, where they appear on the same bits, but one DW earlier.
- The Requester / Completer tag is at bits [7:0] of the forth (base+3) DW. Note that unlike the TLP format, it’s the same DW regardless of it’s a request or completion.
- TC and Attr are at bits [27:25] and [30:28] respectively, of the forth (base+3) DW.
Continuous packet transmission
When submitting a packet (descriptor + possible payload) to the Gen3 Integrated Block, the application logic is not allowed to stall in the middle. Formally put, the s_axis_rq_tvalid and s_axis_cc_tvalid signals must be held asserted along the whole packet (see pages 109 and 126 in the Xilinx’ pg023). In many cases, this means that the application logic must store the full packet in a RAM before asserting the *_tvalid signal, to allow its uninterrupted transmission.
By the way, the Gen3 Block guarantees to hold its *_tvalid signal asserted along packets conveyed to the application logic, but this is hardly an advantage.
*_tready signals may be deasserted at any moment by either side. Hence there is no similar requirement on receiving packets, and the application logic may stall the reception of packets: m_axis_rc_tready and m_axis_rc_tready may stall the data flow, even in the middle of a packet.
Comments and corrections are warmly welcomed in the Xillybus forum. Posting is possibly anonymous.