Down to the TLP: How PCI express devices talk (Part II)

Published: 13 November 2012

Data Link Layer Packets

Aside from wrapping TLPs with its header (2 bytes) and adding a CRC at the end (LCRC actually, 4 bytes), the Data Link layer runs packets of its own for maintaining reliable transmission. These special packets are Data Link Layer Packets (DLLPs). We’ll list them shortly:

Ack DLLP for acknowledging successfully received TLPs.
Nack DLLP for indicating that a TLP arrived corrupted, and that a retransmit is due. Note that there’s also a timeout mechanism in case nothing that looks like a TLP arrives.
Flow Control DLLPs: InitFC1, InitFC2 and UpdateFC, used to announce credits, as described below.
Power Management DLLPs.

Flow control

As mentioned before, the data link layer has a Flow Control (FC) mechanism, which makes sure that a TLP is transmitted only when the link partner has enough buffer space to accept it.

I used the term “link partner” and not “destination” deliberately. For example, when a peripheral is connected to the Root Complex through a switch, it runs its flow control mechanism against the switch and not the final destination. In other words, once the TLP is transmitted from the peripheral, it’s still subject to the flow control mechanism between the switch and the Root Complex. If there are more switches on the way, each leg has its own flow control.

The mechanism is not the simplest, and its description in the spec will give you goosebumps. So I’ll try to put it fairly clear.

The flow control mechanism runs independent accounting for 6 (six!) distinct buffer consumers:

Posted Requests TLP’s headers
Posted Requests TLP’s data
Non-Posted Requests TLP’s headers
Non-Posted Requests TLP’s data
Completion TLP’s headers
Completion TLP’s data

These are the six credit types.

The accounting is done in flow control units, which correspond to 4 DWs of traffic (16 bytes), always rounded up to the nearest integer. Since headers are always 3 or 4 DWs in length, every TLP transmitted consumes one unit from the respective header credit. When data is transmitted, the number of consumed units is the number of data DWs in the TLP, divided by four, rounded upwards. So we can imagine data buckets at the receiver of 16 bytes each, on which we are not allowed to mix data from different TLPs. Each bucket is a flow control unit.

Now lets imagine that there’s a doorkeeper at the transmitter, which counts the total number of flow control units consumed since the link establishment, separately for each credit type. This is six numbers to keep track of. This doorkeeper also has the information about the maximum number each of these credit types is allowed to reach. If a certain TLP for transmission would make any of these counted units exceed its limit, it’s not allowed through. Another TLP may be transmitted instead (subject to reordering rules) or the doorkeeper simply waits for the limit to rise.

This is the way the flow control works. When the link is established, both sides exchange their initial limits. As each receiver processes incoming packets, it updates the limits for its link partner, so it can use the buffer space released. UpdateFC FLLP packets are sent periodically to announce the new credit limits.

Well, I overlooked a small detail: Since we’re counting the total number of units since the link started, there’s always a potential for overflow. The PCIe standard allocates a certain number of bits for each credit type counter and its limit (8 bits for header credits, 12 bits for data credits), knowing that they will overflow pretty soon. This overflow is worked around by making the comparison between each counter and its limit with straightforward modulo arithmetic. So given some restrictions on not setting the limit too high above the counter, the flow control mechanism implements the doorkeeper described above.

Bus entities are allowed to announce an infinite credit limit for any or all of the six credit types, meaning that flow control for that specific credit type is disabled. As a matter of fact, endpoints (as opposed to switches and the Root Complex) must advertise an infinite credit for completion headers and data. In other words, an endpoint can’t refuse to accept a completion TLP based upon flow control. So the Requester of a non-posted transactions must take responsibility for being able to accept the completion by verifying that it has enough buffer space when making the request. This also applies to root complexes not allowing peer-to-peer transactions.

Virtual channels

In part I of this guide, I marked the TC fields in the example TLPs green, saying that those fields are almost always zero. TC stands for Traffic Class and is an identifier used to create Virtual Channels. These Virtual Channels are merely separate sets of data buffers having a separate flow control credits and counters. So by choosing a TC other than zero (and setting up the bus entities accordingly) one can have TLPs being subject to independent flow control systems, preventing TLPs belonging to one channel block the traffic of TLPs belonging to another.

The mapping from TC’s to Virtual Channels is done by software for each bus entity. Anyhow, the real-life PCIe elements I’ve seen so far support only one Virtual Channel, VC0, and hence only TC0 is used, which is the minimum required by spec. So unless some special application requires this, TC will remain zero in all TLPs, and this whole issue can be disregarded.

Packet reordering

One of the issues that comes to mind in a packet network, is to what extent the TLPs may arrive in an order different from how they were sent. The Internet Protocol (IP, as in TCP/IP) for example, allows any packet reshuffling on the way. The PCIe specification allows a certain extent of TLP reordering, and in fact in some cases reordering is mandatory to avoid deadlocks.

Fortunately, the legacy PCI compatibility concern was taken into account in this issue as well, unless the “relaxed ordering” bit is set in the TLP, which it rarely is. This is one of the bits in the Attr field, marked green in the TLP examples in part I of this guide. So all in all, one can trust that things will work as if there was a good old bus we were talking with. Those of us who write to a few registers, and then trigger an event by writing to another one, can go on doing it. I turn off the BAR’s Prefetch bit to be on the safe side, even though there’s nothing to imply that it has anything to do with writes.

The spec defines reordering rules in full detail, but it’s not easy to get the bottom line. So I’ll mention a few results of those rules. All here is said assuming relaxed ordering bit is cleared in all transactions. I’m also ignoring I/O space completely (why use it?):

Posted writes and MSI’s arrive in the order they were sent. Now, all memory writes are posted, and MSIs are in fact (posted) memory writes. So we know for sure that memory writes are executed in order, and that if we issued an MSI after filling a buffer (writes…) it will arrive after the buffer was actually written to.
A read request will never arrive before a write request or MSI sent before it. As a matter of fact, performing a Read Request is a safe way to wait for a write to complete.
Write requests may very well come before read requests sent before them. This mechanism prevents deadlock in certain exotic scenarios. Don’t write to a certain memory area while waiting for the read completion to come in.
Read completions for a certain request (i.e. with the same Tag and Requester ID) arrive in the order they were sent (so they arrive in order with rising addresses). Read completions of different request may be reordered (but who cares).

Other than that, anything can change order or arrival, including read requests which may be reordered among themselves and with read completions.

To relieve any paranoia about an interrupt message arriving before the write operations that preceded it, section 2.2.7 in the spec spells it out:

The Request format used for MSI/MSI-X transactions is identical to the Memory Write Request format defined above, and MSI/MSI-X Requests are indistinguishable from memory writes with regard to ordering, Flow Control, and data integrity.

Zero-length read request

As just mentioned, reading from a bus entity after writing to it, is a safe way to wait for the write operation to finish for real. But why read anything, if we’re not interested in the data? So they made up a zero-length request, which reads nothing. All four Byte Enables are assigned zeroes, meaning nothing is read. As for the completion, section 2.2.5 in the spec says:

If a Read Request of 1 DW specifies that no bytes are enabled to be read (1st DW BE[3:0] field = 0000b), the corresponding Completion must specify a Length of 1 DW, and include a data payload of 1 DW

So we have one DW of rubbish data in the completion. That’s fair enough.

Payload sizes and boundaries

Every TLP carrying data must limit the number of payload data DWs to Max_Payload_Size, which is a number allocated during configuration (typically 128 bytes). This number applies only to payloads, and not to the Length field itself: Memory Read Requests are not restricted in length by Max_Payload_Size (per spec 2.2.2), but are restricted by Max_Read_Request_Size (per spec 2.2.7).

So a Memory Read Request may ask for more data than is allowed in one TLP, and hence multiple TLP completions are inevitable.

Regardless of the Max_Payload_Size restrictions, completions of (memory) read requests may be split into several completion TLPs. The cuts must be in addresses aligned by RCB bytes (Request Completion Boundary, 128 bytes, for Root Complex possibly 64) per spec 2.3.11. If the Request doesn’t cross such an alignment boundary, only a single Completion TLP is allowed. Multiple Memory Read Completions for a single Read Request must return data in increasing address order (which will be kept by the switching network).

And a last remark, citing the spec 2.2.7: Requests must not specify an Address/Length combination which causes a Memory Space access to cross a 4-KB boundary.

That’s it. I hope reading through the PCI Express specification will be easier now. There’s still a lot to read…

Questions & Comments

If you have a remark, would like to ask a question or discuss something, please post a new topic here. Posting is anonymous; no registration is required.

Related pages

FPGA development board as a USB 3.0 device with the XillyUSB IP core. Read more...

Xillybus' IP core offers a simple and intuitive solution for host / FPGA interface over PCIe and AXI buses, as well as USB 3.0. Xilinx or Altera, Windows or Linux, they are all supported.

Click here for more information.