Introduction

FPGA designs involving interaction with a host through PCIe are becoming increasingly popular for good reasons: Efficiency and reliability, as well as a clever and scalable industry standard, all these make PCI Express a wise choice.

Vendors of FPGA devices usually provide a Transaction Layer front-end IP core to use with application logic. The basics of this layer are outlined in a separate tutorial.

Proper DMA-based communication from the FPGA to the host requires some awareness of the specification’s details, but it’s otherwise fairly straightforward in the sense that packets are formed, dispatched and assured to arrive in the order they were sent. The other direction, host to FPGA, is somewhat trickier, since the FPGA’s active part is merely to issue a read request. The completions, containing the requested data, arrive in the time and format depending on the host and bus fabric. The reception logic must be prepared to react to different scenarios, so FPGA logic that worked with one host may fail with another, unless properly designed.

The discussion below looks at different possibilities for implementing host to FPGA data transmission, and attempts to underscore the main considerations.

As a side note, Xillybus’ IP core implements the single or multiple requests in flight methods described below, depending on the bandwidth requirements.

Cheating: No DMA

The simplest way is having the driver software write the data to the hardware in a loop, with something like strcpy(). This is indeed very inefficient, but given the fast PCIe hardware available, it’s possible to reach satisfactory results with this plain method: For example, in a worst-case scenario, each DW (4 bytes) of payload requires 16 bytes of TLP headers (64-bit addressing) + 6 bytes for data link layer overhead, so there are 26 bytes transmitted on the data link layer for each 4 bytes of payload.

On a Gen1 link with x8 lanes, this raw link layer runs at 2.5 Gbps x (8/10) x 8 = 16 Gbps = 2 GB/s (the 8/10 factor accounts for the error correction code). The payload link rate is hence limited to 2000 MB/s x 4 / 26 = 307 MB/s. In reality, the rate is somewhat lower due to data link layer overhead, so 280-300MB/s is a more realistic figure.

In some cases, the processor may send more than 4 bytes on each TLP packet, in particular if a 64-bit write is supported by its instruction set. This significantly improves the throughput. Not all hosts support Gen2 links, but if it does, the potential throughput is doubled. So all in all, impressive data rates are possible using this primitive method.

The obvious downside of this non-DMA method is that the CPU is busy writing. This may be less of a concern when the processor has multiple cores, in particular with hyperthreading support. For example, with a quadcore processor supporting hyperthreading, the data copy routine occupies one core out of the 8 virtual.

And still, the host’s hardware wasn’t designed to support high bandwidth traffic being carried out this way. Even though the bridge between the processor’s internal bus and PCIe should take the load, other peripherals may suffer from a significant performance degradation due to the flood of requests.

Complicated bus scenarios, which the processor wasn’t designed to cope with, may cause a significant performance hit on the entire system. For example, if the Ethernet hardware driver attempts to access its registers, the TLPs requesting these operations may be significantly delayed, causing a slowdown of apparently unrelated tasks.

DMA read requests: Matching the completions

As the PCIe specification requires, in order to transmit data, the FPGA sends a read request TLP to the host, stating (among others) the start address, and the number of DW to send, and a request identifier (”tag”). The host responds with one or more Completion TLPs, which contain the read data as their payload. The “tag” field of these completion packets match the one in the request, so the FPGA can tell which request they belong to.

The PCIe spec defines several rules for the request and its completions, which are best learned from the spec itself. A couple of these rules are:

  • The request TLP is limited in the number of bytes it can ask the host to read. This limit is the lower of the limit declared by the device and the host in their configuration registers. Typically, it ends up to be 512 bytes.
  • The host is allowed to divide its response into an arbitrary number of completion packets, as long as it cuts the address ranges on 64-byte aligned boundaries (a.k.a Read Completion Boundary, RCB, which can also be 128 bytes in rare cases).

Since several completion TLPs may arrive for a single read request, the obvious question is whether they arrive in any particular order. The answer is that completions having the same “tag” field (hence responding to the same request) always arrive in rising address order. Completions from different requests have no ordering rules. In other words, if the FPGA issues a read request with the tag set to zero, and then another read request with the tag set to one, it’s possible that a completion TLP for the second request (with tag set to 1) will arrive before the last completion TLP has arrived for the first one (with tag 0). In fact, it may arrive before any of the tag-0 market completions.

The above is said assuming that the “relaxed ordering” bit is cleared for all TLPs involved, which is always the case in known systems. If this bit is set, no ordering is guaranteed for any packets.

Single request in flight

In an FPGA implementation, it’s quite convenient to know that the packets arrive in rising address order. The incoming data can easily be stored in a FIFO this way. Or if the data is stored in a RAM, a pointer can be set to the start address, and then incremented as the completion packets arrive. But to allow for this simple implementation, each read request must be sent only when the previous one’s completions have all arrived. This means that no completion data is flowing during the time gap until a new request is transmitted, has arrived at the host and is processed. This, in turn, can reduce the bandwidth performance to below 50%.

Multiple read requests in flight

In order to utilize the bandwidth fully, multiple read requests must be issued, in order to make sure that the host always has at least one read request handy when it finishes up completing the others. The FPGA can’t rely on packets arriving in rising address order anymore, and must store the incoming data in RAM before presenting it to the logic consuming the data.

There is also a slight complication in telling when data is ready for submission to the logic that uses it. In the single-request case, the full data arrival could be told by counting the number of bytes or words arrived, or by a similar method. With completions from multiple read requests arriving, a slightly smarter method should be adopted, depending on the data transmission pattern.

In tests, completion packets tend to arrive in increasing address order even when there is more than one read request in flight. The host typically handles the read requests in the same order they arrived, even though read requests may be reordered by the switching fabric and the root complex itself. Apparently, there is no reordering of the completion packets either, even though this is allowed too when the completions are related to different requests.

The fact that the common hardware doesn’t tend to take advantage of its liberties in real-life tests is actually bad news, since the FPGA design is never tested to withstand the conditions it may meet in the future. Forcing TLP reordering requires dedicated (and probably very expensive) testing equipment, and it’s not clear whether equipment that really stress tests this issue exists. So the first time the FPGA faces completion TLPs arriving in non-incremental order may be far away from the development lab, possibly as a result of extended switching fabric and/or the host being heavily loaded.

Completion credits

Any endpoint on the PCIe bus must announce infinite credits for completions. In other words, you can’t send a read request and then refuse to receive the completion by virtue of the link layer’s flow control. It’s the responsibility of the issuer of the read request to make sure that there is enough room to store the incoming completion packets.

The existing PCIe IP cores by Xilinx and Altera have different ways for informing the application logic about the amount of space allocated for incoming completion packets, but none of these tell how much there is left for future read requests at a given moment. It’s up to the application logic to run the bookkeeping of how much resources are left, and decide whether it’s safe to send another read request.

These PCIe cores announce the allocated memory space for completions in terms of header and data credits. Some make this data available to the application logic through special wires, and some specify this information offline, so the figures need to be hardcoded in the HDL. This way or another, a typical implementation needs to calculate the maximal number of credits a read request’s response may consume, and verify against a calculated number of credits left. If there are enough credits, the request is sent, and the response’s worst-case consumption is deduced from the credits left. These credits are returned to the “credits left” estimation as the completion packets arrive.

Obviously, this is a nontrivial piece of logic, which can be simplified in some cases. For example, if all requests are limited to 512 bytes, and always start on 64-bytes boundaries, the host will respond with 8 packets containing 64 bytes of payload each, in the worst case (which is in fact a common response). Each of these packets consumes memory at the receiving side, which is equivalent to one header credit and 4 data credits, so the request’s completion may consume memory of up to 8 header credits and 32 data credits' worth.

Suppose that the PCIe core allocates the equivalent of 28 header credits and 112 data credits for completions (which are possible values for Altera’s Cyclone IV). For this example case, the header credits limit us to 3 requests in flight (3 x 8 < 28), and so do the data credits (3 x 32 < 112). This reduces the bookkeeping to just knowing how many uncompleted requests are out. However this works only because of the previous assumptions on the requests’ format, and the performance may turn out less than optimal as a result from the non-full use of credits.

Some PCIe IP core vendors have a completely different mechanism for incoming TLPs, so the discussion in this section applies only for Xilinx and Altera PCIe blocks, and those who have a similar interface. A specific note about that follows.

Note for Lattice users

Unlike Xilinx and Altera, Lattice's supplied PCIe IP core (for LatticeECP3, LatticeECP2M and LatticeSCM families) doesn't store the incoming packets in its own memory buffers, but passes them immediately to the user logic as they arrive. The user logic has no way to refuse the reception of these packets. If an intermediate buffer is required for storing incoming packets, the user logic must manage these. Accordingly, the user of the core defines the announced credits in the GUI interface for generating the IP core.

The discussion in "completion credits" is still relevant in a way: The user logic shouldn't request reads if it can't take the completions. The major difference is that the PCIe IP core is indifferent to this issue, as the risk of packet overflow relates to the user logic, not the PCIe IP core.

Bandwidth efficiency of completions

The completer’s freedom to divide the requested data into several TLPs takes the bandwidth efficiency issue out of our control. It’s common that the completions are segmented into the smallest possible packets allowed, 64 bytes of payload per TLP. This isn’t as bad as it may sound, since the maximal payload is often set to 128 bytes anyhow.

To get an idea of the efficiency, let’s assume 4 DW headers and add the 6 bytes of data link layer overhead, so there are 22 bytes added to each TLP.

When the payload is 64 bytes, the data link efficiency is 64 / (22 + 64) = 74.4%. For a 128 bytes payload, it would have been 128 / (22 + 128) = 85.3%.

Let’s take an example of a Gen1 link with 1x, with the common 64 bytes completions. The upper limit for the host-to-FPGA data rate is hence 2.5 Gbps x (8/10) x 74.4% = 1.39 Gbps = 186 MB/s. The measured results may be lower, since housekeeping data link layer packets are not taken into account here, and neither are TLP packets created by writes issued by the host (e.g. to FPGA registers).

Conclusion

Communicating data from the host to the FPGA involves a significant loss of control of the data flow, making it impossible to predict the actual performance, as it depends on the host’s responsiveness to read requests. Also, commonly available hardware tends to respond in relatively restricted patterns, making it difficult to test the logic design against legal scenarios that may appear in the future.

As a result, the logic design that guarantees proper functionality, beyond the specific host hardware it’s tested on, requires a certain degree of sophistication which may seem redundant at first glance. Continuous testing on a wide variety of platforms is beneficial to expose the logic design to response patterns appearing, as new hardware for PCIe hosts are introduced to the market.