Introduction
The achieved bandwidth and latency are often a concern when designing a mixed host / FPGA system, and their relation with the underlying infrastructure is not always obvious.
This page is intended to clarify what can be expected, explain the limits and their reasons. A different page directs how to achieve the best possible performance.
This page discusses bandwidth concerns. Latency issue are elaborated on a separate page.
Note that the maximal bandwidths for different demo bundles are listed on the download page for PCIe.
The upper limit
This is a short theoretical outline of the maximal bandwidths that can be achieved on PCIe and AXI links.
PCIe: A single Gen1 PCIe lane has a 2.5 GT/s physical bandwidth, which is effectively reduced to a 2.0 Gb/s bandwidth due to the 8B/10B encoding. Dividing by eight, we obtain the raw rate in bytes, 250 MB/s.
Real-life tests show that the actual data payload bandwidth is ~200 MB/s (in each direction), mostly due to overhead of the data link layer and TLP packet headers.
This has been observed to grow proportionally with the lane width and physical rate. In other words, a 4x Gen1 link can be expected to give up to 800 MB/s of real data payload. Likewise, each Gen2 lane (which runs at a 5.0 GT/s physical bandwidth) contributes 400 MB/s to the payload data rate, each Gen3 lane (8.0 GT/s) contributes 800 MB/s, and each Gen4 lane (16.0 GT/s) contributes 1600 MB/s. So for example, a Virtex-7 device with 8x Gen3 lanes can be expected to reach 6.4 GB/s, assuming that the infrastructure (bus, switches and peer) can take that. Again, in each direction.
AXI: The AMBA bus has separate signals for address and data, so in theory, there is no overhead. The bus’ clock is driven by application logic, so its frequency is chosen within an allowed range. Taking a relatively high 150 MHz clock and a 64 bits wide data bus, we have 9.6 Gb/s = 1.2 GB/s in each direction.
Realistic expectations
Recent FPGAs allow impressive speeds with PCIe / AXI links, which are attractive when the FPGA uses the host’s memory as its own storage for intermediate data, or for peer-to-peer communication on the bus (e.g. with another FPGA or GPU device), with or without using the host’s memory for buffering the data.
However, Xillybus is often used with applications where the software consumes or produces the data directly. For example, in data acquisition or playback applications, where a computer program reads data from a Xillybus pipe and writes it to the disk (or vice versa), or an application processing and presenting the data captured. The FPGA could also add processing power, accepting data for crunching from the host and returning the results back. Controlling the FPGA and getting back status are also eligible use cases.
What all of these scenarios have in common, is that a computer program is involved in the data path. As it turns out in real-life applications, having software working with rates higher than 150-200 MB/s is a challenge, even with today’s computers: If this software does anything more than storing the data into RAM (or fetching data from RAM), multi-threading and other techniques for parallel execution are most likely required.
Unfortunately, the raw PCIe / AXI bandwidths published by hardware vendors may potentially lead to exaggerated expectations on the overall system’s bandwidth, leading to disappointing results during the integration phase, or requiring a significant effort to parallelize software and/or disk storage. It's therefore important to assess the overall system's data handling capacity, and not look at the link between the FPGA and host only.
Xillybus’ actual data rate limit is derived primarily from the clock that governs the IP core (bus_clk) and the width of its data processing path. In order to keep logic consumption low, baseline Xillybus IP cores employ a 32 bits (PCIe’s natural word length). For example, when the clock is 125 MHz (as in all PCIe x4 cores), the theoretical maximum is 4 x 125 = 500 MB/s, but PCIe overhead reduces the actual rate to ~400 MB/s.
For more bandwidth-demanding applications, revision B, XL and XXL IP cores have been released for several FPGA targets. This has been done, among others, by widening the internal data processing path to 64, 128 and 256 bits respectively. As they consume more FPGA logic resources, employing these IP cores is recommended only when the baseline Xillybus IP cores are expected to be the bottleneck, all other factors considered.