The achieved bandwidth and latency are often a concern when designing a mixed host / FPGA system, and their relation with the underlying infrastructure is not always obvious.
This page is intended to clarify what can be expected, explain the limits and their reasons. A different page directs how to achieve the best possible performance.
This page discusses bandwidth concerns. Latency issue are elaborated on a separate page.
Note that the maximal bandwidths for different demo bundles are listed on the download page for PCIe.
The upper limit
This is a short theoretical outline of the maximal bandwidths that can be achieved on PCIe and AXI links.
PCIe: A single Gen1 PCIe lane has a 2.5 GT/s physical bandwidth, which is effectively reduced to a 2.0 Gb/s bandwidth due to the 8B/10B encoding. Dividing by eight, we obtain the raw rate in bytes, 250 MB/s.
Real-life tests show that the actual data payload bandwidth is ~200 MB/s (in each direction), mostly due to overhead of the data link layer and TLP packet headers.
This has been observed to grow proportionally with the lane width and physical rate. In other words, a 4x Gen1 link can be expected to give up to 800 MB/s of real data payload. Likewise, each Gen2 lane (which runs at a 5.0 GT/s physical bandwidth) contributes 400 MB/s to the payload data rate, and each Gen3 lane (8.0 GT/s) contributes 640 MB/s. So a Virtex-7 device with 8x Gen3 lanes can be expected to reach 5.12 GB/s, assuming that the infrastructure (bus, switches and peer) can take that. Again, in each direction.
AXI: The AMBA bus has separate signals for address and data, so in theory, there is no overhead. The bus’ clock is driven by application logic, so its frequency is chosen within an allowed range. Taking a relatively high 150 MHz clock and a 64 bits wide data bus, we have 9.6 Gb/s = 1.2 GB/s in each direction.
So why doesn’t Xillybus meet these figures?
The short answer: Xillybus goes as high as there is demand for.
And now to the longer answer.
Recent FPGAs allow impressive speeds (outlined above) with PCIe / AXI links, which are attractive when the FPGA uses the host’s memory as its own storage for intermediate data, or for peer-to-peer communication on the bus (e.g. with another FPGA or GPU device), with or without using the host’s memory for buffering the data.
Xillybus is however primarily intended for applications, in which the software consumes or produces the data directly. For example, in data acquisition or playback applications, where a computer program reads data from a Xillybus pipe and writes it to the disk (or vice versa), or an application processing and presenting the data captured. The FPGA could also add processing power, accepting data for crunching from the host and returning the results back. Controlling the FPGA and getting back status are also eligible use cases.
What all of these scenarios have in common, is that a computer program is involved in the data path. As it turns out in real-life applications, having software working with rates higher than 150-200 MB/s is a challenge, even with today’s computers.
Unfortunately, the raw PCIe / AXI bandwidths published by hardware vendors may potentially lead to exaggerated expectations on the overall system’s bandwidth, leading to disappointing results during the integration phase, or a significant effort to parallelize software and/or disk storage. Given the complexity of systems that can handle data beyond Xillybus’ capacity, the advantages that Xillybus offers, in terms of a simple and clean design, fade in view of the trickery necessary to keep the data flowing fast enough.
Xillybus’ actual data rate limit is derived primarily from the clock that governs the IP core (bus_clk) and the width of its data processing path, which is 32 bits (PCIe’s natural word length). For example, when the clock is 125 MHz (as in all PCIe x4 cores), the theoretical maximum is 4 x 125 = 500 MB/s, but PCIe overhead reduces the actual rate to ~400 MB/s.
Revision B and XL IP cores have been released for several FPGA targets in order to allow for higher bandwidths as well. This has been done, among others, by widening the internal data processing path to 64 and 128 bits. This makes these cores slightly more logic consuming, but allow them to reach twice and four times as much bandwidth, respectively.