General

Revision A (baseline) Xillybus IP cores are available for download since 2010, and have been utilized in projects by hundreds of users worldwide.

Introduced in 2015, Revision B and XL Xillybus IP cores lift the bandwidth limits of the Revision A cores, in order to meet high-end application's increasing demand for data. The new revisions (B and XL) offer a superset of features compared with the revision A, but are functionally equivalent when defined with the same attributes (with some possible performance improvements).

The previous bandwidth limitations were partly due to the 32-bit internal data bus employed in Revision A cores, as well as the width of data bus connecting to the FPGA's PCIe block. The buses of Revision B and XL are wider (as detailed below), which allows the core to transport data faster, at the cost of a slight increase in logic consumption.

When possible, it's highly recommended to test the application logic with a revision A core first, and apply a revision B/XL core once the design is known to work well. This prevents possible bugs in the application logic to be mistakenly attributed to the revision B/XL (which is rarely the cause of problems).

Revision B and XL cores are available for Vivado only, even though it's possible to implement the revision B core binary on ISE too. This will however leave the ECRC issue mentioned below unsolved due to this bug (or apply the patch mentioned in the linked page fixes it -- upgrading the PCIe block causes a port incompatibility).

Revisions B vs. XL

There are two main differences between revision B and revision XL:

  • A revision B core binary is a drop-in replacement for a revision A. In other words, once a revision B core has been downloaded from the IP Core Factory, it may replace a revision A core with no other change of the logic design. Revision XL cores, on the other hand, are based upon a slightly different demo bundle, and can't be dropped into a revision-A/B-based design.
  • Revision B's bandwidth limit is typically twice as fast as its revision A counterpart, while revision XL is typically four times as fast.

The following table summarizes the three currently available revisions and their merits.

RevisionRelated demo bundleBandwidth multiplierMaximal bandwidthInternal data widthWidth of PCIe block busAllowed user interface data widths
Revision ABaselinex 1 (baseline)800 MB/s3264(*)8, 16, 32
Revision BBaselinex 21700 MB/s64648, 16, 32, 64, 128, 256
Revision XLXL bundle(**)x 43500 MB/s1281288, 16, 32, 64, 128, 256

(*) On Xilinx series-7 FPGAs. For other FPGAs, the connection with the PCIe block is 32 or 64 bits wide.

(**) Access to the XL bundle is given at request.

Obtaining a revision B / XL IP core

Revision B / XL IP cores are currently in the late Beta stage, and access to these is therefore granted only at request by email. There is however no particular requirement to obtain this access; this limitation is merely a means be in contact with early adopters.

These cores are generated in the IP Core Factory by invoking the core summary of an existing revision A (baseline) core, and creating a replica of it, targeting a revision B/XL core (see example image below).

Replicate as revision B / XL screenshot

This option is not enabled by default. Rather, users who are interested in using revision B / XL IP cores should apply by email for having their IP Core Factory account set to enable replication to revision B / XL. In the request, the email address registered at the IP Core Factory should be stated, so the correct account can be found.

Using 64, 128 and 256 bits wide interfaces

While revision A cores allow application data widths of 8, 16 and 32 bits, revisions B and XL also allow 64, 128 and 256 bit wide interfaces. The main motivation is to make it possible to utilize the full bandwidth capacity with a single stream. However, as with revision A cores,  it's also possible to divide the bandwidth using several streams (possibly 32 bits wide), so that the aggregate bandwidth utilizes the limit (usually with a 5-10% degradation). Data widths should be chosen as convenient.

The wider interfaces are allowed regardless of whether they help to increase the bandwidth capability for a given configuration. For example, the user application logic may connect with Xillybus' revision B / XL IP core with a 256 bit data bus even if the maximal theoretical data rate through these 256 bits exceeds the PCIe bus' capacity. These data widths are unrelated to the interface signals with the PCIe block.

When using these wider data interfaces, it's important to observe that since the natural data element of the PCIe bus is a 32-bits, some software safeguards that prevent erroneous use of streams do not apply when the data width goes above 32 bits. For example, the seek positions for a seekable 64-bit wide stream must be a multiple of 8 to achieve any meaningful result. The software will however only enforce that it's a multiple of 4.

In conclusion, when the data width is above 32 bits, the application software is more responsible for performing I/O with a full word's granularity. There are no new granularity rules for wider streams, but unlike 32 and 16 bit streams, the driver will not necessarily enforce them.

Attaining maximal bandwidth

Transporting data at high data rates makes the CPU the bottleneck more often than most people believe. It's highly recommended to monitor the CPU consumption of the data handling process as it runs, and verify that it's has a considerable percentage of idle time.

For high performance data transport, the optimal data buffers size used by read() and/or write() calls usually range between 128kB and 1 MB. This affects the operating system's overhead as it processes system calls. When the buffers are smaller, this overhead may become very significant.

Effects related to context switches may become significant too, when the buffers are large, and when multiple streams are used to utilize a total aggregate bandwidth. The buffers' sizes compared with the CPU's cache size can play a role as well.

It's therefore suggested to experiment with these buffers' size to reach the optimal performance on a given system. The DMA buffers' size and count usually have little or no significance, as long as their total storage exceeds 10 ms worth of data at the expected data rate.

ECRC trimming requirement

Xilinx' PCIe block must be configured to trim the ECRC if such is present on incoming TLPs. As ECRC is rarely used in practical systems, failing to meet this requirement is likely to result in a system that works perfectly well on almost all hosts anyhow, and fail completely on the fraction of hosts that employ ECRC. Baseline demo bundles with versions 1.2d and later, as well as all XL demo bundles have this issue configured properly.

Narrow streams

On revision A (baseline) cores, the use of streams with 8 or 16 bits of data width wastes effective data bandwidth with a factor of 4 and 2 respectively. This is not the case with revision B / XL cores: Thanks to a change in the internal data handling, several 8 and 16 bit streams can be used to utilize the full bandwidth together (with a typical degradation of 5-10% due to CPU overhead).

Hence revision B cores may be the preferred choice for applications requiring several narrow streams, even when the total bandwidth is below the relevant revision A core's limit.