The guide to defining a custom Xillybus IP core

4 IP cores of revisions B, XL and XXL

4.1 General

Up to this point, this document has related to the baseline revision of IP cores (revision A), which is available since 2010. Revisions B and XL were introduced in 2015, adapting to data bandwidth needs of Xillybus’ users. These cores gradually replace revision A.

Revision XXL was introduced in 2019.

The new revisions (B, XL and XXL) offer a superset of features compared with revision A, but are functionally equivalent when defined with the same attributes (with some possible performance improvements).

The most notable differences are:

Increased data bandwidth: For any FPGA, IP cores of revisions B, XL, and XXL allow for an aggregate bandwidth of approximately twice, four times, and eight times the bandwidth of revision A, respectively.

Please refer to section 5 of Getting started with Xillybus on a Linux host or Getting started with Xillybus on a Windows host on how to attain the bandwidth capabilities of Xillybus IP cores.
User interface data widths of 64, 128, and 256 bits are allowed in addition to the already existing options of 8, 16 and 32 bits. These widths are allowed regardless of the width of the data paths between Xillybus’ IP core and the PCIe block in use.
The logic design is faster (easier to attain the timing constraints), with about 1 ns less delay on the slowest timing path.
The consumption of logic is lower in most common cases (see section 4.4).
The bandwidth of the PCIe block is utilized efficiently, regardless of the data width of the signals between the IP core and the application logic. This is contrary to the lower efficiency of streams with 8-bit and 16-bit words with revision A.
On AMD platforms, revision B, XL and XXL are available only for use with Vivado.

4.2 Working with revision B/XL/XXL

An IP core of revision B/XL/XXL is created in the IP Core Factory by replicating an IP core of revision A.

This is done by clicking on “replicate as revision B / XL / XXL core” as in this screenshot from the IP Core Factory:

Downgrading back to revision A is not possible.

The possibility to upgrade to B/XL/XXL is enabled only for users who have requested access to these advanced IP cores. Such requests are made with a plain e-mail using the contact information that is advertised on the website.

There no particular requirement to obtain this access; the purpose of this request is merely for being in closer contact with high-end users.

IP cores of revision B are drop-in replacements for revision A. Hence, the baseline demo bundle for the desired FPGA should be used as a starting point. As this demo bundle arrives with an IP core of revision A, those who desire to work with revision B should configure and download it from the IP Core Factory.

Revision XL and XXL, on the other hand, require a dedicated demo bundle to work with. These demo bundles should be requested through e-mail.

4.3 Width of data word

While IP cores of revision A allow application data widths of only 8, 16 and 32 bits, revisions B/ XL /XXL also allow 64, 128 and 256 bit wide interfaces. The main motivation is to make it possible to utilize the full bandwidth capacity with a single stream.

Nevertheless, it’s also possible to divide the bandwidth using several streams (possibly 8, 16 or 32 bits wide), so that the aggregate bandwidth utilizes the full bandwidth capability (possibly with a 5-10% degradation).

Data widths should be chosen to work naturally with the application logic.

The wider word interfaces are allowed regardless of whether they help to increase the bandwidth capability. For example, a word width of 256 bits is allowed on IP cores of revision B, even though 64 bits is enough to utilize the IP core’s bandwidth. These data widths are unrelated to the interface signals with the PCIe block.

When using a word width above 32 bits, it’s important to note that since the natural data element of the PCIe bus is a 32-bits, some safeguards in the driver, that prevent erroneous use of streams, do not apply when the data width is above 32 bits. For example, any function call to read() or write() for a stream with a word width of 64 bits, must have a lengh that is a multiple of 8. Likewise, the positions requested by seek operations on a 64-bit wide stream must be a multiple of 8 to achieve any meaningful result. The software will however only enforce that it’s a multiple of 4.

In conclusion, when the data width is above 32 bits, the application software is more responsible for performing I/O that is aligned with the word’s width. The rules for word alignment are the same for all word widths, but unlike streams with word widths of 32 and 16 bits, the driver will not necessarily enforce these rules.

4.4 Logic resource consumption

IP cores of revisions B/XL/XXL are optimized for speed and a slightly lower consumption of logic, at the cost of a slightly steeper consumption of logic as the number of streams increases.

In order to quantify the use of logic resources, cores for Kintex-7 with an increasing number of streams were generated. The cores underwent synthesis, and the elements of logic were counted. As in section 3.3, the benchmark test was 50% for upstreams and the same for downstreams.

The following three charts show the consumption of logic, comparing IP cores of revision A, B and XL with equal settings. All tested streams were 32 bits wide.

Comparing the number of registers and LUTs, revision B outperforms revision A when the number of streams is low, but lose this advantage as the number of streams increases.

Revisions XL and XXL consume more logic than both other revisions in all scenarios.

The chart for block RAMs shows that both revision B and XL consume double as many block RAMs, compared with revision A.

The suggested conclusion is that revision B should almost always be preferred over revision A. Even when the consumption of logic says the opposite, the improved timing of revision B outweight this difference. This is true in most practical scenarios, where the logic consumption of the IP core is negligible, compared with the FPGA’s capacity.

Revisions XL and XXL, on the other hand, should be chosen only for applications that require their bandwidth capacity, as they consume more logic and are also more difficult with regard to timing constraints.

4.5 Tuning for optimal bandwidth of stream from host to FPGA

Because IP cores with revisions B, XL and XXL allow for higher data rates, they are increasingly sensitive to proper tuning of the PCIe block’s parameters.

The PCIe blocks in the demo bundles are already set up for optimal performance. However it might be a necessary to make a slight adjustment if the PCIe bus (which is part of the host’s hardware) relays packets with a latency that is longer than normal.

Among FPGAs by AMD, this applies only to IP cores of revisions B or XL, used with Kintex-7 or Virtex-7 with a PCIe block that is limited to Gen2. Revision A doesn’t reach the data rates for which any improvement will be noticed.

In this limited set of cases, it may be required to make adjustments to the parameters of the PCIe block in order to achieve the intended bandwidth on streams from the host to the FPGA.

This may be required because the data flows in the host to FPGA direction by virtue of requests for DMA transfers, which are issued by the FPGA. The host fulfills these requests by sending data. The delay between the requests and the data transmissions that fulfill them (completions) depends on the host’s responsiveness to such requests, and varies from one PCIe bus to another.

In order to make an effective use of the bandwidth that the PCIe bus offers, several DMA requests are issued in parallel by the FPGA, thus ensuring that the host always has a request to handle. There is however a limitation on the number of active requests, which is imposed by the PCIe protocol’s flow control. The limited resource, which is allocated by the flow control mechanism, is called completion credits, and is configured for a PCIe endpoint. Generally speaking, more of these means a larger number of active requests are allowed, and also more resources are required by the FPGA to implement the PCIe block.

The FPGA to host direction is much less affected, if at all, by allocation of credits, as the FPGA sends the data along with the DMA requests. There is hence a little chance for an improvement by modifying the setting of the credits, as they have little influence.

The PCIe block in the demo bundles is configured for optimal bandwidth utilization on common desktop computers, with a processor based upon the x86 architecture. Even though quite uncommon, it may be necessary to alter the configuration in order to attain the advertised bandwidth in the host to FPGA direction.

An improvement may be attained by increasing the number of completion credits (both header and data). This is done in Vivado or Quartus by invoking the configuration of the PCIe block IP, and modifying its parameters. For example, for a Kintex-7 configured in Vivado, this is done by selecting the “Core Capabilities” tab and setting the “BRAM Configuration Options”. The “Perf level” is already set to the highest possible, so there’s no room for improvement on this. However enabling “Buffering Optimized for Bus Mastering Application” increases the completion credits on the expense of other types of credits. This may the improve the bandwidth performance in the host to FPGA direction, without an adverse effect on the opposite direction.