The guide to defining a custom Xillybus IP core

2 Defining custom IP cores

2.1 Overview

The IP Core Factory is a wizard-like web application for defining a custom IP core from scratch, or using the configuration of the demo bundle’s core as a starting point.

For the vast majority of purposes, it’s recommended to rely on the IP Core Factory to set the attributes of each stream, by keeping the “Autoset internals” option enabled. It’s a very common mistake to turn this option off for the sake of tweaking with the stream’s parameters, which almost always leads to worse performance.

In particular, if the IP core fails to meet the expected data rate performance, there’s a good chance that the problem is elsewhere. In this case, it’s recommended to refer to one of these two guides, which discuss how to achieve the IP core’s full performance:

Section 5 in Getting started with Xillybus on a Linux host
Section 5 of Getting started with Xillybus on a Windows host

Another common mistake is to turn “Autoset internals” off in order to adjust the size of the DMA buffers so it matches with the size of the data packets that are intended for transmission. This is discussed in section 2.7.

These are a few additional points worth emphasizing when using this tool:

The FPGA family, for which the IP core is intended, must be selected correctly, since the IP core is delivered as a netlist.
It’s important to set each device file’s “use” attribute to the description that matches the intended purpose. This ensures that the stream’s attributes are set up correctly.
For XillyUSB IP cores, the “Expected bandwidth” attribute should be set accurately to the maximally requested bandwidth by the stream, as the data rate is limited to that value. For other variants (PCIe and AXI), this attribute only affects performance tuning. Realistic numbers should be applied, rather than attempting to obtain better results by exaggerating the requirements. Such exaggeration may result in a performance degradation on other streams that really need certain limited resources.

The rest of this section discusses some of the device files’ attributes.

2.2 The device file’s name

Each stream is designated a name, which are used as the name of the device file that is created on the host.

The names always take the form xillybus_*, e.g. xillybus_mystream. For XillyUSB, the name is like xillyusb_NN_*, where NN is an index – typically two zeros when only one XillyUSB device is connected to the host.

On a Linux system, the stream is opened as a the plain file, e.g. /dev/xillybus_mystream. In Windows, the same stream appears as \\.\xillybus\_mystream.

A device file can represent two streams in opposite directions, which is just two streams happening to share the name of the device file. These two streams can be opened separately in either direction, or opened for read-write. This feature should be avoided in general to prevent confusion, but is useful when the device file is passed to software that expects a bidirectional pipe.

2.3 Data width

The data width is the number of bits of the word that is fetched from or written to the FIFOs in the FPGA. The allowed choices are 32, 16 or 8 bits. Wider data widths are allowed with Xillybus IP cores of revision B/XL/XXL (these are discussed in section 4), as well as with XillyUSB.

When high bandwidth performance is required on a stream, and when the IP core’s revision is A for PCIe or any IP core for AXI, the data width must be set to 32 bits: There’s a significant performance degradation for 16 and 8-bit data width, leading to inefficient use of the underlying transport (e.g. the PCIe bus transport).

The reason is that the words are transported through Xillybus’ internal data paths at the rate of the bus clock. As a result, transporting an 8-bit word takes the same time slot as a 32-bit word, making it effectively four times slower.

This also impacts other streams competing for the underlying transport at a given time, since the data paths become occupied with slower data elements.

Later revisions of the IP core, as well as XillyUSB, have a different internal data path structure, and hence don’t have this limitation.

Regardless, it’s good practice to perform I/O operations in the host application with a granularity that matches the data width, e.g. call the functions read() and write() with data lengths that are a multiple of 4, if the data width is 32 bits.

A poor choice of data width may lead to undesired behavior. For example, if a link from the host to the FPGA is 32 bits wide, writing 3 bytes of data at the host will make the driver wait indefinitely for the fourth byte before sending anything to the FPGA.

2.4 Use

The “Use” attribute helps the tool that produces the IP core to give each stream the properties that are most suitable for the intended application.

It is important to set “Use” to the option that best matches the stream’s purpose in order to obtain the best possible performance.

This is a brief description for each option:

Frame grabbing / video playback: Select this if the stream is intended for video data applications.
Data acquisition / playback: Select this if you intend to connect the stream to a DAC/ADC or other device that continuously produces or consumes data.
Data exchange with coprocessor: Select this if the stream is used for hardware acceleration (i.e. when the FPGA is used to carry out tasks instead of the CPU for the purpose of improving performance). This choice is suitable when the stream needs a high data rate, but the data flow is allowed to occasionally stop briefly.
Bridge to external hardware: This option is suitable when the FPGA controls external hardware with the help of the stream. For example, if the data in the stream contains the firmware for another component.
Data for in-silicon logic verification: Choose this option if the stream is used to transport application data to or from logic for the purpose of verifying this logic’s proper functionality,
Command and status: Select this if the stream is intended for sending commands to the FPGA or collecting information about the FPGA’s status.
Short message transport: This option is suitable if the stream contains short segments of information, possibly for the purpose of sending messages.
Address / data interface: Select this if you want to be able to use lseek() with the stream. When this option is chosen, an address output is added to the interface on the FPGA side.

This option is available in three variations, each variation offering a different number of address wires: 5, 16 or 32.

The topic of seekable streams is explained further in Xillybus FPGA designer’s guide.
General purpose: This option should be selected if none of the above fits your application.

When “Autoset internals” is used, the tool determines whether the stream is synchronous or asynchronous depending on which of the options above was chosen. The stream is a synchronous and if one of these options is chosen: Command and status, Short message transport, Address / data interface or Bridge to external hardware. The stream is asynchronous for all other options.

2.5 Synchronous or asynchronous stream

This attribute is set automatically when the “Autoset internals” option is selected, based upon the selection of the “use” setting.

In most cases, asynchronous streams are suitable for a continuous data flow, and synchronous streams are suitable for commands, control data and obtaining status information.

For synchronous streams, all I/O (including the data flow in the FPGA) takes place only between the invocation and return of function calls to read() or write(). This gives full control on what happens when, but leaves the data transport resources unused while the CPU is doing other things. It’s recommended to read the elaboration on this subject in section 2 of one of these two documents:

Working with synchronous streams makes the software programming more intuitive, but has a negative impact on bandwidth utilization. With asynchronous streams, it’s possible to maintain a continuous data flow, even when the operating system takes the CPU away from processes that run in user space for certain periods of time.

To summarize this subject, these are the guiding questions:

For downstreams (host to FPGA): Is it OK that a write() operation returns before the data has reached the FPGA?
For upstreams (FPGA to host): Is it OK that the Xillybus IP core begins fetching data from the user application logic in the FPGA before a read() operation in the host requests it?

If the answer to the respective question is no, a synchronous stream is needed. Otherwise, the asynchronous option is usually the preferred choice, along with the understanding that there is less control of the data flow, and that it’s slightly less intuitive.

2.6 Buffering time

Xillybus maintains an illusion of a continuous stream of data between the FPGA and the host. The existence of DMA buffers is transparent to the user application logic in the FPGA as well as the application software on the host. They are of interest only to control the efficiency of the data flow and its ability to remain continuous, in particular at high data rates.

Applications like data acquisition and data playback require a continuous flow of data at the FPGA, or data is lost. To maintain this flow, the user space application needs to make function calls to read() or write() frequently enough to prevent the DMA buffers becoming full or empty (respectively) due to the FPGA’s activity.

There is a problem however with ensuring that these function calls are made frequently enough: Common operating systems, such as Linux and Windows, may deprive the CPU from any user-space application for theoretically arbitrary periods of time. The FPGA keeps filling or draining the driver’s buffers regardless. The DMA buffers must therefore be large enough to maintain a continuous data flow despite such momentary deprivations of CPU.

For the sake of the discussion here, buffering time is the amount of time that it takes for a stream to change from the state where all DMA buffers are empty, to the state where all DMA buffers are full, when the data fills the buffers at the rate for which the stream is intended (and they are not drained during that time).

When setting up a Xillybus stream with “Autoset internals” enabled (which is recommended), a selection box titled “Buffering” appears in the web application. This is where the desired buffering time is selected.

For an asynchronous stream that needs to retain its continuity, the selected time should reflect the expected maximal time that the CPU can be taken away from the user space application.

Choosing “Maximum” tells the algorithm that allocates buffers to attempt allocating as much RAM as possible, with just a little consideration for the other streams.

Given a desired buffering time t and an expected bandwidth W, the algorithm will attempt to allocate a total amount of RAM, M, for the driver’s DMA buffers, based upon this formula:

M = t x W

The actual buffer sizes are however always a power of 2 (2^N). It may also turn out impossible to allocate enough memory to meet the desired buffering time.

It is therefore important to look up the allocated buffer size in the IP core’s README file, and verify that it’s acceptable to work with. Setting the buffer size manually (i.e. turning off “Autoset internals”) may be necessary to force a better distribution of RAM among the streams, that is more suitable for the intended application.

2.7 Size of DMA buffers

It’s recommended to let the tools set up the DMA buffers’ parameters automatically by enabling “Autoset internals” in the web application (see section 2.6 above). In some scenarios, the automatic setting may be unsuitable for the application, in which case it’s possible to set the size and number of the DMA buffers manually.

For asynchronous streams, the buffers’ parameters have a significant impact which is discussed in the section named “Continuous I/O at high rate” in these two documents:

There is no need whatsoever to adapt the size of the DMA buffers to the size of the intended function calls of read() and write(). As explained in these two guides, the size of the DMA buffers is irrelevant and transparent in function calls to read() and write(). In particular, a function call to read() returns immediately if enough data has reached the IP core in the FPGA (regardless of the DMA buffer’s fill level). This is thanks to a mechanism between the FPGA and the host, that allows the FPGA to submit a partly filled DMA buffer. This mechanism is used when it helps to immediately complete a function call to read().

Likewise, data from the host to the FPGA can be assured to reach the FPGA immediately by virtue of an explicit request.

It’s a common mistake to make a connection between the size of the DMA buffers and the pattern of the intended data exchange. With Xillybus, there is no need for that, which is once again why “Autoset internals” is the preferred choice for setting the DMA buffers’ size.

For the sake of continuity, more RAM is better, as the total amount of space in the DMA buffers keeps the flow of data continuous even when the CPU is deprived from the application. Making a correct decision involves other factors, which are detailed in the programming guides referenced above.

However when the total size of the DMA buffers is excessively large, there’s a risk for a buffering delay, which is a result of the ability to store a large amount of data. As a result, when one side fills the buffers faster than the other side empties them, data may arrive at the other end after a significant amount of time. This can be controlled by the technique mentioned in Xillybus FPGA designer’s guide, in the section named “Monitoring the amount of buffered data”.

For XillyUSB IP cores, there is one buffer for each stream, which functions as a large FIFO that is managed by the driver. Other IP cores (PCIe and AXI) maintain several DMA buffers for each stream, so both their size and number are defined. The effective size of the DMA buffers is hence the size of each DMA buffer multiplied by their number.

Accordingly, if “Autoset internals” is turned off for IP cores that are based upon PCIe / AXI, there is a need to specify the number of DMA buffers and the size of each. The following considerations should be made:

The size of each DMA buffer has a significance of its own in streams from the host to FPGA: The data is sent to the FPGA when these buffers are full (unless a flush is explicitly requested by the software, or the stream is idle for 10 milliseconds). The size of each DMA buffer has therefore an impact of the typical latency of flowing data.
For slow streams (less than 10 MBytes/s), the recommended number of DMA buffers is 4. When higher bandwidths are required, the number of buffers is chosen to achieve a suitable overall DMA buffer allocation. The suitable number of DMA buffers for high-bandwidth streams is between 16 to 64, if this allows each buffer to be 128 kBytes or less.
The total allocation of DMA buffers for all streams together shouldn’t exceed 512 MBytes, unless an enhanced driver is used on the host. Otherwise, the operating system may refuse to allocate more than this, leading to a failure in the driver’s initialization.
Each time a buffer is filled, a hardware interrupt is sent to the host. Given the expected data rate, the rate of interrupts should be calculated and kept at a level that is sane to the processor (no more than a few thousands per second)
The size of each DMA Buffers should not exceed 128 kBytes when the total size can be reached by increasing their number.

The issue of the driver’s DMA buffers is less significant for synchronous streams. For such, the rule of thumb is that the total RAM allocated for buffers on behalf of a stream should be in the order of magnitude of the data lengths of the intended function calls of read() and write(). As already said above, there is no need to adapt the size of the buffers to these function calls, however there is rarely any point in wasting kernel RAM by making them larger than so.

2.8 DMA acceleration

With IP cores that are based upon PCIe, streams from the host to the FPGA may require acceleration of the DMA data transfers.

To accomplish data exchange in this direction, the PCIe bus protocol states that the FPGA should issue requests for data from the host and wait for the data to arrive. An inherent delay occurs as the request travels on the bus, is queued and handled by the host, and the data travels back. This turnaround time gap causes some degradation in the bus’ efficiency, sometimes reducing the bandwidth of a single stream to as low as 40%.

To work around this issue, multiple data requests are sent, so that the host always has a request in its queue during continuous transmissions. Since data from different requests may arrive in random order, it must be stored in RAM buffers on the FPGA to present an ordered flow of data to the application logic.

Each buffer in the FPGA is used to store a segment of requested data. The current possible settings for DMA accelerations are

None. No data is stored on the FPGA. Each request for data is sent only when all data has arrived from the previous one.
4 segments of 512 bytes each. 2048 bytes of block RAM is allocated on the FPGA. Up to four data requests can be active at any given moment.
8 segments of 512 bytes each. 4096 bytes of block RAM is allocated on the FPGA. Up to eight data requests can be active at any given moment.
Revision B and later IP cores have an option of 16 segments of 512 bytes each as well.

The turnaround time between a request and the data arrival depends on the host’s hardware. The actual bandwidth performance may therefore vary.

When using “Autoset internals” in the IP Core Factory, the automatic allocation of acceleration resources is based upon measured results on typical PC computer hardware, and may need manual refinements in rare cases.