Getting started with Xillybus on a Linux host

5 Guidelines for high bandwidth performance

The users of Xillybus’ IP cores often perform data bandwidth tests in order to ensure that the advertised data transfer rates are indeed met. Achieving these goals requires avoiding bottlenecks that may slow down the data flow considerably.

This section is a collection of guidelines, which is based upon the most common mistakes. Following these guidelines should result in bandwidth measurements that are equal to or slightly better than published.

It is of course important to follow these guidelines in the implementation of the project that is based on Xillybus, so that this project utilizes the IP core’s full capabilities.

Often the problem is that the host doesn’t process the data quickly enough: Measuring the data rate incorrectly is the most common reason for complaints about not being able to attain the published number. The recommended method is using the Linux’ “dd” command, as shown in section 5.3 below.

The information in this section is relatively advanced for a “Getting Started” guide. This discussion also makes references to advanced topics that are explained in other documents. These guidelines are nevertheless given in this guide because many users carry out performance tests at the early stages of getting acquainted with the IP core.

5.1 Don’t loopback

In the demo bundle (inside the FPGA) there is a loopback between the two pairs of streams. This makes the “Hello, world” test possible (see section 3), but this is bad for testing performance.

The problem is that the Xillybus IP core fills the FIFO inside the FPGA very quickly with data transfer bursts. Because this FIFO becomes full, the data flow stops momentarily.

The loopback is implemented with this FIFO, so both sides of this FIFO are connected to the IP core. In response to the existence of data in the FIFO, the IP core fetches this data from the FIFO and sends it back to the host. This too happens very quickly, so the FIFO becomes empty. Once again, the data flow stops momentarily.

As a result of these momentary pauses in the data flow, the measured data transfer rate is lower than expected. This happens because the FIFO is too shallow, and because the IP core is responsible for both filling and emptying the FIFO.

In a real-life scenario there is no loopback. Rather, there is application logic on the FIFO’s other side. Let’s consider the usage scenario that attains the maximal data transfer rate: In this scenario, the application logic consumes the data from the FIFO as quickly as the IP core fills this FIFO. The FIFO is therefore never full.

Likewise for the opposite direction: The application logic fills the FIFO as quickly as the IP core consumes data. The FIFO is therefore never empty.

From a functional point of view, there’s no problem that the FIFO occasionally becomes full or empty. This merely causes the data flow to stall momentarily. Everything works correctly, just not at the maximal speed.

The demo bundle is easily modified for the purpose of a performance test: For example, in order to test /dev/xillybus_read_32, disconnect user_r_read_32_empty from the FIFO inside the FPGA. Instead, connect this signal to a constant zero. As a result, the IP core will think that the FIFO is never empty. Hence the data transfers are performed the maximal speed.

This means that the IP core will occasionally read from an empty FIFO. As a result, the data that arrives to the host will not always be valid (due to underflow). But for a speed test, this doesn’t matter. If the content of the data is important, a possible solution is that application logic fills the FIFO as quickly as possible (for example, with the output of a counter).

Likewise for testing /dev/xillybus_write_32: Disconnect user_w_write_32_full from the FIFO, and connect this signal to a constant zero. The IP core will think that the FIFO is never full, so the data transfer is performed at maximal speed. The data that is sent to the FIFO will be partially lost due to overflow.

Note that disconnecting the loopback allows testing each direction separately. However, this is also the correct way to test both directions simultaneously.

5.2 Don’t involve the disk or other storage

Disks, solid-state drives and other kinds of computer storage are often the reason why bandwidth expectations aren’t met. It is a common mistake to overestimate the storage medium’s speed.

The operating system’s cache mechanism adds to the confusion: When data is written to the disk, the physical storage medium is not always involved. Rather, the data is written to RAM instead. Only later is this data written to the disk itself. It’s also possible that a read operation from the disk doesn’t involve the physical medium. This happens when the same data has already been read recently.

The cache can be very large on modern computers. Several Gigabytes of data can therefore flow before the disk’s real speed limitation becomes visible. This often leads users into thinking that something is wrong with Xillybus’ data transport: There is no other explanation to this sudden change in the data transfer rate.

With solid-state drives (flash), there is an additional source of confusion, in particular during long and continuous write operations: In the low-level implementation of a flash drive, unused segments (blocks) of memory must be erased as a preparation for writing to the flash. This is because writing data to flash memory is allowed only to a blocks that is erased.

As a starting point, a flash drive usually has a lot of of blocks that are already erased. This makes the write operation fast: There is a lot of space to write the data to. However, when there are no more erased blocks, the flash drive is forced to erase blocks and possibly perform defragmentation of the data. This can lead to a significant slowdown that has no apparent explanation.

For these reasons, testing Xillybus’ bandwidth should never involve any storage medium. Even if the storage medium appears to be fast enough during a short test, this can be misleading.

It’s a common mistake to estimate performance by measuring the time it takes to copy data from a Xillybus device file into a large file on the disk. Even though this operation is correct functionally, measuring performance this way can turn out completely wrong.

If the storage is intended as a part of an application (e.g. data acquisition), it’s recommended test this storage medium thoroughly: An extensive, long-term test on the storage medium should be made to verify that it meets its expectations. A short benchmark test can be extremely misleading.

5.3 Read and write large portions

Each function call to read() and to write() results in a system call to the operating system. A lot of CPU cycles are therefore required for carrying out these function calls. It’s hence important that the size of the buffer is large enough, so that fewer system calls are carried out. This is true for bandwidth tests as well as a high-performance application.

Usually, 128 kB is a good size for the buffer of each function call. This means that each such function call is limited to a maximum of 128 kB. However, these function calls are allowed to transfer less data.

It’s important to note that the example programs that were mentioned in section 4.3 (streamread and streamwrite) are not suitable for measuring performance: The buffer size in these programs is 128 bytes (not kB). This simplifies the examples, but makes the programs too slow for a performance test.

The following shell commands can be used for a speed check (replace the /dev/xillybus_* names as required):

dd if=/dev/zero of=/dev/xillybus_sink bs=128k
dd if=/dev/xillybus_source of=/dev/null bs=128k

These commands run until they are stopped with CTRL-C. Add “count=” in order to carry out the tests for a fixed amount of data.

5.4 Pay attention to the CPU consumption

In applications with a high data rate, the computer program is often the bottleneck, and not necessarily the data transport.

It’s a common mistake is to overestimate the CPU’s capabilities. Unlike common belief, when the data rate is above 100-200 MB/s, even the fastest CPUs struggle to do anything meaningful with the data. The performance can be improved with multi-threading, but it may come as a surprise that this should be necessary.

Sometimes an inadequate size of the buffers (as mentioned above) can lead to excessive CPU consumption as well.

It’s therefore important to keep an eye on the CPU consumption. A utility program like “top” can be used for this purpose. However, the output of this program (as well as similar alternatives) can be misleading on computers with multiple processor cores (i.e. practically all computers nowadays). For example, if there are four processor cores, what does 25% CPU mean? Is it a low CPU consumption, or is it 100% on a specific thread? If “top” is used, that depends on the version of the program.

Another thing to note, is how system calls’ processing time is measured and displayed: If the operating system’s overhead slows down the data flow, how is this measured?

A simple way to examine this is using the “time” utility. For example,

$ time dd if=/dev/zero of=/dev/null bs=128k count=100k
102400+0 records in
102400+0 records out
13421772800 bytes (13 GB) copied, 1.07802 s, 12.5 GB/s

real   0m1.080s
user    0m0.005s
sys    0m1.074s

The output of “time” at the bottom indicates that the time it took for “dd” to complete was 1.080 seconds. Out of this time, the processor carried out the user space program during 5 ms, and it was busy during 1.074 seconds with system calls. So in this specific example, it’s obvious that the processor was busy performing system calls almost all the time. This is not a surprise, because “dd” is not doing anything here.

5.5 Don’t make reads and writes mutually dependent

When communication in both directions is required, it’s a common mistake to write a computer program with only one thread. This program usually has one loop, which does the reading as well as the writing: For each iteration, data is written towards the FPGA, and then data is read in the opposite direction.

Sometimes there is no problem with a program like this, for example if the two streams are functionally independent. However, the intention behind a program like this is often that the FPGA should perform coprocessing. This programming style is based upon the misconception that the program should send a portion of data for processing, and then read back the results. Hence the iteration constitutes the processing of each portion of data.

Not only is this method inefficient, but the program often gets stuck. Section 6.6 of Xillybus host application programming guide for Linux elaborates more on this topic, and suggests a more adequate programming technique.

5.6 Know the limits of the host’s RAM

This is relevant mostly to embedded systems and/or when using a revision XL / XXL IP core: There is a limited data bandwidth between the motherboard (or embedded processor) and the DDR RAM. This limitation is rarely noticed in usual usage of the computer. But for very demanding applications with Xillybus, this limit can be the bottleneck.

Keep in mind that each transfer of data from the FPGA to a user space program requires two operations on the RAM: The first operation is when the FPGA writes the data into a DMA buffer. The second operation is when the driver copies this data into a buffer that is accessible by the user space program. For similar reasons, two operations on the RAM are required when the data is transferred in the opposite direction as well.

The separation between DMA buffers and user space buffers is required by the operating system. All I/O that uses read() and write() (or similar function calls) must be carried out in this way.

For example, a test of an XL IP core is expected to result in 3.5 GB/s in each direction, i.e. 7 GB/s in total. However, the RAM is accessed double as much. Hence the RAM’s bandwidth requirement is 14 GB/s. Not all motherboards have this capability. Also keep in mind that the host uses the RAM for other tasks at the same time.

With revision XXL, even a simple test in one direction might exceed the RAM’s bandwidth capability, for the same reason.

5.7 DMA buffers that are large enough

This is rarely an issue, but still worth mentioning: If too little RAM is allocated on the host for DMA buffers, this may slow down the data transport. The reason is that the host is forced to divide the data stream into small segments. This causes a waste of CPU cycles.

All demo bundles have enough DMA memory for performance testing. This is also true for IP cores that are generated at the IP Core Factory correctly: “Autoset Internals” is enabled and “Expected BW” reflects the required data bandwidth. “Buffering” should be selected to be 10 ms, even though any option is most likely fine.

Generally speaking, this is enough for a bandwidth test: At least four DMA buffers that have a total amount of RAM that corresponds to the data transfer during 10 ms. The required data transfer rate must be taken into account, of course.

5.8 Use the correct width for the data word

Quite obviously, the application logic can transfer only one word of data to the IP core for each clock cycle inside the FPGA. Hence there is a limit on the data transfer rate because of the data word’s width and bus_clk’s frequency.

On top of that, there is a limitation that is related to IP cores with the default revision (revision A IP cores): When the word width is 8 bits or 16 bits, the PCIe’s capabilities are not used as efficiently as when the word width is 32 bits. Applications and tests that require high performance should therefore use 32 bits only. This does not apply to revision B IP cores and later revisions.

The word width can be up to 256 bits starting with revision B. The word should be at least as wide as the PCIe block’s width. Hence for a data bandwidth test, these data word widths are required:

Default revision (Revision A): 32 bits.
Revision B: At least 64 bits.
Revision XL: At least 128 bits.
Revision XXL: 256 bits.

If the data word is wider than required above (when possible), slightly better results are usually achieved. The reason is an improvement of the data transfer between the application logic and the IP core.

5.9 Slowdown due to cache synchronization

This issue does not apply to computers that are based upon CPUs that belong to the x86 family (32 and 64 bits). Those who use Xillybus with the AXI bus of a Zynq processor (e.g. with Xillinux) can also ignore this topic.

However, several embedded processors require an explicit synchronization of the cache when DMA buffers are used. This slows down the data transfer with the CPU’s peripherals considerably.

This problem is not specific to Xillybus: Similar behavior is observed with all I/O that is based upon DMA, e.g. Ethernet, USB and other peripherals.

A slowdown because of the cache can be revealed by looking at the CPU consumption. If the CPU spends an unreasonable amount of time in the system call state (“sys” row output of the “time” utility), this may indicate that the cache is the problem. This happens because the CPU is spending a lot of time performing cache synchronization.

However, it’s important to first rule out the possibility of small buffers (as mentioned in sections 5.3 and 5.7 above).

This problem never happens with the x86 family because these CPUs have coherent cache. Hence no cache synchronization is required. The same applies for Xillinux, because the IP core is connected to the CPU through the ACP port.

But when a Zynq processor uses Xillybus with the PCIe bus, this problem occurs. Several other embedded processors are also affected, in particular ARM processors.

5.10 Tuning of parameters

The parameters of the PCIe block in the demo bundles are chosen in order to support the advertised data transfer rate. The performance is tested on a typical computer with a CPU that belongs to the x86 family.

Also, the IP cores that are generated in the IP Core Factory usually don’t need any fine-tuning: When “Autoset Internals” is enabled, the streams are likely to have the optimal balance between performance and utilization of the FPGA’s resources. The requested data transfer rate is hence ensured for each stream.

It is therefore almost always pointless to attempt fine-tuning the parameters of the PCIe block or the IP core. With the default revision of IP cores (revision A) such tuning is always pointless. If such tuning improves performance, it’s very likely that the problem is a flaw in the application logic or in the user application software. In this situation, there is much more to gain by correcting this flaw.

However, in rare scenarios that require exceptional performance, it might be necessary to tune the PCIe block’s parameters slightly in order to attain the requested data rates. This is relevant in particular for streams from the host to the FPGA. Section 4.5 of The guide to defining a custom Xillybus IP core discusses how to perform this tuning.

Note that even when this fine-tuning is beneficial, it’s not the Xillybus IP core’s parameters that are modified. Only the PCIe block is adjusted. It’s a common mistake to attempt improving the data transfer rate by tuning the IP core’s parameters. Rather, the problem is almost always one of the issues that have been mentioned above in this chapter.