Guidelines for high bandwidth applications and tests

Note: This content of this page is now included in the following documents:

Getting started with Xillybus on a Linux host (as chapter 5).
Getting started with Xillybus on a Windows host (as chapter 5).

Accordingly, this web page is due for removal.

Introduction

The maximal bandwidth Xillybus' IP cores offer for each target is published on the download page. Achieving these figures in a real application or performance test requires attention to some details, or lower rates may be observed.

This page is a collection of guidelines, which is based upon common mistakes. Following them should result in bandwidth measurements equal to or slightly higher than those published.

By far, the most common reason for complaints on not meeting the expected bandwidth is measuring incorrectly. The recommended method is using the Linux "dd" command as shown in the examples below. This utility is also available as a compiled executable on the driver bundle for Windows.

1. Don't loopback

The logic in the demo bundles implements a loopback of the data between couples of streams going in opposite directions. While this might be helpful for initial tests, it's a bad setting for any performance test. A significant performance hit is expected as a result of the FPGA's FIFO being almost empty or almost full most of the time, causing the Xillybus IP core to perform less than optimal: In a real-life maximal rate scenario, the application logic is expected to fill the FPGA's FIFO faster than the Xillybus IP core drains it. The FIFO is never expected to be empty. And vice versa: If the data goes the other direction, the application logic is expected to empty the FIFO faster than the IP core fills it, so it's never full.

Of course it's fine, from a functional point of view, that the FIFO gets empty in the first case, and full in the second. This causes the Xillybus stream to stall momentarily. All works fine, just not at the maximal rate.

If a performance test is desired based upon the demo bundle, it can modified quite easily. For example, for testing the 32-bit interface, disconnect the user_r_read_32_empty and user_w_write_32_full signals from the (fifo_32) FIFO, and tie them to constant zero. This will cause the FIFO to possibly overflow and underflow, leading to erroneous data, but the data rates will be optimal. If it's desired to work with valid data on such performance test, any arrangement that makes sure that these signals are never asserted is fine.

Note that breaking the loopback allows testing each stream individually, but it's also the correct way to test both directions simultaneously, for the reason mentioned above.

2. Don't involve the disk or other storage

Disks, solid-state drives and other kinds of computer storage are often the reason why bandwidth expectations aren't met. The operating system's caching mechanism adds to the confusion, as it allows a short-range burst of fast data transport, which doesn't involve the underlying physical media. As the cache can be very large on modern computers, several GBs of data can flow before the actual media's speed limit kicks in. This can lead users into thinking that something went wrong with the Xillybus stream, as nothing apparent happened otherwise.

Solid-state (flash) drives have an additional speed anomaly, in particular during long, continuous writes: Writing to a flash drive involves erasing unused blocks. A flash drive usually has a pool of erased block at any time, which makes the write operation faster while these blocks are filled. Once they're exhausted, a significant slowdown may be experienced as the flash drive is forced to erase blocks and possibly reorganize its data.

For these reasons, testing Xillybus' bandwidth should never involve any storage media, even if a quick check makes it appear like the media is fast enough. A common mistake is to attempt copying a large file into a Xillybus stream, and measure the time it takes. Even though this operation is correct functionally, the performance measurement can turn out completely off.

If the storage is intended to be part of an application (e.g. data capture), it's recommended to run extensive, long-term tests on the media to verify that it's up to its expectations. Short benchmark tests can be fatally misleading.

3. Read and write large chunks

Each read() and write() operation produces a system call on the operating system, which takes its toll in CPU cycles. It's therefore important to use a buffer size that is large enough for a bandwidth test, or a high-performance application for that matter.

A common good buffer size is around 128 kB for each read() and write(). It doesn't mean that the actual amount of data for each read() or write() call is 128 kB, but that it may be that much.

It's important to note that the streamread and streamwrite sample utilities (part of the starter toolkit for Xillybus) are not good for measuring performance, since the buffer size in these is 128 bytes (not kB). This simplifies the utility example, but makes it too slow for a performance test.

On Linux machines, the following shell commands can be used for a quick speed check (replace the /dev/xillybus_* names as required):

dd if=/dev/zero of=/dev/xillybus_sink bs=128k
dd if=/dev/xillybus_source of=/dev/null bs=128k

These run until stopped with CTRL-C. Add a "count=" parameter to run the tests for a fixed amount of data.

4. Note the CPU consumption

The CPU power is often overrated in fast applications. Unlike common belief, even the fastest CPUs available struggle to do anything meaningful with data going faster than 100-200 MB/s (per thread). The computer program is often the bottleneck in intensive applications, and not necessarily the data transport. Sometimes an inadequate buffer size (as mentioned above) can lead to excessive CPU consumption as well.

It's therefore important to keep an eye on the CPU consumption, using e.g. "top" on Linux machines or the Task Manager on Windows. It's nevertheless important to be sure to interpret these programs' output properly, in particular on multi-core machines: Does 25% CPU on a quadcore computer mean a low CPU consumption, or is it 100% on a specific thread? If "top" is used, that depends on the version of the program.

Another thing to note, is how system calls' processing time is measured and displayed: If the operating system's overhead slows down things, will it appear in the CPU percentage of the given process?

One simple way to tell this on Linux machines is using the "time" utility. For example,

$ time dd if=/dev/zero of=/dev/null bs=128k count=100k
102400+0 records in
102400+0 records out
13421772800 bytes (13 GB) copied, 1.07802 s, 12.5 GB/s

real	0m1.080s
user	0m0.005s
sys	0m1.074s

The output of "time" at the bottom indicates that out of the 1.080 wall clock seconds this operation took, the processor spent 5 ms in the user space program, and 1.074 seconds handling system calls. Summing this up, it's clear that the processor was busy all the time, so the processor was the bottleneck. Which is quite expected, as no real I/O was performed in this example.

5. Don't make reads and writes mutually dependent

For applications requiring communication in both directions, a common mistake is to write a single-threaded computer program which consist of one main loop. For each loop, a chunk of data is written towards the FPGA, and then a chunk is read from it.

If the two streams are functionally independent, this might be fine. However quite often a program like this is written for coprocessing applications, based upon the misconception that the program should send a chunk for processing, and then read back the results, so each loop completes the processing of a certain amount of data.

Not only is this method inefficient, it may also lead to the execution getting stuck (depending on how wrong it was written). Section 6.6 of the Xillybus host application programming guide for Linux or Windows elaborates more on this topic, and suggests a more adequate coding technique.

6. Know the host's RAM bandwidth limit

This applies mostly to embedded systems and/or when using revision XL / XXL IP cores: Each motherboard (or embedded system) has a limited bandwidth to its external RAM. On very demanding applications, this can turn out to be the bottleneck.

Keep in mind that each chunk of data going from the FPGA to a user space program requires two RAM operations: The first is when the data goes from the FPGA and is written into a DMA buffer. The second is when the data is copied into a buffer that is accessible by the user space program. For similar reasons, two RAM operations are required as well when the data goes in the opposite direction.

The separation between DMA buffers and user space buffer is an operation system requirement for all I/O that uses read() and write() (or similar calls), both in Linux and Windows.

So if a revision XL IP core is tested in both directions simultaneously, expecting about 3.5 GB/s in each direction, this demands four times this bandwidth from the RAM, that is 14 GB/s. Not all motherboards have this capability, and also keep in mind that the host uses the RAM for other things at the same time.

With revision XXL even a simple test in one direction might exceed the RAM's bandwidth capability, for the same reason.

7. DMA buffers large enough

This is rarely an issue, but still worth mentioning: If the RAM space that is allocated on the host for DMA buffers is too small, it may slow down the data transport, as the host is forced to divide the data stream into small chunks, hence wasting CPU cycles.

All demo bundles have enough DMA memory for performance testing, and the same goes for cores generated at the IP Core Factory with the desired "Expected BW" set to the actual expectation and "Autoset Internals" enabled. "Buffering" should be set to 10 ms, even though any option is most likely fine.

Generally speaking, four DMA buffers with a total RAM space corresponding to 10 ms' worth of data at the target rate is enough.

8. Use the right data width

Quite obviously, a stream can't possibly transport faster than bus_clk multiplied with the stream's data width at the FPGA: On revision A cores, 32-bit streams should be used, also because 8- and 16-bit streams consume more PCIe bandwidth than they actually use.

Revision B and XL IP cores allow picking a data width which is wider than the datapath with the PCIe block. Testing a revision B core, for example, would naturally require a 64-bit stream, as the PCIe datapath is 64 bit wide. However choosing a wider stream might yield slightly better results, as the user logic gets faster access to the data. The difference should be negligible.

Revision XL cores should be tested with 128-bit data width streams (or 256 bits, which shouldn't make a significant difference).

9. Cache slowdown

This applies to embedded systems only, since x86-derived processors (32 and 64 bit) uses coherent cache. Neither does this issue apply to Xillybus using the AXI bus of a Zynq processor (e.g. with Xillinux), as it's attached to the coherent port (ACP), but it does apply when a Zynq processor uses Xillybus over the PCIe bus.

On several embedded processors, in particular when the PCIe bus is used on an ARM processor, the need to synchronize the cache when accessing the DMA buffers slows down the transport considerably. This is not a Xillybus issue, as it applies to any DMA-based I/O on the processor, e.g. Ethernet, USB ports and other possible PCIe devices.

A cache slowdown is spotted by an unreasonable CPU consumption in the system call state ("sys" row output of the "time" utility), despite using large buffers (as mentioned in section 3 above). This is a result of CPU cycles wasted on calling the cache-synchronizing opcode of the processor.

This problem doesn't exist on x86/x64 architectures, as they require no cache synchronization.

10. Tuning of parameters

The PCIe blocks in the demo bundles available for download are tuned for handling the target bandwidth on a mainstream x86-based processor. Likewise, streams that are generated in the IP Core Factory with "Autoset Internals" usually provide the optimal balance between performance and FPGA resource utilization, and ensure the bandwidth defined for each stream.

In rare cases, and for revision B and XL cores only, it might be necessary to tune the PCIe block's parameters slightly further in order to attain the target data rates, in particular on host-to-FPGA streams. This is discussed in section 4.4. of the The guide to defining a custom Xillybus IP core. Even though this direction is often the most appealing when measuring bandwidth rates below expectations, this is by far the least likely direction for solving the issue.

Summary

Several guidelines were given for making the most of Xillybus IP cores' bandwidth. The most notable ones were to open the loopback in the demo bundle's sample code, not involving the disk (or a similar device), and using large buffers in read() and write() calls. Keeping an eye on the CPU consumption is also a good idea when attempting to reach high performance, whether for testing or in a real application.