Introduction
The achieved bandwidth and latency are often a concern when designing a mixed host / FPGA system, and their relation with the underlying infrastructure is not always obvious.
This page is intended to clarify what can be expected, explain the limits and their reasons, and direct how to achieve the best possible performance.
This pages deals with issues related to latency. Bandwidth is discussed on another page.
Latency in general
Xillybus was designed to never induce excessive latency, attempting to deliver data as soon as possible, in either direction. There are however a few things to observe when the latency is measured in microseconds.
Tests on hardware reveal that the typical latency is in the range of 10-50μs from a write() command until the arrival at the FIFO on the FPGA, and vice versa. This is of course hardware-dependent.
There are two main sources of excessive latencies, both of which can be eliminated:
- Improper programming practices, causing unwanted delays
- CPU starvation by the operating system
These issues are discussed below. A brief discussion on controlling the latency, which may result from data waiting in the buffers, can be found at the end of this page.
Xillybus’ data communication takes place using a number of DMA buffers, each having a fixed length (these are configurable parameters). A handshake mechanism between the FPGA and the low-level driver on the host passes data using these buffers, which may be partially filled. It may be necessary for the application software to give the driver hints on when to push or pull a partially filled DMA buffer in order to reduce latency. This is is discussed next.
Low-latency read()
The standard API for read() calls states that the function call’s third argument is the (maximal) number of bytes to be read from the file descriptor. Xillybus is designed to return as fast as possible, if the read() call can be completely fulfilled, regardless of if a DMA buffer has been filled.
By convention, read() may also return with less bytes than required in the third argument. Its return value contains how many bytes were actually read. So in theory, read() could return immediately, even if there wasn’t enough data to fully complete the request. This behavior would however cause an unnecessary CPU load when a continuous stream of data is transferred: In theory, read() could return on every byte that arrived, even though it’s obviously more efficient to wait a bit for more data.
The middle-way solution is that if read() can’t complete all bytes requested, it sleeps for up to 10 ms or until the number of requested bytes has arrived. This makes read()’s call behave in an intuitive manner (a bit like a TCP/IP stream) without degrading CPU performance in data acquisition applications.
For example, if the IP core is configured to have four DMA buffers of 1 MByte each, and 8 bytes have been pushed into the FIFO on the FPGA, those eight bytes will occupy the DMA buffer within microseconds (the Xillybus IP core will sense that the FIFO isn’t empty, and initiate a data transfer into the vacant DMA buffer as soon as possible).
If read() is now called with a request for 8 bytes, this function call will return in a matter of microseconds: Even though the DMA buffer could occupy 1 MB of data, it’s handed over with only 8 bytes, in order to complete the read() call.
If, on the other hand, the read() requests 12 bytes, the read() call will sleep, and will return immediately if 4 more bytes are fed into the FPGA in the FIFO. If that doesn’t happen within 10ms, read() returns with the number of bytes that did arrive.
Alternatively, if the stream is flagged non-blocking (opened with the O_NONBLOCK flag), read() always returns immediately with as much data was available, or with an EAGAIN status. This is another way to avoid the possible 10 ms delay, but requires proper non-blocking I/O programming skills.
Low-latency write()
write() calls behave differently regarding latency, depending on whether the stream is synchronous or asynchronous (see Section 2 of the Programming Guide for Linux or Windows for more about synchronous vs. asynchronous streams). A latency measured in microseconds can be achieved in both types of streams, but asynchronous streams require some attention to this issue, as explained next.
An asynchronous stream is intended to transport a continuous flow of data efficiently. A write() call merely stores data in the DMA buffer and returns immediately (or sleeps if there’s insufficient place left in the DMA buffers). The tricky issue is what happens if there’s a partially filled DMA buffer. On one hand, starting a transfer to the FPGA immediately after a write() call could potentially lead to a very bad utilization of DMA buffers that are significantly larger than the data chunks in the write() calls. On the other hand, waiting for a DMA buffer to fill before sending it, may cause long latencies and counter-intuitive behavior.
To tackle this, asynchronous streams to the FPGA have a flushing mechanism, which forces the immediate transmission of data in a partially filled DMA buffer. This flushing mechanism can be kicked off manually by the software, by issuing a write() call with zero bytes for transmission. Section 3.4 in any of the two programming guides explain how to do this correctly.
On the other hand, if an asynchronous stream to the FPGA remains idle for 10 ms, its DMA buffer is automatically flushed. This prevents the stream from appearing to be stuck.
It’s important to note that flushing after each write() call is probably pointless. Flushing should be initiated when a certain portion of data has been written, and holding it in the buffers is just a waste of time.
Synchronous streams present no latency issue, since each write() call comprises of an automatic flush. This is necessary to ensure that write() doesn’t return before the data has arrived to the FPGA, which is the essence of synchronous streams. On the other hand, this makes them slower and inadequate for continuous streams.
Impact of the operating system
Latency is often measured between two points in time. For example, consider the case where the FPGA sends some data to the host, the host calculates its reaction to this data, and sends some data to the FPGA, which is based upon what it got from the FPGA. The requirement for a low latency is measured on the FPGA: How long it had to wait for the response. Or more precisely, the maximal time difference guaranteed.
The host’s response is created by a user-space program running on the host. In the example scenario, the host sleeps until data arrives from the FPGA (i.e. the read() call blocks), the process wakes up, does its calculations, writes data, and then attempts to read the next portion of data, which makes the process sleep again.
In order to achieve a guaranteed maximal latency, there must be a guarantee on how quickly the process gets the CPU after the data has arrived (and the process changes its state to “running”) and how much, if at all, the process is preempted by the OS while running, until it finishes its data processing cycle. Unfortunately, neither Linux nor Windows are really minded to giving such guarantees. It may therefore turn out difficult to ensure latencies in the levels of microseconds. In particular, irregular conditions may wake up processes and kernel threads with a high priority, which may steal precious CPU time from the application’s real-time process.
The key to achieving true real-time performance is often to remove unnecessary components. For example, a network interface may seem harmless, but if it’s suddenly bombarded with packets, possibly not related to the specific host, the interrupts and tasklets that handle these incoming packets may delay the real-time application. The trick is to spot the possible distractions and eliminate them.
Buffering latency
Another, somewhat unrelated aspect of latency, is buffering latency — data that is delayed because of other data in the buffer(s) that is handled first. For example, suppose a continuous stream of data from the FPGA to the host. The computer program reads a chunk of data from the Xillybus file descriptor, processes it, then reads another chunk and so on. While the host processed the previous chunk of data, new data accumulated in the DMA buffer.
In some applications, it’s important for the program to know how old the data it reads is. In particular, if the processing of the data is slower than the data generated by the FPGA, it may be necessary to catch up, possibly by skipping a chunk of incoming data.
I/O techniques involving non-blocking reads or multithreading can be used to always exhaust the input stream, which is a safe way to know that the last byte read is completely “fresh”.
There’s a simpler way to tackle this issue: It’s quite easy to implement a counter in the FPGA, which indicates the number of words that has been written to the stream of data (more precisely, to the FIFO in the FPGA belonging to this stream). This counter is held zero while the related Xillybus device file isn’t open, and otherwise incremented on every clock cycle for which the FIFO’s “write enable” is asserted.
The value of this counter is easily read from the host by connecting it to the data wires of a dedicated synchronous Xillybus stream (with a 32 bit word width for convenience). This allows the host to read 4 bytes from the counter’s stream, and get the exact current position of the continuous data stream. Based upon this value, and knowing how much data has been read from the continuous stream’s file descriptor, the program knows exactly how much data is kept in the DMA buffer(s), and can make a decision on skipping frames of data as necessary.
The same logic can be used on host to FPGA streams, based upon a counter that informs the host how many words have been fetched by the FPGA. Comparing with the amount of data that has been written to the data stream, the program knows how much data is still waiting in the buffer(s).
This simple technique allows an accurate management of the buffer latencies. Note that the Xillybus stream that transports the data is separate from the one used for the counter, so they are independent: The counter's value arrives virtually immediately, regardless of how much data is waiting in the data transport stream.