Xillybus host application programming guide for Linux

2 Synchronous streams vs. asynchronous streams

2.1 Overview

Each Xillybus stream has a flag, which determines whether it behaves synchronously or asynchronously. This flag’s value is fixed in the FPGA’s logic.

When a stream is marked asynchronous, it’s allowed to communicate data between the FPGA and the host’s kernel driver without the user space software’s involvement, as long as the respective device file is open.

Asynchronous streams have better performance, in particular when the data flow is continuous. Synchronous streams are easier to handle, and are the preferred choice when tight synchronization is needed between the actions of the user space application and what happens in the FPGA.

In custom IP cores that are generated in the IP Core Factory, the selection between making each stream synchronous or asynchronous is automatically based upon the information about the stream’s intended use, as declared by the tool’s user when “autoset internals” is enabled. If the autoset option is turned off, the user makes this choice explicitly.

Either way, the “readme” file, included in the bundle that is downloaded from the IP Core Factory, specifies the synchronous or asynchronous flag for each stream (among other attributes).

In all demo bundles, the streams related to xillybus_read_* and xillybus_write_* are asynchronous. xillybus_mem_8 is seekable and therefore synchronous. When XillyUSB is used, the same applies to the respective xillyusb_* files.

2.2 Motivation for asynchronous streams

Multitasking operating systems such as Linux and Microsoft Windows are based upon CPU time sharing: Processes get time slices of the CPU, with some scheduling algorithm deciding which process gets the CPU at any given moment.

Even though there’s a possibility to set priorities for processes, there is no guarantee that a process will run continuously, or that the periods of preemption have a limited duration, not even on a multiprocessor computer. The underlying assumption of the operating system is that any process can accept any period of CPU starvation. Real-time oriented applications (e.g. sound applications and video players) have no definite solution to this problem. Instead, they rely on the typical de-facto behavior of the operating system, and make up for the preemption periods with I/O buffering.

Asynchronous streams tackle this issue by allowing a data to flow continuously while the application is either preempted or busy with other tasks. The exact significance of this for streams in either direction is discussed next.

2.3 Streams from FPGA to host

In the upstream direction (FPGA to host), if the stream is asynchronous, the IP core in the FPGA attempts to fill the host driver’s buffers whenever possible. That is, when the file is open, data is available and there is free space in those buffers.

On the other hand, if the stream is synchronous, the IP core fetches data from the user application logic (usually from a FIFO) only when the user application software on the host has a pending request to read that data from the file descriptor. In other words, when the user application software is in the middle of a read() function call.

Synchronous streams should be avoided in high-bandwidth applications, mainly for these two reasons:

The data flow is interrupted while the application is preempted or doing something else, so the physical channel remains unutilized during certain time periods. In most cases, this leads to a significant bandwidth performance reduction.
An overflow may occur on the FIFO in the FPGA during these time gaps. For example, if its fill rate is 100 MB/sec, a typical FPGA FIFO with 2 kByte goes from empty to full in around 0.02 ms. Practically, this means that any preemption of the user space program can potentially cause the overflow of the FIFO in the FPGA.

Despite these drawbacks, synchronous streams are useful when the time at which the data was collected at the FPGA is important. In particular, memory-like interfaces require a synchronous interface.

Data that has been received by the Xillybus IP core on the FPGA from the application logic is available for reading immediately by the host’s user space application, regardless of whether the stream is synchronous or asynchronous.

2.4 Streams from host to FPGA

In the downstream direction (host to FPGA), a stream being asynchronous means that the host application’s write() function calls will return immediately most of the time. More precisely, the calls to functions writing to the device file will return immediately if the data can be stored entirely in the driver’s buffers. The data is then transmitted to the FPGA at the rate requested by the user application logic at the FPGA, with no involvement of the host’s application software.

There is a slight difference between XillyUSB and the other Xillybus IP cores, regarding how soon the data is sent to the FPGA, on behalf of an asynchronous streams to the FPGA.

For the IP cores that are based upon PCIe or AXI, data is sent to the FPGA only when one of these happen:

The current DMA buffer is full (there are several buffers for each stream).
A flush is requested explicitly on the device file by the application software (see paragraph 3.4)
The file descriptor is being closed.
A timer expires, forcing an automatic flush if nothing has been written to the stream for a specific amount of time (typically 10 ms).

With a XillyUSB stream, the data is sent virtually immediately. More precisely, the driver attempts to queue USB transfers of a fixed size (typically 64 kB), but a smaller transfer is queued if there is data for transmission, and there’s no other transfer queued for the related stream. Therefore, for each stream, there is never more than one queued transfer with less than the fixed size, but there is always at least one transfer in progress as long as there is data for transmission. This results in an efficient use of USB transfers as well as quick response to short data segments.

All in all, asynchronous streams on all IP cores (XillyUSB and other Xillybus IP cores) behave roughly the same, with XillyUSB having a quicker response time on short segments of data (no 10 ms delay).

On the other hand, if the stream is synchronous, a call to the low-level function writing to the device file will not return until all data has reached the user application’s logic in the FPGA. In a typical application, this means that when the function call to write() returns, it indicates that the data has arrived to the FIFO that is connected to the IP core in the FPGA.

IMPORTANT:
Higher-level I/O functions, such as fwrite(), involve a buffer layer created by the library functions. Hence fwrite() and similar functions may return before the data has arrived at the FPGA, even for synchronous streams.

Synchronous streams should be avoided in high-bandwidth applications, mainly for these two reasons:

The data flow is interrupted while the application is preempted or doing something else, so the physical channel remains unutilized during certain time periods. In most cases, this leads to a significant bandwidth performance hit.
An underflow may occur on the FIFO in the FPGA during these time gaps. For example, if its drain rate is 100 MB/sec, a typical FPGA FIFO with 2 kByte goes from full to empty in around 0.02 ms. Practically, this means that any preemption of the user space program can potentially cause the underflow of the FIFO in the FPGA.

Despite these drawbacks, synchronous streams are useful when it’s important for the application to know that the data has arrived to the FPGA. This is the case when the stream is used to transmit commands that must be executed before some other operation takes place (e.g. configuration of the hardware).

2.5 Uncertainty vs. latency

A common mistake it to require a low latency on asynchronous streams for the sake of synchronization between data. For example, if the application is a modem, there is usually a natural need to synchronize between received and transmitted samples.

This often leads to a misconceived design, based upon the notion the uncertainty in the synchronization is necessarily smaller than the total latency. To keep the uncertainty low, the latency, and hence buffers, are made as small as possible, leading to an overall system with difficult real-time requirements.

With Xillybus, the synchronization is easily made perfect (at the level of a single sample), as explained in paragraph 6.2. The limitation on latency is therefore derived from the need to respond quickly to arriving data, if there is such a need.

For example, with a modem, the maximal latency has an impact on how quickly the application’s data source responds to data that is sent to it. In a camera application, the host may program the camera to adjust the shutter speed to compensate for changing light conditions. Data that arrives with a larger latency slows down this control loop.

These are the real considerations that need to be taken, and still, they are usually significantly less stringent than those derived from the misunderstanding of mixing uncertainty with latency.