4 Continuous I/O at high rate

4.1 The basics

There are four practices that are nearly essential to achieve a high-rate continuous data flow between the host and the FPGA:

  • Using asynchronous streams

  • Making sure the driver’s buffers are large enough to compensate for time gaps between the I/O operations of the user space application.

  • Having the user space application read data from the device file as soon as there is data available, or write data to it as soon as there is space available in the buffers.

  • Never closing and reopening the device files while the FPGA keeps inserting or draining data.

XillyUSB presents additional challenges with maintaining a continuous flow of data, as explained on this web page.

Monitoring how much data is held in the driver’s buffers at any given time is discussed in Xillybus FPGA designer’s guide, in the section named “Monitoring the amount of buffered data”.

The first item in the list above, of using asynchronous streams, is discussed in section 2. The second and third are discussed in the remainder of this section.

To understand the the fourth item, recall that the advantage of asynchronous streams is that data runs between the FPGA and host without the intervention of the user space application. This flow is stopped when the file is closed.

Specifically for a stream from the host to the FPGA, closing the file forces a flush of all data in the buffers, and the file is closed only after that is finished (or after one second). As a result, there is a time gap with no data flow from the moment that the file is closed, to when the file is open again (and data is written to the file descriptor).

As for streams from the FPGA, closing the file leads to loss of any data in the pipe that goes from the application logic in the FPGA, to the user space application in the host (i.e. the FPGA’s FIFO and driver’s buffers). The only way to avoid this loss is draining all data from this pipe before closing the file. Once again, there’s a time gap with no data flowing, between closing the file and opening it again.

A common mistake is to use the EOF capability to mark data chunks (e.g. complete video frames), and by doing so, forcing the host to close and reopen the device file at known boundaries. However this significanly increases the risk for an overflow at the FPGA’s FIFO.

It’s important to keep in mind that the operating system may take away the CPU from a user space application at any given moment (preemption), so time gaps of several, and sometimes tens of milliseconds can occur between subsequent function calls in a program.

4.2 Large driver’s buffers

One of the greatest challenges in transferring data at a high rate between the FPGA and host is to maintain a continuous flow. In applications involving data acquisition and playback, an overflow or shortage of data renders the system nonfunctional. To avoid this, the driver allocates large RAM buffers on the host for its own use. These buffers compensate for the gaps in time, during which the application isn’t available to handle data transfers.

Xillybus allows allocation of huge driver’s buffers, but this memory must be allocated from the pool of the operating system’s kernel RAM. On some systems (32-bit systems in particular) the addressing space of such memory is limited to 1 GB by the Linux operating system, even if the total RAM available is significantly larger. In systems with RAM of less than 1 GB, (embedded Linux in particular), all memory may be used for driver’s buffers.

Much larger buffers can be allocated on 64-bit systems when using an enhanced host driver, as discussed on this page:

https://xillybus.com/doc/huge-dma-buffers/

Except for with XillyUSB, the driver’s buffers are allocated when the Xillybus driver is loaded (typically early in the boot process) and is freed only when the driver is unloaded from the kernel (usually during system shutdown). When the buffers are huge, this usually means that a significant part of the kernel’s RAM pool is occupied by the driver’s buffers. It’s a fairly reasonable setting, since the application using these buffers is likely to be the main purpose of the machine it’s running on.

A potential problem with huge buffers is tha they occupy contiguous segments of physical RAM. This is contrary to a buffer allocated in a userspace program, which is contiguous in virtual address space, but can be spread all over the physical memory, or even not occupy any physical RAM at all.

The pool of available memory becomes fragmented as the operating systems runs. This is why the Xillybus driver allocates its buffers as soon as possible, and retains them even when not actively used. Attempting to unload the driver and reload it at a later stage may fail for the same reason.

XillyUSB has a different approach to memory allocation, which is more tolerant to physical memory fragmentation. This is one of the reasons that its driver allocates RAM for its buffers when a device file is opened, and releases it when the file is closed.

Precautions should however be taken to avoid a shortage of kernel RAM. The IP Core Factory’s automatic memory allocation (“autoset internals”) algorithm is designed not to consume more than 50% of the relevant memory pool, e.g. 512 MB for a PC computer, based upon the assumption that a modern PC has more than 1 GB of RAM installed. It’s probably safe to go as high as 75% as well, which can be done by setting the buffer sizes manually.

Overallocation of buffers may lead to system instability. In particular, the operating system is likely to kill processes apparently randomly, whenever it fails to allocate RAM from the kernel pool.

4.3 RAM buffers in user space

For applications that require buffers larger than 512 MB on a 32-bit machine, it’s recommended to do some of the buffering in user space RAM. On 64-bit machines, this option is rarely relevant, except for when the desired buffer size is very large, and not a power of 2 (2^N). For example, supplying a buffer of 62 GB for a stream is not possible by virtue of Xillybus DMA buffers, but can be achieved with user space RAM.

It may seem counterintuitive that the problem of I/O continuity can be solved by allocating a huge buffer in the user space application. Indeed, this solution doesn’t help when the operating system starves the application of CPU time. But if the operating system’s scheduler is fairly well designed and the priorities are set right, a user space application will get its CPU slice often enough, even on a computer that is under heavy load.

It’s important to pay attention to the first fill of the buffer: Modern operating systems don’t allocate any physical RAM when a user space application requests memory. Instead, they just set up the memory page tables to reflect the memory allocation. Actual physical memory is allocated only when the application attempts to use it. This is a brilliant method for saving resources, but can have a disastrous impact on a data acquisition application: For example, consider what happens when data begins to rush in from a data source. The application writes the data to the buffer just allocated, but each time a new memory page is accessed, the operating system needs to get a new physical memory page. If there happens to be free physical RAM, or if there is a quick way to release physical memory (e.g. disk buffers which are already in sync with the disk), this memory juggling can go by unnoticed. But in the absence of immediate sources of physical RAM, disk operations may have to take place (RAM swapping to disk or flushing disk buffers), which can halt the application for too long.

The really bad news is that the ability to take the initial load of data depends on the overall system’s state. Hence a program that usually works may suddenly fail, because some other program just did something data intensive on the same computer.

The natural solution is memory locking: mlock() tells the operating system that a certain chunk of (virtual) memory must be held in physical RAM. This forces allocation of physical memory immediately, so if disk operations are needed to complete the function call, it may take some time to return.

The operating system is reluctant to lock large chunks of RAM, as this impacts its overall performance. In most cases, there’s a need to raise some limit in the shell or set up configuration files.

4.4 Overview of the fifo.c demo application

Among the demo applications, which can be downloaded for Linux and Windows, there’s one called “fifo.c”. It’s an example of how to implement a RAM FIFO using two threads, which has been tested on 32-bit and 64-bit platforms.

For more about the demo applications, see Getting started with Xillybus on a Linux host.

Note that unlike everywhere else in the documentation, the word “FIFO” in this section refers to a RAM buffer on the host, and not the FIFO in the FPGA.

The purpose of this program is to test fast streams, where a RAM FIFO is necessary to maintain a huge RAM buffer. In other words, if you need a buffer smaller than say, 16 GB, odds are that you don’t need this program.

It can also be used a basis for modification and adoption in custom applications. It’s designed with no mutexes, so no thread ever goes to sleep just because another thread holds the lock. Sleeping (blocking) does occur, of course, when the FIFO’s state requires that (e.g. a read is requested from an empty FIFO).

This implementation without mutexes requires careful use of the API functions, as they’re not reentrant. This is however no problem with one thread for reading, and one thread for writing.

To run it for data acquisition from a device file into a disk file with a buffer of 128 MB, type something like:

$ ./fifo 134217728 /dev/xillybus_async > dumpfile

If no file name is given as the second argument, the program reads from standard input.

There’s probably a need to lift the limit on locked memory, using ’limit -l’ at shell prompt, with root privileges (possibly use “su - your-username” as root to drop your privileges back to a regular user, and retain the updated limit). For a constant change in the limit, refer to your Linux distribution’s docs.

The program creates three threads:

  • read_thread() reads from standard input (or the file given in the command line) and writes the data into the FIFO

  • write_thread() reads from the FIFO and writes to standard output

  • status_thread() prints a status line to standard error recurrently

The third thread has no functional significance, and can be eliminated. It’s also possible to have one of the read/write functionalities running in the main thread. For example, in a data acquisition application, it may be natural to launch only read_thread() to move data from the file descriptor to the FIFO, but consume the data from the FIFO in the thread of the main application.

4.5 fifo.c modification notes

If you want to modify the program, here are a few things to keep in mind:

  • The fifo_* functions are not reentrant. It’s safe to use them when each thread uses a set of functions that no other thread uses (which is a natural use).

  • The function fifo_init() can take time to return, and should be called before an asynchronous Xillybus device file is opened.

  • The thread that reads and the thread that writes in the applications always attempt the maximal number of bytes allowed in their I/O requests. This can be problematic in some cases, e.g. when the I/O source is /dev/zero and the destination is /dev/null. Both will complete the entire request in one attempt, so the FIFO will go from completely empty to completely full and over again. In such cases, it’s more sensible to limit the requested number of bytes in calls to I/O functions.

4.6 RAM FIFO functions

Except for modifying the fifo.c example, it’s possible to adopt a group of functions from the source code.

A section of FIFO API functions is clearly distinct in the fifo.c file. These functions can be used in custom applications, by following the example and according to the functions’ description below.

IMPORTANT:
Even though the fifo_* functions are intended for use in a multi-threaded environment, these functions are not reentrant . This means that one thread should call functions related to reading from the FIFO, and another thread should do writes, so each thread calls its separate set of functions.

Except for an initializer, destroyer and a thread join helper, the API has four functions for reading and writing, two for each direction. Neither of these functions actually access the data in the FIFO; they merely maintain the FIFO’s state and supply the information necessary to perform reads, writes, memory copies etc.

The intended execution procedure is as follows: The thread that reads from the FIFO calls the function fifo_request_drain(), which returns information about how many bytes can be read, and a pointer from which data can be read. If the FIFO is empty, the thread will sleep until data arrives.

The user application then makes whatever use it needs with the data pointed to. After finishing to consume some or all of the data (write to a file, copy data, run some algorithm etc.), it calls the function fifo_drained() to inform the FIFO API how many bytes were actually consumed. The API releases the relevant portion of memory in the FIFO. If the thread that writes was sleeping because the FIFO was full, it is woken up.

Note that the thread that reads doesn’t ask for a specific number of bytes. Rather, fifo_request_drain() tells the application how many bytes can be consumed, and the application reports back how many it chose to consume in fifo_drained().

As for the opposite direction, a similar approach is taken: The thread that writes calls the function fifo_request_write(). This function returns the number of bytes that can be written to the FIFO, or sleeps if the FIFO is full. The user application writes as many bytes it needs (but not more than fifo_request_write() allowed it to) to the address it got from fifo_request_write() and then reports back what it did to fifo_wrote().

We’ll now go through each of these functions in detail.

4.6.1 fifo_init()

fifo_init(struct xillyfifo *fifo, unsigned int size) – This function initializes the FIFO’s information structure and allocates memory for the FIFO as well. It also attempts to lock the FIFO’s virtual memory to physical RAM, making it ready for immediate fast writing and preventing it from being swapped to disk.

fifo_init() allocates memory for a buffer of size bytes. size can be any integer (i.e. doesn’t have to be a power of 2,2^N) but a multiple of what the system considers int is recommended.

Note that this function can take several seconds to return: The request for a large portion of physical RAM may force the operating system to swap other processes’ RAM pages to disk, or force disk cache flushing. In both cases, fifo_init() may have to wait for a lot of data to be written to disk before returning.

The function returns zero on success, nonzero otherwise.

4.6.2 fifo_destroy()

fifo_destroy(struct xillyfifo *fifo) – Frees the FIFO’s memory after unlocking it, and releases thread synchronization resources. This function should be called when the main program exits, because even though the thread synchronization resources are released automatically in current implementations of Linux, their API doesn’t guarantee this.

This function is of void type (hence returns nothing).

4.6.3 fifo_request_drain()

fifo_request_drain(struct xillyfifo *fifo, struct xillyinfo *info) – Supplies a pointer to read data from the FIFO as info->addr, and informs how many bytes can be read, beginning from that pointer, in info->bytes.

The info structure must not be the same one that is used for function calls to fifo_request_write(). Each thread should maintain a local variable of its own for this structure.

IMPORTANT:
The number of bytes returned does not indicate how much data is left for reading in the FIFO: It may also reflect the number of bytes left until the end of the FIFO’s memory buffer. Hence a significantly lower number is possible when the pointer comes close to the end of the buffer.

The function also sets fifo->position to indicate the FIFO’s current read position as a value between 0 and size-1, where size is the value that was given to fifo_init(). A nonzero fifo->slept indicates that the FIFO was empty upon invocation.

The function returns the number of bytes allowed for read (same as info->taken). But if the function fifo_done() has been called, and the FIFO is empty, fifo_request_drain() returns zero.

4.6.4 fifo_drained()

fifo_drained(struct xillyfifo *fifo, unsigned int req_bytes) – This function changes the FIFO’s state to reflect the consumption of req_bytes bytes. If fifo_request_write() was sleeping because the FIFO was full, it will be woken up.

IMPORTANT:
There is no sanity check on req_bytes. It’s the user application’s responsibility to make sure that req_bytes is not larger than info->bytes returned by the last function call to fifo_request_drain().

This function is of void type (hence returns nothing).

4.6.5 fifo_request_write()

fifo_request_write(struct xillyfifo *fifo, struct xillyinfo *info) – Supplies a pointer to write data to the FIFO as info->addr, and informs how many bytes can be written, beginning from that pointer, in info->bytes.

The info structure must not be the same one that is used for function calls to fifo_request_drain(). Each thread should maintain a local variable of its own for this structure.

IMPORTANT:
The number of bytes returned does not indicate how much data is left for writing in the FIFO: It may also reflect the number of bytes left until the end of the FIFO’s memory buffer. Hence a significantly lower number is possible when the pointer comes close to the end of the buffer.

The function also sets fifo->position to indicate the FIFO’s current write position as a value between 0 and size-1, where size is the value that was given to fifo_init(). A nonzero fifo->slept indicates that the FIFO was full upon invocation.

The function returns the number of bytes allowed for write (same as info->taken). But if the function fifo_done() has been called, fifo_request_write() returns zero, even if the FIFO is not full (there is no point writing data into a FIFO that will never be read).

4.6.6 fifo_wrote()

fifo_wrote(struct xillyfifo *fifo, unsigned int req_bytes) – This function changes the FIFO’s state to reflect the insertion of req_bytes bytes. If fifo_request_drain() was sleeping because the FIFO was empty, it will be woken up.

IMPORTANT:
There is no sanity check on req_bytes. It’s the user application’s responsibility to make sure that req_bytes is not larger than info->bytes returned by the last function call to fifo_request_write().

This function is of void type (hence returns nothing).

4.6.7 fifo_done()

fifo_done(struct xillyfifo *fifo) – This function is optional for use, and helps the application to quit gracefully if either of the threads (reading or writing) has finished. It merely sets a flag in the FIFO’s structure and wakes up both threads if they were sleeping. By doing so, the fifo_request_drain() will return zero rather than sleeping if the FIFO is empty, and fifo_request_write() will return zero regardless.

This way, the callers of these functions know that the FIFO has no more use, and may act as necesssary, which is most likely to stop the execution of the thread.

Call this function when the data source feeding the pipe has ended (e.g. EOF reached) or when the data comsumer is no longer receptive (e.g. a broken pipe).

This function is of void type (hence returns nothing).

4.6.8 The FIFO_BACKOFF define variable

Sometimes it’s not desirable to let the FIFO get full to the last byte. Even though there is no apparent reason avoiding that, it may be desirable to maintain a small gap between where data is written to and where its read from.

For example, FIFO_BACKOFF can be set to 8, so the last byte written to the FIFO never shares a 64-bit word with the first valid byte for read. This is a rather far-fetched precaution, but comes at the low price of 8 bytes of memory.

There is no need for this feature when working with Xillybus or XillyUSB.