Overview
An enhanced driver for the Xillybus IP core is available for Linux and Windows, intended for applications that require very large DMA buffers. When runing on 64-bit systems, it allows for DMA buffers as large as 128 GiB per stream, contrary to the existing limit of a total of 512 MiB on Linux systems, or 768 MiB on Windows, above which the driver may not initialize properly. Also, the enhanced driver for Linux raises the size limit of each individual DMA buffer from 4 MiB to 128 MiB (there is no similar limitation on Windows).
Note that the common, non-enhanced drivers run properly on 64-bit and 32-bit systems alike, and are perfectly fine for applications not requiring huge DMA buffers.
In the IP Core Factory, streams that are defined with “Autoset internals” enabled will never exceed the limits of the regular, non-enhanced driver: No individual DMA buffer may exceed 4 MiB, and the total DMA buffer space won’t exceed 576 MiB. Hence “Autoset internals” must be turned off in order to benefit from the extra capabilities of the enhanced driver, which works only on 64-bit systems.
When using the enhanced driver, the total possible amount of DMA buffer space depends on how much the OS has fragmented the physical RAM during its normal activity, however the following ballpark figures apply for a freshly booted system:
- On 64-bit Linux, all but a few GiB of the RAM can be used for DMA.
- On 64-bit Windows, approximately 75% of the RAM can be used for DMA, possibly with a hard limit of 128 GiB or 384 GiB, depending on the OS’ version.
The enhanced drivers are available upon request (please drop an email to Xillybus’ support). However they only works with IP cores that were generated in the IP Core Factory after July 1, 2021. Earlier IP cores, including those in several demo bundles, are rejected by an enhanced driver with a kernel / system log message reading “The Xillybus IP core version in the FPGA is obsolete for use with this driver. Please generate a new one at the IP Core Factory” (or similar).
If needed, fix this by replicating the IP core in the IP Core Factory, generate it, and work with the new one. For a demo bundle, use the default IP core configuration in the Factory.
The allocation of huge DMA buffers is an unusual scenario, as common computer peripherals rarely require that. Even though there are no known problems with memory allocation of this sort, it should be kept in mind that this usage scenario pushes the limits of the operating system, and may therefore reveal hidden bugs.
Background
Xillybus’ IP core was originally released in 2011 with a maximal bandwidth rate of 200 MB/s, and when most computers ran in 32-bit mode. The size of the buffers allocated for DMA traffic was therefore much more limited, in particular due to the operating systems’ way of organizing the virtual memory map.
Over the years, Xillybus’ possible data rates increased by more than an order of magnitude, which brought a demand for deeper DMA buffers in some applications, in particular data acquisition and playback. Also, 64-bit operating systems became the common choice, opening for allocating memory as DMA buffers far beyond the 512 MB limit.
Xillybus’ driver in the Linux kernel is designed to work properly on a variety of platforms and processor architectures, including those with patchy Linux support and possibly problematic handling of the PCIe interface. Therefore, this driver limits itself to 32-bit addressable physical memory for DMA when possible, even when it runs on a 64-bit processor. This holds true for the driver included in the Linux kernel itself as well as the drivers available for direct download at the website, running on Linux 32/64 bit and Windows 32/64 bit, in order to retain consistent behavior.
When the enhanced driver is required
For work with IP cores that require more than 512 MB of DMA buffers, the enhanced driver for the respective OS is required. These drivers allow the DMA buffers to be allocated as any physical memory, and hence don’t limit themselves to 32-bit addressable memory. This lifts the limit of total DMA memory considerably.
The enhanced driver is highly recommended in any scenario where more than 512 MiB of RAM is allocated for DMA, even if the regular (non-enhanced) drivers work properly on a specific machine: Typically, both Linux and Windows allow for about 2 GiB of DMA buffers with the non-enhanced driver when running on a freshly booted 64-bit machine. However the behavior of the regular driver depends to a large extent on how the computer organizes its memory: As it turns out (and explained on this page, see section named “Physical RAM allocation below and above 4 GB on x86_84 machines”), the BIOS selects how much of the hardware RAM is mapped into physical addresses within the 32-bit range, and how much goes to the upper memory region. The exact amounts vary from one computer to another.
As the non-enhanced driver limits itself to 32-bit addresses, an IP core that requires more than 512 MB of DMA buffers may work on one computer and fail on another, if it relies on the non-enhanced drivers, depending how much RAM the BIOSes of each allocated in the 32-bit region.
Non-x86 64-bit processor architectures may behave in a more consistent way, as there’s more control of how the system is set up. It’s nevertheless still possible that different versions of BSPs or Linux versions might change the environment in a way that affects the physical memory map.
The 4 MiB limit on each DMA buffer
On Linux, it’s typically impossible to directly allocate buffers larger than 4 MiB each. This is because the Linux kernel limits the maximal chunk to 2^10 pages of 4096 bytes (i.e. 2^(MAX_ORDER-1), MAX_ORDER being hardcoded to 11 in include/linux/mmzone.h, also see __alloc_pages() in mm/page_alloc.c). This fact is also reflected in the number of columns in /proc/buddyinfo.
Windows has no known hard limit of that sort.
The IP Core Factory allows buffers larger than 4 MiB only for revision B IP cores and above.
The total DMA buffer space for a given Xillybus stream is the number of DMA buffers multiplied by the number of bytes of each. Since each stream is limited to 1024 DMA buffers in the IP Core Factory, Linux’ 4 MiB-per-buffer limit results in a limit of the total DMA space for each stream: 4 GiB. The enhanced driver for Linux works around this by allocating buffers larger than 4 MiB with a separate mechanism: It loops on requesting contiguous chunks of 4 MiB, until these chunks form a contiguous segment of the required buffer size. This works on freshly booted systems, because Linux’ memory allocator happens to have a lot of 4 MiB chunks residing one after the other in physical address space.
However this method involves requesting more memory than is actually used, since those 4 MiB chunks that aren’t part of a larger contiguous segment are just set aside, waiting for their match during the process. These unused chunks are of course freed immediately after the driver has finished its attempt to allocate memory for DMA buffers, whether successfully or not.
If the enhanced driver fails on allocating a DMA buffer larger than 4 MiB, it’s only after there were no 4 MiB chunks left in the system. This doesn’t mean that the system has no free RAM at all, because there is typically a lot of RAM in smaller memory fragments. This extinction of 4 MiB segments is also short-lived, as the driver immediately frees all allocated memory in the event of such failure.
The size limit of each DMA buffer with the enhanced driver is 128 MiB, which is imposed by the IP core itself. With a maximum of 1024 DMA buffers per stream, the maximal total DMA RAM for a stream is 128 GiB.
Linux: Things that can help compaction
Generally speaking, the larger each of these DMA buffers is, the higher the probability for the memory allocation to fail, since each buffer is allocated as a continuous segment of physical memory. As the computer runs, the physical memory becomes fragmented, making it increasingly difficult for the operating system to find large segments.
If the driver fails to allocate DMA buffers, fragmentation of the physical RAM may be the reason, in particular if it has succeeded before. The best solution is to reboot the computer, preferably with the driver installed in the system, so it’s loaded early automatically.
But if that is undesired, the following may help:
DMA memory is allocated from pages as listed in /proc/buddyinfo, which is also reflected in the MemFree line of /proc/meminfo. As memory is consumed for various needs, it becomes unavailable for immediate DMA allocation, and __get_free_pages() fails.
To drop the disk cache (copies of disk sectors kept in RAM): As root,
# echo 3 > /proc/sys/vm/drop_caches
This is non-destructive, see Documentation/admin-guide/sysctl/vm.rst in the kernel source tree.
Also, with memory compaction enabled in the kernel (CONFIG_COMPACTION is “y”), as root:
# echo 1 > /proc/sys/vm/compact_memory
This may improve memory fragmentation somewhat.