Published: 18 November 2020

Introduction

Even though PCIe and USB 3.x rely on the same type of physical bitstream (Multi-Gigabit Transceiver, MGT), there are fundamental differences between the two, which have significant influence on how payload data is handled. This page is intended in particular for those who are used to working with Xillybus’ IP core for PCIe or AXI, and consider using the XillyUSB variant.

There are two main factors to consider: The quality of currently available hardware and the inherent differences in the data transport protocols and the roles that the bus controllers have.

Hardware quality

There are many data-intensive computer peripherals, with a zero tolerance for reliability issues, that rely on PCIe: Graphics cards, NICs, SATA controllers and many more. After several years in use, PCIe has has turned into a rock solid interface, and those who have experience with Xillybus’ IP core for PCIe also know this first hand.

USB 3.x, on the other hand, is less established: Its support is fairly widespread, but not always guaranteed to work properly on Linux machines. USB 3.x host controllers from some less reputed vendors are buggy, with workarounds only in their closed-source drivers for Windows. These host controllers may barely work with Linux, and yet perform flawlessly with Windows. However as hardware vendors release newer versions of their products, there’s an ongoing improvement.

Since PCIe is an extension of the processor’s internal memory map, dependability is a must. A PCIe device has typically access to the entire memory space by virtue of DMA. Therefore, flawed PCIe hardware can easily crash the computer, no matter which operating system it runs. Both the host and peripheral must communicate perfectly with each other, or the user has all reasons to throw that PCIe device away.

By contrast, USB devices are generally not expected to be as reliable. Occasional failures are often fixed with unplugging and plugging the device, and if a device has a flawed USB 3.x interface, the user is likely to work it around by plugging it into a USB 2.0-level port, hence enforcing the lower protocol level (usually without being aware of why it worked). The same workaround applies to a flawed USB 3.x port.

And there's also the USB cable, which may have different characteristics and quality, versus the carefully designed copper traces on the motherboard. The spf2usb kit arrives with a proper USB 3.x cable for this reason.

The importance of the host USB controller

As mentioned above, a PCIe device is a peripheral on the processor’s memory map, typically with the capability of initiating bus operations by virtue of DMA. This allows a data-intensive peripheral to perform data exchange with a low and fairly predictable latency: Indeed, the software must allocate buffers for these data transfers in a timely manner, but the peripheral initiates the data transfers at will.

USB devices, on the other hand, are external by all means. The bonus is hotplugging as a basic feature, but the interaction with a USB device is fundamentally different: Unlike a PCIe device, it can’t initiate communication, but only respond to data transfer requests from the host. The only way a device can control when data is transmitted is by temporarily refusing to a data transmission that is initiated by the host. It may then inform the host that it’s ready again for a data transmission, but even then, it’s not allowed to transfer data: It’s the host which may and may not initiate a data transmission again.

The host USB controller, which is an independent peripheral on the processor’s bus, has full control on the data exchange between the host and the USB device. Even though it’s the software on host (user-space and kernel drivers alike) that requests data transfers, there is no software control whatsoever on when the transfer is scheduled, in particular for Bulk endpoints. This is because the USB protocol is by far too intensive to be handled by the processor. Attempting to handle each USB port’s low-level protocol events with software would swamp it interrupts with only a few microseconds apart.

This holds true for all USB revisions, USB 3.x included: It’s the USB host controller that decides which endpoint to serve at any given time. If there are several endpoints eligible for data transfers, it may choose to schedule a long burst for one of them and starve the others. Alternatively, it may schedule shorter bursts for each. Both ways conform to the specification, and the difference in raw bandwidth utilization is negligible. Some USB controllers do it one way, and others do it the other.

In fact, as a matter of poor design, some USB controllers don’t even utilize the raw bandwidth to full capacity, just by not fully utilizing the features offered by the USB specification. Each USB controller has different ways of allocating the bandwidth of the physical channel to endpoints that are eligible for data transmission. Pauses in the data transmission for no apparent reason are common, as shown on this page, which displays test results with pauses of up to 2 ms. Events of this sort don’t contradict the specification, so even a certified USB host controller may behave this way.

For this reason, the host’s USB controller can make a significant difference on the exchange of payload data. This is quite unfortunate, but is manageable. This page shows how to identify the USB controller used with a XillyUSB device.

Comparing this behavior with PCIe is somewhat slippery, because its spec doesn’t guarantee a fulfillment time limit either. More precisely, the PCIe specification requires that the timeout for a DMA read operation by a device is set to between 50 μs and 50 ms, with no less than 10 ms as the recommended value. The guarantee given by the PCIe protocol is hence practically useless. However practically speaking, PCIe infrastructure performs with no data flow intervention and negligible latency. In other words, a PCIe bus operations can be assumed to be completed practically immediately.

XillyUSB’s approach

The XillyUSB IP core was designed with latency and bandwidth performance in mind, with awareness to the diversity between different USB controllers.

The important point about the USB host controller is that it has complete control over scheduling data transfers with the USB port’s link partner. Software running on the processor (including kernel software) can only supply buffers and request data transfers, but it has no control nor knowledge on when those transfers will take place.

Even though Isochronous and Interrupt USB endpoints offer a guarantee of a certain amount of bandwidth within a periodic slot of 125 μs (a microframe, USB 2.0 and later) these have crucial drawbacks: Isochronous endpoints support no retransmission mechanism in case of bit errors, and Interrupt endpoints allow for a very low bandwidth.

The approach taken for XillyUSB is to rely on Bulk endpoints (why is explained on this page), and handling streams to and from the host completely differently:

  • A single Bulk IN endpoint is used for all streams and XillyUSB-specific messages towards the host (upstreams).
  • A Bulk OUT endpoint is allocated separately for each stream from the host (downstream), plus an extra Bulk OUT endpoint for XillyUSB-specific messages.

A single Bulk IN endpoint for all communication towards the host is possible, because each stream towards the host is flow controlled separately by the XillyUSB driver. As a result, there is no need to flow control the stream that combines them. From the host’s point of view, the Bulk IN endpoint is always ready for data that the FPGA has to send. The host’s USB controller is therefore expected to initiate data transfers virtually immediately after the FPGA signals that it has data ready.

In other words, the USB controller is left with no choices: It can either serve the single Bulk IN endpoint or hold the upstream link idle. By leaving the controller with this little room of choice, the differences between controllers is mitigated.

This solution is possible, as there is no problem allocating RAM buffers on the computer that are large enough to contain several milliseconds of data for each stream. With these large buffers, it’s possible to flow control the data for each individual stream by virtue of the software driver, using XillyUSB messages to the FPGA, without risk for degrading performance.

Unfortunately, this method isn’t possible in the opposite direction, from the host: The flow control must be done between the FPGA and USB host controller with microsecond granularity, or huge FPGA buffers would have been required (huge in FPGA terms, acceptable on a computer). The software response is too slow for this.

Data acquisition and playback applications

As there is one Bulk OUT endpoint for each stream from host to FPGA, it’s up to the USB host controller to schedule traffic in this direction, each with its own policy for how to divide the raw bandwidth among the candidates. If a continuous data flow and/or low latency is required, it’s most likely helpful to ensure that only one stream from the host to FPGA has data to send. This way, the USB host controller is once again left with little freedom of choice.

The FIFOs that are instantiated in the sample FPGA design in the demo bundles are set up with sizes that correspond to approximately 150 μs worth of data at the highest possible data rate, in an attempt to ensure a sustained data flow if so needed. As shown on this page, they may turn out to be larger than required, but also possibly too small, as this depends on the host USB controller’s behavior.

For a design that needs to work with just any USB 3.x controller, the FIFO needs to be able to contain the worth of several milliseconds of data (in the data rate actually used by the related stream). It may be necessary to resort to external memory for this purpose. A boilerplate Verilog module, deepfifo, may help implementing this.