Published: 18 November 2020

Introduction

Even though PCIe and USB 3.x rely on the same type of physical bitstream (Multi-Gigabit Transceiver, MGT), there are fundamental differences between the two, which have significant influence on how payload data is handled. This page is intended in particular for those who are used to working with Xillybus’ IP core for PCIe or AXI, and consider using the XillyUSB variant.

There are two main factors to consider: The quality of currently available hardware and the inherent differences in the data transport protocols and the roles that the bus controllers have.

Hardware quality

There are many data-intensive computer peripherals, with a zero tolerance for reliability issues, that rely on PCIe: Graphics cards, NICs, SATA controllers and many more. After several years in use, PCIe has has turned into a rock solid interface, and those who have experience with Xillybus’ IP core for PCIe also know this first hand.

USB 3.x, on the other hand, is today where PCIe stood 10 years ago: Its support is fairly widespread, but not always guaranteed to work properly. However as hardware vendors release newer versions of their products, there’s an ongoing improvement.

Since PCIe is an extension of the processor’s internal memory map, dependability is a must. A PCIe device has typically access to the entire memory space by virtue of DMA. Therefore, flawed PCIe hardware can easily crash the computer, no matter which operating system it runs. Both the host and peripheral must communicate perfectly with each other, or the user has all reasons to throw that PCIe device away.

By contrast, USB devices are generally not expected to be as reliable. Occasional failures are often fixed with unplugging and plugging the device, and if a device has a flawed USB 3.x interface, the user is likely to work it around by plugging it into a USB 2.0-level port, hence enforcing the lower protocol level, not necessarily consciously. The same workaround applies to a flawed USB 3.x port.

In addition, a flawed USB device doesn’t generally threat the computer’s stability, except for through bugs in software: All it can do is to send or accept data that is scheduled by the host. It has no direct access to the computer’s resources, and is therefore limited in the harm it can cause.

All in all, there is more room for problematic USB 3.x hardware, as the users usually find a way around problems with USB 3.x, without being aware that the solution was falling back to USB 2.0.

On top of all this, there's the USB cable, which may have different characteristics and quality, versus the carefully designed copper traces on the motherboard.

The importance of the host USB controller

As mentioned above, a PCIe device is a peripheral on the processor’s memory map, typically with the capability of initiating bus operations by virtue of DMA. This allows a data-intensive peripheral to perform data exchange with a low and fairly predictable latency: Indeed, the software must allocate buffers for these data transfers in a timely manner, but the peripheral initiates the data transfers at will.

USB devices, on the other hand, are external by all means. The bonus is hotplugging as a basic feature, but the interaction with a USB device is fundamentally different: Unlike a PCIe device, it can’t initiate communication, but only respond to data transfer requests from the host. The only way a device can control when data is transmitted is by temporarily refusing to a data transmission that is initiated by the host. It may then inform the host that it’s ready again for a data transmission, but even then, it’s not allowed to transfer data: It’s the host which may and may not initiate a data transmission again.

The host USB controller, which is an independent peripheral on the processor’s bus, has full control on the data exchange between the host and the USB device. Even though it’s the software on host (user-space and kernel drivers alike) that requests data transfers, there is no software control whatsoever on when the transfer is scheduled, in particular for Bulk endpoints. This is because the USB protocol is by far too intensive to be handled by the processor. Attempting to handle each USB port’s low-level protocol events with software would swamp it interrupts with only a few microseconds apart.

This holds true for all USB revisions, USB 3.x included: It’s the USB host controller that decides which endpoint to serve at any given time. If there are several endpoints eligible for data transfers, it may choose to schedule a long burst for one of them and starve the others. Alternatively, it may schedule shorter bursts for each. Both ways conform to the specification, and the difference in raw bandwidth utilization is negligible. Some USB controllers do it one way, and others do it the other.

In fact, as a matter of poor design, some USB controllers don’t even utilize the raw bandwidth to full capacity, just by not fully utilizing the features offered by the USB specification. It also happens that the link remains idle for several microseconds, even when the protocol allows transmission and there are buffers ready for it. Once again, it’s poor design that adds unnecessary latency and possibly reduces the usable bandwidth. Flaws of this sort don’t contradict the specification, so even a certified USB host controller may behave this way.

For this reason, the host’s USB controller can make a significant difference on the exchange of payload data. This is quite unfortunate, but is manageable, in particular by avoiding low-end and/or old hardware. This page shows how to identify the USB controller used with a XillyUSB device.

Comparing this behavior with PCIe is somewhat slippery, because its spec doesn’t guarantee a fulfillment time either. More precisely, the PCIe specification requires that the timeout for a DMA read operation by a device is set to between 50 μs and 50 ms, with no less than 10 ms as the recommended value. The guarantee given by the PCIe protocol is hence practically useless. However practically speaking, PCIe infrastructure performs with no data flow intervention and negligible latency. In other words, a PCIe bus operations can be assumed to be completed practically immediately, or the typical designer will consider the bus infrastructure dysfunctional.

XillyUSB’s approach

The XillyUSB IP core was designed with latency and bandwidth performance in mind, with awareness to the diversity between different USB controllers.

The important point about the USB host controller is that it has complete control over scheduling data transfers with the USB port’s link partner. Software running on the processor (including kernel software) can only supply buffers and request data transfers, but it has no control nor knowledge on when those transfers will take place.

Even though Isochronous and Interrupt USB endpoints offer a guarantee of a certain amount of bandwidth within a periodic slot of 125 μs (a microframe, USB 2.0 and later) these have crucial drawbacks: Isochronous endpoints support no retransmission mechanism in case of bit errors, and Interrupt endpoints allow for a very low bandwidth.

Because 125 μs is by far longer than the typically observed latency caused by a USB 3.x controller, the approach taken for XillyUSB is to rely on Bulk endpoints, and handling streams to and from the host completely differently:

  • A single Bulk IN endpoint is used for all streams and XillyUSB-specific messages towards the host (upstreams).
  • A Bulk OUT endpoint is allocated separately for each stream from the host (downstream), plus an extra Bulk OUT endpoint for XillyUSB-specific messages.

A single Bulk IN endpoint for all communication towards the host is possible, because each stream towards the host is flow controlled separately by the XillyUSB driver. As a result, there is no need to flow control the stream that combines them. From the host’s point of view, the Bulk IN endpoint is always ready for data that the FPGA has to send. The host’s USB controller is therefore expected to initiate data transfers virtually immediately after the FPGA signals that it has data ready.

In other words, the USB controller is left with no choices: It can either serve the single Bulk IN endpoint or hold the upstream link idle. By leaving the controller with this little room of choice, the differences between controllers is mitigated.

This solution is possible, as there is no problem allocating RAM buffers that are large enough to contain several milliseconds of data for each stream. With these large buffers, it’s possible to flow control the data for each individual stream by virtue of the software driver, using XillyUSB messages to the FPGA, without risk for degrading performance.

Unfortunately, this method isn’t possible in the opposite direction, from the host: The flow control must be done between the FPGA and USB host controller with microsecond granularity, or huge FPGA buffers would have been required (huge in FPGA terms, acceptable on a computer). The software response is too slow for this.

As there is one Bulk OUT endpoint for each stream from host to FPGA, it’s up to the USB host controller to schedule traffic in this direction, each with its own policy for how to divide the raw bandwidth among the candidates. If a continuous data flow and/or low latency is required, it’s most likely helpful to ensure that only one stream from the host to FPGA has data to send. This way, the USB host controller is once again left with little freedom of choice.

The FIFOs that are instantiated in the sample FPGA design in the demo bundles are set up with sizes that correspond to approximately 150 μs worth of data at the highest possible data rate, in an attempt to ensure a sustained data flow if so needed. They may turn out to be larger than required, but also possibly too small, as this depends on the host USB controller’s behavior.

The idea behind this figure is that a microframe, which is the basic element for data transfer scheduling, is 125 μs long. Hence it’s reasonable to expect that no data link is starved longer than that. Even though there’s no way to guarantee the behavior of any USB controller, it’s plausible to consider a controller ineligible for use, if it starves an endpoint for longer than 125 μs.