Published: 18 November 2020

Introduction

The XillyUSB IP core provides two methods for detecting problems with the physical layer of the USB connection: LEDs on the board, which give immediate indications for problems, and event counters, which are stored on the FPGA, and can be displayed with a utility running on the computer to which the examined USB link is connected.

These two methods are first described briefly in a “how-to” manner, in a way that suits those who have a perfectly working link and just want a quick verification. This is followed with a more in-depth explanation of what is detected and how.

Board LEDs gpio_led[3:0]

The XillyUSB IP core provides 8 outputs for LEDs as gpio_led[7:0]. As some boards don’t have 8 GPIO LEDs, it’s sufficient to display only the first four. The descriptions below relate to when the LED is on (logic ‘1′).

  • gpio_led[0]: Heartbeat, 1 Hz, 50% duty cycle. On some boards, it may remain lit for about 20 seconds after FPGA configuration. This is normal, as explained below.
  • gpio_led[1]: Data is transmitted towards the host
  • gpio_led[2]: Data is transmitted from the host.
  • gpio_led[3]: Flashes when an error is detected in either direction, or held '1' when link is down.

gpio_led[1] and gpio_led[2] flash briefly when the host enumerates the device -- this is also considered data exchange.

Error indications on gpio_led[0] and gpio_led[3]

gpio_led[3] is off (logic ‘0′) when the USB link with the host is set up without any errors.

It is on (logic ‘1′) during LFPS polling, which is the initial stage for setting up a USB 3.0 link, before and mutually exclusive with having the 5 Gb/s link up. If this LED remains steadily lit when connecting to a USB port, odds are that the port doesn’t support USB 3.0 or that the wrong GTX is used.

If it does some flashing activity in response to connecting to a USB port, there’s a problem with the low-level handshake for setting up the link. It’s most likely a low-quality USB port, cable or some physical damage to some hardware component.

If it flashes occasionally while the USB device is up and running, it indicates errors on the physical link that were either corrected with the link protocol, or caused a visible error (most of the time it’s the former).

gpio_led[0] and gpio_led[3] may indicate certain conditions immediately after configuration of the FPGA, as listed below:

  • When gpio_led[0] is steadily lit, the external chip that produces the reference clock isn’t locked. On boards with Si5324 / Si5328, these devices may take about 20 seconds to achieve lock, which causes this delay. This applies to a range of Xilinx development boards.
  • When gpio_led[0] flashes briefly every 500 ms, the MGT reference clock is present and locked, but the MGT’s frontend module (named *_frontend.v) didn’t release its frontend_rst signal, indicating that it failed to complete its wakeup procedure. On Ultrascale devices, this is possibly because of a failed CPLL calibration.
  • When gpio_led[3] flashes rapidly (12 Hz), the I2C setting of the external reference clock generator chip has failed (lack of ACK from I2C slave).

Some or all of these three conditions may are impossible in some boards, as they don’t have the components that cause them.

Board LEDs gpio_led[7:4]

The remaining four LEDs provide additional information:

  • gpio_led[4]: Problem with link to host (bad packet reported or Recovery requested by link partner).
  • gpio_led[5]: Problem with link from host (bad packet received or the FPGA requested Recovery).
  • gpio_led[6]: Endpoint reset: This LED is on when the device is not enumerated and/or when an SET_CONFIGURATION or SET_INTERFACE USB protocol commands arrives.
  • gpio_led[7]: The device is in one of the low-power states (U1, U2 or U3).

The term “Recovery” is explained in the in-depth part below.

Note that gpio_led[3] shows link error in either directions, so if either gpio_led[4] or gpio_led[5] are on, gpio_led[3] is turned on along with them. The purpose of gpio_led[5:4] is to tell in which direction the problem is.

The showdiagnostics utility

As a complement to the LEDs outlined above, the XillyUSB IP core maintains several event counters which are incremented in some error-related situations. The showdiagnostics utility allows fetching the values of these counters as well as some status data from the FPGA. Under proper operation, no error related counter should increment, and if they do, it’s an indication of a low-level link problem.

By default, the utility periodically opens /dev/xillyusb_00_diagnostics (Linux) or \\.\xillyusb_00_diagnostics (Windows), fetches the current values of these counters and displays the results on the console, in a way similar to the UNIX “top” and “watch” utilities. This extra device file is connected to the IP core’s internal logic, and is added automatically to all IP cores generated in the IP core factory. Likewise, it’s included in cores of the demo bundles.

For execution on Windows, download the diagnostic utility for Windows at the download page, and extract the showdiagnostics.exe file, which runs as a DOS Window program. Then run it from a Command Prompt window, or double-click the file. The source code for this utility is in the same zip as the executable, for those preferring to compile it themselves.

If the examined device is with non-zero index, it should be used as an argument. For example, to access \\.\xillyusb_02_diagnostics instead, open a Command Prompt window, navigate to the relevant directory, and type

> showdiagnostics 2

Likewise for Linux, extract the showdiagnostics.pl Perl script in the xillyusb.tar.gz bundle, found under utils/. From the directory it was uncompressed into, type

$ ./showdiagnostics.pl

for execution. Similar to the said above, to access /dev/xillyusb_02_diagnostics instead, go

$ ./showdiagnostics.pl 2

Assessing the link quality

The output on the screen typically looks as follows:

XillyUSB event counts - Tue Nov 10 14:20:59 2020

Errors on FPGA to host link:
===========================
Bad packets received by link partner: 0
Recovery requests by link partner: 0

Errors on host to FPGA link:
===========================
Bad packet received by device: 0
Errors detected during link idle: 0
Recovery requests by device: 0

Power management:
================
XillyUSB device's power policy:
REFUSE to low power state

PORT_U2_TIMEOUT: 10.24 ms

Low power transitions made: 0
Low power transitions refused by device: 490397

As the titles imply, the first two sections display counters that remain at zero for a proper link. Each sections shows error counters that are related to one of the two directions.

If the upstream link (towards the host) isn’t active with data, there’s little chance to detect a poor link, because the link partner complains only about errors found in traffic. It’s therefore suggested to run this diagnostic check with data flowing towards the host. It may be application data, or just zeros. For example, with an out-of-the box demo bundle (with loopback FIFOs), these two commands can be used:

$ dd if=/dev/zero of=/dev/xillyusb_00_write_32 bs=64k &
$ dd if=/dev/xillyusb_00_read_32 of=/dev/null bs=64k &

"dd" is a standard utility on Linux machines, but it is available for Windows as well, among others in the Xillybus package for Windows which can be fetched from the download page. However Windows's equivalent for /dev/null is NUL, and it doesn't have a /dev/zero equivalent. Hence an alternative for /dev/zero is required. This could be any program that writes any data to the device file, or dd just reading from a plain file, e.g.

> dd if=the-large-file.dat of=\\.\xillyusb_00_write_32 bs=64k

from one Command Prompt window, and

> dd if=\\.\xillyusb_00_read_32 of=NUL bs=64k

from another. Alternatively, the Linux-like commands can be used on Windows by virtue of Cygwin, or a rewritten dd utility for Windows, which mimics /dev/zero internally.

Note that all counted events (including LPM events mentioned below) are related to the port that is connected to the FPGA directly (the link partner) and not necessarily the host. Typically, the FPGA is connected directly to the host, making it the link partner, but if the FPGA is connected to the host through a hub, it’s the communication with the hub’s port that is monitored. In the latter case, there is no way to examine the link segment(s) between the hub and host.

Also note that it may take about one second for the physical link to stabilize, during which the link performance is degraded and errors are counted. This doesn’t necessarily indicate a problem. It’s therefore recommended to launch this utility slightly after connecting a new device (which is what most of us do naturally anyhow).

Power management (LPM) section

The third section displays parameters and counters related to USB 3.0 power management. This part is irrelevant for default IP cores, as they are set to refuse to requests from the link partner to power down the link, to the extent allowed by spec. This is indicated by the part saying “XillyUSB device’s power policy: Refuse to low power state”. Practically speaking, this means that the 5 Gb/s link remains up unless the host goes into suspend mode (U3 power state).

No XillyUSB IP core initiates a low-power state transition, no matter its configuration.

The PORT_U2_TIMEOUT parameter is optionally set by host during enumeration, and is explained in detail on this page. Practically, if anything else than “disabled” is displayed for this parameter, it means that the link partner is programmed to issue low-power requests, when the link is idle for a certain period of time, which is typically much shorter (possibly 50 μs) than PORT_U2_TIMEOUT. The latter is the timeout for entering U2, which is the deeper power state. There’s normally a transition to the shallower U1 power state first, and only then down to U2. PORT_U2_TIMEOUT is shown only because it usually tells whether the link partner is configured to enter low power states.

Once again, this isn’t relevant, because the device refuses to these states anyhow. This is indicated by the “Low power transitions made” counter staying at zero, and “Low power transitions refused by device” possibly counting up (if the link partner indeed requests low power states).

Bit errors: A closer look

In broad terms, the USB 3.0 specification requires two mechanisms for coping with bit errors on the physical channel: Retransmission and Recovery.

All communication on the link takes place in packets, each protected with some sort of CRC. If a packet arrives with a mismatching CRC, its receiver discards it, and requests a retransmit from the link partner. However there’s a group of packets (Link Commands) which can’t be retransmitted per request. Hence if such packet is corrupted by a bit error, the USB link protocol needs to reinitialize by virtue of a Recovery handshake. The low-level link partner controllers do this on their own, with the processor having no practical way to tell when and how often this happens.

Link command packets are relatively short (8 bytes), and are therefore relatively unlikely to be struck by bit errors. It’s therefore common to see more bad packet retransmits than Recovery handshakes on a link that has random bit errors.

The Recovery handshake is a quick and well-defined procedure, and should in theory not cause any problems. It typically takes about 1-2 μs to complete, even though it may take as long as 19 ms without violating the spec.

However since entering this Recovery handshake abruptly halts all other communication on the link, it tends to expose bugs in the host’s USB controller, in particular under busy traffic and when Recovery handshakes occur frequently. It’s therefore not rare that a link that goes into Recovery often experiences the equivalent of a physical disconnection, or some other visible indication of error. Among others, some host controllers have been very rarely observed mixing up payload data, which causes data errors in the application-level streams. When this happens, it’s most likely not a problem with the Recovery handshake itself, but a bug it triggered. Once again, the Recovery invocation by itself never causes any log message or something of that sort on the computer.

As a side note, the USB protocol requires entry into Recovery for a whole range of reasons, as listed on this page. But except for buggy behavior by the link partner, the underlying cause for transition from a working link into Recovery is always a bit error on a Link Command packet.

The link error counters in detail

The FPGA detects errors on the upstream link (towards the host) when the link partner requires a retransmit of a bad packet (i.e. one that didn’t pass a CRC check), or when it requires to reinitialize the low-level protocol (with a Recovery procedure per USB 3.0 spec). When such requirements arrive, the respective counter is incremented.

These counters should remain zero on a properly functional link.

The FPGA detects errors on the downstream link (from the host) in three possible ways:

  1. A packet arriving with mismatching CRC.
  2. Detecting a situation that requires transition into Recovery (which usually implies that an inbound packet previously arrived with error).
  3. Detecting errors while the link partner doesn’t transmit data (during link idle).

The first two ways are the same as with the upstream link, only that the error cause is sensed directly, rather than detecting its response from the link partner.

The third method relies on the fact that while the link is idle, the data should be all zeros (after descrambling). Hence any non-zero data between packets is considered a bit error, and is counted by the IP core. Detecting errors of this sort is not required by the USB spec, which is why the link partner can’t be expected to perform this check, let alone report it.

Notes:

  • A large error count don’t necessarily indicate serious problems. Errors may come in bursts. It’s also possible that a certain mishap increments more than one counter.
  • The link idle error counter relies on that the link partner doesn’t transmit data when not expected to. If the link partner has protocol bugs, or if there’s a momentary protocol confusion due to a previously corrupted packet, error-free data may be counted as link idle errors.
  • The link idle counter counts error events, not bit errors. In particular, it’s deactivated for about 1024 bytes after detecting an error event, to prevent huge counts due to the protocol confusion case just mentioned.
  • Because the physical link involves adaptive mechanisms for improving the signal quality (equalizers in particular), it happens that hardware that normally works perfectly (i.e. error-free), occasionally performs poorly with a lot of errors, but then goes back to working perfectly just by disconnecting and connecting the USB plug.