Introduction
Developing an FPGA design typically involves loading a bitstream to the FPGA repeatedly for testing different versions. When this FPGA is connected to a host through Xillybus or XillyUSB, it's often desirable to avoid rebooting the host for the sake of hot reconfiguration of the FPGA.
Even though hotplugging is fully supported by the specifications of both PCIe and USB from their very beginning, there's a significant difference in the de-facto support by commercial computer hardware. Generally speaking, it's not unusual that such hardware violates the underlying specifications blatantly when the relevant usage scenario is rarely used. This holds true even for hardware from the most reputable vendors.
This is an important notion when it comes to hotplugging: While USB devices are commonly connected and disconnected from the computer, it's rare that a PCI or PCIe device is removed or plugged into a working desktop computer.
Hence it's necessary to make a strong distinction between XillyUSB and PCIe-based Xillybus:
- If the FPGA is connected to the host through a USB 3.0 interface using XillyUSB, it's perfectly fine to reconfigure the FPGA in any situation. This includes when the FPGA is connected to the host, and data transmission is in progress. Doing so will result in the equivalent of unplugging the USB physically from the host and returning it, something that both hardware and driver software are designed to handle gracefully.
- If the FPGA is connected to the host through PCIe with Xillybus, reconfiguration of the FPGA while the host is powered on may cause the operating system to crash immediately, due to the computer's hardware lacking support of PCIe hotplugging. The majority of computers will however not respond as harshly: Rather, the removal and return of the PCIe interface is ignored by the operating system, and attempting to perform I/O with Xillybus after the FPGA has been reconfigured results in a plain, user-space error indicating that the the I/O operation failed. The operating system remains stable nevertheless.
As there's no problem with XillyUSB to begin with, the rest of this discussion relates to Xillybus over PCIe.
Note that this page discusses the full configuration of the FPGA. For Partial Reconfiguration, see this other page.
PCIe hot reconfiguration is only for the lab
It's important to stress that reloading the FPGA while the host is up and running is suitable as a lazy hack in the lab, but is not a good practice in a product for end users.
The correct way to change the FPGA's behavior in a running system is by virtue of Partial Reconfiguration (PR), i.e. by loading a configuration bitstream that affects only some regions of the logic fabric, leaving the logic that implements the PCIe block and Xillybus IP core intact (among other logic elements). This ensures that the PCIe link's continuity is not broken by this reconfiguration, and hence the FPGA hardware can be used with host hardware that doesn't play well with PCIe hotplugging.
The Partial Configuration bitstream itself can be transported to the FPGA by virtue of a Xillybus stream, which simplifies the implementation of this feature.
Alternatively, many FPGA families have a dedicated feature for Partial Reconfiguration over PCIe, e.g. Xilinx' Tandem PCIe configuration and Intel FPGA's Configuration via Protocol (CvP).
Sequence for hot reconfiguration
The correct sequence for hot reconfiguration of an FPGA that is active as a PCIe device on the bus is:
- Deregister the PCIe device from the operating system.
- Reconfigure the FPGA (load the new bitstream).
- Issue a scan of the PCIe bus to redetect the FPGA.
The importance of the first stage is different for Windows and Linux, as elaborated next.
On Microsoft Windows
The first and third steps in the sequence above are both made with Window's Device Manager.
In order to deregister the device, open the Windows Device Manager, and find the entry saying "Xillybus driver for generic FPGA interface". Right-click this entry, and select "Uninstall". This removes the device from the operating system after unloading the driver properly.
If there's an check box for deleting the driver software as well, don't check it.
After the FPGA has been reconfigured, click "Action" on the Device Manager's menu bar, and select "Scan for hardware changes". This causes the operating system to redetect the FPGA as a PCIe device, and launch Xillybus' driver on its behalf.
The importance of deregistering the device on Windows is that if the FPGA is reconfigured before the driver is detached from it, the driver will fail to get a confirmation from the FPGA that it won't perform any further DMA operations. This is notable by these two messages in Window's event log:
Failed to quiesce the device on exit. Quitting while leaving a mess.
Practically, this causes a memory leak, as the driver refrains from releasing resources, most notably the RAM buffers that were allocated for DMA transfer. As these RAM buffers are taken from a relatively limited pool of kernel memory, skipping the deregistration repeatedly can lead to a shortage of kernel memory, in particular if the DMA buffers are large. Among others, this may prevent the driver from initializing the next time.
So don't skip this stage, even though it will probably appear to be fine to do so, at least for a few rounds.
The rest of this page focuses on Linux.
On Linux
The first and third steps in the sequence above are implemented as commands to the Linux kernel through sysfs. Even though both steps can be done with a one-liner at shell prompt, the first step -- the deregistration -- is best done with the script that is listed below, in order to ensure the safe removal of the device.
So execute the script listed below before reconfiguration with something like
# ./safe-remove.sh
and issue this command at shell prompt after the reconfiguration:
# echo 1 > /sys/bus/pci/rescan
Both operations require root privileges.
Note: Don't try modprobe for the Xillybus driver. It will fail to work unless rescanning has taken place (the FPGA won't respond) and it is unnecessary after the rescan (the module is loaded automatically).
The script
This is the said script to safely deregister the PCIe device:
#!/bin/bash
driver=xillybus_pcie
DIR=/sys/bus/pci/drivers/$driver
if [ ! -d $DIR ] ; then
echo "Driver appears to be unloaded. Maybe the device is already removed?"
exit 1;
fi
devices=$(readlink -e $DIR/* | grep ^/sys/devices)
if ! rmmod $driver ; then
echo "Failed to remove the driver. Doing nothing."
echo "There are probably open device files. Maybe try lsof | grep xillybus"
exit 1;
fi
if [ -d $DIR ] ; then
echo "The driver is still loaded, despite rmmod returning success. Weird."
exit 1;
fi
for dev in $devices ; do
echo Removing $dev
if ! echo 1 > "$dev/remove" ; then
echo "Failed to write to $dev/remove"
exit 1;
fi
done
A closer look on using the script
When used correctly, the script itself outputs
Removing /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0
and the kernel log emits something like
xillybus_pcie 0000:01:00.0: Removed 5 device files.
The "0000:01:00.0" part varies, depending on the device's PCI bus address.
It can be verified that the PCI device has indeed been removed by using lspci: The device should not appear in the output of the command after its removal.
As mentioned above, the FPGA should be reconfigured only after it has been deregistered. If the FPGA has been reconfigured before running the script, the kernel log says something like
xillybus_pcie 0000:01:00.0: Removed 5 device files.
xillybus_pcie 0000:01:00.0: Failed to quiesce the device on exit.
This has a minor practical significance, if at all, as the driver cleans ups after itself regardless (unlike the driver for Windows).
Also as mentioned above, after reconfiguring the FPGA, it's detected again with
# echo 1 > /sys/bus/pci/rescan
upon which the kernel log typically emits something like this:
pci 0000:01:00.0: [10ee:ebeb] type 00 class 0xff0000 pci 0000:01:00.0: reg 0x10: [mem 0x00000000-0x0000007f 64bit] pci 0000:01:00.0: Max Payload Size set to 256 (was 128, max 512) pcieport 0000:00:01.0: ASPM: current common clock configuration is inconsistent, reconfiguring pci 0000:01:00.0: BAR 0: assigned [mem 0xdf100000-0xdf10007f 64bit] xillybus_pcie 0000:01:00.0: enabling device (0000 -> 0002) xillybus_pcie 0000:01:00.0: can't disable ASPM; OS doesn't have ASPM control xillybus_pcie 0000:01:00.0: Created 5 device files.
This is basically the PCI bus driver doing the enumeration of the device, a task that the BIOS has usually already taken care of when the kernel starts. Hence the detailed information of the device's bringup.
Why the script is needed
In principle, the script could have been replaced with this command or similar:
# echo 1 > "/sys/bus/pci/drivers/xillybus_pcie/0000:01:00.0/remove"
However this operation calls the device driver's "remove" method immediately, which causes it to detach itself from the device right away. For this to operate correctly, the device driver must support hot unplugging.
This is however not the case with Xillybus' driver for PCIe: Because it's not possible to rely on PCIe hotplugging support by hardware anyway, it was written with the traditional API, which ensures that the "remove" method is called only when the driver is unloaded. Which in turn can happen only when the driver's reference count goes down to zero, or more specifically: When there are no device files open on its behalf.
Supporting hotplugging involves a significant complication of the device driver's code, making it bug-prone, as it requires proper handling of a variety of corner-case race conditions. Since hotplugging support merely covers a test lab scenario, and given the possibility to use the script above, the reliability of the driver took priority.
Because the driver relies on this reference count protection, it will cause a kernel segmentation fault (oops) if the device is removed with the sysfs command shown above while it has a device file open. This is because it will later attempt to access memory regions that have been freed during the removal, as it processes file operations (for example, closing the file).
The script attempts to remove the driver with an rmmod command before issuing the sysfs command for the device's removal. The purpose of the rmmod command is to fail if the driver's reference count isn't zero, and if that happens, it refrains from removing the device.
Hence the script doesn't remove the device unless it's safe to do so. Actually, the fact that the driver is removed first, ensures that nothing bad can happen as the device goes away.
In this context, it's worth to mention that the same oops will occur when the driver is unbound with something like
echo "0000:01:00.0" > /sys/bus/pci/drivers/xillybus_pcie/unbind
while there's a device file open, for the same reason.
Summary
It's safe to reconfigure an FPGA while it's connected to a host through Xillybus or XillyUSB, assuming that the host has no inherent problem with the device going away suddenly from its bus.
With XillyUSB, the natural hotpluggable nature of the USB bus allows this without any concerns.
With Xillybus over PCIe, there's a possibility that the computer hardware will not handle the hotplug events properly, leading to a crash. For this reason, reconfiguration of the FPGA in an end-user product scenario should be done with Partial Reconfiguration instead.
In a test / engineering lab scenario, it's possible to safely hot-reconfigure the FPGA. The best practices for doing so with Windows and Linux have been outlined.