32/64 bit, IOMMU and SWIOTLB in Linux

Published: 22 July 2021

Introduction

The IOMMU (Input–Output Memory Management Unit) is a feature that is commonly present in 64-bit x86 processors as well as other architectures. Linux’ support for IOMMU has been a relatively disorganized development process, with several obscurities along the way. This is quite remarkable given that it’s part of the kernel’s memory management — a central role in the kernel’s functionality. The available information regarding this feature is likewise confusing, which is partly a result of changes that have taken place both in hardware capabilities and in the view on how virtualization should be done.

This page is a mix of topics that are related to this subject. It focuses on the x86 platforms (Intel and AMD), and is written in context of the hardware and the Linux kernel as of year 2021 (v5.13, plus minus). Keep in mind that things are constantly changing.

A good general explanation of IOMMUs can be found in the introduction of this OLS ‘06 paper.

Motivation for IOMMU and SWIOTLB

IOMMUs were originally promoted along with 64-bit x86 processors in order to ensure proper operation of pre-PCI and PCI / PCIe devices that support DMA with 32-bit addresses only. Since a host might allocate DMA buffers on physical addresses that are beyond the 32-bit range (regardless of its virtual address), a solution was necessary to ensure that these PCI devices would still work.

The IOMMU solves this problem by creating a separate virtual memory map for each PCI device, based upon its bus ID. When a DMA request arrives from the device to the processor, the IOMMU looks up its address translation table, and resolves the physical address that is used to access physical memory. Just like processor MMUs have separate translation tables for each context, does the IOMMU maintain a translation table for each device (with the possible exceptions of sharing translation tables between devices, just like MMUs share address spaces between threads).

This address translation allows the device to issue DMA requests on addresses within the 32-bit range even when the physical address is beyond this range. This was a big deal when IOMMUs were introduced, however the benefit of utilizing this feature has diminished since, and is limited to relatively special cases, in particular for virtualization with direct access to PCIe peripherals, as discussed below.

The main reason IOMMUs are rarely necessary for their original purpose (solving the 32-bit address issue) is that the vast majority of peripherals support 64-bit DMA access nowadays. Those that don’t, are very likely to get buffers in the 32-bit range, as the Linux kernel driver API allows allocating memory in the 32-bit region explicitly, and there are good chances that such memory will be available (see “Availability of 32-bit physical memory” below).

However if the buffer is allocated by a user-space program for direct (zero-copy) access by the device, it may fall outside the 32-bit range, in which case there is no way to pass it to a 32-bit device. Practically speaking, this scenario is relevant only for hardware acceleration with GPUs, and these support 64-bit DMA — those not supporting that are most likely pointless for use anyhow.

On top of all this, there’s SWIOTLB, which is a software solution for the 32/64-bit DMA problem for systems not supporting IOMMU (or when it’s disabled), which is explained further below.

Enable IOMMU or not?

The short answer is that unless you know exactly what you need the IOMMU for, you should probably turn it off.

The IOMMU has an impact on bandwidth performance, which depends on several factors. The main culprit is commonly considered to be the mapping and unmapping of DMA buffers (these terms are explained below). This performance hit can be avoided to a large extent by reusing mapped buffers rather than unmapping them after each use, however traditionally it has been encouraged to unmap buffers as soon as possible, as memory used to be expensive and the mapping / unmapping cheap.

Different IOMMUs store the address translation tables in different ways, but it boils down to maintaining data structures in RAM (typically page tables) that are looked up as needed to handle a DMA requests from devices. Just like MMUs, IOMMUs also cache parts of these tables (typically with IOTLBs) in order to avoid several reads from physical RAM for each DMA request that arrives.

Hence IOTLB cache misses also contribute to performance degradation. This becomes more apparent when the driver doesn’t map and unmap the buffers often, and with access patterns that are unfavorable to caching. In particular, if the DMA buffers’ total size is beyond the IOMMU’s capability to cache, even a plain ring buffer pattern may render the IOTLB cache useless.

For applications that require strict low latency (in terms of tens of microseconds) for DMA accesses, the IOMMU’s impact may be beyond acceptable. From an average bandwidth point of view, it’s perfectly fine for the IOMMU to halt DMA transactions while it reloads its translation tables for whatever reason (or just a cache miss), however this prerogative is disastrous when a maximal latency needs to be assured. Such halt may, for example, cause a buffer overflow in data acquisition devices running at several GByte/s.

For most desktop applications, there is usually no perceived difference between having the IOMMU active or not. This is a conscious decision that is required only for higher end usages.

How to turn off IOMMU

First, what not to do: Don’t set iommu=off. This does indeed turn off IOMMU, but it also turns off SWIOMMU partly: It prevents it from initializing, but not from attempts of using this feature. SWIOMMU is discussed below, but to make a long story short, it’s a feature that is unlikely to do any good. But if it’s not initialized and then later some driver attempt to allocate a DMA buffer uncleverly, a kernel panic occurs instead of what would normally end as a plain I/O error, or maybe a kernel oops at most. That is, a complete computer freeze. This is a bug in the kernel that seems to have slipped away because of the rather obscure scenario for making it show.

Therefore, rather use intel_iommu=off and/or amd_iommu=off to disable the IOMMU, depending on the processor architecture (or use both, it doesn’t hurt).

Those compiling their own kernel may want to turn off IOMMU in the kernel’s configuration (and there’s a chance that the kernel was compiled with IOMMU features turned off to begin with). Distribution kernels usually come with IOMMU enabled, that is CONFIG_IOMMU_SUPPORT=y for general IOMMU support, and primarily CONFIG_AMD_IOMMU=y and CONFIG_INTEL_IOMMU=y for the respective processor families’ IOMMU support. There are several other kernel configuration flags related to each of the IOMMU options. For old AMD processors, there’s CONFIG_GART_IOMMU, supporting the first generation of address translation mechanisms, GART.

CONFIG_SWIOTLB is forced as “y” if CONFIG_X86_64 is selected. In other words, it’s enabled for all 64-bit x86 architectures, and there’s nothing to do about it.

To turn off IOMMU, turn CONFIG_IOMMU_SUPPORT to “n”, which turns off support for Intel and AMD IOMMUs by virtue of dependencies. Alternatively, turn off IOMMU for a specific architecture. The problem mentioned above with setting iommu=off won’t occur with CONFIG_IOMMU_SUPPORT set to “n”: That problem is specifically related to the kernel command parameter, and not to CONFIG_IOMMU_SUPPORT.

For those interested in the gory details on the iommu=off issue: Adding this kernel parameter makes iommu_setup() (defined in arch/x86/kernel/pci-dma.c) set the somewhat global variable no_iommu to 1. Later on, pci_swiotlb_detect_4gb() (defined in arch/x86/kernel/pci-swiotlb.c) sets the swiotlb variable to 1 if and only if the machine has an address span beyond 4 GB and no_iommu is 0, or alternatively, if Secure Memory Encryption (SME) is active. In the same file, pci_swiotlb_init() calls swiotlb_init() only if swiotlb is non-zero. So no_iommu set to non-zero is very likely to prevent the call to swiotlb_init().

But I’m running virtual machines!

Given the IOMMUs’ address translation features, they became a crucial component for implementing virtualization that allows the guest operating system to access a DMA-capable device directly. For example, running Windows 10 as a guest system on a Linux machine, so that the Windows OS accesses one of the NICs directly with its original driver, requires an IOMMU. This is because the driver, which is unaware of the fact that it runs inside a virtual machine, programs the NIC directly to issue DMA requests on the bus, but the addresses that the driver considers to be physical are fake. Hence the IOMMU is required to translate those fake physical addresses into those assigned for the guest machine.

The use of IOMMUs for virtualization led to an expansion of their capabilities, in particular in the field of isolation: Without an IOMMU, any PCI / PCIe device may read from or write to any physical address of the host. A faulty device may therefore turn the computer unstable by randomly writing data to random positions in the RAM. Since the IOMMU keeps track of each device’s allocated DMA ranges by virtue of translation tables, it may also block any attempt to access an address that isn’t mapped for it. Just like a regular MMU, such access creates a page fault. Using IOMMUs can therefore improve the system’s stability. That said, it’s quite unlikely that commercial hardware would cause this kind of damage regardless. The IOMMU’s isolation capability is hence primarily useful for ensuring that data doesn’t leak between a guest OS and other guests or the host OS.

The takeaway from this short discussion on IOMMUs and virtualization is that even though IOMMUs are commonly referred to as virtualization technology, they are in fact not required in most common uses of virtual machines, in particular on plain desktop computers for running multiple operating systems on one machine. For example, the common solution for enabling network access to the guest is to install a driver that is aware of the virtualization host (i.e. paravirtualization). That driver sets up a fake network card on the guest machine, which gives the guest machine access to the network based upon the host’s network view. This setting is often desired, as it doesn’t require extra hardware, and the setup of the guest’s network is simple: It sees what the host sees.

Using IOMMUs for virtualization is hence required only when the guest needs a direct access to a PCI / PCIe device. For example, when a non-standard PCI / PCIe device needs to be accessed by the guest. Another case could be a GPU that is used for acceleration by the guest, and hence the direct access is necessary for performance reasons.

Is IOMMU working on my machine?

If you have IOMMU on, there's a chance you really want it to do something. But is it really? Is it ensuring that PCI / PCIe devices make DMA reads and writes only where they're allowed to?

For example, if these lines appear in the kernel log,

iommu: Default domain type: Translated 
iommu: DMA domain TLB invalidation policy: strict mode

and even

DMAR: IOMMU enabled

it may appear like the IOMMU is alive and kicking, but by themselves, these logs don't indicate that there's any DMA protection. For example, if VT-d is disabled in the BIOS settings, these log lines may appear, but with no IOMMU in effect.

On the other hand, if log lines like these appear (for Intel's IOMMU),

DMAR: Host address width 39
DMAR: DRHD base: 0x000000fed90000 flags: 0x0
DMAR: dmar0: reg_base_addr fed90000 ver 1:0 cap 1c0000c40660462 ecap 7e3ff0505e
DMAR: DRHD base: 0x000000fed91000 flags: 0x1
DMAR: dmar1: reg_base_addr fed91000 ver 1:0 cap d2008c40660462 ecap f050da
DMAR: RMRR base: 0x0000008792f000 end: 0x0000008794efff
DMAR-IR: IOAPIC id 2 under DRHD base  0xfed91000 IOMMU 1
DMAR-IR: Enabled IRQ remapping in xapic mode

it's much more encouraging. If these are present, odds are that the IOMMU is actually doing its work. Even better, log messages like:

pci 0000:01:00.0: Adding to iommu group 1

But the only way to be sure is to make the IOMMU catch a faulty DMA access. The trick is to edit a device driver of a PCI / PCIe device that relies on DMA, so it hides the DMA buffers from the IOMMU. By doing so, the device works as usual, but since the IOMMU is not informed about some or all DMA buffers, it prevents access to them. This causes the device to malfunction, of course, so choose a device that the computer works fine without.

To do this, look for DMA mapping calls in the driver's source code, such as

addr = dma_map_single(dev, ptr, size, direction);

if (dma_mapping_error(dev, addr)) {
  [ ... ]
  return -ENODEV;
}

and replace them with something like

addr = virt_to_phys(ptr);

As explained below in "Linux’ DMA API highlights", dma_map_single() tells the IOMMU that the memory segment is a DMA buffer belonging to a specific device. By replacing this call with virt_to_phys(), the physical address is obtained (correctly when there's no IOMMU involved), but without the IOMMU being involved.

It's recommended to add a printk() or so next to this edit, to be sure that the manipulated code was executed (as opposed to editing unreached code, loading the wrong driver etc.).

Then, when attempting to use the driver, at least one message like this should appear in the kernel log (once again, with Intel's IOMMU):

DMAR: DRHD: handling fault status reg 2
DMAR: [DMA Write NO_PASID] Request device [01:00.0] fault addr 0x87fd8000 [fault reason 0x05] PTE Write access is not set

Note that this message tells us both the bus address of the device and the memory address of the failed access.

If, on the other hand, everything works as usual, despite this edit, odds are that IOMMU is not active. Or the related DMA buffer isn't used at all (for whatever hardware related reason).

Availability of 32-bit physical memory

As already said above, the need (or not) for an IOMMU or the SWIOTLB feature depends to a large extent on the availability of memory in the 32-bit region, called the DMA32 zone, vs. the amount of memory required by drivers for this region. How the Linux kernel manages its memory plays therefore an important role. Quite obviously, this issue is subject to continuous tuning of the kernel’s memory manager.

An anecdotal experiment was run on an Linux v5.13 machine on x86_64 with 6 GiB physical RAM and IOMMU disactivated (and no disk swap). After booting the system without a graphical desktop, /proc/buddyinfo was examined for the amount of memory in each zone: DMA had ~15.9 MiB of free memory, DMA32 had slightly above 2 GiB and the “Normal” zone (all the rest) had about 3.5 GiB. The amount of non-free memory was hence about 500 MiB.

It’s interesting by itself that only 2 GiB were allocated on the DMA32 zone. This is beyond the kernel’s control however, as explained below (”Physical RAM allocation below and above 4 GB on x86_84 machines”). And indeed, attempting to load a driver that requires more than 2 GiB of memory with the __GDP_DMA32 flag resulted in an allocation fault after allocating 2 GiB. The highest physical address was 0x82000000, so the 2 GiB limit was definitely not a 31 bit boundary thing.

In a subsequent test, the driver attempted to allocate 5 GiB of memory, which it was successful with. The driver requested physically continuous RAM segments of 4 MiB or 1 MiB with __get_free_pages() without any restriction on the zone (i.e. without a __GDP_DMA32 flag or alike). On each call to this function, the RAM segment was allocated from the “Normal” zone as long as it was possible, i.e. as long as a continuous memory segment was available in the zone’s pool. When none were left, segments were allocated from the DMA32 zone. So the allocation algorithm clearly preferred the “Normal” zone over the DMA32 zone.

This means that on 64-bit system with plenty of RAM, odds are that there will be DMA32 RAM for drivers that need them, since the memory allocation algorithm saves memory on this zone to drivers that explicitly require them. Or, alternatively, when DMA32 RAM runs out, odds are that kernel memory is low generally.

Physical RAM allocation below and above 4 GB on x86_84 machines

This section discusses how the RAM that is plugged into a computer is divided between the “Normal” and “DMA32″ regions.

All descendants of the x86 family, including 64-bit machines, are expected to be backward compatible with the very first 8086 computers. This means, among other peculiarities, that the first 1 MB is still reserved, and contains the initial 640 kB of DOS memory, and that several segments in the memory map are reserved for predefined uses. Along its history, the physical memory model has been patched time and time again to allow for new capabilities.

A Linux’ machine’s view of the physical memory map can be obtained by reading /proc/iomem as root (a regular user may get all zero addresses), e.g.

# less /proc/iomem

Address ranges labeled “System RAM” are used for the memory pool. For addresses below 4GB, there are typically several such ranges, split by other segments that are assigned to some I/O functionality or are just reserved.

The processor’s memory controller is designed to give these special memory regions priority over RAM: When a read / write command is issued, the physical address is first checked against those memory regions. If none matches, and there is physical RAM mapped to the address, the RAM is accessed (with caching as applicable). This means that some amount of physical RAM is never used, however the proportion of this wasted RAM is small.

On machines with a BIOS (i.e. almost all x86-based computers), the physical RAM allocations made by the BIOS are listed at the very beginning of the kernel log after each boot (try the “dmesg” command), with lines starting with “BIOS-e820″.

This patchy arrangement with reserved regions was fine until computers with more than 3 GB RAM because available. The traditional memory map reserves the region between 0xc0000000 to 0xffffffff, i.e. the upper 1 GB, for PCI and other peripheral access. Any RAM mapped on this region is hence lost. This is why 32-bit operating systems running on a PC could never access more than 3 GB of physical RAM.

This was resolved as 64-bit processors emerged, having a physical address space larger than 32 bits. This allows dividing the available physical RAM, so that some is accessed in the 32-bit range, and some above it. This results in two memory regions for RAM in the physical address space map:

Address 0 to Top of Low Memory < 2^32.
Address 2^32 (0x100000000) to Top of High Memory.

The total address space of these two regions is the same as the amount of RAM hardware that is installed on the machine, unless some technical obstacle prevents from using it fully (which is unexpected if the motherboard’s installation instructions have been followed).

Any address range in physical space that falls outside these regions is considered memory mapped I/O (MMIO).

How much memory is given to each of these two regions is determined by the BIOS (or some other initial boot software), and explained next. Regardless, putting all RAM at a range above 4 GB and hence avoiding all collisions with legacy memory regions was surely a tempting idea, however out of the question, as existing bootloaders and operating systems run in real address mode (at least initially) and expect 32-bit addresses.

The BIOS’ role in getting the RAM working has been crucial way before 64-bit processors were introduced: It queries each of the DIMM’s EEPROM chips for information on their size, supported clock frequencies and other timing information. Based upon this information, it sets up some hardware registers (belonging to by the processor or its companion chip) to configure how these memory modules are accessed. This involves the memory bus frequency, the low-level timing parameters, and may also include how memory chunks are interleaved among the DIMMs.

Given that the BIOS has the information on the installed memory modules, it’s also responsible for deciding how much RAM is allocated in the 32-bit range.

There are two hardware registers for this purpose:

Top of Low Memory Address, TOLM (Intel) / TOP_MEM (AMD): The end of 32-bit-mappable address range that is used to access physical RAM. In other words, if a memory access satisfies 0 <= Address < 2^32 <= TOLM, it’s a physical RAM access, unless it’s overridden by a reserved region.
Top of High Memory Address, TOHM (Intel) / TOM2 (AMD): The end of the physical address range, starting at 0x100000000 (4 GB) that is used to access physical RAM. In other words, if a memory access satisfies 2^32 <= Address <= TOHM, its a physical RAM access.

The exact use of these registers, as well the granularity of RAM address ranges that they offer, vary from one processor to another, however the method is the same.

Accurate definitions of these registers for specific processors can be found in e.g. Intel Xeon Scalable Processors Family Datasheet, Vol. 2 as well as AMD’s BIOS and Kernel Developer’s Guide (BKDG).

Linux’ DMA API highlights

For the purpose of discussing SWIOTLB, and also for a better general understanding, this is an incomplete highlight summary of the Linux DMA API. It’s surely not a substitute for reading the kernel’s own documentation on this matter, which is found in the kernel tree’s Documentation/core-api/ directory, dma-api.rst and dma-api-howto.rst in particular.

Coherent (”consistent”) DMA is not discussed here, as it’s a low-performance kind of DMA, limited to a dedicated, typically small, memory region on embedded processors — hence IOMMU is rarely relevant along with this DMA type. Not to mention that all DMA on x86-based processors is coherent anyhow, so there’s no need for a specific region on these.

The basic Linux API of “streaming DMA” requires the device driver to allocate a chunk of memory for the DMA buffer with one of the common kernel memory allocation functions, i.e. kmalloc(), __get_free_pages() or some other equivalent function. The driver may request that the memory is taken from a specific region of physical memory (”zone”), in particular by supplying the __GFP_DMA or __GFP_DMA32 flags in the call to the memory allocation function, which limit the physical memory addresses to a lower range: __GFP_DMA to 24 bits (16 MiB) and __GDP_DMA32 to 32 bits (4 GiB). Practically, the former is required for ancient hardware only, while the latter may still be necessary with some devices (and possibly with poorly designed / supported PCIe infrastructure), in particular in the absence of an IOMMU.

This way or another, the memory allocation functions return an address in virtual address space. The device needs to be informed about the buffer’s address for use in DMA operations. Before IOMMUs, this was the physical address, i.e. the address that appeared on the electrical wires of the DDR memories. With the possibility of an IOMMU fiddling with the addresses, the device may need to use another address to request reads and writes on the DMA buffer.

Linux’ API simplifies this issue by virtue of the dma_map_single() function: It takes the virtual address as an argument (among others) and returns the address to provide to the device. Whether this is the physical address or some other address that makes the IOMMU happy (if such exists) is none of the driver’s business.

dma_map_single()’s other arguments are the size of the buffer, a pointer to a struct that identifies the device, as well as the direction of the data transfer to take place. In other words, the DMA framework knows all there is to know about the intended DMA access.

An important aspect of calling dma_map_single() is that the buffer is “owned” by the device after this call: The host is not allowed to access the related memory region until the subsequent call to dma_unmap_single(), which takes place when the driver knows that the device has finished its access to that buffer. As its name implies, the latter function reverses the former, telling the kernel’s API that the memory region is not used for DMA anymore.

The mapping and unmapping of DMA buffers, and the idea that the buffer is owned exclusively either by the host or the device may and may not have practical implications: The cache may be invalidated or flushed to make the buffer visible correctly by either side as as necessary, and the IOMMU / SWIOTLB frameworks, if utilized, may be updated to perform memory address translations correctly. On an x86 / x86_64 machine with IOMMU and SWIOTLB unused, dma_map_single() just translates the virtual address to physical and dma_unmap_single() does nothing — no cache synchronization is required on these architectures.

The DMA API also provides two functions, dma_sync_single_for_cpu() and dma_sync_single_for_device() for changing the ownership to the host or device (respectively) without unmapping the buffer. These are used for repeated use of the same buffer, and merely ensure that caching and IOMMU / SWIOTLB play along well.

There are several other functions in the DMA API, in particular for handling scatter-gather DMA, but the point of this summary was to highlight the fact that DMA buffers are explicitly declared as such, and that the API keeps track of which side — the host or the device — owns them at any time. And also, that the actual physical RAM address is of no interest: The device driver just gets a bus address to convey to its device for accessing the buffer, and the DMA API is responsible to make it work. The bus address may and may not be the physical RAM address.

Hence a properly written Linux driver doesn’t need to be IOMMU-aware: The IOMMU’s address map is updated transparently on calls to dma_map_single() and dma_unmap_single() (and similar functions), so whenever the driver declares a memory region as a DMA buffer for a device, the IOMMU is informed accordingly. The address returned by dma_map_single() is the one that is mapped by the IOMMU for the related physical address, so it remains correct to supply the device with this address for use in DMA accesses. It’s also worth reiterating that dma_map_single() takes a pointer to a device struct as its first argument, which is how the IOMMU knows which bus ID to relate the DMA mapping to.

As a side note, there's a parallel set of functions, prefixed with pci_* instead of dma_*, which were originally intended for use with PCI / PCIe devices. These are however deprecated since long.

SWIOTLB

The SWIOTLB feature (which would have been better called SWIOMMU) was introduced to solve the 64/32-bit DMA problem for platforms without an IOMMU. The trick is to maintain a fixed chunk of physical RAM in the 32-bit physical address range (called “the aperture”), out of reach for any other use. This sacrifices 4 MiB on earlier kernels, and typically 64 MiB today.

Whenever a dma_map_single() (or alike) function call would fail because of 32/64-bit memory range issues, a buffer of the requested size is allocated from this chunk, and the function returns the physical address of this newly allocated buffer instead (the “bounce buffer”).

Recall that one of the arguments to dma_map_single() is the direction of the DMA access. Hence if the direction is towards the device, SWIOTLB makes a plain memcpy() from the original buffer to the newly assigned one before returning from the mapping function. Likewise, if the direction is from the device, the data in the newly assigned buffer is copied to the original buffer when the buffer is unmapped or its ownership is returned to the host by virtue of dma_sync_single_for_cpu() or alike.

So in principle, the SWIOTLB is a bit of a loan shark: It takes control of some precious RAM in a valuable zone, and lends chunks of it to callers of memory mapping functions that can’t be fulfilled because of memory zone issues. The price for this loan is that the data is copied by the CPU once for each DMA transaction.

As the vast majority of PCI / PCIe devices support 64-bit DMA, SWIOTLB rarely comes to action, and when it does, it’s usually because of a bug or poor system configuration. Even worse, it may cause kernel panics in some unexpected situations (that is, a complete system freeze).

The initialization of the SWIOTLB feature is announced with a kernel message like this:

PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
software IO TLB [mem 0x4c000000-0x50000000] (64MB) mapped at [00000000e014a0cb-0000000060d63d97]

There is no way to disable SWIOTLB in the kernel configuration. Since kernel v4.10, the swiotlb=noforce kernel parameter turns off SWIOTLB and any attempt to use it along with DMA mappings, and the memory allocation is reduced to minimum (and no allocation at all from kernel v5.13). However according to the kernel’s documentation, this option is intended for debugging only.

It may be tempting to rescue these 64 MiB with iommu=off, which turns off SWIOTLB as well, however this disables only its initialization, and not attempts to use the SWIOTLB framework. Hence if a DMA mapping request can’t fulfilled because the device accepts 32 bit addresses only, a kernel panic is issued by swiotlb_tbl_map_single() (as of v5.13):

Kernel panic - not syncing: Can not allocate SWIOTLB buffer earlier and can't now provide you with the DMA bounce buffer

This occurs because the SWIOTLB framework wasn’t initialized because of iommu=off, and hence had no chance to allocate the aperture RAM. It’s quite questionable why the response is a kernel panic rather than failing the DMA mapping (which any properly written driver should handle gracefully).

For those wanting to save those 64 MiB anyhow, the swiotlb kernel parameter should be set to reduce memory consumption. Oddly enough, if SWIOTLB runs out of aperture RAM, it may just fail the DMA mapping with an “swiotlb buffer is full” message in the kernel log in some scenarios and with some kernel versions, rather than a kernel panic.

Kernel code: Use SWIOTLB?

This is short walkthrough in the kernel code that evaluates whether SWIOTLB should be used for a DMA mapping. The DMA related functions have been moved around a bit, so the said below reflects kernel v5.12.

The function that performs DMA mapping is this one, in kernel/dma/direct.h:

static inline dma_addr_t dma_direct_map_page(struct device *dev,
		struct page *page, unsigned long offset, size_t size,
		enum dma_data_direction dir, unsigned long attrs)
{
	phys_addr_t phys = page_to_phys(page) + offset;
	dma_addr_t dma_addr = phys_to_dma(dev, phys);

	if (unlikely(swiotlb_force == SWIOTLB_FORCE))
		return swiotlb_map(dev, phys, size, dir, attrs);

	if (unlikely(!dma_capable(dev, dma_addr, size, true))) {
		if (swiotlb_force != SWIOTLB_NO_FORCE)
			return swiotlb_map(dev, phys, size, dir, attrs);

		dev_WARN_ONCE(dev, 1,
			     "DMA addr %pad+%zu overflow (mask %llx, bus limit %llx).\n",
			     &dma_addr, size, *dev->dma_mask, dev->bus_dma_limit);
		return DMA_MAPPING_ERROR;
	}

	if (!dev_is_dma_coherent(dev) && !(attrs & DMA_ATTR_SKIP_CPU_SYNC))
		arch_sync_dma_for_device(phys, size, dir);
	return dma_addr;
}

The part that is marked in bold is where the mapping function decides to detour to swiotlb_map() (in kernel/dma/swiotlb.c) because the physical address isn’t allowed by the device. Note that this call is made even if the IOMMU has been turned off (which leads to the kernel panic mentioned above), but not if the swiotlb=noforce parameter assignment has been made (setting swiotlb_force to SWIOTLB_NO_FORCE).

The physical address’ eligibility is checked by dma_capable(), defined in include/linux/dma-direct.h:

static inline bool dma_capable(struct device *dev, dma_addr_t addr, size_t size,
		bool is_ram)
{
	dma_addr_t end = addr + size - 1;

	if (addr == DMA_MAPPING_ERROR)
		return false;
	if (is_ram && !IS_ENABLED(CONFIG_ARCH_DMA_ADDR_T_64BIT) &&
	    min(addr, end) < phys_to_dma(dev, PFN_PHYS(min_low_pfn)))
		return false;

	return end <= min_not_zero(*dev->dma_mask, dev->bus_dma_limit);
}

@is_ram is set to true in the call by dma_direct_map_page(), and CONFIG_ARCH_DMA_ADDR_T_64BIT is true as well for a 64-bit target architecture. Hence the return value is determined by the expression

end <= min_not_zero(*dev->dma_mask, dev->bus_dma_limit)

In an anecdotal test, bus_dma_limit was zero, and hence min_not_zero() always returns dma_mask. The latter is set to 2^32-1 when the DMA mask is set to 32 bits by the driver, and 2^64-1 when it’s set to 64 bit. So this is how the DMA mask influences whether swiotlb_map() is used or not.