btrfs(5)

DESCRIPTION 描述 

This document describes topics related to BTRFS that are not specific to the tools. Currently covers:
本文档描述与 BTRFS 相关的主题,这些主题不特定于工具。目前涵盖内容:

  1. mount options 挂载选项

  2. filesystem features 文件系统功能

  3. checksum algorithms 校验算法

  4. compression 压缩

  5. sysfs interface sysfs 接口

  6. filesystem exclusive operations
    文件系统独占操作

  7. filesystem limits 文件系统限制

  8. bootloader support 引导加载程序支持

  9. file attributes 文件属性

  10. zoned mode 分区模式

  11. control device 控制设备

  12. filesystems with multiple block group profiles
    具有多个块组配置文件的文件系统

  13. seeding device 设备种植

  14. RAID56 status and recommended practices
    RAID56 状态和推荐做法

  15. storage model, hardware considerations
    存储模型,硬件考虑因素

MOUNT OPTIONS 挂载选项 

BTRFS SPECIFIC MOUNT OPTIONS
BTRFS 特定挂载选项 

This section describes mount options specific to BTRFS. For the generic mount options please refer to mount(8) manual page. The options are sorted alphabetically (discarding the no prefix).
本节描述了特定于 BTRFS 的挂载选项。有关通用挂载选项,请参阅 mount(8) 手册页。这些选项按字母顺序排序(丢弃前缀 no)。

Note 注意

Most mount options apply to the whole filesystem and only options in the first mounted subvolume will take effect. This is due to lack of implementation and may change in the future. This means that (for example) you can’t set per-subvolume nodatacow, nodatasum, or compress using mount options. This should eventually be fixed, but it has proved to be difficult to implement correctly within the Linux VFS framework.
大多数挂载选项适用于整个文件系统,只有在第一个挂载的子卷中的选项才会生效。这是由于实现的缺失,可能会在将来发生变化。这意味着(例如),您无法使用挂载选项设置每个子卷的 nodatacow、nodatasum 或压缩。这应该最终会得到解决,但在 Linux VFS 框架内正确实现已被证明是困难的。

Mount options are processed in order, only the last occurrence of an option takes effect and may disable other options due to constraints (see e.g. nodatacow and compress). The output of mount command shows which options have been applied.
挂载选项按顺序处理,只有选项的最后一次出现才会生效,并且可能由于约束(例如 nodatacow 和 compress)而禁用其他选项。mount 命令的输出显示了已应用的选项。

acl, noacl

(default: on) (默认值:开启)

Enable/disable support for POSIX Access Control Lists (ACLs). See the acl(5) manual page for more information about ACLs.
启用/禁用对 POSIX 访问控制列表(ACL)的支持。有关 ACL 的更多信息,请参阅 acl(5)手册页。

The support for ACL is build-time configurable (BTRFS_FS_POSIX_ACL) and mount fails if acl is requested but the feature is not compiled in.
ACL 支持是在构建时可配置的(BTRFS_FS_POSIX_ACL),如果请求了 acl 但该功能未编译进去,则挂载会失败。

autodefrag, noautodefrag autodefrag,noautodefrag

(since: 3.0, default: off)
(自 3.0 版本起,默认关闭)

Enable automatic file defragmentation. When enabled, small random writes into files (in a range of tens of kilobytes, currently it’s 64KiB) are detected and queued up for the defragmentation process. May not be well suited for large database workloads.
启用自动文件碎片整理。启用后,将检测并排队进行碎片整理过程的文件中的小随机写入(在几十千字节的范围内,当前为 64KiB)。可能不适用于大型数据库工作负载。

The read latency may increase due to reading the adjacent blocks that make up the range for defragmentation, successive write will merge the blocks in the new location.
由于读取组成碎片整理范围的相邻块,读取延迟可能会增加,连续写入将合并新位置的块。

Warning 警告

Defragmenting with Linux kernel versions < 3.9 or ≥ 3.14-rc2 as well as with Linux stable kernel versions ≥ 3.10.31, ≥ 3.12.12 or ≥ 3.13.4 will break up the reflinks of COW data (for example files copied with cp --reflink, snapshots or de-duplicated data). This may cause considerable increase of space usage depending on the broken up reflinks.
使用 Linux 内核版本 < 3.9 或 ≥ 3.14-rc2 以及 Linux 稳定内核版本 ≥ 3.10.31、≥ 3.12.12 或 ≥ 3.13.4 进行碎片整理会破坏 COW 数据的 reflinks(例如使用 cp --reflink 复制的文件、快照或去重数据)。这可能会导致由于破坏的 reflinks 而造成空间使用量显著增加。

barrier, nobarrier 障碍,无障碍

(default: on) (默认值:开启)

Ensure that all IO write operations make it through the device cache and are stored permanently when the filesystem is at its consistency checkpoint. This typically means that a flush command is sent to the device that will synchronize all pending data and ordinary metadata blocks, then writes the superblock and issues another flush.
确保所有 IO 写操作通过设备缓存并在文件系统处于一致性检查点时永久存储。这通常意味着向设备发送刷新命令,同步所有待处理数据和普通元数据块,然后写入超级块并发出另一个刷新命令。

The write flushes incur a slight hit and also prevent the IO block scheduler to reorder requests in a more effective way. Disabling barriers gets rid of that penalty but will most certainly lead to a corrupted filesystem in case of a crash or power loss. The ordinary metadata blocks could be yet unwritten at the time the new superblock is stored permanently, expecting that the block pointers to metadata were stored permanently before.
写入刷新会带来轻微的影响,并阻止 IO 块调度程序以更有效的方式重新排序请求。禁用屏障会消除这种惩罚,但在发生崩溃或断电时几乎肯定会导致文件系统损坏。在永久存储新超级块时,普通元数据块可能尚未写入,期望在此之前永久存储了元数据的块指针。

On a device with a volatile battery-backed write-back cache, the nobarrier option will not lead to filesystem corruption as the pending blocks are supposed to make it to the permanent storage.
在具有易失性电池支持的写回缓存的设备上,nobarrier 选项不会导致文件系统损坏,因为待处理块应该能够到达永久存储。

check_int, check_int_data, check_int_print_mask=<value>
检查整数,检查整数数据,检查整数打印掩码=<value>

(since: 3.0, default: off)
(自 3.0 版本开始,默认关闭)

These debugging options control the behavior of the integrity checking module (the BTRFS_FS_CHECK_INTEGRITY config option required). The main goal is to verify that all blocks from a given transaction period are properly linked.
这些调试选项控制完整性检查模块的行为(需要 BTRFS_FS_CHECK_INTEGRITY 配置选项)。主要目标是验证给定事务周期中的所有块是否正确链接。

check_int enables the integrity checker module, which examines all block write requests to ensure on-disk consistency, at a large memory and CPU cost.
check_int 启用完整性检查器模块,检查所有块写请求以确保磁盘一致性,但会消耗大量内存和 CPU。

check_int_data includes extent data in the integrity checks, and implies the check_int option.
check_int_data 在完整性检查中包含范围数据,并隐含了 check_int 选项。

check_int_print_mask takes a bitmask of BTRFSIC_PRINT_MASK_* values as defined in fs/btrfs/check-integrity.c, to control the integrity checker module behavior.
check_int_print_mask 接受一个位掩码,包含在 fs/btrfs/check-integrity.c 中定义的 BTRFSIC_PRINT_MASK_* 值,用于控制完整性检查器模块的行为。

See comments at the top of fs/btrfs/check-integrity.c for more information.
请查看 fs/btrfs/check-integrity.c 顶部的注释获取更多信息。

clear_cache 清除缓存

Force clearing and rebuilding of the free space cache if something has gone wrong.
如果出现问题,强制清除并重建空闲空间缓存。

For free space cache v1, this only clears (and, unless nospace_cache is used, rebuilds) the free space cache for block groups that are modified while the filesystem is mounted with that option. To actually clear an entire free space cache v1, see btrfs check --clear-space-cache v1.
对于空闲空间缓存 v1,这仅清除(并且,除非使用 nospace_cache,重建)在挂载文件系统时使用该选项修改的块组的空闲空间缓存。要实际清除整个空闲空间缓存 v1,请参阅 btrfs check --clear-space-cache v1。

For free space cache v2, this clears the entire free space cache. To do so without requiring to mounting the filesystem, see btrfs check --clear-space-cache v2.
对于免费空间缓存 v2,这将清除整个免费空间缓存。要在不需要挂载文件系统的情况下执行此操作,请参阅 btrfs check --clear-space-cache v2。

See also: space_cache. 另请参阅:space_cache。

commit=<seconds> commit=<秒数>

(since: 3.12, default: 30)
(自 3.12 版本起,默认值为 30)

Set the interval of periodic transaction commit when data are synchronized to permanent storage. Higher interval values lead to larger amount of unwritten data, which has obvious consequences when the system crashes. The upper bound is not forced, but a warning is printed if it’s more than 300 seconds (5 minutes). Use with care.
设置数据同步到永久存储时的周期性事务提交间隔。较高的间隔值会导致未写入数据量增加,当系统崩溃时会产生明显后果。上限没有强制要求,但如果超过 300 秒(5 分钟)会打印警告。请谨慎使用。

compress, compress=<type[:level]>, compress-force, compress-force=<type[:level]>
压缩,compress=<type[:level]>,compress-force,compress-force=<type[:level]>

(default: off, level support since: 5.1)
(默认:关闭,自 5.1 版本开始支持)

Control BTRFS file data compression. Type may be specified as zlib, lzo, zstd or no (for no compression, used for remounting). If no type is specified, zlib is used. If compress-force is specified, then compression will always be attempted, but the data may end up uncompressed if the compression would make them larger.
控制 BTRFS 文件数据压缩。类型可以指定为 zlib、lzo、zstd 或 no(用于重新挂载的无压缩)。如果未指定类型,则使用 zlib。如果指定了 compress-force,则始终会尝试压缩,但如果压缩会使数据变大,则数据可能最终未压缩。

Both zlib and zstd (since version 5.1) expose the compression level as a tunable knob with higher levels trading speed and memory (zstd) for higher compression ratios. This can be set by appending a colon and the desired level. ZLIB accepts the range [1, 9] and ZSTD accepts [1, 15]. If no level is set, both currently use a default level of 3. The value 0 is an alias for the default level.
zlib 和 zstd(自 5.1 版本起)都将压缩级别公开为可调节的旋钮,较高级别会在速度和内存(zstd)之间进行权衡,以获得更高的压缩比。可以通过附加冒号和所需级别来设置。ZLIB 接受范围 [1, 9],ZSTD 接受 [1, 15]。如果未设置级别,则两者当前都使用默认级别 3。值 0 是默认级别的别名。

Otherwise some simple heuristics are applied to detect an incompressible file. If the first blocks written to a file are not compressible, the whole file is permanently marked to skip compression. As this is too simple, the compress-force is a workaround that will compress most of the files at the cost of some wasted CPU cycles on failed attempts. Since kernel 4.15, a set of heuristic algorithms have been improved by using frequency sampling, repeated pattern detection and Shannon entropy calculation to avoid that.
否则,将应用一些简单的启发式方法来检测不可压缩的文件。如果写入文件的前几个块不可压缩,则整个文件将被永久标记为跳过压缩。由于这太简单了,压缩强制是一个解决方法,它将以一些浪费的 CPU 周期为代价,压缩大多数文件。自内核 4.15 以来,一组启发式算法已通过使用频率采样、重复模式检测和香农熵计算进行改进,以避免这种情况。

Note 注意

If compression is enabled, nodatacow and nodatasum are disabled.
如果启用了压缩,则禁用 nodatacow 和 nodatasum。

datacow, nodatacow 数据牛,无数据牛

(default: on) (默认值:开启)

Enable data copy-on-write for newly created files. Nodatacow implies nodatasum, and disables compression. All files created under nodatacow are also set the NOCOW file attribute (see chattr(1)).
为新创建的文件启用数据写时复制。Nodatacow 意味着 nodatasum,并禁用压缩。在 nodatacow 下创建的所有文件也设置为 NOCOW 文件属性(参见 chattr(1))。

Note 注意

If nodatacow or nodatasum are enabled, compression is disabled.
如果启用了 nodatacow 或 nodatasum,则禁用压缩。

Updates in-place improve performance for workloads that do frequent overwrites, at the cost of potential partial writes, in case the write is interrupted (system crash, device failure).
就地更新可提高频繁覆盖的工作负载的性能,但可能会出现部分写入的情况,如果写入被中断(系统崩溃、设备故障)。

datasum, nodatasum datasum,nodatasum

(default: on) (默认值:开启)

Enable data checksumming for newly created files. Datasum implies datacow, i.e. the normal mode of operation. All files created under nodatasum inherit the “no checksums” property, however there’s no corresponding file attribute (see chattr(1)).
为新创建的文件启用数据校验。数据校验意味着数据拷贝,即正常的操作模式。在 nodatasum 下创建的所有文件都继承“无校验和”属性,但没有相应的文件属性(参见 chattr(1))。

Note 注意

If nodatacow or nodatasum are enabled, compression is disabled.
如果启用了 nodatacow 或 nodatasum,则禁用压缩。

There is a slight performance gain when checksums are turned off, the corresponding metadata blocks holding the checksums do not need to updated. The cost of checksumming of the blocks in memory is much lower than the IO, modern CPUs feature hardware support of the checksumming algorithm.
当关闭校验和时,会有轻微的性能提升,持有校验和的相应元数据块不需要更新。内存中块的校验成本远低于 IO,现代 CPU 具有硬件支持的校验算法。

degraded 降级

(default: off) (默认: 关闭)

Allow mounts with fewer devices than the RAID profile constraints require. A read-write mount (or remount) may fail when there are too many devices missing, for example if a stripe member is completely missing from RAID0.
允许挂载比 RAID 配置文件要求的设备少的情况。当缺少太多设备时,例如 RAID0 中完全缺少条带成员时,读写挂载(或重新挂载)可能会失败。

Since 4.14, the constraint checks have been improved and are verified on the chunk level, not at the device level. This allows degraded mounts of filesystems with mixed RAID profiles for data and metadata, even if the device number constraints would not be satisfied for some of the profiles.
自 4.14 版本以来,约束检查已经得到改进,并在块级别而不是设备级别上进行验证。这允许对具有混合 RAID 配置文件的文件系统进行降级挂载,即使对于某些配置文件,设备数量约束也无法满足。

Example: metadata -- raid1, data -- single, devices -- /dev/sda, /dev/sdb
示例: 元数据 -- raid1, 数据 -- 单个, 设备 -- /dev/sda , /dev/sdb

Suppose the data are completely stored on sda, then missing sdb will not prevent the mount, even if 1 missing device would normally prevent (any) single profile to mount. In case some of the data chunks are stored on sdb, then the constraint of single/data is not satisfied and the filesystem cannot be mounted.
假设数据完全存储在 sda 上,那么缺失的 sdb 不会阻止挂载,即使通常情况下缺失一个设备会阻止(任何)单个配置文件挂载。如果一些数据块存储在 sdb 上,则不满足单个/数据的约束,文件系统无法挂载。

device=<devicepath> device=<设备路径>

Specify a path to a device that will be scanned for BTRFS filesystem during mount. This is usually done automatically by a device manager (like udev) or using the btrfs device scan command (e.g. run from the initial ramdisk). In cases where this is not possible the device mount option can help.
指定在挂载期间将扫描用于 BTRFS 文件系统的设备的路径。通常由设备管理器(如 udev)自动完成此操作,或者使用 btrfs device scan 命令(例如,从初始 RAM 磁盘运行)。在无法自动完成此操作的情况下,设备挂载选项可以提供帮助。

Note 注意

Booting e.g. a RAID1 system may fail even if all filesystem’s device paths are provided as the actual device nodes may not be discovered by the system at that point.
即使提供了所有文件系统设备路径,例如 RAID1 系统的引导也可能失败,因为实际设备节点可能在此时未被系统发现。

discard, discard=sync, discard=async, nodiscard
丢弃,丢弃=同步,丢弃=异步,不丢弃

(default: async when devices support it since 6.2, async support since: 5.6)
(默认:如果设备支持自 6.2 以来为异步,自 5.6 以来支持异步)

Enable discarding of freed file blocks. This is useful for SSD devices, thinly provisioned LUNs, or virtual machine images; however, every storage layer must support discard for it to work.
启用释放文件块。这对于 SSD 设备、薄置备 LUN 或虚拟机映像非常有用;但是,每个存储层都必须支持丢弃才能正常工作。

In the synchronous mode (sync or without option value), lack of asynchronous queued TRIM on the backing device TRIM can severely degrade performance, because a synchronous TRIM operation will be attempted instead. Queued TRIM requires newer than SATA revision 3.1 chipsets and devices.
在同步模式(同步或无选项值)中,由于后端设备 TRIM 上缺少异步排队的 TRIM,性能可能会严重下降,因为将尝试同步 TRIM 操作。排队 TRIM 需要更新于 SATA 3.1 版本的芯片组和设备。

The asynchronous mode (async) gathers extents in larger chunks before sending them to the devices for TRIM. The overhead and performance impact should be negligible compared to the previous mode and it’s supposed to be the preferred mode if needed.
异步模式(async)在将范围发送到设备进行 TRIM 之前会将范围聚合成较大的块。与之前的模式相比,开销和性能影响应该可以忽略不计,如果需要的话,这应该是首选模式。

If it is not necessary to immediately discard freed blocks, then the fstrim tool can be used to discard all free blocks in a batch. Scheduling a TRIM during a period of low system activity will prevent latent interference with the performance of other operations. Also, a device may ignore the TRIM command if the range is too small, so running a batch discard has a greater probability of actually discarding the blocks.
如果不需要立即丢弃已释放的块,那么可以使用 fstrim 工具批量丢弃所有空闲块。在系统活动较低的时期安排 TRIM 将防止对其他操作性能的潜在干扰。此外,如果范围太小,设备可能会忽略 TRIM 命令,因此运行批量丢弃实际上更有可能丢弃这些块。

enospc_debug, noenospc_debug

(default: off) (默认: 关闭)

Enable verbose output for some ENOSPC conditions. It’s safe to use but can be noisy if the system reaches near-full state.
为某些 ENOSPC 条件启用详细输出。使用起来是安全的,但如果系统接近满状态可能会很吵闹。

fatal_errors=<action>

(since: 3.4, default: bug)
(自 3.4 版本起,默认值:错误)

Action to take when encountering a fatal error.
遇到致命错误时要采取的操作。

bug

BUG() on a fatal error, the system will stay in the crashed state and may be still partially usable, but reboot is required for full operation
在发生致命错误时调用 BUG(),系统将保持崩溃状态,可能仍然部分可用,但需要重新启动才能完全运行。

panic 恐慌

panic() on a fatal error, depending on other system configuration, this may be followed by a reboot. Please refer to the documentation of kernel boot parameters, e.g. panic, oops or crashkernel.
在致命错误时调用 panic(),根据其他系统配置的不同,可能会导致重新启动。请参考内核引导参数的文档,例如 panic、oops 或 crashkernel。

flushoncommit, noflushoncommit
在提交时刷新,不在提交时刷新

(default: off) (默认: 关闭)

This option forces any data dirtied by a write in a prior transaction to commit as part of the current commit, effectively a full filesystem sync.
此选项强制将先前事务中由写操作脏化的任何数据作为当前提交的一部分提交,实际上是完整的文件系统同步。

This makes the committed state a fully consistent view of the file system from the application’s perspective (i.e. it includes all completed file system operations). This was previously the behavior only when a snapshot was created.
这使得提交的状态从应用程序的角度看是文件系统的一个完全一致的视图(即包括所有已完成的文件系统操作)。以前只有在创建快照时才会出现这种行为。

When off, the filesystem is consistent but buffered writes may last more than one transaction commit.
关闭时,文件系统保持一致,但缓冲写入可能会持续超过一个事务提交。

fragment=<type> 片段=<类型>

(depends on compile-time option CONFIG_BTRFS_DEBUG, since: 4.4, default: off)
(取决于编译时选项 CONFIG_BTRFS_DEBUG,自 4.4 版本起,默认为关闭)

A debugging helper to intentionally fragment given type of block groups. The type can be data, metadata or all. This mount option should not be used outside of debugging environments and is not recognized if the kernel config option CONFIG_BTRFS_DEBUG is not enabled.
一个调试助手,用于有意地分片给定类型的块组。类型可以是数据、元数据或全部。此挂载选项不应在调试环境之外使用,并且如果未启用内核配置选项 CONFIG_BTRFS_DEBUG,则不会被识别。

nologreplay 无日志重放

(default: off, even read-only)
(默认:关闭,即使是只读)

The tree-log contains pending updates to the filesystem until the full commit. The log is replayed on next mount, this can be disabled by this option. See also treelog. Note that nologreplay is the same as norecovery.
树日志包含对文件系统的挂起更新,直到完全提交。下次挂载时会重放日志,可以通过此选项禁用。另请参阅 treelog。请注意,nologreplay 与 norecovery 相同。

Warning 警告

Currently, the tree log is replayed even with a read-only mount! To disable that behaviour, mount also with nologreplay.
目前,即使是只读挂载,树日志也会被重放!要禁用该行为,请同时使用 nologreplay 挂载。

max_inline=<bytes> max_inline=<字节>

(default: min(2048, page size) )
(默认值:min(2048, 页大小))

Specify the maximum amount of space, that can be inlined in a metadata b-tree leaf. The value is specified in bytes, optionally with a K suffix (case insensitive). In practice, this value is limited by the filesystem block size (named sectorsize at mkfs time), and memory page size of the system. In case of sectorsize limit, there’s some space unavailable due to b-tree leaf headers. For example, a 4KiB sectorsize, maximum size of inline data is about 3900 bytes.
指定元数据 b 树叶中可以内联的最大空间量。该值以字节为单位指定,可选择带有 K 后缀(大小写不敏感)。在实践中,此值受文件系统块大小(在 mkfs 时称为 sectorsize)和系统的内存页面大小的限制。在存在 sectorsize 限制的情况下,由于 b 树叶头部的存在,会有一些空间不可用。例如,对于 4KiB 的 sectorsize,内联数据的最大大小约为 3900 字节。

Inlining can be completely turned off by specifying 0. This will increase data block slack if file sizes are much smaller than block size but will reduce metadata consumption in return.
通过指定为 0,可以完全关闭内联。如果文件大小远小于块大小,则会增加数据块的空闲空间,但会减少元数据的消耗。

Note 注意

The default value has changed to 2048 in kernel 4.6.
默认值在内核 4.6 中已更改为 2048。

metadata_ratio=<value> 元数据比率=<value>

(default: 0, internal logic)
(默认值: 0, 内部逻辑)

Specifies that 1 metadata chunk should be allocated after every value data chunks. Default behaviour depends on internal logic, some percent of unused metadata space is attempted to be maintained but is not always possible if there’s not enough space left for chunk allocation. The option could be useful to override the internal logic in favor of the metadata allocation if the expected workload is supposed to be metadata intense (snapshots, reflinks, xattrs, inlined files).
指定在每个值数据块后分配 1 个元数据块。默认行为取决于内部逻辑,尝试保持一定比例的未使用元数据空间,但如果没有足够的空间来进行块分配,则不一定总是可能的。如果预期的工作负载应该是元数据密集型的(快照、reflinks、xattrs、内联文件),该选项可能对覆盖内部逻辑以支持元数据分配很有用。

norecovery 无恢复

(since: 4.5, default: off)
(自 4.5 版本起, 默认值: 关闭)

Do not attempt any data recovery at mount time. This will disable logreplay and avoids other write operations. Note that this option is the same as nologreplay.
不要在挂载时尝试任何数据恢复。这将禁用日志重放并避免其他写操作。请注意,此选项与 nologreplay 相同。

Note 注意

The opposite option recovery used to have different meaning but was changed for consistency with other filesystems, where norecovery is used for skipping log replay. BTRFS does the same and in general will try to avoid any write operations.
相反的选项 recovery 以前有不同的含义,但为了与其他文件系统保持一致,已更改为 norecovery 用于跳过日志重放。BTRFS 也是如此,通常会尝试避免任何写操作。

rescan_uuid_tree 重新扫描 UUID 树

(since: 3.12, default: off)
(自 3.12 版本起,默认为关闭)

Force check and rebuild procedure of the UUID tree. This should not normally be needed.
强制检查和重建 UUID 树的过程。通常情况下不应该需要这样做。

rescue 救援

(since: 5.9) (自 5.9 版本起)

Modes allowing mount with damaged filesystem structures.
允许在损坏的文件系统结构下挂载的模式。

  • usebackuproot (since: 5.9, replaces standalone option usebackuproot)
    usebackuproot(自 5.9 起,替代独立选项 usebackuproot)

  • nologreplay (since: 5.9, replaces standalone option nologreplay)
    无日志重放(自 5.9 起,替换独立选项无日志重放)

  • ignorebadroots, ibadroots (since: 5.11)
    忽略坏根, ibadroots (自版本 5.11 起)

  • ignoredatacsums, idatacsums (since: 5.11)
    忽略数据校验和,idatacsums(自 5.11 起)

  • all (since: 5.9) 所有(自 5.9 起)

skip_balance 跳过平衡

(since: 3.3, default: off)
(自 3.3 版本开始,默认关闭)

Skip automatic resume of an interrupted balance operation. The operation can later be resumed with btrfs balance resume, or the paused state can be removed with btrfs balance cancel. The default behaviour is to resume an interrupted balance immediately after a volume is mounted.
跳过中断的平衡操作的自动恢复。稍后可以使用 btrfs balance resume 恢复操作,或者使用 btrfs balance cancel 移除暂停状态。默认行为是在挂载卷后立即恢复中断的平衡。

space_cache, space_cache=<version>, nospace_cache
空间缓存, space_cache=<version>, 不使用空间缓存

(nospace_cache since: 3.2, space_cache=v1 and space_cache=v2 since 4.5, default: space_cache=v2)
(nospace_cache 自 3.2 版本开始,space_cache=v1 和 space_cache=v2 自 4.5 版本开始,默认为 space_cache=v2)

Options to control the free space cache. The free space cache greatly improves performance when reading block group free space into memory. However, managing the space cache consumes some resources, including a small amount of disk space.
控制空闲空间缓存的选项。当将块组空闲空间读入内存时,空闲空间缓存可以极大地提高性能。然而,管理空间缓存会消耗一些资源,包括少量磁盘空间。

There are two implementations of the free space cache. The original one, referred to as v1, used to be a safe default but has been superseded by v2. The v1 space cache can be disabled at mount time with nospace_cache without clearing.
空闲空间缓存有两种实现方式。最初的版本被称为 v1,曾经是一个安全的默认选项,但已被 v2 取代。可以在挂载时使用 nospace_cache 禁用 v1 空间缓存而不清除。

On very large filesystems (many terabytes) and certain workloads, the performance of the v1 space cache may degrade drastically. The v2 implementation, which adds a new b-tree called the free space tree, addresses this issue. Once enabled, the v2 space cache will always be used and cannot be disabled unless it is cleared. Use clear_cache,space_cache=v1 or clear_cache,nospace_cache to do so. If v2 is enabled, and v1 space cache will be cleared (at the first mount) and kernels without v2 support will only be able to mount the filesystem in read-only mode. On an unmounted filesystem the caches (both versions) can be cleared by “btrfs check --clear-space-cache”.
在非常大的文件系统(数千兆字节)和某些工作负载下,v1 空间缓存的性能可能会急剧下降。v2 实现了一个新的 B 树,称为空闲空间树,以解决这个问题。一旦启用,v2 空间缓存将始终被使用,并且除非清除,否则无法禁用。使用 clear_cache,space_cache=v1 或 clear_cache,nospace_cache 来清除。如果启用了 v2,并且 v1 空间缓存将被清除(在第一次挂载时),而没有 v2 支持的内核只能以只读模式挂载文件系统。在未挂载的文件系统上,可以通过“btrfs check --clear-space-cache”来清除缓存(两个版本)。

The btrfs-check(8) and :doc:`mkfs.btrfs commands have full v2 free space cache support since v4.19.
自 4.19 版本开始,btrfs-check(8) 和 :doc:`mkfs.btrfs` 命令完全支持 v2 的自由空间缓存。

If a version is not explicitly specified, the default implementation will be chosen, which is v2.
如果未明确指定版本,则将选择默认实现,即 v2。

ssd, ssd_spread, nossd, nossd_spread
固态硬盘,固态硬盘扩展,非固态硬盘,非固态硬盘扩展

(default: SSD autodetected)
(默认值:SSD 自动检测)

Options to control SSD allocation schemes. By default, BTRFS will enable or disable SSD optimizations depending on status of a device with respect to rotational or non-rotational type. This is determined by the contents of /sys/block/DEV/queue/rotational). If it is 0, the ssd option is turned on. The option nossd will disable the autodetection.
控制 SSD 分配方案的选项。默认情况下,BTRFS 将根据设备相对于旋转或非旋转类型的状态启用或禁用 SSD 优化。这由 /sys/block/DEV/queue/rotational) 的内容决定。如果它是 0,则启用 ssd 选项。选项 nossd 将禁用自动检测。

The optimizations make use of the absence of the seek penalty that’s inherent for the rotational devices. The blocks can be typically written faster and are not offloaded to separate threads.
优化利用了旋转设备固有的寻道惩罚的缺失。块通常可以更快地写入,并且不会被卸载到单独的线程中。

Note 注意

Since 4.14, the block layout optimizations have been dropped. This used to help with first generations of SSD devices. Their FTL (flash translation layer) was not effective and the optimization was supposed to improve the wear by better aligning blocks. This is no longer true with modern SSD devices and the optimization had no real benefit. Furthermore it caused increased fragmentation. The layout tuning has been kept intact for the option ssd_spread.
自 4.14 版本以来,块布局优化已被取消。这曾经有助于第一代 SSD 设备。它们的 FTL(闪存转换层)效果不佳,优化旨在通过更好地对齐块来改善磨损。这在现代 SSD 设备上不再成立,优化没有真正的好处。此外,它导致了增加的碎片化。布局调整已保留在选项 ssd_spread 中。

The ssd_spread mount option attempts to allocate into bigger and aligned chunks of unused space, and may perform better on low-end SSDs. ssd_spread implies ssd, enabling all other SSD heuristics as well. The option nossd will disable all SSD options while nossd_spread only disables ssd_spread.
ssd_spread 挂载选项尝试分配更大和对齐的未使用空间块,可能在低端 SSD 上表现更好。ssd_spread 暗示 ssd,同时启用所有其他 SSD 启发式方法。选项 nossd 将禁用所有 SSD 选项,而 nossd_spread 仅禁用 ssd_spread。

subvol=<path> subvol=<路径>

Mount subvolume from path rather than the toplevel subvolume. The path is always treated as relative to the toplevel subvolume. This mount option overrides the default subvolume set for the given filesystem.
从路径挂载子卷而不是顶层子卷。该路径始终被视为相对于顶层子卷。此挂载选项将覆盖为给定文件系统设置的默认子卷。

subvolid=<subvolid> 子卷 ID=<subvolid>

Mount subvolume specified by a subvolid number rather than the toplevel subvolume. You can use btrfs subvolume list of btrfs subvolume show to see subvolume ID numbers. This mount option overrides the default subvolume set for the given filesystem.
挂载由 subvolid 号码指定的子卷,而不是顶层子卷。您可以使用 btrfs subvolume list 或 btrfs subvolume show 查看子卷 ID 号码。此挂载选项将覆盖为给定文件系统设置的默认子卷。

Note 注意

If both subvolid and subvol are specified, they must point at the same subvolume, otherwise the mount will fail.
如果同时指定了 subvolid 和 subvol,则它们必须指向同一个子卷,否则挂载将失败。

thread_pool=<number> 线程池=<number>

(default: min(NRCPUS + 2, 8) )
(默认值:min(NRCPUS + 2, 8) )

The number of worker threads to start. NRCPUS is number of on-line CPUs detected at the time of mount. Small number leads to less parallelism in processing data and metadata, higher numbers could lead to a performance hit due to increased locking contention, process scheduling, cache-line bouncing or costly data transfers between local CPU memories.
要启动的工作线程数。NRCPUS 是在挂载时检测到的在线 CPU 数。较小的数字会导致数据和元数据处理中的并行性较低,较高的数字可能会导致性能下降,因为增加了锁竞争、进程调度、缓存行跳动或昂贵的本地 CPU 内存之间的数据传输。

treelog, notreelog treelog,notreelog

(default: on) (默认值:开启)

Enable the tree logging used for fsync and O_SYNC writes. The tree log stores changes without the need of a full filesystem sync. The log operations are flushed at sync and transaction commit. If the system crashes between two such syncs, the pending tree log operations are replayed during mount.
启用用于 fsync 和 O_SYNC 写入的树日志记录。树日志存储更改,无需进行完整的文件系统同步。日志操作在同步和事务提交时被刷新。如果系统在两次这样的同步之间崩溃,则在挂载期间会重放挂起的树日志操作。

Warning 警告

Currently, the tree log is replayed even with a read-only mount! To disable that behaviour, also mount with nologreplay.
目前,即使是只读挂载,树日志也会被重放!要禁用该行为,还需使用 nologreplay 进行挂载。

The tree log could contain new files/directories, these would not exist on a mounted filesystem if the log is not replayed.
树日志可能包含新文件/目录,如果不重放日志,则这些文件/目录在挂载的文件系统上将不存在。

usebackuproot 使用备份根目录

(since: 4.6, default: off)
(自 4.6 版本起,默认值为关闭)

Enable autorecovery attempts if a bad tree root is found at mount time. Currently this scans a backup list of several previous tree roots and tries to use the first readable. This can be used with read-only mounts as well.
如果在挂载时发现坏树根,则启用自动恢复尝试。目前,这会扫描几个先前树根的备份列表,并尝试使用第一个可读的树根。这也可以用于只读挂载。

Note 注意

This option has replaced recovery.
此选项已替换恢复。

user_subvol_rm_allowed 用户子卷_rm_允许

(default: off) (默认: 关闭)

Allow subvolumes to be deleted by their respective owner. Otherwise, only the root user can do that.
允许子卷被其各自所有者删除。否则,只有 root 用户可以执行此操作。

Note 注意

Historically, any user could create a snapshot even if he was not owner of the source subvolume, the subvolume deletion has been restricted for that reason. The subvolume creation has been restricted but this mount option is still required. This is a usability issue. Since 4.18, the rmdir(2) syscall can delete an empty subvolume just like an ordinary directory. Whether this is possible can be detected at runtime, see rmdir_subvol feature in FILESYSTEM FEATURES.
从历史上看,即使用户不是源子卷的所有者,也可以创建快照,出于这个原因,子卷删除已被限制。子卷创建已被限制,但这个挂载选项仍然是必需的。这是一个可用性问题。自 4.18 版本以来,rmdir(2)系统调用可以像普通目录一样删除空子卷。是否可能在运行时检测到,参见 FILESYSTEM FEATURES 中的 rmdir_subvol 功能。

DEPRECATED MOUNT OPTIONS
已弃用的挂载选项

List of mount options that have been removed, kept for backward compatibility.
已删除的挂载选项列表,保留以确保向后兼容性。

recovery 恢复

(since: 3.2, default: off, deprecated since: 4.5)
(自 3.2 起,默认关闭,自 4.5 起已弃用)

Note 注意

This option has been replaced by usebackuproot and should not be used but will work on 4.5+ kernels.
此选项已被 usebackuproot 替代,不应使用,但在 4.5+ 内核上可以工作。

inode_cache, noinode_cache
inode_cache,noinode_cache

(removed in: 5.11, since: 3.0, default: off)
(在 5.11 中移除,自 3.0 起,默认关闭)

Note 注意

The functionality has been removed in 5.11, any stale data created by previous use of the inode_cache option can be removed by btrfs rescue clear-ino-cache.
功能已在 5.11 中移除,之前使用 inode_cache 选项创建的任何陈旧数据都可以通过 btrfs rescue clear-ino-cache 来移除。

NOTES ON GENERIC MOUNT OPTIONS
通用挂载选项的注意事项

Some of the general mount options from mount(8) that affect BTRFS and are worth mentioning.
一些来自 mount(8)的通用挂载选项会影响 BTRFS,并值得一提。

noatime

under read intensive work-loads, specifying noatime significantly improves performance because no new access time information needs to be written. Without this option, the default is relatime, which only reduces the number of inode atime updates in comparison to the traditional strictatime. The worst case for atime updates under relatime occurs when many files are read whose atime is older than 24 h and which are freshly snapshotted. In that case the atime is updated and COW happens - for each file - in bulk. See also https://lwn.net/Articles/499293/ - Atime and btrfs: a bad combination? (LWN, 2012-05-31).
在读取密集工作负载下,指定 noatime 可显著提高性能,因为无需写入新的访问时间信息。如果没有此选项,默认值为 relatime,与传统的 strictatime 相比仅减少 inode atime 更新的次数。在 relatime 下,atime 更新的最坏情况发生在读取许多文件的 atime 早于 24 小时且刚刚进行了快照的情况下。在这种情况下,atime 会被更新,并且会批量发生 COW - 对于每个文件。另请参阅 https://lwn.net/Articles/499293/ - Atime 和 btrfs:一个糟糕的组合?(LWN,2012-05-31)。

Note that noatime may break applications that rely on atime uptimes like the venerable Mutt (unless you use maildir mailboxes).
请注意,noatime 可能会破坏依赖 atime 运行时间的应用程序,如古老的 Mutt(除非您使用 maildir 邮箱)。

FILESYSTEM FEATURES 文件系统特性 

The basic set of filesystem features gets extended over time. The backward compatibility is maintained and the features are optional, need to be explicitly asked for so accidental use will not create incompatibilities.
随着时间的推移,文件系统功能的基本集合得到扩展。向后兼容性得到保持,功能是可选的,需要明确请求,以避免意外使用造成不兼容性。

There are several classes and the respective tools to manage the features:
有几个类别和相应的工具来管理这些功能:

at mkfs time only 仅在 mkfs 时间。

This is namely for core structures, like the b-tree nodesize or checksum algorithm, see mkfs.btrfs(8) for more details.
这主要用于核心结构,比如 b 树节点大小或校验算法,请参阅 mkfs.btrfs(8) 了解更多详情。

after mkfs, on an unmounted filesystem
在未挂载的文件系统上进行 mkfs 后

Features that may optimize internal structures or add new structures to support new functionality, see btrfstune(8). The command btrfs inspect-internal dump-super /dev/sdx will dump a superblock, you can map the value of incompat_flags to the features listed below
可能会优化内部结构或添加新结构以支持新功能的特性,请参阅 btrfstune(8)。命令 btrfs inspect-internal dump-super /dev/sdx 将会转储一个超级块,您可以将 incompat_flags 的值映射到下面列出的特性。

after mkfs, on a mounted filesystem
在挂载的文件系统上进行 mkfs 后

The features of a filesystem (with a given UUID) are listed in /sys/fs/btrfs/UUID/features/, one file per feature. The status is stored inside the file. The value 1 is for enabled and active, while 0 means the feature was enabled at mount time but turned off afterwards.
文件系统的特性(具有给定 UUID)在 /sys/fs/btrfs/UUID/features/ 中列出,每个特性一个文件。状态存储在文件内。值为 1 表示已启用且处于活动状态,而值为 0 表示特性在挂载时已启用,但后来被关闭。

Whether a particular feature can be turned on a mounted filesystem can be found in the directory /sys/fs/btrfs/features/, one file per feature. The value 1 means the feature can be enabled.
可以在目录 /sys/fs/btrfs/features/ 中找到有关特定特性是否可以在挂载的文件系统上启用的信息,每个特性一个文件。值为 1 表示可以启用该特性。

List of features (see also mkfs.btrfs(8) section FILESYSTEM FEATURES):
功能列表(另请参阅 mkfs.btrfs(8) 章节 文件系统功能):

big_metadata 大型元数据

(since: 3.4) (自 3.4 版本起)

the filesystem uses nodesize for metadata blocks, this can be bigger than the page size
文件系统使用节点大小来存储元数据块,这个大小可以大于页面大小

block_group_tree 块组树

(since: 6.1) (自版本 6.1 起)

block group item representation using a dedicated b-tree, this can greatly reduce mount time for large filesystems
使用专用的 B 树表示块组项,这可以大大减少大型文件系统的挂载时间

compress_lzo

(since: 2.6.38) (自 2.6.38 版本起)

the lzo compression has been used on the filesystem, either as a mount option or via btrfs filesystem defrag.
文件系统上已使用了 lzo 压缩,可以作为挂载选项或通过 btrfs 文件系统碎片整理。

compress_zstd

(since: 4.14) (自 4.14 版本起)

the zstd compression has been used on the filesystem, either as a mount option or via btrfs filesystem defrag.
文件系统上已使用 zstd 压缩,可以作为挂载选项或通过 btrfs 文件系统碎片整理。

default_subvol

(since: 2.6.34) (自 2.6.34 起)

the default subvolume has been set on the filesystem
文件系统上已设置了默认子卷

extended_iref

(since: 3.7) (自版本 3.7 起)

increased hardlink limit per file in a directory to 65536, older kernels supported a varying number of hardlinks depending on the sum of all file name sizes that can be stored into one metadata block
将每个目录中文件的硬链接限制增加到 65536,较旧的内核支持的硬链接数量取决于可以存储到一个元数据块中的所有文件名大小之和

free_space_tree

(since: 4.5) (自 4.5 版起)

free space representation using a dedicated b-tree, successor of v1 space cache
使用专用 b 树的自由空间表示,是 v1 空间缓存的继任者

metadata_uuid 元数据 UUID

(since: 5.0) (自 5.0 版本起)

the main filesystem UUID is the metadata_uuid, which stores the new UUID only in the superblock while all metadata blocks still have the UUID set at mkfs time, see btrfstune(8) for more
主文件系统 UUID 是 metadata_uuid,它仅在超级块中存储新的 UUID,而所有元数据块在 mkfs 时仍具有设置的 UUID,请参阅 btrfstune(8) 了解更多信息

mixed_backref

(since: 2.6.31) (自 2.6.31 起)

the last major disk format change, improved backreferences, now default
最后一次重大磁盘格式更改,改进了反向引用,现在是默认设置

mixed_groups

(since: 2.6.37) (自 2.6.37 起)

mixed data and metadata block groups, i.e. the data and metadata are not separated and occupy the same block groups, this mode is suitable for small volumes as there are no constraints how the remaining space should be used (compared to the split mode, where empty metadata space cannot be used for data and vice versa)
混合数据和元数据块组,即数据和元数据未分开,占用相同的块组,此模式适用于小容量,因为没有约束剩余空间应如何使用(与拆分模式相比,在拆分模式中,空的元数据空间不能用于数据,反之亦然)

on the other hand, the final layout is quite unpredictable and possibly highly fragmented, which means worse performance
另一方面,最终布局相当不可预测,可能高度碎片化,这意味着性能较差

no_holes 无空隙

(since: 3.14) (自 3.14 版本起)

improved representation of file extents where holes are not explicitly stored as an extent, saves a few percent of metadata if sparse files are used
改进了文件范围的表示,其中未显式存储空洞作为一个范围,如果使用稀疏文件,可以节省几个百分点的元数据

raid1c34

(since: 5.5) (自 5.5 版本起)

extended RAID1 mode with copies on 3 or 4 devices respectively
在 3 或 4 个设备上分别进行扩展 RAID1 模式

raid_stripe_tree

(since: 6.7) (自 6.7 版本起)

a separate tree for tracking file extents on RAID profiles
用于跟踪 RAID 配置文件范围的单独树

RAID56

(since: 3.9) (自 3.9 版本开始)

the filesystem contains or contained a RAID56 profile of block groups
文件系统包含或曾包含 RAID56 配置文件组

rmdir_subvol 删除子卷

(since: 4.18) (自 4.18 版本开始)

indicate that rmdir(2) syscall can delete an empty subvolume just like an ordinary directory. Note that this feature only depends on the kernel version.
表示 rmdir(2) 系统调用可以像普通目录一样删除空子卷。请注意,此功能仅取决于内核版本。

skinny_metadata 瘦元数据

(since: 3.10) (自 3.10 版本开始)

reduced-size metadata for extent references, saves a few percent of metadata
减小范围引用的元数据大小,节省了几个百分点的元数据

send_stream_version 发送流版本

(since: 5.10) (自 5.10 版本开始)

number of the highest supported send stream version
支持的最高发送流版本编号

simple_quota 简化配额

(since: 6.7) (自 6.7 版本开始)

simplified quota accounting
简化配额记账

supported_checksums 支持的校验和

(since: 5.5) (自 5.5 版本起)

list of checksum algorithms supported by the kernel module, the respective modules or built-in implementing the algorithms need to be present to mount the filesystem, see section CHECKSUM ALGORITHMS.
内核模块支持的校验算法列表,需要存在实现这些算法的相应模块或内置模块才能挂载文件系统,请参阅校验算法章节。

supported_sectorsizes 支持的扇区大小

(since: 5.13) (自 5.13 版本起)

list of values that are accepted as sector sizes (mkfs.btrfs --sectorsize) by the running kernel
运行内核通过 mkfs.btrfs --sectorsize 接受的扇区大小值列表

supported_rescue_options 支持的救援选项

(since: 5.11) (自 5.11 版本起)

list of values for the mount option rescue that are supported by the running kernel, see btrfs(5)
运行内核支持的挂载选项 rescue 的值列表,请参阅 btrfs(5)

zoned 分区

(since: 5.12) (自 5.12 版本起)

zoned mode is allocation/write friendly to host-managed zoned devices, allocation space is partitioned into fixed-size zones that must be updated sequentially, see section ZONED MODE
分区模式对主机管理的分区设备更加友好,分配空间被划分为固定大小的分区,必须按顺序更新,详见“分区模式”部分

SWAPFILE SUPPORT 交换文件支持

A swapfile, when active, is a file-backed swap area. It is supported since kernel 5.0. Use swapon(8) to activate it, until then (respectively again after deactivating it with swapoff(8)) it’s just a normal file (with NODATACOW set), for which the special restrictions for active swapfiles don’t apply.
当活动时,交换文件是一个基于文件的交换区域。自内核 5.0 起支持。使用 swapon(8) 来激活它,在那之前(分别在使用 swapoff(8) 停用后再次激活之前),它只是一个普通文件(设置了 NODATACOW),对于活动交换文件的特殊限制不适用。

There are some limitations of the implementation in BTRFS and Linux swap subsystem:
BTRFS 和 Linux 交换子系统中的实现存在一些限制:

  • filesystem - must be only single device
    文件系统 - 必须只有单个设备

  • filesystem - must have only single data profile
    文件系统 - 必须只有单个数据配置文件

  • subvolume - cannot be snapshotted if it contains any active swapfiles
    子卷 - 如果包含任何活动的交换文件,则无法进行快照

  • swapfile - must be preallocated (i.e. no holes)
    交换文件 - 必须预分配(即没有空洞)

  • swapfile - must be NODATACOW (i.e. also NODATASUM, no compression)
    交换文件 - 必须为 NODATACOW(即也是 NODATASUM,无压缩)

The limitations come namely from the COW-based design and mapping layer of blocks that allows the advanced features like relocation and multi-device filesystems. However, the swap subsystem expects simpler mapping and no background changes of the file block location once they’ve been assigned to swap.
限制主要来自基于 COW 的设计和映射层的块,这使得像重定位和多设备文件系统这样的高级功能成为可能。然而,交换子系统期望更简单的映射,并且一旦文件块被分配给交换,就不会在后台更改文件块位置。

With active swapfiles, the following whole-filesystem operations will skip swapfile extents or may fail:
在活动的交换文件中,以下整个文件系统操作将跳过交换文件范围,或者可能失败:

  • balance - block groups with extents of any active swapfiles are skipped and reported, the rest will be processed normally
    平衡 - 具有任何活动交换文件范围的块组将被跳过并报告,其余部分将被正常处理

  • resize grow - unaffected 调整大小增加 - 不受影响

  • resize shrink - works as long as the extents of any active swapfiles are outside of the shrunk range
    调整大小缩小 - 只要任何活动交换文件的范围在缩小范围之外,就可以正常工作

  • device add - if the new devices do not interfere with any already active swapfiles this operation will work, though no new swapfile can be activated afterwards
    添加设备 - 如果新设备不干扰任何已经激活的交换文件,此操作将正常工作,尽管之后无法激活新的交换文件

  • device delete - if the device has been added as above, it can be also deleted
    设备删除 - 如果设备已添加如上所述,则也可以删除

  • device replace - ditto 设备替换 - 同上

When there are no active swapfiles and a whole-filesystem exclusive operation is running (e.g. balance, device delete, shrink), the swapfiles cannot be temporarily activated. The operation must finish first.
当没有活动的交换文件并且正在运行整个文件系统独占操作(例如平衡、设备删除、收缩)时,无法临时激活交换文件。必须先完成操作。

To create and activate a swapfile run the following commands:
要创建并激活交换文件,请运行以下命令:

# truncate -s 0 swapfile
# chattr +C swapfile
# fallocate -l 2G swapfile
# chmod 0600 swapfile
# mkswap swapfile
# swapon swapfile

Since version 6.1 it’s possible to create the swapfile in a single command (except the activation):
从版本 6.1 开始,可以使用单个命令创建交换文件(激活除外):

# btrfs filesystem mkswapfile --size 2G swapfile
# swapon swapfile

Please note that the UUID returned by the mkswap utility identifies the swap “filesystem” and because it’s stored in a file, it’s not generally visible and usable as an identifier unlike if it was on a block device.
请注意,mkswap 实用程序返回的 UUID 用于标识交换“文件系统”,因为它存储在文件中,通常不可见且不可用作标识符,不像它在块设备上的情况。

Once activated the file will appear in /proc/swaps:
一旦激活,文件将出现在 /proc/swaps 中:

# cat /proc/swaps
Filename          Type          Size           Used      Priority
/path/swapfile    file          2097152        0         -2

The swapfile can be created as one-time operation or, once properly created, activated on each boot by the swapon -a command (usually started by the service manager). Add the following entry to /etc/fstab, assuming the filesystem that provides the /path has been already mounted at this point. Additional mount options relevant for the swapfile can be set too (like priority, not the BTRFS mount options).
交换文件可以作为一次性操作创建,或者一旦正确创建,可以通过 swapon -a 命令在每次启动时激活(通常由服务管理器启动)。在 /etc/fstab 中添加以下条目,假设提供 /path 的文件系统在此时已经挂载。还可以设置适用于交换文件的其他挂载选项(如优先级,而不是 BTRFS 挂载选项)。

/path/swapfile        none        swap        defaults      0 0

From now on the subvolume with the active swapfile cannot be snapshotted until the swapfile is deactivated again by swapoff. Then the swapfile is a regular file and the subvolume can be snapshotted again, though this would prevent another activation any swapfile that has been snapshotted. New swapfiles (not snapshotted) can be created and activated.
从现在开始,具有活动交换文件的子卷在交换文件再次被 swapoff 停用之前无法进行快照。然后,交换文件是一个常规文件,子卷可以再次进行快照,尽管这将阻止对已经进行快照的任何交换文件的另一个激活。可以创建和激活新的交换文件(未进行快照)。

Otherwise, an inactive swapfile does not affect the containing subvolume. Activation creates a temporary in-memory status and prevents some file operations, but is not stored permanently.
否则,非活动的交换文件不会影响包含的子卷。激活会创建一个临时的内存状态并阻止一些文件操作,但不会永久存储。

Hibernation 休眠

A swapfile can be used for hibernation but it’s not straightforward. Before hibernation a resume offset must be written to file /sys/power/resume_offset or the kernel command line parameter resume_offset must be set.
交换文件可以用于休眠,但并不直接。在休眠之前,必须将恢复偏移写入文件/sys/power/resume_offset,或者必须设置内核命令行参数 resume_offset。

The value is the physical offset on the device. Note that this is not the same value that filefrag prints as physical offset!
该值是设备上的物理偏移量。请注意,这不是 filefrag 打印的物理偏移量!

Btrfs filesystem uses mapping between logical and physical addresses but here the physical can still map to one or more device-specific physical block addresses. It’s the device-specific physical offset that is suitable as resume offset.
Btrfs 文件系统使用逻辑地址和物理地址之间的映射,但这里的物理地址仍然可以映射到一个或多个特定于设备的物理块地址。适合作为恢复偏移量的是特定于设备的物理偏移量。

Since version 6.1 there’s a command btrfs inspect-internal map-swapfile that will print the device physical offset and the adjusted value for /sys/power/resume_offset. Note that the value is divided by page size, i.e. it’s not the offset itself.
从版本 6.1 开始,有一个名为 btrfs inspect-internal map-swapfile 的命令,它将打印设备的物理偏移量和调整后的值为 /sys/power/resume_offset 。请注意,该值被页面大小除,即它不是偏移量本身。

# btrfs filesystem mkswapfile swapfile
# btrfs inspect-internal map-swapfile swapfile
Physical start: 811511726080
Resume offset:     198122980

For scripting and convenience the option -r will print just the offset:
为了脚本编写和方便起见,选项 -r 将仅打印偏移量:

# btrfs inspect-internal map-swapfile -r swapfile
198122980

The command map-swapfile also verifies all the requirements, i.e. no holes, single device, etc.
命令 map-swapfile 还验证所有要求,即没有空洞,单个设备等。

Troubleshooting 故障排除 

If the swapfile activation fails please verify that you followed all the steps above or check the system log (e.g. dmesg or journalctl) for more information.
如果交换文件激活失败,请验证您是否按照上述所有步骤操作,或者检查系统日志(例如 dmesg 或 journalctl)以获取更多信息。

Notably, the swapon utility exits with a message that does not say what failed:
值得注意的是,swapon 实用程序退出时会显示一条未说明失败原因的消息:

# swapon /path/swapfile
swapon: /path/swapfile: swapon failed: Invalid argument

The specific reason is likely to be printed to the system log by the btrfs module:
具体原因可能会由 btrfs 模块打印到系统日志中:

# journalctl -t kernel | grep swapfile
kernel: BTRFS warning (device sda): swapfile must have single data profile

CHECKSUM ALGORITHMS 校验算法

Data and metadata are checksummed by default, the checksum is calculated before write and verified after reading the blocks from devices. The whole metadata block has a checksum stored inline in the b-tree node header, each data block has a detached checksum stored in the checksum tree.
默认情况下,数据和元数据会进行校验和计算,校验和在写入之前计算,在从设备读取块后进行验证。整个元数据块在 B 树节点头部内联存储有一个校验和,每个数据块在校验和树中有一个独立的校验和存储。

There are several checksum algorithms supported. The default and backward compatible is crc32c. Since kernel 5.5 there are three more with different characteristics and trade-offs regarding speed and strength. The following list may help you to decide which one to select.
支持多种校验算法。默认和向后兼容的是 crc32c。自内核 5.5 开始,还有三种具有不同特性和速度、强度方面的权衡的算法。以下列表可能帮助您决定选择哪种算法。

CRC32C (32bit digest) CRC32C(32 位摘要)

default, best backward compatibility, very fast, modern CPUs have instruction-level support, not collision-resistant but still good error detection capabilities
默认,最佳向后兼容性,非常快,现代 CPU 具有指令级支持,不具备抗碰撞能力,但仍具有良好的错误检测能力

XXHASH (64bit digest) XXHASH(64 位摘要)

can be used as CRC32C successor, very fast, optimized for modern CPUs utilizing instruction pipelining, good collision resistance and error detection
可用作 CRC32C 的继任者,非常快速,针对利用指令流水线的现代 CPU 进行了优化,具有良好的碰撞抵抗和错误检测能力

SHA256 (256bit digest) SHA256(256 位摘要)

a cryptographic-strength hash, relatively slow but with possible CPU instruction acceleration or specialized hardware cards, FIPS certified and in wide use
一个密码强度很高的哈希函数,相对较慢,但可能具有 CPU 指令加速或专用硬件卡,FIPS 认证并广泛使用

BLAKE2b (256bit digest) BLAKE2b(256 位摘要)

a cryptographic-strength hash, relatively fast with possible CPU acceleration using SIMD extensions, not standardized but based on BLAKE which was a SHA3 finalist, in wide use, the algorithm used is BLAKE2b-256 that’s optimized for 64bit platforms
一个密码强度很高的哈希函数,相对较快,可能使用 SIMD 扩展进行 CPU 加速,没有标准化,但基于 BLAKE,后者是 SHA3 决赛选手之一,广泛使用,所使用的算法是针对 64 位平台进行优化的 BLAKE2b-256

The digest size affects overall size of data block checksums stored in the filesystem. The metadata blocks have a fixed area up to 256 bits (32 bytes), so there’s no increase. Each data block has a separate checksum stored, with additional overhead of the b-tree leaves.
摘要大小会影响文件系统中存储的数据块校验和的整体大小。元数据块有一个固定区域,最多为 256 位(32 字节),因此不会增加。每个数据块都有一个单独存储的校验和,带有 b 树叶子的额外开销。

Approximate relative performance of the algorithms, measured against CRC32C using implementations on a 11th gen 3.6GHz intel CPU:
算法的大致相对性能,根据在第 11 代 3.6GHz 英特尔 CPU 上实现的 CRC32C 进行测量:

Digest

Cycles/4KiB 周期/4KiB

Ratio

Implementation

CRC32C

470

1.00

CPU instruction, PCL combination
CPU 指令,PCL 组合

XXHASH

870

1.9

reference impl. 参考实现

SHA256

7600

16

libgcrypt

SHA256

8500

18

openssl

SHA256

8700

18

botan

SHA256

32000

68

builtin, CPU instruction 内建,CPU 指令

SHA256

37000

78

libsodium

SHA256

78000

166

builtin, reference impl. 内置,参考实现。

BLAKE2b

10000

21

builtin/AVX2 内置/AVX2

BLAKE2b

10900

23

libgcrypt

BLAKE2b

13500

29

builtin/SSE41 内置/SSE41

BLAKE2b

13700

29

libsodium

BLAKE2b

14100

30

openssl

BLAKE2b

14500

31

kcapi

BLAKE2b

14500

34

builtin, reference impl. 内置,参考实现。

Many kernels are configured with SHA256 as built-in and not as a module. The accelerated versions are however provided by the modules and must be loaded explicitly (modprobe sha256) before mounting the filesystem to make use of them. You can check in /sys/fs/btrfs/FSID/checksum which one is used. If you see sha256-generic, then you may want to unmount and mount the filesystem again, changing that on a mounted filesystem is not possible. Check the file /proc/crypto, when the implementation is built-in, you’d find
许多内核都配置为将 SHA256 作为内置而不是作为模块。然而,加速版本是由模块提供的,必须在挂载文件系统之前显式加载它们(modprobe sha256)才能使用。您可以在 /sys/fs/btrfs/FSID/checksum 中检查使用的是哪个版本。如果看到 sha256-generic,则可能需要卸载并重新挂载文件系统,无法在已挂载的文件系统上更改。当实现内置时,请检查文件 /proc/crypto ,您会发现

name         : sha256
driver       : sha256-generic
module       : kernel
priority     : 100
...

while accelerated implementation is e.g.
而加速实现例如

name         : sha256
driver       : sha256-avx2
module       : sha256_ssse3
priority     : 170
...

COMPRESSION 压缩 

Btrfs supports transparent file compression. There are three algorithms available: ZLIB, LZO and ZSTD (since v4.14), with various levels. The compression happens on the level of file extents and the algorithm is selected by file property, mount option or by a defrag command. You can have a single btrfs mount point that has some files that are uncompressed, some that are compressed with LZO, some with ZLIB, for instance (though you may not want it that way, it is supported).
Btrfs 支持透明文件压缩。有三种可用的算法:ZLIB、LZO 和 ZSTD(自 v4.14 起),具有不同的级别。压缩发生在文件范围的级别上,并且算法是通过文件属性、挂载选项或 defrag 命令选择的。您可以有一个单独的 btrfs 挂载点,其中有一些文件是未压缩的,一些使用 LZO 压缩,一些使用 ZLIB 压缩,例如(尽管您可能不希望这样,但是支持)。

Once the compression is set, all newly written data will be compressed, i.e. existing data are untouched. Data are split into smaller chunks (128KiB) before compression to make random rewrites possible without a high performance hit. Due to the increased number of extents the metadata consumption is higher. The chunks are compressed in parallel.
一旦设置了压缩,所有新写入的数据都将被压缩,即现有数据不会受到影响。数据在压缩之前被分成较小的块(128KiB),以便进行随机重写而不会受到性能损失。由于范围数量的增加,元数据消耗更高。这些块是并行压缩的。

The algorithms can be characterized as follows regarding the speed/ratio trade-offs:
算法可以根据速度/比率折衷来描述如下:

ZLIB
  • slower, higher compression ratio
    较慢,压缩比较高

  • levels: 1 to 9, mapped directly, default level is 3
    级别:1 到 9,直接映射,默认级别为 3

  • good backward compatibility
    良好的向后兼容性

LZO
  • faster compression and decompression than ZLIB, worse compression ratio, designed to be fast
    比 ZLIB 更快的压缩和解压缩速度,压缩比较差,旨在提高速度

  • no levels 没有级别

  • good backward compatibility
    良好的向后兼容性

ZSTD
  • compression comparable to ZLIB with higher compression/decompression speeds and different ratio
    具有比 ZLIB 更高的压缩/解压缩速度和不同比率的压缩

  • levels: 1 to 15, mapped directly (higher levels are not available)
    级别:1 到 15,直接映射(较高级别不可用)

  • since 4.14, levels since 5.1
    自 4.14 起,自 5.1 起的级别

The differences depend on the actual data set and cannot be expressed by a single number or recommendation. Higher levels consume more CPU time and may not bring a significant improvement, lower levels are close to real time.
差异取决于实际数据集,无法用单个数字或建议来表达。更高的级别消耗更多的 CPU 时间,可能不会带来显著的改进,较低的级别接近实时。

How to enable compression
如何启用压缩

Typically the compression can be enabled on the whole filesystem, specified for the mount point. Note that the compression mount options are shared among all mounts of the same filesystem, either bind mounts or subvolume mounts. Please refer to btrfs(5) section MOUNT OPTIONS.
通常可以在整个文件系统上启用压缩,指定挂载点。请注意,压缩挂载选项在同一文件系统的所有挂载之间共享,无论是绑定挂载还是子卷挂载。请参阅 btrfs(5) 章节 MOUNT OPTIONS。

$ mount -o compress=zstd /dev/sdx /mnt

This will enable the zstd algorithm on the default level (which is 3). The level can be specified manually too like zstd:3. Higher levels compress better at the cost of time. This in turn may cause increased write latency, low levels are suitable for real-time compression and on reasonably fast CPU don’t cause noticeable performance drops.
这将在默认级别(即 3 级)上启用 zstd 算法。也可以手动指定级别,如 zstd:3 。更高级别在时间成本上压缩效果更好。这反过来可能导致写入延迟增加,低级别适用于实时压缩,在相对较快的 CPU 上不会引起明显的性能下降。

$ btrfs filesystem defrag -czstd file

The command above will start defragmentation of the whole file and apply the compression, regardless of the mount option. (Note: specifying level is not yet implemented). The compression algorithm is not persistent and applies only to the defragmentation command, for any other writes other compression settings apply.
上述命令将启动整个文件的碎片整理并应用压缩,不考虑挂载选项。(注意:尚未实现指定级别)。压缩算法不是持久的,仅适用于碎片整理命令,对于任何其他写入,其他压缩设置适用。

Persistent settings on a per-file basis can be set in two ways:
可以通过两种方式在每个文件基础上设置持久设置:

$ chattr +c file
$ btrfs property set file compression zstd

The first command is using legacy interface of file attributes inherited from ext2 filesystem and is not flexible, so by default the zlib compression is set. The other command sets a property on the file with the given algorithm. (Note: setting level that way is not yet implemented.)
第一个命令使用从 ext2 文件系统继承的文件属性的传统接口,不够灵活,因此默认设置为 zlib 压缩。另一个命令在文件上设置了给定算法的属性。(注意:目前尚未实现通过这种方式设置级别。)

Compression levels 压缩级别

The level support of ZLIB has been added in v4.14, LZO does not support levels (the kernel implementation provides only one), ZSTD level support has been added in v5.1.
ZLIB 的级别支持已在 v4.14 中添加,LZO 不支持级别(内核实现仅提供一个级别),ZSTD 级别支持已在 v5.1 中添加。

There are 9 levels of ZLIB supported (1 to 9), mapping 1:1 from the mount option to the algorithm defined level. The default is level 3, which provides the reasonably good compression ratio and is still reasonably fast. The difference in compression gain of levels 7, 8 and 9 is comparable but the higher levels take longer.
ZLIB 支持 9 个级别(1 到 9),从挂载选项到算法定义级别的 1:1 映射。默认级别为 3,提供合理的压缩比并且仍然相当快。级别 7、8 和 9 的压缩增益相当,但较高级别需要更长时间。

The ZSTD support includes levels 1 to 15, a subset of full range of what ZSTD provides. Levels 1-3 are real-time, 4-8 slower with improved compression and 9-15 try even harder though the resulting size may not be significantly improved.
ZSTD 支持级别 1 到 15,是 ZSTD 提供的完整范围的子集。级别 1-3 是实时的,4-8 较慢但具有改进的压缩,9-15 则更加努力,尽管结果大小可能没有显著改善。

Level 0 always maps to the default. The compression level does not affect compatibility.
级别 0 总是映射到默认值。压缩级别不影响兼容性。

Incompressible data 不可压缩的数据

Files with already compressed data or with data that won’t compress well with the CPU and memory constraints of the kernel implementations are using a simple decision logic. If the first portion of data being compressed is not smaller than the original, the compression of the file is disabled -- unless the filesystem is mounted with compress-force. In that case compression will always be attempted on the file only to be later discarded. This is not optimal and subject to optimizations and further development.
具有已经压缩数据或者数据不适合使用 CPU 和内存约束的内核实现进行压缩的文件,使用简单的决策逻辑。如果要压缩的数据的第一部分不比原始数据小,文件的压缩将被禁用 -- 除非文件系统已经使用了 compress-force 进行挂载。在这种情况下,将始终尝试对文件进行压缩,但最终会被丢弃。这并不是最佳方案,可以进行优化和进一步的开发。

If a file is identified as incompressible, a flag is set (NOCOMPRESS) and it’s sticky. On that file compression won’t be performed unless forced. The flag can be also set by chattr +m (since e2fsprogs 1.46.2) or by properties with value no or none. Empty value will reset it to the default that’s currently applicable on the mounted filesystem.
如果文件被识别为不可压缩,会设置一个标志(NOCOMPRESS)并且是粘性的。在这个文件上,除非强制执行,否则不会进行压缩。该标志也可以通过 chattr +m 进行设置(自 e2fsprogs 1.46.2 起),或者通过值为 no 或 none 的属性进行设置。空值将重置为当前适用于挂载文件系统的默认值。

There are two ways to detect incompressible data:
有两种方法可以检测不可压缩的数据:

  • actual compression attempt - data are compressed, if the result is not smaller, it’s discarded, so this depends on the algorithm and level
    实际压缩尝试 - 如果数据被压缩后结果没有变小,则被丢弃,因此这取决于算法和级别

  • pre-compression heuristics - a quick statistical evaluation on the data is performed and based on the result either compression is performed or skipped, the NOCOMPRESS bit is not set just by the heuristic, only if the compression algorithm does not make an improvement
    预压缩启发式 - 对数据进行快速统计评估,根据结果进行压缩或跳过,NOCOMPRESS 位不仅仅由启发式设置,只有在压缩算法没有改进时才会设置

$ lsattr file
---------------------m file

Using the forcing compression is not recommended, the heuristics are supposed to decide that and compression algorithms internally detect incompressible data too.
不建议使用强制压缩,启发式算法应该决定压缩算法内部也会检测无法压缩的数据。

Pre-compression heuristics
预压缩启发式算法

The heuristics aim to do a few quick statistical tests on the compressed data in order to avoid probably costly compression that would turn out to be inefficient. Compression algorithms could have internal detection of incompressible data too but this leads to more overhead as the compression is done in another thread and has to write the data anyway. The heuristic is read-only and can utilize cached memory.
启发式算法旨在对压缩数据进行一些快速的统计测试,以避免可能代价高昂的压缩,这样会导致效率低下。压缩算法也可能具有内部检测无法压缩数据的功能,但这会增加更多开销,因为压缩是在另一个线程中进行的,仍然需要写入数据。启发式算法是只读的,可以利用缓存内存。

The tests performed based on the following: data sampling, long repeated pattern detection, byte frequency, Shannon entropy.
基于以下进行的测试:数据采样、长重复模式检测、字节频率、香农熵。

Compatibility 兼容性 

Compression is done using the COW mechanism so it’s incompatible with nodatacow. Direct IO read works on compressed files but will fall back to buffered writes and leads to no compression even if force compression is set. Currently nodatasum and compression don’t work together.
使用 COW 机制进行压缩,因此与 nodatacow 不兼容。直接 IO 读取在压缩文件上运行,但会退回到缓冲写入,并导致即使设置了强制压缩也不会进行压缩。目前 nodatasum 和压缩不能一起工作。

The compression algorithms have been added over time so the version compatibility should be also considered, together with other tools that may access the compressed data like bootloaders.
随着时间的推移,压缩算法已被添加,因此还应考虑版本兼容性,以及可能访问压缩数据的其他工具,如引导加载程序。

SYSFS INTERFACE SYSFS 接口 

Btrfs has a sysfs interface to provide extra knobs.
Btrfs 具有 sysfs 接口,提供额外的控制选项。

The top level path is /sys/fs/btrfs/, and the main directory layout is the following:
顶层路径是 /sys/fs/btrfs/ ,主目录布局如下:

Relative Path 相对路径

Description

Version

features/

All supported features 所有支持的功能

3.14+

<UUID>/

Mounted fs UUID 挂载的文件系统 UUID

3.14+

<UUID>/allocation/ <UUID>/分配/

Space allocation info 空间分配信息

3.14+

<UUID>/features/ <UUID>/特性/

Features of the filesystem
文件系统的特性

3.14+

<UUID>/devices/<DEVID>/ <UUID>/设备/<DEVID>/

Symlink to each block device sysfs
每个块设备的符号链接 sysfs

5.6+

<UUID>/devinfo/<DEVID>/

Btrfs specific info for each device
每个设备的 Btrfs 特定信息

5.6+

<UUID>/qgroups/

Global qgroup info 全局 qgroup 信息

5.9+

<UUID>/qgroups/<LEVEL>_<ID>/

Info for each qgroup 每个 qgroup 的信息

5.9+

<UUID>/discard/ <UUID>/丢弃/

Discard stats and tunables
丢弃统计数据和可调参数

6.1+

For /sys/fs/btrfs/features/ directory, each file means a supported feature for the current kernel.
对于 /sys/fs/btrfs/features/ 目录,每个文件代表当前内核支持的一个特性。

For /sys/fs/btrfs/<UUID>/features/ directory, each file means an enabled feature for the mounted filesystem.
对于 /sys/fs/btrfs/<UUID>/features/ 目录,每个文件表示挂载文件系统的一个已启用功能。

The features shares the same name in section FILESYSTEM FEATURES.
这些功能在 FILESYSTEM FEATURES 部分中具有相同的名称。

Files in /sys/fs/btrfs/<UUID>/ directory are:
/sys/fs/btrfs/<UUID>/ 目录中的文件为:

bg_reclaim_threshold 回收阈值

(RW, since: 5.19) (RW, 自 5.19 版本起)

Used space percentage of total device space to start auto block group claim. Mostly for zoned devices.
用于启动自动块组索赔的总设备空间的已用空间百分比。主要用于分区设备。

checksum 校验和

(RO, since: 5.5) (只读, 自 5.5 版本起)

The checksum used for the mounted filesystem. This includes both the checksum type (see section CHECKSUM ALGORITHMS) and the implemented driver (mostly shows if it’s hardware accelerated).
用于挂载文件系统的校验和。这包括校验和类型(请参阅校验和算法部分)和实现的驱动程序(主要显示是否硬件加速)。

clone_alignment 克隆对齐

(RO, since: 3.16) (只读, 自 3.16 版本起)

The bytes alignment for clone and dedupe ioctls.
克隆和去重 ioctl 的字节对齐。

commit_stats 提交统计

(RW, since: 6.0) (RW, 自 6.0 起)

The performance statistics for btrfs transaction commit. Mostly for debug purposes.
btrfs 事务提交的性能统计。主要用于调试目的。

Writing into this file will reset the maximum commit duration to the input value.
将写入此文件将最大提交持续时间重置为输入值。

exclusive_operation 独占操作

(RO, since: 5.10) (RO, 自 5.10 版本起)

Shows the running exclusive operation. Check section FILESYSTEM EXCLUSIVE OPERATIONS for details.
显示正在运行的独占操作。有关详细信息,请查看文件系统独占操作部分。

generation 生成

(RO, since: 5.11) (只读, 自 5.11 版本起)

Show the generation of the mounted filesystem.
显示已挂载文件系统的生成。

label 标签

(RW, since: 3.14) (RW, 自 3.14 起)

Show the current label of the mounted filesystem.
显示已挂载文件系统的当前标签。

metadata_uuid 元数据 UUID

(RO, since: 5.0) (只读, 自 5.0 版本起)

Shows the metadata uuid of the mounted filesystem. Check metadata_uuid feature for more details.
显示已挂载文件系统的元数据 UUID。有关更多详细信息,请查看元数据 UUID 功能。

nodesize 节点大小

(RO, since: 3.14) (只读, 自 3.14 版本起)

Show the nodesize of the mounted filesystem.
显示已挂载文件系统的节点大小。

quota_override 配额覆盖

(RW, since: 4.13) (自 4.13 版本起可读写)

Shows the current quota override status. 0 means no quota override. 1 means quota override, quota can ignore the existing limit settings.
显示当前配额覆盖状态。0 表示没有配额覆盖。1 表示配额覆盖,配额可以忽略现有的限制设置。

read_policy 读取策略

(RW, since: 5.11) (读写, 自 5.11 版本起)

Shows the current balance policy for reads. Currently only “pid” (balance using pid value) is supported.
显示当前的读取平衡策略。目前仅支持“pid”(使用 pid 值进行平衡)。

sectorsize 扇区大小

(RO, since: 3.14) (只读, 自版本 3.14 起)

Shows the sectorsize of the mounted filesystem.
显示已挂载文件系统的扇区大小。

Files and directories in /sys/fs/btrfs/<UUID>/allocations directory are:
/sys/fs/btrfs/<UUID>/allocations 目录中的文件和目录为:

global_rsv_reserved

(RO, since: 3.14) (只读, 自版本 3.14 起)

The used bytes of the global reservation.
全局保留的已使用字节。

global_rsv_size

(RO, since: 3.14) (只读, 自版本 3.14 起)

The total size of the global reservation.
全球预订总大小。

data/, metadata/ and system/ directories
data/、metadata/ 和 system/ 目录

(RO, since: 5.14) (只读, 自 5.14 版本起)

Space info accounting for the 3 chunk types. Mostly for debug purposes.
占用 3 种块类型的空间信息。主要用于调试目的。

Files in /sys/fs/btrfs/<UUID>/allocations/data,metadata,system directory are:
/sys/fs/btrfs/<UUID>/allocations/data,metadata,system 目录中的文件为:

bg_reclaim_threshold

(RW, since: 5.19) (RW, 自 5.19 起)

Reclaimable space percentage of block group’s size (excluding permanently unusable space) to reclaim the block group. Can be used on regular or zoned devices.
可回收空间百分比,用于回收块组的大小(不包括永久不可用空间)。可用于常规设备或分区设备。

chunk_size 块大小

(RW, since: 6.0) (RW, 自 6.0 起)

Shows the chunk size. Can be changed for data and metadata. Cannot be set for zoned devices.
显示块大小。可为数据和元数据更改。无法为分区设备设置。

Files in /sys/fs/btrfs/<UUID>/devinfo/<DEVID> directory are:
/sys/fs/btrfs/<UUID>/devinfo/<DEVID> 目录中的文件为:

error_stats: 错误统计:

(RO, since: 5.14) (只读, 自 5.14 版本起)

Shows all the history error numbers of the device.
显示设备的所有历史错误数量。

fsid:

(RO, since: 5.17) (只读, 自 5.17 版本起)

Shows the fsid which the device belongs to. It can be different than the <UUID> if it’s a seed device.
显示设备所属的 fsid。如果是种子设备,则可能与 <UUID> 不同。

in_fs_metadata 在_fs_metadata

(RO, since: 5.6) (只读, 自 5.6 版本起)

Shows whether we have found the device. Should always be 1, as if this turns to 0, the <DEVID> directory would get removed automatically.
显示我们是否找到了设备。应始终为 1,因为如果变为 0,<DEVID> 目录将自动删除。

missing 缺失

(RO, since: 5.6) (只读, 自 5.6 版本起)

Shows whether the device is missing.
显示设备是否丢失。

replace_target 替换目标

(RO, since: 5.6) (只读, 自 5.6 版本起)

Shows whether the device is the replace target. If no dev-replace is running, this value should be 0.
显示设备是否为替换目标。如果没有进行 dev-replace 操作,则该值应为 0。

scrub_speed_max 最大擦除速度

(RW, since: 5.14) (读写, 自 5.14 版本起)

Shows the scrub speed limit for this device. The unit is Bytes/s. 0 means no limit.
显示此设备的擦除速度限制。单位为字节/秒。0 表示无限制。

writeable 可写

(RO, since: 5.6) (只读,自 5.6 版本起)

Show if the device is writeable.
显示设备是否可写。

Files in /sys/fs/btrfs/<UUID>/qgroups/ directory are:
/sys/fs/btrfs/<UUID>/qgroups/ 目录中的文件为:

enabled 已启用

(RO, since: 6.1) (只读, 自 6.1 起)

Shows if qgroup is enabled. Also, if qgroup is disabled, the qgroups directory would be removed automatically.
显示 qgroup 是否已启用。此外,如果 qgroup 已禁用,则 qgroups 目录将被自动删除。

inconsistent 不一致的

(RO, since: 6.1) (只读,自 6.1 版本起)

Shows if the qgroup numbers are inconsistent. If 1, it’s recommended to do a qgroup rescan.
显示 qgroup 数字是否不一致。如果为 1,则建议执行 qgroup rescan。

drop_subtree_threshold

(RW, since: 6.1) (可读写, 自版本 6.1 起)

Shows the subtree drop threshold to automatically mark qgroup inconsistent.
显示子树下降阈值,以自动标记 qgroup 不一致。

When dropping large subvolumes with qgroup enabled, there would be a huge load for qgroup accounting. If we have a subtree whose level is larger than or equal to this value, we will not trigger qgroup account at all, but mark qgroup inconsistent to avoid the huge workload.
在启用 qgroup 的情况下删除大子卷时,qgroup 记账会有很大的负载。如果我们有一个子树,其级别大于或等于此值,则我们将根本不会触发 qgroup 记账,而是标记 qgroup 不一致以避免巨大的工作量。

Default value is 8, where no subtree drop can trigger qgroup.
默认值为 8,其中没有子树下降会触发 qgroup。

Lower value can reduce qgroup workload, at the cost of extra qgroup rescan to re-calculate the numbers.
降低值可以减少 qgroup 的工作量,但会增加额外的 qgroup 重新扫描以重新计算数字。

Files in /sys/fs/btrfs/<UUID>/<LEVEL>_<ID>/ directory are:
/sys/fs/btrfs/<UUID>/<LEVEL>_<ID>/ 目录中的文件为:

exclusive 独占

(RO, since: 5.9) (只读,自 5.9 版本起)

Shows the exclusively owned bytes of the qgroup.
显示 qgroup 独占的字节。

limit_flags 限制标志

(RO, since: 5.9) (只读, 自 5.9 版本起)

Shows the numeric value of the limit flags. If 0, means no limit implied.
显示限制标志的数值。如果为 0,则表示没有限制。

max_exclusive 最大独占

(RO, since: 5.9) (只读, 自 5.9 版本起)

Shows the limits on exclusively owned bytes.
显示独占字节的限制。

max_referenced

(RO, since: 5.9) (只读, 自 5.9 版本起)

Shows the limits on referenced bytes.
显示引用字节的限制。

referenced 引用的

(RO, since: 5.9) (只读, 自 5.9 版本起)

Shows the referenced bytes of the qgroup.
显示 qgroup 的引用字节。

rsv_data

(RO, since: 5.9) (RO, 自 5.9 版本起)

Shows the reserved bytes for data.
显示数据的保留字节。

rsv_meta_pertrans 每个事务元数据的保留字节

(RO, since: 5.9) (只读, 自 5.9 版本起)

Shows the reserved bytes for per transaction metadata.
显示每个事务元数据的保留字节。

rsv_meta_prealloc 预分配元数据的保留字节数

(RO, since: 5.9) (只读, 自 5.9 版本起)

Shows the reserved bytes for preallocated metadata.
显示预分配元数据的保留字节数

Files in /sys/fs/btrfs/<UUID>/discard/ directory are:
/sys/fs/btrfs/<UUID>/discard/ 目录中的文件为:

discardable_bytes 可丢弃字节

(RO, since: 6.1) (只读, 自 6.1 版本起)

Shows amount of bytes that can be discarded in the async discard and nodiscard mode.
显示可以在异步丢弃和不丢弃模式中丢弃的字节数量。

discardable_extents 可丢弃的范围

(RO, since: 6.1) (只读, 自 6.1 版本起)

Shows number of extents to be discarded in the async discard and nodiscard mode.
显示在异步丢弃和不丢弃模式中要丢弃的范围数量。

discard_bitmap_bytes 丢弃位图字节

(RO, since: 6.1) (只读, 自 6.1 版本起)

Shows amount of discarded bytes from data tracked as bitmaps.
显示作为位图跟踪的数据中丢弃的字节数。

discard_extent_bytes

(RO, since: 6.1) (只读, 自 6.1 版本起)

Shows amount of discarded extents from data tracked as bitmaps.
显示作为位图跟踪的数据中丢弃的范围的数量。

discard_bytes_saved

(RO, since: 6.1) (只读, 自 6.1 版本起)

Shows the amount of bytes that were reallocated without being discarded.
显示重新分配但未丢弃的字节数量。

kbps_limit kbps 限制

(RW, since: 6.1) (可读写, 自版本 6.1 起)

Tunable limit of kilobytes per second issued as discard IO in the async discard mode.
在异步丢弃模式中作为丢弃 IO 的每秒千字节的可调限制。

iops_limit

(RW, since: 6.1) (读写, 自 6.1 版本起)

Tunable limit of number of discard IO operations to be issued in the async discard mode.
异步丢弃模式中要发出的丢弃 IO 操作数量的可调限制。

max_discard_size

(RW, since: 6.1) (读写, 自 6.1 版本起)

Tunable limit for size of one IO discard request.
一个 IO 丢弃请求大小的可调限制。

FILESYSTEM EXCLUSIVE OPERATIONS
文件系统独占操作

There are several operations that affect the whole filesystem and cannot be run in parallel. Attempt to start one while another is running will fail (see exceptions below).
有几个影响整个文件系统且不能并行运行的操作。在另一个操作运行时尝试启动一个操作将失败(请参见下面的异常情况)。

Since kernel 5.10 the currently running operation can be obtained from /sys/fs/UUID/exclusive_operation with following values and operations:
从内核 5.10 开始,当前运行的操作可以从 /sys/fs/UUID/exclusive_operation 中获取,具有以下值和操作:

  • balance 平衡

  • balance paused (since 5.17)
    平衡暂停(自 5.17 起)

  • device add 设备添加

  • device delete 设备删除

  • device replace 设备替换

  • resize 调整大小

  • swapfile activate 启用交换文件

  • none 

Enqueuing is supported for several btrfs subcommands so they can be started at once and then serialized.
对于几个 btrfs 子命令支持排队,以便它们可以一次启动,然后进行序列化。

There’s an exception when a paused balance allows to start a device add operation as they don’t really collide and this can be used to add more space for the balance to finish.
当暂停余额允许启动设备添加操作时,会出现异常,因为它们实际上并不冲突,这可以用来为余额提供更多空间以完成操作。

FILESYSTEM LIMITS 文件系统限制

maximum file name length 文件名长度的最大限制

255

This limit is imposed by Linux VFS, the structures of BTRFS could store larger file names.
这个限制是由 Linux VFS 强加的,BTRFS 的结构可以存储更长的文件名。

maximum symlink target length
符号链接目标的最大长度

depends on the nodesize value, for 4KiB it’s 3949 bytes, for larger nodesize it’s 4095 due to the system limit PATH_MAX
取决于节点大小的值,对于 4KiB,它是 3949 字节,对于更大的节点大小,由于系统限制 PATH_MAX,它是 4095。

The symlink target may not be a valid path, i.e. the path name components can exceed the limits (NAME_MAX), there’s no content validation at symlink(3) creation.
符号链接目标可能不是有效路径,即路径名组件可能超过限制(NAME_MAX),在创建符号链接(3)时没有内容验证。

maximum number of inodes 最大索引节点数

264 but depends on the available metadata space as the inodes are created dynamically
2 64 但取决于可用的元数据空间,因为索引节点是动态创建的。

Each subvolume is an independent namespace of inodes and thus their numbers, so the limit is per subvolume, not for the whole filesystem.
每个子卷是独立的索引节点命名空间,因此它们的数量是独立的,因此限制是针对每个子卷而不是整个文件系统。

inode numbers 索引节点编号

minimum number: 256 (for subvolumes), regular files and directories: 257, maximum number: (264 - 256)
最小数量: 256(对于子卷),常规文件和目录: 257,最大数量:(2 64 - 256)

The inode numbers that can be assigned to user created files are from the whole 64bit space except first 256 and last 256 in that range that are reserved for internal b-tree identifiers.
可分配给用户创建文件的索引节点号来自整个 64 位空间,除了该范围中保留用于内部 B 树标识符的第一个 256 和最后一个 256。

maximum file length 文件长度最大值

inherent limit of BTRFS is 264 (16 EiB) but the practical limit of Linux VFS is 263 (8 EiB)
BTRFS 的固有限制是 2 64 (16 EiB),但 Linux VFS 的实际限制是 2 63 (8 EiB)

maximum number of subvolumes
子卷的最大数量

the subvolume ids can go up to 248 but the number of actual subvolumes depends on the available metadata space
子卷的 ID 可以达到 2 48 ,但实际子卷的数量取决于可用的元数据空间

The space consumed by all subvolume metadata includes bookkeeping of shared extents can be large (MiB, GiB). The range is not the full 64bit range because of qgroups that use the upper 16 bits for another purposes.
所有子卷元数据消耗的空间包括共享范围的簿记,可能会很大(MiB,GiB)。该范围不是完整的 64 位范围,因为 qgroups 使用上 16 位用于其他目的。

maximum number of hardlinks of a file in a directory
目录中文件的硬链接的最大数量

65536 when the extref feature is turned on during mkfs (default), roughly 100 otherwise and depends on file name length that fits into one metadata node
在 mkfs 过程中打开 extref 功能时为 65536(默认值),否则大约为 100,取决于适合放入一个元数据节点的文件名长度

minimum filesystem size 文件系统最小大小

the minimal size of each device depends on the mixed-bg feature, without that (the default) it’s about 109MiB, with mixed-bg it’s is 16MiB
每个设备的最小大小取决于 mixed-bg 功能,如果没有(默认情况下)约为 109MiB,有 mixed-bg 则为 16MiB

BOOTLOADER SUPPORT 引导加载程序支持

GRUB2 (https://www.gnu.org/software/grub) has the most advanced support of booting from BTRFS with respect to features.
GRUB2(https://www.gnu.org/software/grub)在从 BTRFS 引导方面具有最先进的功能支持。

U-Boot (https://www.denx.de/wiki/U-Boot/) has decent support for booting but not all BTRFS features are implemented, check the documentation.
U-Boot(https://www.denx.de/wiki/U-Boot/)具有良好的引导支持,但并非所有 BTRFS 功能都已实现,请查阅文档。

In general, the first 1MiB on each device is unused with the exception of primary superblock that is on the offset 64KiB and spans 4KiB. The rest can be freely used by bootloaders or for other system information. Note that booting from a filesystem on zoned device is not supported.
一般情况下,每个设备上的前 1MiB 未使用,除了位于偏移 64KiB 且跨越 4KiB 的主超级块之外。其余部分可以自由地供引导加载程序或其他系统信息使用。请注意,不支持从分区设备上的文件系统引导。

FILE ATTRIBUTES 文件属性

The btrfs filesystem supports setting file attributes or flags. Note there are old and new interfaces, with confusing names. The following list should clarify that:
btrfs 文件系统支持设置文件属性或标志。请注意,存在旧接口和新接口,名称可能会令人困惑。以下列表应该澄清这一点:

  • attributes: chattr(1) or lsattr(1) utilities (the ioctls are FS_IOC_GETFLAGS and FS_IOC_SETFLAGS), due to the ioctl names the attributes are also called flags
    属性:chattr(1)或 lsattr(1)实用程序(ioctl 为 FS_IOC_GETFLAGS 和 FS_IOC_SETFLAGS),由于 ioctl 的名称,这些属性也被称为标志

  • xflags: to distinguish from the previous, it’s extended flags, with tunable bits similar to the attributes but extensible and new bits will be added in the future (the ioctls are FS_IOC_FSGETXATTR and FS_IOC_FSSETXATTR but they are not related to extended attributes that are also called xattrs), there’s no standard tool to change the bits, there’s support in xfs_io(8) as command xfs_io -c chattr
    xflags:用于与之前的区分,它是扩展标志,具有可调整的位,类似于属性,但是可扩展的,并且将来会添加新的位(ioctl 是 FS_IOC_FSGETXATTR 和 FS_IOC_FSSETXATTR,但它们与也称为 xattrs 的扩展属性无关),没有标准工具来更改位,xfs_io(8) 中有支持,命令为 xfs_io -c chattr

Attributes 属性

a

append only, new writes are always written at the end of the file
仅追加,新写入始终写入文件末尾

A

no atime updates 不更新访问时间

c

compress data, all data written after this attribute is set will be compressed. Please note that compression is also affected by the mount options or the parent directory attributes.
压缩数据,此属性设置后写入的所有数据将被压缩。请注意,压缩也受挂载选项或父目录属性的影响。

When set on a directory, all newly created files will inherit this attribute. This attribute cannot be set with ‘m’ at the same time.
当在目录上设置时,所有新创建的文件将继承此属性。此属性不能与‘m’同时设置。

C

no copy-on-write, file data modifications are done in-place
不进行写时复制,文件数据修改直接在原地进行

When set on a directory, all newly created files will inherit this attribute.
当设置在目录上时,所有新创建的文件都将继承此属性。

Note 注意

Due to implementation limitations, this flag can be set/unset only on empty files.
由于实现限制,此标志只能在空文件上设置/取消设置。

d

no dump, makes sense with 3rd party tools like dump(8), on BTRFS the attribute can be set/unset but no other special handling is done
不转储,与 dump(8) 等第三方工具一起使用时是有意义的,在 BTRFS 上,属性可以设置/取消,但不进行其他特殊处理

D

synchronous directory updates, for more details search open(2) for O_SYNC and O_DSYNC
同步目录更新,有关更多详细信息,请搜索 open(2) 查找 O_SYNC 和 O_DSYNC

i

immutable, no file data and metadata changes allowed even to the root user as long as this attribute is set (obviously the exception is unsetting the attribute)
不可变,即使对于 root 用户,只要设置了此属性,也不允许对文件数据和元数据进行更改(显然例外是取消属性)

m

no compression, permanently turn off compression on the given file. Any compression mount options will not affect this file. (chattr(1) support added in 1.46.2)
不压缩,在给定文件上永久关闭压缩。任何压缩挂载选项都不会影响此文件。(在 1.46.2 中添加了 chattr(1)支持)

When set on a directory, all newly created files will inherit this attribute. This attribute cannot be set with c at the same time.
当设置在目录上时,所有新创建的文件将继承此属性。此属性不能与 c 同时设置。

S

synchronous updates, for more details search open(2) for O_SYNC and O_DSYNC
同步更新,更多细节请搜索 open(2)中的 O_SYNC 和 O_DSYNC。

No other attributes are supported. For the complete list please refer to the chattr(1) manual page.
不支持其他属性。有关完整列表,请参阅 chattr(1) 手册页。

XFLAGS

There’s an overlap of letters assigned to the bits with the attributes, this list refers to what xfs_io(8) provides:
位与属性具有重叠的字母分配,此列表指的是 xfs_io(8) 提供的内容:

i

immutable, same as the attribute
不可变,与属性相同

a

append only, same as the attribute
仅追加,与属性相同

s

synchronous updates, same as the attribute S
同步更新,与属性 S 相同

A

no atime updates, same as the attribute
不更新访问时间,与属性相同

d

no dump, same as the attribute
不转储,与属性相同

ZONED MODE 分区模式

Since version 5.12 btrfs supports so called zoned mode. This is a special on-disk format and allocation/write strategy that’s friendly to zoned devices. In short, a device is partitioned into fixed-size zones and each zone can be updated by append-only manner, or reset. As btrfs has no fixed data structures, except the super blocks, the zoned mode only requires block placement that follows the device constraints. You can learn about the whole architecture at https://zonedstorage.io .
从版本 5.12 开始,btrfs 支持所谓的分区模式。这是一种特殊的磁盘格式和分配/写入策略,对分区设备友好。简而言之,设备被分成固定大小的区域,每个区域可以通过追加方式更新,或者重置。由于 btrfs 没有固定的数据结构,除了超级块,分区模式只需要遵循设备约束的块放置。您可以在 https://zonedstorage.io 了解整个架构。

The devices are also called SMR/ZBC/ZNS, in host-managed mode. Note that there are devices that appear as non-zoned but actually are, this is drive-managed and using zoned mode won’t help.
这些设备也被称为 SMR/ZBC/ZNS,在主机管理模式下。请注意,有些设备看起来不是分区的,但实际上是,这是驱动程序管理的,并且使用分区模式不会有帮助。

The zone size depends on the device, typical sizes are 256MiB or 1GiB. In general it must be a power of two. Emulated zoned devices like null_blk allow to set various zone sizes.
分区大小取决于设备,典型大小为 256MiB 或 1GiB。一般来说,它必须是 2 的幂。像 null_blk 这样的模拟分区设备允许设置各种分区大小。

Requirements, limitations
要求,限制

  • all devices must have the same zone size
    所有设备必须具有相同的分区大小

  • maximum zone size is 8GiB
    最大区域大小为 8GiB

  • minimum zone size is 4MiB
    最小区域大小为 4MiB

  • mixing zoned and non-zoned devices is possible, the zone writes are emulated, but this is namely for testing
    可以混合使用分区和非分区设备,分区写入会被模拟,但这主要用于测试

  • the super block is handled in a special way and is at different locations than on a non-zoned filesystem:
    超级块以特殊方式处理,并位于分区文件系统上的不同位置:

    • primary: 0B (and the next two zones)
      主要:0B(以及接下来的两个区域)

    • secondary: 512GiB (and the next two zones)
      次要:512GiB(以及接下来的两个区域)

    • tertiary: 4TiB (4096GiB, and the next two zones)
      三级:4TiB(4096GiB,以及接下来的两个区域)

Incompatible features 不兼容的特性

The main constraint of the zoned devices is lack of in-place update of the data. This is inherently incompatible with some features:
分区设备的主要约束是数据的就地更新不足。这与某些特性本质上不兼容:

  • NODATACOW - overwrite in-place, cannot create such files
    NODATACOW - 覆盖原地,无法创建此类文件

  • fallocate - preallocating space for in-place first write
    fallocate - 为原地第一次写入预分配空间

  • mixed-bg - unordered writes to data and metadata, fixing that means using separate data and metadata block groups
    mixed-bg - 无序写入数据和元数据,修复这个问题意味着使用单独的数据和元数据块组

  • booting - the zone at offset 0 contains superblock, resetting the zone would destroy the bootloader data
    启动 - 偏移量为 0 的区域包含超级块,重置该区域将销毁引导加载程序数据

Initial support lacks some features but they’re planned:
初始支持缺少一些功能,但它们已计划:

  • only single (data, metadata) and DUP (metadata) profile is supported
    仅支持单个(数据,元数据)和 DUP(元数据)配置文件

  • fstrim - due to dependency on free space cache v1
    fstrim - 由于依赖于空闲空间缓存 v1

Super block 超级块

As said above, super block is handled in a special way. In order to be crash safe, at least one zone in a known location must contain a valid superblock. This is implemented as a ring buffer in two consecutive zones, starting from known offsets 0B, 512GiB and 4TiB.
如上所述,超级块以特殊方式处理。为了保证崩溃安全,至少一个已知位置的区域必须包含一个有效的超级块。这是通过在两个连续区域中实现的环形缓冲区来实现的,从已知偏移 0B、512GiB 和 4TiB 开始。

The values are different than on non-zoned devices. Each new super block is appended to the end of the zone, once it’s filled, the zone is reset and writes continue to the next one. Looking up the latest super block needs to read offsets of both zones and determine the last written version.
这些值与非分区设备上的值不同。每个新的超级块都附加到区域的末尾,一旦填满,该区域将被重置,并且写入将继续到下一个区域。查找最新的超级块需要读取两个区域的偏移量,并确定最后写入的版本。

The amount of space reserved for super block depends on the zone size. The secondary and tertiary copies are at distant offsets as the capacity of the devices is expected to be large, tens of terabytes. Maximum zone size supported is 8GiB, which would mean that e.g. offset 0-16GiB would be reserved just for the super block on a hypothetical device of that zone size. This is wasteful but required to guarantee crash safety.
为超级块保留的空间量取决于区域大小。由于设备的容量预计会很大,达到数十 TB,因此次要和三次副本位于远距离的偏移量上。支持的最大区域大小为 8GiB,这意味着例如在假设的该区域大小的设备上,0-16GiB 的偏移量将仅用于超级块。这是浪费的,但为了保证崩溃安全性是必需的。

Devices 设备

Real hardware 真实硬件

The WD Ultrastar series 600 advertises HM-SMR, i.e. the host-managed zoned mode. There are two more: DA (device managed, no zoned information exported to the system), HA (host aware, can be used as regular disk but zoned writes improve performance). There are not many devices available at the moment, the information about exact zoned mode is hard to find, check data sheets or community sources gathering information from real devices.
WD Ultrastar 系列 600 宣传 HM-SMR,即主机管理的分区模式。还有两种:DA(设备管理,不向系统导出分区信息),HA(主机感知,可用作常规磁盘,但分区写入可提高性能)。目前可用的设备不多,关于确切分区模式的信息很难找到,查看数据表或从真实设备中收集信息的社区来源。

Note: zoned mode won’t work with DM-SMR disks.
注意:分区模式不适用于 DM-SMR 磁盘。

  • Ultrastar® DC ZN540 NVMe ZNS SSD (product brief)
    Ultrastar® DC ZN540 NVMe ZNS 固态硬盘(产品简介)

Emulated: null_blk 模拟:null_blk 

The driver null_blk provides memory backed device and is suitable for testing. There are some quirks setting up the devices. The module must be loaded with nr_devices=0 or the numbering of device nodes will be offset. The configfs must be mounted at /sys/kernel/config and the administration of the null_blk devices is done in /sys/kernel/config/nullb. The device nodes are named like /dev/nullb0 and are numbered sequentially. NOTE: the device name may be different than the named directory in sysfs!
驱动程序 null_blk 提供基于内存的设备,适用于测试。设置设备时有一些怪癖。必须使用 nr_devices=0 加载模块,否则设备节点的编号将会偏移。必须将 configfs 挂载到 /sys/kernel/config,对 null_blk 设备的管理在 /sys/kernel/config/nullb 中进行。设备节点的命名类似于 /dev/nullb0 并按顺序编号。注意:设备名称可能与 sysfs 中命名的目录不同!

Setup: 设置:

modprobe configfs
modprobe null_blk nr_devices=0

Create a device mydev, assuming no other previously created devices, size is 2048MiB, zone size 256MiB. There are more tunable parameters, this is a minimal example taking defaults:
创建一个设备 mydev,假设之前没有创建过其他设备,大小为 2048MiB,区域大小为 256MiB。还有更多可调参数,这是一个以默认值为例的最小示例:

cd /sys/kernel/config/nullb/
mkdir mydev
cd mydev
echo 2048 > size
echo 1 > zoned
echo 1 > memory_backed
echo 256 > zone_size
echo 1 > power

This will create a device /dev/nullb0 and the value of file index will match the ending number of the device node.
这将创建一个设备 /dev/nullb0 ,文件索引的值将与设备节点的结束数字匹配。

Remove the device: 移除设备:

rmdir /sys/kernel/config/nullb/mydev

Then continue with mkfs.btrfs /dev/nullb0, the zoned mode is auto-detected.
然后继续执行 mkfs.btrfs /dev/nullb0,分区模式将被自动检测。

For convenience, there’s a script wrapping the basic null_blk management operations https://github.com/kdave/nullb.git, the above commands become:
为了方便起见,有一个脚本包装了基本的 null_blk 管理操作 https://github.com/kdave/nullb.git,上述命令变为:

nullb setup
nullb create -s 2g -z 256
mkfs.btrfs /dev/nullb0
...
nullb rm nullb0

Emulated: TCMU runner 模拟:TCMU 运行器

TCMU is a framework to emulate SCSI devices in userspace, providing various backends for the storage, with zoned support as well. A file-backed zoned device can provide more options for larger storage and zone size. Please follow the instructions at https://zonedstorage.io/projects/tcmu-runner/ .
TCMU 是一个在用户空间模拟 SCSI 设备的框架,提供各种后端存储支持,同时也支持分区。基于文件的分区设备可以为更大的存储和分区大小提供更多选项。请按照 https://zonedstorage.io/projects/tcmu-runner/ 上的说明操作。

Compatibility, incompatibility
兼容性,不兼容性

  • the feature sets an incompat bit and requires new kernel to access the filesystem (for both read and write)
    该功能设置了一个不兼容位,并需要新的内核来访问文件系统(包括读和写)

  • superblock needs to be handled in a special way, there are still 3 copies but at different offsets (0, 512GiB, 4TiB) and the 2 consecutive zones are a ring buffer of the superblocks, finding the latest one needs reading it from the write pointer or do a full scan of the zones
    超级块需要以特殊方式处理,仍然有 3 个副本,但位于不同的偏移量(0、512GiB、4TiB),而 2 个连续的区域是超级块的环形缓冲区,找到最新的超级块需要从写指针读取它或对区域进行完整扫描

  • mixing zoned and non zoned devices is possible (zones are emulated) but is recommended only for testing
    混合分区和非分区设备是可能的(分区被模拟),但建议仅用于测试

  • mixing zoned devices with different zone sizes is not possible
    无法将具有不同区域大小的区域设备混合使用

  • zone sizes must be power of two, zone sizes of real devices are e.g. 256MiB or 1GiB, larger size is expected, maximum zone size supported by btrfs is 8GiB
    区域大小必须是 2 的幂,实际设备的区域大小为 256MiB 或 1GiB,预期更大的尺寸,btrfs 支持的最大区域大小为 8GiB

Status, stability, reporting bugs
状态、稳定性、报告错误 

The zoned mode has been released in 5.12 and there are still some rough edges and corner cases one can hit during testing. Please report bugs to https://github.com/naota/linux/issues/ .
分区模式已在 5.12 中发布,测试过程中可能会遇到一些问题和特殊情况。请将错误报告给 https://github.com/naota/linux/issues/ 。

References 参考资料 

CONTROL DEVICE 控制设备

There’s a character special device /dev/btrfs-control with major and minor numbers 10 and 234 (the device can be found under the misc category).
有一个字符特殊设备 /dev/btrfs-control ,主次编号分别为 10 和 234(该设备可以在杂项类别下找到)。

$ ls -l /dev/btrfs-control
crw------- 1 root root 10, 234 Jan  1 12:00 /dev/btrfs-control

The device accepts some ioctl calls that can perform following actions on the filesystem module:
该设备接受一些 ioctl 调用,可以对文件系统模块执行以下操作:

  • scan devices for btrfs filesystem (i.e. to let multi-device filesystems mount automatically) and register them with the kernel module
    扫描设备以查找 btrfs 文件系统(即让多设备文件系统自动挂载)并将它们注册到内核模块

  • similar to scan, but also wait until the device scanning process is finished for a given filesystem
    类似于扫描,但还会等待给定文件系统的设备扫描过程完成

  • get the supported features (can be also found under /sys/fs/btrfs/features)
    获取支持的功能(也可以在 /sys/fs/btrfs/features 下找到)

The device is created when btrfs is initialized, either as a module or a built-in functionality and makes sense only in connection with that. Running e.g. mkfs without the module loaded will not register the device and will probably warn about that.
设备是在初始化 btrfs 时创建的,可以作为模块或内置功能,并且只有在与其连接时才有意义。例如,在未加载模块的情况下运行 mkfs 将不会注册设备,并且可能会发出警告。

In rare cases when the module is loaded but the device is not present (most likely accidentally deleted), it’s possible to recreate it by
在极少数情况下,当模块已加载但设备不存在(很可能是意外删除时),可以通过重新创建来解决。

# mknod --mode=600 /dev/btrfs-control c 10 234

or (since 5.11) by a convenience command
或者(自 5.11 起)通过一个便利命令。

# btrfs rescue create-control-device

The control device is not strictly required but the device scanning will not work and a workaround would need to be used to mount a multi-device filesystem. The mount option device can trigger the device scanning during mount, see also btrfs device scan.
控制设备并非严格要求,但设备扫描将无法工作,需要使用一种变通方法来挂载多设备文件系统。挂载选项设备可以在挂载期间触发设备扫描,另请参阅 btrfs 设备扫描。

FILESYSTEM WITH MULTIPLE PROFILES
具有多个配置文件的文件系统

It is possible that a btrfs filesystem contains multiple block group profiles of the same type. This could happen when a profile conversion using balance filters is interrupted (see btrfs-balance(8)). Some btrfs commands perform a test to detect this kind of condition and print a warning like this:
有可能 btrfs 文件系统包含同一类型的多个块组配置文件。当使用平衡过滤器进行配置文件转换时中断(参见 btrfs-balance(8))可能会发生这种情况。一些 btrfs 命令执行测试以检测这种情况并打印警告,如下所示:

WARNING: Multiple block group profiles detected, see 'man btrfs(5)'.
WARNING:   Data: single, raid1
WARNING:   Metadata: single, raid1

The corresponding output of btrfs filesystem df might look like:
btrfs 文件系统 df 的相应输出可能如下所示:

WARNING: Multiple block group profiles detected, see 'man btrfs(5)'.
WARNING:   Data: single, raid1
WARNING:   Metadata: single, raid1
Data, RAID1: total=832.00MiB, used=0.00B
Data, single: total=1.63GiB, used=0.00B
System, single: total=4.00MiB, used=16.00KiB
Metadata, single: total=8.00MiB, used=112.00KiB
Metadata, RAID1: total=64.00MiB, used=32.00KiB
GlobalReserve, single: total=16.25MiB, used=0.00B

There’s more than one line for type Data and Metadata, while the profiles are single and RAID1.
对于 Data 和 Metadata 类型,有多行输出,而 profiles 是单个和 RAID1。

This state of the filesystem OK but most likely needs the user/administrator to take an action and finish the interrupted tasks. This cannot be easily done automatically, also the user knows the expected final profiles.
文件系统的状态是 OK 的,但很可能需要用户/管理员采取行动并完成中断的任务。这不能轻松地自动完成,用户也知道预期的最终 profiles。

In the example above, the filesystem started as a single device and single block group profile. Then another device was added, followed by balance with convert=raid1 but for some reason hasn’t finished. Restarting the balance with convert=raid1 will continue and end up with filesystem with all block group profiles RAID1.
在上面的示例中,文件系统最初作为单个设备和单个块组配置启动。然后添加了另一个设备,然后使用 convert=raid1 进行平衡,但由于某种原因尚未完成。重新使用 convert=raid1 进行平衡将继续,并最终以所有块组配置为 RAID1 的文件系统结束。

Note 注意

If you’re familiar with balance filters, you can use convert=raid1,profiles=single,soft, which will take only the unconverted single profiles and convert them to raid1. This may speed up the conversion as it would not try to rewrite the already convert raid1 profiles.
如果您熟悉平衡过滤器,可以使用 convert=raid1,profiles=single,soft,这将仅获取未转换的单个配置文件并将其转换为 raid1。这可能会加快转换速度,因为它不会尝试重写已转换为 raid1 的配置文件。

Having just one profile is desired as this also clearly defines the profile of newly allocated block groups, otherwise this depends on internal allocation policy. When there are multiple profiles present, the order of selection is RAID56, RAID10, RAID1, RAID0 as long as the device number constraints are satisfied.
仅具有一个配置文件是理想的,因为这也清楚地定义了新分配的块组的配置文件,否则这取决于内部分配策略。当存在多个配置文件时,选择顺序为 RAID56、RAID10、RAID1、RAID0,只要满足设备编号约束。

Commands that print the warning were chosen so they’re brought to user attention when the filesystem state is being changed in that regard. This is: device add, device delete, balance cancel, balance pause. Commands that report space usage: filesystem df, device usage. The command filesystem usage provides a line in the overall summary:
选择打印警告的命令,以便在更改文件系统状态时引起用户注意。这些命令包括:添加设备、删除设备、取消平衡、暂停平衡。报告空间使用情况的命令有:文件系统 df、设备使用情况。命令文件系统使用提供了整体摘要中的一行:

Multiple profiles:                 yes (data, metadata)

SEEDING DEVICE 种子设备 

The COW mechanism and multiple devices under one hood enable an interesting concept, called a seeding device: extending a read-only filesystem on a device with another device that captures all writes. For example imagine an immutable golden image of an operating system enhanced with another device that allows to use the data from the golden image and normal operation. This idea originated on CD-ROMs with base OS and allowing to use them for live systems, but this became obsolete. There are technologies providing similar functionality, like unionmount, overlayfs or qcow2 image snapshot.
COW 机制和一个外壳下的多个设备实现了一个有趣的概念,称为种子设备:通过在一个设备上扩展只读文件系统,并使用另一个设备捕获所有写入。例如,想象一个不可变的操作系统黄金镜像,再加上另一个设备,允许使用黄金镜像的数据和正常操作。这个想法起源于带有基本操作系统的 CD-ROM,并允许将它们用于实时系统,但这已经过时了。有一些提供类似功能的技术,如 unionmount、overlayfs 或 qcow2 镜像快照。

The seeding device starts as a normal filesystem, once the contents is ready, btrfstune -S 1 is used to flag it as a seeding device. Mounting such device will not allow any writes, except adding a new device by btrfs device add. Then the filesystem can be remounted as read-write.
种子设备开始时作为一个普通文件系统,一旦内容准备好,使用 btrfstune -S 1 将其标记为种子设备。挂载这样的设备将不允许任何写入,除非通过 btrfs device add 添加新设备。然后文件系统可以重新挂载为读写。

Given that the filesystem on the seeding device is always recognized as read-only, it can be used to seed multiple filesystems from one device at the same time. The UUID that is normally attached to a device is automatically changed to a random UUID on each mount.
鉴于播种设备上的文件系统始终被识别为只读,因此可以使用它同时从一个设备播种多个文件系统。通常附加到设备上的 UUID 在每次挂载时会自动更改为随机 UUID。

Once the seeding device is mounted, it needs the writable device. After adding it, unmounting and mounting with umount /path; mount /dev/writable /path or remounting read-write with remount -o remount,rw makes the filesystem at /path ready for use.
一旦播种设备被挂载,它需要可写设备。添加后,使用 umount /path; mount /dev/writable /path 或使用 remount -o remount,rw 重新挂载为读写模式,使得位于 /path 的文件系统准备就绪。

Note 注意

There is a known bug with using remount to make the mount writeable: remount will leave the filesystem in a state where it is unable to clean deleted snapshots, so it will leak space until it is unmounted and mounted properly.
使用 remount 使挂载点可写存在已知 bug:remount 会使文件系统处于无法清理已删除快照的状态,因此会泄漏空间,直到正确卸载和挂载为止。

Furthermore, deleting the seeding device from the filesystem can turn it into a normal filesystem, provided that the writable device can also contain all the data from the seeding device.
此外,从文件系统中删除种子设备可以将其转换为普通文件系统,前提是可写设备还可以包含种子设备中的所有数据。

The seeding device flag can be cleared again by btrfstune -f -S 0, e.g. allowing to update with newer data but please note that this will invalidate all existing filesystems that use this particular seeding device. This works for some use cases, not for others, and the forcing flag to the command is mandatory to avoid accidental mistakes.
可以通过 btrfstune -f -S 0 再次清除种子设备标志,例如,允许使用更新的数据,但请注意,这将使使用此特定种子设备的所有现有文件系统无效。这适用于某些用例,但不适用于其他用例,并且命令的强制标志是必需的,以避免意外错误。

Example how to create and use one seeding device:
创建和使用一个种子设备的示例:

# mkfs.btrfs /dev/sda
# mount /dev/sda /mnt/mnt1
... fill mnt1 with data
# umount /mnt/mnt1

# btrfstune -S 1 /dev/sda

# mount /dev/sda /mnt/mnt1
# btrfs device add /dev/sdb /mnt/mnt1
# umount /mnt/mnt1
# mount /dev/sdb /mnt/mnt1
... /mnt/mnt1 is now writable

Now /mnt/mnt1 can be used normally. The device /dev/sda can be mounted again with a another writable device:
现在 /mnt/mnt1 可以正常使用。设备 /dev/sda 可以再次与另一个可写设备挂载:

# mount /dev/sda /mnt/mnt2
# btrfs device add /dev/sdc /mnt/mnt2
# umount /mnt/mnt2
# mount /dev/sdc /mnt/mnt2
... /mnt/mnt2 is now writable

The writable device (file:/dev/sdb) can be decoupled from the seeding device and used independently:
可以将可写设备(文件:/dev/sdb)与种子设备分离,独立使用:

# btrfs device delete /dev/sda /mnt/mnt1

As the contents originated in the seeding device, it’s possible to turn /dev/sdb to a seeding device again and repeat the whole process.
由于内容源自种子设备,可以将 /dev/sdb 再次转换为种子设备,重复整个过程。

A few things to note:
一些需要注意的事项:

  • it’s recommended to use only single device for the seeding device, it works for multiple devices but the single profile must be used in order to make the seeding device deletion work
    建议只使用单个设备作为种子设备,虽然多个设备也可以使用,但必须使用单个配置文件才能使种子设备删除功能正常工作

  • block group profiles single and dup support the use cases above
    区块组配置文件 single 和 dup 支持上述用例

  • the label is copied from the seeding device and can be changed by btrfs filesystem label
    标签是从种子设备复制的,并可以通过 btrfs 文件系统标签进行更改

  • each new mount of the seeding device gets a new random UUID
    每次挂载种子设备时都会获得一个新的随机 UUID

  • umount /path; mount /dev/writable /path can be replaced with mount -o remount,rw /path but it won’t reclaim space of deleted subvolumes until the seeding device is mounted read-write again before making it seeding again
    umount /path; mount /dev/writable /path 可以替换为 mount -o remount,rw /path,但在将种子设备再次挂载为读写模式之前,它不会回收已删除子卷的空间,然后再次将其设置为种子设备

Chained seeding devices
链式播种设备

Though it’s not recommended and is rather an obscure and untested use case, chaining seeding devices is possible. In the first example, the writable device /dev/sdb can be turned onto another seeding device again, depending on the unchanged seeding device /dev/sda. Then using /dev/sdb as the primary seeding device it can be extended with another writable device, say /dev/sdd, and it continues as before as a simple tree structure on devices.
虽然不建议这样做,而且这是一个相当模糊且未经测试的用例,但是可以链接播种设备。在第一个示例中,可写设备 /dev/sdb 可以再次转到另一个播种设备,取决于未更改的播种设备 /dev/sda 。然后,使用 /dev/sdb 作为主要播种设备,可以使用另一个可写设备 /dev/sdd 进行扩展,然后继续作为设备上的简单树结构。

# mkfs.btrfs /dev/sda
# mount /dev/sda /mnt/mnt1
... fill mnt1 with data
# umount /mnt/mnt1

# btrfstune -S 1 /dev/sda

# mount /dev/sda /mnt/mnt1
# btrfs device add /dev/sdb /mnt/mnt1
# mount -o remount,rw /mnt/mnt1
... /mnt/mnt1 is now writable
# umount /mnt/mnt1

# btrfstune -S 1 /dev/sdb

# mount /dev/sdb /mnt/mnt1
# btrfs device add /dev/sdc /mnt
# mount -o remount,rw /mnt/mnt1
... /mnt/mnt1 is now writable
# umount /mnt/mnt1

As a result we have:
因此,我们有:

  • sda is a single seeding device, with its initial contents
    sda 是一个单个的播种设备,具有其初始内容

  • sdb is a seeding device but requires sda, the contents are from the time when sdb is made seeding, i.e. contents of sda with any later changes
    sdb 是一个播种设备,但需要 sda,其内容来自于 sdb 成为播种设备时的时间,即 sda 的内容以及任何后续更改

  • sdc last writable, can be made a seeding one the same way as was sdb, preserving its contents and depending on sda and sdb
    sdc 最后可写,可以像 sdb 一样制作成播种设备,保留其内容并依赖于 sda 和 sdb

As long as the seeding devices are unmodified and available, they can be used to start another branch.
只要播种设备未经修改且可用,它们就可以用来启动另一个分支。

STORAGE MODEL, HARDWARE CONSIDERATIONS
存储模型,硬件考虑

Storage model 存储模型

A storage model is a model that captures key physical aspects of data structure in a data store. A filesystem is the logical structure organizing data on top of the storage device.
存储模型是捕捉数据存储中数据结构的关键物理方面的模型。文件系统是在存储设备顶部组织数据的逻辑结构。

The filesystem assumes several features or limitations of the storage device and utilizes them or applies measures to guarantee reliability. BTRFS in particular is based on a COW (copy on write) mode of writing, i.e. not updating data in place but rather writing a new copy to a different location and then atomically switching the pointers.
文件系统假定存储设备具有几个特性或限制,并利用它们或采取措施来保证可靠性。特别是 BTRFS 基于写时复制(COW)模式,即不在原地更新数据,而是将新副本写入不同位置,然后原子地切换指针。

In an ideal world, the device does what it promises. The filesystem assumes that this may not be true so additional mechanisms are applied to either detect misbehaving hardware or get valid data by other means. The devices may (and do) apply their own detection and repair mechanisms but we won’t assume any.
在理想世界中,设备会实现其承诺。文件系统假定这可能不是真实的,因此会应用额外的机制来检测设备的异常行为或通过其他方式获取有效数据。设备可能(也确实)应用其自己的检测和修复机制,但我们不会假设任何情况。

The following assumptions about storage devices are considered (sorted by importance, numbers are for further reference):
考虑了关于存储设备的以下假设(按重要性排序,数字供进一步参考):

  1. atomicity of reads and writes of blocks/sectors (the smallest unit of data the device presents to the upper layers)
    读取和写入块/扇区(设备向上层呈现的最小数据单元)的原子性

  2. there’s a flush command that instructs the device to forcibly order writes before and after the command; alternatively there’s a barrier command that facilitates the ordering but may not flush the data
    有一个刷新命令,指示设备在命令之前和之后强制排序写入;或者有一个屏障命令,促进排序但可能不刷新数据

  3. data sent to write to a given device offset will be written without further changes to the data and to the offset
    发送到给定设备偏移量的写入数据将被写入,而不会对数据和偏移量进行进一步更改

  4. writes can be reordered by the device, unless explicitly serialized by the flush command
    写入可以被设备重新排序,除非通过刷新命令明确序列化

  5. reads and writes can be freely reordered and interleaved
    读取和写入可以自由重新排序和交错

The consistency model of BTRFS builds on these assumptions. The logical data updates are grouped, into a generation, written on the device, serialized by the flush command and then the super block is written ending the generation. All logical links among metadata comprising a consistent view of the data may not cross the generation boundary.
BTRFS 的一致性模型建立在这些假设之上。逻辑数据更新被分组成一代,写入设备,通过刷新命令序列化,然后写入超级块结束该代。所有元数据之间的逻辑链接组成了数据的一致视图,可能不会跨越代边界。

When things go wrong
当事情出错时 

No or partial atomicity of block reads/writes (1)
块读/写的不完全原子性 (1)

  • Problem: a partial block contents is written (torn write), e.g. due to a power glitch or other electronics failure during the read/write
    问题: 写入部分块内容 (断裂写入),例如由于读/写期间的电源故障或其他电子故障

  • Detection: checksum mismatch on read
    检测:读取时校验和不匹配

  • Repair: use another copy or rebuild from multiple blocks using some encoding scheme
    修复:使用另一个副本或使用某种编码方案从多个块重新构建

The flush command does not flush (2)
刷新命令不刷新(2)

This is perhaps the most serious problem and impossible to mitigate by filesystem without limitations and design restrictions. What could happen in the worst case is that writes from one generation bleed to another one, while still letting the filesystem consider the generations isolated. Crash at any point would leave data on the device in an inconsistent state without any hint what exactly got written, what is missing and leading to stale metadata link information.
这可能是最严重的问题,文件系统无法通过限制和设计约束来减轻。在最糟糕的情况下可能发生的是,一个世代的写入会泄漏到另一个世代,同时让文件系统认为这些世代是隔离的。在任何时候崩溃都会使设备上的数据处于不一致状态,而没有任何线索表明到底写入了什么,缺少了什么,导致了陈旧的元数据链接信息。

Devices usually honor the flush command, but for performance reasons may do internal caching, where the flushed data are not yet persistently stored. A power failure could lead to a similar scenario as above, although it’s less likely that later writes would be written before the cached ones. This is beyond what a filesystem can take into account. Devices or controllers are usually equipped with batteries or capacitors to write the cache contents even after power is cut. (Battery backed write cache)
设备通常会遵守刷新命令,但出于性能原因可能会进行内部缓存,刷新的数据尚未持久存储。断电可能导致与上述类似的情况,尽管后续写入在缓存写入之前的可能性较小。这超出了文件系统所能考虑的范围。设备或控制器通常配备电池或电容器,以便在断电后仍能写入缓存内容。(带电池备份写缓存)

Data get silently changed on write (3)
数据在写入时会悄悄地发生变化(3)

Such thing should not happen frequently, but still can happen spuriously due the complex internal workings of devices or physical effects of the storage media itself.
这种事情不应该经常发生,但由于设备的复杂内部工作或存储介质本身的物理效应,仍然可能偶尔发生。

  • Problem: while the data are written atomically, the contents get changed
    问题:虽然数据是以原子方式写入的,但内容发生了变化

  • Detection: checksum mismatch on read
    检测:读取时校验和不匹配

  • Repair: use another copy or rebuild from multiple blocks using some encoding scheme
    修复:使用另一个副本或使用某种编码方案从多个块重新构建

Data get silently written to another offset (3)
数据悄悄地写入另一个偏移量(3)

This would be another serious problem as the filesystem has no information when it happens. For that reason the measures have to be done ahead of time. This problem is also commonly called ghost write.
当发生这种情况时,这将是另一个严重问题,因为文件系统在其发生时没有任何信息。因此,必须提前采取措施。这个问题通常也被称为幽灵写入。

The metadata blocks have the checksum embedded in the blocks, so a correct atomic write would not corrupt the checksum. It’s likely that after reading such block the data inside would not be consistent with the rest. To rule that out there’s embedded block number in the metadata block. It’s the logical block number because this is what the logical structure expects and verifies.
元数据块中嵌入了校验和,因此正确的原子写入不会损坏校验和。很可能在读取这样的块之后,内部数据与其余部分不一致。为了排除这种可能性,在元数据块中嵌入了块编号。这是逻辑块编号,因为这是逻辑结构所期望并验证的内容。

The following is based on information publicly available, user feedback, community discussions or bug report analyses. It’s not complete and further research is encouraged when in doubt.
以下内容基于公开信息、用户反馈、社区讨论或错误报告分析。这并不完整,当有疑问时鼓励进一步研究。

Main memory 主存储器 

The data structures and raw data blocks are temporarily stored in computer memory before they get written to the device. It is critical that memory is reliable because even simple bit flips can have vast consequences and lead to damaged structures, not only in the filesystem but in the whole operating system.
在将数据结构和原始数据块写入设备之前,它们会临时存储在计算机内存中。内存的可靠性至关重要,因为即使是简单的位翻转也可能导致严重后果,并导致结构受损,不仅仅是在文件系统中,还包括整个操作系统。

Based on experience in the community, memory bit flips are more common than one would think. When it happens, it’s reported by the tree-checker or by a checksum mismatch after reading blocks. There are some very obvious instances of bit flips that happen, e.g. in an ordered sequence of keys in metadata blocks. We can easily infer from the other data what values get damaged and how. However, fixing that is not straightforward and would require cross-referencing data from the entire filesystem to see the scope.
根据社区的经验,内存位翻转比人们想象的要常见。当发生这种情况时,树检查器或在读取块后发生校验和不匹配会报告。有一些非常明显的位翻转实例会发生,例如在元数据块中键的有序序列中。我们可以轻松地从其他数据推断出哪些值受损以及受损程度。然而,修复这个问题并不简单,需要跨引用整个文件系统的数据来查看范围。

If available, ECC memory should lower the chances of bit flips, but this type of memory is not available in all cases. A memory test should be performed in case there’s a visible bit flip pattern, though this may not detect a faulty memory module because the actual load of the system could be the factor making the problems appear. In recent years attacks on how the memory modules operate have been demonstrated (rowhammer) achieving specific bits to be flipped. While these were targeted, this shows that a series of reads or writes can affect unrelated parts of memory.
如果可用,ECC 内存应该降低位翻转的机会,但并非所有情况下都有这种类型的内存。如果存在可见的位翻转模式,应进行内存测试,尽管这可能无法检测到有问题的内存模块,因为系统的实际负载可能是导致问题出现的因素。近年来,对内存模块操作的攻击已经得到证明(rowhammer),实现特定位的翻转。尽管这些是有针对性的,但这表明一系列读取或写入可能会影响内存的不相关部分。

Block group profiles with redundancy (like RAID1) will not protect against memory errors as the blocks are first stored in memory before they are written to the devices from the same source.
具有冗余的块组配置文件(如 RAID1)不会保护内存错误,因为块首先存储在内存中,然后从相同来源写入设备。

A filesystem mounted read-only will not affect the underlying block device in almost 100% (with highly unlikely exceptions). The exception is a tree-log that needs to be replayed during mount (and before the read-only mount takes place), working memory is needed for that and that can be affected by bit flips. There’s a theoretical case where bit flip changes the filesystem status from read-only to read-write.
挂载为只读的文件系统几乎不会影响底层块设备(几乎 100%,极少数例外)。例外情况是在挂载期间需要重放的树日志(在只读挂载之前进行),这需要工作内存,而这可能会受到位翻转的影响。存在一种理论情况,即位翻转将文件系统状态从只读更改为读写。

Further reading: 进一步阅读:

What to do: 该怎么办:

  • run memtest, note that sometimes memory errors happen only when the system is under heavy load that the default memtest cannot trigger
    运行内存测试,注意有时候内存错误只会在系统承受重负载时发生,而默认的内存测试无法触发这种情况

  • memory errors may appear as filesystem going read-only due to “pre write” check, that verify meta data before they get written but fail some basic consistency checks
    内存错误可能会导致文件系统变为只读,这是由于“预写”检查引起的,该检查在写入之前验证元数据,但未通过一些基本一致性检查

  • newly built systems should be tested before being put to production use, ideally start a IO/CPU load that will be run on such system later; namely systems that will utilize overclocking or special performance features
    新构建的系统在投入生产使用之前应该进行测试,最好启动一个会在该系统上运行的 IO/CPU 负载测试;特别是那些将利用超频或特殊性能功能的系统

Direct memory access (DMA)
直接内存访问(DMA)

Another class of errors is related to DMA (direct memory access) performed by device drivers. While this could be considered a software error, the data transfers that happen without CPU assistance may accidentally corrupt other pages. Storage devices utilize DMA for performance reasons, the filesystem structures and data pages are passed back and forth, making errors possible in case page life time is not properly tracked.
另一类错误与设备驱动程序执行的 DMA(直接内存访问)有关。虽然这可能被视为软件错误,但在没有 CPU 协助的情况下发生的数据传输可能会意外损坏其他页面。存储设备利用 DMA 来提高性能,文件系统结构和数据页面来回传递,如果页面寿命没有正确跟踪,可能会出现错误。

There are lots of quirks (device-specific workarounds) in Linux kernel drivers (regarding not only DMA) that are added when found. The quirks may avoid specific errors or disable some features to avoid worse problems.
在 Linux 内核驱动程序中存在许多怪癖(特定设备的解决方法),当发现时会添加到其中(不仅限于 DMA)。这些怪癖可能避免特定错误或禁用某些功能以避免更严重的问题。

What to do: 该怎么办:

  • use up-to-date kernel (recent releases or maintained long term support versions)
    使用最新的内核(最新发布版或长期维护版本)

  • as this may be caused by faulty drivers, keep the systems up-to-date
    由于可能是由于驱动程序故障引起的,请保持系统保持最新状态

Rotational disks (HDD) 旋转磁盘(HDD)

Rotational HDDs typically fail at the level of individual sectors or small clusters. Read failures are caught on the levels below the filesystem and are returned to the user as EIO - Input/output error. Reading the blocks repeatedly may return the data eventually, but this is better done by specialized tools and filesystem takes the result of the lower layers. Rewriting the sectors may trigger internal remapping but this inevitably leads to data loss.
旋转式硬盘驱动器(HDD)通常在单个扇区或小簇的级别上发生故障。读取失败会在文件系统下的级别上被捕获,并作为 EIO - 输入/输出错误返回给用户。重复读取块可能最终会返回数据,但最好由专门工具完成,文件系统会接收下层的结果。重写扇区可能会触发内部重映射,但这不可避免地会导致数据丢失。

Disk firmware is technically software but from the filesystem perspective is part of the hardware. IO requests are processed, and caching or various other optimizations are performed, which may lead to bugs under high load or unexpected physical conditions or unsupported use cases.
硬盘固件在技术上是软件,但从文件系统的角度来看,它是硬件的一部分。IO 请求会被处理,并执行缓存或各种其他优化,这可能会在高负载或意外物理条件或不支持的用例下导致错误。

Disks are connected by cables with two ends, both of which can cause problems when not attached properly. Data transfers are protected by checksums and the lower layers try hard to transfer the data correctly or not at all. The errors from badly-connecting cables may manifest as large amount of failed read or write requests, or as short error bursts depending on physical conditions.
硬盘通过两端连接的电缆连接,当连接不正确时,两端都可能导致问题。数据传输受校验和保护,较低的层次会尽力正确传输数据,或者根本不传输。由于连接不良的电缆可能导致大量失败的读取或写入请求,或者根据物理条件而定,表现为短暂的错误突发。

What to do: 该怎么办:

  • check smartctl for potential issues
    检查 smartctl 是否存在潜在问题

Solid state drives (SSD)
固态硬盘(SSD)

The mechanism of information storage is different from HDDs and this affects the failure mode as well. The data are stored in cells grouped in large blocks with limited number of resets and other write constraints. The firmware tries to avoid unnecessary resets and performs optimizations to maximize the storage media lifetime. The known techniques are deduplication (blocks with same fingerprint/hash are mapped to same physical block), compression or internal remapping and garbage collection of used memory cells. Due to the additional processing there are measures to verify the data e.g. by ECC codes.
信息存储机制与 HDD 不同,这也影响了故障模式。数据存储在以大块为单位的单元中,具有有限数量的重置和其他写入约束。固件试图避免不必要的重置,并执行优化以最大化存储介质的寿命。已知的技术包括去重(具有相同指纹/哈希的块被映射到同一物理块)、压缩或内部重映射以及已使用内存单元的垃圾收集。由于额外的处理,有措施来验证数据,例如通过 ECC 码。

The observations of failing SSDs show that the whole electronic fails at once or affects a lot of data (e.g. stored on one chip). Recovering such data may need specialized equipment and reading data repeatedly does not help as it’s possible with HDDs.
失败的固态硬盘的观察表明,整个电子设备一次性失效或影响大量数据(例如存储在一个芯片上的数据)。恢复这些数据可能需要专门的设备,反复读取数据并不能像 HDD 那样有助于恢复。

There are several technologies of the memory cells with different characteristics and price. The lifetime is directly affected by the type and frequency of data written. Writing “too much” distinct data (e.g. encrypted) may render the internal deduplication ineffective and lead to a lot of rewrites and increased wear of the memory cells.
内存单元有几种不同特性和价格的技术。寿命直接受到写入数据类型和频率的影响。写入“过多”不同的数据(例如加密数据)可能使内部去重失效,并导致大量重写和增加内存单元的磨损。

There are several technologies and manufacturers so it’s hard to describe them but there are some that exhibit similar behaviour:
有几种技术和制造商,因此很难描述它们,但有一些表现出类似的行为:

  • expensive SSD will use more durable memory cells and is optimized for reliability and high load
    昂贵的固态硬盘将使用更耐用的内存单元,并针对可靠性和高负载进行了优化。

  • cheap SSD is projected for a lower load (“desktop user”) and is optimized for cost, it may employ the optimizations and/or extended error reporting partially or not at all
    廉价的固态硬盘适用于较低负载(“桌面用户”),并且经过成本优化,可能部分或完全采用优化和/或扩展的错误报告

It’s not possible to reliably determine the expected lifetime of an SSD due to lack of information about how it works or due to lack of reliable stats provided by the device.
由于缺乏关于其工作原理的信息或由于设备提供的可靠统计数据不足,无法可靠地确定固态硬盘的预期寿命。

Metadata writes tend to be the biggest component of lifetime writes to a SSD, so there is some value in reducing them. Depending on the device class (high end/low end) the features like DUP block group profiles may affect the reliability in both ways:
元数据写入往往是固态硬盘寿命写入的最大组成部分,因此减少它们具有一定的价值。根据设备类别(高端/低端),像 DUP 块组配置文件这样的功能可能会以两种方式影响可靠性。

  • high end are typically more reliable and using single for data and metadata could be suitable to reduce device wear
    高端产品通常更可靠,并且单独用于数据和元数据可能适合减少设备磨损

  • low end could lack ability to identify errors so an additional redundancy at the filesystem level (checksums, DUP) could help
    低端产品可能缺乏识别错误的能力,因此在文件系统级别增加额外的冗余(校验和、DUP)可能有所帮助

Only users who consume 50 to 100% of the SSD’s actual lifetime writes need to be concerned by the write amplification of btrfs DUP metadata. Most users will be far below 50% of the actual lifetime, or will write the drive to death and discover how many writes 100% of the actual lifetime was. SSD firmware often adds its own write multipliers that can be arbitrary and unpredictable and dependent on application behavior, and these will typically have far greater effect on SSD lifespan than DUP metadata. It’s more or less impossible to predict when a SSD will run out of lifetime writes to within a factor of two, so it’s hard to justify wear reduction as a benefit.
只有消耗 SSD 实际寿命写入量的 50 到 100% 的用户需要关注 btrfs DUP 元数据的写放大。大多数用户的写入量远低于实际寿命的 50%,或者会写入到设备死亡并发现 100% 实际寿命的写入量是多少。SSD 固件通常会添加自己的写入倍增器,这些倍增器可能是任意的、不可预测的,并且取决于应用程序的行为,这些倍增器通常对 SSD 寿命的影响要远远大于 DUP 元数据。几乎不可能准确预测 SSD 何时耗尽寿命写入量,因此很难将减少磨损作为一个好处来证明。

Further reading: 进一步阅读:

What to do: 该怎么办:

  • run smartctl or self-tests to look for potential issues
    运行 smartctl 或自检以查找潜在问题

  • keep the firmware up-to-date
    保持固件最新

NVM express, non-volatile memory (NVMe)
NVM Express,非易失性存储器(NVMe)

NVMe is a type of persistent memory usually connected over a system bus (PCIe) or similar interface and the speeds are an order of magnitude faster than SSD. It is also a non-rotating type of storage, and is not typically connected by a cable. It’s not a SCSI type device either but rather a complete specification for logical device interface.
NVMe 是一种通常通过系统总线(PCIe)或类似接口连接的持久性存储器类型,速度比固态硬盘快一个数量级。它也是一种非旋转式存储,通常不通过电缆连接。它也不是 SCSI 类型设备,而是逻辑设备接口的完整规范。

In a way the errors could be compared to a combination of SSD class and regular memory. Errors may exhibit as random bit flips or IO failures. There are tools to access the internal log (nvme log and nvme-cli) for a more detailed analysis.
在某种程度上,错误可以被比作 SSD 类和常规内存的组合。错误可能表现为随机位翻转或 IO 故障。有工具可以访问内部日志(nvme 日志和 nvme-cli)进行更详细的分析。

There are separate error detection and correction steps performed e.g. on the bus level and in most cases never making in to the filesystem level. Once this happens it could mean there’s some systematic error like overheating or bad physical connection of the device. You may want to run self-tests (using smartctl).
有单独的错误检测和纠正步骤,例如在总线级别上执行,在大多数情况下从未进入文件系统级别。一旦发生这种情况,可能意味着存在一些系统性错误,如过热或设备的不良物理连接。您可能希望运行自检(使用 smartctl)。

Drive firmware 驱动器固件 

Firmware is technically still software but embedded into the hardware. As all software has bugs, so does firmware. Storage devices can update the firmware and fix known bugs. In some cases the it’s possible to avoid certain bugs by quirks (device-specific workarounds) in Linux kernel.
固件在技术上仍然是软件,但嵌入到硬件中。正如所有软件都有漏洞一样,固件也有漏洞。存储设备可以更新固件并修复已知的漏洞。在某些情况下,可以通过 Linux 内核中的技巧(设备特定的解决方法)来避免某些漏洞。

A faulty firmware can cause wide range of corruptions from small and localized to large affecting lots of data. Self-repair capabilities may not be sufficient.
故障的固件可能导致从小范围和局部的损坏到影响大量数据的广泛损坏。自我修复能力可能不足。

What to do: 该怎么办:

  • check for firmware updates in case there are known problems, note that updating firmware can be risky on itself
    检查固件更新以防存在已知问题,请注意更新固件本身可能存在风险

  • use up-to-date kernel (recent releases or maintained long term support versions)
    使用最新的内核(最新发布版或长期维护版本)

SD flash cards SD 闪存卡 

There are a lot of devices with low power consumption and thus using storage media based on low power consumption too, typically flash memory stored on a chip enclosed in a detachable card package. An improperly inserted card may be damaged by electrical spikes when the device is turned on or off. The chips storing data in turn may be damaged permanently. All types of flash memory have a limited number of rewrites, so the data are internally translated by FTL (flash translation layer). This is implemented in firmware (technically a software) and prone to bugs that manifest as hardware errors.
有许多低功耗设备,因此使用基于低功耗的存储介质,通常是存储在可拆卸卡包中的芯片上的闪存存储器。当设备开启或关闭时,不正确插入的卡可能会受到电压峰值的损坏。存储数据的芯片可能会永久受损。所有类型的闪存存储器都有有限的重写次数,因此数据通过 FTL(闪存转换层)进行内部转换。这是在固件中实现的(技术上是软件),容易出现表现为硬件错误的错误。

Adding redundancy like using DUP profiles for both data and metadata can help in some cases but a full backup might be the best option once problems appear and replacing the card could be required as well.
增加冗余,例如同时为数据和元数据使用 DUP 配置文件,在某些情况下可能有所帮助,但一旦出现问题,完整备份可能是最佳选择,同时可能需要更换卡。

Hardware as the main source of filesystem corruptions
硬件是文件系统损坏的主要原因。

If you use unreliable hardware and don’t know about that, don’t blame the filesystem when it tells you.
如果你使用不可靠的硬件并且对此一无所知,当文件系统告诉你时不要责怪它。

SEE ALSO 参见 

acl(5), btrfs(8), chattr(1), fstrim(8), ioctl(2), mkfs.btrfs(8), mount(8), swapon(8)