Zoned mode 分区模式

Since version 5.12 btrfs supports so called zoned mode. This is a special on-disk format and allocation/write strategy that’s friendly to zoned devices. In short, a device is partitioned into fixed-size zones and each zone can be updated by append-only manner, or reset. As btrfs has no fixed data structures, except the super blocks, the zoned mode only requires block placement that follows the device constraints. You can learn about the whole architecture at https://zonedstorage.io .
从版本 5.12 开始,btrfs 支持所谓的分区模式。这是一种特殊的磁盘格式和分配/写入策略,对分区设备友好。简而言之,设备被分成固定大小的区域,每个区域可以通过追加方式更新,或者重置。由于 btrfs 没有固定的数据结构,除了超级块,分区模式只需要遵循设备约束的块放置。您可以在 https://zonedstorage.io 了解整个架构。

The devices are also called SMR/ZBC/ZNS, in host-managed mode. Note that there are devices that appear as non-zoned but actually are, this is drive-managed and using zoned mode won’t help.
这些设备也被称为 SMR/ZBC/ZNS,在主机管理模式下。请注意,有些设备看起来不是分区的,但实际上是,这是驱动程序管理的,并且使用分区模式不会有帮助。

The zone size depends on the device, typical sizes are 256MiB or 1GiB. In general it must be a power of two. Emulated zoned devices like null_blk allow to set various zone sizes.
分区大小取决于设备,典型大小为 256MiB 或 1GiB。一般来说,它必须是 2 的幂。像 null_blk 这样的模拟分区设备允许设置各种分区大小。

Requirements, limitations
要求,限制

  • all devices must have the same zone size
    所有设备必须具有相同的分区大小

  • maximum zone size is 8GiB
    最大区域大小为 8GiB

  • minimum zone size is 4MiB
    最小区域大小为 4MiB

  • mixing zoned and non-zoned devices is possible, the zone writes are emulated, but this is namely for testing
    可以混合使用分区和非分区设备,分区写入会被模拟,但这主要用于测试

  • the super block is handled in a special way and is at different locations than on a non-zoned filesystem:
    超级块以特殊方式处理,并位于分区文件系统上的不同位置:

    • primary: 0B (and the next two zones)
      主要:0B(以及接下来的两个区域)

    • secondary: 512GiB (and the next two zones)
      次要:512GiB(以及接下来的两个区域)

    • tertiary: 4TiB (4096GiB, and the next two zones)
      三级:4TiB(4096GiB,以及接下来的两个区域)

Incompatible features 不兼容的特性

The main constraint of the zoned devices is lack of in-place update of the data. This is inherently incompatible with some features:
分区设备的主要约束是数据的就地更新不足。这与某些特性本质上不兼容:

  • NODATACOW - overwrite in-place, cannot create such files
    NODATACOW - 覆盖原地,无法创建此类文件

  • fallocate - preallocating space for in-place first write
    fallocate - 为原地第一次写入预分配空间

  • mixed-bg - unordered writes to data and metadata, fixing that means using separate data and metadata block groups
    mixed-bg - 无序写入数据和元数据,修复这个问题意味着使用单独的数据和元数据块组

  • booting - the zone at offset 0 contains superblock, resetting the zone would destroy the bootloader data
    启动 - 偏移量为 0 的区域包含超级块,重置该区域将销毁引导加载程序数据

Initial support lacks some features but they’re planned:
初始支持缺少一些功能,但它们已计划:

  • only single (data, metadata) and DUP (metadata) profile is supported
    仅支持单个(数据,元数据)和 DUP(元数据)配置文件

  • fstrim - due to dependency on free space cache v1
    fstrim - 由于依赖于空闲空间缓存 v1

Super block 超级块

As said above, super block is handled in a special way. In order to be crash safe, at least one zone in a known location must contain a valid superblock. This is implemented as a ring buffer in two consecutive zones, starting from known offsets 0B, 512GiB and 4TiB.
如上所述,超级块以特殊方式处理。为了保证崩溃安全,至少一个已知位置的区域必须包含一个有效的超级块。这是通过在两个连续区域中实现的环形缓冲区来实现的,从已知偏移 0B、512GiB 和 4TiB 开始。

The values are different than on non-zoned devices. Each new super block is appended to the end of the zone, once it’s filled, the zone is reset and writes continue to the next one. Looking up the latest super block needs to read offsets of both zones and determine the last written version.
这些值与非分区设备上的值不同。每个新的超级块都附加到区域的末尾,一旦填满,该区域将被重置,并且写入将继续到下一个区域。查找最新的超级块需要读取两个区域的偏移量,并确定最后写入的版本。

The amount of space reserved for super block depends on the zone size. The secondary and tertiary copies are at distant offsets as the capacity of the devices is expected to be large, tens of terabytes. Maximum zone size supported is 8GiB, which would mean that e.g. offset 0-16GiB would be reserved just for the super block on a hypothetical device of that zone size. This is wasteful but required to guarantee crash safety.
为超级块保留的空间量取决于区域大小。由于设备的容量预计会很大,达到数十 TB,因此次要和三次副本位于远距离的偏移量上。支持的最大区域大小为 8GiB,这意味着例如在假设的该区域大小的设备上,0-16GiB 的偏移量将仅用于超级块。这是浪费的,但为了保证崩溃安全性是必需的。

Devices 设备

Real hardware 真实硬件

The WD Ultrastar series 600 advertises HM-SMR, i.e. the host-managed zoned mode. There are two more: DA (device managed, no zoned information exported to the system), HA (host aware, can be used as regular disk but zoned writes improve performance). There are not many devices available at the moment, the information about exact zoned mode is hard to find, check data sheets or community sources gathering information from real devices.
WD Ultrastar 系列 600 宣传 HM-SMR,即主机管理的分区模式。还有两种:DA(设备管理,不向系统导出分区信息),HA(主机感知,可用作常规磁盘,但分区写入可提高性能)。目前可用的设备不多,关于确切分区模式的信息很难找到,查看数据表或从真实设备中收集信息的社区来源。

Note: zoned mode won’t work with DM-SMR disks.
注意:分区模式不适用于 DM-SMR 磁盘。

  • Ultrastar® DC ZN540 NVMe ZNS SSD (product brief)
    Ultrastar® DC ZN540 NVMe ZNS 固态硬盘(产品简介)

Emulated: null_blk 模拟:null_blk 

The driver null_blk provides memory backed device and is suitable for testing. There are some quirks setting up the devices. The module must be loaded with nr_devices=0 or the numbering of device nodes will be offset. The configfs must be mounted at /sys/kernel/config and the administration of the null_blk devices is done in /sys/kernel/config/nullb. The device nodes are named like /dev/nullb0 and are numbered sequentially. NOTE: the device name may be different than the named directory in sysfs!
驱动程序 null_blk 提供基于内存的设备,适用于测试。设置设备时有一些怪癖。必须使用 nr_devices=0 加载模块,否则设备节点的编号将会偏移。必须将 configfs 挂载到 /sys/kernel/config,对 null_blk 设备的管理在 /sys/kernel/config/nullb 中进行。设备节点的命名类似于 /dev/nullb0 并按顺序编号。注意:设备名称可能与 sysfs 中命名的目录不同!

Setup: 设置:

modprobe configfs
modprobe null_blk nr_devices=0

Create a device mydev, assuming no other previously created devices, size is 2048MiB, zone size 256MiB. There are more tunable parameters, this is a minimal example taking defaults:
创建一个设备 mydev,假设之前没有创建过其他设备,大小为 2048MiB,区域大小为 256MiB。还有更多可调参数,这是一个以默认值为例的最小示例:

cd /sys/kernel/config/nullb/
mkdir mydev
cd mydev
echo 2048 > size
echo 1 > zoned
echo 1 > memory_backed
echo 256 > zone_size
echo 1 > power

This will create a device /dev/nullb0 and the value of file index will match the ending number of the device node.
这将创建一个设备 /dev/nullb0 ,文件索引的值将与设备节点的结束数字匹配。

Remove the device: 移除设备:

rmdir /sys/kernel/config/nullb/mydev

Then continue with mkfs.btrfs /dev/nullb0, the zoned mode is auto-detected.
然后继续执行 mkfs.btrfs /dev/nullb0,分区模式将被自动检测。

For convenience, there’s a script wrapping the basic null_blk management operations https://github.com/kdave/nullb.git, the above commands become:
为了方便起见,有一个脚本包装了基本的 null_blk 管理操作 https://github.com/kdave/nullb.git,上述命令变为:

nullb setup
nullb create -s 2g -z 256
mkfs.btrfs /dev/nullb0
...
nullb rm nullb0

Emulated: TCMU runner 模拟:TCMU 运行器

TCMU is a framework to emulate SCSI devices in userspace, providing various backends for the storage, with zoned support as well. A file-backed zoned device can provide more options for larger storage and zone size. Please follow the instructions at https://zonedstorage.io/projects/tcmu-runner/ .
TCMU 是一个在用户空间模拟 SCSI 设备的框架,提供各种后端存储支持,同时也支持分区。基于文件的分区设备可以为更大的存储和分区大小提供更多选项。请按照 https://zonedstorage.io/projects/tcmu-runner/ 上的说明操作。

Compatibility, incompatibility
兼容性,不兼容性

  • the feature sets an incompat bit and requires new kernel to access the filesystem (for both read and write)
    该功能设置了一个不兼容位,并需要新的内核来访问文件系统(包括读和写)

  • superblock needs to be handled in a special way, there are still 3 copies but at different offsets (0, 512GiB, 4TiB) and the 2 consecutive zones are a ring buffer of the superblocks, finding the latest one needs reading it from the write pointer or do a full scan of the zones
    超级块需要以特殊方式处理,仍然有 3 个副本,但位于不同的偏移量(0、512GiB、4TiB),而 2 个连续的区域是超级块的环形缓冲区,找到最新的超级块需要从写指针读取它或对区域进行完整扫描

  • mixing zoned and non zoned devices is possible (zones are emulated) but is recommended only for testing
    混合分区和非分区设备是可能的(分区被模拟),但建议仅用于测试

  • mixing zoned devices with different zone sizes is not possible
    无法将具有不同区域大小的区域设备混合使用

  • zone sizes must be power of two, zone sizes of real devices are e.g. 256MiB or 1GiB, larger size is expected, maximum zone size supported by btrfs is 8GiB
    区域大小必须是 2 的幂,实际设备的区域大小为 256MiB 或 1GiB,预期更大的尺寸,btrfs 支持的最大区域大小为 8GiB

Status, stability, reporting bugs
状态、稳定性、报告错误 

The zoned mode has been released in 5.12 and there are still some rough edges and corner cases one can hit during testing. Please report bugs to https://github.com/naota/linux/issues/ .
分区模式已在 5.12 中发布,测试过程中可能会遇到一些问题和特殊情况。请将错误报告给 https://github.com/naota/linux/issues/ 。

References 参考资料 