Zoned mode 分区模式
Since version 5.12 btrfs supports so called zoned mode. This is a special
on-disk format and allocation/write strategy that’s friendly to zoned devices.
In short, a device is partitioned into fixed-size zones and each zone can be
updated by append-only manner, or reset. As btrfs has no fixed data structures,
except the super blocks, the zoned mode only requires block placement that
follows the device constraints. You can learn about the whole architecture at
https://zonedstorage.io .
从版本 5.12 开始,btrfs 支持所谓的分区模式。这是一种特殊的磁盘格式和分配/写入策略,对分区设备友好。简而言之,设备被分成固定大小的区域,每个区域可以通过追加方式更新,或者重置。由于 btrfs 没有固定的数据结构,除了超级块,分区模式只需要遵循设备约束的块放置。您可以在 https://zonedstorage.io 了解整个架构。
The devices are also called SMR/ZBC/ZNS, in host-managed mode. Note that
there are devices that appear as non-zoned but actually are, this is
drive-managed and using zoned mode won’t help.
这些设备也被称为 SMR/ZBC/ZNS,在主机管理模式下。请注意,有些设备看起来不是分区的,但实际上是,这是驱动程序管理的,并且使用分区模式不会有帮助。
The zone size depends on the device, typical sizes are 256MiB or 1GiB. In
general it must be a power of two. Emulated zoned devices like null_blk allow
to set various zone sizes.
分区大小取决于设备,典型大小为 256MiB 或 1GiB。一般来说,它必须是 2 的幂。像 null_blk 这样的模拟分区设备允许设置各种分区大小。
Requirements, limitations
要求,限制
all devices must have the same zone size
所有设备必须具有相同的分区大小maximum zone size is 8GiB
最大区域大小为 8GiBminimum zone size is 4MiB
最小区域大小为 4MiBmixing zoned and non-zoned devices is possible, the zone writes are emulated, but this is namely for testing
可以混合使用分区和非分区设备,分区写入会被模拟,但这主要用于测试the super block is handled in a special way and is at different locations than on a non-zoned filesystem:
超级块以特殊方式处理,并位于分区文件系统上的不同位置:primary: 0B (and the next two zones)
主要:0B(以及接下来的两个区域)secondary: 512GiB (and the next two zones)
次要:512GiB(以及接下来的两个区域)tertiary: 4TiB (4096GiB, and the next two zones)
三级:4TiB(4096GiB,以及接下来的两个区域)
Incompatible features 不兼容的特性
The main constraint of the zoned devices is lack of in-place update of the data.
This is inherently incompatible with some features:
分区设备的主要约束是数据的就地更新不足。这与某些特性本质上不兼容:
NODATACOW - overwrite in-place, cannot create such files
NODATACOW - 覆盖原地,无法创建此类文件fallocate - preallocating space for in-place first write
fallocate - 为原地第一次写入预分配空间mixed-bg - unordered writes to data and metadata, fixing that means using separate data and metadata block groups
mixed-bg - 无序写入数据和元数据,修复这个问题意味着使用单独的数据和元数据块组booting - the zone at offset 0 contains superblock, resetting the zone would destroy the bootloader data
启动 - 偏移量为 0 的区域包含超级块,重置该区域将销毁引导加载程序数据
Initial support lacks some features but they’re planned:
初始支持缺少一些功能,但它们已计划:
only single (data, metadata) and DUP (metadata) profile is supported
仅支持单个(数据,元数据)和 DUP(元数据)配置文件fstrim - due to dependency on free space cache v1
fstrim - 由于依赖于空闲空间缓存 v1
Super block 超级块
As said above, super block is handled in a special way. In order to be crash
safe, at least one zone in a known location must contain a valid superblock.
This is implemented as a ring buffer in two consecutive zones, starting from
known offsets 0B, 512GiB and 4TiB.
如上所述,超级块以特殊方式处理。为了保证崩溃安全,至少一个已知位置的区域必须包含一个有效的超级块。这是通过在两个连续区域中实现的环形缓冲区来实现的,从已知偏移 0B、512GiB 和 4TiB 开始。
The values are different than on non-zoned devices. Each new super block is
appended to the end of the zone, once it’s filled, the zone is reset and writes
continue to the next one. Looking up the latest super block needs to read
offsets of both zones and determine the last written version.
这些值与非分区设备上的值不同。每个新的超级块都附加到区域的末尾,一旦填满,该区域将被重置,并且写入将继续到下一个区域。查找最新的超级块需要读取两个区域的偏移量,并确定最后写入的版本。
The amount of space reserved for super block depends on the zone size. The
secondary and tertiary copies are at distant offsets as the capacity of the
devices is expected to be large, tens of terabytes. Maximum zone size supported
is 8GiB, which would mean that e.g. offset 0-16GiB would be reserved just for
the super block on a hypothetical device of that zone size. This is wasteful
but required to guarantee crash safety.
为超级块保留的空间量取决于区域大小。由于设备的容量预计会很大,达到数十 TB,因此次要和三次副本位于远距离的偏移量上。支持的最大区域大小为 8GiB,这意味着例如在假设的该区域大小的设备上,0-16GiB 的偏移量将仅用于超级块。这是浪费的,但为了保证崩溃安全性是必需的。
Devices 设备
Real hardware 真实硬件
The WD Ultrastar series 600 advertises HM-SMR, i.e. the host-managed zoned
mode. There are two more: DA (device managed, no zoned information exported to
the system), HA (host aware, can be used as regular disk but zoned writes
improve performance). There are not many devices available at the moment, the
information about exact zoned mode is hard to find, check data sheets or
community sources gathering information from real devices.
WD Ultrastar 系列 600 宣传 HM-SMR,即主机管理的分区模式。还有两种:DA(设备管理,不向系统导出分区信息),HA(主机感知,可用作常规磁盘,但分区写入可提高性能)。目前可用的设备不多,关于确切分区模式的信息很难找到,查看数据表或从真实设备中收集信息的社区来源。
Note: zoned mode won’t work with DM-SMR disks.
注意:分区模式不适用于 DM-SMR 磁盘。
Ultrastar® DC ZN540 NVMe ZNS SSD (product brief)
Ultrastar® DC ZN540 NVMe ZNS 固态硬盘(产品简介)
Emulated: null_blk 模拟:null_blk
The driver null_blk provides memory backed device and is suitable for
testing. There are some quirks setting up the devices. The module must be
loaded with nr_devices=0 or the numbering of device nodes will be offset. The
configfs must be mounted at /sys/kernel/config and the administration of
the null_blk devices is done in /sys/kernel/config/nullb. The device nodes
are named like /dev/nullb0
and are numbered sequentially. NOTE: the device
name may be different than the named directory in sysfs!
驱动程序 null_blk 提供基于内存的设备,适用于测试。设置设备时有一些怪癖。必须使用 nr_devices=0 加载模块,否则设备节点的编号将会偏移。必须将 configfs 挂载到 /sys/kernel/config,对 null_blk 设备的管理在 /sys/kernel/config/nullb 中进行。设备节点的命名类似于 /dev/nullb0
并按顺序编号。注意:设备名称可能与 sysfs 中命名的目录不同!
Setup: 设置:
modprobe configfs
modprobe null_blk nr_devices=0
Create a device mydev, assuming no other previously created devices, size is
2048MiB, zone size 256MiB. There are more tunable parameters, this is a minimal
example taking defaults:
创建一个设备 mydev,假设之前没有创建过其他设备,大小为 2048MiB,区域大小为 256MiB。还有更多可调参数,这是一个以默认值为例的最小示例:
cd /sys/kernel/config/nullb/
mkdir mydev
cd mydev
echo 2048 > size
echo 1 > zoned
echo 1 > memory_backed
echo 256 > zone_size
echo 1 > power
This will create a device /dev/nullb0
and the value of file index will
match the ending number of the device node.
这将创建一个设备 /dev/nullb0
,文件索引的值将与设备节点的结束数字匹配。
Remove the device: 移除设备:
rmdir /sys/kernel/config/nullb/mydev
Then continue with mkfs.btrfs /dev/nullb0, the zoned mode is auto-detected.
然后继续执行 mkfs.btrfs /dev/nullb0,分区模式将被自动检测。
For convenience, there’s a script wrapping the basic null_blk management operations
https://github.com/kdave/nullb.git, the above commands become:
为了方便起见,有一个脚本包装了基本的 null_blk 管理操作 https://github.com/kdave/nullb.git,上述命令变为:
nullb setup
nullb create -s 2g -z 256
mkfs.btrfs /dev/nullb0
...
nullb rm nullb0
Emulated: TCMU runner 模拟:TCMU 运行器
TCMU is a framework to emulate SCSI devices in userspace, providing various
backends for the storage, with zoned support as well. A file-backed zoned
device can provide more options for larger storage and zone size. Please follow
the instructions at https://zonedstorage.io/projects/tcmu-runner/ .
TCMU 是一个在用户空间模拟 SCSI 设备的框架,提供各种后端存储支持,同时也支持分区。基于文件的分区设备可以为更大的存储和分区大小提供更多选项。请按照 https://zonedstorage.io/projects/tcmu-runner/ 上的说明操作。
Compatibility, incompatibility
兼容性,不兼容性
the feature sets an incompat bit and requires new kernel to access the filesystem (for both read and write)
该功能设置了一个不兼容位,并需要新的内核来访问文件系统(包括读和写)superblock needs to be handled in a special way, there are still 3 copies but at different offsets (0, 512GiB, 4TiB) and the 2 consecutive zones are a ring buffer of the superblocks, finding the latest one needs reading it from the write pointer or do a full scan of the zones
超级块需要以特殊方式处理,仍然有 3 个副本,但位于不同的偏移量(0、512GiB、4TiB),而 2 个连续的区域是超级块的环形缓冲区,找到最新的超级块需要从写指针读取它或对区域进行完整扫描mixing zoned and non zoned devices is possible (zones are emulated) but is recommended only for testing
混合分区和非分区设备是可能的(分区被模拟),但建议仅用于测试mixing zoned devices with different zone sizes is not possible
无法将具有不同区域大小的区域设备混合使用zone sizes must be power of two, zone sizes of real devices are e.g. 256MiB or 1GiB, larger size is expected, maximum zone size supported by btrfs is 8GiB
区域大小必须是 2 的幂,实际设备的区域大小为 256MiB 或 1GiB,预期更大的尺寸,btrfs 支持的最大区域大小为 8GiB
Status, stability, reporting bugs
状态、稳定性、报告错误
The zoned mode has been released in 5.12 and there are still some rough edges
and corner cases one can hit during testing. Please report bugs to
https://github.com/naota/linux/issues/ .
分区模式已在 5.12 中发布,测试过程中可能会遇到一些问题和特殊情况。请将错误报告给 https://github.com/naota/linux/issues/ 。
References 参考资料
-
https://zonedstorage.io/projects/libzbc/ -- libzbc is library and set of tools to directly manipulate devices with ZBC/ZAC support
https://zonedstorage.io/projects/libzbc/ -- libzbc 是一个库和一组工具,用于直接操作支持 ZBC/ZAC 的设备。https://zonedstorage.io/projects/libzbd/ -- libzbd uses the kernel provided zoned block device interface based on the ioctl() system calls
https://zonedstorage.io/projects/libzbd/ -- libzbd 使用基于 ioctl() 系统调用的内核提供的分区块设备接口
https://hddscan.com/blog/2020/hdd-wd-smr.html -- some details about exact device types
https://hddscan.com/blog/2020/hdd-wd-smr.html -- 有关确切设备类型的一些细节https://lwn.net/Articles/853308/ -- Btrfs on zoned block devices
https://lwn.net/Articles/853308/ -- Btrfs 在分区块设备上https://www.usenix.org/conference/vault20/presentation/bjorling -- Zone Append: A New Way of Writing to Zoned Storage
https://www.usenix.org/conference/vault20/presentation/bjorling -- 区域追加:一种新的写入分区存储的方式