Quota groups 配额组 

The concept of quota has a long-standing tradition in the Unix world. Ever since computers allow multiple users to work simultaneously in one filesystem, there is the need to prevent one user from using up the entire space. Every user should get his fair share of the available resources.
配额的概念在 Unix 世界中有着悠久的传统。自从计算机允许多个用户同时在一个文件系统中工作以来,就有必要防止一个用户使用完整个空间。每个用户都应该获得他公平份额的可用资源。

In case of files, the solution is quite straightforward. Each file has an owner recorded along with it, and it has a size. Traditional quota just restricts the total size of all files that are owned by a user. The concept is quite flexible: if a user hits his quota limit, the administrator can raise it on the fly.
对于文件来说,解决方案非常直接。每个文件都记录了所有者和文件大小。传统的配额只是限制了用户拥有的所有文件的总大小。这个概念非常灵活:如果用户达到配额限制,管理员可以即时提高它。

On the other hand, the traditional approach has only a poor solution to restrict directories. At installation time, the harddisk can be partitioned so that every directory (e.g. /usr, /var, …) that needs a limit gets its own partition. The obvious problem is that those limits cannot be changed without a reinstallation. The btrfs subvolume feature builds a bridge. Subvolumes correspond in many ways to partitions, as every subvolume looks like its own filesystem. With subvolume quota, it is now possible to restrict each subvolume like a partition, but keep the flexibility of quota. The space for each subvolume can be expanded or restricted on the fly.
另一方面,传统方法只能提供一个限制目录的不完善解决方案。在安装时,硬盘可以分区,以便每个需要限制的目录(例如 /usr/var ,...)都有自己的分区。显而易见的问题是,这些限制不能在不重新安装的情况下更改。btrfs 子卷功能架起了一座桥梁。子卷在许多方面类似于分区,因为每个子卷看起来都像是自己的文件系统。通过子卷配额,现在可以像分区一样限制每个子卷,但保持配额的灵活性。每个子卷的空间可以动态扩展或限制。

As subvolumes are the basis for snapshots, interesting questions arise as to how to account used space in the presence of snapshots. If you have a file shared between a subvolume and a snapshot, whom to account the file to? The creator? Both? What if the file gets modified in the snapshot, should only these changes be accounted to it? But wait, both the snapshot and the subvolume belong to the same user home. I just want to limit the total space used by both! But somebody else might not want to charge the snapshots to the users.
由于子卷是快照的基础,因此在存在快照时,关于如何计算已使用空间会出现一些有趣的问题。如果一个文件在子卷和快照之间共享,应该将文件归属于谁?创建者?两者都?如果在快照中修改了文件,只有这些更改应该计入其中吗?但等等,快照和子卷都属于同一个用户家目录。我只想限制两者使用的总空间!但其他人可能不想将快照计入用户的空间。

Btrfs subvolume quota solves these problems by introducing groups of subvolumes and let the user put limits on them. It is even possible to have groups of groups. In the following, we refer to them as qgroups.
Btrfs 子卷配额通过引入子卷组并允许用户对其设置限制来解决这些问题。甚至可以有子组。在接下来的内容中,我们将其称为 qgroups。

Each qgroup primarily tracks two numbers, the amount of total referenced space and the amount of exclusively referenced space.
每个 qgroup 主要跟踪两个数字,即总引用空间量和独占引用空间量。

referenced 引用的

space is the amount of data that can be reached from any of the subvolumes contained in the qgroup, while
空间是可以从 qgroup 中包含的任何子卷中访问的数据量,而

exclusive 独占

is the amount of data where all references to this data can be reached from within this qgroup.
是所有对这些数据的引用都可以从此 qgroup 中访问的数据量。

Subvolume quota groups 子卷配额组

The basic notion of the Subvolume Quota feature is the quota group, short qgroup. Qgroups are notated as level/id, e.g. the qgroup 3/2 is a qgroup of level 3. For level 0, the leading 0/ can be omitted. Qgroups of level 0 get created automatically when a subvolume/snapshot gets created. The ID of the qgroup corresponds to the ID of the subvolume, so 0/5 is the qgroup for the root subvolume. For the btrfs qgroup command, the path to the subvolume can also be used instead of 0/ID. For all higher levels, the ID can be chosen freely.
子卷配额功能的基本概念是配额组,简称 qgroup。Qgroups 被表示为 level/id,例如,qgroup 3/2 是一个级别为 3 的 qgroup。对于级别为 0 的情况,前导的 0/ 可以省略。当创建子卷/快照时,级别为 0 的 qgroups 会自动创建。qgroup 的 ID 对应于子卷的 ID,因此 0/5 是根子卷的 qgroup。对于 btrfs qgroup 命令,也可以使用子卷的路径而不是 0/ID。对于所有更高级别,ID 可以自由选择。

Each qgroup can contain a set of lower level qgroups, thus creating a hierarchy of qgroups. Figure 1 shows an example qgroup tree.
每个 qgroup 可以包含一组较低级别的 qgroups,从而创建一个 qgroup 层次结构。图 1 显示了一个示例 qgroup 树。

                          +---+
                          |2/1|
                          +---+
                         /     \
                   +---+/       \+---+
                   |1/1|         |1/2|
                   +---+         +---+
                  /     \       /     \
            +---+/       \+---+/       \+---+
qgroups     |0/1|         |0/2|         |0/3|
            +-+-+         +---+         +---+
              |          /     \       /     \
              |         /       \     /       \
              |        /         \   /         \
extents       1       2            3            4

Figure 1: Sample qgroup hierarchy

At the bottom, some extents are depicted showing which qgroups reference which extents. It is important to understand the notion of referenced vs exclusive. In the example, qgroup 0/2 references extents 2 and 3, while 1/2 references extents 2-4, 2/1 references all extents.
在底部,显示了一些范围,展示了哪些 qgroups 引用了哪些范围。理解引用 vs 独占的概念非常重要。在这个例子中,qgroup 0/2 引用了范围 2 和 3,而 1/2 引用了范围 2-4,2/1 引用了所有范围。

On the other hand, extent 1 is exclusive to 0/1, extent 2 is exclusive to 0/2, while extent 3 is neither exclusive to 0/2 nor to 0/3. But because both references can be reached from 1/2, extent 3 is exclusive to 1/2. All extents are exclusive to 2/1.
另一方面,范围 1 是独占于 0/1,范围 2 是独占于 0/2,而范围 3 既不是独占于 0/2 也不是独占于 0/3。但因为从 1/2 可以到达两个引用,范围 3 是独占于 1/2。所有范围都是独占于 2/1。

So exclusive does not mean there is no other way to reach the extent, but it does mean that if you delete all subvolumes contained in a qgroup, the extent will get deleted.
因此,独占并不意味着没有其他方法可以到达该范围,但它确实意味着如果删除了 qgroup 中包含的所有子卷,该范围将被删除。

Exclusive of a qgroup conveys the useful information how much space will be freed in case all subvolumes of the qgroup get deleted.
排除 qgroup 的一个优势是传达有用信息,即在删除 qgroup 的所有子卷时会释放多少空间。

All data extents are accounted this way. Metadata that belongs to a specific subvolume (i.e. its filesystem tree) is also accounted. Checksums and extent allocation information are not accounted.
所有数据范围都是这样计算的。属于特定子卷的元数据(即其文件系统树)也会计算在内。校验和和范围分配信息不会计算在内。

In turn, the referenced count of a qgroup can be limited. All writes beyond this limit will lead to a ‘Quota Exceeded’ error.
反过来,qgroup 的引用计数可以被限制。超过此限制的所有写操作将导致“超出配额”错误。

Inheritance 继承

Things get a bit more complicated when new subvolumes or snapshots are created. The case of (empty) subvolumes is still quite easy. If a subvolume should be part of a qgroup, it has to be added to the qgroup at creation time. To add it at a later time, it would be necessary to at least rescan the full subvolume for a proper accounting.
当创建新的子卷或快照时,情况会变得有些复杂。空子卷的情况仍然相当简单。如果一个子卷应该是一个 qgroup 的一部分,在创建时必须将其添加到 qgroup 中。要在以后的某个时候添加它,至少需要重新扫描整个子卷以进行适当的核算。

Creation of a snapshot is the hard case. Obviously, the snapshot will reference the exact amount of space as its source, and both source and destination now have an exclusive count of 0 (the filesystem nodesize to be precise, as the roots of the trees are not shared). But what about qgroups of higher levels? If the qgroup contains both the source and the destination, nothing changes. If the qgroup contains only the source, it might lose some exclusive.
创建快照是一个困难的情况。显然,快照将引用与其源相同的空间量,现在源和目的地都有一个独占计数为 0(准确来说是文件系统节点大小,因为树的根不共享)。但是高级别的 qgroups 呢?如果 qgroup 同时包含源和目的地,则不会发生任何变化。如果 qgroup 只包含源,它可能会失去一些独占。

But how much? The tempting answer is, subtract all exclusive of the source from the qgroup, but that is wrong, or at least not enough. There could have been an extent that is referenced from the source and another subvolume from that qgroup. This extent would have been exclusive to the qgroup, but not to the source subvolume. With the creation of the snapshot, the qgroup would also lose this extent from its exclusive set.
但是多少呢?诱人的答案是,从 qgroup 中减去除源之外的所有内容,但那是错误的,或者至少不够。可能存在一个从源引用的范围,以及从该 qgroup 中的另一个子卷。这个范围将是独占于 qgroup,但不独占于源子卷。随着快照的创建,qgroup 也将从其独占集中失去这个范围。

So how can this problem be solved? In the instant the snapshot gets created, we already have to know the correct exclusive count. We need to have a second qgroup that contains all the subvolumes as the first qgroup, except the subvolume we want to snapshot. The moment we create the snapshot, the exclusive count from the second qgroup needs to be copied to the first qgroup, as it represents the correct value. The second qgroup is called a tracking qgroup. It is only there in case a snapshot is needed.
那么这个问题如何解决呢?在快照创建的瞬间,我们已经需要知道正确的独占计数。我们需要有第二个 qgroup,其中包含所有子卷,就像第一个 qgroup 一样,除了我们想要快照的子卷。在我们创建快照的那一刻,第二个 qgroup 的独占计数需要复制到第一个 qgroup,因为它代表了正确的值。第二个 qgroup 被称为跟踪 qgroup。它只在需要快照时存在。

Use cases 使用案例 

Below are some use cases that do not mean to be extensive. You can find your own way how to integrate qgroups.
以下是一些不一定详尽的用例。您可以找到自己的方法如何集成 qgroups。

Single-user machine 单用户机器

Replacement for partitions. The simplest use case is to use qgroups as simple replacement for partitions. Btrfs takes the disk as a whole, and /, /usr, /var, etc. are created as subvolumes. As each subvolume gets it own qgroup automatically, they can simply be restricted. No hierarchy is needed for that.
用于替代分区。最简单的用例是将 qgroups 用作分区的简单替代。Btrfs 将整个磁盘作为一个整体, //usr/var 等被创建为子卷。由于每个子卷都会自动获得自己的 qgroup,因此它们可以简单地受限制。不需要层次结构。

Track usage of snapshots. When a snapshot is taken, a qgroup for it will automatically be created with the correct values. Referenced will show how much is in it, possibly shared with other subvolumes. Exclusive will be the amount of space that gets freed when the subvolume is deleted.
跟踪快照的使用情况。当拍摄快照时,将自动为其创建具有正确值的 qgroup。引用将显示其中有多少内容,可能与其他子卷共享。独占将是在删除子卷时释放的空间量。

Multi-user machine 多用户机器

Restricting homes. When you have several users on a machine, with home directories probably under /home, you might want to restrict /home as a whole, while restricting every user to an individual limit as well. This is easily accomplished by creating a qgroup for /home , e.g. 1/1, and assigning all user subvolumes to it. Restricting this qgroup will limit /home, while every user subvolume can get its own (lower) limit.
限制家目录。当您在一台机器上有多个用户,家目录可能位于 /home 下时,您可能希望限制 /home 作为一个整体,同时也限制每个用户的个人限制。通过为 /home 创建一个 qgroup,例如 1/1,然后将所有用户子卷分配给它,可以轻松实现这一点。限制此 qgroup 将限制/home,而每个用户子卷可以获得自己的(更低的)限制。

Accounting snapshots to the user. Let’s say the user is allowed to create snapshots via some mechanism. It would only be fair to account space used by the snapshots to the user. This does not mean the user doubles his usage as soon as he takes a snapshot. Of course, files that are present in his home and the snapshot should only be accounted once. This can be accomplished by creating a qgroup for each user, say 1/UID. The user home and all snapshots are assigned to this qgroup. Limiting it will extend the limit to all snapshots, counting files only once. To limit /home as a whole, a higher level group 2/1 replacing 1/1 from the previous example is needed, with all user qgroups assigned to it.
会计快照给用户。假设用户可以通过某种机制创建快照。将快照使用的空间计入用户是公平的。这并不意味着用户一旦拍摄快照就会使他的使用量翻倍。当然,他的主目录中存在的文件和快照应该只计入一次。这可以通过为每个用户创建一个 qgroup 来实现,比如 1/UID。用户主目录和所有快照都分配给这个 qgroup。将其限制将扩展到所有快照,只计算文件一次。要将 /home 作为一个整体限制,需要一个更高级别的组 2/1 来替换前面示例中的 1/1,所有用户 qgroups 都分配给它。

Do not account snapshots. On the other hand, when the snapshots get created automatically, the user has no chance to control them, so the space used by them should not be accounted to him. This is already the case when creating snapshots in the example from the previous section.
不要计入快照。另一方面,当快照自动创建时,用户无法控制它们,因此它们使用的空间不应计入他。在前一节示例中创建快照时已经是这种情况。

Snapshots for backup purposes. This scenario is a mixture of the previous two. The user can create snapshots, but some snapshots for backup purposes are being created by the system. The user’s snapshots should be accounted to the user, not the system. The solution is similar to the one from section Accounting snapshots to the user, but do not assign system snapshots to user’s qgroup.
用于备份目的的快照。这种情况是前两种情况的混合体。用户可以创建快照,但系统正在创建一些用于备份目的的快照。用户的快照应该归属于用户,而不是系统。解决方案类似于“将快照分配给用户”部分的解决方案,但不要将系统快照分配给用户的 qgroup。

Simple quotas (squota) 简单配额(squota)

As detailed in this document, qgroups can handle many complex extent sharing and unsharing scenarios while maintaining an accurate count of exclusive and shared usage. However, this flexibility comes at a cost: many of the computations are global, in the sense that we must count up the number of trees referring to an extent after its references change. This can slow down transaction commits and lead to unacceptable latencies, especially in cases where snapshots scale up.
如本文档所述,qgroups 可以处理许多复杂的范围共享和取消共享场景,同时保持对独占和共享使用的准确计数。然而,这种灵活性是有代价的:许多计算是全局的,意味着在引用更改后,我们必须计算引用某个范围的树的数量。这可能会减慢事务提交速度,并导致不可接受的延迟,特别是在快照扩展的情况下。

To work around this limitation of qgroups, btrfs also supports a second set of quota semantics: simple quotas or squotas. Squotas fully share the qgroups API and hierarchical model, but do not track shared vs. exclusive usage. Instead, they account all extents to the subvolume that first allocated it. With a bit of new bookkeeping, this allows all accounting decisions to be local to the allocation or freeing operation that deals with the extents themselves, and fully avoids the complex and costly back-reference resolutions.
为了解决 qgroups 的这一限制,btrfs 还支持第二套配额语义:简单配额或 squotas。Squotas 完全共享 qgroups API 和分层模型,但不跟踪共享与独占使用情况。相反,它们将所有范围归属于首次分配它的子卷。通过一些新的簿记,这使得所有的核算决策都局限于处理范围本身的分配或释放操作,并完全避免了复杂和昂贵的反向引用解析。

Example

To illustrate the difference between squotas and qgroups, consider the following basic example assuming a nodesize of 16KiB.
为了说明 squotas 和 qgroups 之间的区别,考虑以下基本示例,假设节点大小为 16KiB。

  1. create subvolume 256 创建子卷 256

  2. rack up 1GiB of data and metadata usage in 256
    在 256 中累积 1GiB 的数据和元数据使用量

  3. snapshot 256, creating subvolume 257
    快照 256,创建子卷 257

  4. COW 512MiB of the data and metadata in 257
    在 257 中 COW 512MiB 的数据和元数据

  5. delete everything in 256 删除 256 中的所有内容

At each step, qgroups would have the following accounting:
在每个步骤中,qgroups 将具有以下会计信息:

  1. 0/256: 16KiB excl 0 shared
    0/256: 16KiB 独占 0 共享

  2. 0/256: 1GiB excl 0 shared
    0/256:1GiB 独占 0 共享

  3. 0/256: 0 excl 1GiB shared; 0/257: 0 excl 1GiB shared
    0/256:0 独占 1GiB 共享;0/257:0 独占 1GiB 共享

  4. 0/256: 512MiB excl 512MiB shared; 0/257: 512MiB excl 512MiB shared
    0/256:512MiB 独占 512MiB 共享;0/257:512MiB 独占 512MiB 共享

  5. 0/256: 16KiB excl 0 shared; 0/257: 1GiB excl 0 shared
    0/256:16KiB 独占 0 共享;0/257:1GiB 独占 0 共享

Whereas under squotas, the accounting would look like:
而在配额下,会计会看起来像:

  1. 0/256: 16KiB excl 16KiB shared
    0/256:16KiB 独占 16KiB 共享

  2. 0/256: 1GiB excl 1GiB shared
    0/256:1GiB 独占 1GiB 共享

  3. 0/256: 1GiB excl 1GiB shared; 0/257: 16KiB excl 16KiB shared
    0/256:1GiB 独占 1GiB 共享;0/257:16KiB 独占 16KiB 共享

  4. 0/256: 1GiB excl 1GiB shared; 0/257: 512MiB excl 512MiB shared
    0/256:1GiB 独占 1GiB 共享;0/257:512MiB 独占 512MiB 共享

  5. 0/256: 512MiB excl 512MiB shared; 0/257: 512MiB excl 512MiB shared
    0/256:512MiB 独占 512MiB 共享;0/257:512MiB 独占 512MiB 共享

Note that since the original snapshotted 512MiB are still referenced by 257, they cannot be freed from 256, even after 256 is emptied, or even deleted.
请注意,由于原始快照的 512MiB 仍然被 257 引用,即使 256 被清空,甚至被删除后,它们也无法从 256 中释放。

Summary

If you want some of power and flexibility of quotas for tracking and limiting subvolume usage, but want to avoid the performance penalty of accurately tracking extent ownership life cycles, then squotas can be a useful option.
如果您想要一些用于跟踪和限制子卷使用情况的配额的功能和灵活性,但又想避免准确跟踪范围所有权生命周期所带来的性能惩罚,那么 squota 可能是一个有用的选项。

Furthermore, squotas is targeted at use cases where the original extent is immutable, like image snapshotting for container startup, in which case we avoid these awkward scenarios where a subvolume is empty or deleted but still has significant extents accounted to it. However, as long as you are aware of the accounting semantics, they can handle mutable original extents.
此外,squotas 针对的是原始范围不可变的使用情况,比如容器启动时的图像快照,这样我们就可以避免出现子卷为空或已删除但仍然有重要范围被计入的尴尬情况。然而,只要您了解会计语义,它们可以处理可变的原始范围。