Troubleshooting pages 故障排除页面 

System messages printed to the log (dmesg, syslog, journal) have limited space for description and may need further explanation what needs to be done.
打印到日志(dmesg、syslog、journal)的系统消息的描述空间有限,可能需要进一步解释需要做什么。

Error: parent transid verify error
错误:父事务 ID 验证错误

Reason: result of a failed internal consistency check of the filesystem’s metadata. Type: permanent
原因:文件系统元数据的内部一致性检查失败的结果。类型:永久

[ 4007.489730] BTRFS error (device vdb): parent transid verify failed on 30736384 wanted 10 found 8

The b-tree nodes are linked together, a block pointer in the parent node contains target block offset and generation that last changed this block. The block it points to then upon read verifies that the block address and the generation matches. This check is done on all tree levels.
B 树节点彼此链接在一起,父节点中的块指针包含目标块偏移和最后更改此块的生成。然后,在读取时,它指向的块验证块地址和生成是否匹配。这个检查在所有树级别上都会进行。

The number in faled on 30736384 is the logical block number, wanted 10 is the expected generation number in the parent node, found 8 is the one found in the target block. The number difference between the generation can give a hint when the problem could have happened, in terms of transaction commits.
在 30736384 上失败的数字是逻辑块号,想要的 10 是父节点中预期的生成号,找到的 8 是目标块中找到的生成号。生成号之间的差异可以提示问题可能发生的时间,以事务提交为单位。

Once the mismatched generations are stored on the device, it’s permanent and cannot be easily recovered, because of information loss. The recovery tool btrfs restore is able to ignore the errors and attempt to restore the data but due to the inconsistency in the metadata the data need to be verified by the user.
一旦不匹配的生成号存储在设备上,它是永久的,无法轻松恢复,因为信息丢失。恢复工具 btrfs restore 能够忽略错误并尝试恢复数据,但由于元数据不一致,数据需要由用户验证。

The root cause of the error cannot be easily determined, possible reasons are:
错误的根本原因无法轻易确定,可能的原因有:

  • logical bug: filesystem structures haven’t been properly updated and stored
    逻辑错误:文件系统结构未被正确更新和存储

  • misdirected write: the underlying storage does not store the data to the exact address as expected and overwrites some other block
    错误写入:底层存储未将数据存储到预期地址,而是覆盖了其他块

  • storage device (hardware or emulated) does not properly flush and persist data between transactions so they get mixed up
    存储设备(硬件或模拟)在事务之间未正确刷新和持久化数据,导致数据混乱

  • lost write without proper error handling: writing the block worked as viewed on the filesystem layer, but there was a problem on the lower layers not propagated upwards
    丢失写入并且没有适当的错误处理:在文件系统层面上看,写入块成功,但在较低层面出现问题未向上传播

Error: No space left on device (ENOSPC)
错误:设备上没有剩余空间(ENOSPC) 

Type: transient 类型:瞬态

Space handling on a COW filesystem is tricky, namely when it’s in combination with delayed allocation, dynamic chunk allocation and parallel data updates. There are several reasons why the ENOSPC might get reported and there’s not just a single cause and solution. The space reservation algorithms try to fairly assign the space, fall back to heuristics or block writes until enough data are persisted and possibly making old copies available.
在 COW 文件系统上处理空间是棘手的,特别是当它与延迟分配、动态块分配和并行数据更新结合在一起时。导致 ENOSPC 报告的原因有几个,并不只有一个单一的原因和解决方案。空间预留算法尝试公平分配空间,退回到启发式算法或者阻止写入,直到足够的数据被持久化,并可能使旧副本可用。

The most obvious way how to exhaust space is to create a file until the data chunks are full:
耗尽空间最明显的方法是创建一个文件,直到数据块满为止。

$ df -h .
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda        4.0G  3.6M  2.0G   1% /mnt/

$ cat /dev/zero > file
cat: write error: No space left on device

$ df -h .
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdc        4.0G  2.0G     0 100% /mnt/data250

$ btrfs fi df .
Data, single: total=1.98GiB, used=1.98GiB
System, DUP: total=8.00MiB, used=16.00KiB
Metadata, DUP: total=1.00GiB, used=2.22MiB
GlobalReserve, single: total=3.25MiB, used=0.00B

The data chunks have been exhausted, so there’s really no space left where to write. The metadata chunks have space but that can’t be used for that purpose.
数据块已经耗尽,因此实际上没有剩余空间可供写入。元数据块有空间,但不能用于此目的。

Metadata space got exhausted
元数据空间已耗尽

Cannot track new data extents, no inline files, no reflinks, no xattrs. Deletion still works.
无法跟踪新数据范围,没有内联文件,没有 reflinks,没有 xattrs。删除仍然有效。

Balance does not have enough workspace
余额没有足够的工作空间

Relocation of block groups requires a temporary work space, i.e. area on the device that’s available for the filesystem but without any other existing block groups. Before balance starts a check is performed to verify the requested action is possible. If not, ENOSPC is returned.
块组的重新定位需要一个临时的工作空间,即设备上可用于文件系统但没有其他现有块组的区域。在平衡开始之前,会执行检查以验证所请求的操作是否可能。如果不可能,则返回 ENOSPC。

Error: unable to start balance with target metadata profile
错误:无法使用目标元数据配置文件启动平衡

unable to start balance with target metadata profile 32

This means that a conversion has been attempted from profile RAID1 to dup with btrfs-progs earlier than version 4.7. Update and you’ll be able to do the conversion.
这意味着尝试从配置文件 RAID1 转换为 btrfs-progs 版本早于 4.7 的 dup。更新后,您将能够进行转换。

Error: balance will reduce metadata integrity
错误:平衡将降低元数据完整性

The full message in system log
系统日志中的完整消息

balance will reduce metadata integrity, use force if you want this

This means that conversion will remove a degree of metadata redundancy, for example when going from profile RAID1 or dup to single. The force parameter to btrfs balance start -f is needed.
这意味着转换将消除一定程度的元数据冗余,例如从配置 RAID1 或 dup 转换为单个时。需要使用 btrfs balance start -f 的 force 参数。

How to clean old super block
如何清理旧的超级块

The preferred way is to use the wipefs utility that is part of the util-linux package. Running the command with the device will not destroy the data, just list the detected filesystems:
首选方法是使用 wipefs 实用程序,该实用程序是 util-linux 软件包的一部分。使用设备运行命令不会破坏数据,只会列出检测到的文件系统:

# wipefs /dev/sda
offset               type
----------------------------------------------------------------
0x10040              btrfs   [filesystem]
                     UUID:  7760469b-1704-487e-9b96-7d7a57d218a5

Remove the filesystem signature at a given offset or wipe all recognized signatures on the device:
在给定的偏移处删除文件系统签名或擦除设备上所有已识别的签名:

# wipefs -o 0x10040 /dev/sda
8 bytes [5f 42 48 52 66 53 5f 4d] erased at offset 0x10040 (btrfs)

# wipefs -a /dev/sda
8 bytes [5f 42 48 52 66 53 5f 4d] erased at offset 0x10040 (btrfs)

Note 注意

The process is reversible, if the 8 bytes are written back, the device is recognized again. See below.
该过程是可逆的,如果 8 个字节被写回,设备将再次被识别。请参见下文。

Note 注意

wipefs clears only the first super block. If available, the second and third copies can be used to resurrect the filesystem.
wipefs 仅清除第一个超级块。如果可用,第二和第三个副本可用于恢复文件系统。

Stale signature on device
设备上的过时签名

Related problem regarding partitioned and unpartitioned device: Long time ago I created btrfs on /dev/sda. After some changes btrfs moved to /dev/sda1.
有关分区和未分区设备的相关问题:很久以前我在 /dev/sda 上创建了 btrfs。经过一些更改,btrfs 移动到了 /dev/sda1。

Use wipefs -o 0x10040 (i.e. with the offset of the btrfs signature), it won’t touch the partition table.
使用 wipefs -o 0x10040(即使用 btrfs 签名的偏移量),它不会影响分区表。

Manual deletion of super block signature
手动删除超级块签名

There are three superblocks: the first one is located at 64KiB, the second one at 64MiB, the third one at 256GiB. The following lines reset the signature on all the three copies:
有三个超级块:第一个位于 64KiB,第二个位于 64MiB,第三个位于 256GiB。以下行重置所有三个副本上的签名:

# dd if=/dev/zero bs=1 count=8 of=/dev/sda seek=$((64*1024+64))
# dd if=/dev/zero bs=1 count=8 of=/dev/sda seek=$((64*1024*1024+64))
# dd if=/dev/zero bs=1 count=8 of=/dev/sda seek=$((256*1024*1024*1024+64))

If you want to restore the super block signatures:
如果您想要恢复超级块签名:

# echo "_BHRfS_M" | dd bs=1 count=8 of=/dev/sda seek=$((64*1024+64))
# echo "_BHRfS_M" | dd bs=1 count=8 of=/dev/sda seek=$((64*1024*1024+64))
# echo "_BHRfS_M" | dd bs=1 count=8 of=/dev/sda seek=$((256*1024*1024*1024+64))

Generic errors, errno 通用错误,errno

Note there’s a established text message for the errors, though they are used in a broader sense (e.g. error mentions a file but it can be relevant for another structure). The title of each section uses the nonstandard meaning that is perhaps more suitable for a filesystem.
请注意,虽然这些错误消息是针对错误而建立的,但它们在更广泛的意义上使用(例如,错误提到一个文件,但它可能与另一个结构相关)。每个部分的标题使用了非标准的含义,这可能更适合文件系统。

ENOENT (No such entry)
ENOENT(没有这样的条目)

Common error “no such entry”, in general it may mean that some structure hasn’t been found, e.g. an entry in some in-memory tree. This becomes a critical problem when the entry is expected to exist because of consistency of the structures.
通用错误“没有这样的条目”,通常意味着找不到某些结构,例如内存树中的条目。当由于结构的一致性而期望条目存在时,这就成为一个关键问题。

ENOMEM (Not enough memory)
ENOMEM(内存不足)

Memory allocation error. In many cases the error is recoverable and the operation restartable after it’s reported to userspace. In critical contexts, like when a transaction needs to be committed, the error is not recoverable and leads to flipping the filesystem to read-only. Such cases are rare under normal conditions. Memory can be artificially limited e.g. by cgroups, which may trigger the condition, which is useful for testing but any real workload should have resources scaled accordingly.
内存分配错误。在许多情况下,错误是可恢复的,并且在向用户空间报告错误后,操作可以重新启动。在关键上下文中,例如需要提交事务时,错误是不可恢复的,并导致将文件系统切换为只读。在正常情况下,这种情况很少发生。内存可以被人为限制,例如通过 cgroups,这可能会触发条件,这对于测试是有用的,但任何真实的工作负载都应该相应地扩展资源。

EINVAL (Invalid argument)
EINVAL(无效参数)

This is typically returned from ioctl when a parameter is invalid, i.e. unexpected range, a bit flag not recognized, or a combination of input parameters that does not make sense. Errors are typically recoverable.
当参数无效时,通常从 ioctl 返回,即意外范围,未识别的位标志,或输入参数的组合无意义。错误通常是可恢复的。

EUCLEAN (Filesystem corrupted)
EUCLEAN(文件系统损坏)

The text of the message is confusing “Structure needs cleaning”, in reality this is used to describe a severe corruption condition. The reason of the corruption is unknown at this point, but some constraint or condition has been violated and the filesystem driver can’t do much. In practice such errors can be observed on fuzzed images, faulty hardware or misinteraction with other parts of the operating system.
该消息的文本“结构需要清理”很令人困惑,实际上这是用来描述严重的损坏情况。目前还不清楚损坏的原因,但某些约束或条件已被违反,文件系统驱动程序无法做太多事情。在实践中,这种错误可能出现在模糊图像、故障硬件或与操作系统其他部分的错误交互中。

EIO (Input/output error)
EIO(输入/输出错误)

“Input output error”, typically returned as an error from a device that was unable to read data, or finish a write. Checksum errors also lead to EIO, there isn’t an established error for checksum validation errors, although some filesystems use EBADMSG for that.
“输入输出错误”,通常作为设备无法读取数据或完成写入的错误返回。校验和错误也会导致 EIO,尽管一些文件系统使用 EBADMSG 来表示校验和验证错误。

EEXIST (Object already exists)
EEXIST(对象已存在)

ENOSPC (No space left)
ENOSPC(没有剩余空间)

EOPNOTSUPP (Operation not supported)
EOPNOTSUPP(不支持的操作)

TODO 待办事项

Transient 短暂的

  • enospc

  • operation cannot be done 操作无法完成

Possibly both 可能是两者

  • checksum errors from changes on the medium under hands
    由于手中介质上的更改而导致的校验和错误

  • transient because of direct io
    因为直接 I/O 而是瞬时的

  • stored from faulty data in memory
    存储在内存中的错误数据