跳转至

XFS Self Describing Metadata

原文档地址

1 Introduction

The largest scalability problem facing XFS is not one of algorithmic scalability, but of verification of the filesystem structure. Scalabilty of the structures and indexes on disk and the algorithms for iterating them are adequate for supporting PB scale filesystems with billions of inodes, however it is this very scalability that causes the verification problem.

XFS面临的最大的可伸缩性问题不是算法可伸缩性,而是文件系统结构的验证。磁盘上的结构和索引及其迭代算法的可扩展性足以支持具有数十亿个索引节点的PB级文件系统,但正是这种可扩展性导致了验证问题。

Almost all metadata on XFS is dynamically allocated. The only fixed location metadata is the allocation group headers (SB, AGF, AGFL and AGI), while all other metadata structures need to be discovered by walking the filesystem structure in different ways. While this is already done by userspace tools for validating and repairing the structure, there are limits to what they can verify, and this in turn limits the supportable size of an XFS filesystem.

XFS上几乎所有的元数据都是动态分配的。唯一固定位置的元数据是分配组标头(SB、AGF、AGFL和AGI),而所有其他元数据结构都需要通过以不同方式遍历文件系统结构来发现。虽然这已经由用于验证和修复结构的用户空间工具完成,但他们可以验证的内容存在限制,这反过来又限制了XFS文件系统的可支持大小。

For example, it is entirely possible to manually use xfs_db and a bit of scripting to analyse the structure of a 100TB filesystem when trying to determine the root cause of a corruption problem, but it is still mainly a manual task of verifying that things like single bit errors or misplaced writes weren’t the ultimate cause of a corruption event. It may take a few hours to a few days to perform such forensic analysis, so for at this scale root cause analysis is entirely possible.

例如,在试图确定损坏问题的根本原因时,完全可以手动使用xfs_db和一些脚本来分析100TB文件系统的结构,但这仍然主要是一项手动任务,即验证诸如单位错误或错误写入之类的事情不是损坏事件的最终原因。执行此类法医分析可能需要几个小时到几天的时间,因此在这种规模下根本原因分析是完全可能的。

However, if we scale the filesystem up to 1PB, we now have 10x as much metadata to analyse and so that analysis blows out towards weeks/months of forensic work. Most of the analysis work is slow and tedious, so as the amount of analysis goes up, the more likely that the cause will be lost in the noise. Hence the primary concern for supporting PB scale filesystems is minimising the time and effort required for basic forensic analysis of the filesystem structure.

然而,如果我们将文件系统扩展到1PB,那么我们现在需要分析的元数据是原来的10倍,因此分析工作将持续数周/数月。大多数分析工作都是缓慢而乏味的,因此随着分析量的增加,原因越有可能在噪声中消失。因此,支持PB级文件系统的主要关注点是最小化对文件系统结构进行基本取证分析所需的时间和精力。

2 Self Describing Metadata

One of the problems with the current metadata format is that apart from the magic number in the metadata block, we have no other way of identifying what it is supposed to be. We can’t even identify if it is the right place. Put simply, you can’t look at a single metadata block in isolation and say “yes, it is supposed to be there and the contents are valid”.

当前元数据格式的一个问题是,除了元数据块中的神奇数字之外,我们没有其他方法来确定它应该是什么。我们甚至无法确定它是否是正确的位置。简单地说,你不能孤立地看一个元数据块,然后说“是的,它应该在那里,内容是有效的”。

Hence most of the time spent on forensic analysis is spent doing basic verification of metadata values, looking for values that are in range (and hence not detected by automated verification checks) but are not correct. Finding and understanding how things like cross linked block lists (e.g. sibling pointers in a btree end up with loops in them) are the key to understanding what went wrong, but it is impossible to tell what order the blocks were linked into each other or written to disk after the fact.

因此,用于法医分析的大部分时间都用于对元数据值进行基本验证,查找范围内(因此未通过自动验证检查检测到)但不正确的值。找到并理解交叉链接的块列表(例如,btree中的兄弟指针以循环结尾)是理解错误的关键,但无法知道这些块是按照什么顺序链接到彼此或写入磁盘的。

Hence we need to record more information into the metadata to allow us to quickly determine if the metadata is intact and can be ignored for the purpose of analysis. We can’t protect against every possible type of error, but we can ensure that common types of errors are easily detectable. Hence the concept of self describing metadata.

因此,我们需要将更多的信息记录到元数据中,以便快速确定元数据是否完整,并且可以出于分析目的而忽略。我们不能防止每一种可能的错误,但我们可以确保常见类型的错误很容易被检测到。因此出现了自我描述元数据的概念。

The first, fundamental requirement of self describing metadata is that the metadata object contains some form of unique identifier in a well known location. This allows us to identify the expected contents of the block and hence parse and verify the metadata object. IF we can’t independently identify the type of metadata in the object, then the metadata doesn’t describe itself very well at all!

自我描述元数据的第一个基本要求是元数据对象在已知位置包含某种形式的唯一标识符。这允许我们识别块的预期内容,从而解析和验证元数据对象。如果我们不能独立地识别对象中的元数据类型,那么元数据根本就不能很好地描述自己!

Luckily, almost all XFS metadata has magic numbers embedded already - only the AGFL, remote symlinks and remote attribute blocks do not contain identifying magic numbers. Hence we can change the on-disk format of all these objects to add more identifying information and detect this simply by changing the magic numbers in the metadata objects. That is, if it has the current magic number, the metadata isn’t self identifying. If it contains a new magic number, it is self identifying and we can do much more expansive automated verification of the metadata object at runtime, during forensic analysis or repair.

幸运的是,几乎所有XFS元数据都已经嵌入了幻数——只有AGFL、远程符号链接和远程属性块不包含识别幻数。因此,我们可以更改所有这些对象的磁盘格式,以添加更多标识信息,并只需更改元数据对象中的幻数即可检测到这些信息。也就是说,如果它有当前的幻数,元数据就不能自我识别。如果它包含一个新的幻数,它是自我识别的,我们可以在运行时、法医分析或修复期间对元数据对象进行更广泛的自动验证。

As a primary concern, self describing metadata needs some form of overall integrity checking. We cannot trust the metadata if we cannot verify that it has not been changed as a result of external influences. Hence we need some form of integrity check, and this is done by adding CRC32c validation to the metadata block. If we can verify the block contains the metadata it was intended to contain, a large amount of the manual verification work can be skipped.

作为主要关注点,自我描述元数据需要某种形式的整体完整性检查。如果我们无法验证元数据没有因外部影响而更改,我们就无法信任元数据。因此,我们需要某种形式的完整性检查,这是通过向元数据块添加CRC32c验证来完成的。若我们可以验证该块是否包含它想要包含的元数据,则可以跳过大量的手动验证工作。

CRC32c was selected as metadata cannot be more than 64k in length in XFS and hence a 32 bit CRC is more than sufficient to detect multi-bit errors in metadata blocks. CRC32c is also now hardware accelerated on common CPUs so it is fast. So while CRC32c is not the strongest of possible integrity checks that could be used, it is more than sufficient for our needs and has relatively little overhead. Adding support for larger integrity fields and/or algorithms does really provide any extra value over CRC32c, but it does add a lot of complexity and so there is no provision for changing the integrity checking mechanism.

选择CRC32c是因为元数据在XFS中的长度不能超过64k,因此32位CRC足以检测元数据块中的多位错误。CRC32c现在也在普通CPU上进行了硬件加速,因此速度很快。因此,虽然CRC32c不是可能使用的最强大的完整性检查,但它足以满足我们的需求,并且开销相对较小。增加对更大的完整性字段和/或算法的支持确实比CRC32c提供了任何额外的价值,但它确实增加了很多复杂性,因此没有改变完整性检查机制的规定。

Self describing metadata needs to contain enough information so that the metadata block can be verified as being in the correct place without needing to look at any other metadata. This means it needs to contain location information. Just adding a block number to the metadata is not sufficient to protect against mis-directed writes - a write might be misdirected to the wrong LUN and so be written to the “correct block” of the wrong filesystem. Hence location information must contain a filesystem identifier as well as a block number.

自我描述元数据需要包含足够的信息,以便可以验证元数据块是否位于正确的位置,而无需查看任何其他元数据。这意味着它需要包含位置信息。仅向元数据中添加块号不足以防止错误定向的写入—写入可能被错误定向到错误的LUN,因此被写入错误文件系统的“正确块”。因此,位置信息必须包含文件系统标识符和块号。

Another key information point in forensic analysis is knowing who the metadata block belongs to. We already know the type, the location, that it is valid and/or corrupted, and how long ago that it was last modified. Knowing the owner of the block is important as it allows us to find other related metadata to determine the scope of the corruption. For example, if we have a extent btree object, we don’t know what inode it belongs to and hence have to walk the entire filesystem to find the owner of the block. Worse, the corruption could mean that no owner can be found (i.e. it’s an orphan block), and so without an owner field in the metadata we have no idea of the scope of the corruption. If we have an owner field in the metadata object, we can immediately do top down validation to determine the scope of the problem.

法医分析中的另一个关键信息点是知道元数据块属于谁。我们已经知道元数据块的类型、位置、它是否有效和/或已损坏,以及它上次被修改的时间。了解块的所有者很重要,因为它允许我们找到其他相关的元数据来确定损坏的范围。例如,如果我们有一个区段btree对象,我们不知道它属于哪个inode,因此必须遍历整个文件系统才能找到块的所有者。更糟糕的是,损坏可能意味着找不到所有者(即它是一个孤立块),因此如果元数据中没有所有者字段,我们就不知道损坏的范围。如果我们在元数据对象中有所有者字段,我们可以立即进行自顶向下的验证,以确定问题的范围。

Different types of metadata have different owner identifiers. For example, directory, attribute and extent tree blocks are all owned by an inode, while freespace btree blocks are owned by an allocation group. Hence the size and contents of the owner field are determined by the type of metadata object we are looking at. The owner information can also identify misplaced writes (e.g. freespace btree block written to the wrong AG). 不同类型的元数据具有不同的所有者标识符。例如,目录、属性和数据块树块都归inode所有,而自由空间btree块归分配组所有。因此,所有者字段的大小和内容取决于我们正在查看的元数据对象的类型。所有者信息还可以识别错误的写入(例如,写入错误AG的freespace btree块)。

Self describing metadata also needs to contain some indication of when it was written to the filesystem. One of the key information points when doing forensic analysis is how recently the block was modified. Correlation of set of corrupted metadata blocks based on modification times is important as it can indicate whether the corruptions are related, whether there’s been multiple corruption events that lead to the eventual failure, and even whether there are corruptions present that the run-time verification is not detecting.

自我描述元数据还需要包含一些何时将其写入文件系统的指示。进行法医分析时的一个关键信息点是区块最近被修改的时间。基于修改时间的损坏元数据块集的相关性很重要,因为它可以指示损坏是否相关,是否存在导致最终失败的多个损坏事件,甚至是否存在运行时验证未检测到的损坏。

For example, we can determine whether a metadata object is supposed to be free space or still allocated if it is still referenced by its owner by looking at when the free space btree block that contains the block was last written compared to when the metadata object itself was last written. If the free space block is more recent than the object and the object’s owner, then there is a very good chance that the block should have been removed from the owner.

例如,我们可以通过查看包含该块的自由空间btree块上次写入的时间与元数据对象本身上次写入的时候相比,来确定元数据对象是否应该是自由空间,或者如果它仍然被其所有者引用,则它仍然被分配。如果可用空间块比对象和对象的所有者更新,则很有可能该块已从所有者处删除。

To provide this “written timestamp”, each metadata block gets the Log Sequence Number (LSN) of the most recent transaction it was modified on written into it. This number will always increase over the life of the filesystem, and the only thing that resets it is running xfs_repair on the filesystem. Further, by use of the LSN we can tell if the corrupted metadata all belonged to the same log checkpoint and hence have some idea of how much modification occurred between the first and last instance of corrupt metadata on disk and, further, how much modification occurred between the corruption being written and when it was detected.

为了提供这个“写入的时间戳”,每个元数据块都会得到它在写入时修改的最近事务的日志序列号(LSN)。这个数字在文件系统的生命周期内始终会增加,唯一重置它的是在文件系统上运行xfs_repair。此外,通过使用LSN,我们可以判断损坏的元数据是否都属于同一个日志检查点,从而了解磁盘上损坏的元数据的第一个和最后一个实例之间发生了多少修改,此外,在写入损坏和检测到损坏之间发生了多大修改。

3 Runtime Validation

Validation of self-describing metadata takes place at runtime in two places:

  • immediately after a successful read from disk
  • immediately prior to write IO submission

自描述元数据的验证在运行时在两个地方进行:

-成功读取磁盘后立即 -在提交写IO之前

The verification is completely stateless - it is done independently of the modification process, and seeks only to check that the metadata is what it says it is and that the metadata fields are within bounds and internally consistent. As such, we cannot catch all types of corruption that can occur within a block as there may be certain limitations that operational state enforces of the metadata, or there may be corruption of interblock relationships (e.g. corrupted sibling pointer lists). Hence we still need stateful checking in the main code body, but in general most of the per-field validation is handled by the verifiers.

验证是完全无状态的——它是独立于修改过程进行的,并且只试图检查元数据是否是它所说的,以及元数据字段是否在边界内并且内部一致。因此,我们无法捕获块内可能发生的所有类型的损坏,因为操作状态对元数据的强制执行可能存在某些限制,或者可能存在块间关系的损坏(例如损坏的兄弟指针列表)。因此,我们仍然需要在主代码体中进行有状态检查,但通常情况下,每个字段的大多数验证都由验证程序处理。

For read verification, the caller needs to specify the expected type of metadata that it should see, and the IO completion process verifies that the metadata object matches what was expected. If the verification process fails, then it marks the object being read as EFSCORRUPTED. The caller needs to catch this error (same as for IO errors), and if it needs to take special action due to a verification error it can do so by catching the EFSCORRUPTED error value. If we need more discrimination of error type at higher levels, we can define new error numbers for different errors as necessary.

对于读取验证,调用者需要指定它应该看到的元数据的预期类型,IO完成过程验证元数据对象是否与预期匹配。如果验证过程失败,则将被读取的对象标记为EFSCORRUPTED。调用者需要捕获此错误(与IO错误相同),如果由于验证错误而需要采取特殊操作,则可以通过捕获EFSCORRUPTED错误值来执行。如果我们需要在更高级别上对错误类型进行更多的区分,我们可以根据需要为不同的错误定义新的错误编号。

The first step in read verification is checking the magic number and determining whether CRC validating is necessary. If it is, the CRC32c is calculated and compared against the value stored in the object itself. Once this is validated, further checks are made against the location information, followed by extensive object specific metadata validation. If any of these checks fail, then the buffer is considered corrupt and the EFSCORRUPTED error is set appropriately.

读取验证的第一步是检查幻数并确定是否需要CRC验证。如果是,则计算CRC32c并与存储在对象本身中的值进行比较。一旦验证了这一点,将对位置信息进行进一步检查,然后进行广泛的特定于对象的元数据验证。如果这些检查中的任何一项失败,则认为缓冲区已损坏,并正确设置EFSCORRUPTED错误。

Write verification is the opposite of the read verification - first the object is extensively verified and if it is OK we then update the LSN from the last modification made to the object, After this, we calculate the CRC and insert it into the object. Once this is done the write IO is allowed to continue. If any error occurs during this process, the buffer is again marked with a EFSCORRUPTED error for the higher layers to catch.

写入验证与读取验证相反-首先对对象进行广泛验证,如果正确,则更新对对象进行的最后修改的LSN。之后,我们计算CRC并将其插入对象。完成后,允许写入IO继续。如果在此过程中发生任何错误,缓冲区将再次标记为EFSCORRUPTED错误,以供更高层捕获。

4 Structures

A typical on-disk structure needs to contain the following information:

典型的磁盘结构需要包含以下信息:

struct xfs_ondisk_hdr {
        __be32  magic;              /* magic number */
        __be32  crc;                /* CRC, not logged */
        uuid_t  uuid;               /* filesystem identifier */
        __be64  owner;              /* parent object */
        __be64  blkno;              /* location on disk */
        __be64  lsn;                /* last modification in log, not logged */
};

Depending on the metadata, this information may be part of a header structure separate to the metadata contents, or may be distributed through an existing structure. The latter occurs with metadata that already contains some of this information, such as the superblock and AG headers.

取决于元数据,该信息可以是与元数据内容分离的报头结构的一部分,或者可以通过现有结构分发。后者发生在已经包含一些信息的元数据中,例如超级块和AG头。

Other metadata may have different formats for the information, but the same level of information is generally provided. For example:

  • short btree blocks have a 32 bit owner (ag number) and a 32 bit block number for location. The two of these combined provide the same information as @owner and @blkno in eh above structure, but using 8 bytes less space on disk.
  • directory/attribute node blocks have a 16 bit magic number, and the header that contains the magic number has other information in it as well. hence the additional metadata headers change the overall format of the metadata.

其他元数据可以具有不同的信息格式,但通常提供相同级别的信息。例如:

  • 短btree块具有32位所有者(ag编号)和32位块编号。这两个组合提供了与上述结构中的@owner和@blkno相同的信息,但使用的磁盘空间少了8字节。

  • directory/attribute节点块有一个16位幻数,包含幻数的标头中也有其他信息。因此,附加的元数据报头改变了元数据的整体格式。

A typical buffer read verifier is structured as follows:

典型的缓冲区读取验证器的结构如下:

#define XFS_FOO_CRC_OFF             offsetof(struct xfs_ondisk_hdr, crc)

static void
xfs_foo_read_verify(
        struct xfs_buf      *bp)
{
    struct xfs_mount *mp = bp->b_mount;

        if ((xfs_sb_version_hascrc(&mp->m_sb) &&
            !xfs_verify_cksum(bp->b_addr, BBTOB(bp->b_length),
                                        XFS_FOO_CRC_OFF)) ||
            !xfs_foo_verify(bp)) {
                XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr);
                xfs_buf_ioerror(bp, EFSCORRUPTED);
        }
}

The code ensures that the CRC is only checked if the filesystem has CRCs enabled by checking the superblock of the feature bit, and then if the CRC verifies OK (or is not needed) it verifies the actual contents of the block.

该代码确保只有当文件系统通过检查特征位的超级块启用CRC时,才检查CRC,然后如果CRC验证OK(或不需要),则验证块的实际内容。

The verifier function will take a couple of different forms, depending on whether the magic number can be used to determine the format of the block. In the case it can’t, the code is structured as follows:

验证器函数将采用多种不同的形式,这取决于是否可以使用幻数来确定块的格式。如果不能,代码结构如下:

static bool
xfs_foo_verify(
        struct xfs_buf              *bp)
{
        struct xfs_mount    *mp = bp->b_mount;
        struct xfs_ondisk_hdr       *hdr = bp->b_addr;

        if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC))
                return false;

        if (!xfs_sb_version_hascrc(&mp->m_sb)) {
                if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid))
                        return false;
                if (bp->b_bn != be64_to_cpu(hdr->blkno))
                        return false;
                if (hdr->owner == 0)
                        return false;
        }

        /* object specific verification checks here */

        return true;
}

If there are different magic numbers for the different formats, the verifier will look like:

如果不同格式有不同的幻数,验证程序将如下所示:

static bool
xfs_foo_verify(
        struct xfs_buf              *bp)
{
        struct xfs_mount    *mp = bp->b_mount;
        struct xfs_ondisk_hdr       *hdr = bp->b_addr;

        if (hdr->magic == cpu_to_be32(XFS_FOO_CRC_MAGIC)) {
                if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid))
                        return false;
                if (bp->b_bn != be64_to_cpu(hdr->blkno))
                        return false;
                if (hdr->owner == 0)
                        return false;
        } else if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC))
                return false;

        /* object specific verification checks here */

        return true;
}

Write verifiers are very similar to the read verifiers, they just do things in the opposite order to the read verifiers. A typical write verifier:

写校验器与读校验器非常相似,它们只是按照与读校验符相反的顺序进行操作。典型的写验证器:

static void
xfs_foo_write_verify(
        struct xfs_buf      *bp)
{
        struct xfs_mount    *mp = bp->b_mount;
        struct xfs_buf_log_item     *bip = bp->b_fspriv;

        if (!xfs_foo_verify(bp)) {
                XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr);
                xfs_buf_ioerror(bp, EFSCORRUPTED);
                return;
        }

        if (!xfs_sb_version_hascrc(&mp->m_sb))
                return;


        if (bip) {
                struct xfs_ondisk_hdr       *hdr = bp->b_addr;
                hdr->lsn = cpu_to_be64(bip->bli_item.li_lsn);
        }
        xfs_update_cksum(bp->b_addr, BBTOB(bp->b_length), XFS_FOO_CRC_OFF);
}

This will verify the internal structure of the metadata before we go any further, detecting corruptions that have occurred as the metadata has been modified in memory. If the metadata verifies OK, and CRCs are enabled, we then update the LSN field (when it was last modified) and calculate the CRC on the metadata. Once this is done, we can issue the IO.

这将验证元数据的内部结构,然后再进一步,检测在内存中修改元数据时发生的损坏。如果元数据验证为OK,并且启用了CRC,则更新LSN字段(上次修改时)并计算元数据的CRC。一旦完成,我们就可以发布IO。

5 Inodes and Dquots

Inodes and dquots are special snowflakes. They have per-object CRC and self-identifiers, but they are packed so that there are multiple objects per buffer. Hence we do not use per-buffer verifiers to do the work of per-object verification and CRC calculations. The per-buffer verifiers simply perform basic identification of the buffer - that they contain inodes or dquots, and that there are magic numbers in all the expected spots. All further CRC and verification checks are done when each inode is read from or written back to the buffer.

Inodes和dquos是特殊的雪花。它们具有每个对象的CRC和自标识符,但它们是打包的,因此每个缓冲区有多个对象。因此,我们不使用每个缓冲区验证器来完成每个对象验证和CRC计算的工作。每个缓冲区验证器只需执行缓冲区的基本标识,即它们包含inode或dquot,并且在所有预期的点中都有幻数。所有进一步的CRC和验证检查都是在从缓冲区读取或写回每个inode时完成的。

The structure of the verifiers and the identifiers checks is very similar to the buffer code described above. The only difference is where they are called. For example, inode read verification is done in xfs_inode_from_disk() when the inode is first read out of the buffer and the struct xfs_inode is instantiated. The inode is already extensively verified during writeback in xfs_iflush_int, so the only addition here is to add the LSN and CRC to the inode as it is copied back into the buffer.

验证者和标识符检查的结构与上述缓冲代码非常相似。唯一的区别是在哪里调用它们。例如,当首次从缓冲区中读取inode并实例化结构xfs_inode时,在xfs_inode_from_disk()中完成inode读取验证。在xfs_iflush_int的写回过程中,inode已经得到了广泛的验证,因此这里唯一的添加是在将inode复制回缓冲区时将LSN和CRC添加到inode中。

XXX: inode unlinked list modification doesn’t recalculate the inode CRC! None of the unlinked list modifications check or update CRCs, neither during unlink nor log recovery. So, it’s gone unnoticed until now. This won’t matter immediately - repair will probably complain about it - but it needs to be fixed.

XXX: inode未链接列表修改不会重新计算inode CRC!无论是在取消链接还是日志恢复期间,未链接的列表修改都不会检查或更新CRC。所以,直到现在,它都没有被注意到。这不会立即发生,维修人员可能会抱怨,但需要修复。