ext4 Data Structures and Algorithms¶
1 About this Book¶
本文档试图描述ext4文件系统的磁盘格式。同样的一般思想也应适用于ext2/3文件系统,尽管它们不支持ext4支持的所有功能,字段也会更短。
NOTE: This is a work in progress, based on notes that the author (djwong) made while picking apart a filesystem by hand. The data structure definitions should be current as of Linux 4.18 and e2fsprogs-1.44. All comments and corrections are welcome, since there is undoubtedly plenty of lore that might not be reflected in freshly created demonstration filesystems.
注:这是一项正在进行的工作,基于作者(djwong)在手动拆分文件系统时所做的注释。从Linux4.18和e2fsprogs-1.44开始,数据结构定义应该是最新的。欢迎所有评论和更正,因为毫无疑问,新创建的演示文件系统中可能没有反映出大量的知识。
1.1 License¶
This book is licensed under the terms of the GNU Public License, v2.
本书根据GNU公共许可证第2版的条款获得许可。
1.2 Terminology¶
ext4 divides a storage device into an array of logical blocks both to reduce bookkeeping overhead and to increase throughput by forcing larger transfer sizes. Generally, the block size will be 4KiB (the same size as pages on x86 and the block layer’s default block size), though the actual size is calculated as 2 ^ (10 + sb.s_log_block_size
) bytes. Throughout this document, disk locations are given in terms of these logical blocks, not raw LBAs, and not 1024-byte blocks. For the sake of convenience, the logical block size will be referred to as $block_size
throughout the rest of the document.
ext4将存储设备划分为逻辑块阵列,以减少记账开销,并通过强制更大的传输大小来提高吞吐量。通常,块大小将为4KiB(与x86上的页面大小和块层的默认块大小相同),但实际大小计算为2^(10+“sb.s_log_block_size”)字节。在本文档中,磁盘位置是以这些逻辑块而非原始LBA和1024字节块的形式给出的。为了方便起见,在文档的其余部分中,逻辑块大小将被称为“$block_size”。
When referenced in preformatted text
blocks, sb
refers to fields in the super block, and inode
refers to fields in an inode table entry.
当在“预格式化文本”块中引用时,“sb”指代超级块中的字段,而“inode”指代inode表条目中的字段。
1.3 Other References¶
Also see https://www.nongnu.org/ext2-doc/ for quite a collection of information about ext2/3. Here’s another old reference: http://wiki.osdev.org/Ext2
2 High Level Design¶
An ext4 file system is split into a series of block groups. To reduce performance difficulties due to fragmentation, the block allocator tries very hard to keep each file’s blocks within the same group, thereby reducing seek times. The size of a block group is specified in sb.s_blocks_per_group
blocks, though it can also calculated as 8 * block_size_in_bytes
. With the default block size of 4KiB, each group will contain 32,768 blocks, for a length of 128MiB. The number of block groups is the size of the device divided by the size of a block group.
ext4文件系统被分成一系列块组。为了减少由于碎片导致的性能困难,块分配器非常努力地将每个文件的块保持在同一组中,从而减少查找时间。块组的大小在“sb.s_blocks_per_group”块中指定,但也可以计算为8*“block_size_in_bytes”。默认块大小为4KiB,每组将包含32768个块,长度为128MiB。块组的数量是设备的大小除以块组的大小。
All fields in ext4 are written to disk in little-endian order. HOWEVER, all fields in jbd2 (the journal) are written to disk in big-endian order.
ext4中的所有字段都以小端序写入磁盘。然而,jbd2(日志)中的所有字段都以大端顺序写入磁盘。
2.1 Blocks¶
ext4 allocates storage space in units of “blocks”. A block is a group of sectors between 1KiB and 64KiB, and the number of sectors must be an integral power of 2. Blocks are in turn grouped into larger units called block groups. Block size is specified at mkfs time and typically is 4KiB. You may experience mounting problems if block size is greater than page size (i.e. 64KiB blocks on a i386 which only has 4KiB memory pages). By default a filesystem can contain 2^32 blocks; if the ‘64bit’ feature is enabled, then a filesystem can have 2^64 blocks. The location of structures is stored in terms of the block number the structure lives in and not the absolute offset on disk.
ext4以“块”为单位分配存储空间。块是1KiB和64KiB之间的一组扇区,扇区数必须是2的整数次方。块依次分组为更大的单元,称为块组。块大小在mkfs时间指定,通常为4KiB。如果块大小大于页面大小(即i386上只有4KiB内存页的64KiB块),您可能会遇到安装问题。默认情况下,文件系统可以包含2^32个块;如果启用了“64位”功能,那么文件系统可以有2^64个块。结构的位置是根据结构所在的块号而不是磁盘上的绝对偏移量来存储的。
For 32-bit filesystems, limits are as follows:
Item | 1KiB | 2KiB | 4KiB | 64KiB |
---|---|---|---|---|
Blocks | 2^32 | 2^32 | 2^32 | 2^32 |
Inodes | 2^32 | 2^32 | 2^32 | 2^32 |
File System Size | 4TiB | 8TiB | 16TiB | 256TiB |
Blocks Per Block Group | 8,192 | 16,384 | 32,768 | 524,288 |
Inodes Per Block Group | 8,192 | 16,384 | 32,768 | 524,288 |
Block Group Size | 8MiB | 32MiB | 128MiB | 32GiB |
Blocks Per File, Extents | 2^32 | 2^32 | 2^32 | 2^32 |
Blocks Per File, Block Maps | 16,843,020 | 134,480,396 | 1,074,791,436 | 4,398,314,962,956 (really 2^32 due to field size limitations) |
File Size, Extents | 4TiB | 8TiB | 16TiB | 256TiB |
File Size, Block Maps | 16GiB | 256GiB | 4TiB | 256TiB |
For 64-bit filesystems, limits are as follows:
Item | 1KiB | 2KiB | 4KiB | 64KiB |
---|---|---|---|---|
Blocks | 2^64 | 2^64 | 2^64 | 2^64 |
Inodes | 2^32 | 2^32 | 2^32 | 2^32 |
File System Size | 16ZiB | 32ZiB | 64ZiB | 1YiB |
Blocks Per Block Group | 8,192 | 16,384 | 32,768 | 524,288 |
Inodes Per Block Group | 8,192 | 16,384 | 32,768 | 524,288 |
Block Group Size | 8MiB | 32MiB | 128MiB | 32GiB |
Blocks Per File, Extents | 2^32 | 2^32 | 2^32 | 2^32 |
Blocks Per File, Block Maps | 16,843,020 | 134,480,396 | 1,074,791,436 | 4,398,314,962,956 (really 2^32 due to field size limitations) |
File Size, Extents | 4TiB | 8TiB | 16TiB | 256TiB |
File Size, Block Maps | 16GiB | 256GiB | 4TiB | 256TiB |
Note: Files not using extents (i.e. files using block maps) must be placed within the first 2^32 blocks of a filesystem. Files with extents must be placed within the first 2^48 blocks of a filesystem. It’s not clear what happens with larger filesystems.
注意:不使用扩展数据块的文件(即使用块映射的文件)必须放在文件系统的前2^32个块内。带有扩展数据块的文件必须放在文件系统的前2^48个块内。目前还不清楚大型文件系统会发生什么情况。
2.2 Layout¶
The layout of a standard block group is approximately as follows (each of these fields is discussed in a separate section below):
标准块组的布局大致如下(以下单独一节讨论了每个字段):
Group 0 Padding | ext4 Super Block | Group Descriptors | Reserved GDT Blocks | Data Block Bitmap | inode Bitmap | inode Table | Data Blocks |
---|---|---|---|---|---|---|---|
1024 bytes | 1 block | many blocks | many blocks | 1 block | 1 block | many blocks | many more blocks |
For the special case of block group 0, the first 1024 bytes are unused, to allow for the installation of x86 boot sectors and other oddities. The superblock will start at offset 1024 bytes, whichever block that happens to be (usually 0). However, if for some reason the block size = 1024, then block 0 is marked in use and the superblock goes in block 1. For all other block groups, there is no padding.
对于块组0的特殊情况,前1024个字节是未使用的,以允许安装x86引导扇区和其他奇怪的东西。超级块将从偏移1024字节开始,无论哪个块恰好是(通常为0)。然而,如果由于某种原因,块大小=1024,那么块0被标记为正在使用,超级块进入块1。对于所有其他块组,没有填充。
The ext4 driver primarily works with the superblock and the group descriptors that are found in block group 0. Redundant copies of the superblock and group descriptors are written to some of the block groups across the disk in case the beginning of the disk gets trashed, though not all block groups necessarily host a redundant copy (see following paragraph for more details). If the group does not have a redundant copy, the block group begins with the data block bitmap. Note also that when the filesystem is freshly formatted, mkfs will allocate “reserve GDT block” space after the block group descriptors and before the start of the block bitmaps to allow for future expansion of the filesystem. By default, a filesystem is allowed to increase in size by a factor of 1024x over the original filesystem size.
ext4驱动程序主要使用块组0中的超级块和组描述符。超级块和组描述符的冗余副本将写入磁盘上的某些块组,以防磁盘的开头被丢弃,但并非所有块组都必须拥有冗余副本(有关详细信息,请参阅以下段落)。如果组没有冗余副本,则块组以数据块位图开始。还要注意,当文件系统是新格式化的时,mkfs将在块组描述符之后和块位图开始之前分配“保留GDT块”空间,以允许将来扩展文件系统。默认情况下,允许文件系统的大小比原始文件系统大小增加1024倍。
The location of the inode table is given by grp.bg_inode_table_*
. It is continuous range of blocks large enough to contain sb.s_inodes_per_group * sb.s_inode_size
bytes.
inode表的位置由“grp.bg_node_table_”给出。它是一个连续的块范围,足够大,可以包含“sb.s_inodes_per_groupsb.s_index_size”字节。
As for the ordering of items in a block group, it is generally established that the super block and the group descriptor table, if present, will be at the beginning of the block group. The bitmaps and the inode table can be anywhere, and it is quite possible for the bitmaps to come after the inode table, or for both to be in different groups (flex_bg). Leftover space is used for file data blocks, indirect block maps, extent tree blocks, and extended attributes.
关于块组中项目的排序,通常确定超级块和组描述符表(如果存在)将位于块组的开头。位图和inode表可以位于任何位置,位图很可能位于inode表之后,或者两者位于不同的组中(flex_bg)。剩余空间用于文件数据块、间接块映射、数据块树块和扩展属性。
2.3 Flexible Block Groups¶
Starting in ext4, there is a new feature called flexible block groups (flex_bg). In a flex_bg, several block groups are tied together as one logical block group; the bitmap spaces and the inode table space in the first block group of the flex_bg are expanded to include the bitmaps and inode tables of all other block groups in the flex_bg. For example, if the flex_bg size is 4, then group 0 will contain (in order) the superblock, group descriptors, data block bitmaps for groups 0-3, inode bitmaps for groups 0-3, inode tables for groups 0-3, and the remaining space in group 0 is for file data. The effect of this is to group the block group metadata close together for faster loading, and to enable large files to be continuous on disk. Backup copies of the superblock and group descriptors are always at the beginning of block groups, even if flex_bg is enabled. The number of block groups that make up a flex_bg is given by 2 ^ sb.s_log_groups_per_flex
.
从ext4开始,有一个新特性叫做flexible block groups(flex_bg)。在flex_bg中,几个块组作为一个逻辑块组连接在一起;扩展flex_ bg的第一块组中的位图空间和inode表空间,以包括flex_bg。例如,如果flex_bg大小为4,则组0将包含(按顺序)超级块、组描述符、组0-3的数据块位图、组0-30的inode位图、组10-3的inode表,组0中的剩余空间用于文件数据。这样做的效果是将块组元数据分组在一起,以便更快地加载,并使大型文件能够在磁盘上连续。超级块和组描述符的备份副本始终位于块组的开头,即使启用了flex_bg。组成flex_bg的块组的数量由2^sb.s_log_groups_per_flex给出。
2.4 Meta Block Groups¶
Without the option META_BG, for safety concerns, all block group descriptors copies are kept in the first block group. Given the default 128MiB(2^27 bytes) block group size and 64-byte group descriptors, ext4 can have at most 2^27/64 = 2^21 block groups. This limits the entire filesystem size to 2^21 * 2^27 = 2^48bytes or 256TiB.
如果没有选项META_BG,出于安全考虑,所有块组描述符副本都保存在第一个块组中。给定默认128MiB(2^27字节)块组大小和64字节组描述符,ext4最多可以有2^27/64=2^21个块组。这将整个文件系统大小限制为2^21*2^27=2^48字节或256TiB。
The solution to this problem is to use the metablock group feature (META_BG), which is already in ext3 for all 2.6 releases. With the META_BG feature, ext4 filesystems are partitioned into many metablock groups. Each metablock group is a cluster of block groups whose group descriptor structures can be stored in a single disk block. For ext4 filesystems with 4 KB block size, a single metablock group partition includes 64 block groups, or 8 GiB of disk space. The metablock group feature moves the location of the group descriptors from the congested first block group of the whole filesystem into the first group of each metablock group itself. The backups are in the second and last group of each metablock group. This increases the 2^21 maximum block groups limit to the hard limit 2^32, allowing support for a 512PiB filesystem.
这个问题的解决方案是使用元块组特性(META_BG),它已经在所有2.6版本的ext3中。通过META_BG特性,ext4文件系统被划分为许多元块组。每个元块组是一组块组,其组描述符结构可以存储在单个磁盘块中。对于块大小为4KB的ext4文件系统,单个元块组分区包括64个块组,即8GB的磁盘空间。元块组特征将组描述符的位置从整个文件系统的拥塞的第一块组移动到每个元块组本身的第一组。备份位于每个元块组的第二个也是最后一个组中。这将2^21最大块组限制增加到硬限制2^32,允许支持512PiB文件系统。
The change in the filesystem format replaces the current scheme where the superblock is followed by a variable-length set of block group descriptors. Instead, the superblock and a single block group descriptor block is placed at the beginning of the first, second, and last block groups in a meta-block group. A meta-block group is a collection of block groups which can be described by a single block group descriptor block. Since the size of the block group descriptor structure is 32 bytes, a meta-block group contains 32 block groups for filesystems with a 1KB block size, and 128 block groups for filesystems with a 4KB blocksize. Filesystems can either be created using this new block group descriptor layout, or existing filesystems can be resized on-line, and the field s_first_meta_bg in the superblock will indicate the first block group using this new layout.
文件系统格式的改变取代了当前的方案,其中超级块后面跟着一组可变长度的块组描述符。相反,超级块和单个块组描述符块被放置在元块组中第一个、第二个和最后一个块组的开头。元块组是可以由单个块组描述符块描述的块组的集合。由于块组描述符结构的大小为32字节,因此元块组包含32个块组(用于1KB块大小的文件系统)和128个块组,用于4KB块大小文件系统。可以使用此新的块组描述符布局创建文件系统,也可以在线调整现有文件系统的大小,超级块中的字段s_first_meta_bg将指示使用此新布局的第一个块组。
Please see an important note about BLOCK_UNINIT
in the section about block and inode bitmaps.
请参阅关于块和inode位图的部分中有关BLOCK_UNINIT
的重要说明。
2.5 Lazy Block Group Initialization¶
A new feature for ext4 are three block group descriptor flags that enable mkfs to skip initializing other parts of the block group metadata. Specifically, the INODE_UNINIT and BLOCK_UNINIT flags mean that the inode and block bitmaps for that group can be calculated and therefore the on-disk bitmap blocks are not initialized. This is generally the case for an empty block group or a block group containing only fixed-location block group metadata. The INODE_ZEROED flag means that the inode table has been initialized; mkfs will unset this flag and rely on the kernel to initialize the inode tables in the background.
ext4的一个新特性是三个块组描述符标志,使mkfs能够跳过初始化块组元数据的其他部分。具体来说,INODE_UNINIT和BLOCK_UNINIIT标志意味着可以计算该组的INODE和块位图,因此不会初始化磁盘上的位图块。这通常适用于空块组或仅包含固定位置块组元数据的块组。INODE_ZEROED标志表示索引节点表已初始化;mkfs将取消设置此标志,并依赖内核在后台初始化inode表。
By not writing zeroes to the bitmaps and inode table, mkfs time is reduced considerably. Note the feature flag is RO_COMPAT_GDT_CSUM, but the dumpe2fs output prints this as “uninit_bg”. They are the same thing.
通过不对位图和inode表写入零,mkfs时间大大减少。注意,特性标志是RO_COMPAT_GDT_CSUM,但dumpe2fs输出将其打印为“uninit_bg”。它们是一样的。
2.6 Special inodes¶
ext4 reserves some inode for special features, as follows:
ext4为特殊功能保留了一些inode,如下所示:
inode Number | Purpose |
---|---|
0 | Doesn’t exist; there is no inode 0. |
1 | List of defective blocks. |
2 | Root directory. |
3 | User quota. |
4 | Group quota. |
5 | Boot loader. |
6 | Undelete directory. |
7 | Reserved group descriptors inode. (“resize inode”) |
8 | Journal inode. |
9 | The “exclude” inode, for snapshots(?) |
10 | Replica inode, used for some non-upstream feature? |
11 | Traditional first non-reserved inode. Usually this is the lost+found directory. See s_first_ino in the superblock. |
Note that there are also some inodes allocated from non-reserved inode numbers for other filesystem features which are not referenced from standard directory hierarchy. These are generally reference from the superblock. They are:
请注意,还有一些索引节点是从非保留索引节点编号分配给其他文件系统特性的,这些特性不是从标准目录层次结构中引用的。这些通常来自超级块。他们是:
Superblock field | Description |
---|---|
s_lpf_ino | Inode number of lost+found directory. |
s_prj_quota_inum | Inode number of quota file tracking project quotas |
s_orphan_file_inum | Inode number of file tracking orphan inodes. |
2.7 Block and Inode Allocation Policy¶
ext4 recognizes (better than ext3, anyway) that data locality is generally a desirably quality of a filesystem. On a spinning disk, keeping related blocks near each other reduces the amount of movement that the head actuator and disk must perform to access a data block, thus speeding up disk IO. On an SSD there of course are no moving parts, but locality can increase the size of each transfer request while reducing the total number of requests. This locality may also have the effect of concentrating writes on a single erase block, which can speed up file rewrites significantly. Therefore, it is useful to reduce fragmentation whenever possible.
ext4认识到(无论如何比ext3更好)数据位置通常是文件系统的理想质量。在旋转的磁盘上,保持相关块彼此靠近可以减少磁头驱动器和磁盘访问数据块时必须执行的移动量,从而加快磁盘IO。在SSD上,当然没有移动部件,但位置可以增加每个传输请求的大小,同时减少请求总数。此位置还可能会将写入集中在单个擦除块上,从而显著加快文件重写速度。因此,尽可能减少碎片化是有用的。
The first tool that ext4 uses to combat fragmentation is the multi-block allocator. When a file is first created, the block allocator speculatively allocates 8KiB of disk space to the file on the assumption that the space will get written soon. When the file is closed, the unused speculative allocations are of course freed, but if the speculation is correct (typically the case for full writes of small files) then the file data gets written out in a single multi-block extent. A second related trick that ext4 uses is delayed allocation. Under this scheme, when a file needs more blocks to absorb file writes, the filesystem defers deciding the exact placement on the disk until all the dirty buffers are being written out to disk. By not committing to a particular placement until it’s absolutely necessary (the commit timeout is hit, or sync() is called, or the kernel runs out of memory), the hope is that the filesystem can make better location decisions.
ext4用来对抗碎片的第一个工具是多块分配器。当第一次创建文件时,块分配器会推测为该文件分配8KiB的磁盘空间,前提是该空间很快就会被写入。当文件关闭时,未使用的推测分配当然会被释放,但如果推测正确(通常是对小文件的完全写入),则文件数据将在单个多块数据块中被写入。ext4使用的第二个相关技巧是延迟分配。在这种方案下,当文件需要更多的块来吸收文件写入时,文件系统会推迟决定磁盘上的确切位置,直到所有脏缓冲区都被写入磁盘。通过在绝对必要时才提交到特定位置(提交超时被命中,或者sync()被调用,或者内核内存耗尽),希望文件系统能够做出更好的位置决定。
The third trick that ext4 (and ext3) uses is that it tries to keep a file’s data blocks in the same block group as its inode. This cuts down on the seek penalty when the filesystem first has to read a file’s inode to learn where the file’s data blocks live and then seek over to the file’s data blocks to begin I/O operations.
ext4(和ext3)使用的第三个技巧是,它试图将文件的数据块与其inode保持在同一块组中。当文件系统首先必须读取文件的inode以了解文件的数据块所在位置,然后查找到文件的数据数据块以开始I/O操作时,这样可以减少查找代价。
The fourth trick is that all the inodes in a directory are placed in the same block group as the directory, when feasible. The working assumption here is that all the files in a directory might be related, therefore it is useful to try to keep them all together.
第四个技巧是,在可行的情况下,目录中的所有索引节点都放在与目录相同的块组中。这里的工作假设是,一个目录中的所有文件都可能是相关的,因此将它们放在一起是很有用的。
The fifth trick is that the disk volume is cut up into 128MB block groups; these mini-containers are used as outlined above to try to maintain data locality. However, there is a deliberate quirk – when a directory is created in the root directory, the inode allocator scans the block groups and puts that directory into the least heavily loaded block group that it can find. This encourages directories to spread out over a disk; as the top-level directory/file blobs fill up one block group, the allocators simply move on to the next block group. Allegedly this scheme evens out the loading on the block groups, though the author suspects that the directories which are so unlucky as to land towards the end of a spinning drive get a raw deal performance-wise.
第五个技巧是将磁盘卷分成128MB的块组;如上所述,使用这些小型容器来维护数据位置。然而,有一个故意的怪癖——当在根目录中创建目录时,inode分配器会扫描块组,并将该目录放入它能找到的负载最少的块组中。这鼓励目录在磁盘上分散;当顶级目录/文件blob填满一个块组时,分配器只需移动到下一块组。据称,这一方案均衡了块组的负载,尽管作者怀疑,在旋转驱动器即将结束时不幸着陆的目录在性能方面表现不佳。
Of course if all of these mechanisms fail, one can always use e4defrag to defragment files.
当然,如果所有这些机制都失败了,可以始终使用e4defrag对文件进行碎片整理。
2.8 Checksums¶
Starting in early 2012, metadata checksums were added to all major ext4 and jbd2 data structures. The associated feature flag is metadata_csum. The desired checksum algorithm is indicated in the superblock, though as of October 2012 the only supported algorithm is crc32c. Some data structures did not have space to fit a full 32-bit checksum, so only the lower 16 bits are stored. Enabling the 64bit feature increases the data structure size so that full 32-bit checksums can be stored for many data structures. However, existing 32-bit filesystems cannot be extended to enable 64bit mode, at least not without the experimental resize2fs patches to do so.
从2012年初开始,元数据校验和被添加到所有主要的ext4和jbd2数据结构中。相关的特征标志是metadata_csum。所需的校验和算法显示在超级块中,但截至2012年10月,唯一支持的算法是crc32c。某些数据结构没有空间容纳完整的32位校验和,因此只存储较低的16位。启用64位功能会增加数据结构的大小,从而可以为许多数据结构存储完整的32位校验和。然而,现有的32位文件系统无法扩展到启用64位模式,至少在没有实验性的resize2fs补丁的情况下是如此。
Existing filesystems can have checksumming added by running tune2fs -O metadata_csum
against the underlying device. If tune2fs encounters directory blocks that lack sufficient empty space to add a checksum, it will request that you run e2fsck -D
to have the directories rebuilt with checksums. This has the added benefit of removing slack space from the directory files and rebalancing the htree indexes. If you ignore this step, your directories will not be protected by a checksum!
现有文件系统可以通过对底层设备运行tune2fs-O metadata_csum
来添加校验和。如果tune2fs遇到缺少足够的空白空间来添加校验和的目录块,它将请求您运行e2fsck-D
以使用校验和重建目录。这还有一个额外的好处,即从目录文件中删除空闲空间并重新平衡htree索引。如果忽略此步骤,您的目录将不会受到校验和的保护!
The following table describes the data elements that go into each type of checksum. The checksum function is whatever the superblock describes (crc32c as of October 2013) unless noted otherwise.
Metadata | Length | Ingredients |
---|---|---|
Superblock | __le32 | The entire superblock up to the checksum field. The UUID lives inside the superblock. |
MMP | __le32 | UUID + the entire MMP block up to the checksum field. |
Extended Attributes | __le32 | UUID + the entire extended attribute block. The checksum field is set to zero. |
Directory Entries | __le32 | UUID + inode number + inode generation + the directory block up to the fake entry enclosing the checksum field. |
HTREE Nodes | __le32 | UUID + inode number + inode generation + all valid extents + HTREE tail. The checksum field is set to zero. |
Extents | __le32 | UUID + inode number + inode generation + the entire extent block up to the checksum field. |
Bitmaps | __le32 or __le16 | UUID + the entire bitmap. Checksums are stored in the group descriptor, and truncated if the group descriptor size is 32 bytes (i.e. ^64bit) |
Inodes | __le32 | UUID + inode number + inode generation + the entire inode. The checksum field is set to zero. Each inode has its own checksum. |
Group Descriptors | __le16 | If metadata_csum, then UUID + group number + the entire descriptor; else if gdt_csum, then crc16(UUID + group number + the entire descriptor). In all cases, only the lower 16 bits are stored. |
下表描述了进入每种类型校验和的数据元素。除非另有说明,否则校验和函数是超级块所描述的(截至2013年10月的crc32c)。
元数据 | 长度 | 成分 |
---|---|---|
超级块 | __le32 | 直到校验和字段的整个超级块。UUID位于超级块内 |
MMP | __le32 | UUID+整个MMP块,直到校验和字段 |
扩展属性 | __le32 | UUID+整个扩展属性块。校验和字段设置为零 |
目录条目 | __le32 | UUID+索引节点编号+索引节点生成+直到包含校验和字段的假条目的目录块 |
HTREE节点 | __le32 | UUID+索引节点编号+索引节点生成+所有有效数据块+HTREE尾部。校验和字段设置为零 |
区段 | __le32 | UUID+索引节点编号+索引节点生成+整个区段块,直到校验和字段 |
位图 | __le32或__le16 | UUID+整个位图。校验和存储在组描述符中,如果组描述符大小为32字节(即^64位),校验和将被截断 |
索引节点 | __le32 | UUID+索引节点编号+索引节点生成+整个索引节点。校验和字段设置为零。每个inode都有自己的校验和 |
组描述符 | __le16 | 如果metadata_csum,则UUID+组编号+整个描述符;否则,如果gdt_csum,则为crc16(UUID+组号+整个描述符)。在所有情况下,仅存储较低的16位 |
2.9 Bigalloc¶
At the moment, the default size of a block is 4KiB, which is a commonly supported page size on most MMU-capable hardware. This is fortunate, as ext4 code is not prepared to handle the case where the block size exceeds the page size. However, for a filesystem of mostly huge files, it is desirable to be able to allocate disk blocks in units of multiple blocks to reduce both fragmentation and metadata overhead. The bigalloc feature provides exactly this ability.
目前,块的默认大小是4KiB,这是大多数支持MMU的硬件上通常支持的页面大小。这是幸运的,因为ext4代码不准备处理块大小超过页面大小的情况。然而,对于包含大量文件的文件系统,最好能够以多个块为单位分配磁盘块,以减少碎片和元数据开销。bigalalloc特性正好提供了这种能力。
The bigalloc feature (EXT4_FEATURE_RO_COMPAT_BIGALLOC) changes ext4 to use clustered allocation, so that each bit in the ext4 block allocation bitmap addresses a power of two number of blocks. For example, if the file system is mainly going to be storing large files in the 4-32 megabyte range, it might make sense to set a cluster size of 1 megabyte. This means that each bit in the block allocation bitmap now addresses 256 4k blocks. This shrinks the total size of the block allocation bitmaps for a 2T file system from 64 megabytes to 256 kilobytes. It also means that a block group addresses 32 gigabytes instead of 128 megabytes, also shrinking the amount of file system overhead for metadata.
bigalalloc特性(EXT4_feature_RO_COMPAT_bigalloc)将EXT4更改为使用集群分配,以便EXT4块分配位图中的每个位都寻址两个块的幂。例如,如果文件系统主要存储4-32兆字节范围内的大型文件,那么将集群大小设置为1兆字节可能是有意义的。这意味着块分配位图中的每个比特现在寻址256个4k块。这将2T文件系统的块分配位图的总大小从64 MB缩减到256 KB。这也意味着块组的地址为32千兆字节,而不是128兆字节,这也减少了元数据的文件系统开销。
The administrator can set a block cluster size at mkfs time (which is stored in the s_log_cluster_size field in the superblock); from then on, the block bitmaps track clusters, not individual blocks. This means that block groups can be several gigabytes in size (instead of just 128MiB); however, the minimum allocation unit becomes a cluster, not a block, even for directories. TaoBao had a patchset to extend the “use units of clusters instead of blocks” to the extent tree, though it is not clear where those patches went– they eventually morphed into “extent tree v2” but that code has not landed as of May 2015.
管理员可以在mkfs时间设置块集群大小(存储在超级块的s_log_cluster_size字段中);从那时起,块位图跟踪簇,而不是单个块。这意味着块组的大小可以是几GB(而不是128MiB);然而,即使对于目录,最小分配单元也变成了集群,而不是块。TaoBao有一个补丁集,将“使用集群而不是块的单位”扩展到范围树,尽管不清楚这些补丁去了哪里——它们最终演变成了“范围树v2”,但截至2015年5月,该代码尚未发布。
2.10 Inline Data¶
The inline data feature was designed to handle the case that a file’s data is so tiny that it readily fits inside the inode, which (theoretically) reduces disk block consumption and reduces seeks. If the file is smaller than 60 bytes, then the data are stored inline in inode.i_block
. If the rest of the file would fit inside the extended attribute space, then it might be found as an extended attribute “system.data” within the inode body (“ibody EA”). This of course constrains the amount of extended attributes one can attach to an inode. If the data size increases beyond i_block + ibody EA, a regular block is allocated and the contents moved to that block.
内联数据特性的设计目的是处理文件数据太小以至于很容易放入inode的情况,这(理论上)减少了磁盘块消耗并减少了查找。如果文件小于60个字节,则数据以内联方式存储在inode.i_block
中。如果文件的其余部分适合扩展属性空间,则可能会在inode主体(“ibody EA”)中找到扩展属性“system.data”。当然,这限制了可以附加到inode的扩展属性的数量。如果数据大小超过i_block+ibody EA,则会分配一个常规块,并将内容移动到该块。
Pending a change to compact the extended attribute key used to store inline data, one ought to be able to store 160 bytes of data in a 256-byte inode (as of June 2015, when i_extra_isize is 28). Prior to that, the limit was 156 bytes due to inefficient use of inode space.
在更改压缩用于存储内联数据的扩展属性键之前,应该能够在256字节的inode中存储160字节的数据(截至2015年6月,i_extra_isize为28)。在此之前,由于inode空间使用效率低下,限制为156字节。
The inline data feature requires the presence of an extended attribute for “system.data”, even if the attribute value is zero length.
内联数据特性要求存在“system.data”的扩展属性,即使属性值长度为零。
2.10.1 Inline Directories¶
The first four bytes of i_block are the inode number of the parent directory. Following that is a 56-byte space for an array of directory entries; see struct ext4_dir_entry
. If there is a “system.data” attribute in the inode body, the EA value is an array of struct ext4_dir_entry
as well. Note that for inline directories, the i_block and EA space are treated as separate dirent blocks; directory entries cannot span the two.
i_block的前四个字节是父目录的索引节点号。接下来是一个56字节的目录条目数组空间;请参见struct ext4_dir_entry
。如果inode主体中有“system.data”属性,则EA值也是structext4_dir_entry
数组。注意,对于内联目录,i_block和EA空间被视为单独的目录块;目录条目不能跨越两者。
Inline directory entries are not checksummed, as the inode checksum should protect all inline data contents.
内联目录条目不进行校验和,因为inode校验和应保护所有内联数据内容。
2.11 Large Extended Attribute Values¶
To enable ext4 to store extended attribute values that do not fit in the inode or in the single extended attribute block attached to an inode, the EA_INODE feature allows us to store the value in the data blocks of a regular file inode. This “EA inode” is linked only from the extended attribute name index and must not appear in a directory entry. The inode’s i_atime field is used to store a checksum of the xattr value; and i_ctime/i_version store a 64-bit reference count, which enables sharing of large xattr values between multiple owning inodes. For backward compatibility with older versions of this feature, the i_mtime/i_generation may store a back-reference to the inode number and i_generation of the one owning inode (in cases where the EA inode is not referenced by multiple inodes) to verify that the EA inode is the correct one being accessed.
为了使ext4能够存储不适合inode或附加到inode的单个扩展属性块的扩展属性值,EA_inode特性允许我们将值存储在常规文件inode的数据块中。此“EA inode”仅从扩展属性名称索引链接,不得出现在目录条目中。inode的i_atime字段用于存储xattr值的校验和;和i_ctime/iversion存储一个64位的引用计数,这允许在多个拥有的inode之间共享大型xattr值。为了与此功能的旧版本向后兼容,i_mtime/i_generation可以存储对一个拥有的索引节点的索引节点号和i_generation的反向引用(在多个索引节点未引用EA索引节点的情况下),以验证EA索引节点是正确的被访问索引节点。
2.12 Verity files¶
ext4 supports fs-verity, which is a filesystem feature that provides Merkle tree based hashing for individual readonly files. Most of fs-verity is common to all filesystems that support it; see Documentation/filesystems/fsverity.rst for the fs-verity documentation. However, the on-disk layout of the verity metadata is filesystem-specific. On ext4, the verity metadata is stored after the end of the file data itself, in the following format:
ext4支持fsverity,这是一个文件系统特性,为单个只读文件提供基于Merkle树的哈希。大多数fs verity对所有支持它的文件系统都是通用的;请参见Documentation/filess/fsverity.rst文档。然而,verity元数据的磁盘布局是特定于文件系统的。在ext4上,verity元数据存储在文件数据本身结束后,格式如下:
- Zero-padding to the next 65536-byte boundary. This padding need not actually be allocated on-disk, i.e. it may be a hole.
- 下一个65536字节边界的零填充。这个填充实际上不需要在磁盘上分配,也就是说它可能是一个洞。
- The Merkle tree, as documented in Documentation/filesystems/fsverity.rst, with the tree levels stored in order from root to leaf, and the tree blocks within each level stored in their natural order.
- Merkle树,如Documentation/filess/fsverity.rst中所述,树级别按从根到叶的顺序存储,每个级别中的树块按其自然顺序存储。
- Zero-padding to the next filesystem block boundary.
- 零填充到下一个文件系统块边界。
- The verity descriptor, as documented in Documentation/filesystems/fsverity.rst, with optionally appended signature blob.
- verity描述符,如Documentation/filess/fsverity.rst中所述,并可选地附加签名blob。
- Zero-padding to the next offset that is 4 bytes before a filesystem block boundary.
- 零填充到文件系统块边界前4字节的下一个偏移量。
- The size of the verity descriptor in bytes, as a 4-byte little endian integer.
- verity描述符的大小(以字节为单位),作为4字节的小端整数。
Verity inodes have EXT4_VERITY_FL set, and they must use extents, i.e. EXT4_EXTENTS_FL must be set and EXT4_INLINE_DATA_FL must be clear. They can have EXT4_ENCRYPT_FL set, in which case the verity metadata is encrypted as well as the data itself.
Verity inode已设置EXT4_Verity_FL,并且它们必须使用范围,即必须设置EXT4_extents_FL,并且必须清除EXT4_INLINE_DATA_FL。它们可以设置EXT4_ENCRYPT_FL,在这种情况下,验证元数据和数据本身都被加密。
Verity files cannot have blocks allocated past the end of the verity metadata.
Verity文件不能分配超过Verity元数据末尾的块。
Verity and DAX are not compatible and attempts to set both of these flags on a file will fail.
Verity和DAX不兼容,尝试在文件上设置这两个标志将失败。
3 Global Structures¶
The filesystem is sharded into a number of block groups, each of which have static metadata at fixed locations.
文件系统被分成多个块组,每个块组在固定位置都有静态元数据。
3.1 Super Block¶
The superblock records various information about the enclosing filesystem, such as block counts, inode counts, supported features, maintenance information, and more.
超级块记录有关封闭文件系统的各种信息,如块计数、索引节点计数、支持的功能、维护信息等。
If the sparse_super feature flag is set, redundant copies of the superblock and group descriptors are kept only in the groups whose group number is either 0 or a power of 3, 5, or 7. If the flag is not set, redundant copies are kept in all groups.
如果设置了spare_super特征标志,则超级块和组描述符的冗余副本仅保留在组编号为0或3、5或7的幂的组中。如果未设置该标志,则在所有组中保留冗余副本。
The superblock checksum is calculated against the superblock structure, which includes the FS UUID.
超级块校验和是根据超级块结构(包括FSUUID)计算的。
The ext4 superblock is laid out as follows in struct ext4_super_block
:
Offset | Size | Name | Description |
---|---|---|---|
0x0 | __le32 | s_inodes_count | Total inode count. |
0x4 | __le32 | s_blocks_count_lo | Total block count. |
0x8 | __le32 | s_r_blocks_count_lo | This number of blocks can only be allocated by the super-user. |
0xC | __le32 | s_free_blocks_count_lo | Free block count. |
0x10 | __le32 | s_free_inodes_count | Free inode count. |
0x14 | __le32 | s_first_data_block | First data block. This must be at least 1 for 1k-block filesystems and is typically 0 for all other block sizes. |
0x18 | __le32 | s_log_block_size | Block size is 2 ^ (10 + s_log_block_size). |
0x1C | __le32 | s_log_cluster_size | Cluster size is 2 ^ (10 + s_log_cluster_size) blocks if bigalloc is enabled. Otherwise s_log_cluster_size must equal s_log_block_size. |
0x20 | __le32 | s_blocks_per_group | Blocks per group. |
0x24 | __le32 | s_clusters_per_group | Clusters per group, if bigalloc is enabled. Otherwise s_clusters_per_group must equal s_blocks_per_group. |
0x28 | __le32 | s_inodes_per_group | Inodes per group. |
0x2C | __le32 | s_mtime | Mount time, in seconds since the epoch. |
0x30 | __le32 | s_wtime | Write time, in seconds since the epoch. |
0x34 | __le16 | s_mnt_count | Number of mounts since the last fsck. |
0x36 | __le16 | s_max_mnt_count | Number of mounts beyond which a fsck is needed. |
0x38 | __le16 | s_magic | Magic signature, 0xEF53 |
0x3A | __le16 | s_state | File system state. See super_state for more info. |
0x3C | __le16 | s_errors | Behaviour when detecting errors. See super_errors for more info. |
0x3E | __le16 | s_minor_rev_level | Minor revision level. |
0x40 | __le32 | s_lastcheck | Time of last check, in seconds since the epoch. |
0x44 | __le32 | s_checkinterval | Maximum time between checks, in seconds. |
0x48 | __le32 | s_creator_os | Creator OS. See the table super_creator for more info. |
0x4C | __le32 | s_rev_level | Revision level. See the table super_revision for more info. |
0x50 | __le16 | s_def_resuid | Default uid for reserved blocks. |
0x52 | __le16 | s_def_resgid | Default gid for reserved blocks. |
These fields are for EXT4_DYNAMIC_REV superblocks only.Note: the difference between the compatible feature set and the incompatible feature set is that if there is a bit set in the incompatible feature set that the kernel doesn’t know about, it should refuse to mount the filesystem.e2fsck’s requirements are more strict; if it doesn’t know about a feature in either the compatible or incompatible feature set, it must abort and not try to meddle with things it doesn’t understand… | |||
0x54 | __le32 | s_first_ino | First non-reserved inode. |
0x58 | __le16 | s_inode_size | Size of inode structure, in bytes. |
0x5A | __le16 | s_block_group_nr | Block group ## of this superblock. |
0x5C | __le32 | s_feature_compat | Compatible feature set flags. Kernel can still read/write this fs even if it doesn’t understand a flag; fsck should not do that. See the super_compat table for more info. |
0x60 | __le32 | s_feature_incompat | Incompatible feature set. If the kernel or fsck doesn’t understand one of these bits, it should stop. See the super_incompat table for more info. |
0x64 | __le32 | s_feature_ro_compat | Readonly-compatible feature set. If the kernel doesn’t understand one of these bits, it can still mount read-only. See the super_rocompat table for more info. |
0x68 | __u8 | s_uuid[16] | 128-bit UUID for volume. |
0x78 | char | s_volume_name[16] | Volume label. |
0x88 | char | s_last_mounted[64] | Directory where filesystem was last mounted. |
0xC8 | __le32 | s_algorithm_usage_bitmap | For compression (Not used in e2fsprogs/Linux) |
Performance hints. Directory preallocation should only happen if the EXT4_FEATURE_COMPAT_DIR_PREALLOC flag is on. | |||
0xCC | __u8 | s_prealloc_blocks | #. of blocks to try to preallocate for … files? (Not used in e2fsprogs/Linux) |
0xCD | __u8 | s_prealloc_dir_blocks | #. of blocks to preallocate for directories. (Not used in e2fsprogs/Linux) |
0xCE | __le16 | s_reserved_gdt_blocks | Number of reserved GDT entries for future filesystem expansion. |
Journalling support is valid only if EXT4_FEATURE_COMPAT_HAS_JOURNAL is set. | |||
0xD0 | __u8 | s_journal_uuid[16] | UUID of journal superblock |
0xE0 | __le32 | s_journal_inum | inode number of journal file. |
0xE4 | __le32 | s_journal_dev | Device number of journal file, if the external journal feature flag is set. |
0xE8 | __le32 | s_last_orphan | Start of list of orphaned inodes to delete. |
0xEC | __le32 | s_hash_seed[4] | HTREE hash seed. |
0xFC | __u8 | s_def_hash_version | Default hash algorithm to use for directory hashes. See super_def_hash for more info. |
0xFD | __u8 | s_jnl_backup_type | If this value is 0 or EXT3_JNL_BACKUP_BLOCKS (1), then the s_jnl_blocks field contains a duplicate copy of the inode’s i_block[] array and i_size . |
0xFE | __le16 | s_desc_size | Size of group descriptors, in bytes, if the 64bit incompat feature flag is set. |
0x100 | __le32 | s_default_mount_opts | Default mount options. See the super_mountopts table for more info. |
0x104 | __le32 | s_first_meta_bg | First metablock block group, if the meta_bg feature is enabled. |
0x108 | __le32 | s_mkfs_time | When the filesystem was created, in seconds since the epoch. |
0x10C | __le32 | s_jnl_blocks[17] | Backup copy of the journal inode’s i_block[] array in the first 15 elements and i_size_high and i_size in the 16th and 17th elements, respectively. |
64bit support is valid only if EXT4_FEATURE_COMPAT_64BIT is set. | |||
0x150 | __le32 | s_blocks_count_hi | High 32-bits of the block count. |
0x154 | __le32 | s_r_blocks_count_hi | High 32-bits of the reserved block count. |
0x158 | __le32 | s_free_blocks_count_hi | High 32-bits of the free block count. |
0x15C | __le16 | s_min_extra_isize | All inodes have at least ## bytes. |
0x15E | __le16 | s_want_extra_isize | New inodes should reserve ## bytes. |
0x160 | __le32 | s_flags | Miscellaneous flags. See the super_flags table for more info. |
0x164 | __le16 | s_raid_stride | RAID stride. This is the number of logical blocks read from or written to the disk before moving to the next disk. This affects the placement of filesystem metadata, which will hopefully make RAID storage faster. |
0x166 | __le16 | s_mmp_interval | #. seconds to wait in multi-mount prevention (MMP) checking. In theory, MMP is a mechanism to record in the superblock which host and device have mounted the filesystem, in order to prevent multiple mounts. This feature does not seem to be implemented… |
0x168 | __le64 | s_mmp_block | Block ## for multi-mount protection data. |
0x170 | __le32 | s_raid_stripe_width | RAID stripe width. This is the number of logical blocks read from or written to the disk before coming back to the current disk. This is used by the block allocator to try to reduce the number of read-modify-write operations in a RAID5/6. |
0x174 | __u8 | s_log_groups_per_flex | Size of a flexible block group is 2 ^ s_log_groups_per_flex . |
0x175 | __u8 | s_checksum_type | Metadata checksum algorithm type. The only valid value is 1 (crc32c). |
0x176 | __le16 | s_reserved_pad | |
0x178 | __le64 | s_kbytes_written | Number of KiB written to this filesystem over its lifetime. |
0x180 | __le32 | s_snapshot_inum | inode number of active snapshot. (Not used in e2fsprogs/Linux.) |
0x184 | __le32 | s_snapshot_id | Sequential ID of active snapshot. (Not used in e2fsprogs/Linux.) |
0x188 | __le64 | s_snapshot_r_blocks_count | Number of blocks reserved for active snapshot’s future use. (Not used in e2fsprogs/Linux.) |
0x190 | __le32 | s_snapshot_list | inode number of the head of the on-disk snapshot list. (Not used in e2fsprogs/Linux.) |
0x194 | __le32 | s_error_count | Number of errors seen. |
0x198 | __le32 | s_first_error_time | First time an error happened, in seconds since the epoch. |
0x19C | __le32 | s_first_error_ino | inode involved in first error. |
0x1A0 | __le64 | s_first_error_block | Number of block involved of first error. |
0x1A8 | __u8 | s_first_error_func[32] | Name of function where the error happened. |
0x1C8 | __le32 | s_first_error_line | Line number where error happened. |
0x1CC | __le32 | s_last_error_time | Time of most recent error, in seconds since the epoch. |
0x1D0 | __le32 | s_last_error_ino | inode involved in most recent error. |
0x1D4 | __le32 | s_last_error_line | Line number where most recent error happened. |
0x1D8 | __le64 | s_last_error_block | Number of block involved in most recent error. |
0x1E0 | __u8 | s_last_error_func[32] | Name of function where the most recent error happened. |
0x200 | __u8 | s_mount_opts[64] | ASCIIZ string of mount options. |
0x240 | __le32 | s_usr_quota_inum | Inode number of user quota file. |
0x244 | __le32 | s_grp_quota_inum | Inode number of group quota file. |
0x248 | __le32 | s_overhead_blocks | Overhead blocks/clusters in fs. (Huh? This field is always zero, which means that the kernel calculates it dynamically.) |
0x24C | __le32 | s_backup_bgs[2] | Block groups containing superblock backups (if sparse_super2) |
0x254 | __u8 | s_encrypt_algos[4] | Encryption algorithms in use. There can be up to four algorithms in use at any time; valid algorithm codes are given in the super_encrypt table below. |
0x258 | __u8 | s_encrypt_pw_salt[16] | Salt for the string2key algorithm for encryption. |
0x268 | __le32 | s_lpf_ino | Inode number of lost+found |
0x26C | __le32 | s_prj_quota_inum | Inode that tracks project quotas. |
0x270 | __le32 | s_checksum_seed | Checksum seed used for metadata_csum calculations. This value is crc32c(~0, $orig_fs_uuid). |
0x274 | __u8 | s_wtime_hi | Upper 8 bits of the s_wtime field. |
0x275 | __u8 | s_mtime_hi | Upper 8 bits of the s_mtime field. |
0x276 | __u8 | s_mkfs_time_hi | Upper 8 bits of the s_mkfs_time field. |
0x277 | __u8 | s_lastcheck_hi | Upper 8 bits of the s_lastcheck field. |
0x278 | __u8 | s_first_error_time_hi | Upper 8 bits of the s_first_error_time field. |
0x279 | __u8 | s_last_error_time_hi | Upper 8 bits of the s_last_error_time field. |
0x27A | __u8 | s_pad[2] | Zero padding. |
0x27C | __le16 | s_encoding | Filename charset encoding. |
0x27E | __le16 | s_encoding_flags | Filename charset encoding flags. |
0x280 | __le32 | s_orphan_file_inum | Orphan file inode number. |
0x284 | __le32 | s_reserved[94] | Padding to the end of the block. |
0x3FC | __le32 | s_checksum | Superblock checksum. |
The superblock state is some combination of the following:
超级块状态是以下各项的组合:
Value | Description |
---|---|
0x0001 | Cleanly umounted |
0x0002 | Errors detected |
0x0004 | Orphans being recovered |
The superblock error policy is one of the following:
超级块错误策略是以下之一:
Value | Description |
---|---|
1 | Continue |
2 | Remount read-only |
3 | Panic |
The filesystem creator is one of the following:
文件系统创建者是以下之一:
Value | Description |
---|---|
0 | Linux |
1 | Hurd |
2 | Masix |
3 | FreeBSD |
4 | Lites |
The superblock revision is one of the following:
超级块版本如下:
Value | Description |
---|---|
0 | Original format |
1 | v2 format w/ dynamic inode sizes |
Note that EXT4_DYNAMIC_REV
refers to a revision 1 or newer filesystem.
请注意,“EXT4_DYNAMIC_REV”指的是版本1或更新的文件系统。
The superblock compatible features field is a combination of any of the following:
超级块兼容功能字段是以下任一项的组合:
Value | Description |
---|---|
0x1 | Directory preallocation (COMPAT_DIR_PREALLOC). |
0x2 | “imagic inodes”. Not clear from the code what this does (COMPAT_IMAGIC_INODES). |
0x4 | Has a journal (COMPAT_HAS_JOURNAL). |
0x8 | Supports extended attributes (COMPAT_EXT_ATTR). |
0x10 | Has reserved GDT blocks for filesystem expansion (COMPAT_RESIZE_INODE). Requires RO_COMPAT_SPARSE_SUPER. |
0x20 | Has directory indices (COMPAT_DIR_INDEX). |
0x40 | “Lazy BG”. Not in Linux kernel, seems to have been for uninitialized block groups? (COMPAT_LAZY_BG) |
0x80 | “Exclude inode”. Not used. (COMPAT_EXCLUDE_INODE). |
0x100 | “Exclude bitmap”. Seems to be used to indicate the presence of snapshot-related exclude bitmaps? Not defined in kernel or used in e2fsprogs (COMPAT_EXCLUDE_BITMAP). |
0x200 | Sparse Super Block, v2. If this flag is set, the SB field s_backup_bgs points to the two block groups that contain backup superblocks (COMPAT_SPARSE_SUPER2). |
0x400 | Fast commits supported. Although fast commits blocks are backward incompatible, fast commit blocks are not always present in the journal. If fast commit blocks are present in the journal, JBD2 incompat feature (JBD2_FEATURE_INCOMPAT_FAST_COMMIT) gets set (COMPAT_FAST_COMMIT). |
0x1000 | Orphan file allocated. This is the special file for more efficient tracking of unlinked but still open inodes. When there may be any entries in the file, we additionally set proper rocompat feature (RO_COMPAT_ORPHAN_PRESENT). |
The superblock incompatible features field is a combination of any of the following:
超级块不兼容功能字段是以下任一项的组合:
Value | Description |
---|---|
0x1 | Compression (INCOMPAT_COMPRESSION). |
0x2 | Directory entries record the file type. See ext4_dir_entry_2 below (INCOMPAT_FILETYPE). |
0x4 | Filesystem needs recovery (INCOMPAT_RECOVER). |
0x8 | Filesystem has a separate journal device (INCOMPAT_JOURNAL_DEV). |
0x10 | Meta block groups. See the earlier discussion of this feature (INCOMPAT_META_BG). |
0x40 | Files in this filesystem use extents (INCOMPAT_EXTENTS). |
0x80 | Enable a filesystem size of 2^64 blocks (INCOMPAT_64BIT). |
0x100 | Multiple mount protection (INCOMPAT_MMP). |
0x200 | Flexible block groups. See the earlier discussion of this feature (INCOMPAT_FLEX_BG). |
0x400 | Inodes can be used to store large extended attribute values (INCOMPAT_EA_INODE). |
0x1000 | Data in directory entry (INCOMPAT_DIRDATA). (Not implemented?) |
0x2000 | Metadata checksum seed is stored in the superblock. This feature enables the administrator to change the UUID of a metadata_csum filesystem while the filesystem is mounted; without it, the checksum definition requires all metadata blocks to be rewritten (INCOMPAT_CSUM_SEED). |
0x4000 | Large directory >2GB or 3-level htree (INCOMPAT_LARGEDIR). Prior to this feature, directories could not be larger than 4GiB and could not have an htree more than 2 levels deep. If this feature is enabled, directories can be larger than 4GiB and have a maximum htree depth of 3. |
0x8000 | Data in inode (INCOMPAT_INLINE_DATA). |
0x10000 | Encrypted inodes are present on the filesystem. (INCOMPAT_ENCRYPT). |
The superblock read-only compatible features field is a combination of any of the following:
超级块只读兼容功能字段是以下任一项的组合:
Value | Description |
---|---|
0x1 | Sparse superblocks. See the earlier discussion of this feature (RO_COMPAT_SPARSE_SUPER). |
0x2 | This filesystem has been used to store a file greater than 2GiB (RO_COMPAT_LARGE_FILE). |
0x4 | Not used in kernel or e2fsprogs (RO_COMPAT_BTREE_DIR). |
0x8 | This filesystem has files whose sizes are represented in units of logical blocks, not 512-byte sectors. This implies a very large file indeed! (RO_COMPAT_HUGE_FILE) |
0x10 | Group descriptors have checksums. In addition to detecting corruption, this is useful for lazy formatting with uninitialized groups (RO_COMPAT_GDT_CSUM). |
0x20 | Indicates that the old ext3 32,000 subdirectory limit no longer applies (RO_COMPAT_DIR_NLINK). A directory’s i_links_count will be set to 1 if it is incremented past 64,999. |
0x40 | Indicates that large inodes exist on this filesystem (RO_COMPAT_EXTRA_ISIZE). |
0x80 | This filesystem has a snapshot (RO_COMPAT_HAS_SNAPSHOT). |
0x100 | Quota (RO_COMPAT_QUOTA). |
0x200 | This filesystem supports “bigalloc”, which means that file extents are tracked in units of clusters (of blocks) instead of blocks (RO_COMPAT_BIGALLOC). |
0x400 | This filesystem supports metadata checksumming. (RO_COMPAT_METADATA_CSUM; implies RO_COMPAT_GDT_CSUM, though GDT_CSUM must not be set) |
0x800 | Filesystem supports replicas. This feature is neither in the kernel nor e2fsprogs. (RO_COMPAT_REPLICA) |
0x1000 | Read-only filesystem image; the kernel will not mount this image read-write and most tools will refuse to write to the image. (RO_COMPAT_READONLY) |
0x2000 | Filesystem tracks project quotas. (RO_COMPAT_PROJECT) |
0x8000 | Verity inodes may be present on the filesystem. (RO_COMPAT_VERITY) |
0x10000 | Indicates orphan file may have valid orphan entries and thus we need to clean them up when mounting the filesystem (RO_COMPAT_ORPHAN_PRESENT). |
The s_def_hash_version
field is one of the following:
Value | Description |
---|---|
0x0 | Legacy. |
0x1 | Half MD4. |
0x2 | Tea. |
0x3 | Legacy, unsigned. |
0x4 | Half MD4, unsigned. |
0x5 | Tea, unsigned. |
The s_default_mount_opts
field is any combination of the following:
Value | Description |
---|---|
0x0001 | Print debugging info upon (re)mount. (EXT4_DEFM_DEBUG) |
0x0002 | New files take the gid of the containing directory (instead of the fsgid of the current process). (EXT4_DEFM_BSDGROUPS) |
0x0004 | Support userspace-provided extended attributes. (EXT4_DEFM_XATTR_USER) |
0x0008 | Support POSIX access control lists (ACLs). (EXT4_DEFM_ACL) |
0x0010 | Do not support 32-bit UIDs. (EXT4_DEFM_UID16) |
0x0020 | All data and metadata are commited to the journal. (EXT4_DEFM_JMODE_DATA) |
0x0040 | All data are flushed to the disk before metadata are committed to the journal. (EXT4_DEFM_JMODE_ORDERED) |
0x0060 | Data ordering is not preserved; data may be written after the metadata has been written. (EXT4_DEFM_JMODE_WBACK) |
0x0100 | Disable write flushes. (EXT4_DEFM_NOBARRIER) |
0x0200 | Track which blocks in a filesystem are metadata and therefore should not be used as data blocks. This option will be enabled by default on 3.18, hopefully. (EXT4_DEFM_BLOCK_VALIDITY) |
0x0400 | Enable DISCARD support, where the storage device is told about blocks becoming unused. (EXT4_DEFM_DISCARD) |
0x0800 | Disable delayed allocation. (EXT4_DEFM_NODELALLOC) |
The s_flags
field is any combination of the following:
Value | Description |
---|---|
0x0001 | Signed directory hash in use. |
0x0002 | Unsigned directory hash in use. |
0x0004 | To test development code. |
The s_encrypt_algos
list can contain any of the following:
Value | Description |
---|---|
0 | Invalid algorithm (ENCRYPTION_MODE_INVALID). |
1 | 256-bit AES in XTS mode (ENCRYPTION_MODE_AES_256_XTS). |
2 | 256-bit AES in GCM mode (ENCRYPTION_MODE_AES_256_GCM). |
3 | 256-bit AES in CBC mode (ENCRYPTION_MODE_AES_256_CBC). |
Total size of the superblock is 1024 bytes.
超级块的总大小为1024字节。
3.2 Block Group Descriptors¶
Each block group on the filesystem has one of these descriptors associated with it. As noted in the Layout section above, the group descriptors (if present) are the second item in the block group. The standard configuration is for each block group to contain a full copy of the block group descriptor table unless the sparse_super feature flag is set.
文件系统上的每个块组都有一个与之相关的描述符。如上面的布局部分所述,组描述符(如果存在)是块组中的第二项。标准配置是每个块组包含块组描述符表的完整副本,除非设置了spare_super特征标志。
Notice how the group descriptor records the location of both bitmaps and the inode table (i.e. they can float). This means that within a block group, the only data structures with fixed locations are the superblock and the group descriptor table. The flex_bg mechanism uses this property to group several block groups into a flex group and lay out all of the groups’ bitmaps and inode tables into one long run in the first group of the flex group.
注意组描述符如何记录位图和索引节点表的位置(即它们可以浮动)。这意味着在块组中,唯一具有固定位置的数据结构是超级块和组描述符表。flex_bg机制使用此属性将多个块组分组到一个flex组中,并在flex组的第一个组中将所有组的位图和inode表布置成一个长运行。
If the meta_bg feature flag is set, then several block groups are grouped together into a meta group. Note that in the meta_bg case, however, the first and last two block groups within the larger meta group contain only group descriptors for the groups inside the meta group.
如果设置了meta_bg特征标志,则将多个块组组合成一个元组。注意,然而,在meta_bg的情况下,较大元组内的第一个和最后两个块组仅包含元组内组的组描述符。
flex_bg and meta_bg do not appear to be mutually exclusive features.
flex_bg和meta_bg似乎不是互斥的特性。
In ext2, ext3, and ext4 (when the 64bit feature is not enabled), the block group descriptor was only 32 bytes long and therefore ends at bg_checksum. On an ext4 filesystem with the 64bit feature enabled, the block group descriptor expands to at least the 64 bytes described below; the size is stored in the superblock.
在ext2、ext3和ext4(未启用64位特性时)中,块组描述符只有32字节长,因此以bg_checksum结尾。在启用了64位特性的ext4文件系统上,块组描述符至少扩展到下面描述的64个字节;大小存储在超级块中。
If gdt_csum is set and metadata_csum is not set, the block group checksum is the crc16 of the FS UUID, the group number, and the group descriptor structure. If metadata_csum is set, then the block group checksum is the lower 16 bits of the checksum of the FS UUID, the group number, and the group descriptor structure. Both block and inode bitmap checksums are calculated against the FS UUID, the group number, and the entire bitmap.
如果设置了gdt_csum而未设置metadata_csum,则块组校验和是FS UUID、组编号和组描述符结构的crc16。如果设置了metadata_csum,则块组校验和是FS UUID、组编号和组描述符结构的校验和的低16位。块和索引节点位图校验和都是根据FS UUID、组号和整个位图计算的。
The block group descriptor is laid out in struct ext4_group_desc
.
块组描述符位于struct ext4_group_desc
中。
Offset | Size | Name | Description |
---|---|---|---|
0x0 | __le32 | bg_block_bitmap_lo | Lower 32-bits of location of block bitmap. |
0x4 | __le32 | bg_inode_bitmap_lo | Lower 32-bits of location of inode bitmap. |
0x8 | __le32 | bg_inode_table_lo | Lower 32-bits of location of inode table. |
0xC | __le16 | bg_free_blocks_count_lo | Lower 16-bits of free block count. |
0xE | __le16 | bg_free_inodes_count_lo | Lower 16-bits of free inode count. |
0x10 | __le16 | bg_used_dirs_count_lo | Lower 16-bits of directory count. |
0x12 | __le16 | bg_flags | Block group flags. See the bgflags table below. |
0x14 | __le32 | bg_exclude_bitmap_lo | Lower 32-bits of location of snapshot exclusion bitmap. |
0x18 | __le16 | bg_block_bitmap_csum_lo | Lower 16-bits of the block bitmap checksum. |
0x1A | __le16 | bg_inode_bitmap_csum_lo | Lower 16-bits of the inode bitmap checksum. |
0x1C | __le16 | bg_itable_unused_lo | Lower 16-bits of unused inode count. If set, we needn’t scan past the (sb.s_inodes_per_group - gdt.bg_itable_unused) th entry in the inode table for this group. |
0x1E | __le16 | bg_checksum | Group descriptor checksum; crc16(sb_uuid+group_num+bg_desc) if the RO_COMPAT_GDT_CSUM feature is set, or crc32c(sb_uuid+group_num+bg_desc) & 0xFFFF if the RO_COMPAT_METADATA_CSUM feature is set. The bg_checksum field in bg_desc is skipped when calculating crc16 checksum, and set to zero if crc32c checksum is used. |
These fields only exist if the 64bit feature is enabled and s_desc_size > 32. | |||
0x20 | __le32 | bg_block_bitmap_hi | Upper 32-bits of location of block bitmap. |
0x24 | __le32 | bg_inode_bitmap_hi | Upper 32-bits of location of inodes bitmap. |
0x28 | __le32 | bg_inode_table_hi | Upper 32-bits of location of inodes table. |
0x2C | __le16 | bg_free_blocks_count_hi | Upper 16-bits of free block count. |
0x2E | __le16 | bg_free_inodes_count_hi | Upper 16-bits of free inode count. |
0x30 | __le16 | bg_used_dirs_count_hi | Upper 16-bits of directory count. |
0x32 | __le16 | bg_itable_unused_hi | Upper 16-bits of unused inode count. |
0x34 | __le32 | bg_exclude_bitmap_hi | Upper 32-bits of location of snapshot exclusion bitmap. |
0x38 | __le16 | bg_block_bitmap_csum_hi | Upper 16-bits of the block bitmap checksum. |
0x3A | __le16 | bg_inode_bitmap_csum_hi | Upper 16-bits of the inode bitmap checksum. |
0x3C | __u32 | bg_reserved | Padding to 64 bytes. |
Block group flags can be any combination of the following:
块组标志可以是以下任意组合:
Value | Description |
---|---|
0x1 | inode table and bitmap are not initialized (EXT4_BG_INODE_UNINIT). |
0x2 | block bitmap is not initialized (EXT4_BG_BLOCK_UNINIT). |
0x4 | inode table is zeroed (EXT4_BG_INODE_ZEROED). |
3.3 Block and inode Bitmaps¶
The data block bitmap tracks the usage of data blocks within the block group.
数据块位图跟踪块组中数据块的使用情况。
The inode bitmap records which entries in the inode table are in use.
inode位图记录inode表中哪些条目正在使用。
As with most bitmaps, one bit represents the usage status of one data block or inode table entry. This implies a block group size of 8 * number_of_bytes_in_a_logical_block.
与大多数位图一样,一位表示一个数据块或索引节点表项的使用状态。这意味着块组大小为8*number_of_bytes_in_a_logical_block。
NOTE: If BLOCK_UNINIT
is set for a given block group, various parts of the kernel and e2fsprogs code pretends that the block bitmap contains zeros (i.e. all blocks in the group are free). However, it is not necessarily the case that no blocks are in use – if meta_bg
is set, the bitmaps and group descriptor live inside the group. Unfortunately, ext2fs_test_block_bitmap2() will return ‘0’ for those locations, which produces confusing debugfs output.
注意:如果为给定的块组设置了BLOCK_UNINIT
,内核和e2fsprogs代码的各个部分会假装块位图包含零(即组中的所有块都是空闲的)。但是,不一定没有块在使用–如果设置了meta_bg
,位图和组描述符将位于组内。不幸的是,ext2fs_test_block_bitmap2()将为这些位置返回“0”,这会产生令人困惑的debugfs输出。
3.4 Inode Table¶
Inode tables are statically allocated at mkfs time. Each block group descriptor points to the start of the table, and the superblock records the number of inodes per group. See the section on inodes for more information.
Inode表在mkfs时静态分配。每个块组描述符指向表的开头,超级块记录每个组的索引节点数。有关更多信息,请参阅inode一节。
3.5 Multiple Mount Protection¶
Multiple mount protection (MMP) is a feature that protects the filesystem against multiple hosts trying to use the filesystem simultaneously. When a filesystem is opened (for mounting, or fsck, etc.), the MMP code running on the node (call it node A) checks a sequence number. If the sequence number is EXT4_MMP_SEQ_CLEAN, the open continues. If the sequence number is EXT4_MMP_SEQ_FSCK, then fsck is (hopefully) running, and open fails immediately. Otherwise, the open code will wait for twice the specified MMP check interval and check the sequence number again. If the sequence number has changed, then the filesystem is active on another machine and the open fails. If the MMP code passes all of those checks, a new MMP sequence number is generated and written to the MMP block, and the mount proceeds.
多装载保护(MMP)是一种保护文件系统免受多个主机同时使用文件系统的功能。当文件系统打开(用于装载或fsck等)时,节点(称为节点a)上运行的MMP代码会检查序列号。如果序列号为EXT4_MMP_SEQ_CLEAN,则打开继续。如果序列号是EXT4_MMP_SEQ_FSCK,则FSCK(希望)正在运行,open将立即失败。否则,打开的代码将等待指定的MMP检查间隔的两倍,然后再次检查序列号。如果序列号已更改,则文件系统在另一台计算机上处于活动状态,打开失败。如果MMP代码通过了所有这些检查,将生成一个新的MMP序列号并将其写入MMP块,然后继续装载。
While the filesystem is live, the kernel sets up a timer to re-check the MMP block at the specified MMP check interval. To perform the re-check, the MMP sequence number is re-read; if it does not match the in-memory MMP sequence number, then another node (node B) has mounted the filesystem, and node A remounts the filesystem read-only. If the sequence numbers match, the sequence number is incremented both in memory and on disk, and the re-check is complete.
当文件系统处于活动状态时,内核设置一个计时器,以指定的MMP检查间隔重新检查MMP块。为了执行重新检查,重新读取MMP序列号;如果它与内存中的MMP序列号不匹配,则另一个节点(节点B)已安装文件系统,节点A以只读方式重新安装文件系统。如果序列号匹配,则序列号在内存和磁盘上都会递增,并且重新检查完成。
The hostname and device filename are written into the MMP block whenever an open operation succeeds. The MMP code does not use these values; they are provided purely for informational purposes.
每当打开操作成功时,主机名和设备文件名都会写入MMP块。MMP代码不使用这些值;提供它们纯粹是为了提供信息。
The checksum is calculated against the FS UUID and the MMP structure. The MMP structure (struct mmp_struct
) is as follows:
根据FS UUID和MMP结构计算校验和。MMP结构(“struct MMP_struct”)如下:
Offset | Type | Name | Description |
---|---|---|---|
0x0 | __le32 | mmp_magic | Magic number for MMP, 0x004D4D50 (“MMP”). |
0x4 | __le32 | mmp_seq | Sequence number, updated periodically. |
0x8 | __le64 | mmp_time | Time that the MMP block was last updated. |
0x10 | char[64] | mmp_nodename | Hostname of the node that opened the filesystem. |
0x50 | char[32] | mmp_bdevname | Block device name of the filesystem. |
0x70 | __le16 | mmp_check_interval | The MMP re-check interval, in seconds. |
0x72 | __le16 | mmp_pad1 | Zero. |
0x74 | __le32[226] | mmp_pad2 | Zero. |
0x3FC | __le32 | mmp_checksum | Checksum of the MMP block. |
3.6 Journal (jbd2)¶
Introduced in ext3, the ext4 filesystem employs a journal to protect the filesystem against metadata inconsistencies in the case of a system crash. Up to 10,240,000 file system blocks (see man mke2fs(8) for more details on journal size limits) can be reserved inside the filesystem as a place to land “important” data writes on-disk as quickly as possible. Once the important data transaction is fully written to the disk and flushed from the disk write cache, a record of the data being committed is also written to the journal. At some later point in time, the journal code writes the transactions to their final locations on disk (this could involve a lot of seeking or a lot of small read-write-erases) before erasing the commit record. Should the system crash during the second slow write, the journal can be replayed all the way to the latest commit record, guaranteeing the atomicity of whatever gets written through the journal to the disk. The effect of this is to guarantee that the filesystem does not become stuck midway through a metadata update.
在ext3中介绍了,ext4文件系统它使用日志来保护文件系统在系统崩溃时不受元数据不一致的影响。可以在文件系统中保留多达10240000个文件系统块(有关日志大小限制的更多详细信息,(请参阅man mke2fs(8)),作为尽快将“重要”数据写入磁盘的地方。一旦将重要数据事务完全写入磁盘并从磁盘写缓存中刷新,则提交的数据记录也将写入日志。在稍后的某个时间点,日志代码会在擦除提交记录之前将事务写入磁盘上的最终位置(这可能涉及大量查找或大量小的读写擦除)。如果系统在第二次慢写期间崩溃,日志可以一直重放到最新的提交记录,从而保证通过日志写入磁盘的任何内容的原子性。这样做的效果是确保文件系统不会在元数据更新的中途卡住。
For performance reasons, ext4 by default only writes filesystem metadata through the journal. This means that file data blocks are /not/ guaranteed to be in any consistent state after a crash. If this default guarantee level (data=ordered
) is not satisfactory, there is a mount option to control journal behavior. If data=journal
, all data and metadata are written to disk through the journal. This is slower but safest. If data=writeback
, dirty data blocks are not flushed to the disk before the metadata are written to disk through the journal.
出于性能原因,默认情况下,ext4只通过日志写入文件系统元数据。这意味着文件数据块在崩溃后/不/保证处于任何一致状态。如果此默认保证级别(data=ordered
)不令人满意,则有一个装载选项来控制日志行为。如果“data=journal”,则所有数据和元数据都通过日志写入磁盘。这速度较慢,但最安全。如果“data=writeback”,则在元数据通过日志写入磁盘之前,脏数据块不会被刷新到磁盘。
In case of data=ordered
mode, Ext4 also supports fast commits which help reduce commit latency significantly. The default data=ordered
mode works by logging metadata blocks to the journal. In fast commit mode, Ext4 only stores the minimal delta needed to recreate the affected metadata in fast commit space that is shared with JBD2. Once the fast commit area fills in or if fast commit is not possible or if JBD2 commit timer goes off, Ext4 performs a traditional full commit. A full commit invalidates all the fast commits that happened before it and thus it makes the fast commit area empty for further fast commits. This feature needs to be enabled at mkfs time.
在data=ordered
模式下,Ext4还支持快速提交,这有助于显著减少提交延迟。默认的data=ordered
模式通过将元数据块记录到日志中来工作。在快速提交模式下,Ext4只存储在与JBD2共享的快速提交空间中重新创建受影响的元数据所需的最小增量。一旦快速提交区域填满,或者如果快速提交不可能,或者JBD2提交计时器关闭,Ext4将执行传统的完全提交。完全提交会使之前发生的所有快速提交无效,因此它会使快速提交区域为空,以便进一步快速提交。此功能需要在mkfs时启用。
The journal inode is typically inode 8. The first 68 bytes of the journal inode are replicated in the ext4 superblock. The journal itself is normal (but hidden) file within the filesystem. The file usually consumes an entire block group, though mke2fs tries to put it in the middle of the disk.
日志索引节点通常是索引节点8。日志索引节点的前68个字节复制到ext4超级块中。日志本身是文件系统中的正常(但隐藏)文件。尽管mke2fs试图将文件放在磁盘的中间,但该文件通常会占用整个块组。
All fields in jbd2 are written to disk in big-endian order. This is the opposite of ext4.
jbd2中的所有字段都以大端顺序写入磁盘。这与ext4相反。
NOTE: Both ext4 and ocfs2 use jbd2.
注意:ext4和ocfs2都使用jbd2。
The maximum size of a journal embedded in an ext4 filesystem is 2^32 blocks. jbd2 itself does not seem to care.
嵌入ext4文件系统中的日志的最大大小为2^32个块。jbd2本身似乎并不在意。
3.6.1 Layout¶
Generally speaking, the journal has this format:
Superblock | descriptor_block (data_blocks or revocation_block) [more data or revocations] commmit_block | [more transactions…] |
---|---|---|
One transaction |
Notice that a transaction begins with either a descriptor and some data, or a block revocation list. A finished transaction always ends with a commit. If there is no commit record (or the checksums don’t match), the transaction will be discarded during replay.
请注意,事务以描述符和一些数据或块撤销列表开始。完成的事务总是以提交结束。如果没有提交记录(或校验和不匹配),则在重放期间将丢弃事务。
3.6.2 External Journal¶
Optionally, an ext4 filesystem can be created with an external journal device (as opposed to an internal journal, which uses a reserved inode). In this case, on the filesystem device, s_journal_inum
should be zero and s_journal_uuid
should be set. On the journal device there will be an ext4 super block in the usual place, with a matching UUID. The journal superblock will be in the next full block after the superblock.
可选地,可以使用外部日志设备创建ext4文件系统(而不是使用保留索引节点的内部日志)。在这种情况下,在文件系统设备上,s_journal_inum
应为零,并且应设置s_johannal_uuid
。在日志设备上,在通常的位置将有一个ext4超级块,具有匹配的UUID。日志超级块将位于超级块之后的下一个完整块中。
1024 bytes of padding | ext4 Superblock | Journal Superblock | descriptor_block (data_blocks or revocation_block) [more data or revocations] commmit_block | [more transactions…] |
---|---|---|---|---|
One transaction |
3.6.3 Block Header¶
Every block in the journal starts with a common 12-byte header struct journal_header_s
:
日志中的每个块都以一个公共的12字节标题struct journal_header_s
开头:
Offset | Type | Name | Description |
---|---|---|---|
0x0 | __be32 | h_magic | jbd2 magic number, 0xC03B3998. |
0x4 | __be32 | h_blocktype | Description of what this block contains. See the jbd2_blocktype table below. |
0x8 | __be32 | h_sequence | The transaction ID that goes with this block. |
The journal block type can be any one of:
日记账块类型可以是以下任一类型:
Value | Description |
---|---|
1 | Descriptor. This block precedes a series of data blocks that were written through the journal during a transaction. |
2 | Block commit record. This block signifies the completion of a transaction. |
3 | Journal superblock, v1. |
4 | Journal superblock, v2. |
5 | Block revocation records. This speeds up recovery by enabling the journal to skip writing blocks that were subsequently rewritten. |
3.6.4 Super Block¶
The super block for the journal is much simpler as compared to ext4’s. The key data kept within are size of the journal, and where to find the start of the log of transactions.
与ext4相比,日志的超级块要简单得多。保存在日志中的关键数据是日志的大小,以及在何处查找事务日志的开始。
The journal superblock is recorded as struct journal_superblock_s
, which is 1024 bytes long:
日志超级块记录为struct journal_superblock_s
,长度为1024字节:
Offset | Type | Name | Description |
---|---|---|---|
Static information describing the journal. | |||
0x0 | journal_header_t (12 bytes) | s_header | Common header identifying this as a superblock. |
0xC | __be32 | s_blocksize | Journal device block size. |
0x10 | __be32 | s_maxlen | Total number of blocks in this journal. |
0x14 | __be32 | s_first | First block of log information. |
Dynamic information describing the current state of the log. | |||
0x18 | __be32 | s_sequence | First commit ID expected in log. |
0x1C | __be32 | s_start | Block number of the start of log. Contrary to the comments, this field being zero does not imply that the journal is clean! |
0x20 | __be32 | s_errno | Error value, as set by jbd2_journal_abort() . |
The remaining fields are only valid in a v2 superblock. | |||
0x24 | __be32 | s_feature_compat; | Compatible feature set. See the table jbd2_compat below. |
0x28 | __be32 | s_feature_incompat | Incompatible feature set. See the table jbd2_incompat below. |
0x2C | __be32 | s_feature_ro_compat | Read-only compatible feature set. There aren’t any of these currently. |
0x30 | __u8 | s_uuid[16] | 128-bit uuid for journal. This is compared against the copy in the ext4 super block at mount time. |
0x40 | __be32 | s_nr_users | Number of file systems sharing this journal. |
0x44 | __be32 | s_dynsuper | Location of dynamic super block copy. (Not used?) |
0x48 | __be32 | s_max_transaction | Limit of journal blocks per transaction. (Not used?) |
0x4C | __be32 | s_max_trans_data | Limit of data blocks per transaction. (Not used?) |
0x50 | __u8 | s_checksum_type | Checksum algorithm used for the journal. See jbd2_checksum_type for more info. |
0x51 | __u8[3] | s_padding2 | |
0x54 | __be32 | s_num_fc_blocks | Number of fast commit blocks in the journal. |
0x58 | __u32 | s_padding[42] | |
0xFC | __be32 | s_checksum | Checksum of the entire superblock, with this field set to zero. |
0x100 | __u8 | s_users[16*48] | ids of all file systems sharing the log. e2fsprogs/Linux don’t allow shared external journals, but I imagine Lustre (or ocfs2?), which use the jbd2 code, might. |
The journal compat features are any combination of the following:
日志兼容功能是以下各项的任意组合:
Value | Description |
---|---|
0x1 | Journal maintains checksums on the data blocks. (JBD2_FEATURE_COMPAT_CHECKSUM) |
The journal incompat features are any combination of the following:
日志不兼容功能是以下各项的任意组合:
Value | Description |
---|---|
0x1 | Journal has block revocation records. (JBD2_FEATURE_INCOMPAT_REVOKE) |
0x2 | Journal can deal with 64-bit block numbers. (JBD2_FEATURE_INCOMPAT_64BIT) |
0x4 | Journal commits asynchronously. (JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT) |
0x8 | This journal uses v2 of the checksum on-disk format. Each journal metadata block gets its own checksum, and the block tags in the descriptor table contain checksums for each of the data blocks in the journal. (JBD2_FEATURE_INCOMPAT_CSUM_V2) |
0x10 | This journal uses v3 of the checksum on-disk format. This is the same as v2, but the journal block tag size is fixed regardless of the size of block numbers. (JBD2_FEATURE_INCOMPAT_CSUM_V3) |
0x20 | Journal has fast commit blocks. (JBD2_FEATURE_INCOMPAT_FAST_COMMIT) |
Journal checksum type codes are one of the following. crc32 or crc32c are the most likely choices.
日记账校验和类型代码是以下代码之一。crc32或crc32c是最可能的选择。
Value | Description |
---|---|
1 | CRC32 |
2 | MD5 |
3 | SHA1 |
4 | CRC32C |
3.6.5 Descriptor Block¶
The descriptor block contains an array of journal block tags that describe the final locations of the data blocks that follow in the journal. Descriptor blocks are open-coded instead of being completely described by a data structure, but here is the block structure anyway. Descriptor blocks consume at least 36 bytes, but use a full block:
描述符块包含一组日志块标记,用于描述日志中后续数据块的最终位置。描述符块是开放编码的,而不是完全由数据结构描述,但无论如何,这里是块结构。描述符块至少消耗36个字节,但使用完整的块:
Offset | Type | Name | Descriptor |
---|---|---|---|
0x0 | journal_header_t | (open coded) | Common block header. |
0xC | struct journal_block_tag_s | open coded array[] | Enough tags either to fill up the block or to describe all the data blocks that follow this descriptor block. |
Journal block tags have any of the following formats, depending on which journal feature and block tag flags are set.
日志块标记具有以下任何格式,具体取决于设置的日志特征和块标记标志。
If JBD2_FEATURE_INCOMPAT_CSUM_V3 is set, the journal block tag is defined as struct journal_block_tag3_s
, which looks like the following. The size is 16 or 32 bytes.
如果设置了JBD2_FEATURE_COMPAT_CSUM_V3,则日志块标记定义为“struct journal_block_tag3_s”,如下所示。大小为16或32字节。
Offset | Type | Name | Descriptor |
---|---|---|---|
0x0 | __be32 | t_blocknr | Lower 32-bits of the location of where the corresponding data block should end up on disk. |
0x4 | __be32 | t_flags | Flags that go with the descriptor. See the table jbd2_tag_flags for more info. |
0x8 | __be32 | t_blocknr_high | Upper 32-bits of the location of where the corresponding data block should end up on disk. This is zero if JBD2_FEATURE_INCOMPAT_64BIT is not enabled. |
0xC | __be32 | t_checksum | Checksum of the journal UUID, the sequence number, and the data block. |
This field appears to be open coded. It always comes at the end of the tag, after t_checksum. This field is not present if the “same UUID” flag is set. | |||
0x8 or 0xC | char | uuid[16] | A UUID to go with this tag. This field appears to be copied from the j_uuid field in struct journal_s , but only tune2fs touches that field. |
The journal tag flags are any combination of the following:
日志标记标志是以下各项的任意组合:
Value | Description |
---|---|
0x1 | On-disk block is escaped. The first four bytes of the data block just happened to match the jbd2 magic number. |
0x2 | This block has the same UUID as previous, therefore the UUID field is omitted. |
0x4 | The data block was deleted by the transaction. (Not used?) |
0x8 | This is the last tag in this descriptor block. |
If JBD2_FEATURE_INCOMPAT_CSUM_V3 is NOT set, the journal block tag is defined as struct journal_block_tag_s
, which looks like the following. The size is 8, 12, 24, or 28 bytes:
如果未设置JBD2_FEATURE_COMPAT_CSUM_V3,则日志块标记定义为struct journal_block_tag_s
,如下所示。大小为8、12、24或28字节:
Offset | Type | Name | Descriptor |
---|---|---|---|
0x0 | __be32 | t_blocknr | Lower 32-bits of the location of where the corresponding data block should end up on disk. |
0x4 | __be16 | t_checksum | Checksum of the journal UUID, the sequence number, and the data block. Note that only the lower 16 bits are stored. |
0x6 | __be16 | t_flags | Flags that go with the descriptor. See the table jbd2_tag_flags for more info. |
This next field is only present if the super block indicates support for 64-bit block numbers. | |||
0x8 | __be32 | t_blocknr_high | Upper 32-bits of the location of where the corresponding data block should end up on disk. |
This field appears to be open coded. It always comes at the end of the tag, after t_flags or t_blocknr_high. This field is not present if the “same UUID” flag is set. | |||
0x8 or 0xC | char | uuid[16] | A UUID to go with this tag. This field appears to be copied from the j_uuid field in struct journal_s , but only tune2fs touches that field. |
If JBD2_FEATURE_INCOMPAT_CSUM_V2 or JBD2_FEATURE_INCOMPAT_CSUM_V3 are set, the end of the block is a struct jbd2_journal_block_tail
, which looks like this:
如果设置了JBD2_FEATURE_COMPAT_CSUM_V2或JBD2_FEATURE_COMPACT_CSUM_V3,则块的结尾是“struct JBD2_journal_block_tail”,如下所示:
Offset | Type | Name | Descriptor |
---|---|---|---|
0x0 | __be32 | t_checksum | Checksum of the journal UUID + the descriptor block, with this field set to zero. |
3.6.6 Data Block¶
In general, the data blocks being written to disk through the journal are written verbatim into the journal file after the descriptor block. However, if the first four bytes of the block match the jbd2 magic number then those four bytes are replaced with zeroes and the “escaped” flag is set in the descriptor block tag.
通常,通过日志写入磁盘的数据块在描述符块之后逐字写入日志文件。但是,如果块的前四个字节与jbd2幻数匹配,则这四个字节将替换为零,并在描述符块标记中设置“转义”标志。
3.6.7 Revocation Block¶
A revocation block is used to prevent replay of a block in an earlier transaction. This is used to mark blocks that were journalled at one time but are no longer journalled. Typically this happens if a metadata block is freed and re-allocated as a file data block; in this case, a journal replay after the file block was written to disk will cause corruption.
撤销块用于防止在早期事务中重放块。这用于标记一次记录但不再记录的块。通常,如果元数据块被释放并重新分配为文件数据块,就会发生这种情况;在这种情况下,文件块写入磁盘后的日志重放将导致损坏。
NOTE: This mechanism is NOT used to express “this journal block is superseded by this other journal block”, as the author (djwong) mistakenly thought. Any block being added to a transaction will cause the removal of all existing revocation records for that block.
注:正如作者(djwong)错误地认为的那样,此机制不用于表示“此日志块被其他日志块取代”。添加到事务中的任何块都将导致删除该块的所有现有吊销记录。
Revocation blocks are described in struct jbd2_journal_revoke_header_s
, are at least 16 bytes in length, but use a full block:
吊销块在struct jbd2_journal_revoke_header_s
中描述,长度至少为16字节,但使用完整块:
Offset | Type | Name | Description |
---|---|---|---|
0x0 | journal_header_t | r_header | Common block header. |
0xC | __be32 | r_count | Number of bytes used in this block. |
0x10 | __be32 or __be64 | blocks[0] | Blocks to revoke. |
After r_count is a linear array of block numbers that are effectively revoked by this transaction. The size of each block number is 8 bytes if the superblock advertises 64-bit block number support, or 4 bytes otherwise.
r_count之后是一个线性数组,由该事务有效撤销。如果超级块通告64位块号支持,则每个块号的大小为8字节,否则为4字节。
If JBD2_FEATURE_INCOMPAT_CSUM_V2 or JBD2_FEATURE_INCOMPAT_CSUM_V3 are set, the end of the revocation block is a struct jbd2_journal_revoke_tail
, which has this format:
如果设置了JBD2_FEATURE_COMPAT_CSUM_V2或JBD2_FEATURE_COMPACT_CSUM_V3,则吊销块的结尾为struct JBD2_journal_revoke_tail
,其格式如下:
Offset | Type | Name | Description |
---|---|---|---|
0x0 | __be32 | r_checksum | Checksum of the journal UUID + revocation block |
3.6.8 Commit Block¶
The commit block is a sentry that indicates that a transaction has been completely written to the journal. Once this commit block reaches the journal, the data stored with this transaction can be written to their final locations on disk.
提交块是一个哨兵,指示事务已完全写入日志。一旦该提交块到达日志,与该事务一起存储的数据就可以写入磁盘上的最终位置。
The commit block is described by struct commit_header
, which is 32 bytes long (but uses a full block):
提交块由struct commit_header
描述,它长32字节(但使用完整块):
Offset | Type | Name | Descriptor |
---|---|---|---|
0x0 | journal_header_s | (open coded) | Common block header. |
0xC | unsigned char | h_chksum_type | The type of checksum to use to verify the integrity of the data blocks in the transaction. See jbd2_checksum_type for more info. |
0xD | unsigned char | h_chksum_size | The number of bytes used by the checksum. Most likely 4. |
0xE | unsigned char | h_padding[2] | |
0x10 | __be32 | h_chksum[JBD2_CHECKSUM_BYTES] | 32 bytes of space to store checksums. If JBD2_FEATURE_INCOMPAT_CSUM_V2 or JBD2_FEATURE_INCOMPAT_CSUM_V3 are set, the first __be32 is the checksum of the journal UUID and the entire commit block, with this field zeroed. If JBD2_FEATURE_COMPAT_CHECKSUM is set, the first __be32 is the crc32 of all the blocks already written to the transaction. |
0x30 | __be64 | h_commit_sec | The time that the transaction was committed, in seconds since the epoch. |
0x38 | __be32 | h_commit_nsec | Nanoseconds component of the above timestamp. |
3.6.9 Fast commits¶
Fast commit area is organized as a log of tag length values. Each TLV has a struct ext4_fc_tl
in the beginning which stores the tag and the length of the entire field. It is followed by variable length tag specific value. Here is the list of supported tags and their meanings:
快速提交区域被组织为标记长度值的日志。每个TLV的开头都有一个structext4_fc_tl
,用于存储标记和整个字段的长度。其后是可变长度标记特定值。以下是支持的标记及其含义列表:
Tag | Meaning | Value struct | Description |
---|---|---|---|
EXT4_FC_TAG_HEAD | Fast commit area header | struct ext4_fc_head |
Stores the TID of the transaction after which these fast commits should be applied. |
EXT4_FC_TAG_ADD_RANGE | Add extent to inode | struct ext4_fc_add_range |
Stores the inode number and extent to be added in this inode |
EXT4_FC_TAG_DEL_RANGE | Remove logical offsets to inode | struct ext4_fc_del_range |
Stores the inode number and the logical offset range that needs to be removed |
EXT4_FC_TAG_CREAT | Create directory entry for a newly created file | struct ext4_fc_dentry_info |
Stores the parent inode number, inode number and directory entry of the newly created file |
EXT4_FC_TAG_LINK | Link a directory entry to an inode | struct ext4_fc_dentry_info |
Stores the parent inode number, inode number and directory entry |
EXT4_FC_TAG_UNLINK | Unlink a directory entry of an inode | struct ext4_fc_dentry_info |
Stores the parent inode number, inode number and directory entry |
EXT4_FC_TAG_PAD | Padding (unused area) | None | Unused bytes in the fast commit area. |
EXT4_FC_TAG_TAIL | Mark the end of a fast commit | struct ext4_fc_tail |
Stores the TID of the commit, CRC of the fast commit of which this tag represents the end of |
3.6.10 Fast Commit Replay Idempotence快速提交重放即时性¶
Fast commits tags are idempotent in nature provided the recovery code follows certain rules. The guiding principle that the commit path follows while committing is that it stores the result of a particular operation instead of storing the procedure.
只要恢复代码遵循某些规则,快速提交标签本质上是幂等的。提交时提交路径遵循的指导原则是,它存储特定操作的结果,而不是存储过程。
Let’s consider this rename operation: ‘mv /a /b’. Let’s assume dirent ‘/a’ was associated with inode 10. During fast commit, instead of storing this operation as a procedure “rename a to b”, we store the resulting file system state as a “series” of outcomes:
- Link dirent b to inode 10
- Unlink dirent a
- Inode 10 with valid refcount
让我们考虑一下这个重命名操作:“mv/a/b”。让我们假设目录“/a”与索引节点10相关联。在快速提交过程中,我们将生成的文件系统状态存储为“一系列”结果,而不是将此操作存储为“将a重命名为b”的过程:
- 将目录b链接到索引节点10
- 取消链接目录
- 具有有效引用计数的索引节点10
Now when recovery code runs, it needs “enforce” this state on the file system. This is what guarantees idempotence of fast commit replay.
现在,当恢复代码运行时,它需要在文件系统上“强制”此状态。这就是保证快速提交重放的幂等性的原因。
Let’s take an example of a procedure that is not idempotent and see how fast commits make it idempotent. Consider following sequence of operations:
让我们以一个非幂等的过程为例,看看提交的速度如何使其成为幂等的。考虑以下操作顺序:
- rm A
- mv B A
- read A
If we store this sequence of operations as is then the replay is not idempotent. Let’s say while in replay, we crash after (2). During the second replay, file A (which was actually created as a result of “mv B A” operation) would get deleted. Thus, file named A would be absent when we try to read A. So, this sequence of operations is not idempotent. However, as mentioned above, instead of storing the procedure fast commits store the outcome of each procedure. Thus the fast commit log for above procedure would be as follows:
如果我们按原样存储此操作序列,则重放不是幂等的。让我们假设在回放时,我们在(2)之后崩溃。在第二次回放期间,文件A(实际上是作为“mv B A”操作的结果创建的)将被删除。因此,当我们尝试读取A时,名为A的文件将不存在。因此,此操作序列不是幂等的。然而,如上所述,快速提交不是存储过程,而是存储每个过程的结果。因此,上述过程的快速提交日志如下:
(Let’s assume dirent A was linked to inode 10 and dirent B was linked to inode 11 before the replay)
(假设目录A在重放之前链接到索引节点10,目录B在重放之前连接到索引节点11)
- Unlink A
- Link A to inode 11
- Unlink B
- Inode 11
If we crash after (3) we will have file A linked to inode 11. During the second replay, we will remove file A (inode 11). But we will create it back and make it point to inode 11. We won’t find B, so we’ll just skip that step. At this point, the refcount for inode 11 is not reliable, but that gets fixed by the replay of last inode 11 tag. Thus, by converting a non-idempotent procedure into a series of idempotent outcomes, fast commits ensured idempotence during the replay.
如果我们在(3)之后崩溃,那么文件A将链接到inode 11。在第二次回放期间,我们将删除文件A(inode 11)。但是我们将创建它并使其指向inode 11。我们找不到B,所以我们跳过这一步。此时,inode 11的refcount不可靠,但通过回放最后一个inode 11标记,这一点得到了解决。因此,通过将非幂等过程转换为一系列幂等结果,快速提交确保了重放期间的幂等性。
3.6.11 Journal Checkpoint日志检查点¶
Checkpointing the journal ensures all transactions and their associated buffers are submitted to the disk. In-progress transactions are waited upon and included in the checkpoint. Checkpointing is used internally during critical updates to the filesystem including journal recovery, filesystem resizing, and freeing of the journal_t structure.
检查日志可确保所有事务及其相关缓冲区都提交到磁盘。正在进行的事务被等待并包含在检查点中。检查点在文件系统的关键更新(包括日志恢复、文件系统大小调整和释放journal_t结构)期间在内部使用。
A journal checkpoint can be triggered from userspace via the ioctl EXT4_IOC_CHECKPOINT. This ioctl takes a single, u64 argument for flags. Currently, three flags are supported. First, EXT4_IOC_CHECKPOINT_FLAG_DRY_RUN can be used to verify input to the ioctl. It returns error if there is any invalid input, otherwise it returns success without performing any checkpointing. This can be used to check whether the ioctl exists on a system and to verify there are no issues with arguments or flags. The other two flags are EXT4_IOC_CHECKPOINT_FLAG_DISCARD and EXT4_IOC_CHECKPOINT_FLAG_ZEROOUT. These flags cause the journal blocks to be discarded or zero-filled, respectively, after the journal checkpoint is complete. EXT4_IOC_CHECKPOINT_FLAG_DISCARD and EXT4_IOC_CHECKPOINT_FLAG_ZEROOUT cannot both be set. The ioctl may be useful when snapshotting a system or for complying with content deletion SLOs.
日志检查点可以通过ioctl EXT4_IOC_checkpoint从用户空间触发。这个ioctl使用一个u64参数作为标志。目前,支持三个标志。首先,EXT4_IOC_CHECKPOINT_FLAG_DRY_RUN可用于验证ioctl的输入。如果有任何无效输入,它将返回错误,否则它将返回成功,而不执行任何检查点。这可用于检查系统上是否存在ioctl,并验证参数或标志是否存在问题。其他两个标志是EXT4_IOC_CHECKPPOINT_FLAG_DISCARD和EXT4_IOC_COHECKPPOInt_FLAG_ZEROOUT。在日志检查点完成后,这些标志分别导致日志块被丢弃或被零填充。EXT4_IOC_CHECKPOINT_FLAG_DISCARD和EXT4_IOC_COHECKPOInt_FLAG_ZEROOUT不能同时设置。ioctl在拍摄系统快照或遵守内容删除SLO时可能很有用。
3.7 Orphan file孤儿文件¶
In unix there can inodes that are unlinked from directory hierarchy but that are still alive because they are open. In case of crash the filesystem has to clean up these inodes as otherwise they (and the blocks referenced from them) would leak. Similarly if we truncate or extend the file, we need not be able to perform the operation in a single journalling transaction. In such case we track the inode as orphan so that in case of crash extra blocks allocated to the file get truncated.
在unix中,可以有从目录层次结构中取消链接的inode,但由于它们是打开的,所以它们仍然有效。如果发生崩溃,文件系统必须清理这些inode,否则它们(以及从它们引用的块)将泄漏。类似地,如果我们截断或扩展文件,我们不需要在单个日志事务中执行该操作。在这种情况下,我们将inode作为孤儿进行跟踪,以便在发生崩溃时截断分配给文件的额外块。
Traditionally ext4 tracks orphan inodes in a form of single linked list where superblock contains the inode number of the last orphan inode (s_last_orphan field) and then each inode contains inode number of the previously orphaned inode (we overload i_dtime inode field for this). However this filesystem global single linked list is a scalability bottleneck for workloads that result in heavy creation of orphan inodes. When orphan file feature (COMPAT_ORPHAN_FILE) is enabled, the filesystem has a special inode (referenced from the superblock through s_orphan_file_inum) with several blocks. Each of these blocks has a structure:
传统上,ext4以单链接列表的形式跟踪孤儿索引节点,其中超级块包含最后一个孤儿索引节点的索引节点号(s_last_orphan字段),然后每个索引节点包含先前孤儿索引节点(为此我们重载i_dtime索引节点字段)。然而,对于导致大量创建孤立inode的工作负载来说,这个文件系统全局单链接列表是一个可扩展性瓶颈。当启用孤立文件功能(COMPAT_orphan_file)时,文件系统有一个特殊的inode(通过s_orphan_file_inum从超级块引用),其中包含几个块。这些块中的每一个都有一个结构:
Offset | Type | Name | Description |
---|---|---|---|
0x0 | Array of __le32 entries | Orphan inode entries | Each __le32 entry is either empty (0) or it contains inode number of an orphan inode. |
blocksize-8 | __le32 | ob_magic | Magic value stored in orphan block tail (0x0b10ca04) |
blocksize-4 | __le32 | ob_checksum | Checksum of the orphan block. |
When a filesystem with orphan file feature is writeably mounted, we set RO_COMPAT_ORPHAN_PRESENT feature in the superblock to indicate there may be valid orphan entries. In case we see this feature when mounting the filesystem, we read the whole orphan file and process all orphan inodes found there as usual. When cleanly unmounting the filesystem we remove the RO_COMPAT_ORPHAN_PRESENT feature to avoid unnecessary scanning of the orphan file and also make the filesystem fully compatible with older kernels.
当具有孤立文件特性的文件系统以可写方式装入时,我们在超级块中设置RO_COMPAT_orphan_PRESENT特性,以指示可能存在有效的孤立条目。如果我们在安装文件系统时看到这个特性,我们将读取整个孤立文件,并像往常一样处理在那里找到的所有孤立inode。在彻底卸载文件系统时,我们删除了RO_COMPAT_ORPHAN_PRESENT特性,以避免对孤立文件进行不必要的扫描,并使文件系统与旧内核完全兼容。
4 Dynamic Structures¶
Dynamic metadata are created on the fly when files and blocks are allocated to files.
当文件和块分配给文件时,动态元数据会动态创建。
4.1 Index Nodes¶
In a regular UNIX filesystem, the inode stores all the metadata pertaining to the file (time stamps, block maps, extended attributes, etc), not the directory entry. To find the information associated with a file, one must traverse the directory files to find the directory entry associated with a file, then load the inode to find the metadata for that file. ext4 appears to cheat (for performance reasons) a little bit by storing a copy of the file type (normally stored in the inode) in the directory entry. (Compare all this to FAT, which stores all the file information directly in the directory entry, but does not support hard links and is in general more seek-happy than ext4 due to its simpler block allocator and extensive use of linked lists.)
在常规UNIX文件系统中,inode存储与文件相关的所有元数据(时间戳、块映射、扩展属性等),而不是目录条目。要查找与文件相关的信息,必须遍历目录文件以查找与文件关联的目录条目,然后加载索引节点以查找该文件的元数据。ext4通过在目录条目中存储文件类型的副本(通常存储在inode中),似乎有点作弊(出于性能原因)。(将所有这些与FAT进行比较,FAT将所有文件信息直接存储在目录条目中,但不支持硬链接,并且由于其更简单的块分配器和大量使用链接列表,因此通常比ext4更易于查找。)
The inode table is a linear array of struct ext4_inode
. The table is sized to have enough blocks to store at least sb.s_inode_size * sb.s_inodes_per_group
bytes. The number of the block group containing an inode can be calculated as (inode_number - 1) / sb.s_inodes_per_group
, and the offset into the group’s table is (inode_number - 1) % sb.s_inodes_per_group
. There is no inode 0.
inode表是“structext4_inode”的线性数组。该表的大小足以存储至少sb.s_inode_size*sb.s_index_per_group
字节的块。包含索引节点的块组的数量可以计算为(inode_number-1)/sb.s_inodes_per_group
,组表中的偏移量为(node_number-1)%sb.s_inode_per_group
。没有索引节点0。
The inode checksum is calculated against the FS UUID, the inode number, and the inode structure itself.
索引节点校验和是根据FS UUID、索引节点编号和索引节点结构本身计算的。
The inode table entry is laid out in struct ext4_inode
.
inode表条目在structext4_inode
中进行布局。
Offset | Size | Name | Description |
---|---|---|---|
0x0 | __le16 | i_mode | File mode. See the table i_mode below. |
0x2 | __le16 | i_uid | Lower 16-bits of Owner UID. |
0x4 | __le32 | i_size_lo | Lower 32-bits of size in bytes. |
0x8 | __le32 | i_atime | Last access time, in seconds since the epoch. However, if the EA_INODE inode flag is set, this inode stores an extended attribute value and this field contains the checksum of the value. |
0xC | __le32 | i_ctime | Last inode change time, in seconds since the epoch. However, if the EA_INODE inode flag is set, this inode stores an extended attribute value and this field contains the lower 32 bits of the attribute value’s reference count. |
0x10 | __le32 | i_mtime | Last data modification time, in seconds since the epoch. However, if the EA_INODE inode flag is set, this inode stores an extended attribute value and this field contains the number of the inode that owns the extended attribute. |
0x14 | __le32 | i_dtime | Deletion Time, in seconds since the epoch. |
0x18 | __le16 | i_gid | Lower 16-bits of GID. |
0x1A | __le16 | i_links_count | Hard link count. Normally, ext4 does not permit an inode to have more than 65,000 hard links. This applies to files as well as directories, which means that there cannot be more than 64,998 subdirectories in a directory (each subdirectory’s ‘..’ entry counts as a hard link, as does the ‘.’ entry in the directory itself). With the DIR_NLINK feature enabled, ext4 supports more than 64,998 subdirectories by setting this field to 1 to indicate that the number of hard links is not known. |
0x1C | __le32 | i_blocks_lo | Lower 32-bits of “block” count. If the huge_file feature flag is not set on the filesystem, the file consumes i_blocks_lo 512-byte blocks on disk. If huge_file is set and EXT4_HUGE_FILE_FL is NOT set in inode.i_flags , then the file consumes i_blocks_lo + (i_blocks_hi << 32) 512-byte blocks on disk. If huge_file is set and EXT4_HUGE_FILE_FL IS set in inode.i_flags , then this file consumes (i_blocks_lo + i_blocks_hi << 32) filesystem blocks on disk. |
0x20 | __le32 | i_flags | Inode flags. See the table i_flags below. |
0x24 | 4 bytes | i_osd1 | See the table i_osd1 for more details. |
0x28 | 60 bytes | i_block[EXT4_N_BLOCKS=15] | Block map or extent tree. See the section “The Contents of inode.i_block”. |
0x64 | __le32 | i_generation | File version (for NFS). |
0x68 | __le32 | i_file_acl_lo | Lower 32-bits of extended attribute block. ACLs are of course one of many possible extended attributes; I think the name of this field is a result of the first use of extended attributes being for ACLs. |
0x6C | __le32 | i_size_high / i_dir_acl | Upper 32-bits of file/directory size. In ext2/3 this field was named i_dir_acl, though it was usually set to zero and never used. |
0x70 | __le32 | i_obso_faddr | (Obsolete) fragment address. |
0x74 | 12 bytes | i_osd2 | See the table i_osd2 for more details. |
0x80 | __le16 | i_extra_isize | Size of this inode - 128. Alternately, the size of the extended inode fields beyond the original ext2 inode, including this field. |
0x82 | __le16 | i_checksum_hi | Upper 16-bits of the inode checksum. |
0x84 | __le32 | i_ctime_extra | Extra change time bits. This provides sub-second precision. See Inode Timestamps section. |
0x88 | __le32 | i_mtime_extra | Extra modification time bits. This provides sub-second precision. |
0x8C | __le32 | i_atime_extra | Extra access time bits. This provides sub-second precision. |
0x90 | __le32 | i_crtime | File creation time, in seconds since the epoch. |
0x94 | __le32 | i_crtime_extra | Extra file creation time bits. This provides sub-second precision. |
0x98 | __le32 | i_version_hi | Upper 32-bits for version number. |
0x9C | __le32 | i_projid | Project ID. |
The i_mode
value is a combination of the following flags:
Value | Description |
---|---|
0x1 | S_IXOTH (Others may execute) |
0x2 | S_IWOTH (Others may write) |
0x4 | S_IROTH (Others may read) |
0x8 | S_IXGRP (Group members may execute) |
0x10 | S_IWGRP (Group members may write) |
0x20 | S_IRGRP (Group members may read) |
0x40 | S_IXUSR (Owner may execute) |
0x80 | S_IWUSR (Owner may write) |
0x100 | S_IRUSR (Owner may read) |
0x200 | S_ISVTX (Sticky bit) |
0x400 | S_ISGID (Set GID) |
0x800 | S_ISUID (Set UID) |
These are mutually-exclusive file types: | |
0x1000 | S_IFIFO (FIFO) |
0x2000 | S_IFCHR (Character device) |
0x4000 | S_IFDIR (Directory) |
0x6000 | S_IFBLK (Block device) |
0x8000 | S_IFREG (Regular file) |
0xA000 | S_IFLNK (Symbolic link) |
0xC000 | S_IFSOCK (Socket) |
The i_flags
field is a combination of these values:
Value | Description |
---|---|
0x1 | This file requires secure deletion (EXT4_SECRM_FL). (not implemented) |
0x2 | This file should be preserved, should undeletion be desired (EXT4_UNRM_FL). (not implemented) |
0x4 | File is compressed (EXT4_COMPR_FL). (not really implemented) |
0x8 | All writes to the file must be synchronous (EXT4_SYNC_FL). |
0x10 | File is immutable (EXT4_IMMUTABLE_FL). |
0x20 | File can only be appended (EXT4_APPEND_FL). |
0x40 | The dump(1) utility should not dump this file (EXT4_NODUMP_FL). |
0x80 | Do not update access time (EXT4_NOATIME_FL). |
0x100 | Dirty compressed file (EXT4_DIRTY_FL). (not used) |
0x200 | File has one or more compressed clusters (EXT4_COMPRBLK_FL). (not used) |
0x400 | Do not compress file (EXT4_NOCOMPR_FL). (not used) |
0x800 | Encrypted inode (EXT4_ENCRYPT_FL). This bit value previously was EXT4_ECOMPR_FL (compression error), which was never used. |
0x1000 | Directory has hashed indexes (EXT4_INDEX_FL). |
0x2000 | AFS magic directory (EXT4_IMAGIC_FL). |
0x4000 | File data must always be written through the journal (EXT4_JOURNAL_DATA_FL). |
0x8000 | File tail should not be merged (EXT4_NOTAIL_FL). (not used by ext4) |
0x10000 | All directory entry data should be written synchronously (see dirsync ) (EXT4_DIRSYNC_FL). |
0x20000 | Top of directory hierarchy (EXT4_TOPDIR_FL). |
0x40000 | This is a huge file (EXT4_HUGE_FILE_FL). |
0x80000 | Inode uses extents (EXT4_EXTENTS_FL). |
0x100000 | Verity protected file (EXT4_VERITY_FL). |
0x200000 | Inode stores a large extended attribute value in its data blocks (EXT4_EA_INODE_FL). |
0x400000 | This file has blocks allocated past EOF (EXT4_EOFBLOCKS_FL). (deprecated) |
0x01000000 | Inode is a snapshot (EXT4_SNAPFILE_FL ). (not in mainline) |
0x04000000 | Snapshot is being deleted (EXT4_SNAPFILE_DELETED_FL ). (not in mainline) |
0x08000000 | Snapshot shrink has completed (EXT4_SNAPFILE_SHRUNK_FL ). (not in mainline) |
0x10000000 | Inode has inline data (EXT4_INLINE_DATA_FL). |
0x20000000 | Create children with the same project ID (EXT4_PROJINHERIT_FL). |
0x80000000 | Reserved for ext4 library (EXT4_RESERVED_FL). |
Aggregate flags: | |
0x705BDFFF | User-visible flags. |
0x604BC0FF | User-modifiable flags. Note that while EXT4_JOURNAL_DATA_FL and EXT4_EXTENTS_FL can be set with setattr, they are not in the kernel’s EXT4_FL_USER_MODIFIABLE mask, since it needs to handle the setting of these flags in a special manner and they are masked out of the set of flags that are saved directly to i_flags. |
The osd1
field has multiple meanings depending on the creator:
Linux:
Offset | Size | Name | Description |
---|---|---|---|
0x0 | __le32 | l_i_version | Inode version. However, if the EA_INODE inode flag is set, this inode stores an extended attribute value and this field contains the upper 32 bits of the attribute value’s reference count. |
Hurd:
Offset | Size | Name | Description |
---|---|---|---|
0x0 | __le32 | h_i_translator | ?? |
Masix:
Offset | Size | Name | Description |
---|---|---|---|
0x0 | __le32 | m_i_reserved | ?? |
The osd2
field has multiple meanings depending on the filesystem creator:
osd2
字段有多种含义,具体取决于文件系统创建者:
Linux:
Offset | Size | Name | Description |
---|---|---|---|
0x0 | __le16 | l_i_blocks_high | Upper 16-bits of the block count. Please see the note attached to i_blocks_lo. |
0x2 | __le16 | l_i_file_acl_high | Upper 16-bits of the extended attribute block (historically, the file ACL location). See the Extended Attributes section below. |
0x4 | __le16 | l_i_uid_high | Upper 16-bits of the Owner UID. |
0x6 | __le16 | l_i_gid_high | Upper 16-bits of the GID. |
0x8 | __le16 | l_i_checksum_lo | Lower 16-bits of the inode checksum. |
0xA | __le16 | l_i_reserved | Unused. |
Hurd:
Offset | Size | Name | Description |
---|---|---|---|
0x0 | __le16 | h_i_reserved1 | ?? |
0x2 | __u16 | h_i_mode_high | Upper 16-bits of the file mode. |
0x4 | __le16 | h_i_uid_high | Upper 16-bits of the Owner UID. |
0x6 | __le16 | h_i_gid_high | Upper 16-bits of the GID. |
0x8 | __u32 | h_i_author | Author code? |
Masix:
Offset | Size | Name | Description |
---|---|---|---|
0x0 | __le16 | h_i_reserved1 | ?? |
0x2 | __u16 | m_i_file_acl_high | Upper 16-bits of the extended attribute block (historically, the file ACL location). |
0x4 | __u32 | m_i_reserved2[2] | ?? |
4.1.1 Inode Size¶
In ext2 and ext3, the inode structure size was fixed at 128 bytes (EXT2_GOOD_OLD_INODE_SIZE
) and each inode had a disk record size of 128 bytes. Starting with ext4, it is possible to allocate a larger on-disk inode at format time for all inodes in the filesystem to provide space beyond the end of the original ext2 inode. The on-disk inode record size is recorded in the superblock as s_inode_size
. The number of bytes actually used by struct ext4_inode beyond the original 128-byte ext2 inode is recorded in the i_extra_isize
field for each inode, which allows struct ext4_inode to grow for a new kernel without having to upgrade all of the on-disk inodes. Access to fields beyond EXT2_GOOD_OLD_INODE_SIZE should be verified to be within i_extra_isize
. By default, ext4 inode records are 256 bytes, and (as of August 2019) the inode structure is 160 bytes (i_extra_isize = 32
). The extra space between the end of the inode structure and the end of the inode record can be used to store extended attributes. Each inode record can be as large as the filesystem block size, though this is not terribly efficient.
在ext2和ext3中,inode结构大小固定为128字节(ext2_GOOD_OLD_inode_size
),每个inode的磁盘记录大小为128字节。从ext4开始,可以在格式化时为文件系统中的所有inode分配更大的磁盘上inode,以提供超出原始ext2 inode末尾的空间。磁盘上的inode记录大小在超级块中记录为s_inode_size
。结构ext4_inode实际使用的字节数超出了原始的128字节ext2 inode,记录在每个inode的i_extra_isize
字段中,这允许结构ext4_node为新内核增长,而不必升级所有磁盘上的inode。应验证对EXT2_GOOD_OLD_INODE_SIZE以外字段的访问是否在i_extra_isize
范围内。默认情况下,ext4 inode记录为256字节,(截至2019年8月)inode结构为160字节(i_extra_isize=32
)。索引节点结构末端和索引节点记录末端之间的额外空间可用于存储扩展属性。每个inode记录可以与文件系统块大小一样大,尽管这并不是非常有效。
4.1.2 Finding an Inode¶
Each block group contains sb->s_inodes_per_group
inodes. Because inode 0 is defined not to exist, this formula can be used to find the block group that an inode lives in: bg = (inode_num - 1) / sb->s_inodes_per_group
. The particular inode can be found within the block group’s inode table at index = (inode_num - 1) % sb->s_inodes_per_group
. To get the byte address within the inode table, use offset = index * sb->s_inode_size
.
每个块组包含sb->s_inodes_per_group
inode。因为inode 0被定义为不存在,所以可以使用以下公式来查找inode所在的块组:bg=(inode_num-1)/sb->s_inodes_per_group
。可以在块组的索引节点表中的index=(inode_num-1)%sb->s_inodes_per_group
处找到特定的索引节点。要获取inode表中的字节地址,请使用offset=index*sb->s_inode_size
。
4.1.3 Inode Timestamps¶
Four timestamps are recorded in the lower 128 bytes of the inode structure – inode change time (ctime), access time (atime), data modification time (mtime), and deletion time (dtime). The four fields are 32-bit signed integers that represent seconds since the Unix epoch (1970-01-01 00:00:00 GMT), which means that the fields will overflow in January 2038. If the filesystem does not have orphan_file feature, inodes that are not linked from any directory but are still open (orphan inodes) have the dtime field overloaded for use with the orphan list. The superblock field s_last_orphan
points to the first inode in the orphan list; dtime is then the number of the next orphaned inode, or zero if there are no more orphans.
索引节点结构的较低128字节中记录了四个时间戳:索引节点更改时间(ctime)、访问时间(atime)、数据修改时间(mtime)和删除时间(dtime)。这四个字段是32位有符号整数,表示自Unix时代(1970-01-01 00:00:00 GMT)以来的秒数,这意味着这些字段将在2038年1月溢出。如果文件系统不具有orphan_file特性,则未从任何目录链接但仍处于打开状态的inode(孤儿inode)会重载dtime字段,以便与孤儿列表一起使用。超级块字段“s_last_orphan”指向孤儿列表中的第一个inode;dtime是下一个孤儿inode的数量,如果没有更多孤儿inode,则为零。
If the inode structure size sb->s_inode_size
is larger than 128 bytes and the i_inode_extra
field is large enough to encompass the respective i_[cma]time_extra
field, the ctime, atime, and mtime inode fields are widened to 64 bits. Within this “extra” 32-bit field, the lower two bits are used to extend the 32-bit seconds field to be 34 bit wide; the upper 30 bits are used to provide nanosecond timestamp accuracy. Therefore, timestamps should not overflow until May 2446. dtime was not widened. There is also a fifth timestamp to record inode creation time (crtime); this field is 64-bits wide and decoded in the same manner as 64-bit [cma]time. Neither crtime nor dtime are accessible through the regular stat() interface, though debugfs will report them.
如果inode结构大小sb->s_inode_size
大于128字节,并且i_inode_extra
字段足够大,足以包含相应的i_[cma]time_extra
字段,则ctime、atime和mtime inode字段将扩展到64位。在这个“额外”32位字段中,较低的两位用于将32位秒字段扩展到34位宽;高30位用于提供纳秒时间戳精度。因此,时间戳应该在5月2446日之前不会溢出。时间没有延长。还有第五个时间戳来记录inode创建时间(crtime);该字段为64比特宽,并且以与64比特[cma]时间相同的方式解码。crtime和dtime都不能通过常规stat()接口访问,尽管debugfs会报告它们。
We use the 32-bit signed time value plus (2^32 * (extra epoch bits)). In other words:
我们使用32位有符号时间值加(2^32*(额外的历元位))。换句话说:
Extra epoch bits | MSB of 32-bit time | Adjustment for signed 32-bit to 64-bit tv_sec | Decoded 64-bit tv_sec | valid time range |
---|---|---|---|---|
0 0 | 1 | 0 | -0x80000000 - -0x00000001 |
1901-12-13 to 1969-12-31 |
0 0 | 0 | 0 | 0x000000000 - 0x07fffffff |
1970-01-01 to 2038-01-19 |
0 1 | 1 | 0x100000000 | 0x080000000 - 0x0ffffffff |
2038-01-19 to 2106-02-07 |
0 1 | 0 | 0x100000000 | 0x100000000 - 0x17fffffff |
2106-02-07 to 2174-02-25 |
1 0 | 1 | 0x200000000 | 0x180000000 - 0x1ffffffff |
2174-02-25 to 2242-03-16 |
1 0 | 0 | 0x200000000 | 0x200000000 - 0x27fffffff |
2242-03-16 to 2310-04-04 |
1 1 | 1 | 0x300000000 | 0x280000000 - 0x2ffffffff |
2310-04-04 to 2378-04-22 |
1 1 | 0 | 0x300000000 | 0x300000000 - 0x37fffffff |
2378-04-22 to 2446-05-10 |
This is a somewhat odd encoding since there are effectively seven times as many positive values as negative values. There have also been long-standing bugs decoding and encoding dates beyond 2038, which don’t seem to be fixed as of kernel 3.12 and e2fsprogs 1.42.8. 64-bit kernels incorrectly use the extra epoch bits 1,1 for dates between 1901 and 1970. At some point the kernel will be fixed and e2fsck will fix this situation, assuming that it is run before 2310.
这是一种有点奇怪的编码,因为实际上正值的数量是负值的七倍。在2038年之后,解码和编码也存在着长期存在的错误,从内核3.12和e2fsprogs 1.42.8开始,这些问题似乎还没有得到解决。64位内核错误地将额外的历元位1,1用于1901年至1970年之间的日期。在某个时刻,内核将被修复,e2fsck将修复这种情况,假设它在2310之前运行。
4.2 The Contents of inode.i_block¶
Depending on the type of file an inode describes, the 60 bytes of storage in inode.i_block
can be used in different ways. In general, regular files and directories will use it for file block indexing information, and special files will use it for special purposes.
根据inode描述的文件类型,inode.i_block
中的60字节存储可以以不同的方式使用。通常,常规文件和目录会将其用于文件块索引信息,而特殊文件会将其用作特殊用途。
4.2.1 Symbolic Links¶
The target of a symbolic link will be stored in this field if the target string is less than 60 bytes long. Otherwise, either extents or block maps will be used to allocate data blocks to store the link target.
如果目标字符串长度小于60字节,则符号链接的目标将存储在此字段中。否则,将使用区段或块映射来分配数据块以存储链接目标。
4.2.2 Direct/Indirect Block Addressing¶
In ext2/3, file block numbers were mapped to logical block numbers by means of an (up to) three level 1-1 block map. To find the logical block that stores a particular file block, the code would navigate through this increasingly complicated structure. Notice that there is neither a magic number nor a checksum to provide any level of confidence that the block isn’t full of garbage.
在ext2/3中,文件块号通过三级1-1块映射映射到逻辑块号。为了找到存储特定文件块的逻辑块,代码将在这个日益复杂的结构中导航。请注意,既没有幻数也没有校验和来提供任何级别的置信度,即该块没有充满垃圾。
i.i_block偏移 | 指向何处 |
---|---|
0到11 | 直接映射到文件块0到11 |
12 | 间接块:(文件块12到($block_size /4)+11,如果是4KiB块,则为12到1035)间接块偏移指向($block _size /4)直接映射到($sblock_size /4)块(如果是4K iB块则为1024) |
13 | 双间接块:(文件块$block_size /4+12到($blocks_size /4)^2+($block _size +4)+11,或1036至1049611(如果是4KiB块)双间接块偏移指向($Block_size /4)映射到($Block _size /4)间接块(如果是4KiB块,则为1024) |
14 | 三重间接块:(文件块($block_size /4)^2+($block _size /4)+12到($block_size /4)^3+($block_size /4)^ 2+($block_size /4)+12,或1049612至1074791436(如果是4KiB块)三重间接块偏移0指向($Block_size /4)映射到($Block _size /4)两个间接块(如果是4KiB块,则为1024)两次间接块偏移1指向($sblock_size /4)三重块偏移0到($sock_size )映射到($block_size /4)个块(如果是4KiB块,则为1024个) |
Note that with this block mapping scheme, it is necessary to fill out a lot of mapping data even for a large contiguous file! This inefficiency led to the creation of the extent mapping scheme, discussed below.
请注意,使用此块映射方案,即使对于大型连续文件,也需要填写大量映射数据!这种低效率导致了以下讨论的范围映射方案的创建。
Notice also that a file using this mapping scheme cannot be placed higher than 2^32 blocks.
请注意,使用此映射方案的文件不能放置在高于2^32块的位置。
4.2.3 Extent Tree¶
In ext4, the file to logical block map has been replaced with an extent tree. Under the old scheme, allocating a contiguous run of 1,000 blocks requires an indirect block to map all 1,000 entries; with extents, the mapping is reduced to a single struct ext4_extent
with ee_len = 1000
. If flex_bg is enabled, it is possible to allocate very large files with a single extent, at a considerable reduction in metadata block use, and some improvement in disk efficiency. The inode must have the extents flag (0x80000) flag set for this feature to be in use.
在ext4中,文件到逻辑块映射已被扩展树替换。根据旧方案,分配1000个块的连续运行需要一个间接块来映射所有1000个条目;对于扩展区,映射被简化为单个structext4_extent
,ee_len=1000
。如果启用了flex_bg,则可以使用单个扩展区分配非常大的文件,同时大大减少了元数据块的使用,并提高了磁盘效率。索引节点必须设置扩展区标志(0x80000),才能使用此功能。
Extents are arranged as a tree. Each node of the tree begins with a struct ext4_extent_header
. If the node is an interior node (eh.eh_depth
> 0), the header is followed by eh.eh_entries
instances of struct ext4_extent_idx
; each of these index entries points to a block containing more nodes in the extent tree. If the node is a leaf node (eh.eh_depth == 0
), then the header is followed by eh.eh_entries
instances of struct ext4_extent
; these instances point to the file’s data blocks. The root node of the extent tree is stored in inode.i_block
, which allows for the first four extents to be recorded without the use of extra metadata blocks.
范围以树的形式排列。树的每个节点都以struct ext4_extent_header
开头。如果该节点是内部节点(eh.eh_depth
>0),则标头后面跟着struct ext4_extent_idx
的eh.eh-entries
实例;这些索引条目中的每一个都指向扩展树中包含更多节点的块。如果节点是叶节点(eh.eh_deph==0
),则标头后面跟着structext4_extent
的eh.eh-entries
实例;这些实例指向文件的数据块。数据块树的根节点存储在inode.i_block
中,这允许在不使用额外元数据块的情况下记录前四个数据块。
The extent tree header is recorded in struct ext4_extent_header
, which is 12 bytes long:
区段树标头记录在struct ext4_extent_header
中,长度为12字节:
Offset | Size | Name | Description |
---|---|---|---|
0x0 | __le16 | eh_magic | Magic number, 0xF30A. |
0x2 | __le16 | eh_entries | Number of valid entries following the header. |
0x4 | __le16 | eh_max | Maximum number of entries that could follow the header. |
0x6 | __le16 | eh_depth | Depth of this extent node in the extent tree. 0 = this extent node points to data blocks; otherwise, this extent node points to other extent nodes. The extent tree can be at most 5 levels deep: a logical block number can be at most 2^32 , and the smallest n that satisfies 4*(((blocksize - 12)/12)^n) >= 2^32 is 5. |
0x8 | __le32 | eh_generation | Generation of the tree. (Used by Lustre, but not standard ext4). |
Internal nodes of the extent tree, also known as index nodes, are recorded as struct ext4_extent_idx
, and are 12 bytes long:
数据块树的内部节点也称为索引节点,记录为struct ext4_extent_idx
,长度为12字节:
Offset | Size | Name | Description |
---|---|---|---|
0x0 | __le32 | ei_block | This index node covers file blocks from ‘block’ onward. |
0x4 | __le32 | ei_leaf_lo | Lower 32-bits of the block number of the extent node that is the next level lower in the tree. The tree node pointed to can be either another internal node or a leaf node, described below. |
0x8 | __le16 | ei_leaf_hi | Upper 16-bits of the previous field. |
0xA | __u16 | ei_unused |
Leaf nodes of the extent tree are recorded as struct ext4_extent
, and are also 12 bytes long:
区段树的叶节点记录为struct ext4_extent
,长度也为12字节:
Offset | Size | Name | Description |
---|---|---|---|
0x0 | __le32 | ee_block | First file block number that this extent covers. |
0x4 | __le16 | ee_len | Number of blocks covered by extent. If the value of this field is <= 32768, the extent is initialized. If the value of the field is > 32768, the extent is uninitialized and the actual extent length is ee_len - 32768. Therefore, the maximum length of a initialized extent is 32768 blocks, and the maximum length of an uninitialized extent is 32767. |
0x6 | __le16 | ee_start_hi | Upper 16-bits of the block number to which this extent points. |
0x8 | __le32 | ee_start_lo | Lower 32-bits of the block number to which this extent points. |
Prior to the introduction of metadata checksums, the extent header + extent entries always left at least 4 bytes of unallocated space at the end of each extent tree data block (because (2^x % 12) >= 4). Therefore, the 32-bit checksum is inserted into this space. The 4 extents in the inode do not need checksumming, since the inode is already checksummed. The checksum is calculated against the FS UUID, the inode number, the inode generation, and the entire extent block leading up to (but not including) the checksum itself.
在引入元数据校验和之前,数据块标头+数据块条目总是在每个数据块树数据块的末尾留下至少4字节的未分配空间(因为(2^x%12)>=4)。因此,将32位校验和插入此空间。索引节点中的4个区段不需要校验和,因为索引节点已经进行了校验和。校验和是根据FS UUID、索引节点编号、索引节点生成以及导致(但不包括)校验和本身的整个数据块来计算的。
struct ext4_extent_tail
is 4 bytes long:
Offset | Size | Name | Description |
---|---|---|---|
0x0 | __le32 | eb_checksum | Checksum of the extent block, crc32c(uuid+inum+igeneration+extentblock) |
4.2.4 Inline Data¶
If the inline data feature is enabled for the filesystem and the flag is set for the inode, it is possible that the first 60 bytes of the file data are stored here.
如果为文件系统启用了内联数据功能,并且为inode设置了标志,则文件数据的前60个字节可能存储在这里。
4.3 Directory Entries¶
In an ext4 filesystem, a directory is more or less a flat file that maps an arbitrary byte string (usually ASCII) to an inode number on the filesystem. There can be many directory entries across the filesystem that reference the same inode number–these are known as hard links, and that is why hard links cannot reference files on other filesystems. As such, directory entries are found by reading the data block(s) associated with a directory file for the particular directory entry that is desired.
在ext4文件系统中,目录或多或少是一个平面文件,它将任意字节字符串(通常是ASCII)映射到文件系统上的索引节点号。文件系统中可能有许多目录条目引用相同的索引节点号,这些条目称为硬链接,这就是为什么硬链接不能引用其他文件系统上的文件的原因。这样,通过读取与期望的特定目录条目的目录文件相关联的数据块来找到目录条目。
4.3.1 Linear (Classic) Directories¶
By default, each directory lists its entries in an “almost-linear” array. I write “almost” because it’s not a linear array in the memory sense because directory entries are not split across filesystem blocks. Therefore, it is more accurate to say that a directory is a series of data blocks and that each block contains a linear array of directory entries. The end of each per-block array is signified by reaching the end of the block; the last entry in the block has a record length that takes it all the way to the end of the block. The end of the entire directory is of course signified by reaching the end of the file. Unused directory entries are signified by inode = 0. By default the filesystem uses struct ext4_dir_entry_2
for directory entries unless the “filetype” feature flag is not set, in which case it uses struct ext4_dir_entry
.
默认情况下,每个目录都以“几乎线性”数组列出其条目。我写“几乎”是因为它不是内存意义上的线性阵列,因为目录条目不会在文件系统块中拆分。因此,更准确的说法是,一个目录是一系列数据块,每个块包含一个线性的目录条目数组。每个每个块阵列的结束通过到达块的结束来表示;块中的最后一个条目具有一个记录长度,该长度将其一直带到块的末尾。当然,整个目录的结尾是通过到达文件的结尾来表示的。未使用的目录条目由inode=0表示。默认情况下,文件系统对目录条目使用struct ext4_dir_entry_2
,除非未设置“filetype”功能标志,在这种情况下,它使用structext4_dir-entry
。
The original directory entry format is struct ext4_dir_entry
, which is at most 263 bytes long, though on disk you’ll need to reference dirent.rec_len
to know for sure.
原始目录条目格式为structext4_dir_entry
,最多263个字节,但在磁盘上需要引用dirent.rec_len
才能确定。
Offset | Size | Name | Description |
---|---|---|---|
0x0 | __le32 | inode | Number of the inode that this directory entry points to. |
0x4 | __le16 | rec_len | Length of this directory entry. Must be a multiple of 4. |
0x6 | __le16 | name_len | Length of the file name. |
0x8 | char | name[EXT4_NAME_LEN] | File name. |
Since file names cannot be longer than 255 bytes, the new directory entry format shortens the name_len field and uses the space for a file type flag, probably to avoid having to load every inode during directory tree traversal. This format is ext4_dir_entry_2
, which is at most 263 bytes long, though on disk you’ll need to reference dirent.rec_len
to know for sure.
由于文件名不能超过255个字节,因此新的目录条目格式缩短了name_len字段,并使用空间作为文件类型标志,这可能是为了避免在目录树遍历期间加载每个inode。此格式为ext4_dir_entry_2
,最长263字节,但在磁盘上需要引用dirent.rec_len
才能确定。
Offset | Size | Name | Description |
---|---|---|---|
0x0 | __le32 | inode | Number of the inode that this directory entry points to. |
0x4 | __le16 | rec_len | Length of this directory entry. |
0x6 | __u8 | name_len | Length of the file name. |
0x7 | __u8 | file_type | File type code, see ftype table below. |
0x8 | char | name[EXT4_NAME_LEN] | File name. |
The directory file type is one of the following values:
Value | Description |
---|---|
0x0 | Unknown. |
0x1 | Regular file. |
0x2 | Directory. |
0x3 | Character device file. |
0x4 | Block device file. |
0x5 | FIFO. |
0x6 | Socket. |
0x7 | Symbolic link. |
To support directories that are both encrypted and casefolded directories, we must also include hash information in the directory entry. We append ext4_extended_dir_entry_2
to ext4_dir_entry_2
except for the entries for dot and dotdot, which are kept the same. The structure follows immediately after name
and is included in the size listed by rec_len
If a directory entry uses this extension, it may be up to 271 bytes.
为了支持既加密又折叠的目录,我们还必须在目录条目中包含哈希信息。我们将ext4_extended_dir_entry_2
附加到ext4_dir_entry_2
,但dot和dotdot的条目保持不变。该结构紧跟在name
之后,并包含在rec_len
列出的大小中。如果目录条目使用此扩展名,则该扩展名最多可达271字节。
Offset | Size | Name | Description |
---|---|---|---|
0x0 | __le32 | hash | The hash of the directory name |
0x4 | __le32 | minor_hash | The minor hash of the directory name |
In order to add checksums to these classic directory blocks, a phony struct ext4_dir_entry
is placed at the end of each leaf block to hold the checksum. The directory entry is 12 bytes long. The inode number and name_len fields are set to zero to fool old software into ignoring an apparently empty directory entry, and the checksum is stored in the place where the name normally goes. The structure is struct ext4_dir_entry_tail
:
为了向这些经典的目录块添加校验和,在每个叶块的末尾放置一个假的structext4_dir_entry
来保存校验和。目录条目长度为12字节。inode number和name_len字段设置为零,以欺骗旧软件忽略一个明显为空的目录条目,并且校验和存储在名称通常所在的位置。结构为struct ext4_dir_entry_tail
:
Offset | Size | Name | Description |
---|---|---|---|
0x0 | __le32 | det_reserved_zero1 | Inode number, which must be zero. |
0x4 | __le16 | det_rec_len | Length of this directory entry, which must be 12. |
0x6 | __u8 | det_reserved_zero2 | Length of the file name, which must be zero. |
0x7 | __u8 | det_reserved_ft | File type, which must be 0xDE. |
0x8 | __le32 | det_checksum | Directory leaf block checksum. |
The leaf directory block checksum is calculated against the FS UUID, the directory’s inode number, the directory’s inode generation number, and the entire directory entry block up to (but not including) the fake directory entry.
叶目录块校验和是根据FS UUID、目录的inode编号、目录的索引节点生成编号和整个目录条目块(直到(但不包括)假目录条目)计算的。
4.3.2 Hash Tree Directories¶
A linear array of directory entries isn’t great for performance, so a new feature was added to ext3 to provide a faster (but peculiar) balanced tree keyed off a hash of the directory entry name. If the EXT4_INDEX_FL (0x1000) flag is set in the inode, this directory uses a hashed btree (htree) to organize and find directory entries. For backwards read-only compatibility with ext2, this tree is actually hidden inside the directory file, masquerading as “empty” directory data blocks! It was stated previously that the end of the linear directory entry table was signified with an entry pointing to inode 0; this is (ab)used to fool the old linear-scan algorithm into thinking that the rest of the directory block is empty so that it moves on.
一个线性的目录条目数组对于性能来说并不是很好,所以ext3中添加了一个新特性,以提供一个更快(但很独特)的平衡树,该树通过目录条目名称的散列来实现。如果在inode中设置了EXT4_INDEX_FL(0x1000)标志,则该目录使用哈希btree(htree)来组织和查找目录条目。为了与ext2向后只读兼容,该树实际上隐藏在目录文件中,伪装成“空”目录数据块!前面说过,线性目录条目表的末尾用指向inode 0的条目表示;这是(ab)用来欺骗旧的线性扫描算法,使其认为目录块的其余部分是空的,从而继续前进。
The root of the tree always lives in the first data block of the directory. By ext2 custom, the ‘.’ and ‘..’ entries must appear at the beginning of this first block, so they are put here as two struct ext4_dir_entry_2
s and not stored in the tree. The rest of the root node contains metadata about the tree and finally a hash->block map to find nodes that are lower in the htree. If dx_root.info.indirect_levels
is non-zero then the htree has two levels; the data block pointed to by the root node’s map is an interior node, which is indexed by a minor hash. Interior nodes in this tree contains a zeroed out struct ext4_dir_entry_2
followed by a minor_hash->block map to find leafe nodes. Leaf nodes contain a linear array of all struct ext4_dir_entry_2
; all of these entries (presumably) hash to the same value. If there is an overflow, the entries simply overflow into the next leaf node, and the least-significant bit of the hash (in the interior node map) that gets us to this next leaf node is set.
树的根始终位于目录的第一个数据块中。通过ext2自定义,“.”和“..”条目必须出现在第一个块的开头,因此它们在这里作为两个structext4_dir_entry2
放置,而不是存储在树中。根节点的其余部分包含关于树的元数据,最后是一个hash->块映射,用于查找htree中较低的节点。如果dx_root.info.indirect_levels
为非零,则htree有两个级别;根节点映射所指向的数据块是一个内部节点,它由一个小散列索引。此树中的内部节点包含一个清零的struct ext4_dir_entry_2
,后跟一个minor_hash->块映射以查找叶节点。叶节点包含所有struct ext4_dir_entry_2
的线性数组;所有这些条目(可能)散列到相同的值。如果存在溢出,则条目只会溢出到下一个叶节点,并设置将我们带到下一叶节点的哈希(在内部节点映射中)的最低有效位。
To traverse the directory as a htree, the code calculates the hash of the desired file name and uses it to find the corresponding block number. If the tree is flat, the block is a linear array of directory entries that can be searched; otherwise, the minor hash of the file name is computed and used against this second block to find the corresponding third block number. That third block number will be a linear array of directory entries.
要作为htree遍历目录,代码计算所需文件名的哈希值,并使用它查找相应的块号。如果树是平的,则块是可以搜索的目录条目的线性阵列;否则,计算文件名的小散列,并针对该第二块使用该小散列来查找对应的第三块编号。第三个块编号将是目录条目的线性阵列。
To traverse the directory as a linear array (such as the old code does), the code simply reads every data block in the directory. The blocks used for the htree will appear to have no entries (aside from ‘.’ and ‘..’) and so only the leaf nodes will appear to have any interesting content.
要以线性阵列的形式遍历目录(如旧代码所做的),代码只需读取目录中的每个数据块。用于htree的块看起来没有任何条目(除了“.”和“..”),因此只有叶节点看起来有任何有趣的内容。
The root of the htree is in struct dx_root
, which is the full length of a data block:
htree的根位于structdx_root
中,它是数据块的完整长度:
Offset | Type | Name | Description |
---|---|---|---|
0x0 | __le32 | dot.inode | inode number of this directory. |
0x4 | __le16 | dot.rec_len | Length of this record, 12. |
0x6 | u8 | dot.name_len | Length of the name, 1. |
0x7 | u8 | dot.file_type | File type of this entry, 0x2 (directory) (if the feature flag is set). |
0x8 | char | dot.name[4] | “.000” |
0xC | __le32 | dotdot.inode | inode number of parent directory. |
0x10 | __le16 | dotdot.rec_len | block_size - 12. The record length is long enough to cover all htree data. |
0x12 | u8 | dotdot.name_len | Length of the name, 2. |
0x13 | u8 | dotdot.file_type | File type of this entry, 0x2 (directory) (if the feature flag is set). |
0x14 | char | dotdot_name[4] | “..00” |
0x18 | __le32 | struct dx_root_info.reserved_zero | Zero. |
0x1C | u8 | struct dx_root_info.hash_version | Hash type, see dirhash table below. |
0x1D | u8 | struct dx_root_info.info_length | Length of the tree information, 0x8. |
0x1E | u8 | struct dx_root_info.indirect_levels | Depth of the htree. Cannot be larger than 3 if the INCOMPAT_LARGEDIR feature is set; cannot be larger than 2 otherwise. |
0x1F | u8 | struct dx_root_info.unused_flags | |
0x20 | __le16 | limit | Maximum number of dx_entries that can follow this header, plus 1 for the header itself. |
0x22 | __le16 | count | Actual number of dx_entries that follow this header, plus 1 for the header itself. |
0x24 | __le32 | block | The block number (within the directory file) that goes with hash=0. |
0x28 | struct dx_entry | entries[0] | As many 8-byte struct dx_entry as fits in the rest of the data block. |
The directory hash is one of the following values:
Value | Description |
---|---|
0x0 | Legacy. |
0x1 | Half MD4. |
0x2 | Tea. |
0x3 | Legacy, unsigned. |
0x4 | Half MD4, unsigned. |
0x5 | Tea, unsigned. |
0x6 | Siphash. |
Interior nodes of an htree are recorded as struct dx_node
, which is also the full length of a data block:
htree的内部节点记录为struct dx_node
,这也是数据块的完整长度:
Offset | Type | Name | Description |
---|---|---|---|
0x0 | __le32 | fake.inode | Zero, to make it look like this entry is not in use. |
0x4 | __le16 | fake.rec_len | The size of the block, in order to hide all of the dx_node data. |
0x6 | u8 | name_len | Zero. There is no name for this “unused” directory entry. |
0x7 | u8 | file_type | Zero. There is no file type for this “unused” directory entry. |
0x8 | __le16 | limit | Maximum number of dx_entries that can follow this header, plus 1 for the header itself. |
0xA | __le16 | count | Actual number of dx_entries that follow this header, plus 1 for the header itself. |
0xE | __le32 | block | The block number (within the directory file) that goes with the lowest hash value of this block. This value is stored in the parent block. |
0x12 | struct dx_entry | entries[0] | As many 8-byte struct dx_entry as fits in the rest of the data block. |
The hash maps that exist in both struct dx_root
and struct dx_node
are recorded as struct dx_entry
, which is 8 bytes long:
struct dx_root
和struct dx_node
中存在的哈希映射记录为struct dx_entry
,长度为8字节:
Offset | Type | Name | Description |
---|---|---|---|
0x0 | __le32 | hash | Hash code. |
0x4 | __le32 | block | Block number (within the directory file, not filesystem blocks) of the next node in the htree. |
(If you think this is all quite clever and peculiar, so does the author.)
(如果你认为这一切都很聪明和奇特,作者也是如此。)
If metadata checksums are enabled, the last 8 bytes of the directory block (precisely the length of one dx_entry) are used to store a struct dx_tail
, which contains the checksum. The limit
and count
entries in the dx_root/dx_node structures are adjusted as necessary to fit the dx_tail into the block. If there is no space for the dx_tail, the user is notified to run e2fsck -D to rebuild the directory index (which will ensure that there’s space for the checksum. The dx_tail structure is 8 bytes long and looks like this:
如果启用了元数据校验和,则目录块的最后8个字节(正好是一个dx_entry的长度)将用于存储包含校验和的struct dx_tail
。根据需要调整dx_root/dx_node结构中的limit
和count
条目,以将dx_tail适配到块中。如果dx_tail没有空间,则通知用户运行e2fsck-D来重建目录索引(这将确保有空间用于校验和)。dx_tail结构长8字节,如下所示:
Offset | Type | Name | Description |
---|---|---|---|
0x0 | u32 | dt_reserved | Zero. |
0x4 | __le32 | dt_checksum | Checksum of the htree directory block. |
The checksum is calculated against the FS UUID, the htree index header (dx_root or dx_node), all of the htree indices (dx_entry) that are in use, and the tail block (dx_tail).
校验和是根据FS UUID、htree索引头(dx_root或dx_node)、正在使用的所有htree索引(dx_entry)和尾部块(dx_tail)计算的。
4.4 Extended Attributes¶
Extended attributes (xattrs) are typically stored in a separate data block on the disk and referenced from inodes via inode.i_file_acl*
. The first use of extended attributes seems to have been for storing file ACLs and other security data (selinux). With the user_xattr
mount option it is possible for users to store extended attributes so long as all attribute names begin with “user”; this restriction seems to have disappeared as of Linux 3.0.
扩展属性(xattrs)通常存储在磁盘上的单独数据块中,并通过inode.i_file_acl*
从inode引用。扩展属性的第一个用途似乎是存储文件acl和其他安全数据(selinux)。使用user_xattr
mount选项,只要所有属性名称都以“user”开头,用户就可以存储扩展属性;从Linux3.0开始,这种限制似乎已经消失。
There are two places where extended attributes can be found. The first place is between the end of each inode entry and the beginning of the next inode entry. For example, if inode.i_extra_isize = 28 and sb.inode_size = 256, then there are 256 - (128 + 28) = 100 bytes available for in-inode extended attribute storage. The second place where extended attributes can be found is in the block pointed to by inode.i_file_acl
. As of Linux 3.11, it is not possible for this block to contain a pointer to a second extended attribute block (or even the remaining blocks of a cluster). In theory it is possible for each attribute’s value to be stored in a separate data block, though as of Linux 3.11 the code does not permit this.
有两个地方可以找到扩展属性。第一个位置位于每个inode条目的结尾和下一个inode条目开头之间。例如,如果inode.i_extra_isize=28,sb.inode_size=256,则有256-(128+28)=100字节可用于inode内扩展属性存储。第二个可以找到扩展属性的地方是inode.i_file_acl
所指向的块。从Linux 3.11开始,该块不可能包含指向第二个扩展属性块(甚至集群的剩余块)的指针。理论上,每个属性的值都有可能存储在单独的数据块中,尽管从Linux 3.11开始,代码不允许这样做。
Keys are generally assumed to be ASCIIZ strings, whereas values can be strings or binary data.
键通常假定为ASCIIZ字符串,而值可以是字符串或二进制数据。
Extended attributes, when stored after the inode, have a header ext4_xattr_ibody_header
that is 4 bytes long:
扩展属性存储在inode之后时,具有4个字节长的头ext4_xattr_ibody_header
:
Offset | Type | Name | Description |
---|---|---|---|
0x0 | __le32 | h_magic | Magic number for identification, 0xEA020000. This value is set by the Linux driver, though e2fsprogs doesn’t seem to check it(?) |
The beginning of an extended attribute block is in struct ext4_xattr_header
, which is 32 bytes long:
扩展属性块的开头位于struct ext4_xattr_header
中,长度为32字节:
Offset | Type | Name | Description |
---|---|---|---|
0x0 | __le32 | h_magic | Magic number for identification, 0xEA020000. |
0x4 | __le32 | h_refcount | Reference count. |
0x8 | __le32 | h_blocks | Number of disk blocks used. |
0xC | __le32 | h_hash | Hash value of all attributes. |
0x10 | __le32 | h_checksum | Checksum of the extended attribute block. |
0x14 | __u32 | h_reserved[3] | Zero. |
The checksum is calculated against the FS UUID, the 64-bit block number of the extended attribute block, and the entire block (header + entries).
根据FS UUID、扩展属性块的64位块号和整个块(标题+条目)计算校验和。
Following the struct ext4_xattr_header
or struct ext4_xattr_ibody_header
is an array of struct ext4_xattr_entry
; each of these entries is at least 16 bytes long. When stored in an external block, the struct ext4_xattr_entry
entries must be stored in sorted order. The sort order is e_name_index
, then e_name_len
, and finally e_name
. Attributes stored inside an inode do not need be stored in sorted order.
在struct ext4_xattr_header
或struct text4_xattr_ibody_header
后面是一个struct xt4_xattr_entry
数组;这些条目中的每一个至少有16个字节长。当存储在外部块中时,struct ext4_xattr_entry
条目必须按排序顺序存储。排序顺序为e_name_index
,然后是e_name_len
,最后是e_name
。存储在inode中的属性不需要按排序顺序存储。
Offset | Type | Name | Description |
---|---|---|---|
0x0 | __u8 | e_name_len | Length of name. |
0x1 | __u8 | e_name_index | Attribute name index. There is a discussion of this below. |
0x2 | __le16 | e_value_offs | Location of this attribute’s value on the disk block where it is stored. Multiple attributes can share the same value. For an inode attribute this value is relative to the start of the first entry; for a block this value is relative to the start of the block (i.e. the header). |
0x4 | __le32 | e_value_inum | The inode where the value is stored. Zero indicates the value is in the same block as this entry. This field is only used if the INCOMPAT_EA_INODE feature is enabled. |
0x8 | __le32 | e_value_size | Length of attribute value. |
0xC | __le32 | e_hash | Hash value of attribute name and attribute value. The kernel doesn’t update the hash for in-inode attributes, so for that case this value must be zero, because e2fsck validates any non-zero hash regardless of where the xattr lives. |
0x10 | char | e_name[e_name_len] | Attribute name. Does not include trailing NULL. |
Attribute values can follow the end of the entry table. There appears to be a requirement that they be aligned to 4-byte boundaries. The values are stored starting at the end of the block and grow towards the xattr_header/xattr_entry table. When the two collide, the overflow is put into a separate disk block. If the disk block fills up, the filesystem returns -ENOSPC.
属性值可以在条目表的末尾。似乎需要将它们与4字节边界对齐。这些值从块的末尾开始存储,并向xattr_header/xattr_entry表增长。当两者发生冲突时,溢出将被放入一个单独的磁盘块中。如果磁盘块已满,文件系统将返回-ENOPC。
The first four fields of the ext4_xattr_entry
are set to zero to mark the end of the key list.
ext4_xattr_entry
的前四个字段设置为零,以标记密钥列表的结束。
4.4.1 Attribute Name Indices¶
Logically speaking, extended attributes are a series of key=value pairs. The keys are assumed to be NULL-terminated strings. To reduce the amount of on-disk space that the keys consume, the beginning of the key string is matched against the attribute name index. If a match is found, the attribute name index field is set, and matching string is removed from the key name. Here is a map of name index values to key prefixes:
从逻辑上讲,扩展属性是一系列键=值对。假设密钥是以NULL结尾的字符串。为了减少键占用的磁盘空间量,将键字符串的开头与属性名称索引相匹配。如果找到匹配项,则设置属性名称索引字段,并从关键字名称中删除匹配字符串。以下是名称索引值到关键字前缀的映射:
Name Index | Key Prefix |
---|---|
0 | (no prefix) |
1 | “user.” |
2 | “system.posix_acl_access” |
3 | “system.posix_acl_default” |
4 | “trusted.” |
6 | “security.” |
7 | “system.” (inline_data only?) |
8 | “system.richacl” (SuSE kernels only?) |
For example, if the attribute key is “user.fubar”, the attribute name index is set to 1 and the “fubar” name is recorded on disk.
例如,如果属性键是“user.fubar”,则属性名称索引设置为1,“fubar”名称记录在磁盘上。
4.4.2 POSIX ACLs¶
POSIX ACLs are stored in a reduced version of the Linux kernel (and libacl’s) internal ACL format. The key difference is that the version number is different (1) and the e_id
field is only stored for named user and group ACLs.
POSIX ACL以精简版的Linux内核(和libacl)内部ACL格式存储。关键区别在于版本号不同(1),并且e_id
字段仅为命名用户和组ACL存储。