跳转至

NFS

原文档地址

title: principal是什么?
参与网络通信活动并且具有唯一名称的客户机/用户或服务器/服务实例。

1 NFSv4 client identifier

This document explains how the NFSv4 protocol identifies client instances in order to maintain file open and lock state during system restarts. A special identifier and principal are maintained on each client. These can be set by administrators, scripts provided by site administrators, or tools provided by Linux distributors.

本文档说明NFSv4协议如何识别客户端实例,以便在系统重新启动期间保持文件打开和锁定状态。在每个客户机上维护一个特殊的标识符和主体。这些可以由管理员、站点管理员提供的脚本或Linux发行商提供的工具设置。

There are risks if a client’s NFSv4 identifier and its principal are not chosen carefully.

如果未仔细选择客户的NFSv4标识符及其主体,则存在风险。

1.1 Introduction

The NFSv4 protocol uses “lease-based file locking”. Leases help NFSv4 servers provide file lock guarantees and manage their resources.

NFSv4协议使用“基于租约的文件锁”。租约帮助NFSv4服务器提供文件锁保证并管理其资源。

Simply put, an NFSv4 server creates a lease for each NFSv4 client. The server collects each client’s file open and lock state under the lease for that client.

简单地说,NFSv4服务器为每个NFSv4客户端创建租约。服务器收集每个客户端在租约下的文件打开和锁定状态。

The client is responsible for periodically renewing its leases. While a lease remains valid, the server holding that lease guarantees the file locks the client has created remain in place.

客户负责定期续订租约。当租约仍然有效时,持有租约的服务器保证客户端创建的文件锁保持不变。

If a client stops renewing its lease (for example, if it crashes), the NFSv4 protocol allows the server to remove the client’s open and lock state after a certain period of time. When a client restarts, it indicates to servers that open and lock state associated with its previous leases is no longer valid and can be destroyed immediately.

如果客户端停止续订其租约(例如,如果它崩溃),NFSv4协议允许服务器在一定时间后删除客户端的打开和锁定状态。当客户机重新启动时,它会向服务器指示与其以前的租约关联的打开和锁定状态不再有效,可以立即销毁。

In addition, each NFSv4 server manages a persistent list of client leases. When the server restarts and clients attempt to recover their state, the server uses this list to distinguish amongst clients that held state before the server restarted and clients sending fresh OPEN and LOCK requests. This enables file locks to persist safely across server restarts.

此外,每个NFSv4服务器都管理一个客户端租约的持久列表。当服务器重新启动并且客户端尝试恢复其状态时,服务器使用此列表来区分在服务器重新启动之前保持状态的客户端和发送新的OPEN和LOCK请求的客户端。这使文件锁定能够在服务器重新启动期间安全地保持。

1.2 NFSv4 client identifiers

Each NFSv4 client presents an identifier to NFSv4 servers so that they can associate the client with its lease. Each client’s identifier consists of two elements:

每个NFSv4客户端向NFSv4服务器提供一个标识符,以便它们可以将客户端与其租约相关联。每个客户端的标识符由两个元素组成:

  • co_ownerid: An arbitrary but fixed string.任意但固定的字符串。
  • boot verifier: A 64-bit incarnation verifier that enables a server to distinguish successive boot epochs of the same client.64位化身验证器,使服务器能够区分同一客户端的连续引导时期。

The NFSv4.0 specification refers to these two items as an “nfs_client_id4”. The NFSv4.1 specification refers to these two items as a “client_owner4”.

NFSv4.0规范将这两项称为“nfs_client_id4”。NFSv4.1规范将这两项称为“client_owner4”。

NFSv4 servers tie this identifier to the principal and security flavor that the client used when presenting it. Servers use this principal to authorize subsequent lease modification operations sent by the client. Effectively this principal is a third element of the identifier.

NFSv4服务器将此标识符绑定到客户端在呈现它时使用的主体和安全风格。服务器使用此主体来授权客户端发送的后续租约修改操作。实际上,这个主体是标识符的第三个元素。

As part of the identity presented to servers, a good “co_ownerid” string has several important properties:

作为呈现给服务器的标识的一部分,一个好的“co_ownerid”字符串具有几个重要属性:

  • The “co_ownerid” string identifies the client during reboot recovery, therefore the string is persistent across client reboots.“co_ownerid”字符串在重新启动恢复期间标识客户端,因此该字符串在客户端重新启动期间是持久的。
  • The “co_ownerid” string helps servers distinguish the client from others, therefore the string is globally unique. Note that there is no central authority that assigns “co_ownerid” strings.“coownerid”字符串帮助服务器区分客户机和其他客户机,因此该字符串是全局唯一的。注意,没有分配“co_ownerid”字符串的中央机构。
  • Because it often appears on the network in the clear, the “co_ownerid” string does not reveal private information about the client itself.因为“co_ownerid”字符串经常出现在网络上,所以它不会透露有关客户端本身的私人信息。
  • The content of the “co_ownerid” string is set and unchanging before the client attempts NFSv4 mounts after a restart.“coownerid”字符串的内容在重新启动后客户端尝试NFSv4装载之前设置并保持不变。
  • The NFSv4 protocol places a 1024-byte limit on the size of the “co_ownerid” string.NFSv4协议对“co_ownerid”字符串的大小设置了1024字节的限制。

1.3 Protecting NFSv4 lease state

NFSv4 servers utilize the “client_owner4” as described above to assign a unique lease to each client. Under this scheme, there are circumstances where clients can interfere with each other. This is referred to as “lease stealing”.

NFSv4服务器利用如上所述的“client_owner4”为每个客户端分配唯一的租约。在这一方案下,有些情况下客户可以相互干扰。这被称为“偷租”。

If distinct clients present the same “co_ownerid” string and use the same principal (for example, AUTH_SYS and UID 0), a server is unable to tell that the clients are not the same. Each distinct client presents a different boot verifier, so it appears to the server as if there is one client that is rebooting frequently. Neither client can maintain open or lock state in this scenario.

如果不同的客户端呈现相同的“co_ownerid”字符串并使用相同的主体(例如,AUTH_SYS和UID 0),则服务器无法判断客户端是否相同。每个不同的客户机都有一个不同的引导验证器,因此在服务器上看起来就像有一个客户机频繁重启一样。在这种情况下,客户端都不能保持打开或锁定状态。

If distinct clients present the same “co_ownerid” string and use distinct principals, the server is likely to allow the first client to operate normally but reject subsequent clients with the same “co_ownerid” string.

如果不同的客户端呈现相同的“co_ownerid”字符串并使用不同的主体,则服务器可能会允许第一个客户端正常运行,但拒绝具有相同“co_ownrid”字符串的后续客户端。

If a client’s “co_ownerid” string or principal are not stable, state recovery after a server or client reboot is not guaranteed. If a client unexpectedly restarts but presents a different “co_ownerid” string or principal to the server, the server orphans the client’s previous open and lock state. This blocks access to locked files until the server removes the orphaned state.

如果客户端的“co_ownerid”字符串或主体不稳定,则无法保证在服务器或客户端重新启动后恢复状态。如果客户机意外重新启动,但向服务器提供了不同的“co_ownerid”字符串或主体,则服务器将孤立客户机以前的打开和锁定状态。这将阻止对锁定文件的访问,直到服务器删除孤立状态。

If the server restarts and a client presents a changed “co_ownerid” string or principal to the server, the server will not allow the client to reclaim its open and lock state, and may give those locks to other clients in the meantime. This is referred to as “lock stealing”.

如果服务器重新启动,并且客户机向服务器提供了一个已更改的“co_ownrid”字符串或主体,则服务器将不允许客户机恢复其打开和锁定状态,同时可能会将这些锁交给其他客户机。这被称为“偷锁”。

Lease stealing and lock stealing increase the potential for denial of service and in rare cases even data corruption.

租赁盗窃和锁盗窃增加了拒绝服务的可能性,在极少数情况下甚至会导致数据损坏。

1.4 Selecting an appropriate client identifier

By default, the Linux NFSv4 client implementation constructs its “co_ownerid” string starting with the words “Linux NFS” followed by the client’s UTS node name (the same node name, incidentally, that is used as the “machine name” in an AUTH_SYS credential). In small deployments, this construction is usually adequate. Often, however, the node name by itself is not adequately unique, and can change unexpectedly. Problematic situations include:

默认情况下,Linux NFSv4客户端实现构造其“co_ownerid”字符串,以单词“Linux NFS”开头,后跟客户端的UTS节点名(顺便提一下,该节点名在AUTH_SYS凭据中用作“机器名”)。在小型部署中,这种结构通常是足够的。然而,节点名称本身往往不够唯一,可能会意外更改。问题情况包括:

  • NFS-root (diskless) clients, where the local DCHP server (or equivalent) does not provide a unique host name.NFS根(无盘)客户端,其中本地DCHP服务器(或等效服务器)不提供唯一的主机名。
  • “Containers” within a single Linux host. If each container has a separate network namespace, but does not use the UTS namespace to provide a unique host name, then there can be multiple NFS client instances with the same host name.单个Linux主机中的“容器”。如果每个容器都有一个单独的网络名称空间,但不使用UTS名称空间来提供唯一的主机名,那么可以有多个具有相同主机名的NFS客户端实例。
  • Clients across multiple administrative domains that access a common NFS server. If hostnames are not assigned centrally then uniqueness cannot be guaranteed unless a domain name is included in the hostname.跨多个管理域访问公共NFS服务器的客户端。如果没有集中分配主机名,则无法保证唯一性,除非主机名中包含域名。

Linux provides two mechanisms to add uniqueness to its “co_ownerid” string:

Linux提供了两种机制来为其“co_ownerid”字符串添加唯一性:

  • nfs.nfs4_unique_id

This module parameter can set an arbitrary uniquifier string via the kernel command line, or when the “nfs” module is loaded.

该模块参数可以通过内核命令行或在加载“nfs”模块时设置任意的uniquifier字符串。

  • /sys/fs/nfs/client/net/identifier

This virtual file, available since Linux 5.3, is local to the network namespace in which it is accessed and so can provide distinction between network namespaces (containers) when the hostname remains uniform.

这个虚拟文件从Linux 5.3开始可用,它是访问它的网络名称空间的本地文件,因此当主机名保持一致时,可以提供网络名称空间(容器)之间的区别。

Note that this file is empty on name-space creation. If the container system has access to some sort of per-container identity then that uniquifier can be used. For example, a uniquifier might be formed at boot using the container’s internal identifier:

请注意,在创建名称空间时,此文件为空。如果容器系统可以访问某种类型的每个容器标识,则可以使用该uniquifier。例如,可以在启动时使用容器的内部标识符形成一个uniquifier:

  • sha256sum /etc/machine-id | awk ‘{print $1}’ \

/sys/fs/nfs/client/net/identifier

1.5 Security considerations

The use of cryptographic security for lease management operations is strongly encouraged.

强烈鼓励在租赁管理操作中使用加密安全。

If NFS with Kerberos is not configured, a Linux NFSv4 client uses AUTH_SYS and UID 0 as the principal part of its client identity. This configuration is not only insecure, it increases the risk of lease and lock stealing. However, it might be the only choice for client configurations that have no local persistent storage. “co_ownerid” string uniqueness and persistence is critical in this case.

如果未配置带有Kerberos的NFS,则Linux NFSv4客户端使用AUTH_SYS和UID 0作为其客户端标识的主要部分。这种配置不仅不安全,还会增加租赁和锁被盗的风险。但是,对于没有本地持久存储的客户端配置,它可能是唯一的选择。在这种情况下,“co_ownerid”字符串的唯一性和持久性至关重要。

When a Kerberos keytab is present on a Linux NFS client, the client attempts to use one of the principals in that keytab when identifying itself to servers. The “sec=” mount option does not control this behavior. Alternately, a single-user client with a Kerberos principal can use that principal in place of the client’s host principal.

当Linux NFS客户端上存在Kerberos密钥表时,客户端在向服务器标识自己时会尝试使用该密钥表中的一个主体。“sec=”装载选项不控制此行为。或者,具有Kerberos主体的单用户客户端可以使用该主体代替客户端的主机主体。

Using Kerberos for this purpose enables the client and server to use the same lease for operations covered by all “sec=” settings. Additionally, the Linux NFS client uses the RPCSEC_GSS security flavor with Kerberos and the integrity QOS to prevent in-transit modification of lease modification requests.

为此目的使用Kerberos使客户端和服务器能够对所有“sec=”设置所涵盖的操作使用相同的租约。此外,Linux NFS客户端使用带有Kerberos和完整性QOS的RPCSEC_GSS安全特性,以防止在传输过程中修改租约修改请求。

1.6 Additional notes

The Linux NFSv4 client establishes a single lease on each NFSv4 server it accesses. NFSv4 mounts from a Linux NFSv4 client of a particular server then share that lease.

Linux NFSv4客户端在其访问的每个NFSv4服务器上建立一个租约。NFSv4从特定服务器的Linux NFSv4客户端装载,然后共享该租约。

Once a client establishes open and lock state, the NFSv4 protocol enables lease state to transition to other servers, following data that has been migrated. This hides data migration completely from running applications. The Linux NFSv4 client facilitates state migration by presenting the same “client_owner4” to all servers it encounters.

一旦客户机建立了打开和锁定状态,NFSv4协议就可以在迁移数据之后将租约状态转换到其他服务器。这将对运行的应用程序完全隐藏数据迁移。Linux NFSv4客户机通过向遇到的所有服务器呈现相同的“client_owner4”来促进状态迁移。

2 See Also

  • nfs(5)
  • kerberos(7)
  • RFC 7530 for the NFSv4.0 specification
  • RFC 8881 for the NFSv4.1 specification.

3 Making Filesystems Exportable

3.1 Overview

All filesystem operations require a dentry (or two) as a starting point. Local applications have a reference-counted hold on suitable dentries via open file descriptors or cwd/root. However remote applications that access a filesystem via a remote filesystem protocol such as NFS may not be able to hold such a reference, and so need a different way to refer to a particular dentry. As the alternative form of reference needs to be stable across renames, truncates, and server-reboot (among other things, though these tend to be the most problematic), there is no simple answer like ‘filename’.

所有文件系统操作都需要一个dentry(或两个)作为起点。本地应用程序通过打开文件描述符或cwd/root对适当的dentry进行引用计数。然而,通过远程文件系统协议(如NFS)访问文件系统的远程应用程序可能无法保存这样的引用,因此需要不同的方式来引用特定的dentry。由于另一种引用形式需要在重命名、截断和服务器重新启动期间保持稳定(除其他外,这些往往是最有问题的),因此没有像“文件名”这样的简单答案。

The mechanism discussed here allows each filesystem implementation to specify how to generate an opaque (outside of the filesystem) byte string for any dentry, and how to find an appropriate dentry for any given opaque byte string. This byte string will be called a “filehandle fragment” as it corresponds to part of an NFS filehandle.

这里讨论的机制允许每个文件系统实现指定如何为任何dentry生成不透明(文件系统外部)字节字符串,以及如何为任何给定的不透明字节字符串找到合适的dentry。此字节字符串将被称为“文件句柄片段”,因为它对应于NFS文件句柄的一部分。

A filesystem which supports the mapping between filehandle fragments and dentries will be termed “exportable”.

支持文件句柄片段和dentry之间映射的文件系统将被称为“可导出”。

3.2 Dcache Issues

The dcache normally contains a proper prefix of any given filesystem tree. This means that if any filesystem object is in the dcache, then all of the ancestors of that filesystem object are also in the dcache. As normal access is by filename this prefix is created naturally and maintained easily (by each object maintaining a reference count on its parent).

dcache通常包含任何给定文件系统树的适当前缀。这意味着,如果任何文件系统对象都在dcache中,那么该文件系统对象的所有祖先也都在dccache中。由于正常访问是通过文件名进行的,因此前缀是自然创建的,并且易于维护(通过每个对象在其父对象上维护引用计数)。

However when objects are included into the dcache by interpreting a filehandle fragment, there is no automatic creation of a path prefix for the object. This leads to two related but distinct features of the dcache that are not needed for normal filesystem access.

但是,当通过解释文件句柄片段将对象包含到dcache中时,不会自动创建对象的路径前缀。这导致了dcache的两个相关但不同的特性,这是正常文件系统访问所不需要的。

  1. The dcache must sometimes contain objects that are not part of the proper prefix. i.e that are not connected to the root. dcache有时必须包含不属于正确前缀的对象。比如未连接到 root。
  2. The dcache must be prepared for a newly found (via ->lookup) directory to already have a (non-connected) dentry, and must be able to move that dentry into place (based on the parent and name in the ->lookup). This is particularly needed for directories as it is a dcache invariant that directories only have one dentry. dcache必须为新找到的(通过->查找)目录准备好,以便已经有一个(未连接的)dentry,并且必须能够将该dentry移动到位(基于->查找中的父目录和名称)。这对于目录尤其需要,因为目录只有一个dentry是dcache不变量。

To implement these features, the dcache has:

为了实现这些功能,dcache具有:

  1. A dentry flag DCACHE_DISCONNECTED which is set on any dentry that might not be part of the proper prefix. This is set when anonymous dentries are created, and cleared when a dentry is noticed to be a child of a dentry which is in the proper prefix. If the refcount on a dentry with this flag set becomes zero, the dentry is immediately discarded, rather than being kept in the dcache. If a dentry that is not already in the dcache is repeatedly accessed by filehandle (as NFSD might do), an new dentry will be a allocated for each access, and discarded at the end of the access. dentry标志DCACHE_DISCONNECTED,它设置在可能不是正确前缀一部分的任何dentry上。这在创建匿名dentry时设置,当注意到dentry是正确前缀中dentry的子级时清除。如果设置了此标志的dentry上的refcount变为零,则dentry将立即被丢弃,而不是保存在dcache中。如果文件句柄重复访问尚未在dcache中的dentry(NFSD可能会这样做),则将为每次访问分配一个新的dentry,并在访问结束时丢弃。

Note that such a dentry can acquire children, name, ancestors, etc. without losing DCACHE_DISCONNECTED - that flag is only cleared when subtree is successfully reconnected to root. Until then dentries in such subtree are retained only as long as there are references; refcount reaching zero means immediate eviction, same as for unhashed dentries. That guarantees that we won’t need to hunt them down upon umount.

请注意,这样的dentry可以在不丢失DCACHE_DISCONNECTED的情况下获取子级、名称、祖先等—只有当子树成功重新连接到根目录时,才会清除该标志。在此之前,这种子树中的dentry只在有引用的情况下被保留;refcount达到零意味着立即驱逐,与未处理的dentry相同。这保证了我们不需要在死后追捕他们。

  1. A primitive for creation of secondary roots - d_obtain_root(inode). Those do not bear DCACHE_DISCONNECTED. They are placed on the per-superblock list (->s_roots), so they can be located at umount time for eviction purposes.用于创建次根的基元-d_againn_root(inode)。这些不带有DCACHE_DISCONNECTED。它们被放置在每个超级块列表(->s_roots)中,因此它们可以在死时被定位以用于驱逐。
  2. Helper routines to allocate anonymous dentries, and to help attach loose directory dentries at lookup time. They are:帮助程序例程分配匿名dentry,并帮助在查找时附加松散目录dentry。他们是:
  • d_obtain_alias(inode) will return a dentry for the given inode.

If the inode already has a dentry, one of those is returned.If it doesn’t, a new anonymous (IS_ROOT and DCACHE_DISCONNECTED) dentry is allocated and attached.In the case of a directory, care is taken that only one dentry can ever be attached.

如果inode已经有一个dentry,则返回其中一个。如果没有,则分配并附加一个新的匿名(is_ROOT和DCACHE_DISCONNECTED)dentry。对于目录,请注意只能附加一个denty。

  • d_splice_alias(inode, dentry) will introduce a new dentry into the tree;

either the passed-in dentry or a preexisting alias for the given inode (such as an anonymous one created by d_obtain_alias), if appropriate. It returns NULL when the passed-in dentry is used, following the calling convention of ->lookup.

传入的dentry或给定inode的预先存在的别名(例如由dgetinalias创建的匿名别名)(如果合适)。当使用传入的dentry时,它返回NULL,遵循调用惯例->查找。

3.3 Filesystem Issues

For a filesystem to be exportable it must:

  1. provide the filehandle fragment routines described below.提供下面描述的文件句柄片段例程。

  2. make sure that d_splice_alias is used rather than d_add when ->lookup finds an inode for a given parent and name.

If inode is NULL, d_splice_alias(inode, dentry) is equivalent to:

d_add(dentry, inode), NULL

Similarly, d_splice_alias(ERR_PTR(err), dentry) = ERR_PTR(err)

Typically the ->lookup routine will simply end with a:

        return d_splice_alias(inode, dentry);
}

A file system implementation declares that instances of the filesystem are exportable by setting the s_export_op field in the struct super_block. This field must point to a “struct export_operations” struct which has the following members:

文件系统实现通过在结构super_block中设置s_export_op字段来声明文件系统的实例是可导出的。此字段必须指向具有以下成员的“struct export_operations”结构:

  • encode_fh (optional)

Takes a dentry and creates a filehandle fragment which can later be used to find or create a dentry for the same object. The default implementation creates a filehandle fragment that encodes a 32bit inode and generation number for the inode encoded, and if necessary the same information for the parent.fh_to_dentry (mandatory)Given a filehandle fragment, this should find the implied object and create a dentry for it (possibly with d_obtain_alias).fh_to_parent (optional but strongly recommended)Given a filehandle fragment, this should find the parent of the implied object and create a dentry for it (possibly with d_obtain_alias). May fail if the filehandle fragment is too small.get_parent (optional but strongly recommended)When given a dentry for a directory, this should return a dentry for the parent. Quite possibly the parent dentry will have been allocated by d_alloc_anon. The default get_parent function just returns an error so any filehandle lookup that requires finding a parent will fail. ->lookup(“..”) is not used as a default as it can leave “..” entries in the dcache which are too messy to work with.get_name (optional)When given a parent dentry and a child dentry, this should find a name in the directory identified by the parent dentry, which leads to the object identified by the child dentry. If no get_name function is supplied, a default implementation is provided which uses vfs_readdir to find potential names, and matches inode numbers to find the correct match.flagsSome filesystems may need to be handled differently than others. The export_operations struct also includes a flags field that allows the filesystem to communicate such information to nfsd. See the Export Operations Flags section below for more explanation.

获取dentry并创建一个文件句柄片段,该片段稍后可用于查找或创建同一对象的dentry。默认实现创建一个文件句柄片段,该片段编码32位inode和编码的inode的代号,如果需要,还可以为parent.fh_to_entry(强制)提供相同的信息,这应该找到隐含对象并为其创建dentry(可能使用d_gainn_alias)。fh-to_parent(可选但强烈建议)给定文件句柄片段,这应该找到该隐含对象的父对象并为它创建denty(可能使用d-gainn_aias)。如果filehandle片段太小,则可能会失败。get_parent(可选但强烈建议)如果为目录提供dentry,则应该为父目录返回dentry。父dentry很可能已经由d_alloc_anon分配。默认的get_parent函数只返回一个错误,因此任何需要查找父级的文件句柄查找都将失败。->lookup(“..”)用作默认值,因为它可能会在dcache中留下“..”条目,这些条目太乱,无法使用.get_name(可选)当给定父dentry和子dentry时,应该会在父dentry标识的目录中找到一个名称,这会导致子dentry标识对象。如果未提供get_name函数,则会提供一个默认实现,该实现使用vfs_readdir查找潜在名称,并匹配inode编号以查找正确的匹配项。flags某些文件系统可能需要与其他文件系统不同的处理方式。export_operations结构还包括一个标志字段,允许文件系统将此类信息传递给nfsd。有关详细说明,请参阅下面的导出操作标志部分。

A filehandle fragment consists of an array of 1 or more 4byte words, together with a one byte “type”. The decode_fh routine should not depend on the stated size that is passed to it. This size may be larger than the original filehandle generated by encode_fh, in which case it will have been padded with nuls. Rather, the encode_fh routine should choose a “type” which indicates the decode_fh how much of the filehandle is valid, and how it should be interpreted.

文件句柄片段由1个或多个4字节单词的数组以及一个1字节的“类型”组成。decode_fh例程不应取决于传递给它的指定大小。此大小可能大于encode_fh生成的原始文件句柄,在这种情况下,它将被填充nul。相反,encode_fh例程应该选择一个“类型”,该类型指示decode_fh文件句柄的多少是有效的,以及应该如何解释。

3.4 Export Operations Flags

In addition to the operation vector pointers, struct export_operations also contains a “flags” field that allows the filesystem to communicate to nfsd that it may want to do things differently when dealing with it. The following flags are defined:

除了操作向量指针之外,struct export_operations还包含一个“flags”字段,该字段允许文件系统与nfsd进行通信,表明文件系统在处理它时可能希望采取不同的方式。定义了以下标志:

  • EXPORT_OP_NOWCC - disable NFSv3 WCC attributes on this filesystem

RFC 1813 recommends that servers always send weak cache consistency (WCC) data to the client after each operation. The server should atomically collect attributes about the inode, do an operation on it, and then collect the attributes afterward. This allows the client to skip issuing GETATTRs in some situations but means that the server is calling vfs_getattr for almost all RPCs. On some filesystems (particularly those that are clustered or networked) this is expensive and atomicity is difficult to guarantee. This flag indicates to nfsd that it should skip providing WCC attributes to the client in NFSv3 replies when doing operations on this filesystem. Consider enabling this on filesystems that have an expensive ->getattr inode operation, or when atomicity between pre and post operation attribute collection is impossible to guarantee.

RFC 1813建议服务器在每次操作后始终向客户端发送弱缓存一致性(WCC)数据。服务器应该自动收集有关inode的属性,对其执行操作,然后收集属性。这允许客户端在某些情况下跳过发出GEATTR,但意味着服务器正在为几乎所有RPC调用vfs_getattr。在某些文件系统(特别是集群或网络文件系统)上,这是昂贵的,而且很难保证原子性。该标志向nfsd表明,在对该文件系统执行操作时,它应该跳过在NFSv3应答中向客户端提供WCC属性。考虑在具有昂贵的->getattr inode操作的文件系统上启用此功能,或者在无法保证操作前和操作后属性集合之间的原子性时启用此功能。

  • EXPORT_OP_NOSUBTREECHK - disallow subtree checking on this fs

Many NFS operations deal with filehandles, which the server must then vet to ensure that they live inside of an exported tree. When the export consists of an entire filesystem, this is trivial. nfsd can just ensure that the filehandle live on the filesystem. When only part of a filesystem is exported however, then nfsd must walk the ancestors of the inode to ensure that it’s within an exported subtree. This is an expensive operation and not all filesystems can support it properly. This flag exempts the filesystem from subtree checking and causes exportfs to get back an error if it tries to enable subtree checking on it.

许多NFS操作处理文件句柄,然后服务器必须对其进行检查,以确保它们位于导出的树中。当导出包含整个文件系统时,这是微不足道的。nfsd只能确保文件句柄位于文件系统上。但是,当仅导出文件系统的一部分时,nfsd必须遍历inode的祖先,以确保它位于导出的子树中。这是一项昂贵的操作,并非所有文件系统都能正确支持它。该标志免除了文件系统的子树检查,并导致exportfs在尝试启用子树检查时返回错误。

  • EXPORT_OP_CLOSE_BEFORE_UNLINK - always close cached files before unlinking

On some exportable filesystems (such as NFS) unlinking a file that is still open can cause a fair bit of extra work. For instance, the NFS client will do a “sillyrename” to ensure that the file sticks around while it’s still open. When reexporting, that open file is held by nfsd so we usually end up doing a sillyrename, and then immediately deleting the sillyrenamed file just afterward when the link count actually goes to zero. Sometimes this delete can race with other operations (for instance an rmdir of the parent directory). This flag causes nfsd to close any open files for this inode before calling into the vfs to do an unlink or a rename that would replace an existing file.

在某些可导出的文件系统(如NFS)上,取消链接仍然打开的文件可能会导致大量额外的工作。例如,NFS客户端将执行一个“愚蠢的名称”,以确保文件在打开时仍然存在。重新导出时,打开的文件由nfsd保存,因此我们通常会执行一个sillyrename,然后在链接计数实际为零时立即删除该sillyrenamed文件。有时,此删除会与其他操作(例如父目录的rmdir)竞争。此标志会导致nfsd关闭此inode的所有打开文件_before_调用vfs以执行取消链接或重命名以替换现有文件。

4 Reference counting in pnfs

The are several inter-related caches. We have layouts which can reference multiple devices, each of which can reference multiple data servers. Each data server can be referenced by multiple devices. Each device can be referenced by multiple layouts. To keep all of this straight, we need to reference count.

有几个相互关联的缓存。我们有可以引用多个设备的布局,每个设备可以引用多台数据服务器。每个数据服务器可以被多个设备引用。每个设备可以被多个布局引用。为了保持这一切,我们需要引用count。

4.1 struct pnfs_layout_hdr

The on-the-wire command LAYOUTGET corresponds to struct pnfs_layout_segment, usually referred to by the variable name lseg. Each nfs_inode may hold a pointer to a cache of these layout segments in nfsi->layout, of type struct pnfs_layout_hdr.

on-the-wire命令LAYOUTGET对应于结构pnfs_layout_segment,通常由变量名lseg引用。每个nfs_inode可以保存指向nfsi->layout中这些布局段的缓存的指针,类型为struct pnfs_layout_hdr。

We reference the header for the inode pointing to it, across each outstanding RPC call that references it (LAYOUTGET, LAYOUTRETURN, LAYOUTCOMMIT), and for each lseg held within.

我们引用指向它的inode的头,跨引用它的每个未完成RPC调用(LAYOUTGET、LAYOUTRETURN、LAYOUTCOMMIT),以及其中包含的每个lseg。

Each header is also (when non-empty) put on a list associated with struct nfs_client (cl_layouts). Being put on this list does not bump the reference count, as the layout is kept around by the lseg that keeps it in the list.

每个头也(当非空时)放在与结构nfs_client(cl_layouts)关联的列表中。放在这个列表中不会影响引用计数,因为布局是由lseg保持在列表中的。

4.2 deviceid_cache

lsegs reference device ids, which are resolved per nfs_client and layout driver type. The device ids are held in a RCU cache (struct nfs4_deviceid_cache). The cache itself is referenced across each mount. The entries (struct nfs4_deviceid) themselves are held across the lifetime of each lseg referencing them.

lsegs引用设备id,根据nfsclient和布局驱动程序类型解析。设备ID保存在RCU缓存(结构nfs4_deviceid_cache)中。缓存本身在每个装载中被引用。条目(struct nfs4_deviceid)本身在引用它们的每个lseg的生命周期中保存。

RCU is used because the deviceid is basically a write once, read many data structure. The hlist size of 32 buckets needs better justification, but seems reasonable given that we can have multiple deviceid’s per filesystem, and multiple filesystems per nfs_client.

之所以使用RCU,是因为deviceid基本上是一个一次写入、多次读取的数据结构。32个存储桶的最大列表大小需要更好的理由,但考虑到每个文件系统可以有多个设备ID,每个nfs_client可以有多文件系统,这似乎是合理的。

The hash code is copied from the nfsd code base. A discussion of hashing and variations of this algorithm can be found here.

哈希代码是从nfsd代码库复制的。此处可以找到对哈希和该算法变体的讨论

4.3 data server cache

file driver devices refer to data servers, which are kept in a module level cache. Its reference is held over the lifetime of the deviceid pointing to it.

文件驱动程序设备是指保存在模块级缓存中的数据服务器。它的引用在指向它的设备ID的生命周期内保持。

4.4 lseg

lseg maintains an extra reference corresponding to the NFS_LSEG_VALID bit which holds it in the pnfs_layout_hdr’s list. When the final lseg is removed from the pnfs_layout_hdr’s list, the NFS_LAYOUT_DESTROYED bit is set, preventing any new lsegs from being added.

lseg维护与NFS_LSEG_VALID位相对应的额外引用,该位将其保存在pnfs_layout_hdr的列表中。当从pnfs_layout_hdr的列表中删除最后的lseg时,将设置NFS_LAYOUT_DESTROYED 位,防止添加任何新的lseg。

4.5 layout drivers

PNFS utilizes what is called layout drivers. The STD defines 4 basic layout types: “files”, “objects”, “blocks”, and “flexfiles”. For each of these types there is a layout-driver with a common function-vectors table which are called by the nfs-client pnfs-core to implement the different layout types.

PNFS利用所谓的布局驱动程序。STD定义了4种基本布局类型:“文件”、“对象”、“块”和“柔性文件”。对于这些类型中的每一种,都有一个布局驱动程序,其中包含一个公共函数向量表,nfs客户端pnfs内核调用该表来实现不同的布局类型。

Files-layout-driver code is in: fs/nfs/filelayout/.. directory Blocks-layout-driver code is in: fs/nfs/blocklayout/.. directory Flexfiles-layout-driver code is in: fs/nfs/flexfilelayout/.. directory

文件布局驱动程序代码位于:fs/nfs/filayout/.. 目录块布局驱动程序代码位于:fs/nfs/blocklayout/.. 目录Flexfiles布局驱动程序代码位于:fs/nfs/flexfilelayout/.. 目录

4.6 blocks-layout setup

TODO: Document the setup needs of the blocks layout driver

5 RPC Cache

This document gives a brief introduction to the caching mechanisms in the sunrpc layer that is used, in particular, for NFS authentication.

本文简要介绍了sunrpc层中的缓存机制,特别是用于NFS身份验证的缓存机制。

5.1 Caches

The caching replaces the old exports table and allows for a wide variety of values to be caches.

缓存替换了旧的导出表,并允许缓存多种值。

There are a number of caches that are similar in structure though quite possibly very different in content and use. There is a corpus of common code for managing these caches.

有许多缓存结构相似,但内容和使用可能非常不同。有一个用于管理这些缓存的通用代码库。

Examples of caches that are likely to be needed are:

  • mapping from IP address to client name从IP地址到客户端名称的映射
  • mapping from client name and filesystem to export options从客户机名称和文件系统到导出选项的映射
  • mapping from UID to list of GIDs, to work around NFS’s limitation of 16 gids.从UID到GID列表的映射,以解决NFS对16个GID的限制。
  • mappings between local UID/GID and remote UID/GID for sites that do not have uniform uid assignment对于没有统一UID分配的站点,本地UID/GID和远程UID/GIID之间的映射
  • mapping from network identify to public key for crypto authentication.从网络标识映射到加密认证的公钥。

The common code handles such things as:

  • general cache lookup with correct locking具有正确锁定的常规缓存查找
  • supporting ‘NEGATIVE’ as well as positive entries支持“否定”和肯定条目
  • allowing an EXPIRED time on cache items, and removing items after they expire, and are no longer in-use.允许缓存项目有一个EXPIRED时间,并在项目过期且不再使用后删除项目。
  • making requests to user-space to fill in cache entries请求用户空间填充缓存条目
  • allowing user-space to directly set entries in the cache允许用户空间直接在缓存中设置条目
  • delaying RPC requests that depend on as-yet incomplete cache entries, and replaying those requests when the cache entry is complete.延迟依赖于尚未完成的缓存项的RPC请求,并在缓存项完成时重播这些请求。
  • clean out old entries as they expire.在过期时清除旧条目。

5.1.1 Creating a Cache

  • A cache needs a datum to store. This is in the form of a structure definition that must contain a struct cache_head as an element, usually the first. It will also contain a key and some content. Each cache element is reference counted and contains expiry and update times for use in cache management.缓存需要一个数据来存储。这是一个结构定义的形式,它必须包含一个结构cache_head作为元素,通常是第一个。它还将包含一个密钥和一些内容。每个缓存元素都是引用计数的,并包含用于缓存管理的过期时间和更新时间。
  • A cache needs a “cache_detail” structure that describes the cache. This stores the hash table, some parameters for cache management, and some operations detailing how to work with particular cache items.缓存需要描述缓存的“cache_detail”结构。它存储哈希表、缓存管理的一些参数以及详细说明如何使用特定缓存项的一些操作。

The operations are:

  • struct cache_head *alloc (void)

This simply allocates appropriate memory and returns a pointer to the cache_detail embedded within the structure

这只需分配适当的内存并返回一个指向结构中嵌入的cache_detail的指针

  • void cache_put(struct kref *)

This is called when the last reference to an item is dropped. The pointer passed is to the ‘ref’ field in the cache_head. cache_put should release any references create by ‘cache_init’ and, if CACHE_VALID is set, any references created by cache_update. It should then release the memory allocated by ‘alloc’.

当删除对项的最后一个引用时,将调用此函数。传递的指针指向cache_head中的“ref”字段。cache_put应释放由“cache_init”创建的任何引用,如果设置了CACHE_VALID,则释放由cache_update创建的所有引用。然后应该释放“alloc”分配的内存。

  • int match(struct cache_head *orig, struct cache_head *new)

test if the keys in the two structures match. Return 1 if they do, 0 if they don’t.

测试两个结构中的键是否匹配。如果返回1,则返回0。

  • void init(struct cache_head *orig, struct cache_head *new)

Set the ‘key’ fields in ‘new’ from ‘orig’. This may include taking references to shared objects.

从“orig”设置“new”中的“key”字段。这可能包括引用共享对象。

  • void update(struct cache_head *orig, struct cache_head *new)

Set the ‘content’ fileds in ‘new’ from ‘orig’.

从“orig”设置“new”中的“content”文件。

  • int cache_show(struct seq_file *m, struct cache_detail *cd, struct cache_head *h)

Optional. Used to provide a /proc file that lists the contents of a cache. This should show one item, usually on just one line.

可选择的用于提供列出缓存内容的/proc文件。这应该显示一个项目,通常只显示一行。

  • int cache_request(struct cache_detail *cd, struct cache_head *h, char **bpp, int *blen)

Format a request to be send to user-space for an item to be instantiated. *bpp is a buffer of size *blen. bpp should be moved forward over the encoded message, and *blen should be reduced to show how much free space remains. Return 0 on success or <0 if not enough room or other problem.

为要实例化的项设置要发送到用户空间的请求的格式*bpp是大小为*blen的缓冲区。bpp应该在编码的消息上向前移动,并且*blen应该减少,以显示剩余的空闲空间。成功时返回0,如果没有足够的空间或其他问题,则返回<0。

  • int cache_parse(struct cache_detail *cd, char *buf, int len)

A message from user space has arrived to fill out a cache entry. It is in ‘buf’ of length ‘len’. cache_parse should parse this, find the item in the cache with sunrpc_cache_lookup_rcu, and update the item with sunrpc_cache_update.

来自用户空间的消息已到达以填充缓存条目。长度为“len”的“buf”。cacheparse应该解析这个,用sunrpc_cache_lookup_cru在缓存中找到该项,并用sunrpccache_update更新该项。

  • A cache needs to be registered using cache_register(). This includes it on a list of caches that will be regularly cleaned to discard old data.需要使用cache_register()注册缓存。这将其包含在缓存列表中,这些缓存将定期清理以丢弃旧数据。

5.1.2 Using a cache

To find a value in a cache, call sunrpc_cache_lookup_rcu passing a pointer to the cache_head in a sample item with the ‘key’ fields filled in. This will be passed to ->match to identify the target entry. If no entry is found, a new entry will be create, added to the cache, and marked as not containing valid data.

要在缓存中查找值,请调用sunrpc_cache_lookup_cru,将指针传递给示例项中的cache_head,并填充“key”字段。这将传递给->match以标识目标项。如果找不到条目,将创建一个新条目,将其添加到缓存中,并标记为不包含有效数据。

The item returned is typically passed to cache_check which will check if the data is valid, and may initiate an up-call to get fresh data. cache_check will return -ENOENT in the entry is negative or if an up call is needed but not possible, -EAGAIN if an upcall is pending, or 0 if the data is valid;

返回的项通常会传递给cache_check,它将检查数据是否有效,并可能会启动up调用以获取新数据。cache_check将在条目中返回-ENOENT为负,或者如果需要但不可能进行上调,则返回-EGAIN,如果上调挂起,则返回0,如果数据有效则返回0;

cache_check can be passed a “struct cache_req*”. This structure is typically embedded in the actual request and can be used to create a deferred copy of the request (struct cache_deferred_req). This is done when the found cache item is not uptodate, but the is reason to believe that userspace might provide information soon. When the cache item does become valid, the deferred copy of the request will be revisited (->revisit). It is expected that this method will reschedule the request for processing.

cache_check可以传递“struct cache_req*”。此结构通常嵌入在实际请求中,可用于创建请求的延迟副本(结构cache_deferred_req)。这是在找到的缓存项不是最新的情况下完成的,但有理由相信用户空间可能会很快提供信息。当缓存项变为有效时,将重新访问请求的延迟副本(->revisit)。预计此方法将重新安排处理请求。

The value returned by sunrpc_cache_lookup_rcu can also be passed to sunrpc_cache_update to set the content for the item. A second item is passed which should hold the content. If the item found by _lookup has valid data, then it is discarded and a new item is created. This saves any user of an item from worrying about content changing while it is being inspected. If the item found by _lookup does not contain valid data, then the content is copied across and CACHE_VALID is set.

sunrpc_cache_lookup_cru返回的值也可以传递给sunrpc_ccache_update以设置项目的内容。传递第二个项目,该项目应包含内容。如果_lookup找到的项具有有效数据,则丢弃该项并创建新项。这可以避免项目的任何用户在检查项目时担心内容发生变化。如果_lookup找到的项不包含有效数据,则复制内容并设置CACHE_VALID 。

5.1.3 Populating a cache

Each cache has a name, and when the cache is registered, a directory with that name is created in /proc/net/rpc

每个缓存都有一个名称,当注册缓存时,将在/proc/net/rpc中创建一个具有该名称的目录

This directory contains a file called ‘channel’ which is a channel for communicating between kernel and user for populating the cache. This directory may later contain other files of interacting with the cache.

此目录包含一个名为“channel”的文件,该文件是内核和用户之间用于填充缓存的通信通道。此目录稍后可能包含与缓存交互的其他文件。

The ‘channel’ works a bit like a datagram socket. Each ‘write’ is passed as a whole to the cache for parsing and interpretation. Each cache can treat the write requests differently, but it is expected that a message written will contain:

“通道”的工作方式有点像数据报套接字。每个“write”都作为一个整体传递给缓存进行解析和解释。每个缓存可以不同地对待写入请求,但预期写入的消息将包含:

  • a key
  • an expiry time
  • a content.

with the intention that an item in the cache with the give key should be create or updated to have the given content, and the expiry time should be set on that item.

意图是应该创建或更新缓存中具有给定密钥的项目以具有给定内容,并且应该为该项目设置到期时间。

Reading from a channel is a bit more interesting. When a cache lookup fails, or when it succeeds but finds an entry that may soon expire, a request is lodged for that cache item to be updated by user-space. These requests appear in the channel file.

从频道上阅读更有趣。当缓存查找失败时,或者当缓存查找成功但找到一个可能很快过期的条目时,将提交一个请求,要求用户空间更新该缓存项。这些请求显示在频道文件中。

Successive reads will return successive requests. If there are no more requests to return, read will return EOF, but a select or poll for read will block waiting for another request to be added.

连续读取将返回连续请求。如果没有更多的请求要返回,read将返回EOF,但对read的选择或轮询将阻止等待添加另一个请求。

Thus a user-space helper is likely to:

因此,用户空间助手可能会:

open the channel.
  select for readable
  read a request
  write a response
loop.

If it dies and needs to be restarted, any requests that have not been answered will still appear in the file and will be read by the new instance of the helper.

如果它死了,需要重新启动,那么任何尚未应答的请求都将仍然出现在文件中,并将由助手的新实例读取。

Each cache should define a “cache_parse” method which takes a message written from user-space and processes it. It should return an error (which propagates back to the write syscall) or 0.

每个缓存都应该定义一个“cache_parse”方法,该方法接收从用户空间写入的消息并对其进行处理。它应该返回一个错误(会传播回write系统调用)或0。

Each cache should also define a “cache_request” method which takes a cache item and encodes a request into the buffer provided.

每个缓存还应定义一个“cache_request”方法,该方法获取缓存项并将请求编码到提供的缓冲区中。

Note

If a cache has no active readers on the channel, and has had not active readers for more than 60 seconds, further requests will not be added to the channel but instead all lookups that do not find a valid entry will fail. This is partly for backward compatibility: The previous nfs exports table was deemed to be authoritative and a failed lookup meant a definite ‘no’.

如果缓存在通道上没有活动的读卡器,并且在超过60秒的时间内没有活动的读卡器,则不会向通道添加进一步的请求,而是所有未找到有效条目的查找都将失败。这部分是为了向后兼容:以前的nfs导出表被认为是权威的,失败的查找意味着明确的“no”。

5.1.4 request/response format

While each cache is free to use its own format for requests and responses over channel, the following is recommended as appropriate and support routines are available to help: Each request or response record should be printable ASCII with precisely one newline character which should be at the end. Fields within the record should be separated by spaces, normally one. If spaces, newlines, or nul characters are needed in a field they much be quoted. two mechanisms are available:

虽然每个缓存都可以自由使用自己的格式通过通道进行请求和响应,但建议适当使用以下格式,并提供支持例程来帮助:每个请求或响应记录都应该是可打印的ASCII码,并在结尾处正好有一个换行符。记录中的字段应该用空格隔开,通常是一个。如果字段中需要空格、换行符或nul字符,则必须引用它们。有两种机制可用:

  • If a field begins ‘x’ then it must contain an even number of hex digits, and pairs of these digits provide the bytes in the field.如果字段以“x”开头,则它必须包含偶数个十六进制数字,这些数字对提供字段中的字节。
  • otherwise a in the field must be followed by 3 octal digits which give the code for a byte. Other characters are treated as them selves. At the very least, space, newline, nul, and ‘' must be quoted in this way.否则,字段中的a必须后跟3个八进制数字,这些数字给出一个字节的代码。其他角色被视为自己。至少,空格、换行符、nul和“”必须以这种方式引用。

6 rpcsec_gss support for kernel RPC servers

This document gives references to the standards and protocols used to implement RPCGSS authentication in kernel RPC servers such as the NFS server and the NFS client’s NFSv4.0 callback server. (But note that NFSv4.1 and higher don’t require the client to act as a server for the purposes of authentication.)

本文档引用了用于在内核RPC服务器(如NFS服务器和NFS客户端的NFSv4.0回调服务器)中实现RPCGSS身份验证的标准和协议。(但请注意,NFSv4.1及更高版本不要求客户端作为服务器进行身份验证。)

RPCGSS is specified in a few IETF documents:

  • RFC2203 v1: https://tools.ietf.org/rfc/rfc2203.txt
  • RFC5403 v2: https://tools.ietf.org/rfc/rfc5403.txt

There is a third version that we don’t currently implement:

  • RFC7861 v3: https://tools.ietf.org/rfc/rfc7861.txt

6.1 Background

The RPCGSS Authentication method describes a way to perform GSSAPI Authentication for NFS. Although GSSAPI is itself completely mechanism agnostic, in many cases only the KRB5 mechanism is supported by NFS implementations.

RPCGSS身份验证方法描述了一种为NFS执行GSSAPI身份验证的方法。尽管GSSAPI本身完全不依赖于机制,但在许多情况下,NFS实现只支持KRB5机制。

The Linux kernel, at the moment, supports only the KRB5 mechanism, and depends on GSSAPI extensions that are KRB5 specific.

目前,Linux内核仅支持KRB5机制,并且依赖于KRB5特定的GSSAPI扩展。

GSSAPI is a complex library, and implementing it completely in kernel is unwarranted. However GSSAPI operations are fundementally separable in 2 parts:

GSSAPI是一个复杂的库,完全在内核中实现它是没有必要的。然而,GSSAPI的运营基本上可分为两部分:

  • initial context establishment初始上下文建立
  • integrity/privacy protection (signing and encrypting of individual packets)完整性/隐私保护(单个数据包的签名和加密)

The former is more complex and policy-independent, but less performance-sensitive. The latter is simpler and needs to be very fast.

前者更为复杂,与政策无关,但对性能不太敏感。后者更简单,需要非常快。

Therefore, we perform per-packet integrity and privacy protection in the kernel, but leave the initial context establishment to userspace. We need upcalls to request userspace to perform context establishment.

因此,我们在内核中执行每包完整性和隐私保护,但将初始上下文建立留给用户空间。我们需要upcall来请求用户空间执行上下文建立。

6.2 NFS Server Legacy Upcall Mechanism

The classic upcall mechanism uses a custom text based upcall mechanism to talk to a custom daemon called rpc.svcgssd that is provide by the nfs-utils package.

经典的upcall机制使用基于文本的自定义upcall机制与nfs-utils包提供的名为rpc.svcgssd的自定义守护进程进行通信。

This upcall mechanism has 2 limitations:

这种上调机制有两个限制:

  1. It can handle tokens that are no bigger than 2KiB

In some Kerberos deployment GSSAPI tokens can be quite big, up and beyond 64KiB in size due to various authorization extensions attacked to the Kerberos tickets, that needs to be sent through the GSS layer in order to perform context establishment.

在某些Kerberos部署中,GSSAPI令牌可能非常大,大小超过64KiB,这是由于Kerberos票证受到各种授权扩展的攻击,需要通过GSS层发送才能执行上下文建立。

B) It does not properly handle creds where the user is member of more than a few thousand groups (the current hard limit in the kernel is 65K groups) due to limitation on the size of the buffer that can be send back to the kernel (4KiB).

B) 由于可以发送回内核的缓冲区大小(4KiB)的限制,当用户是数千个以上组的成员时(内核中当前的硬限制是65K组),它无法正确处理cred。

6.3 NFS Server New RPC Upcall Mechanism

The newer upcall mechanism uses RPC over a unix socket to a daemon called gss-proxy, implemented by a userspace program called Gssproxy.

更新的upcall机制通过unix套接字使用RPC连接到一个名为gss-proxy的守护进程,该守护进程由名为Gssproxy的用户空间程序实现。

The gss_proxy RPC protocol is currently documented here.

gss_proxy RPC协议目前已记录此处.

This upcall mechanism uses the kernel rpc client and connects to the gssproxy userspace program over a regular unix socket. The gssproxy protocol does not suffer from the size limitations of the legacy protocol.

此上调机制使用内核rpc客户端,并通过常规unix套接字连接到gssproxy用户空间程序。gssproxy协议不受传统协议的大小限制。

6.4 Negotiating Upcall Mechanisms

To provide backward compatibility, the kernel defaults to using the legacy mechanism. To switch to the new mechanism, gss-proxy must bind to /var/run/gssproxy.sock and then write “1” to /proc/net/rpc/use-gss-proxy. If gss-proxy dies, it must repeat both steps.

为了提供向后兼容性,内核默认使用传统机制。要切换到新机制,gss-proxy必须绑定到/var/run/gsproxy.sock,然后将“1”写入/proc/net/rpc/use-gss-proxy。如果gss代理失效,它必须重复这两个步骤。

Once the upcall mechanism is chosen, it cannot be changed. To prevent locking into the legacy mechanisms, the above steps must be performed before starting nfsd. Whoever starts nfsd can guarantee this by reading from /proc/net/rpc/use-gss-proxy and checking that it contains a “1”–the read will block until gss-proxy has done its write to the file.

一旦选择了上调机制,就不能更改。为了防止锁定到遗留机制中,必须在启动nfsd之前执行以上步骤。无论谁启动nfsd,都可以通过读取/proc/net/rpc/use gss代理并检查它是否包含“1”来保证这一点——在gss代理完成对文件的写入之前,读取将被阻止。

7 NFSv4.1 Server Implementation

Server support for minorversion 1 can be controlled using the /proc/fs/nfsd/versions control file. The string output returned by reading this file will contain either “+4.1” or “-4.1” correspondingly.

可以使用/proc/fs/nfsd/versions控制文件来控制对minorversion 1的服务器支持。读取此文件返回的字符串输出将相应地包含“+4.1”或“-4.1”。

Currently, server support for minorversion 1 is enabled by default. It can be disabled at run time by writing the string “-4.1” to the /proc/fs/nfsd/versions control file. Note that to write this control file, the nfsd service must be taken down. You can use rpc.nfsd for this; see rpc.nfsd(8).

目前,默认情况下已启用minorversion 1的服务器支持。通过将字符串“-4.1”写入/proc/fs/nfsd/versions控制文件,可以在运行时禁用它。注意,要写入此控制文件,必须关闭nfsd服务。您可以为此使用rpc.nfsd;请参阅rpc.nfsd(8)。

(Warning: older servers will interpret “+4.1” and “-4.1” as “+4” and “-4”, respectively. Therefore, code meant to work on both new and old kernels must turn 4.1 on or off before turning support for version 4 on or off; rpc.nfsd does this correctly.)

(警告:较旧的服务器会将“+4.1”和“-4.1”分别解释为“+4”和“-4”。因此,用于新内核和旧内核的代码必须先打开或关闭4.1,然后才能打开或关闭对版本4的支持;rpc.nfsd会正确执行此操作。)

The NFSv4 minorversion 1 (NFSv4.1) implementation in nfsd is based on RFC 5661.

nfsd中的NFSv4 minorversion 1(NFSv4.1)实现基于RFC 5661。

From the many new features in NFSv4.1 the current implementation focuses on the mandatory-to-implement NFSv4.1 Sessions, providing “exactly once” semantics and better control and throttling of the resources allocated for each client.

从NFSv4.1中的许多新特性来看,当前的实现侧重于强制实现NFSv4.1会话,提供“精确一次”语义,并更好地控制和限制为每个客户端分配的资源。

The table below, taken from the NFSv4.1 document, lists the operations that are mandatory to implement (REQ), optional (OPT), and NFSv4.0 operations that are required not to implement (MNI) in minor version 1. The first column indicates the operations that are not supported yet by the linux server implementation.

下表摘自NFSv4.1文件,列出了次要版本1中强制实施(REQ)、可选(OPT)和不强制实施(MNI)的NFSv4.0操作。第一列表示linux服务器实现尚不支持的操作。

The OPTIONAL features identified and their abbreviations are as follows:

识别的可选功能及其缩写如下:

  • pNFS Parallel NFS并行NFS
  • FDELG File Delegations文件委派
  • DDELG Directory Delegations目录委派

The following abbreviations indicate the linux server implementation status.

以下缩写表示linux服务器实现状态。

  • I Implemented NFSv4.1 operations.实施NFSv4.1操作。
  • NS Not Supported.不支持。
  • NS* Unimplemented optional feature.未实现的可选功能。

7.1 Operations

Implementation status Operation REQ,REC, OPT or NMI Feature (REQ, REC or OPT) Definition
ACCESS REQ Section 18.1
l BACKCHANNEL_CTL REQ Section 18.33
I BIND_CONN_TO_SESSION REQ Section 18.34
CLOSE REQ Section 18.2
COMMIT REQ Section 18.3
CREATE REQ Section 18.4
I CREATE_SESSION REQ Section 18.36
NS* DELEGPURGE OPT FDELG (REQ) Section 18.5
DELEGRETURN OPT FDELG, Section 18.6
DDELG, pNFS
(REQ)
I DESTROY_CLIENTID REQ Section 18.50
I DESTROY_SESSION REQ Section 18.37
I EXCHANGE_ID REQ Section 18.35
I FREE_STATEID REQ Section 18.38
GETATTR REQ Section 18.7
I GETDEVICEINFO OPT pNFS (REQ) Section 18.40
NS* GETDEVICELIST OPT pNFS (OPT) Section 18.41
GETFH REQ Section 18.8
NS* GET_DIR_DELEGATION OPT DDELG (REQ) Section 18.39
I LAYOUTCOMMIT OPT pNFS (REQ) Section 18.42
I LAYOUTGET OPT pNFS (REQ) Section 18.43
I LAYOUTRETURN OPT pNFS (REQ) Section 18.44
LINK OPT Section 18.9
LOCK REQ Section 18.10
LOCKT REQ Section 18.11
LOCKU REQ Section 18.12
LOOKUP REQ Section 18.13
LOOKUPP REQ Section 18.14
NVERIFY REQ Section 18.15
OPEN REQ Section 18.16
NS* OPENATTR OPT Section 18.17
OPEN_CONFIRM MNI N/A
OPEN_DOWNGRADE REQ Section 18.18
PUTFH REQ Section 18.19
PUTPUBFH REQ Section 18.20
PUTROOTFH REQ Section 18.21
READ REQ Section 18.22
READDIR REQ Section 18.23
READLINK OPT Section 18.24
RECLAIM_COMPLETE REQ Section 18.51
RELEASE_LOCKOWNER MNI N/A
REMOVE REQ Section 18.25
RENAME REQ Section 18.26
RENEW MNI N/A
RESTOREFH REQ Section 18.27
SAVEFH REQ Section 18.28
SECINFO REQ Section 18.29
I SECINFO_NO_NAME REC pNFS files Section 18.45,
layout (REQ) Section 13.12
I SEQUENCE REQ Section 18.46
SETATTR REQ Section 18.30
SETCLIENTID MNI N/A
SETCLIENTID_CONFIRM MNI N/A
NS SET_SSV REQ Section 18.47
I TEST_STATEID REQ Section 18.48
VERIFY REQ Section 18.31
NS* WANT_DELEGATION OPT FDELG (OPT) Section 18.49
WRITE REQ Section 18.32

7.2 Callback Operations

Implementation status Operation REQ,REC, OPT or NMI Feature (REQ, REC or OPT) Definition
CB_GETATTR OPT FDELG (REQ) Section 20.1
I CB_LAYOUTRECALL OPT pNFS (REQ) Section 20.3
NS* CB_NOTIFY OPT DDELG (REQ) Section 20.4
NS* CB_NOTIFY_DEVICEID OPT pNFS (OPT) Section 20.12
NS* CB_NOTIFY_LOCK OPT Section 20.11
NS* CB_PUSH_DELEG OPT FDELG (OPT) Section 20.5
CB_RECALL OPT FDELG, Section 20.2
DDELG, pNFS
(REQ)
NS* CB_RECALL_ANY OPT FDELG, Section 20.6
DDELG, pNFS
(REQ)
NS CB_RECALL_SLOT REQ Section 20.8
NS* CB_RECALLABLE_OBJ_AVAIL OPT DDELG, pNFS Section 20.7
(REQ)
I CB_SEQUENCE OPT FDELG, Section 20.9
DDELG, pNFS
(REQ)
NS* CB_WANTS_CANCELLED OPT FDELG, Section 20.10
DDELG, pNFS
(REQ)

7.3 Implementation notes

  • SSV:

The spec claims this is mandatory, but we don’t actually know of any implementations, so we’re ignoring it for now. The server returns NFS4ERR_ENCR_ALG_UNSUPP on EXCHANGE_ID, which should be future-proof.

规范声称这是强制性的,但我们实际上不知道任何实现,所以我们现在忽略了它。服务器在EXCHANGE_ID上返回NFS4ERR_ENCR_ALG_UNSUPP,这应该是将来的证明。

  • GSS on the backchannel:

Again, theoretically required but not widely implemented (in particular, the current Linux client doesn’t request it). We return NFS4ERR_ENCR_ALG_UNSUPP on CREATE_SESSION.

同样,理论上需要,但没有广泛实现(特别是,当前的Linux客户机不需要)。我们在CREATE_SSESSION上返回NFS4ERR_ENCR_ALG_UNSUPP。

  • DELEGPURGE:

mandatory only for servers that support CLAIM_DELEGATE_PREV and/or CLAIM_DELEG_PREV_FH (which allows clients to keep delegations that persist across client reboots). Thus we need not implement this for now.

仅对于支持CLAIM_DELEGATE_REV和/或CLAIM_DELegE_REV_FH(允许客户端保留在客户端重新启动期间持续存在的委派)的服务器是必需的。因此,我们现在不需要实施这一点。

  • EXCHANGE_ID:

implementation ids are ignored

忽略实现ID

  • CREATE_SESSION:

backchannel attributes are ignored

反向通道属性被忽略

  • SEQUENCE:

no support for dynamic slot table renegotiation (optional)

不支持动态槽表重新协商(可选)

  • Nonstandard compound limitations:

No support for a sessions fore channel RPC compound that requires both a ca_maxrequestsize request and a ca_maxresponsesize reply, so we may fail to live up to the promise we made in CREATE_SESSION fore channel negotiation.

不支持同时需要ca_maxrequestsize请求和ca_maxrresponsesize回复的会话前通道RPC复合,因此我们可能无法兑现在CREATE_SESESSION前通道协商中所做的承诺。

See also http://wiki.linux-nfs.org/wiki/index.php/Server_4.0_and_4.1_issues.

8 Kernel NFS Server Statistics

  • Authors

Greg Banks <gnb@sgi.com> - 26 Mar 2009

This document describes the format and semantics of the statistics which the kernel NFS server makes available to userspace. These statistics are available in several text form pseudo files, each of which is described separately below.

本文档描述了内核NFS服务器向用户空间提供的统计信息的格式和语义。这些统计信息可以在几个文本形式的伪文件中找到,下面将分别描述每个伪文件。

In most cases you don’t need to know these formats, as the nfsstat(8) program from the nfs-utils distribution provides a helpful command-line interface for extracting and printing them.

在大多数情况下,您不需要知道这些格式,因为nfs-utils发行版中的nfsstat(8)程序为提取和打印这些格式提供了一个有用的命令行界面。

All the files described here are formatted as a sequence of text lines, separated by newline ‘n’ characters. Lines beginning with a hash ‘#’ character are comments intended for humans and should be ignored by parsing routines. All other lines contain a sequence of fields separated by whitespace.

这里描述的所有文件都被格式化为一系列文本行,用换行符“n”分隔。以哈希“#”字符开头的行是针对人类的注释,解析例程应忽略这些注释。所有其他行都包含一系列用空格分隔的字段。

8.1 /proc/fs/nfsd/pool_stats

This file is available in kernels from 2.6.30 onwards, if the /proc/fs/nfsd filesystem is mounted (it almost always should be).

从2.6.30开始,如果安装了/proc/fs/nfsd文件系统,该文件在内核中可用(几乎总是这样)。

The first line is a comment which describes the fields present in all the other lines. The other lines present the following data as a sequence of unsigned decimal numeric fields. One line is shown for each NFS thread pool.

第一行是一条注释,描述了所有其他行中的字段。其他行以无符号十进制数字字段序列形式显示以下数据。每个NFS线程池显示一行。

All counters are 64 bits wide and wrap naturally. There is no way to zero these counters, instead applications should do their own rate conversion.

所有计数器都是64位宽,自然换行。没有办法将这些计数器归零,应用程序应该自己进行速率转换。

  • pool

The id number of the NFS thread pool to which this line applies. This number does not change.Thread pool ids are a contiguous set of small integers starting at zero. The maximum value depends on the thread pool mode, but currently cannot be larger than the number of CPUs in the system. Note that in the default case there will be a single thread pool which contains all the nfsd threads and all the CPUs in the system, and thus this file will have a single line with a pool id of “0”.

应用此行的NFS线程池的id号。这个数字不会改变。线程池ID是一组连续的小整数,从零开始。最大值取决于线程池模式,但当前不能大于系统中的CPU数量。注意,在默认情况下,将有一个包含系统中所有nfsd线程和所有CPU的单线程池,因此该文件将有一行池id为“0”。

  • packets-arrived

Counts how many NFS packets have arrived. More precisely, this is the number of times that the network stack has notified the sunrpc server layer that new data may be available on a transport (e.g. an NFS or UDP socket or an NFS/RDMA endpoint).Depending on the NFS workload patterns and various network stack effects (such as Large Receive Offload) which can combine packets on the wire, this may be either more or less than the number of NFS calls received (which statistic is available elsewhere). However this is a more accurate and less workload-dependent measure of how much CPU load is being placed on the sunrpc server layer due to NFS network traffic.

统计已到达的NFS数据包数。更准确地说,这是网络堆栈通知sunrpc服务器层传输(例如NFS或UDP套接字或NFS/RDMA端点)上可能有新数据的次数,这可能大于或小于所接收的NFS调用的数量(该统计数据在其他地方可用)。然而,这是一个更准确、更少依赖于工作负载的度量,可以衡量由于NFS网络流量而在sunrpc服务器层上放置了多少CPU负载。

  • sockets-enqueued

Counts how many times an NFS transport is enqueued to wait for an nfsd thread to service it, i.e. no nfsd thread was considered available.The circumstance this statistic tracks indicates that there was NFS network-facing work to be done but it couldn’t be done immediately, thus introducing a small delay in servicing NFS calls. The ideal rate of change for this counter is zero; significantly non-zero values may indicate a performance limitation.This can happen because there are too few nfsd threads in the thread pool for the NFS workload (the workload is thread-limited), in which case configuring more nfsd threads will probably improve the performance of the NFS workload.

统计NFS传输排队等待nfsd线程为其提供服务的次数,即没有nfsd线程可用。此统计信息跟踪的情况表明,需要完成面向NFS网络的工作,但无法立即完成,因此在为NFS调用提供服务时会有一点延迟。该计数器的理想变化率为零;显著非零值可能表示性能限制。这可能是因为NFS工作负载的线程池中nfsd线程太少(工作负载受线程限制),在这种情况下,配置更多nfsd线程可能会提高NFS工作负载性能。

  • threads-woken

Counts how many times an idle nfsd thread is woken to try to receive some data from an NFS transport.This statistic tracks the circumstance where incoming network-facing NFS work is being handled quickly, which is a good thing. The ideal rate of change for this counter will be close to but less than the rate of change of the packets-arrived counter.

统计空闲nfsd线程被唤醒以尝试从NFS传输接收某些数据的次数。此统计信息跟踪快速处理传入网络的NFS工作的情况,这是一件好事。此计数器的理想变化率将接近但小于数据包到达计数器的变化率。

  • threads-timedout

Counts how many times an nfsd thread triggered an idle timeout, i.e. was not woken to handle any incoming network packets for some time.This statistic counts a circumstance where there are more nfsd threads configured than can be used by the NFS workload. This is a clue that the number of nfsd threads can be reduced without affecting performance. Unfortunately, it’s only a clue and not a strong indication, for a couple of reasons:Currently the rate at which the counter is incremented is quite slow; the idle timeout is 60 minutes. Unless the NFS workload remains constant for hours at a time, this counter is unlikely to be providing information that is still useful.It is usually a wise policy to provide some slack, i.e. configure a few more nfsds than are currently needed, to allow for future spikes in load.

统计nfsd线程触发空闲超时的次数,即在一段时间内未被唤醒以处理任何传入的网络数据包。此统计信息统计配置的nfsd线程数超过NFS工作负载可使用的数量的情况。这表明可以减少nfsd线程的数量而不影响性能。不幸的是,这只是一个线索,并不是一个强有力的指示,原因有两个:目前计数器的递增速度相当缓慢;空闲超时为60分钟。除非NFS工作负载一次保持数小时不变,否则此计数器不太可能提供仍然有用的信息。提供一些空闲时间通常是明智的策略,即配置比当前需要的更多的nfsd,以允许将来的负载峰值。

Note that incoming packets on NFS transports will be dealt with in one of three ways. An nfsd thread can be woken (threads-woken counts this case), or the transport can be enqueued for later attention (sockets-enqueued counts this case), or the packet can be temporarily deferred because the transport is currently being used by an nfsd thread. This last case is not very interesting and is not explicitly counted, but can be inferred from the other counters thus:

请注意,NFS传输上的传入数据包将通过以下三种方式之一进行处理。nfsd线程可以被唤醒(这种情况下被唤醒的线程数),或者传输可以被排队以供以后注意(这种情况中被排队的套接字数),或数据包可以被暂时延迟,因为传输当前正被nfsd线程使用。最后一种情况不是很有趣,也没有明确计算,但可以从其他计数器中推断出:

packets-deferred = packets-arrived - ( sockets-enqueued + threads-woken )

8.2 More

Descriptions of the other statistics file should go here.

此处应显示其他统计文件的说明。