跳转至

FUSE

原文档地址

1 Definitions

  • Userspace filesystem用户空间文件系统:

A filesystem in which data and metadata are provided by an ordinary userspace process. The filesystem can be accessed normally through the kernel interface. 由普通用户空间进程提供数据和元数据的文件系统。文件系统可以通过内核接口正常访问。

  • Filesystem daemon文件系统守护程序:

The process(es) providing the data and metadata of the filesystem.

提供文件系统数据和元数据的过程。

  • Non-privileged mount (or user mount)非特权挂载(或用户挂载):

A userspace filesystem mounted by a non-privileged (non-root) user. The filesystem daemon is running with the privileges of the mounting user. NOTE: this is not the same as mounts allowed with the “user” option in /etc/fstab, which is not discussed here.

由非特权(非root)用户安装的用户空间文件系统。文件系统守护程序以挂载用户的权限运行。注意:这与/etc/fstab中的“user”选项所允许的挂载不同,这里不讨论。

  • Filesystem connection文件系统连接:

A connection between the filesystem daemon and the kernel. The connection exists until either the daemon dies, or the filesystem is umounted. Note that detaching (or lazy umounting) the filesystem does not break the connection, in this case it will exist until the last reference to the filesystem is released.

文件系统守护程序和内核之间的连接。连接一直存在,直到守护程序终止,或者文件系统终止。请注意,分离(或延迟卸载)文件系统不会断开连接,在这种情况下,它将一直存在,直到释放对文件系统的最后一次引用。

  • Mount owner挂载拥有者:

The user who does the mounting.

执行安装的用户。

  • User用户:

The user who is performing filesystem operations.

正在执行文件系统操作的用户。

2 What is FUSE?

FUSE is a userspace filesystem framework. It consists of a kernel module (fuse.ko), a userspace library (libfuse.*) and a mount utility (fusermount).

FUSE 是一个用户空间文件系统框架。它由内核模块(fuse.ko)、用户空间库(libfuse.*)和装载实用程序(fusermount)组成。

One of the most important features of FUSE is allowing secure, non-privileged mounts. This opens up new possibilities for the use of filesystems. A good example is sshfs: a secure network filesystem using the sftp protocol.

FUSE最重要的特性之一是允许安全、非特权的装载。这为文件系统的使用开辟了新的可能性。一个很好的例子是sshfs:一个使用sftp协议的安全网络文件系统。

The userspace library and utilities are available from the FUSE homepage: 用户空间库和实用程序可从FUSE主页获取

3 Filesystem type

The filesystem type given to mount(2) can be one of the following:

  • fuse

This is the usual way to mount a FUSE filesystem. The first argument of the mount system call may contain an arbitrary string, which is not interpreted by the kernel.

这是安装FUSE文件系统的常用方法。mount系统调用的第一个参数可能包含一个任意字符串,内核不会解释该字符串。

  • fuseblk

The filesystem is block device based. The first argument of the mount system call is interpreted as the name of the device.

文件系统是基于块设备的。装载系统调用的第一个参数被解释为设备的名称。

4 Mount options

  • fd=N

The file descriptor to use for communication between the userspace filesystem and the kernel. The file descriptor must have been obtained by opening the FUSE device (‘/dev/fuse’).

用于用户空间文件系统和内核之间通信的文件描述符。文件描述符必须是通过打开FUSE设备(“/dev/FUSE”)获得的。

  • rootmode=M

The file mode of the filesystem’s root in octal representation.

以八进制表示的文件系统根的文件模式。

  • user_id=N

The numeric user id of the mount owner.

装载所有者的数字用户id。

  • group_id=N

The numeric group id of the mount owner.

装载所有者的数字组id。

  • default_permissions

By default FUSE doesn’t check file access permissions, the filesystem is free to implement its access policy or leave it to the underlying file access mechanism (e.g. in case of network filesystems). This option enables permission checking, restricting access based on file mode. It is usually useful together with the ‘allow_other’ mount option.

默认情况下,FUSE不检查文件访问权限,文件系统可以自由实现其访问策略或将其留给底层文件访问机制(例如,在网络文件系统的情况下)。此选项启用权限检查,根据文件模式限制访问。它通常与“allow_other”装载选项一起使用。

  • allow_other

This option overrides the security measure restricting file access to the user mounting the filesystem. This option is by default only allowed to root, but this restriction can be removed with a (userspace) configuration option.

此选项将覆盖限制安装文件系统的用户访问文件的安全措施。默认情况下,此选项仅允许root用户使用,但可以使用(用户空间)配置选项删除此限制。

  • max_read=N

With this option the maximum size of read operations can be set. The default is infinite. Note that the size of read requests is limited anyway to 32 pages (which is 128kbyte on i386).

使用此选项,可以设置读取操作的最大大小。默认值为无限。请注意,读取请求的大小无论如何都限制为32页(在i386上为128kbyte)。

  • blksize=N

Set the block size for the filesystem. The default is 512. This option is only valid for ‘fuseblk’ type mounts.

设置文件系统的块大小。默认值为512。此选项仅对“fuseblk”类型的装载有效。

5 Control filesystem

There’s a control filesystem for FUSE, which can be mounted by:

mount -t fusectl none /sys/fs/fuse/connections

Mounting it under the ‘/sys/fs/fuse/connections’ directory makes it backwards compatible with earlier versions.

Under the fuse control filesystem each connection has a directory named by a unique number.

For each connection the following files exist within this directory:

  • waiting

The number of requests which are waiting to be transferred to userspace or being processed by the filesystem daemon. If there is no filesystem activity and ‘waiting’ is non-zero, then the filesystem is hung or deadlocked.

  • abort

Writing anything into this file will abort the filesystem connection. This means that all waiting requests will be aborted an error returned for all aborted and new requests.

Only the owner of the mount may read or write these files.

5.1 Interrupting filesystem operations

If a process issuing a FUSE filesystem request is interrupted, the following will happen:

如果发出FUSE文件系统请求的进程被中断,将发生以下情况:

  • If the request is not yet sent to userspace AND the signal is fatal (SIGKILL or unhandled fatal signal), then the request is dequeued and returns immediately.如果请求尚未发送到用户空间,并且信号是致命的(SIGKILL或未处理的致命信号),那么请求将退出队列并立即返回。
  • If the request is not yet sent to userspace AND the signal is not fatal, then an interrupted flag is set for the request. When the request has been successfully transferred to userspace and this flag is set, an INTERRUPT request is queued.如果请求尚未发送到用户空间,并且信号不是致命的,则为请求设置中断标志。当请求成功传输到用户空间并且设置了此标志时,INTERRUPT请求将排队。
  • If the request is already sent to userspace, then an INTERRUPT request is queued.如果请求已发送到用户空间,则INTERRUPT请求将排队。

INTERRUPT requests take precedence over other requests, so the userspace filesystem will receive queued INTERRUPTs before any others.

INTERRUPT请求优先于其他请求,因此用户空间文件系统将先于其他请求接收排队的INTERRUPT。

The userspace filesystem may ignore the INTERRUPT requests entirely, or may honor them by sending a reply to the original request, with the error set to EINTR.

用户空间文件系统可以完全忽略INTERRUPT请求,也可以通过发送对original请求的回复来兑现这些请求,错误设置为EINTR。

It is also possible that there’s a race between processing the original request and its INTERRUPT request. There are two possibilities:

处理原始请求和其INTERRUPT请求之间也可能存在竞争。有两种可能性:

  1. The INTERRUPT request is processed before the original request is processed在处理原始请求之前处理INTERRUPT请求
  2. The INTERRUPT request is processed after the original request has been answeredINTERRUPT请求在原始请求得到响应后进行处理

If the filesystem cannot find the original request, it should wait for some timeout and/or a number of new requests to arrive, after which it should reply to the INTERRUPT request with an EAGAIN error. In case 1) the INTERRUPT request will be requeued. In case 2) the INTERRUPT reply will be ignored.

如果文件系统找不到原始请求,它应该等待一些超时和/或一些新请求到达,之后它应该用EAGAIN错误回复INTERRUPT请求。在情况1)中,中断请求将被重新排队。在情况2)下,INTERRUPT回复将被忽略。

6 Aborting a filesystem connection

It is possible to get into certain situations where the filesystem is not responding. Reasons for this may be:

在某些情况下,文件系统没有响应是可能的。原因可能是:

a. Broken userspace filesystem implementation损坏的用户空间文件系统实现 b. Network connection down网络连接断开 c. Accidental deadlock意外死锁 d. Malicious deadlock恶意死锁

(For more on c) and d) see later sections)

In either of these cases it may be useful to abort the connection to the filesystem. There are several ways to do this:

在这两种情况下,中止与文件系统的连接可能都很有用。有几种方法可以做到这一点:

  • Kill the filesystem daemon. Works in case of a) and b)关闭文件系统守护程序。a)和b)情况下的工程
  • Kill the filesystem daemon and all users of the filesystem. Works in all cases except some malicious deadlocks杀死文件系统守护程序和文件系统的所有用户。除某些恶意死锁外,在所有情况下都有效
  • Use forced umount (umount -f). Works in all cases but only if filesystem is still attached (it hasn’t been lazy unmounted)使用强制的umount(umount-f)。在所有情况下都可以工作,但仅当文件系统仍然连接时(它还没有被延迟卸载)
  • Abort filesystem through the FUSE control filesystem. Most powerful method, always works.通过FUSE控制文件系统中止文件系统。最强大的方法,总是有效的。

7 How do non-privileged mounts work?

Since the mount() system call is a privileged operation, a helper program (fusermount) is needed, which is installed setuid root.

由于mount()系统调用是一个特权操作,因此需要一个助手程序(fusermount),它安装在setuid root中。

The implication of providing non-privileged mounts is that the mount owner must not be able to use this capability to compromise the system. Obvious requirements arising from this are:

提供非特权装载的含义是,装载所有者不能使用此功能危害系统。由此产生的明显要求如下:

A. mount owner should not be able to get elevated privileges with the help of the mounted filesystem装载所有者应该无法在装载的文件系统的帮助下获得提升的权限 B. mount owner should not get illegitimate access to information from other users’ and the super user’s processes装载所有者不应非法访问其他用户和超级用户进程中的信息 C. mount owner should not be able to induce undesired behavior in other users’ or the super user’s processes装载所有者不应在其他用户或超级用户的进程中引发不期望的行为

8 How are requirements fulfilled?如何满足要求?

  1. (A)The mount owner could gain elevated privileges by either:装载所有者可以通过以下任一方式获得提升的权限:

    1. creating a filesystem containing a device file, then opening this device创建包含设备文件的文件系统,然后打开该设备
    2. creating a filesystem containing a suid or sgid application, then executing this application创建包含suid或sgid应用程序的文件系统,然后执行该应用程序

The solution is not to allow opening device files and ignore setuid and setgid bits when executing programs. To ensure this fusermount always adds “nosuid” and “nodev” to the mount options for non-privileged mounts.

解决方案是不允许在执行程序时打开设备文件并忽略setuid和setgid位。为了确保此fusermount始终在非特权装载的装载选项中添加“nosuid”和“nodev”。

  1. (B)If another user is accessing files or directories in the filesystem, the filesystem daemon serving requests can record the exact sequence and timing of operations performed. This information is otherwise inaccessible to the mount owner, so this counts as an information leak.如果另一个用户正在访问文件系统中的文件或目录,服务请求的文件系统守护程序可以记录所执行操作的确切顺序和时间。否则,装载所有者无法访问此信息,因此这将被视为信息泄漏。

    The solution to this problem will be presented in point 2) of C).该问题的解决方案将在C.2)点中给出。

  2. (C)There are several ways in which the mount owner can induce undesired behavior in other users’ processes, such as:装载所有者可以通过多种方式在其他用户的进程中诱导不期望的行为,例如:

    1. mounting a filesystem over a file or directory which the mount owner could otherwise not be able to modify (or could only make limited modifications).将文件系统装载到装载所有者无法修改(或只能进行有限修改)的文件或目录上。

    This is solved in fusermount, by checking the access permissions on the mountpoint and only allowing the mount if the mount owner can do unlimited modification (has write access to the mountpoint, and mountpoint is not a “sticky” directory)

    这可以在fusermount中解决,方法是检查装载点上的访问权限,并仅在装载所有者可以进行无限修改时才允许装载(具有对装载点的写入权限,并且装载点不是“粘性”目录)

    1. Even if 1) is solved the mount owner can change the behavior of other users’ processes.即使解决了1),装载所有者也可以更改其他用户进程的行为。
    1. (i)It can slow down or indefinitely delay the execution of a filesystem operation creating a DoS against the user or the whole system. For example a suid application locking a system file, and then accessing a file on the mount owner’s filesystem could be stopped, and thus causing the system file to be locked forever.它可以减慢或无限期延迟文件系统操作的执行,从而对用户或整个系统造成DoS。例如,一个suid应用程序锁定一个系统文件,然后访问装载所有者的文件系统上的文件,可能会被停止,从而导致系统文件被永久锁定。
    2. (ii)It can present files or directories of unlimited length, or directory structures of unlimited depth, possibly causing a system process to eat up diskspace, memory or other resources, again causing DoS.它可以呈现无限长度的文件或目录,或无限深度的目录结构,可能导致系统进程占用磁盘空间、内存或其他资源,再次导致DoS。

    The solution to this as well as B) is not to allow processes to access the filesystem, which could otherwise not be monitored or manipulated by the mount owner. Since if the mount owner can ptrace a process, it can do all of the above without using a FUSE mount, the same criteria as used in ptrace can be used to check if a process is allowed to access the filesystem or not.

    解决这一问题的方法以及B)是不允许进程访问文件系统,否则装载所有者可能无法监视或操纵文件系统。由于如果装载所有者可以对进程进行ptrace,那么它可以在不使用FUSE装载的情况下执行上述所有操作,因此可以使用ptrace中使用的相同标准来检查是否允许进程访问文件系统。

    Note that the ptrace check is not strictly necessary to prevent C/2/i, it is enough to check if mount owner has enough privilege to send signal to the process accessing the filesystem, since SIGSTOP can be used to get a similar effect.

    注意,ptrace检查对于防止C/2/i并不是严格必要的,检查挂载所有者是否有足够的权限向访问文件系统的进程发送信号就足够了,因为SIGSTOP可以用于获得类似的效果。

9 think these limitations are unacceptable?认为这些限制是不可接受的吗?

If a sysadmin trusts the users enough, or can ensure through other measures, that system processes will never enter non-privileged mounts, it can relax the last limitation in several ways:如果系统管理员足够信任用户,或者可以通过其他措施确保系统进程永远不会进入非特权装载,则可以通过几种方式放宽最后一个限制:

  • With the ‘user_allow_other’ config option. If this config option is set, the mounting user can add the ‘allow_other’ mount option which disables the check for other users’ processes.使用“user_allow_other”配置选项。如果设置了此配置选项,则装载用户可以添加“allow_other”装载选项,该选项将禁用对其他用户进程的检查。

User namespaces have an unintuitive interaction with ‘allow_other’: an unprivileged user - normally restricted from mounting with ‘allow_other’ - could do so in a user namespace where they’re privileged. If any process could access such an ‘allow_other’ mount this would give the mounting user the ability to manipulate processes in user namespaces where they’re unprivileged. For this reason ‘allow_other’ restricts access to users in the same userns or a descendant.用户名称空间与“allow_other”有一种不直观的交互:一个没有特权的用户(通常被限制使用“allow_other”进行装载)可以在他们有特权的用户名称空间中这样做。如果任何进程都可以访问这样的“allow_other”装载,这将使装载用户能够在没有特权的用户命名空间中操纵进程。因此,“allow_other”限制同一用户或后代的用户访问。

  • With the ‘allow_sys_admin_access’ module option. If this option is set, super user’s processes have unrestricted access to mounts irrespective of allow_other setting or user namespace of the mounting user.使用“allow_sys_admin_access”模块选项。如果设置了此选项,则超级用户的进程可以不受限制地访问装载,而不管装载用户的allow_other设置或用户名称空间如何。

Note that both of these relaxations expose the system to potential information leak or DoS as described in points B and C/2/i-ii in the preceding section.

请注意,这两种松弛都会使系统暴露于潜在的信息泄漏或DoS,如前一节中B点和C/2/i-ii所述。

10 Kernel - userspace interface

The following diagram shows how a filesystem operation (in this example unlink) is performed in FUSE.

下图显示了如何在FUSE中执行文件系统操作(在本例中为unlink)。

|  "rm /mnt/fuse/file"               |  FUSE filesystem daemon
|                                    |
|                                    |  >sys_read()
|                                    |    >fuse_dev_read()
|                                    |      >request_wait()
|                                    |        [sleep on fc->waitq]
|                                    |
|  >sys_unlink()                     |
|    >fuse_unlink()                  |
|      [get request from             |
|       fc->unused_list]             |
|      >request_send()               |
|        [queue req on fc->pending]  |
|        [wake up fc->waitq]         |        [woken up]
|        >request_wait_answer()      |
|          [sleep on req->waitq]     |
|                                    |      <request_wait()
|                                    |      [remove req from fc->pending]
|                                    |      [copy req to read buffer]
|                                    |      [add req to fc->processing]
|                                    |    <fuse_dev_read()
|                                    |  <sys_read()
|                                    |
|                                    |  [perform unlink]
|                                    |
|                                    |  >sys_write()
|                                    |    >fuse_dev_write()
|                                    |      [look up req in fc->processing]
|                                    |      [remove from fc->processing]
|                                    |      [copy write buffer to req]
|          [woken up]                |      [wake up req->waitq]
|                                    |    <fuse_dev_write()
|                                    |  <sys_write()
|        <request_wait_answer()      |
|      <request_send()               |
|      [add request to               |
|       fc->unused_list]             |
|    <fuse_unlink()                  |
|  <sys_unlink()                     |

Note

Everything in the description above is greatly simplified

上面描述中的所有内容都大大简化了

There are a couple of ways in which to deadlock a FUSE filesystem. Since we are talking about unprivileged userspace programs, something must be done about these.

有两种方法可以死锁FUSE文件系统。由于我们讨论的是非特权用户空间程序,因此必须对此采取一些措施。

Scenario 1 - Simple deadlock:

|  "rm /mnt/fuse/file"               |  FUSE filesystem daemon
|                                    |
|  >sys_unlink("/mnt/fuse/file")     |
|    [acquire inode semaphore        |
|     for "file"]                    |
|    >fuse_unlink()                  |
|      [sleep on req->waitq]         |
|                                    |  <sys_read()
|                                    |  >sys_unlink("/mnt/fuse/file")
|                                    |    [acquire inode semaphore
|                                    |     for "file"]
|                                    |    *DEADLOCK*

The solution for this is to allow the filesystem to be aborted.

Scenario 2 - Tricky deadlock

This one needs a carefully crafted filesystem. It’s a variation on the above, only the call back to the filesystem is not explicit, but is caused by a pagefault.

|  Kamikaze filesystem thread 1      |  Kamikaze filesystem thread 2
|                                    |
|  [fd = open("/mnt/fuse/file")]     |  [request served normally]
|  [mmap fd to 'addr']               |
|  [close fd]                        |  [FLUSH triggers 'magic' flag]
|  [read a byte from addr]           |
|    >do_page_fault()                |
|      [find or create page]         |
|      [lock page]                   |
|      >fuse_readpage()              |
|         [queue READ request]       |
|         [sleep on req->waitq]      |
|                                    |  [read request to buffer]
|                                    |  [create reply header before addr]
|                                    |  >sys_write(addr - headerlength)
|                                    |    >fuse_dev_write()
|                                    |      [look up req in fc->processing]
|                                    |      [remove from fc->processing]
|                                    |      [copy write buffer to req]
|                                    |        >do_page_fault()
|                                    |           [find or create page]
|                                    |           [lock page]
|                                    |           * DEADLOCK *

The solution is basically the same as above.

解决方案与上述基本相同。

An additional problem is that while the write buffer is being copied to the request, the request must not be interrupted/aborted. This is because the destination address of the copy may not be valid after the request has returned.

另外一个问题是,当写入缓冲区被复制到请求时,请求不能被中断/中止。这是因为请求返回后,副本的目标地址可能无效。

This is solved with doing the copy atomically, and allowing abort while the page(s) belonging to the write buffer are faulted with get_user_pages(). The ‘req->locked’ flag indicates when the copy is taking place, and abort is delayed until this flag is unset.

这可以通过原子复制来解决,并允许在属于写入缓冲区的页面出现get_user_pages()故障时中止。“req->locked”标志表示复制发生的时间,中止被延迟,直到该标志未设置。

2 Fuse I/O Modes

Fuse supports the following I/O modes保险丝支持以下I/O模式:

  • direct-io直接io
  • cached + write-through + writeback-cache缓存+直写+回写缓存

The direct-io mode can be selected with the FOPEN_DIRECT_IO flag in the FUSE_OPEN reply.

可以使用FUSE_OPEN回复中的FOPEN_DIRECT_IO标志选择direct-io模式。

In direct-io mode the page cache is completely bypassed for reads and writes. No read-ahead takes place. Shared mmap is disabled.

在direct-io模式下,页面缓存被完全绕过进行读取和写入。不进行预读。共享mmap已禁用。

In cached mode reads may be satisfied from the page cache, and data may be read-ahead by the kernel to fill the cache. The cache is always kept consistent after any writes to the file. All mmap modes are supported.

在缓存模式下,可以从页面缓存中满足读取,并且内核可以提前读取数据以填充缓存。在写入文件后,缓存始终保持一致。支持所有mmap模式。

The cached mode has two sub modes controlling how writes are handled. The write-through mode is the default and is supported on all kernels. The writeback-cache mode may be selected by the FUSE_WRITEBACK_CACHE flag in the FUSE_INIT reply.

缓存模式有两个子模式,控制如何处理写入。直写模式是默认模式,所有内核都支持该模式。可通过FUSE_INIT回复中的FUSE_WRITEBACK_CACHE标志选择写回缓存模式。

In write-through mode each write is immediately sent to userspace as one or more WRITE requests, as well as updating any cached pages (and caching previously uncached, but fully written pages). No READ requests are ever sent for writes, so when an uncached page is partially written, the page is discarded.

在直写模式下,每次写入都会立即作为一个或多个write请求发送到用户空间,以及更新任何缓存的页面(以及缓存以前未缓存但已完全写入的页面)。写入时不会发送任何READ请求,因此当未缓存页被部分写入时,该页将被丢弃。

In writeback-cache mode (enabled by the FUSE_WRITEBACK_CACHE flag) writes go to the cache only, which means that the write(2) syscall can often complete very fast. Dirty pages are written back implicitly (background writeback or page reclaim on memory pressure) or explicitly (invoked by close(2), fsync(2) and when the last ref to the file is being released on munmap(2)). This mode assumes that all changes to the filesystem go through the FUSE kernel module (size and atime/ctime/mtime attributes are kept up-to-date by the kernel), so it’s generally not suitable for network filesystems. If a partial page is written, then the page needs to be first read from userspace. This means, that even for files opened for O_WRONLY it is possible that READ requests will be generated by the kernel.

在写回缓存模式下(由FUSE_WRITEBACK_CACHE标志启用),写操作只能进入缓存,这意味着 write(2)系统调用通常可以非常快地完成。脏页被隐式地写回(后台写回或内存压力下的页面回收)或显式地写(由close(2)、fsync(2)调用,并且当文件的最后一个引用在munmap(2)上释放时)。该模式假设对文件系统的所有更改都通过FUSE内核模块(大小和atime/ctime/mtime属性由内核保持最新),因此通常不适用于网络文件系统。如果写入了部分页面,则需要首先从用户空间读取该页面。这意味着,即使是为O_WRONLY打开的文件,内核也可能会生成READ请求。