Linux select/poll/epoll 原理（一）实现基础

spartmap

2019-11-03

关注关注

本序列涉及的 Linux 源码都是基于 linux-4.14.143 。

1. 文件抽象与 poll 操作

1.1 文件抽象

在 Linux 内核里，文件是一个抽象，设备是个文件，网络套接字也是个文件。

文件抽象必须支持的能力定义在 file_operations 结构体里。

在 Linux 里，一个打开的文件对应一个文件描述符 file descriptor/FD，FD 其实是一个整数，内核把进程打开的文件维护在一个数组里，FD 对应的是数组的下标。

文件抽象的能力定义：

// 源码位置：include/linux/fs.h
struct file_operations {
 struct module *owner;
 loff_t (*llseek) (struct file *, loff_t, int);
 ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
 ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
 ssize_t (*read_iter) (struct kiocb *, struct iov_iter *);
 ssize_t (*write_iter) (struct kiocb *, struct iov_iter *);
 int (*iterate) (struct file *, struct dir_context *);
 int (*iterate_shared) (struct file *, struct dir_context *);
 // 对于 select/poll/epoll 最重要的实现基础
 // 非阻塞的轮询文件状态的函数
 unsigned int (*poll) (struct file *, struct poll_table_struct *);
 // 省略其他函数指针
} __randomize_layout;
// 源码位置：include/linux/poll.h
typedef struct poll_table_struct {
 // 文件的 file_operations.poll 实现一定会调用的队列处理函数
 poll_queue_proc _qproc;
 // poll 操作敢兴趣的事件
 unsigned long _key;
} poll_table;
// poll 队列处理函数
typedef void (*poll_queue_proc)(struct file *, wait_queue_head_t *, struct poll_table_struct *);

1.2 文件 poll 操作

poll 函数的原型：

unsigned int (*poll) (struct file *, poll_table *);
/**
 * 如果 poll_table 有回调函数，则回调它。
 * 
 * @filp 要监听的目标文件
 * @wait_address 要监听事件的等待队列头
 * @p select/poll/epoll 调用里传入里的等待节点
 */
static inline void poll_wait(struct file * filp, wait_queue_head_t * wait_address, poll_table *p)
{
 if (p &amp;&amp; p-&gt;_qproc &amp;&amp; wait_address)
 p-&gt;_qproc(filp, wait_address, p);
}

文件抽象 poll 函数的具体实现必须完成两件事（这两点算是规范了）：

1. 在 poll 函数敢兴趣的等待队列上调用 poll_wait 函数，以接收到唤醒；具体的实现必须把 poll_table 类型的参数作为透明对象来使用，不需要知道它的具体结构。

2. 返回比特掩码，表示当前可立即执行而不会阻塞的操作。

下面是某个驱动的 poll 实现示例，来自：https://www.oreilly.com/library/view/linux-device-drivers/0596000081/ch05s03.html：

unsigned int scull_p_poll(struct file *filp, poll_table *wait)
{
 Scull_Pipe *dev = filp-&gt;private_data;
 unsigned int mask = 0;
 /*
 * The buffer is circular; it is considered full
 * if "wp" is right behind "rp". "left" is 0 if the
 * buffer is empty, and it is "1" if it is completely full.
 */
 int left = (dev-&gt;rp + dev-&gt;buffersize - dev-&gt;wp) % dev-&gt;buffersize;
 // 在不同的等待队列上调用 poll_wait 函数
 poll_wait(filp, &amp;dev-&gt;inq, wait);
 poll_wait(filp, &amp;dev-&gt;outq, wait);
 /* readable */
 if (dev-&gt;rp != dev-&gt;wp) mask |= POLLIN | POLLRDNORM;
 /* writable */
 if (left != 1) mask |= POLLOUT | POLLWRNORM;
 return mask;
}

2. poll 的等待与唤醒

poll 函数接收的 poll_table 只有一个队列处理函数 _qproc 和感兴趣的事件属性 _key。

文件抽象的具体实现在构建时会初始化一个或多个 wait_queue_head_t 类型的事件等待队列。

poll 等待的过程：

poll 函数被调用时，其实现肯定会调用 poll_wait，进而调用到 _qproc 函数。
_qproc 负责构建包含 wait_queue_entry 结构体的等待节点（比如 select 操作是 poll_table_entry 结构体），并把 wait_queue_entry 添加到要监听文件的等待队列 wait_address 上（wait_queue_entry 结构体指定了事件发生时的唤醒函数，比如 select 操作里指定的是 pollwake 函数）。
poll 函数返回文件当前可立即执行而不阻塞的操作表示码。

事件发生时的唤醒过程：

当事件发生时，文件的具体实现遍历等待队列，调用其唤醒函数，由唤醒函数进行具体的唤醒操作，唤醒函数的类型为 typedef int (*wait_queue_func_t)(struct wait_queue_entry *wq_entry, unsigned mode, int flags, void *key)。
具体的唤醒函数实现根据 wait_queue_entry 找到 _qproc 函数里构建的等待节点，利用其数据判断是否需要唤醒，是则唤醒等待进程。

一个小困惑：