Linux下新系统调用sync_file_range提高数据sync的效率
我们在做数据库程序或者IO密集型的程序的时候,通常在更新的时候,比如说数据库程序,希望更新有一定的安全性,我们会在更新操作结束的时候调用fsync或者fdatasync来flush数据到持久设备去。而且通常是以页面为单位,16K一次或者4K一次。 安全性保证了,但是性能就有很大的损害。而且我们更新的时候,通常是更新文件的某一个页面,那么由于是更新覆盖操作,对文件系统的元数据来讲的话,无需变更,所以我们通常不大关心元数据是否写入。 当更新非常频繁的时候,我们时候能够有其他方法减少性能损失。sync_file_range同学出场了。
Linux下系统调用sync_file_range只在内核2.6.17及更高版本是可用的, 我们常用的RHEL 5U4是支持的。
下面是这篇文章的介绍:
The 2.6.17 kernel will include two new system calls which expand the capabilities available to user-space programs in some interesting ways. This article contains a look at the current form of these new interfaces.
splice()
The splice() system call has a long history. First proposed by Larry McVoy in 1998; it was seen as a way of improving I/O performance on server systems. Despite being often mentioned in the following years, no splice() implementation was ever created for the mainline Linux kernel. That situation changed, however, just before the 2.6.17 merge window was closed when Jens Axboe's splice() patch, along with a number of modifications, was merged.
As of this writing, the splice() interface looks like this:
long splice(int fdin, int fdout, size_t len, unsigned int flags);
A call to splice() will cause the kernel to move up to len bytes from the data source fdin to fdout. The data will move through kernel space only, with a minimum of copying. In the current implementation, at least one of the two file descriptors must refer to a pipe device. That requirement is a limitation of the current code, and it could be removed at some future time.
The flags argument modifies how the copy is done. Currently implemented flags are:
SPLICE_F_NONBLOCK: makes the splice() operations non-blocking. A call to splice() could still block, however, especially if either of the file descriptors has not been set for non-blocking I/O.
SPLICE_F_MORE: a hint to the kernel that more data will come in a subsequent splice() call.
SPLICE_F_MOVE: if the output is a file, this flag will cause the kernel to attempt to move pages directly from the input pipe buffer into the output address space, avoiding a copy operation.
Internally, splice() works using the pipe buffer mechanism added by Linus in early 2005 - that is why one side of the operation is required to be a pipe for now. There are two additions to the ever-growing file_operations structure for devices and filesystems which wish to support splice():
ssize_t (*splice_write)(struct inode *pipe, struct file *out,
size_t len, unsigned int flags);
ssize_t (*splice_read)(struct file *in, struct inode *pipe,
size_t len, unsigned int flags);
The new operations should move len bytes between pipe and either in or out, respecting the given flags. For filesystems, there are generic implementations of these operations which can be used; there is also a generic_splice_sendpage() which is used to enable splicing to a socket. As of this writing, there are no splice() implementations for device drivers, but there is nothing preventing such implementations in the future, for char devices at least.
Discussions on the linux-kernel suggest that the splice() interface could change before it is set in stone with the 2.6.17 release. Andrew Tridgell has requested that an offset argument be added to specify where copying should begin - either that, or a separate psplice() should be added. There is also some concern about error handling; if a splice() call returns an error, how does the application tell whether the problem is with the input or the output? Resolving those issues may require some interface changes over the next month or so.
sync_file_range()
Early in the 2.6.17 process, some changes to the posix_fadvise() system call were merged. The new, Linux-specific options were meant to give applications better control over how data written to files is flushed to the physical media. The capabilities provided are needed, but there were concerns about extending a POSIX-defined function in a Linux-specific way. So, after some discussions, Andrew Morton pulled that patch back out and replaced it with a new system call:
long sync_file_range(int fd, loff_t offset, loff_t nbytes, int flags);
This call will synchronize a file's data to disk, starting at the given offset and proceeding for nbytes bytes (or to the end of the file if nbytes is zero). How the synchronization is done is controlled by flags:
SYNC_FILE_RANGE_WAIT_BEFORE blocks the calling process until any already in-progress writeout of pages (in the given range) completes.
SYNC_FILE_RANGE_WRITE starts writeout of any dirty pages in the given range which are not already under I/O.
SYNC_FILE_RANGE_WAIT_AFTER blocks the calling process until the newly-initiated writes complete.
An application which wants to initiate writeback of all dirty pages should provide the first two flags. Providing all three flags guarantees that those pages are actually on disk when the call returns.
The new implementation avoids distorting the posix_fadvise() system call. It also allows synchronization operations to be performed with a single call, instead of the multiple calls required by the previous attempt. In the future, it may also be possible to add other operations to the flags list; the ability to request metadata synchronization seems to be high on the list.
(Thanks to Michael Kerrisk - who agitated for this change - for providing some of the background information).
也可以man sync_file_range下他的具体作用, 但是请注意,sync_file_range是不可移植的。
sync_file_range – sync a file segment with disk
sync_file_range() permits fine control when synchronising the open file referred to by the file descriptor fd with disk.
offset is the starting byte of the file range to be synchronised. nbytes specifies the length of the range to be synchronised, in bytes; if nbytes is zero, then all bytes from offset through to the end of file are synchronised. Synchronisation is in units of the system page size: offset is rounded down to a page boundary; (offset+nbytes-1) is rounded up to a page boundary.
sync_file_range可以让我们在做多个更新后,一次性的刷数据,这样大大提高IO的性能。 具体的实现在fs/sync.c里面,有兴趣的同学可以围观下。
著名的fio测试工具支持sync_file_range来做sync操作,我们在决定在我们的应用中使用该syscall之前不妨先fio测试一把。
祝大家玩的开心。