Published on 31st of July 2018

fastcat - A Faster `cat` Implementation Using Splice

fastcat logo

Lots of people asked me to write another piece about the internals of well-known Unix commands. Well, actually, nobody asked me, but it makes for a good intro. I'm sure you’ve read the previous parts about yes and ls — they are epic.

Anyway, today we talk about cat, which is used to concatenate files - or, more commonly, abused to print a file's contents to the screen.

# Concatenate files, the intended purpose
cat input1.txt input2.txt input3.txt > output.txt

# Print file to screen, the most common use case
cat myfile

Implementing cat

Here's a naive cat in Ruby:

#!/usr/bin/env ruby

def cat(args)
  args.each do |arg|
    IO.foreach(arg) do |line|
      puts line
    end
  end
end

cat(ARGV)

This program goes through each file and prints its contents line by line. Easy peasy! But wait, how fast is this tool?

I quickly created a random 2 GB file for the benchmark.

Let's compare the speed of our naive implementation with the system one using the awesome pv (Pipe Viewer) tool. All tests are averaged over five runs on a warm cache (file in memory).

# Ruby 2.5.1
> ./rubycat myfile | pv -r > /dev/null
[196MiB/s]

Not bad, I guess? How does it compare with my system's cat?

cat myfile | pv -r > /dev/null
[1.90GiB/s]

Uh oh, GNU cat is ten times faster than our little Ruby cat. 🐌

Making our Ruby cat a little faster

Our naive Ruby code can be tweaked a bit. Turns out line buffering hurts performance in the end1:

#!/usr/bin/env ruby

def cat(args)
  args.each do |arg|
    IO.copy_stream(arg, STDOUT)
  end
end

cat(ARGV)
rubycat myfile | pv -r > /dev/null
[1.81GiB/s]

Wow... we didn't really try hard, and we're already approaching the speed of a tool that gets optimized since 1971. 🎉

But before we celebrate too much, let's see if we can go even faster.

Splice

What initially motivated me to write about cat was this comment by user wahern on HackerNews:

I'm surprised that neither GNU yes nor GNU cat uses splice(2).

Could this splice thing make printing files even faster? — I was intrigued.

Splice was first introduced to the Linux Kernel in 2006, and there is a nice summary from Linus Torvalds himself, but I prefer the description from the manpage:

splice() moves data between two file descriptors without copying between kernel address space and user address space. It transfers up to len bytes of data from the file descriptor fd_in to the file descriptor fd_out, where one of the file descriptors must refer to a pipe.

If you really want to dig deeper, here's the corresponding source code from the Linux Kernel, but we don't need to know all the nitty-gritty details for now. Instead, we can just inspect the header from the C implementation:

#include <fcntl.h>

ssize_t splice (int fd_in, loff_t *off_in, int fd_out,
                loff_t *off_out, size_t len,
                unsigned int flags);

To break it down even more, here's how we would copy the entire src file to dst:

const ssize_t r = splice (src, NULL, dst, NULL, size, 0);

The cool thing about this is that all of it happens inside the Linux kernel, which means we won't copy a single byte to userspace (where our program runs). Ideally, splice works by remapping pages and does not actually copy any data, which may improve I/O performance (reference).

File icon by Aleksandr Vector from the Noun Project.
terminal icon by useiconic.com from the Noun Project.

Using splice from Rust

I have to say I'm not a C programmer and I prefer Rust because it offers a safer interface. Here's the same thing in Rust:

#[cfg(any(target_os = "linux", target_os = "android"))]
pub fn splice(
    fd_in: RawFd,
    off_in: Option<&mut libc::loff_t>,
    fd_out: RawFd,
    off_out: Option<&mut libc::loff_t>,
    len: usize,
    flags: SpliceFFlags,
) -> Result<usize>

Now I didn't implement the Linux bindings myself. Instead, I just used a library called nix, which provides Rust friendly bindings to *nix APIs.

There is one caveat, though: We cannot really copy the file directly to standard out, because splice requires one file descriptor to be a pipe. The way around that is to create a pipe, which consists of a reader and a writer (rd and wr). We pipe the file into the writer, and then we read from the pipe and push the data to stdout.

You can see that I use a relatively big buffer of 16384 bytes (214) to improve performance.

extern crate nix;

use std::env;
use std::fs::File;
use std::io;
use std::os::unix::io::AsRawFd;

use nix::fcntl::{splice, SpliceFFlags};
use nix::unistd::pipe;

const BUF_SIZE: usize = 16384;

fn main() {
    for path in env::args().skip(1) {
        let input = File::open(&path).expect(&format!("fcat: {}: No such file or directory", path));
        let (rd, wr) = pipe().unwrap();
        let stdout = io::stdout();
        let _handle = stdout.lock();

        loop {
            let res = splice(
                input.as_raw_fd(),
                None,
                wr,
                None,
                BUF_SIZE,
                SpliceFFlags::empty(),
            ).unwrap();

            if res == 0 {
                // We read 0 bytes from the input,
                // which means we're done copying.
                break;
            }

            let _res = splice(
                rd,
                None,
                stdout.as_raw_fd(),
                None,
                BUF_SIZE,
                SpliceFFlags::empty(),
            ).unwrap();
        }
    }
}

So, how fast is this?

fcat myfile | pv -r > /dev/null
[5.90GiB/s]

Holy guacamole. That's over three times as fast as system cat.

Operating System support

Nevertheless, in a production-grade implementation, the splice support could be activated on systems that support it, while using a generic implementation as a fallback.

Nice, but why on earth would I want that?

I have no idea. Probably you don't, because your bottleneck is somewhere else. That said, many people use cat for piping data into another process like

# Count all lines in C files
cat *.c | wc -l

or

cat kittens.txt | grep "dog"

In this case, if you notice that cat is the bottleneck try fcat (but first, try to avoid cat altogether).

With some more work, fcat could also be used to directly route packets from one network card to another, similar to netcat.

Lessons learned

That said, I still have no idea why GNU cat does not use splice on Linux. 🤔 The source code for fcat is on Github. Contributions welcome!

Thanks to Olaf Gladis for helping me run the benchmarks on his Linux machine and to Patrick Pokatilo and Simon Brüggen for reviewing drafts of the article.

Footnotes

1. Thanks to reader Freeky for making this code more idiomatic.
2. Thanks to reader masklinn for the hint.