Google's Chromium sandbox [LWN.net]

By Jake Edge
August 19, 2009

Creating a sandbox—a safe area in which to run untrusted code—is a difficult problem. The successful sandbox implementations tend to come with completely new languages (e.g. Java) that are specifically designed to support that functionality. Trying to sandbox C code is a much more difficult task, but one that the Google Chrome web browser team has been working on.

The basic idea is to restrict the WebKit-based renderer—along with the various image and other format libraries that are linked to it—so that browser-based vulnerabilities are unable to affect the system as a whole. A successful sandbox for the browser would eliminate a whole class of problems that plague Firefox and other browsers that require frequent, critical security updates. Essentially, the browser would protect users from bugs in the rendering of maliciously-crafted web pages, so that they could not lead to system or user data compromise.

The Chrome browser, and its free software counterpart, Chromium, is designed around the idea of separate processes for each tab, both for robustness and security. A misbehaving web page can only affect the process controlling that particular tab, so it won't bring the entire browser down if it causes the process to crash. In addition, these processes are considered to be "untrusted", in that they could have been compromised by some web page exploiting a bug in the renderer. The sandbox scheme works by severely restricting the actions that untrusted processes can take directly.

At some level, Linux already has a boundary that isolates programs from the underlying system: system calls. A program that does no system calls should not be able to affect anything else, at least permanently. But it is a trivial program indeed that does not need to call on some system services. A largely unknown kernel feature, seccomp, allows processes to call a very small subset of system calls—just read(), write(), sigreturn(), and exit()—aborting a process that attempts to call any other. That is the starting point for the Chromium sandbox.

But, there are other system calls that the browser might need to make. For one thing, memory allocation might require the brk() system call. Also, the renderer needs to be able to share memory with the X server for drawing. And so on. Any additional system calls, beyond the four that seccomp allows, have to be handled differently.

A proposed change to seccomp that would allow finer-grained control over which system calls were allowed didn't get very far. In any case, that wasn't a near-term solution, so Markus Gutschke of the Chrome team went in another direction. By splitting the renderer process into trusted and untrusted threads, some system calls could be allowed for the untrusted thread by making the equivalent of a remote procedure call (RPC) to the trusted thread. The trusted thread could then verify that the system call, and its arguments, were reasonable and, if so, perform the requested action.

Chrome team member Adam Langley describes it this way:

So that's what we do: each untrusted thread has a trusted helper thread running in the same process. This certainly presents a fairly hostile environment for the trusted code to run in. For one, it can only trust its CPU registers - all memory must be assumed to be hostile. Since C code will spill to the stack when needed and may pass arguments on the stack, all the code for the trusted thread has to [be] carefully written in assembly.

The trusted thread can receive requests to make system calls from the untrusted thread over a socket pair, validate the system call number and perform them on its behalf. We can stop the untrusted thread from breaking out by only using CPU registers and by refusing to let the untrusted code manipulate the VM in unsafe ways with mmap, mprotect etc.

There are still problems with that approach, however. For one thing, the renderer code is large, with many different system calls scattered throughout. Turning each of those into an RPC is possible, but then would have to be maintained by the Chromium developers going forward. The upstream projects (WebKit, et. al.) would not be terribly interested in those changes, so each new revision from upstream would need to be patched and then checked for new system calls.

Another approach might be to use LD_PRELOAD trickery to intercept the calls in glibc. That has its own set of problems as Langley points out: "we could try and intercept at dynamic linking time, assuming that all the system calls are via glibc. Even if that were true, glibc's functions make system calls directly, so we would have to patch at the level of functions like printf rather than write."

So, a method of finding and patching the system calls at runtime was devised. It uses a disassembler on the executable code, finds each system call and turns it into an RPC to the trusted thread. Correctly parsing x86 machine code is notoriously difficult, but it doesn't have to be perfect. Because the untrusted thread runs in seccomp mode, any system call that is missed will not lead to a security breach, as the kernel will abort the thread if it attempts any but the trusted four system calls. As Langley puts it:

But we don't need a perfect disassembler so long as it works in practice for the code that we have. It turns out that a simple disassembler does the job perfectly well, with only a very few corner cases.

The last piece of the puzzle is handling time-of-check-to-time-of-use race conditions. System call arguments that are passed in memory, via pointers or for system calls with too many arguments to fit in registers, can be changed by the, presumably subverted, untrusted thread between the time they are checked for validity and when they are used. To handle that, a trusted process, which is shared between all of the renderers, is created to check system calls that cannot be verified within the address space of the untrusted renderer.

The trusted process shares a few pages of memory with each trusted thread, which are read-only to the trusted thread, and read-write for the trusted process. System calls that cannot be handled by the trusted thread, either because some of the arguments live in memory, or because the verification process is too complex to be reasonably done in assembly code, are handed off to the trusted process. The arguments are copied by the trusted process into its address space, so they are immune to changes from the untrusted code.

While the current implementation is for x86 and x86-64—though there are still a few issues to be worked out with the V8 Javascript engine on x86-64—there is a clear path for other architectures. Adapting or writing a disassembler and writing the assembly language trusted thread are the two pieces needed to support each additional architecture. According to Langley:

The former is probably easier on many other architectures because they are likely to be more RISC like. The latter takes some work, but it's a coding problem only at this point.

There are some potential pitfalls in this sandbox mechanism. Bugs in the implementation of the trusted pieces—either coding errors or mistakes made in determining which system calls and arguments are "safe"—could certainly lead to problems. Currently, deciding which calls to allow is done on an ad hoc basis, by running the renderer, seeing which calls it makes, and deciding which are reasonable. The outcome of those decisions are then codified in syscall_table.c.

One additional, important area that is not covered by the sandbox are plugins like Flash. Restricting what plugins can do does not fit well with what users expect, which makes plugins a major vector for attack. Langley said that the plugin support on Linux is relatively new, but "our experience on Windows is that, in order for Flash to do all the things that various sites expect it to be able to do, the sandbox has to be so full of holes that it's rather useless". He is currently looking at SELinux as a way to potentially restrict plugins, but, for now, they are wide open.

This is a rather—some would say overly—complex scheme. It is still in the experimental stage, so changes are likely, but it does show one way to protect browser users from bugs in the HTML renderer that might lead to system or data compromise. It certainly doesn't solve all of the web's security problems, but could, over time, largely eliminate a whole class of attacks. It is definitely a project worth keeping an eye on.

[ Many thanks to Adam Langley, whose document was used as a basis for this article, and who patiently answered questions from the author. ]

(Log in to post comments)

Google's Chromium sandbox

Posted Aug 19, 2009 16:23 UTC (Wed) by jake (editor, #205) [Link]

I should have been more clear about why a thread is needed. Certain operations, memory allocation for example, cannot be done in one process on behalf of another because they don't share address space.

I don't think, but don't know for sure, that it is required to have a thread to do the disassembling. I believe that is done by the untrusted thread before it handles any user input, and before it enters seccomp mode.

jake

Google's Chromium sandbox

Posted Aug 20, 2009 0:43 UTC (Thu) by cventers (subscriber, #31465) [Link]

I should have been more clear about why a thread is needed. Certain operations, memory allocation for example, cannot be done in one process on behalf of another because they don't share address space.

On the contrary, I experimented with a technique to do just that. This may not be the perfect solution for Chrome's needs, but I played around with the idea of open()ing a shared memory segment on the vfs, using ftruncate() to resize it, and then sending the fd via a UNIX-domain socket to the untrusted process and allowing it to mmap() the pages.

Now, in my case, I was using this technique to allow dynamically-grown, runtime-allocated shared memory segments between untrusted processes. There are still complications (such as the need to install a SIGBUS handler since the untrusted process might ftruncate() the mmaped fd() to 0, causing the trusted process to fault when it tries to access its own mmap()), and perhaps the requirements for this kind of an implementation are not easy to satisfy for desktop applications. But it's Linux, and there's more than one way to do it. My implementation had the advantage of being architecture-agnostic, as well-behaved user-space code should be.

Google's Chromium sandbox

Posted Aug 20, 2009 0:58 UTC (Thu) by agl (subscriber, #4541) [Link]

That seems like a perfectly reasonable way to allocate memory for another
process. However, we would still need non-seccomp processes to receive the
file descriptor from the socket (recvmsg) and to do the mmap. The first
process need only share the descriptor table with the untrusted process, but
the second needs to share an address space for mmap to be effective. We
merge these two processes into one and, since it shares an address space, we
call it the 'trusted thread'.

Google's Chromium sandbox

Posted Aug 20, 2009 8:59 UTC (Thu) by mingo (subscriber, #31122) [Link]

Btw., (and i raised this on lkml too in the past - at that time the code i referred to was not upstream yet), there's a way you could further increase the restrictions (and hence, the security) of the untrusted seccomp thread: by the use of the C expressions filter engine that is included in the upstream kernel. (right now used by ftrace and will also be used by perfcounters)

The engine accepts an ASCII C-ish expression runtime, such as:

 "fd <= 2 && addr == 0x1234000 && len == 4096"

... and turns/parses that into a cached list of safe predicaments that the kernel will execute atomically on syscall arguments. Once parsed (by the kernel), the execution of the filter expression is very fast.

Despite it being used for tracing currently, the filter engine is generic and can be reused not just to limit trace entries of syscalls, but also to restrict execution on syscalls.

This is real, working code very close to what you need. With latest -tip you can use the filter engine on a per syscall basis, and the kernel knows about the parameter names of system calls. So on a testbox i can do this:

  # cd /debug/tracing/events/syscalls/sys_enter_read

  # echo "fd <= 2 && buf == 0x120000 && count == 1024" > filter

  # cat filter 
  fd <= 2 && buf == 0x120000 && count == 1024

... and from that point on the kernel can execute that filter expression to limit trace entries that match the expression.

All you need is a small extension to seccomp to allow the installation of such expressions from user-space, by passing in the ASCII string. The filter engine can be used by unprivileged user-space as well. (but obviously the untrusted sandboxed thread should not be allowed to modify it.)

The filter engine has no deep dependence on tracing (other than being used by it currently) - it is a safe parser and atomic script execution engine that can be utilized by unprivileged tasks too and so it could be reused in seccomp and could be reused by other Linux security frameworks as well, such as selinux or netfilter.

Google's Chromium sandbox

Posted Aug 20, 2009 14:41 UTC (Thu) by paragw (subscriber, #45306) [Link]

Does this approach work on a per process basis? I.e. do the restrictions
apply to a particular process/thread while others are not impacted?

How would one deal with which process can specify which other process or
thread can do what syscalls with what arguments and is the change permanent
and localized w.r.t the target thread? How does one go about safely modifying
the restrictions dynamically - the restricted thread needs to open a FD with
user permission that wasn't in the originally specified restrictions list?

From what you described there seem to be some significant usability problems
(need to have tracing enabled, debug file system mounted, user-space access
to the filtering mechanism and per PID operation etc.) that need to be
addressed before it can become generally usable?

Google's Chromium sandbox

Posted Aug 20, 2009 19:33 UTC (Thu) by mingo (subscriber, #31122) [Link]

Does this approach work on a per process basis? I.e. do the restrictions apply to a particular process/thread while others are not impacted?

It's an engine - and as such it takes ASCII strings, turns them into a 'filter object' in essence which you can then attach to anything and pass in values to evaluate.

Note that there's nothing 'tracing' about that concept.

Right now we attach such filters to tracepoints - such as syscall tracepoints.

It could be attached via seccomp and to an untrusted process as well, with minimal amount of code, if there's interest to share this facility for such purposes.

Generic sandbox needed

Posted Sep 4, 2009 20:18 UTC (Fri) by cmccabe (subscriber, #60281) [Link]

> It seems to me that this kind of sandboxing is required by many (all?)
> programs dealing with potentially hostile data....
>
> If the kernel would provide a flexible mechanism for an application to
> limit what it can do, the threat of hostile data could be reduced.

I thought that this was what selinux was all about.

The basic idea behind selinux is that rather than using identity-based security, you use capability-based security

Identity-based security works like this: I am a process started by bob, therefore I can do everything bob can do. Capability-based security works like this: bob starts a process and gives it only the capabilities it needs to do the work it's supposed to do.

So bob runs a spell-checker program (aspell or whatever), it shouldn't have the capability to open network sockets and send messages to evilhackers.com. It's the difference between giving the application a few keys, to open the doors it needs, and giving it the whole keyring, which is what we do with traditional uid / gid based security.

It seems like what the google people are trying to do here is to reinvent the selinux concept with seccomp. I'm curious as to why. I guess selinux is difficult to set up and configure, and a lot of distributions have been slow to adopt it. Perhaps they are also trying to be cross-platform?

I'm also curious why Google is using threads rather than processes here. If you don't want to share your memory with the untrusted guy, processes are the obvious solution. As other have noted, you can always use posix shared memory if you feel the need to directly access the memory of the untrusted guy. As a bonus, you could run the untrusted processes as "nobody," and prevent them from doing a lot of nasty things -- even on a system like openBSD, where seccomp and selinux are unheard-of.

P.S.
I seem to remember that the openBSD ssh daemon was written in a similar way. There was an trusted part which ran as root, and an untrusted part which ran as a regular user.