vmsplice(): the making of a local root exploit

This article brought to you by LWN subscribers

Subscribers to LWN.net made this article — and everything that surrounds it — possible. If you appreciate our content, please buy a subscription and make the next set of articles possible.

By Jonathan Corbet
February 12, 2008

As this is being written, distributors are working quickly to ship kernel updates fixing the local root vulnerabilities in the vmsplice() system call. Unlike a number of other recent vulnerabilities which have required special situations (such as the presence of specific hardware) to exploit, these vulnerabilities are trivially exploited and the code to do so is circulating on the net. Your editor found himself wondering how such a wide hole could find its way into the core kernel code, so he set himself the task of figuring out just what was going on - a task which took rather longer than he had expected.

The splice() system call, remember, is a mechanism for creating data flow plumbing within the kernel. It can be used to join two file descriptors; the kernel will then read data from one of those descriptors and write it to the other in the most efficient way possible. So one can write a trivial file copy program which opens the source and destination files, then splices the two together. The vmsplice() variant connects a file descriptor (which must be a pipe) to a region of user memory; it is in this system call that the problems came to be.

The first step in understanding this vulnerability is that, in fact, it is three separate bugs. When the word of this problem first came out, it was thought to only affect 2.6.23 and 2.6.24 kernels. Changes to the vmsplice() code had caused the omission of a couple of important permissions checks. In particular, if the application had requested that vmsplice() move the contents of a pipe into a range of memory, the kernel didn't check whether that application had the right to write to that memory. So the exploit could simply write a code snippet of its choice into a pipe, then ask the kernel to copy it into a piece of kernel memory. Think of it as a quick-and-easy rootkit installation mechanism.

If the application is, instead, splicing a memory range into a pipe, the kernel must, first, read in one or more iovec structures describing that memory range. The 2.6.23 vmsplice() changes omitted a check on whether the purported iovec structures were in readable memory. This looks more like an information disclosure vulnerability than anything else - though, as we will see, it can be hard to tell sometimes.

These two vulnerabilities (CVE-2008-0009 and CVE-2008-0010) were patched in the 2.6.23.15 and 2.6.24.1 kernel updates, released on February 8.

On February 10, Niki Denev pointed out that the kernel appeared to be still vulnerable after the fix. In fact, the vulnerability was the result of a different problem - and it is a much worse one, in that kernels all the way back to 2.6.17 are affected. At this point, a large proportion of running Linux systems are vulnerable. This one has been fixed in the 2.6.22.18, 2.6.23.16, and 2.6.24.2 kernels, also released on the 10th. At this point, with luck, all of these bugs have been firmly stomped - though, now, we need to see a lot of distributor updates.

The problem, once again, is in the memory-to-pipe implementation. The function get_iovec_page_array() is charged with finding a set of struct page pointers corresponding to the array of iovec structures passed in by the calling application. Those pointers are stored in this array:

    struct page *pages[PIPE_BUFFERS];

Where PIPE_BUFFERS happens to be 16. In order to avoid overflowing this array, get_iovec_page_array() does the following check:

    npages = (off + len + PAGE_SIZE - 1) >> PAGE_SHIFT;
    if (npages > PIPE_BUFFERS - buffers)
	npages = PIPE_BUFFERS - buffers;

Here, off is the offset into the first page of the memory to be transferred, len is the length passed in by the application, and buffers is the current index into the pages array.

Now, if we turn our attention to the exploit code for a moment, we see it setting up a number of memory areas with mmap(); some of that setup is not necessary for the exploit to work, as it turns out. At the end, the code does this (edited slightly):

    iov.iov_base = map_addr;
    iov.iov_len  = ULONG_MAX;
    vmsplice(pi[1], &iov, 1, 0);

The map_addr address points to one of the areas created with mmap() which, crucially, is significantly more than PIPE_BUFFERS pages long. And the length is passed through as the largest possible unsigned long value.

Now let's go back to fs/splice.c, where the vmsplice() implementation lives. We note that, prior to the fix, the kernel did not check whether the memory area pointed to by the iovec structure was readable by the calling process. Once again, this looks like an information disclosure vulnerability - the process could cause any bit of kernel memory to be written to the pipe, from which it could be read. But the exploit code is, in fact, passing in a valid pointer - it's just the length which is clearly absurd.

Looking back at the code which calculates npages, we see something interesting:

    npages = (off + len + PAGE_SIZE - 1) >> PAGE_SHIFT;
    if (npages > PIPE_BUFFERS - buffers)
	npages = PIPE_BUFFERS - buffers;

Since len will be ULONG_MAX when the exploit runs, the addition will cause an integer overflow - with the effect that npages is calculated to be zero. Which, one would think, would cause no pages to be examined at all. Except that there is an unfortunate interaction with another part of the kernel.

Once npages has been calculated, the next line of code looks like this:

    error = get_user_pages(current, current->mm,
		       	   (unsigned long) base, npages, 0, 0,
		       	   &pages[buffers], NULL);

get_user_pages() is the core memory management function used to pin a set of user-space pages into memory and locate their struct page pointers. While the npages variable passed as an argument is an unsigned quantity, the prototype for get_user_pages() declares it as a simple int called len. And, to complete the evil, this function processes pages in a do {} while(); loop which ends thusly:

	len--;
    } while (len && start < vma->vm_end);

So, if get_user_pages() is passed with a len argument of zero, it will pass through the mapping loop once, decrement len to a negative number, then continue faulting in pages until it hits an address which lacks a valid mapping. At that point it will stop and return. But, by then, it may have stored far more entries into the pages array than the caller had allocated space for.

The practical result in this case is that get_user_pages() faults in (and stores struct page pointers for) the entire region mapped by the exploit code. That region (by design) has more than PIPE_BUFFERS pages - in fact, it has three times that many, so 48 pointers get stored into a 16-pointer array. And this turns the failure to read-verify the source array into a buffer overflow vulnerability within the kernel. Once that is in place, it is a relatively straightforward exercise for any suitably 31337 hacker to cause the kernel to jump into the code of his or her choice. Game over. (Update: as a linux-kernel reader pointed out, the story is a little more complicated still at this point; this is an unusual sort of buffer overflow attack).

The fix which was applied simply checks the address range that the application is trying to splice into the pipe. Since a range of length ULONG_MAX is unlikely to be valid, the vulnerability is closed - as are any potential information disclosure problems.

This vulnerability is a clear example of how a seemingly read-only vulnerability can be escalated into something rather more severe. It also shows what can happen when certain types of sloppiness find their way into the code - if get_user_pages() is asked to get zero pages, that's how many it should do. Your editor is working on a patch to clean that up a bit. Meanwhile, everybody should ensure that they are running current kernels with the vulnerability closed.

Index entries for this article
Kernel	Security/Vulnerabilities
Kernel	splice()
Security	Linux kernel
Security	Vulnerabilities/Privilege escalation

(Log in to post comments)

vmsplice(): the making of a local root exploit

Posted Feb 12, 2008 0:46 UTC (Tue) by jd (subscriber, #26381) [Link]

It's also a good reason for checking ranges. Putting unsigned values into signed variables
could do all kinds of ugly things, as could the reverse, or any other unwise casting of types.
The fact that so few bugs of this kind have turned up over the entirety of the kernel lifespan
merits applause to the superb kernel coders out there, but as this illustrates, these sorts of
bugs happen even to the best.

vmsplice(): the making of a local root exploit

Posted Feb 12, 2008 0:48 UTC (Tue) by jwb (guest, #15467) [Link]

Can anyone explain why the fix only applied to len and not off?  Is it because the base
address does not come from userspace?

vmsplice(): the making of a local root exploit

Posted Feb 12, 2008 9:11 UTC (Tue) by and (guest, #2883) [Link]

If my understanding of this is correct, then off is the offset within the 
first page (i.e. off is always smaller than PAGE_SIZE).

vmsplice(): the making of a local root exploit

Posted Feb 12, 2008 22:48 UTC (Tue) by jd (subscriber, #26381) [Link]

If it's just casting, then fixing an unexploitable casting bug is tidier than leaving it,
explicitly states intention, and prevents these souped-up aggressive optimizing compilers used
on the kernel from optimizing in a problem sometime down the road. On the other hand,
unnecessary changes introduce risks of adding as many problems as they fix.

vmsplice(): the making of a local root exploit

Posted Feb 12, 2008 1:10 UTC (Tue) by dw (guest, #12017) [Link]

One of the several things over the years that has kept my nose firmly outside of the kernel
source tree is the strong and regular repugnance I feel when delving inside - I mean, look at
the number of ad-hoc calculations, bitshifts and whatnot mentioned in this article. Look at
the escalation of a simple bug into a major one by moving a single test to the end of the loop
in get_user_pages().

How much of a size/performance win was gained in return for sacrificing nice, simple
abstractions (or even just a bunch of well documented macros!), and simple, readable loops?

I notice that the NetBSD kernel bears some of the same resemblance. I will quite shamelessly
say that my userland C code never has, nor ever will look as ugly, unreadable, or ultimately
undocumented as most of the code in kernel space. Is this part of the
genius-hacker-pissing-contest culture? I (nor my current employer) would ever hire someone
that wrote code like this.

Sorry, rant over. Perhaps I'm just a stupid noob, or perhaps I have a hint of professionalism.
I can never quite work that out.

Argument goes back to Athens

Posted Feb 12, 2008 3:08 UTC (Tue) by smitty_one_each (subscriber, #28989) [Link]

As carried out by Plato and Aristotle, courtesy of Raphael:
http://abyss.uoregon.edu/~js/glossary/aristotle.html

This code is there not for performance

Posted Feb 12, 2008 7:32 UTC (Tue) by khim (subscriber, #9252) [Link]

It's easy to say in userspace code: we'll just never use structures as big as 2GiB or larger - then we can safely use signed offsets. Of you can use 64bit offsets if you want 4GB so badly. The problem with kernel is that both solutions are inadequate: 64bit offsets will require full-sized locking on many architectures - and that's sometimes can cost you 100-150% slowdown (or even more in rare cases), 2GB limit will make the system unusable (ask HURD guys who had such limit for partition size few years ago). Thus kernel code is full is strange arithmetic where you try to calculate something in range of terabytes (think disk) with just 32bit integers. Of course it's not "clean and easy to read" but it's necessary if you want production kernel and not a toy.

This code is there not for performance

Posted Feb 12, 2008 12:44 UTC (Tue) by mjthayer (guest, #39183) [Link]

Perhaps I misunderstood here, but I think that the original poster was saying something else -
not that you do not do all these things (strange arithmetic and suchlike), but that you keep
the code in one place (i.e. a set of macros) instead of duplicating it in lots of places.

Not that it would have helped much here, since the problem was a failure to validate user
input.

This code is there not for performance

Posted Feb 12, 2008 13:50 UTC (Tue) by ms (subscriber, #41272) [Link]

I'm obviously going to sound like a broken record again. The problem is validating incoming
data. Now you next need to ask, "why should we validate user input?" to which the answer is
"because the user can supply silly values" to which the next question is "well why can the
user to supply silly values?" to which the answer is "because the type of the values the user
is supplying are too wide". So basically, you're using the wrong types. Users should not be
allowed to present "bad" input: the type system should prevent it. In an ideal world...

This code is there not for performance

Posted Feb 12, 2008 14:38 UTC (Tue) by nix (subscriber, #2304) [Link]

That's nice in theory, but I'm not sure it's entirely practical to require the users to
provide a different pointer type for every possible VM range!

(Also, kernel/user boundaries are necessarily special places: typing across the boundary is a
matter of consensus as much as enforcement.)

This code is there not for performance

Posted Feb 12, 2008 14:49 UTC (Tue) by ms (subscriber, #41272) [Link]

Yeah, I don't disagree. It is a hard thing to do - you tend to get into messes with
dependently typed languages and so forth - one could easily argue that they're not quite ready
for writing kernels in!

Typically, you first implement maths in the type system, then you can implement a basic "is
smaller than" so then you could effectively arbitrarily refine the int range to be "the value
must be an int and must also be smaller than x". That kinda thing. Of course, as soon as you
hit "the value must be a function which will terminate", you're in trouble...!

This code is there not for performance

Posted Feb 12, 2008 15:05 UTC (Tue) by nix (subscriber, #2304) [Link]

There are a good few languages (Cayenne and Qi spring to mind) with type systems that are so
powerful that they themselves trip the halting problem: compilation is no longer guaranteed to
terminate. :)

I suppose a simple ranged length type (length must be >0) would have sufficed here: you
wouldn't need separate types for every possible valid pointer range.

Halting problem ? Ha!

Posted Feb 12, 2008 20:17 UTC (Tue) by khim (subscriber, #9252) [Link]

C++ has them beat: it's type system is not just trigger halting problem, it's turing-complete!

Halting problem ? Ha!

Posted Feb 12, 2008 21:00 UTC (Tue) by nix (subscriber, #2304) [Link]

C++'s template expander is modelled on ML's pattern matching. Cayenne and 
Qi are both perhaps two generations beyond that (both type systems being 
more powerful than Haskell's) in different directions: personally I prefer 
Qi's, but part of that is probably because it's possible to bootstrap the 
Qi implementation without being an ultra-guru).

Of course, C++ compilers often *appear* to not halt when compiling 
anyway ;}

Halting problem ? Ha!

Posted Feb 14, 2008 22:01 UTC (Thu) by lysse (guest, #3190) [Link]

I don't know if you're joking about C++, but one of the notable things about Qi is that its
type system *is* Turing-complete, by intent and proof (someone implemented SK combinators in
it).

Halting problem ? Ha!

Posted Feb 15, 2008 0:06 UTC (Fri) by ms (subscriber, #41272) [Link]

No, nix wasn't joking. If you use C++ templates and limit yourself to numbers then it really
is turing complete. Haskell, with the right flags (undecidable instances and overlapping
instances) is also Turing complete. Cayenne is deliberately so and many people are now really
thinking that it's just better to permit Turing completeness and let the programmer take
responsibility.

The alternative is more like what Epigram is looking at (amongst others) where you limit
recursion to on the structure of terms and prevent infinite structures. That way, you can
still guarantee termination. I fear we may now be some way from the original issue though ;)

Halting problem ? Ha!

Posted Feb 15, 2008 21:48 UTC (Fri) by nix (subscriber, #2304) [Link]

I used the wrong terms, really. What Haskell, Cayenne and Qi provide over 
C++ is (radically) greater *expressiveness*. The syntax of Qi type 
definitions is especially strange, but it's a hell of a lot saner than 
trying to define anything complicated using C++ templates.

This code is there not for performance

Posted Feb 13, 2008 15:45 UTC (Wed) by werth1 (guest, #48435) [Link]

The answer to the second question "why can the user supply silly values?" is more like:
Because he can.
The user is free to chose any language he wants to interface with the kernel.
In particular the user is free to chose a language without or limited type checks.
So any range checks on system functions have to be in the kernel itself.

This code is there not for performance

Posted Feb 16, 2008 11:28 UTC (Sat) by ernest (guest, #2355) [Link]

It is unfortunate that the CPU cannot enforce signedness and size types. 
Anybody programming in assembly can bypass any higher level language type 
checks you have in mind. This is true even if the users has the best 
intentions.

Ernest.

This code is there not for performance

Posted Feb 17, 2008 19:50 UTC (Sun) by giraffedata (guest, #1954) [Link]

It is unfortunate that the CPU cannot enforce signedness and size types.

The unfortunateness is at a lower level than that. It's unfortunate that a CPU can't do ordinary integer math, where 2 + 2 = 4. I understand why the very first CPUs wrapped around integers -- it happens naturally with the simplest implementations. But I don't get why no CPU today provides even the option of trapping on an arithmetic overflow instead of wrapping around silently. They do it for floating point, but not for integers.

Trapping on overflow

Posted Feb 23, 2008 21:45 UTC (Sat) by anton (subscriber, #25547) [Link]

But I don't get why no CPU today provides even the option of trapping on an arithmetic overflow.

MIPS and Alpha have separate arithmetic instructions that trap on signed overflow (e.g., ADD on MIPS and ADDV on Alpha). IA-32 has INTO which traps if OF is set. Apparently this instruction was so rarely used by programmers, that AMD64 removed it in order to free up some opcode space, and did not even bother to allocate another (multi-byte) opcode for it; but you can still implement the functionality by combining JO (or JNO) with INT.

The existence of INTO has not helped against this security hole, though.

Trapping on overflow

Posted Feb 23, 2008 22:24 UTC (Sat) by giraffedata (guest, #1954) [Link]

MIPS and Alpha have separate arithmetic instructions that trap on signed overflow ...

Nice. Do you know if there is any way to make GCC (or any other C compiler) generate such instructions?

I can understand people resisting adding instructions to handle overflow, but if I could declare in my C program "no arithmetic in here is supposed to wrap around" and get signalled to death if it does, I'd do it a lot.

Trapping on overflow

Posted Feb 28, 2008 21:23 UTC (Thu) by anton (subscriber, #25547) [Link]

Apart from asm statements and modifying gcc I don't know of a way to get gcc or other compilers to use the trapping instructions for C code.

Concerning "no arithmetic in here is supposed to wrap around", unsigned arithmetic is supposed to wrap around in standard C, only signed arithmetic is allowed to trap (or do anything else) on overflow.

Magic exists by necessity alone

Posted Feb 13, 2008 0:19 UTC (Wed) by jd (subscriber, #26381) [Link]

Well, yes, but ideally you'd have some kind of abstraction. Since the numbers and arithmetic are "magic", they must also be impermanent and subject to continual experimentation. Possibly by a coder, possibly by the person compiling, possibly by the OS itself. As such, those need to be the components that are most easily identified and changed in a consistant way - in terms of calculations and type ranges.

Making this "pure" beyond a certain level is, agreed, problematic. You don't have infinite CPU cycles and although the compiler can do some optimization, it can't turn an abstract, general solution into something tuned to a considerably narrower range of special cases that are usable in practice that are further constrained by the implementation details of a given architecture, which is all any real-world physical computer can ever be.

A way round this would be to have some sort of hypothetical generic architecture, which implemented the formal "pure" solution but was never - and could never - actually used in practice. As all usable architectures would necessarily be perfect subsets of the "pure" solution, it no longer matters if it is easy to read, you know it's just a re-arranged and contracted form.

However, here you run into a problem. This assumes a totally generic "pure" solution even exists, wholly independent of any architecture. Since you can't use such a solution, test-run such a solution or in many cases reverse-map onto such a solution, it's not obvious you could ever show a pure solution was indeed generic or was the solution the specific implementations were specific implementations of.

Magic exists by necessity alone

Posted Feb 15, 2008 23:23 UTC (Fri) by vonbrand (guest, #4458) [Link]

Even if a "perfect architecture" existed such that "real machines" are "perfect subsets" (real architectures are very different, sometimes in very weird ways), one of the problems writing a kernel is that real hardware is as buggy (or more!) as the next software (in the end, it is algorithms implemented in silicon, with the added burden that bugs can rarely patched and the rebuilt package distributed to the users).

vmsplice(): the making of a local root exploit

Posted Feb 12, 2008 13:18 UTC (Tue) by tialaramex (subscriber, #21167) [Link]

In my experience the kernel is documented where it needs to be.

Like my own code, it doesn't meet some arbitrary "programming by the book" rule about how many
comments there should be, but instead provides comments where it is expected that they would
be helpful to subsequent maintainers. Every year when I was assisting in the undergraduate
first programming class we had to first beat it into new students that they need to write
comments explaining what's going on in their code, and then beat it into them that useless
(e.g. i++; /* increment i */) explanations are worse than none at all.

Good documentation is like a good alarm system - it doesn't tell you about the ordinary, the
routine, which you are expected to already understand. Contrast AWS (a railway safety system
which alerts and requires action for all aspects except proceed, meaning it becomes more of a
routine annoyance than an actual safety aid) with ATP or TPWS (systems which do nothing until
an apparently unsafe condition occurs). Every comment is taking up space, both on the screen
and in the maintenance programmer's mind. A short function with one comment pointing at an
unusual condition for a loop may as a result be easier to comprehend than a similar function
with sixteen multi-line C++ style comments explaining mundane things you ought to be able to
pick up at a glance from the surrounding code ("this is a loop counter") or know as domain
knowledge for the system you're working on.

For example, if you work on code related to SCSI or filesytems or otherwise connected with
disks, you're expected to recognise that (bytes >> 9) converts from a byte count to a sector
count, since sectors are 512 bytes. Thus each incidence of this expression is not expected to
attract a comment even though it may initially seem inexplicable to someone who has never
worked with disks.

I've had to work on several parts of the Linux kernel (though I don't think any code I wrote
is currently in a Linus git repo) and I found it all satisfactorily documented. Unlike
Tanenbaum if I were grading this project I would give it at least an A minus. Maybe you're
just too sensitive - do you have trouble eating good cheese? Are you revolted by the thought
that game isn't freshly killed when its made into pies ? Not everything needs to look like
pristine sample code in order to be maintainable.

code documented where it needs to be

Posted Feb 12, 2008 16:12 UTC (Tue) by rfunk (subscriber, #4054) [Link]

Reminds me of Steve Yegge's latest.

vmsplice(): the making of a local root exploit

Posted Feb 12, 2008 16:42 UTC (Tue) by utoddl (guest, #1232) [Link]

I basically agree with what you're saying, but I'd like to make just a couple of counter-points.

...beat it into them that useless (e.g. i++; /* increment i */) explanations are worse than none at all.

True, that's useless to you and me, but for the noob programmer who has come through elementary and high school maths, the expression i++ is not immediately obvious. By the second or third C program he should be familiar with it, certainly. But my point is that the "document what's not obvious" standard requires a judgment call where "what's obvious" varies greatly with the experience of the coder.

For example, if you work on code related to SCSI or filesytems or otherwise connected with disks, you're expected to recognise that (bytes >> 9) converts from a byte count to a sector count, since sectors are 512 bytes.

If you say so. I don't happen to work in that domain often, so it isn't obvious to me. However, a well-crafted macro BYTES2SECTORS(bytes) would give me a clue, and the magic numbers and operations on them would live in the definition of the macro, so you've got one place to maintain the conversion and it's clear to us noobs what it's related to. Again, the point isn't about this particular case, but if you can give magic numbers and operations on them a name, you can make the intent more clear (and perhaps avoid a problem, like (bytes<9)). It's already clear to somebody, but will it be clear next week, or to the next programmer who may be less experienced?

vmsplice(): the making of a local root exploit

Posted Feb 12, 2008 18:14 UTC (Tue) by iabervon (subscriber, #722) [Link]

It is useless to the noob programmer who is looking at a piece of code with hundreds of
thousands of occurrences of the increment operator if they're all commented. It's worse than
useful to that same programmer if some but not all of them are commented (are the commented
ones somehow different?). The novice should have access to a tutorial and reference guide for
the language, and that should explain the notation. Even things which are unique to a
project's style shouldn't be documented in situ at every (or any) place they're used; they
should be documented once in a central location. So the standard should really be "comment
what's unusual about this location, given the context of the rest of the project; also
document what's unusual about this project, given the language it's written in; also have
language reference works available; and certainly don't duplicate any of this information,
since it will just get out of sync."

vmsplice(): the making of a local root exploit

Posted Feb 14, 2008 6:38 UTC (Thu) by jimparis (guest, #38647) [Link]

> > For example, if you work on code related to SCSI or filesytems or otherwise connected with
disks, you're expected to recognise that (bytes >> 9) converts from a byte count to a sector
count, since sectors are 512 bytes.

> However, a well-crafted macro BYTES2SECTORS(bytes) would give me a clue, and the magic
numbers and operations on them would live in the definition of the macro, so you've got one
place to maintain the conversion and it's clear to us noobs what it's related to.

But in the context of the kernel, hiding things in a BYTES2SECTORS macro can be downright
dangerous.  What types does it handle for input and output?  Does it sleep or have any locking
requirements?  If I pass an expression, do I have to worry about side effects if it gets
evaluated twice?  How does it behave if bytes is not a multiple of one sector?
Is it talking about the canonical 512-byte defintion of a sector or the actual 2048 byte
sectors used on CD-ROMs?

"bytes >> 9" is absolutely clear.  I see instantly that it handles integer types and matches
the signedness.  It is free from side effects and fast.  I know that the answer is rounded
down to the nearest whole sector.  Sectors are 512 bytes.

I don't see why you would want to hide all of that information from a developer.  Your version
is only simpler at a quick glance to a casual observer, not someone who actually has to deal
with the code and what the statement actually does.

vmsplice(): the making of a local root exploit

Posted Feb 12, 2008 19:20 UTC (Tue) by bronson (subscriber, #4806) [Link]

I agree with everything you said, except...

> you're expected to recognise that (bytes >> 9) converts from a byte count to a sector count

Then why not define BYTES_TO_SECTORS(n) ((n)>>9)?  Putting this bare shift into code
introduces an informal lingo that, as it builds up, can really get in the way for new
maintainers.  True, one instance is no big deal, but trying to get up to speed on code that
has more than five or ten undocumented idioms like this is a real drag.

Even if you understand the lingo, you will likely start to think, "Ah, those are now sectors"
whenever you see >>9 in the code.  That's bad too.

vmsplice(): the making of a local root exploit

Posted Feb 13, 2008 7:27 UTC (Wed) by JoeF (guest, #4486) [Link]

Hmm, no. The use of magic numbers is something I beat out of undergrads. Use a define, or a macro.
When I do code reviews at my job, I always latch onto the use of magic numbers.
And, your example is flawed, anyway. While hardware sectors may be 512 bytes (currently), filesystems usually deal with larger blocks.

vmsplice(): the making of a local root exploit

Posted Feb 14, 2008 6:40 UTC (Thu) by jimparis (guest, #38647) [Link]

> While hardware sectors may be 512 bytes (currently), filesystems usually deal with larger
blocks.

All the more reason to be just specify it clearly at the point of use rather than hiding it
away in a #define!  See my response above too.

vmsplice(): the making of a local root exploit

Posted Feb 14, 2008 19:44 UTC (Thu) by JoeF (guest, #4486) [Link]

You either need to add a comment, or use a define.
bytes >> 9 without a clarifier is something that should never be in production-quality code.
Magic numbers should never be in code without a clear explanation what they mean.
If you don't want a macro that can hide things, write something like bytes >> SECTOR_SHIFT.

vmsplice(): the making of a local root exploit

Posted Feb 14, 2008 20:42 UTC (Thu) by jimparis (guest, #38647) [Link]

> You either need to add a comment, or use a define.
bytes >> 9 without a clarifier is something that should never be in production-quality code.
Magic numbers should never be in code without a clear explanation what they mean.  If you don't
want a macro that can hide things, write something like bytes >> SECTOR_SHIFT.

If one doesn't want to hide things, hide it inside "SECTOR_SHIFT"?
I'm sorry, I remain unconvinced.
Using ">> 9" is essentially a comment saying "I'm converting this to a 512-byte sector count".
It's such a fundamentally basic operation that anyone working with filesystem or disk code
would understand it immediately.

Abstracting common constants is most useful when they might change, and they're just arbitrary
to begin with -- HZ for example.  Or things that actually have no real meaning, like magic
numbers that identify a filesystem.  For a number that has a real, well-known meaning, and for
which changing would involve huge logic changes in the code, I think macros like that are just
extra levels of obfuscation.

vmsplice(): the making of a local root exploit

Posted Feb 15, 2008 6:54 UTC (Fri) by JoeF (guest, #4486) [Link]

Using ">> 9" is essentially a comment saying "I'm converting this to a 512-byte sector count".

No. It only says that you shift a value by 9 bits to the right.
It may say to you that you are converting a value to a 512-byte sector count. It does not necessarily say that to others.

It's such a fundamentally basic operation that anyone working with filesystem or disk code would understand it immediately.

That mindset is what results in a lot of the flaws in all kinds of code. There will always be somebody working on the code for whom it isn't obvious. I venture that if you can't see that, you must still be in your first job and haven't yet had the task to maintain somebody else's code. I can tell you from (painful) experience that this kind of stuff is among the worst stuff out there.
">>9" makes implicit assumptions, and implicit assumptions, even if they may seem reasonable to the original author, are often not obvious to maintainers, possibly years down the road.
Oh, and just in case, I am writing filesystem code. I would never even get the idea to write something like ">>9" without a comment, or better, in a macro.

vmsplice(): the making of a local root exploit

Posted Feb 15, 2008 7:43 UTC (Fri) by jimparis (guest, #38647) [Link]

> I venture that if you can't see that, you must still be in your
> first job and haven't yet had the task to maintain somebody else's code.

And I venture that, in my experience (which is not as limited as you impolitely presume),
"somebody else's code" is a whole lot more difficult to work with when I have to dig through
arbitrary levels of opaque macros to figure out what they're really trying to do.

Clearly we have a difference in opinion.  I prefer in code that is written clearly enough to
be self-explanatory.  Consider:
  inode->i_blocks = bytes >> 9;
vs:
  inode->i_blocks = BYTES_TO_BLOCKS(bytes);

If you maintain that the second version conveys more information than the first, then I'm
afraid we will just have to disagree.

vmsplice(): the making of a local root exploit

Posted Feb 16, 2008 1:52 UTC (Sat) by dododge (guest, #2870) [Link]

> Consider:
>   inode->i_blocks = bytes >> 9;
> vs:
>   inode->i_blocks = BYTES_TO_BLOCKS(bytes);

It's a bit of an unfair example, though, because you're computing
against a value called "bytes" and assigning it to something called
"blocks".  You've put enough context around the expression to make
it clear what the shift is trying to accomplish.

The problem is when someone assumes ">> 9" is inherently
self-documenting and throws it into the middle of a much more
complex statement.  Consider:

  process_frag_2(sig,(get_ent(curr) >> 9) + 2,HEX_ENCODE);
vs:
  process_frag_2(sig,BYTES_TO_BLOCKS(get_ent(curr)) + 2,HEX_ENCODE);

When I'm reading code, I'd much rather see the latter.  It doesn't
just tell me why the shift is being done; it even adds useful
information about the APIs for get_ent() and process_frag_2().

vmsplice(): the making of a local root exploit

Posted Feb 16, 2008 17:13 UTC (Sat) by bronson (subscriber, #4806) [Link]

I guess we will disagree.  Your way is still hostile to new maintainers because it carries so
much implicit information.  What if your code code has this?

   inode->i_blocks = bytes >> 8;

What is a new maintainer to think?  Even if he does recognize that this expression is
different from the others, he's still confused.  Did you auto-optimize the expression (bytes
>> 9)*2?  Is it typoed or a thinko?

Will every potentially confusing use of the shift operator in your code include a comment
stating the author's intention?  If so, then will that convention still be true once someone
else has modified your code?

Also, you've given yourself a rather heavy restriction on variable naming haven't you?  Will
all your variable names really contain "blocks" and "bytes" as appropriate?  You'll probably
want to add this restriction as a comment to each file to which it applies or it won't last
long once other people start committing patches to your code!

Finally, your technique only works for trivial expressions like assignment.  What happens if
you need to actually perform a calculation?

A BYTES_TO_BLOCKS macro solves all these problems by making the conversion explicit.  Implicit
rules and unwritten conventions always make life hard for new maintainers.  Always.

vmsplice(): the making of a local root exploit

Posted Jan 11, 2009 4:06 UTC (Sun) by dibbles_you (guest, #45004) [Link]

cd /to/linux/kernel
# find . -name "*.c" | xargs grep ">>" | awk -F ">>" '{ for(n=2;n<=NF;n++) { $c=$n; sub(/^=/,"",$c); sub(/ /,"",$c); gsub(/$/,"",$c); gsub(/$/,"",$c); $c=substr($c,1,1); if ($c>='0' && $c<='9') print "number"; else print "text"; } }' > a
# cat a| grep text | wc -l
# cat a| grep number | wc -l

number=19877
text=4936

so it would seem the linux kernel believes numbers are usually clearer for others to read and understand.

if you believe otherwise submit a patch to the kernel removing all occurrences and replacing them with meaningful #defines.

#define 'ing all numbers is stupid, using identifiers that describe what they are should allow anybody to understand their context, if you use lots of operators and numbers in a single expression split them up to use multiple variables with appropriate names, if they are in conditions use multiple conditions with comments if necessary. The compiler will optimize all these away and make your code more readable without have 15 editors open or using code navigators to work out that the actual value of CLUSTERS_DIVIDE_BY_BYTES_IN_SECTORS_MULTIPLIED_BY_PI(x) is 2.

PS. the script is an approximation.

vmsplice(): the making of a local root exploit

Posted Feb 13, 2008 17:09 UTC (Wed) by landley (guest, #6789) [Link]

It's easier for you to read what you wrote than it is for you to read what 
someone else wrote, therefore the problem is with the other person?

Reading code is always harder than writing code.  When you write code, by 
definition you have a working model of the code in your head and 
understand all the concepts involved (or it won't work).  When you read 
someone else's code, even well commented source code, you're trying to 
reverse engineer their thought process based on a machine they built which 
was primarily designed to function, not to teach.

Beyond that, working code is often complicated because it has to deal with 
the real world rather than a theoretical model.

Go read this:
http://www.joelonsoftware.com/articles/fog0000000069.html

vmsplice(): the making of a local root exploit

Posted Feb 13, 2008 17:34 UTC (Wed) by landley (guest, #6789) [Link]

So code other people wrote is harder for you to read than code you wrote.  
Join the club.  For code you wrote, you'd better have a complete 
theoretical model in your head or it won't work.  For other people's code, 
you're trying to reverse engineer their thought process by taking apart a 
machine they built.

So if you haven't figured out yet that source code is inherently harder to 
read than it is to wrote, or that working code accumulates complexity as 
it has to deal with a real world that does not cleanly match simple 
theoretical models, you're definitely on the "newb" side rather than 
the "professional" side.

Read this:
http://www.joelonsoftware.com/articles/fog0000000053.html

And note how much else out there agrees with it:
http://www.spinellis.gr/codereading/
http://withoutane.com/rants/2007/when-you-read-code
http://blogs.msdn.com/oldnewthing/archive/2007/04/06/2036...

and so on...

vmsplice(): the making of a local root exploit

Posted Feb 13, 2008 18:03 UTC (Wed) by dw (guest, #12017) [Link]

As a matter of diligence I (like many, many people) document large swathes of my code in the
form of comments. Basically anywhere it has taken me more than a moment to think about, or
where the purpose of the code cannot be inferred by reading API documentation for the calls
made inside it.

Magical integer literals and bitshifts fall well within this purview, and I cannot see it as
"noobness" in these two specific cases. Imagine someone had to go back and fix all those
bitshifts when we move to Some New Compiler (in Some Hypothetical Future). Can't be regexed,
no central place where a change can ripple through the tree, and utterly dangerous, say, if
this New Compiler takes advantage of the fact that bitwise right shift on a signed number is
undefined according to ANSI C (I'm taking this as one, single example. There are hundreds
more).

I understand that other peoples' code is difficult to comprehend. Hell, I've read more than my
fair worth of Other Peoples' Code. I'm talking specifically about why the Linux kernel seems
to be so full of this, but other commenters have given good reasons for some of this already.

code obfuscation

Posted Feb 16, 2008 1:21 UTC (Sat) by man_ls (guest, #15091) [Link]

True. Any idiot can obfuscate his or her code, but it takes work and wit to make legible code.

distro update progress

Posted Feb 12, 2008 1:13 UTC (Tue) by dougg (subscriber, #1894) [Link]

Looks like Debian is first out of the blocks. [Several of my machines running  debian etch 4.0
(i386) have new kernels (2.6.18) now. Sadly my slug (NSLU2) running etch 4.0 has not yet been
fixed.] As I write there has been no update for Fedora 8 or Ubuntu server 7.10 ...

distro update progress

Posted Feb 12, 2008 1:30 UTC (Tue) by corbet (editor, #1) [Link]

Fedora updates are in our mailbox now, haven't seen a whole lot of others yet.

distro update progress

Posted Feb 12, 2008 1:37 UTC (Tue) by jengelh (subscriber, #33263) [Link]

"suser-jengelh/SUSE-10.3" repository updated as of Feb 12 01:18 UTC.

distro update progress

Posted Feb 12, 2008 2:09 UTC (Tue) by mrons (subscriber, #1751) [Link]

I took the fedora update  kernel-2.6.23.15-137.fc8 straight from the build system
(koji.fedoraproject.org) about 21 hours ago.

I couldn't wait for it to be distributed to the mirrors, I have lots of students with shell
accounts that read slashdot!

distro update progress

Posted Feb 12, 2008 8:17 UTC (Tue) by nix (subscriber, #2304) [Link]

It is too late. Now you have lots of new co-system-administrators. ;)

distro update progress

Posted Feb 12, 2008 11:49 UTC (Tue) by Velmont (guest, #46433) [Link]

You could always had used the quick hotfix to disable vmsplice (no reboot necessary):

http://www.ping.uio.no/~mortehu/disable-vmsplice-if-explo...

distro update progress

Posted Feb 12, 2008 18:58 UTC (Tue) by incase (guest, #37115) [Link]

That "fix" is even worse than the problem itself:
It first tries wether the exploit works and overwrites parts of kernel memory on the way.
If your machine only has few and trusted users, don't use it. If you have untrusted users (or
anticipate having some remote exploit allowing the attacker to execute his code under some
(non-root) account, it would be better to shut down the machine until you have an updated
kernel installed. Either by patching your kernel yourself or by installing a distribution
kernel with the fixes in it.

distro update progress

Posted Feb 13, 2008 10:52 UTC (Wed) by Velmont (guest, #46433) [Link]

If you use the new hotfix, it will *not* use the exploit to get root but just disable
vmsplice.

Morten Hustveit made the patch while waiting for a pizza delivery, and didn't look at the
exploit - now the second version enables sysadmins to disable vmsplice more securely. ;-)

distro update progress

Posted Feb 12, 2008 4:25 UTC (Tue) by pr1268 (subscriber, #24648) [Link]

Slackware patched the kernel for both -current and -stable as of 0100 UTC (6:00 PM CST), according to the ChangeLogs -current and -stable.

The -current fix involves upgrading to Kernel 2.6.23.16, whereas the -stable fix appears to be a backported patch to 2.6.21.5 (which shipped with Slackware 12.0 back in July).

distro update progress

Posted Feb 12, 2008 5:38 UTC (Tue) by afalko (guest, #37028) [Link]

For all those curious about Gentoo:

I use vanilla-sources-2.6.24 in Gentoo; I needed to package.keyword vanilla-sources and mask
2.6.25.

gentoo-sources, which is supported by Gentoo security had 2.6.23.16 kernel stable within about
a day and half.

distro update progress

Posted Feb 12, 2008 7:38 UTC (Tue) by jmm (subscriber, #34596) [Link]

Updates for arm will be released soon. They're built from the same source package as the other
kernel images, but since arm is slow to compile and rarely used in a multiuser environment we
decided to go ahead with the other archs.

distro update progress

Posted Feb 12, 2008 22:40 UTC (Tue) by man_ls (guest, #15091) [Link]

My slug (NSLU2) running etch is now updated with linux-image-2.6.18-6-ixp4xx. Thanks! Debian rocks (as usual).

for the record

Posted Feb 23, 2008 22:06 UTC (Sat) by gvy (guest, #11981) [Link]

* Sun Feb 10 2008 Sergey Vlasov <vsu@altlinux> 2.6.18-alt12
- Security-related changes:
  + CVE-2008-0600: splice: fix user pointer access in get_iovec_page_array()
  + check iovec buffers in __bio_map_user_iov() (fixes issue with SG_IO)
  + guard against attempts to call get_user_pages() for 0 pages

Gentoo's gentoo-sources updated

Posted Feb 12, 2008 1:43 UTC (Tue) by dberkholz (guest, #23346) [Link]

Gentoo isn't currently sending out GLSAs for kernels because we have something like 50 kernel
sources and not enough people, but at least the primary, recommended one (gentoo-sources) has
been updated since yesterday.

(Unrelated: the comments didn't have a Reply link, so I had to stick this in a new thread.)

OT: reply

Posted Feb 12, 2008 12:57 UTC (Tue) by hummassa (guest, #307) [Link]

> (Unrelated: the comments didn't have a Reply link, so I had to stick
> this in a new thread.)

What happened to the "reply" button?

OT: reply

Posted Feb 12, 2008 13:22 UTC (Tue) by tialaramex (subscriber, #21167) [Link]

I've definitely seen this phenomenon. I'm not quite sure how it happens, but it seems to be
possible to be in a state where the site doesn't give you any reply buttons, yet you're
allowed to read the article (so you're presumably logged in as a paying customer). Maybe
(editors?) we should try to pin down when exactly this happens ?

OT: reply

Posted Feb 12, 2008 14:44 UTC (Tue) by jake (editor, #205) [Link]

If anyone sees that problem again, I'd love to figure it out and fix it.  The source of a page
that exhibits this behaviour might help.  Also, does anyone remember if *all* comments had no
"Reply ..." button or if it was just one?

Rather than clutter the thread, email to jake, you can probably guess the domain, is fine.

thanks,

jake

OT: reply

Posted Feb 12, 2008 16:10 UTC (Tue) by rfunk (subscriber, #4054) [Link]

The one time I saw it, it was all comments missing the button, and 
reloading the page fixed the problem.  And I was using Konqueror as my 
browser.

OT: reply

Posted Feb 14, 2008 11:51 UTC (Thu) by intgr (subscriber, #39733) [Link]

I've seen this repeatedly with Firefox, probably because I always throw out all browser
cookies on exit; the refresh trick always fixes it.

This is how the problem occurs for me:

1. I'm not logged in and click on an article that requires subscription; LWN informs me that I
can't see this article.
2. I click on "log in to read this article"; https://lwn.net/login?target=/Articles/268783/
3. After filling in username and password, I'm redirected to the
http://lwn.net/Articles/268783/ page without the "reply to this comment" buttons -- as if LWN
thought I wasn't logged in.
4. Reloading the page makes the buttons appear.

To reproduce the bug again, I have to clear my cache (!) after logging out. This would suggest
that it depends on some external dependency (like CSS or JavaScript) that varies depending on
login, but gets cached pre-login by the browser. However I couldn't find such dependencies on
the LWN site.

OT: reply: FireFox versus CSS

Posted Feb 14, 2008 14:42 UTC (Thu) by hummassa (guest, #307) [Link]

I think it is a CSS problem. As I use non-standard colors, when I do the same operation ("log
in to read this article"), it comes back with the default colors instead of my colors. At this
point, I refresh the page... hence I don't encounter the "no buttons" problem.

OT: reply

Posted Feb 19, 2008 8:09 UTC (Tue) by intgr (subscriber, #39733) [Link]

So it turns out that this indeed is the problem; the stylesheet file sizes only differ in one byte so I didn't catch the difference the first time (I thought they were equal).

--- lwn_logged_out.css  2008-02-19 09:58:38.000000000 +0200
+++ lwn_logged_in.css   2008-02-19 09:57:04.000000000 +0200
@@ -135,7 +135,7 @@
 DIV.CommentBody { padding: 0px 4px 0px 4px; }
 P.CommentPoster { margin-top: 0px; }

-div.CommentReplyButton { display: none;
+div.CommentReplyButton { display: block;
                         text-align: right;
                         padding: 4px; }

The problem here is browser cache. When the user first visits LWN.net, not having logged in yet, the browser downloads /CSS/lwn which specifies that div.CommentReplyButton should be hidden.

Later when the user does log in, the browser doesn't bother re-fetching the stylesheet because it already has this URL cached. The Right Way to fix this is to have different stylesheet URLs depending on whether the user is logged in -- rather than using magic stylesheets.

OT: reply

Posted Feb 15, 2008 15:07 UTC (Fri) by madscientist (subscriber, #16861) [Link]

I see this happen somewhat often using FireFox too.  As you say, the buttons are ALL missing,
and a reload takes care of it.  I checked the source once on these but it looked fine: the CSS
for the buttons was all there.  They just didn't show up.

I've also had situations where the front page says there are replies, but when I click the
link I don't see any replies.  Then if I reload, the replies appear!  This happens more
rarely, but it's happened more than once.

At first I thought it was because I was using the greasemonkey add-on to distinguish between
read and unread comments, but I've since turned that off and I still see both of these
problems.

Linus saw it coming

Posted Feb 12, 2008 17:12 UTC (Tue) by proski (subscriber, #104) [Link]

Linus actually criticized vmsplice() from the performance standpoint. But his reference to "VM games" seems highly relevant to this issue. From http://kerneltrap.org/node/6506:

"I claim that Mach people (and apparently FreeBSD) are incompetent idiots. Playing games with VM is bad. memory copies are _also_ bad, but quite frankly, memory copies often have _less_ downside than VM games, and bigger caches will only continue to drive that point home."

Having more than one way to do something has a downside. The less used way is less tested and less reviewed for security.

Linus saw it coming

Posted Feb 12, 2008 18:55 UTC (Tue) by eSk (guest, #45221) [Link]

Except that vmsplice() does not fall under the category of "VM games" since it does not actually go about modifying the page tables of the applications. Also Linus was referring to the performance downsides of playing "VM games". Considering that splice() and vmsplice() were added purely for performance reasons it would have been quite hypocritical to add vmsplice() to improve performance (if vmsplice() actually did play "VM games", that is). I'm not familiar with his criticism of vmsplice() from the performance standpoint that you refer to.

It should also be noted that Linus was in this quote quite wrong in his contempt for FreeBSD developers. FreeBSD developers were (and are) very much aware of the issues at hand. That's why the zero-copy socket option was disabled by default.

Linus saw it coming

Posted Feb 12, 2008 19:14 UTC (Tue) by rfunk (subscriber, #4054) [Link]

No, Linus was promoting vmsplice there, not criticizing it.

Bigger caches ? Where ?

Posted Feb 12, 2008 21:15 UTC (Tue) by khim (subscriber, #9252) [Link]

This is place where Linus misses the point completely: there are no "bigger cache sizes". There will never be "bigger cache sizes". Never. Pentium had 16 KiB L1 cache in 1993, PowerPC 601 had 32 Kib cache in 1993! Today... Latest and greatest Itanium has 16 KiB L1D + 16 KiB L1I cache, Inter Core 2 uses 32 KiB L1D + 32 KiB L1I caches and biggest caches of POWER6 and AMD Phenom are reaching just 64 KiB L1D + 64 KiB L1I! Basically two times increase in 15 years! I'm not sure we'll ever see anything bigger than 256KiB "cache" (local RAM) in Cell.

So while "cheep, slow" RAM is abundant resource L1 is as precious today as it was 10 years ago and as it will be 10 years from now! Use L1 very sparingly!

Bigger caches ? Where ?

Posted Feb 12, 2008 22:04 UTC (Tue) by zlynx (guest, #2285) [Link]

I think you're missing the point here.

L1 cache is, as I understand it, about as fast as chip registers.  So that 16 KB cache has
grown faster and faster over the years.

What is L2 and L3 cache now is *huge* and is likely as fast if not faster than that 1993
Pentium's 16 KB cache.

And all chip cache is hugely faster than system RAM.  Look at benchmarks that show the
optimization of certain algorithms.  When they use cache-optimized blocks instead of streaming
over the entire data set, performance improves a hundred fold in some cases.

Bigger caches ? Where ?

Posted Feb 12, 2008 22:51 UTC (Tue) by man_ls (guest, #15091) [Link]

Moore's law is still valid: transistors on a chip are doubling every 18 months. However clock speeds are not getting any faster; my old PIV was running at 3.2 GHz, and you seldom see those speeds nowadays. So guess where many (if not most) of all those extra transistors are going. Yep, into the caches.

Bigger caches ? Where ?

Posted Feb 13, 2008 20:23 UTC (Wed) by hmh (subscriber, #3838) [Link]

In mainstream consumer CPUs, maybe the clocks have stabilized (which IS a good thing, we
already waste too much power just to do office work).

Look at the IBM Power6 for some high-clock cores...

Bigger caches ? Where ?

Posted Feb 21, 2008 9:53 UTC (Thu) by renox (guest, #23785) [Link]

>Look at the IBM Power6 for some high-clock cores...

Except that apparently they had a lot of difficulty ramping up the clock..
Plus their CPU does *in order* integer processing which induce a loss of efficiency..

It'd be interesting to have a study which analyse whether going in order for integer
processing to increase the clock really did improve performance or if it was just a marketing
gimmick a la P4.

Power 6 clock high-score and real performance

Posted Feb 23, 2008 17:14 UTC (Sat) by anton (subscriber, #25547) [Link]

It'd be interesting to have a study which analyse whether going in order for integer processing to increase the clock really did improve performance

Take a look at SPEC CPU 2006 results. Last I looked, a 4.7GHz Power6 had similar speed to a 3GHz Core 2 Duo 6850 in both SPECint2006 and SPECfp2006 results.

Bigger caches ? Where ?

Posted Feb 21, 2008 9:46 UTC (Thu) by renox (guest, #23785) [Link]

>This is place where Linus misses the point completely: there are no "bigger cache sizes".
There will never be "bigger cache sizes". Never. Pentium had 16 KiB L1 

Funny how you restrict the meaning of 'bigger cache size' to 'L1 cache size' to "make your
point".

Sure L1 won't grow, but L2 and L3 cache are also cache you know?

Bigger caches ? Where ?

Posted Mar 9, 2008 23:16 UTC (Sun) by phip (guest, #1715) [Link]

> There will never be "bigger cache sizes". Never.

You should check out PA-RISC.  They have L1 D-caches up to 2048K.

Slackware has released fixd kernel update packages

Posted Feb 12, 2008 19:18 UTC (Tue) by juhl (guest, #33245) [Link]

For those of you running Slackware. The ChangeLog for Slackware-current shows that there are now updated kernel packages, that fix this problem, available.

Mon Feb 11 17:47:58 CST 2008
a/kernel-generic-2.6.23.16-i486-1.tgz:
       Upgraded to Linux 2.6.23.16 uniprocessor generic.s (requires initrd) kernel.
       All of these kernel upgrades fix yesterday's local root exploit.
       The kernel headers did not change, so a glibc rebuild is not required.
       For more information, see:
       http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2008-0010
       http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2008-0163
       http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2008-0600
       (* Security fix *)
       If you use lilo, don't forget to run it again after the upgrade.

...

vmsplice(): the making of a local root exploit

Posted Feb 12, 2008 19:40 UTC (Tue) by bronson (subscriber, #4806) [Link]

Given how few userspace programs actually use vmsplice, is it safe to say that the largest
user of this system call is the exploit?  (Some quick google and koders searching implies
this; please tell me if my impression is wrong.)

I'm curious why nobody has been talking about purging this demonstrably scary call from the
kernel.  Why not redesign it so this sort of mistake will be easier to find in the future?

Ultimately, I guess here's my question: is there any quantifiable reason to believe the
current fix isn't a premature botch the way the previous fixes were?

vmsplice(): the making of a local root exploit

Posted Feb 12, 2008 20:57 UTC (Tue) by nix (subscriber, #2304) [Link]

The reason why it's not widely used yet is because it's quite new. Many 
people probably don't even have a glibc with this call in (it was added in 
2.5), and there's a long lag time after that before applications can start 
relying on it (or even benefiting detectably from it).

vmsplice(): the making of a local root exploit

Posted Feb 12, 2008 21:40 UTC (Tue) by tialaramex (subscriber, #21167) [Link]

There's nothing particularly scary about this system call.

To put this into perspective, dozens of calls with this sort of problem have been found in the
history of Linux. Dozens in Windows NT. Dozens in official AT&T branded Unix. The problem was
fixed and you're still using the relevant system calls.

There are tools that are supposed to look for this sort of problem in your OS, but of course
if you keep inventing system calls, someone has to update the tools to know about the new
system calls, and if there were people being that vigilant, you'd hope that at least two of
the three bugs that lead to this discussion would have been spotted before they got into the
Linus kernel tree.

What you're looking at here is boring human error. No interesting design lessons. No
fundamental principles that ought to be reconsidered (unless you consider the hilarious
proposals earlier in comments that Linux should be re-written in a dynamic type system where
the OS would spend all its time validating your input parameters and not getting any work
done...)

vmsplice(): the making of a local root exploit

Posted Feb 12, 2008 22:32 UTC (Tue) by nix (subscriber, #2304) [Link]

I think the intention was to model it on type-inferencing strongly typed 
systems, actually: but of course that sort of inferencing can't cross the 
separate-compilation-and-language abstraction boundary between the kernel 
and userspace (not to mention the privilege boundary, which is perhaps 
less significant in this case).

why didn't static checkers spot this?

Posted Feb 12, 2008 22:28 UTC (Tue) by adamgundy (subscriber, #5418) [Link]

I was under the impression that sparse and/or coverity were specifically designed to catch
this kind of thing (reading/writing userspace pointers which haven't been correctly checked
for validity)

vmsplice(): the making of a local root exploit

Posted Feb 13, 2008 20:32 UTC (Wed) by hmh (subscriber, #3838) [Link]

Apparently, there is a lot more technical details to this exploit.

See http://article.gmane.org/gmane.linux.kernel/639206 for some very interesting info from an
anonymous source in .hu.

vmsplice(): the making of a local root exploit

Posted Feb 14, 2008 8:16 UTC (Thu) by dale77 (guest, #1490) [Link]

Anyone know whether SELinux would have prevented this exploit?

vmsplice(): the making of a local root exploit

Posted Feb 14, 2008 13:44 UTC (Thu) by corbet (editor, #1) [Link]

The exploit is able to run arbitrary code in kernel mode, so the answer has to be "no." Unless one had previously configured SELinux to disallow access to vmsplice() altogether, of course.

One possible SELinux trick

Posted Feb 14, 2008 15:06 UTC (Thu) by corbet (editor, #1) [Link]

I just ran across this posting from James Morris on how SELinux (in recent kernels) can block the mapping of memory into very low addresses - a feature which would have defeated this particular exploit.

It's far from complete

Posted Feb 24, 2008 10:55 UTC (Sun) by fbh (guest, #49754) [Link]

Hello,

I'm suprised to see how this article is far from complete. It misses some important details
and it doesn't actually explain how the exploit works...

For example, the interesting buffer overflow (which is actually not done by get_user_pages()
BTW), is not used to put a return address on the stack at all ! Hint: it's actually used to
put 0 and 0x1000.

And, surprise surprise, the exploit also 'mmap' 2 areas starting by 0 and 0x1000. And these
areas are made up to look like a struct page structure for a compound page where the
destructor field is set to kernel_code() address.

One thing I still miss is why the exploit code is so weirdly written ?
For example, it does:

        pages[0] = *(void **) &(int[2]){0,PAGE_SIZE};

whereas it could have done instead:

        pages[0] = NULL;

Is it something common for exploit code to be ofuscated ?

It's far from complete

Posted Feb 25, 2008 0:19 UTC (Mon) by cras (guest, #7000) [Link]

I don't know if the exploit was supposed to work as a 64bit binary (I crashed my machine when 
testing one version of it), but that code doesn't translate to "NULL" on 64bit systems.

It's far from complete

Posted Feb 25, 2008 9:02 UTC (Mon) by fbh (guest, #49754) [Link]

Acutally you're right.

It's a trick to compute the addresses of the fake "struct page" structures on both 32 and 64
bits platforms.

It should work on 64 bits platforms. I don't know why it doesn't in your case though but it's
just a matter of tuning some values in the exploit code probably.

Thanks.

QA?

Posted Dec 24, 2009 22:07 UTC (Thu) by barrygould (guest, #4774) [Link]

ISTM that someone should have tested the exploit against the 'fixed' kernel before declaring it fixed. (more QA is needed in the kernel development processes.)