A little fun for the last day of 2012:
how to crash a program via division?
x86’s IDIV instruction traps not only on division by zero,
but also on INT_MIN / -1 (signed integer overflow).
C/C++
Try to compile this little function.
longlongcrash()
{
return(1LL<<63)%-1;
}
It will crash the
tcc compiler and
the sparse checker.
SQL
You need a 64-bit PostgreSQL (older than 9.2.2/9.1.7/9.0.11)
installed on Windows.
Try the following SQL statement.
SELECT((-9223372036854775808)::int8)%(-1);
Instead of producing 0 from modulo -1,
it will crash your PostgreSQL server.
The modulo trick also worked in Firebird.
This is fixed by adding a check on the divisor.
The first two are straightforward.
The third one (multiplication) crashes PostgreSQL
because the overflow check is done via division.
The
fix
is simple: do the overflow check before the division, not after.
It’s interesting that the developers
fixed 32-bit division crashes,
but missed 64-bit cases.
My guess is that the developers did the tests on a 32-bit
Windows. In that case, since there’s no 64-bit IDIV instruction,
the compiler instead generates a call to a runtime function lldiv,
which doesn’t trap on INT_MIN / -1.
This would lead to the incorrect conclusion that 64-bit division
wouldn’t trap.
ClamAV bytecode
The ClamAV engine accepts bytecode signatures as extensions to
detect new viruses. You can write one to crash ClamAV’s interpreter,
as follows.
intentrypoint(void)
{
longlongx=1LL<<63;
inty=__is_bigendian()-1;
if(x/y)
foundVirus("idiv");
return0;
}
It basically does INT_MIN / -1, only to prevent ClamAV’s bytecode
compiler from optimizing away the division.
Then compile the function into bytecode, load it into clamscan,
and it will crash the interpreter.
Actually the interpreter does have a sanity check for signed division,
which was
introduced
in 2009.
CVE-2008-0166
is an infamous example of using uninitialized memory for random
number generation. A Debian maintainer commented out
two lines of code
to silence Valgrind, which complained about the uses of uninitialized
memory as a ``bonus’’ source of entropy, and this change caused
OpenSSL to generate bogus keys on Debian-based systems.
Actually, using uninitialized memory has always been a very bad
idea, not only because it confuses developers and tools like Valgrind,
but because a smart C compiler will kick you in the teeth. Here is
one example in FreeBSD and Mac OS X libc.
The srandomdev() function is used to seed random(), according
to its manpage, ``suitable for cryptographic use.’’ It will first
try /dev/random, which is non-blocking on FreeBSD and Mac OS X;
if that fails, it falls back to using current time and pid, with
some amazing extra bits.
You can see that the seed is computed using results from gettimeofday()
and getpid(), mixed with the value of an uninitialized stack
variable junk. Here’s the corresponding assembly code from Mac
OS X 10.6 (Snow Leopard).
/usr/lib/libSystem.B.dylib (10.6)
leaq0xe0(%rbp),%rdi
xorl%esi,%esi
callq0x001422ca;symbolstubfor:_gettimeofday
callq0x00142270;symbolstubfor:_getpid
movq0xe0(%rbp),%rdx
movl0xe8(%rbp),%edi
xorl%edx,%edi
shll$0x10,%eax
xorl%eax,%edi
xorl%ebx,%edi
callq0x00142d68;symbolstubfor:_srandom
Everything looks good so far. Now let’s look at the same code from
10.7 (Lion) and 10.8 (Mountain Lion).
/usr/lib/system/libsystem_c.dylib (10.7/10.8)
leaq0xd8(%rbp),%rdi
xorl%esi,%esi
callq0x000a427e;symbolstubfor:_gettimeofday
callq0x000a3882;symbolstubfor:_getpid
callq0x000a4752;symbolstubfor:_srandom
Wait, the entire seed computation is gone! The results of
gettimeofday() and getpid() are not used at all; srandom()
is called with some ``garbage’’ value.
I guess Apple has switched from GCC to LLVM for compiling libc in
newer Mac OS X. Since the C code contains undefined behavior, the
use of uninitialized variable junk, LLVM optimizes away the
computation more aggressively. You should see the same assembly
if compiling FreeBSD using LLVM.
There are several interesting questions to explore.
Is it possible to trigger this by somehow failing /dev/random
on FreeBSD and Mac OS X?
How random is the srandom() seed now compiled using
LLVM, which is just the value of %edi? Looks like it’s the
lower 32 bits of the address of the stack variable tv.
Does GCC generate ``correct’’ code? Probably not. The
last xorl instruction uses %ebx, which is unlikely to correspond
to the value of junk but the file descriptor of /dev/random.
movl%ebx,%edi
callq0x00141d48;symbolstubfor:_close$NOCANCEL
In short, just don’t use uninitialized memory for more randomness.
The use of junkwas introduced
in 1997. Google shows a bunch of similar usages and fixes.
Varnish developers fixed a similar usage, though the code is still being
used by others.
BTW, you can find an interesting comparison fd >= 0 there, where fd is a pointer!
The story started with
an email
from an IBM engineer to the ext4 mailing list back in 2009.
Hi,
Found kernel bug - fixpoint divide exception
While working with fsfuzzer
Environment: 2.6.31-rc7
Architecture: s390 and ppc64
------------[ cut here ]------------
...
Kernel BUG at 000003e00429d934 [verbose debug info unavailable]
fixpoint divide exception: 0009 [#1] SMP
...
Call Trace:
([<000003e00429d87c>] ext4_fill_super+0x13c0/0x2908 [ext4])
[<00000000000f46f2>] get_sb_bdev+0x13e/0x19c
[<000003e00429230e>] ext4_get_sb+0x2e/0x40 [ext4]
[<00000000000f3f98>] vfs_kern_mount+0xc0/0x168
[<00000000000f40ac>] do_kern_mount+0x58/0x114
[<000000000010e558>] do_mount+0x798/0x830
[<000000000010e6a0>] SyS_mount+0xb0/0x100
[<00000000000266be>] sysc_noemu+0x10/0x16
[<0000004e53f234e2>] 0x4e53f234e2
Last Breaking-Event-Address:
[<000003e00429d8d8>] ext4_fill_super+0x141c/0x2908 [ext4]
---[ end trace b5563edf9c0c9b52 ]---
...
The call trace shows that a division by zero occurred when mounting
an ext4 filesystem. Note that the bug was triggered using fsfuzzer
on s390 and ppc64. Why does this matter? Let’s take a look at the
code.
Since s_log_groups_per_flex is read from disk, it can be anything.
The only existing check is to make sure it’s non-zero. Let
s_log_groups_per_flex be 32 then. Now groups_per_flex becomes
1 << 32 = 0, and traps the division-by-zero exception when it’s
used as a divisor later
(see CVE-2009-4307).
Recall that the bug was found using fsfuzzer on s390 and ppc64.
Actually, if fsfuzzer were running on x86, the division by zero
wouldn’t have been triggered at all, because on x86 1 << 32
evaluates to 1, rather than 0. The reason is that s390 and ppc use
6 bits for the shifting amount, while x86 uses 5.
Division by zero is easy to fix, just to add a zero check
groups_per_flex == 0, isn’t it? That’s exactly the
initial patch
from the IBM engineer who discovered the bug.
The ext4 developer changed the patch a little bit
in the fix.
the existing zero check on s_log_groups_per_flex
(that is, groups_per_flex == 1), and
the newly proposed zero check groups_per_flex == 0
into a single check groups_per_flex < 2.
Let’s examine the fixes. First, what if s_log_groups_per_flex
is 36? groups_per_flex (1 << 36) is 0 on ppc, while it’s 16
on x86, which will bypass the check. This causes a bizarre
inconsistency not only across architectures, but also between
s_log_groups_per_flex and groups_per_flex — logically,
s_log_groups_per_flex is expected to be the log of groups_per_flex.
A more subtle part is that shifting an n-bit integer by n or
more bits is undefined in C. That means a compiler can make
aggressive assumptions, such as s_log_groups_per_flex < 32, which
further implies that groups_per_flex will never be zero. Thus,
in the initial patch, the zero check groups_per_flex == 0 would
be optimized away by a standard-conforming compiler (at least Clang
does so).
Fortunately, the check was changed to groups_per_flex < 2, which
hinders the optimization and fixes the division by zero. Of course,
future compilers may still break that, for example, by rewriting
the check to groups_per_flex == 1.
To sum up,
a better fix
is to check s_log_groups_per_flex rather than groups_per_flex,
so as to avoid the undefined behavior.
The major interface change in this version is that instead of writing
smulo32(...) for detecting signed 32-bit multiplication overflow,
you just call overflow_mul(...), and it will figure out the type
for you (using black magic).
I have also added a unit test using
Google Test.
It would be great if you can help generate and test ARM/PowerPC
implementations.
To give you a sense of how efficient libo is, below is libo’s
implementation of overflow_mul() for long on x86_64:
overflow_mul_l:
imulq%rdx,%rsi
movq%rsi,(%rdi)
seto%al
ret
on ppc:
overflow_mul_l:
mullw6,4,5
stw6,0(3)
mulhw3,4,5
srawi4,6,31
xor3,3,4
addic4,3,-1
subfe3,4,3
blr
For a comparison, this is the implementation from CERT’s
IntegerLib.
Since it’s non-trivial to do integer overflow checking correctly
and efficiently (see my
email to cfe-dev
and previous post on libo),
let’s try compiler support. Just gave it a shot in Clang.
My patch is available on the
builtin-overflow branch.
It introduces bool __overflow_*(T *, T, T) builtin functions,
which are easier to understand, less error-prone, and have better
performance (e.g., only one more jno instruction on x86 for most cases).
Here is an example.
void*malloc_array(size_tn,size_tsize)
{
size_tbytes;
if(__overflow_umul(&bytes,n,size))
returnNULL;
returnmalloc(bytes);
}
One more example: signed addition overflow detection mentioned
in the previous post. Below is the
implementation from CERT’s IntegerLib.
Anything suspicious? Despite the clever bit trick, the code is
undefined because signed overflow can happen before the check.
It also doesn’t work on 64-bit platform: sizeof(int) should be
sizeof(long). But now you can simply write __overflow_sadd,
without worrying about undefined behavior nor performance.
Interested in overflow builtins, found any bugs,
or got better ideas about the interface and function names?
Leave your comments or join the discussions at
bugzilla
and cfe-dev.
Writing efficient overflow checks in C is very challenging
(but good brain exercise). Even libraries written by security
experts, such as Microsoft’s SafeInt and CERT’s IntegerLib, could
contain broken overflow checks.
For another example, I have seen a lot of bit tricks for detecting
int + int overflow, many of which are actually questionable
because they rely on undefined behavior. At C-language level one
cannot compute the signed addition first before overflow checking
since signed integer overflow is undefined. Guess what C compilers
would do to the following code.
intsum(inta,intb)
{
intc;
assert(a>=0);
assert(b>=0);
c=a+b;/* both a and b are non-negative */
if(c<0)/* overflow? */
abort();
returnc;
}
GCC will optimize away the check c < 0, while Clang keeps
it. You may try another (broken) form c < a:
this time GCC keeps the check, while Clang removes it.
I am proposing an overflow detection library,
libo.
Unlike previous ones, it consists of assembly code, and due to the
fact that I am lazy, the assembly code is generated automatically
from a short program.
Let’s start with a simple task: writing an array allocator
malloc_array(n, size), where n is the number of elements
and size is the element size
(i.e., the non-zeroing version of calloc).
Here’s a popular way.
void*malloc_array(size_tn,size_tsize)
{
if(size&&n>SIZE_MAX/size)
returnNULL;
returnmalloc(n*size);
}
Unfortunately neither GCC nor Clang is able to optimize away the
division (dunno why though; looks like a straightforward transformation).
malloc_array:# @malloc_array
.cfi_startproc
# BB#0: # %entry
testq%rsi,%rsi
je.LBB0_3
# BB#1: # %land.rhs
movq$-1,%rax
xorl%edx,%edx
divq%rsi
cmpq%rdi,%rax
jae.LBB0_3
# BB#2: # %return
xorl%eax,%eax
ret
.LBB0_3:# %if.end
imulq%rdi,%rsi
movq%rsi,%rdi
jmpmalloc# TAILCALL
.Ltmp0:
.sizemalloc_array,.Ltmp0-malloc_array
.cfi_endproc
If you are crazy about performance, try this
trick
from the Boehm GC guys.
Another way is to promote the integer operation to a wider type,
such as 128-bit (assuming size_t is 64-bit); just don’t
forget to do the type cast before multiplication.
I am not sure how portable this is. The
grsecurity patch uses something similar
to modify kmalloc.
To use libo with Clang (and link-time optimization), one can just
use the IR. To use libo with GCC, one can invoke LLVM’s backends
to generate assembly. Assuming LLVM is implemented correctly, we
get a correct and efficient libo implementation for all architectures
LLVM supports.
One downside is that it may be difficult to do link-time optimization
when using GCC; this would end up with having a call to umulo64,
rather than inlining it. How to ``inline’’ a function purely written
in assembly code using GCC? Or is it possible to have LLVM generate
inline assembly instead? Of course it would be nicer if compilers
provide built-in functions for overflow detection.
What does malloc(n) return if n is a big number (e.g., -1 or SIZE_MAX)?
What does calloc(n, size) return if n and size are big numbers?
What does new T[n] return if n is a big number (assuming -fno-exception)?
I believe everyone expects the answer NULL for all the questions,
as NULL is used to indicate out of memory.
Unfortunately, in reality many memory allocators break (or used to
break) these expectations, which could lead to memory corruption and
security vulnerabilities, especially if the allocation size
can be controlled by an adversary.
Here goes a summary of some known and unknown vulnerabilities
(notably multiplication, rounding, and pointer overflows)
in popular memory allocators.
GLIBC malloc
An infamous multiplication overflow used to exist in the calloc
implementation in GNU libc and Microsoft’s C runtime, as detailed
in
RUS-CERT Advisory 2002-08:02.
This overflow vulnerability can be easily misused to mount buffer
overflow attacks. It was
fixed
in GLIBC in 2002.
Hoard
I cannot find the revision log of the Hoard
allocator, so it’s hard to say what vulnerabilities it used to have.
Though it’s easy to exploit the current version (3.8) via a rounding
overflow. Below is the buggy code snippet.
// and deduct that amount from the local heap bytes counter.
if(sz<=LargestObject){
intc=getSizeClass(sz);
void*ptr=_localHeap(c).get();
if(ptr){
assert(_localHeapBytes>=sz);
_localHeapBytes-=sz;
assert(getSize(ptr)>=sz);
returnptr;
}
}
When sz is (size_t)-1, a big value, the align call overflows
and rounds sz up to 0. Thus, malloc(-1) is basically malloc(0),
which could lead to a buffer overflow.
Hoard’s calloc implementation is also vulnerable to multiplication
overflow. With glibc Hoard is injected via __malloc_hook,
and its calloc won’t be used. However, the vulnerability can
be triggered on platforms that do not use glibc, such as Mac OS X.
jemalloc
The jemalloc allocator is
used by the libc of FreeBSD and NetBSD, as well as Mozilla Firefox.
It used to have the calloc multiplication overflow vulnerability.
The bug was
fixed in 2006.
Note that size is unsigned and incr is signed.
When size is big enough, incr is misinterpreted as negative
and causes the sbrk system call to shrink the memory.
Bionic malloc
Bionic, the Android libc, is derived from the BSD libc. When
libc.debug.malloc is set, allocation calls will be rerouted to
its own debugging library.
Again, the debugging library had calloc multiplication overflows
(CVE-2009-0607).
Also, its malloc implementation still has multiple rounding
overflows, at least in in chk_malloc and leak_malloc.
Hope that no production phone ships the debugging library.
TCMalloc
Google’s TCMalloc
repeated the calloc multiplication overflow bug and got it
fixed
in 2005.
I didn’t spot any serious rounding issues in TCMalloc,
but just a slip in ReportLargeAlloc:
the error message says ``tcmalloc: large alloc 0 bytes == (nil)’’
given malloc(-1), because the size to be output overflows.
APR pool
APR pool is
developed for the Apache web server. It is also adopted by other
projects such as Subversion. The interface is similar to
allocate-and-free allocators, but takes one extra parameter specifying
which pool to allocate from.
APR pool had a series of interesting pointer arithmetic fixes,
listed as follows.
In Aug 2002, the sanity check p + size < q was
changed
to size < q - p, where
the two pointers p (i.e., node->first_evail) and q (i.e.,
node->endp) point to the start and the end of a memory block,
respectively.
- if (size < (apr_size_t)(node->endp - node->first_avail)) {
+ if (size <= node_free_space(node)) {
Here p + size < q is more dangerous than size < q - p. One obvious
reason is that p + size could wrap around to a small pointer given
a big size, while the latter check doesn’t have the pointer
overflow issue.
But how about adding a wrap check p + size < p? A subtle
problem is that pointer overflow (or more precisely, out-of-bounds
pointer) is undefined in C, which means that the compiler can do
anything to the wrap check p + size < p, such as optimizing it
away (VU#162289).
I tried to compile the wrap check using GCC 4.6 and Clang 3.0 on x86:
it’s optimized as extracting the sign of size, without even looking
at the value of p.
movl8(%esp),%eax
shrl$31,%eax
ret
The size rounding vulnerability was also discovered and fixed in APR pool
(CVE-2009-2412)
in 2009.
boost::pool
Boost Pool has the multiplication overflow
issue in its ordered_malloc implementation.
We can see that total_req_size may wrap around to a small number
given a big n.
Boehm GC
Boehm GC is probably
the most popular open-source garbage collector. It provides C
allocation calls and overloads the C++ new operator.
Boehm GC’s calloc implementation, in both malloc.c and its
documentation,
simply redirects calloc(n, size) to GC_MALLOC((n) * (size)),
thus is vulnerable to multiplication overflow.
It also has the size rounding vulnerability, so GC_MALLOC(-1)
doesn’t return a null pointer either.
Safe by Design
Some allocators are designed to be used in ``safe’’
environment. For example, LLVM’s BumpPtrAllocator
doesn’t consider out-of-memory or overflow issues.
/* FIXME: This should look for multiplication overflow */
void*calloc(size_tnmemb,size_tsize)
{
void*ptr;
size*=nmemb;
ptr=malloc(size);
if(ptr)
memset(ptr,0,size);
returnptr;
}
Compiler
The C++ array allocation new T[n] has a potential integer overflow
vulnerability if n is not limited correctly, which could lead to
a buffer overflow. This is similar to calloc(n, sizeof(T)).
The code generated by GCC simply calculates the
allocation size as n * sizeof(T) without any overflow checks,
though GCC developers have been
discussing the problem
for a long time.
Clang
generates ``safe’’ code
that defends against integer overflow attacks, via setting the
allocation size to -1 if any multiplication overflow happens. This
implies an assumption that the underlying allocator must return a
null pointer given a large size. As we have seen, this assumption
doesn’t hold well in practice.
A year ago I introduced this
Doodle Fit game
to my roommate Victor.
We were playing it all day long. To stop the addiction, we then
wrote a program to find the answers, which successfully destroyed
the interest in playing the game. Got some time to rethink
the problem last night. Here goes an interesting alternative to
solve Doodle Fit, using 0-1 integer programming.
The rule of the game is simple: fit a set of given blocks into
a given shape (no rotation). For example, try to fit the three
blocks into the two-square shape.
This solution is pretty easy. Put the red block to the bottom-right
corner and the green block to the left-most, leaving the blue block
in the middle.
Let’s see how to model the game using a set of constraints.
We use to denote whether to place the
top-left corner of the block at : 1 for yes
and 0 for no. Take the red block for an example. Since it can
only be placed at two cells, or , this means
that one and only one of and is 1.
In other words, their sum must be 1. We have similar constraints
for the other two blocks.
Additionally, each cell will be covered by one and only one block.
For example, the top-left cell can be covered by either
the green block placed at , or the blue block placed at
; it cannot be covered by the red block. So the sum of
and must be 1. We have such a constraint
for each cell, as follows.
That’s all we need. Feed these constraints into a solver
that supports 0-1 integer programming, e.g.,
GLPK and
lp_solve,
and you can get the answer back like
,
indicating where to put these blocks.
I wrote a script doodle-fit.py
to automate the steps.
Prepare an input file two specifying the shape and the
blocks, split by ---, as follows.
##
####
##
---
*
**
---
*
*
---
*
**
Then generate the constraints using the script,
which writes them to two.mod in the GNU MathProg language.
$ python doodle-fit.py < two > two.mod
Invoke a solver to get a solution. I use GLPK here.
You may try to add --minisat to glpsol to solve the constraints
with the SAT solver instead.
$ glpsol --math two.mod -w two.sol
GLPSOL: GLPK LP/MIP Solver, v4.47
Parameter(s) specified in the command line:
--math two.mod -w two.sol
Reading model section from two.mod...
24 lines were read
Generating b0...
Generating b1...
Generating b2...
Generating c00...
Generating c01...
Generating c10...
Generating c11...
Generating c12...
Generating c13...
Generating c22...
Generating c23...
Model has been successfully generated
GLPK Integer Optimizer, v4.47
11 rows, 9 columns, 32 non-zeros
9 integer variables, all of which are binary
Preprocessing...
11 rows, 9 columns, 32 non-zeros
9 integer variables, all of which are binary
Scaling...
A: min|aij| = 1.000e+00 max|aij| = 1.000e+00 ratio = 1.000e+00
Problem data seem to be well scaled
Constructing initial basis...
Size of triangular part = 8
Solving LP relaxation...
GLPK Simplex Optimizer, v4.47
11 rows, 9 columns, 32 non-zeros
* 0: obj = 0.000000000e+00 infeas = 0.000e+00 (3)
OPTIMAL SOLUTION FOUND
Integer optimization begins...
+ 0: mip = not found yet >= -inf (1; 0)
+ 0: >>>>> 0.000000000e+00 >= 0.000000000e+00 0.0% (1; 0)
+ 0: mip = 0.000000000e+00 >= tree is empty 0.0% (0; 1)
INTEGER OPTIMAL SOLUTION FOUND
Time used: 0.0 secs
Memory used: 0.1 Mb (121133 bytes)
Writing MIP solution to `two.sol'...
22 lines were written
Make sure you see something like OPTIMAL SOLUTION FOUND
in the output. If the SAT solver is used, you should see
SATISFIABLE. Otherwise, go back to check the input file,
or send me a bug report.
Finally, display the solution two.sol using the script again.
This is a classical integer overflow in the Linux kernel’s XFS
filesystem, which could further lead to a buffer overflow. To
exploit it, an attacker needs to craft a corrupted XFS, for example,
via a USB flash drive or remotely mounted filesystem.
The vulnerability seems to be introduced in commit ef14f0c1 in July 2009.
Note that count is passed into posix_acl_alloc() without any
validation; it is converted from aclp->acl_cnt, which is directly
read from disk.
posix_acl_alloc
structposix_acl*
posix_acl_alloc(intcount,gfp_tflags)
{
constsize_tsize=sizeof(structposix_acl)+
count*sizeof(structposix_acl_entry);
structposix_acl*acl=kmalloc(size,flags);
if(acl)
posix_acl_init(acl,count);
returnacl;
}
Given a large count, the allocation size for kmalloc() would
wrap to become a small integer. Back to xfs_acl_from_disk(),
this would cause out-of-bounds write in the subsequent loop.
Later I realized that there’s a recent commit in XFS’s git repository
that tried to fix the problem.
However the patch doesn’t really work: count is a signed integer;
an attacker could feed in a negative value and bypass the sanity
check. So here goes my follow-up patch.
xfs: fix acl count validation in xfs_acl_from_disk()link
There are a lot of blogs/tutorials/notes on how to set up digest
access authentication in lighttpd.
Well, one thing they may miss is that it is just broken for
several years.
Digest access authentication is specified by RFC 2617.
It is sort of between SSL and the basic authentication — lightweight,
easy to set up, while resistant to replay attacks.
It works as follows. When the client first requests a
protected resource, the server responds with 401 Unauthorized,
as well as realm and nonce. Then the client needs to send a second
request with the response value for authentication, which is
computed from password, the nonce, etc., as follows.
Here nc is short for nonce count and cnonce is client-side
nonce, both of which are optional.
An attacker that is eavesdropping the traffic cannot guess the
password, nor can the attacker replay the request to the server
since the nonce value would be different each time.
The problem with lighttpd’s mod_auth is that it doesn’t verify
the nonce in the client’s second request is the one it generated
earlier, which means that an attacker could use any nonce to bypass
the digest authentication. Thus, the digest authentication is
basically equivalent to the basic authentication in this case.
The lighttpd developers pointed me to their wiki.
The implementation of digest method is currently not completely compliant with the standard as it still allows a replay attack. (i.e. not secure)
Seriously, do you really have to teach people how to configure
digest authentication but hide the sentence in a bunch of
small bullets? ;-)