Fast integer overflow detection

Writing efficient overflow checks in C is very challenging (but good brain exercise). Even libraries written by security experts, such as Microsoft’s SafeInt and CERT’s IntegerLib, could contain broken overflow checks.

For another example, I have seen a lot of bit tricks for detecting int + int overflow, many of which are actually questionable because they rely on undefined behavior. At C-language level one cannot compute the signed addition first before overflow checking since signed integer overflow is undefined. Guess what C compilers would do to the following code.

int sum(int a, int b)
{
	int c;
	assert(a >= 0);
	assert(b >= 0);
	c = a + b; /* both a and b are non-negative */
	if (c < 0) /* overflow? */
		abort();
	return c;
}

GCC will optimize away the check c < 0, while Clang keeps it. You may try another (broken) form c < a: this time GCC keeps the check, while Clang removes it.

I am proposing an overflow detection library, libo. Unlike previous ones, it consists of assembly code, and due to the fact that I am lazy, the assembly code is generated automatically from a short program.

Let’s start with a simple task: writing an array allocator malloc_array(n, size), where n is the number of elements and size is the element size (i.e., the non-zeroing version of calloc).

Here’s a popular way.

void *malloc_array(size_t n, size_t size)
{
	if (size && n > SIZE_MAX / size)
		return NULL;
	return malloc(n * size);
}

Unfortunately neither GCC nor Clang is able to optimize away the division (dunno why though; looks like a straightforward transformation).

malloc_array:                           # @malloc_array
	.cfi_startproc
# BB#0:                                 # %entry
	testq	%rsi, %rsi
	je	.LBB0_3
# BB#1:                                 # %land.rhs
	movq	$-1, %rax
	xorl	%edx, %edx
	divq	%rsi
	cmpq	%rdi, %rax
	jae	.LBB0_3
# BB#2:                                 # %return
	xorl	%eax, %eax
	ret
.LBB0_3:                                # %if.end
	imulq	%rdi, %rsi
	movq	%rsi, %rdi
	jmp	malloc                  # TAILCALL
.Ltmp0:
	.size	malloc_array, .Ltmp0-malloc_array
	.cfi_endproc

If you are crazy about performance, try this trick from the Boehm GC guys.

#define SQRT_SIZE_MAX ((1U << (sizeof(size_t) * 8 / 2)) - 1)

	if ((size | n) > SQRT_SIZE_MAX /* fast test */
	    && size && n > SIZE_MAX / size)
		return NULL;

Another way is to promote the integer operation to a wider type, such as 128-bit (assuming size_t is 64-bit); just don’t forget to do the type cast before multiplication. I am not sure how portable this is. The grsecurity patch uses something similar to modify kmalloc.

void *malloc_array(size_t n, size_t size)
{
	__uint128_t bytes = (__uint128_t)n * (__uint128_t)bytes;
	if (bytes > SIZE_MAX)
		return NULL;
	return malloc((size_t)bytes);
}

With the new libo, malloc_array could be implemented as follows.

void *malloc_array(size_t n, size_t size)
{
	size_t bytes;
	if (umuloz(n, size, &bytes))
		return NULL;
	return malloc(bytes);
}

With link-time optimization, Clang produces very nice code: compared to the baseline code (without overflow checking), it has only one more jno.

malloc_array:                           # @malloc_array
	.cfi_startproc
# BB#0:                                 # %entry
	movq	%rdi, %rax
	mulq	%rsi
	jno	.LBB0_2
# BB#1:                                 # %return
	xorl	%eax, %eax
	ret
.LBB0_2:                                # %if.end
	movq	%rax, %rdi
	jmp	malloc                  # TAILCALL
.Ltmp0:
	.size	malloc_array, .Ltmp0-malloc_array
	.cfi_endproc

Here is the trick. LLVM internally supports arithmetic with overflow operations, based on which the libo generator builds up functions like this:

define i32 @umulo64(i64, i64, i64*) alwaysinline {
entry:
  %3 = call { i64, i1 } @llvm.umul.with.overflow.i64(i64 %0, i64 %1)
  %4 = extractvalue { i64, i1 } %3, 0
  store i64 %4, i64* %2
  %5 = extractvalue { i64, i1 } %3, 1
  %6 = sext i1 %5 to i32
  ret i32 %6
}

To use libo with Clang (and link-time optimization), one can just use the IR. To use libo with GCC, one can invoke LLVM’s backends to generate assembly. Assuming LLVM is implemented correctly, we get a correct and efficient libo implementation for all architectures LLVM supports.

One downside is that it may be difficult to do link-time optimization when using GCC; this would end up with having a call to umulo64, rather than inlining it. How to “inline” a function purely written in assembly code using GCC? Or is it possible to have LLVM generate inline assembly instead? Of course it would be nicer if compilers provide built-in functions for overflow detection.