Integer arithmetic from a computer point of view

This post is a work in progress, so if you find it incomplete and not readable probably it's not finished yet. I prefer to publish a little before than leave a post to rust in my drafts.

A thing that is overlooked is the way arithmetic operations work in a computer and specifically in the processing unit: not having a clear idea of how the operations are performed and their limitations can cause very important bug to happen and also help in case you want to reverse unknown code.

First of all, the arithmetic inside a CPU is done on register of fixed size via the ALU and since the register are size limited, all the arithmetic operations are intended modulo the size of the registers.

Integer encoding

We are interested of the case of a register containing a number: since the register has a fixed number of bits (let’s say is \(N\)) we can only represent (directly) unsigned values between \(0\) and \(2^N - 1\).

The formula is the following

Remember the following properties for binary numbers: completness

and shifting - multiplication relation

that means that left shifting a binary number is equivalent to multiplying the same number for a power of two.

For signed numbers there is not a unique way to represent them: the quick and dirty way would be to use the most significant bit as sign bit; this has two drawbacks:

  1. when you add a positive and negative number together you have to invert the operation to be a subtraction (and viceversa) and this need a circuitery in the processor.
  2. you have two zeros

I think that this encoding is not used by anyone in the real world (but I could be wrong), there are more efficient ways.

One’s complement

It consists in flipping all the bits of a number, in this way if you define the negative of a given number as the one’s complement of it you have the nice property that this two numbers summed are equal to zero and you don’t have to implement any particular circuitery on the processor since the operations are executed as usual.

The problem is that you have two zeros: all bits equal to zero and all equal to one.

Two’s complement

It’s an extension of the one’s complement: to obtain the negative representation of a number you have to take the one’s complement and add one: in this way you have an asymmetry between the minimum and maximum number that can be represented, i.e. you can represent values between \(-2^{N - 1}\) and \(2^{N - 1} - 1\). For example with 7 bits you have the interval \((-64, 63)\).

Normally in the code is this the way the negative numbers are represented.

Remember that a value into a register is not signed or unsigned by itself, it depends on how is used in the code.


Sign extension

In certain cases could be necessary to do operations between numbers having a different number of bits; if these numbers are unsigned it’s not big deal, but if instead we having signed ones we have to sign extend i.e. to complete the bits of the extended number with all 1s.

Let me make an example: if we have a 8-bit register with the decimal value \(-16\), its representation with two’s complement will be 0xef; now, if we want to put this value into a 16-bit register and represent the same number, we have to set as most significant byte 0xff, i.e. 0xffef: this because of a nice property of binary numbers, namely

in our case we have the Mth bit used for the sign and suppose we have other \(s\) bits

Some architectures have direct instructions to do that, like the movxs in x86, other instead use multiple operations to do the same


It’s all fine and good but as already said, we have a limited number of bits to represent numbers, so it’s possible that some operations couldn’t be done correctly: for example, if you want to sum, in a 8bits-register 0xff to any other number, you can’t fit the result in the register, you should have one bit more; for this reason in a CPU you have also some flags (i.e. one-bit values) usually contained in an unique register to indicate some particular properties of the last arithmetic operation. Take in mind that not all architectures have it.

Each system has its own nomenclature and specific flags, but I think the minimal set is composed of the following

Carry flag

Used in unsigned numbers to indicate that the result doesn’t fit in the register.

Overflow flag

Used for signed numbers to indicate that the resulting sign bit is not coherent with the correct result; for example with 4-bit numbers we can have the following four cases:

Zero flag

Programmation errors

Out of bounds




Do you find this post incomplete? probably because it's a work in progress. Let me know how do you want this to be completed