System call — How it works internally

5 min readMar 6, 2021

A system call is a request made to the OS so that it executes some operations for the user process. In modern operating systems, some operations can’t by any means be performed directly by the user process without the help of the OS. Examples of such operations are a read from disk, a write to disk, a fork of the process, etc…

Why do we need system calls

There are two main reasons why such instructions are not available directly for the process:

The first one is protection: The OS must do many checks to evaluate the right of the process to request such an instruction.
The second one is that the OS needs to update its data structures when the instruction is performed. If we take fork system call, the OS needs to update many data structures related to processes to take into account the new process, add it in the running queue, etc…

System call wrapper function

There is a very important distinction to make. In C, when we open a file with fopen("filename.txt","r") we are not making a system call. Instead, we are calling the fopen function of C library that will make the open system call for us. The code of the fopen function is defined within a C library implementation such as Glibc, while the code of the open system call is defined within the kernel of the OS.

Switching from user mode to kernel mode

When a process is executing, it can run in two modes: user mode or kernel mode. It runs in user mode when it is executing normal CPU instructions that don’t require a privilege such as jump to address, load from memory, write to memory, … However, when the process has to execute privileged instructions it should give the hand to the OS to execute on behalf of it, this is what we call kernel mode.

When performing a system call, the process needs to switch from user mode to kernel mode as it is going to execute privileged instructions. Switching from user mode to kernel mode involves many changes in the state and privilege of the current process:

The notion of current privilege is stored in the cpu. For example, in x86 CPUs, this information is available in the CS register under a 2 bits flag called CPL (Current Privilege Level). Its value is 0 under kernel mode, and 3 under user mode. So the first step of a sytem call is to change the value of CPL to 0.
The code segment is no longer the code segment of the process (which includes the executed program’s instructions). Now the code segment is the kernel code segment (this is because the instructions of open are not available in the executed program, but in kernel code).
The stack segment does no longer point to the user stack but to the kernel stack. Each process (more generally each thread) has 2 stacks, a user mode stack, on which we execute the instructions in the program, and a kernel mode stack, on which we execute kernel instructions such as system call handling. There are many reasons why we have two instead of one stack. Basically, it’s a matter of protection so that no information that the kernel used when executing is available to the process when switching back to user mode, and second so that the kernel code is not responsible for the overflow of the stack of the user program…

For a program to switch from user mode to kernel mode, there is no means but interrupts.

In general, an interrupt can be a hardware interrupt such as a timer interrupt that makes the CPU switch to kernel mode and execute a process switch for example, or it can be a software interrupt which are caused by the program itself, such as a division by zero, a page fault or a .. system call.

Interrupts are handled by the OS by means of a table called the Interrupt Descriptor Table (IDT), which maps each type of interrupt to a function that the OS will execute when the interrupt happens (we call this function an interrupt handler). For example, the first entry of this table may contain the code that will be executed when a division by zero happens.

The “int $0x80” instruction

To make a system call in a x86 cpu we execute the int $0x80 instruction. int stands for interrupt and 0x80 (the hexadecimal number corresponding to 128) is the position in the IDT table of the system call handler. Before calling this instruction we store in well defined registers the system call number of the function we want to execute (open, read, fork) and its arguments. For example, the system call number should be stored in EAX register.

Executing this instruction will do the switch from user to kernel mode. It retrieves the kernel stack segment from TSS segment, and then pushes user mode registers in this kernel stack (user stack segment register SS, stack pointer ESP, code segment register CS, instuction pointer EIP, and eflags). Kernel code segment and instruction pointer are then retrieved from the IDT table pointed to by idtr register.

When finishing the execution of int $0x80 the instruction pointer points to the first instruction in the system call handler, which will do some checks and some state storage before executing the instructions of read , open , .. or whatever system call we made.

One thing worth emphasizing is the difference between a regular function call and a system call. When we execute a function call we push the arguments on the stack and we jump to the function. A system call can’t do the same thing because we don’t execute on the same stack, so the arguments are rather stored in registers, and because the kernel code is not on the same code segment so a simple jump instruction will not work.

Returning from a system call: The “iret” instruction

When the kernel finished execution of the system call it needs to give the hand back to the user process. The same way that there is a special instruction to switch from user to kernel mode, there is another special instruction that switch back to user mode. We are talking about iret instruction. It pops the user registers that were pushed by int(CS, SS, ESP, EIP, and eflags) and store them back in corresponding registers. Doing this we have switched back to user mode and we are ready to continue the execution of our program.