A system call can be defined as a request made to the OS so that it executes some operations for the user process. In modern operating systems, some operations can’t by any means be performed directly by the user process without the help of the OS. Examples of such operations are a read from disk, a write to disk, a fork of the process, etc…
Why do we need system calls
There are two main reasons why such instructions are not available directly for the process. The first one is protection: The OS must do many checks to evaluate the right of the process to request such an instruction. The second one is that the OS needs to update its data structures when the instruction is performed. For a fork for example, the OS needs to update many data structures related to processes to take into account the new process, add it in the running queue, etc..
System call wrapper function
In C, when we open a file with
fopen("filename.txt","r") we are not making a system call. Rather, we are calling the fopen function of the C library that will make the
open system call for us. The code of the fopen function is defined within a C library implementation such as Glibc, while the code of the
open system call is defined within the kernel of the OS.
Switching from user mode to kernel mode
When performing a system call, the process needs to switch from user mode to kernel mode. Switching from user mode to kernel mode involves many changes in the state and privilege of the current process:
- First, it has now a higher privilege to execute many instructions that were not available from user mode. In x86 cpus, this privilege information is available in the CS register under a 2 bits flag called CPL (Current Privilege Level). Its value is 0 under kernel mode, and 3 under user mode.
- The code segment is no longer the code segment of the process (which includes the executed program instructions). Now the code segment is the kernel code segment (this is because the instructions of
openare not available in the executed program, but in kernel code).
- The stack segment does no longer point to the user stack but to the kernel stack. Each process (more generally each thread) has 2 stacks, a user mode stack, on which we execute the instructions in the program, and a kernel mode stack, on which we execute kernel instructions such as system call handling. There are many reasons why we have two instead of one stack. Basically it’s a matter of protection so that no information that the kernel used when executing is available to the process when switching back to user mode, and second so that the kernel code is not responsible for the overflow of the stack of the user program…
For a program to switch from user mode to kernel mode, there is no means but interrupts. An interrupt can be a hardware interrupt such as a timer interrupt that makes the CPU switch to kernel mode and execute a process switch for example, or it can be a software interrupt which are caused by the program itself, such as a division by zero, a page fault or a .. system call.
Interrupts are handled by the OS by means of a table called the Interrupt Descriptor Table (IDT), which maps each type of interrupt to a function that the OS will execute when the interrupt happens (we call this function an interrupt handler). For example, the first entry of this table may contain the code that will be executed when a division by zero happens.
The “int $0x80” instruction
To make a system call in a x86 cpu we execute the
int $0x80 instruction. int stands for interrupt and
0x80 (the hexadecimal number corresponding to 128) is the position in the IDT table of the system call handler. Before calling this instruction we store in well defined registers the system call number of the function we want to execute (open, read, fork) and its arguments. For example, the system call number should be stored in EAX register.
Executing this instruction will do the switch from user to kernel mode. It retrieves the kernel stack segment from TSS segment, and then pushes user mode registers in this kernel stack (user stack segment register SS, stack pointer ESP, code segment register CS, instuction pointer EIP, and eflags). Kernel code segment and instruction pointer are then retrieved from the IDT table pointed to by idtr register.
When finishing the execution of
int $0x80 the instruction pointer points to the first instruction in the system call handler, which will do some checks and some state storage before executing the instructions of
open , .. or whatever system call we made.
One thing worth emphasizing is the difference between a regular function call and a system call. When we execute a function call we push the arguments on the stack and we jump to the function. A system call can’t do the same thing because we don’t execute on the same stack, so the arguments are rather stored in registers, and because the kernel code is not on the same code segment so a simple jump instruction will not work.
Returning from a system call: The “iret” instruction
When the kernel finished execution of the system call it needs to give the hand back to the user process. The same way that there is a special instruction to switch from user to kernel mode, there is another special instruction that switch back to user mode. We are talking about
iret instruction. It pops the user registers that were pushed by
int(CS, SS, ESP, EIP, and eflags) and store them back in corresponding registers. Doing this we have switched back to user mode and we are ready to continue the execution of our program.