The Go runtime scheduler’s clever way of dealing with system calls
One of Go’s signature features is goroutines, which are lightweight threads that are managed by the Go runtime. The Go runtime implements goroutines using a M:N work stealing scheduler to multiplex goroutines on to operating system threads. The scheduler has special terminology for three important entities; a G is a goroutine, an M is an OS thread (a ‘machine’), and a P is a ‘processor’, which at its core is a limited resource that must be claimed by an M in order to run Go code. Having a limited supply of Ps is how Go limits how many things it will do at once, so as to not overload the overall system; generally there is one P per actual CPU that the OS reports (the number of Ps is GOMAXPROCS
).
When a goroutine performs network IO or any system call operation that can definitely be done asynchronously, Go has an entire runtime subsystem, the netpoller, that converts what looks like multiple separate synchronous operations into a single wait (using operating system mechanisms like epoll
). Rather than actually making a blocking system call, your goroutine goes to sleep waiting for its network socket, just as if it was waiting for a channel to become ready. This is all conceptually straightforward, if tricky to implement efficiently.
However, network IO and similar things are far from the only system calls that Go programs can make, so Go has to deal with blocking system calls as well. The straightforward way to handle blocking system calls is for your goroutine’s M to release its P just before it makes the system call and then try to re-acquire a P after the system call resumes. If there’s no free P at that time, your goroutine gets parked in the scheduler along with everything else that’s waiting to run.
While all system calls are blocking in theory, not all are expected to be blocking in practice. For example, on modern systems the ‘system call’ to get the current time may not even enter the kernel (see vdso(7) on Linux). Having goroutines go through the full work of releasing their current P and then re-acquiring one for these system calls has two problems. First, there’s a bunch of overhead involved in locking (and releasing) all of the data structures involved. Second, if there’s more runnable goroutines than Ps, a goroutine that makes this sort of system call won’t be able to re-acquire a P and will have to park itself; the moment it released the P, something else was scheduled onto it. This is extra runtime overhead, is sort of unfair, and kind of defeats the purpose of having fast system calls (especially ones that don’t go into the kernel).
So the Go runtime and scheduler actually have two ways of handling blocking system calls, a pessimistic way for system calls that are expected to be slow and an optimistic way for ones that are expected to be fast. The pessimistic system call path implements the straightforward approach where the runtime actively releases the P before the system call, attempts to get it back afterward, and parks itself if it can’t. The optimistic system call path doesn’t release the P; instead, it sets a special P state flag and just makes the system call. A special internal goroutine, the sysmon goroutine, then comes along periodically and looks for P’s that have been sitting in this ‘making a system call’ state for too long and steals them away from the goroutine making the system call. When the system call returns, the runtime code checks to see if its P has been stolen out from underneath it, and if it hasn’t it can just go on (if the P has been stolen, the runtime tries to get another P and then may have to park your goroutine).
If everything works out, the optimistic system call path has very low overhead (mostly it requires a couple of atomic compare and swap operations). If things don’t work out and there’s more runnable goroutines than there are Ps, then one P will be unnecessarily idle for what is probably generally a few tens of microseconds (the sysmon goroutine runs at most once every 20 microseconds, but can run less frequently if there seems to be no need for it). There are probably worst case scenarios possible, but in general this seems to be a worthwhile tradeoff on the part of the Go runtime.