It is no secret we like performance. Hell, we wrote our own kernel for that very reason. We've also made a lot of tooling in the past to help us find hotspots and make it easier to profile Nanos. For instance we ported ftrace over along with a few pandas/plotly scripts to help visualize hotspots. We've also previously addressed how the lack of tooling for unikernel debugging is a myth. We literally have a APM product that makes unikernel performance monitoring in production a breeze.
Today we'll introduce you to another tool. We've had simple strace like support for well over a year and a half now. It is what allows us to find a wide variety of issues like syscalls that error out or files that might be missing. That output looks like so:
➜ g ops run --trace g
booting /Users/eyberg/.ops/images/g.img ...
NET: starting DHCP for interface en1
NET: starting DHCPv6 for interface en1
en1: assigned 10.0.2.15
read program complete: 0xffff80c000000000 gitversion:
653ecb66a19fdd28a441bbc46178fcf849924463 1 run thread, cpu 0, frame
0xffffc00003402000, pc 0x465860, sp 0x7fcb8ed0, rv 0x0
1 arch_prctl
1 arch_prctl: code 0x1002, addr 0x54c5b0
1 direct return: 0, rsp 0x7fcb8e70
1 run thread, cpu 0, frame 0xffffc00003402000, pc 0x466165, sp
0x7fcb8e70, rv 0x0
1 sched_getaffinity
1 direct return: 8, rsp 0x7fcb6e50
1 run thread, cpu 0, frame 0xffffc00003402000, pc 0x4661d6, sp
0x7fcb6e50, rv 0x8
1 openat
1 "/sys/kernel/mm/transparent_hugepage/hpage_pmd_size" - not found
1 direct return: -2, rsp 0x7fcb8e30
1 run thread, cpu 0, frame 0xffffc00003402000, pc 0x4658da, sp
0x7fcb8e30, rv 0xfffffffffffffffe
1 mmap
1 mmap: addr 0x0000000000000000, length 0x40000, prot 0x3, flags 0x22,
fd -1, offset 0x0
1 alloc: 0x2fe72975000
1 anonymous, specified target 0x2fe72975000
1 vmap_remove_intersection: q [0x2fe72975000 0x2fe729b5000), r
[0x2fe72975000 0x2fe729b5000)
1 returning 0x2fe72975000
1 direct return: 3291867467776, rsp 0x7fcb8bf8
1 run thread, cpu 0, frame 0xffffc00003402000, pc 0x465ee4, sp
0x7fcb8bf8, rv 0x2fe72975000
1 page fault, vaddr 0x2fe72975000
We recently added support for showing the equivalent of 'strace -c' which is a syscall summary table and shows the number of calls per syscall, the time it took and errors, along with vm exits, since Nanos is always provisioned as a virtual machine. Let's start with a simple c sample.
#include <stdio.h>
int main() {
printf("hello!\n");
}
When we run it we see the output of the syscalls used, the number of times it was called and the associated cost.
eyberg@box:~/z$ ops run --syscall-summary main
You are running an older version of Ops.
Update: Run `ops update`
booting /home/eyberg/.ops/images/main.img ...
en1: assigned 10.0.2.15
hello!
% time seconds usecs/call calls errors syscall
------ ------------ ------------ ------------ ------------------------------
48.53 0.000644 215 3 brk
25.16 0.000334 334 1 read
16.05 0.000213 213 1 write
3.31 0.000044 6 7 mmap
1.73 0.000023 6 4 pread64
1.58 0.000021 4 5 4 openat
0.97 0.000013 13 1 close
0.90 0.000012 3 4 mprotect
0.67 0.000009 9 1 uname
0.37 0.000005 5 1 1 access
0.22 0.000003 2 2 fstat
0.22 0.000003 1 3 3 stat
0.22 0.000003 2 2 1 arch_prctl
------ ------------ ------------ ------------ ------------------------------
100.00 0.001327 35 9 total
exit status 1
Here we can see the call to brk is taking the most time. Let's look at a go sample to compare - same code but different language.
package main
import "fmt"
func main() {
fmt.Println("yo")
}
➜ g ops run --syscall-summary g
booting /Users/eyberg/.ops/images/g.img ...
en1: assigned 10.0.2.15
yo
% time seconds usecs/call calls errors syscall
------ ------------ ------------ ------------ ------------ ------------------
36.00 0.000180 1 147 mmap
17.80 0.000089 1 130 munmap
16.20 0.000081 81 1 sched_yield
14.40 0.000072 72 1 write
9.00 0.000045 15 3 clone
3.40 0.000017 17 1 readlinkat
2.00 0.000010 1 13 nanosleep
0.80 0.000004 4 1 1 openat
0.40 0.000002 0 9 futex
0.00 0.000000 0 3 fcntl
0.00 0.000000 0 44 clock_gettime
0.00 0.000000 0 113 rt_sigaction
0.00 0.000000 0 7 gettid
0.00 0.000000 0 11 rt_sigprocmask
0.00 0.000000 0 1 sched_getaffinity
0.00 0.000000 0 1 arch_prctl
0.00 0.000000 0 8 sigaltstack
------ ------------ ------------ ------------ ------------ ------------------
100.00 0.000500 494 1 total
Interesting to see. Both mmap and brk are taking up the largest amount of time. Did you know both of these calls can provide similiar functionality albeit in a different manner? The reason we see mmap being used by Go is because Go implements a small language managed runtime and implements garbage collection. This is probably one of the reasons why the Go team chose to use mmap instead for their memory allocator.
So even though the code is roughly the same it is implemented differently at the syscall level. This is why it is important to have tooling to profile the syscalls and where this tool comes in handy.