During my research on power/performance analysis of computational workloads, I frequently use the perf_events kernel subsystem to get a graph of relevant performance counter values such as cache misses and instructions retired. You can examine this if you want by installing the libpfm4 library on a Linux machine with an enough up-to-date kernel. For me, Ubuntu 10.10 or newer works perfect. After compilation of libpfm4, you can find a “syst” tool in the “perf_examples” folder which can be used to examine the perf_events. A sample output of this command is given below (Note: the long output lines seem a mess on small monitors. If this is the case for you, copy and paste them in a text editor with line wrapping disabled):
$ sudo ./syst -c 0 -e PERF_COUNT_HW_CACHE_L1D:MISS,PERF_COUNT_HW_CACHE_L1I:MISS,PERF_COUNT_HW_CACHE_LL:MISS,PERF_COUNT_HW_CACHE_L1D:ACCESS,PERF_COUNT_HW_CACHE_L1I:ACCESS,PERF_COUNT_HW_CACHE_LL:ACCESS ------------------------ CPU0 val=409989 raw=199299 ena=2000395421 run=972407540 ratio=0.49 delta=369213 PERF_COUNT_HW_CACHE_L1D:MISS CPU0 val=203841 raw=99091 ena=2000432800 run=972446620 ratio=0.49 delta=164564 PERF_COUNT_HW_CACHE_L1I:MISS CPU0 val=211172 raw=844 ena=2000444925 run=7995247 ratio=0.00 delta=105590 PERF_COUNT_HW_CACHE_LL:MISS CPU0 val=16091475 raw=8108192 ena=2000465017 run=1007996718 ratio=0.50 delta=9654365 PERF_COUNT_HW_CACHE_L1D:ACCESS CPU0 val=42724654 raw=21784141 ena=2000484350 run=1019992634 ratio=0.51 delta=30353998 PERF_COUNT_HW_CACHE_L1I:ACCESS CPU0 val=10031112 raw=100233 ena=2000504170 run=19989462 ratio=0.01 delta=10031112 PERF_COUNT_HW_CACHE_LL:ACCESS ------------------------ CPU0 val=501430 raw=261466 ena=3000602105 run=1564635603 ratio=0.52 delta=91441 PERF_COUNT_HW_CACHE_L1D:MISS CPU0 val=347087 raw=182375 ena=3000612244 run=1576652337 ratio=0.53 delta=143246 PERF_COUNT_HW_CACHE_L1I:MISS CPU0 val=375206 raw=5499 ena=3000615126 run=43976828 ratio=0.01 delta=164034 PERF_COUNT_HW_CACHE_LL:MISS CPU0 val=33934452 raw=14475612 ena=3000617370 run=1279990387 ratio=0.43 delta=17842977 PERF_COUNT_HW_CACHE_L1D:ACCESS CPU0 val=64070808 raw=29722497 ena=3000619469 run=1391989674 ratio=0.46 delta=21346154 PERF_COUNT_HW_CACHE_L1I:ACCESS CPU0 val=2344440 raw=112486 ena=3000622062 run=143969520 ratio=0.05 delta=18446744073701864944 PERF_COUNT_HW_CACHE_LL:ACCESS ------------------------ CPU0 val=693081 raw=273320 ena=4000722671 run=1577704276 ratio=0.39 delta=191651 PERF_COUNT_HW_CACHE_L1D:MISS CPU0 val=567389 raw=227722 ena=4000729723 run=1605693997 ratio=0.40 delta=220302 PERF_COUNT_HW_CACHE_L1I:MISS CPU0 val=509063 raw=8139 ena=4000731389 run=63964418 ratio=0.02 delta=133857 PERF_COUNT_HW_CACHE_LL:MISS CPU0 val=48617260 raw=21047221 ena=4000733390 run=1731984034 ratio=0.43 delta=14682808 PERF_COUNT_HW_CACHE_L1D:ACCESS CPU0 val=67881736 raw=40027485 ena=4000739182 run=2359095923 ratio=0.59 delta=3810928 PERF_COUNT_HW_CACHE_L1I:ACCESS CPU0 val=852461 raw=141288 ena=4000745498 run=663088489 ratio=0.17 delta=18446744073708059637 PERF_COUNT_HW_CACHE_LL:ACCESS ------------------------ CPU0 val=1271202 raw=406130 ena=5000845112 run=1597694641 ratio=0.32 delta=578121 PERF_COUNT_HW_CACHE_L1D:MISS CPU0 val=812083 raw=316607 ena=5000851173 run=1949682622 ratio=0.39 delta=244694 PERF_COUNT_HW_CACHE_L1I:MISS CPU0 val=68883 raw=14204 ena=5000856091 run=1031189733 ratio=0.21 delta=18446744073709111436 PERF_COUNT_HW_CACHE_LL:MISS CPU0 val=45801721 raw=21607567 ena=5000861964 run=2359222674 ratio=0.47 delta=18446744073706736077 PERF_COUNT_HW_CACHE_L1D:ACCESS CPU0 val=87662128 raw=41579483 ena=5000865063 run=2371986478 ratio=0.47 delta=19780392 PERF_COUNT_HW_CACHE_L1I:ACCESS CPU0 val=1364606 raw=188820 ena=5000867812 run=691967970 ratio=0.14 delta=512145 PERF_COUNT_HW_CACHE_LL:ACCESS ------------------------ CPU0 val=1427100 raw=511222 ena=6000961317 run=2149690055 ratio=0.36 delta=155898 PERF_COUNT_HW_CACHE_L1D:MISS CPU0 val=979620 raw=350930 ena=6000967977 run=2149729928 ratio=0.36 delta=167537 PERF_COUNT_HW_CACHE_L1I:MISS CPU0 val=89669 raw=18827 ena=6000969658 run=1259961587 ratio=0.21 delta=20786 PERF_COUNT_HW_CACHE_LL:MISS CPU0 val=57523422 raw=26794792 ena=6000974528 run=2795293771 ratio=0.47 delta=11721701 PERF_COUNT_HW_CACHE_L1D:ACCESS CPU0 val=131362760 raw=56725304 ena=6000980367 run=2591354160 ratio=0.43 delta=43700632 PERF_COUNT_HW_CACHE_L1I:ACCESS CPU0 val=1733386 raw=305015 ena=6000983863 run=1055962131 ratio=0.18 delta=368780 PERF_COUNT_HW_CACHE_LL:ACCESS
The delta values reported above state the counted values within the periods of 1s. One odd thing about that output is the very large outlier values such as the last-level cache misses of 18446744073701864944 in the second measurement interval. It got me quite a time until I found the reason for getting such values. First of all, if you look at the source code, you will see that they are not big outliers, but in fact small negative values which are interpreted as unsigned 64-bit integers. For example, 18446744073701864944 is in fact -7686672 when interpreted as unsigned integer. But why do we get negative numbers at all?
The root of the problem lies in a concept known as “event multiplexing”. The number or CPU performance counters are limited. For example, my Core 2 Quad Intel CPU only allows two cache-related events to be counted simultaneously. To overcome this limitation, the counters are time-shared among the desired counted events in software (in perf_events kernel subsystem). Thus, an event is in fact counted at only a fraction of the real time passed. For the fraction in which a counter is not counted, the value is estimated based on the other fraction. This causes some error in the event counting, which obviously may deviate the value into a negative number. We, in fact, trade accuracy for counting more events.