Hamid Fadishei's Blog

July 8, 2011

perf_events and the secret behind very very large outlier values

During my research on power/performance analysis of computational workloads, I frequently use the perf_events kernel subsystem to get a graph of relevant performance counter values such as cache misses and instructions retired. You can examine this if you want by installing the libpfm4 library on a Linux machine with an enough up-to-date kernel. For me, Ubuntu 10.10 or newer works perfect. After compilation of libpfm4, you can find a “syst” tool in the “perf_examples” folder which can be used to examine the perf_events. A sample output of this command is given below (Note: the long output lines seem a mess on small monitors. If this is the case for you, copy and paste them in a text editor with line wrapping disabled):

$ sudo ./syst -c 0 -e PERF_COUNT_HW_CACHE_L1D:MISS,PERF_COUNT_HW_CACHE_L1I:MISS,PERF_COUNT_HW_CACHE_LL:MISS,PERF_COUNT_HW_CACHE_L1D:ACCESS,PERF_COUNT_HW_CACHE_L1I:ACCESS,PERF_COUNT_HW_CACHE_LL:ACCESS
------------------------
CPU0 val=409989               raw=199299 ena=2000395421 run=972407540 ratio=0.49 delta=369213          PERF_COUNT_HW_CACHE_L1D:MISS
CPU0 val=203841               raw=99091 ena=2000432800 run=972446620 ratio=0.49 delta=164564           PERF_COUNT_HW_CACHE_L1I:MISS
CPU0 val=211172               raw=844 ena=2000444925 run=7995247 ratio=0.00 delta=105590               PERF_COUNT_HW_CACHE_LL:MISS
CPU0 val=16091475             raw=8108192 ena=2000465017 run=1007996718 ratio=0.50 delta=9654365       PERF_COUNT_HW_CACHE_L1D:ACCESS
CPU0 val=42724654             raw=21784141 ena=2000484350 run=1019992634 ratio=0.51 delta=30353998     PERF_COUNT_HW_CACHE_L1I:ACCESS
CPU0 val=10031112             raw=100233 ena=2000504170 run=19989462 ratio=0.01 delta=10031112         PERF_COUNT_HW_CACHE_LL:ACCESS
------------------------
CPU0 val=501430               raw=261466 ena=3000602105 run=1564635603 ratio=0.52 delta=91441          PERF_COUNT_HW_CACHE_L1D:MISS
CPU0 val=347087               raw=182375 ena=3000612244 run=1576652337 ratio=0.53 delta=143246         PERF_COUNT_HW_CACHE_L1I:MISS
CPU0 val=375206               raw=5499 ena=3000615126 run=43976828 ratio=0.01 delta=164034             PERF_COUNT_HW_CACHE_LL:MISS
CPU0 val=33934452             raw=14475612 ena=3000617370 run=1279990387 ratio=0.43 delta=17842977     PERF_COUNT_HW_CACHE_L1D:ACCESS
CPU0 val=64070808             raw=29722497 ena=3000619469 run=1391989674 ratio=0.46 delta=21346154     PERF_COUNT_HW_CACHE_L1I:ACCESS
CPU0 val=2344440              raw=112486 ena=3000622062 run=143969520 ratio=0.05 delta=18446744073701864944 PERF_COUNT_HW_CACHE_LL:ACCESS
------------------------
CPU0 val=693081               raw=273320 ena=4000722671 run=1577704276 ratio=0.39 delta=191651         PERF_COUNT_HW_CACHE_L1D:MISS
CPU0 val=567389               raw=227722 ena=4000729723 run=1605693997 ratio=0.40 delta=220302         PERF_COUNT_HW_CACHE_L1I:MISS
CPU0 val=509063               raw=8139 ena=4000731389 run=63964418 ratio=0.02 delta=133857             PERF_COUNT_HW_CACHE_LL:MISS
CPU0 val=48617260             raw=21047221 ena=4000733390 run=1731984034 ratio=0.43 delta=14682808     PERF_COUNT_HW_CACHE_L1D:ACCESS
CPU0 val=67881736             raw=40027485 ena=4000739182 run=2359095923 ratio=0.59 delta=3810928      PERF_COUNT_HW_CACHE_L1I:ACCESS
CPU0 val=852461               raw=141288 ena=4000745498 run=663088489 ratio=0.17 delta=18446744073708059637 PERF_COUNT_HW_CACHE_LL:ACCESS
------------------------
CPU0 val=1271202              raw=406130 ena=5000845112 run=1597694641 ratio=0.32 delta=578121                 PERF_COUNT_HW_CACHE_L1D:MISS
CPU0 val=812083               raw=316607 ena=5000851173 run=1949682622 ratio=0.39 delta=244694                 PERF_COUNT_HW_CACHE_L1I:MISS
CPU0 val=68883                raw=14204 ena=5000856091 run=1031189733 ratio=0.21 delta=18446744073709111436    PERF_COUNT_HW_CACHE_LL:MISS
CPU0 val=45801721             raw=21607567 ena=5000861964 run=2359222674 ratio=0.47 delta=18446744073706736077 PERF_COUNT_HW_CACHE_L1D:ACCESS
CPU0 val=87662128             raw=41579483 ena=5000865063 run=2371986478 ratio=0.47 delta=19780392             PERF_COUNT_HW_CACHE_L1I:ACCESS
CPU0 val=1364606              raw=188820 ena=5000867812 run=691967970 ratio=0.14 delta=512145                  PERF_COUNT_HW_CACHE_LL:ACCESS
------------------------
CPU0 val=1427100              raw=511222 ena=6000961317 run=2149690055 ratio=0.36 delta=155898               PERF_COUNT_HW_CACHE_L1D:MISS
CPU0 val=979620               raw=350930 ena=6000967977 run=2149729928 ratio=0.36 delta=167537               PERF_COUNT_HW_CACHE_L1I:MISS
CPU0 val=89669                raw=18827 ena=6000969658 run=1259961587 ratio=0.21 delta=20786                 PERF_COUNT_HW_CACHE_LL:MISS
CPU0 val=57523422             raw=26794792 ena=6000974528 run=2795293771 ratio=0.47 delta=11721701           PERF_COUNT_HW_CACHE_L1D:ACCESS
CPU0 val=131362760            raw=56725304 ena=6000980367 run=2591354160 ratio=0.43 delta=43700632           PERF_COUNT_HW_CACHE_L1I:ACCESS
CPU0 val=1733386              raw=305015 ena=6000983863 run=1055962131 ratio=0.18 delta=368780               PERF_COUNT_HW_CACHE_LL:ACCESS

The delta values reported above state the counted values within the periods of 1s. One odd thing about that output is the very large outlier values such as the last-level cache misses of 18446744073701864944 in the second measurement interval. It got me quite a time until I found the reason for getting such values. First of all, if you look at the source code, you will see that they are not big outliers, but in fact small negative values which are interpreted as unsigned 64-bit integers. For example, 18446744073701864944 is in fact -7686672 when interpreted as unsigned integer. But why do we get negative numbers at all?
The root of the problem lies in a concept known as “event multiplexing”. The number or CPU performance counters are limited. For example, my Core 2 Quad Intel CPU only allows two cache-related events to be counted simultaneously. To overcome this limitation, the counters are time-shared among the desired counted events in software (in perf_events kernel subsystem). Thus, an event is in fact counted at only a fraction of the real time passed. For the fraction in which a counter is not counted, the value is estimated based on the other fraction. This causes some error in the event counting, which obviously may deviate the value into a negative number. We, in fact, trade accuracy for counting more events.

Advertisements

Create a free website or blog at WordPress.com.