cpumemory Ulrich Drepper What Every Programmer Should Know About Memory

[ Pobierz całość w formacie PDF ]

The same is true with the AMD model where each pro-
ure 3.20 shows the results for the sequential Increment
cessor can have local memory. All processors can indeed
test. This graph is using a logarithmic scale for the Y
concurrently access their local memory quickly, espe-
axis. So, do not be fooled by the apparently small dif-
cially with the integrated memory controller. But multi-
ferences. We still have about a 18% penalty for run-
thread and multi-process programs at least from time to
ning two threads and now an amazing 93% penalty for
21
At a smaller level the same is true for two cores on the same pro-
running four threads. This means the prefetch traffic to-
cessor. The costs are just a bit smaller. The RFO message is likely to
gether with the write-back traffic is pretty much saturat-
be sent many times.
22 ing the bus when four threads are used.
Which is why we see nowadays, for instance, AMD Opteron sys-
tems with three sockets. Each processor is exactly one hop away given
We use the logarithmic scale to show the results for the
that the processors only have three hyperlinks and one is needed for the
Northbridge connection. L1d range. What can be seen is that, as soon as more
Ulrich Drepper Version 1.0 27
than one thread is running, the L1d is basically ineffec-
500
tive. The single-thread access times exceed 20 cycles
450
only when the L1d is not sufficient to hold the work-
400
ing set. When multiple threads are running, those access
350
times are hit immediately, even with the smallest work-
ing set sizes.
300
250
One aspect of the problem is not shown here. It is hard to
200
measure with this specific test program. Even though the
test modifies memory and we therefore must expect RFO
150
messages we do not see higher costs for the L2 range
100
when more than one thread is used. The program would
50
have to use a large amount of memory and all threads
must access the same memory in parallel. This is hard
210 213 216 219 222 225 228
to achieve without a lot of synchronization which would
Working Set Size (Bytes)
then dominate the execution time.
#Threads=1 #Threads=2 #Threads=4
Finally in Figure 3.21 we have the numbers for the Add-
nextlast test with random access of memory. This figure
is provided mainly to show the appallingly high numbers.
Figure 3.19: Sequential Read Access, Multiple Threads
It now takes around 1,500 cycles to process a single list
1000 element in the extreme case. The use of more threads
is even more questionable. We can summarize the effi-
500
ciency of multiple thread use in a table.
200
100 #Threads Seq Read Seq Inc Rand Add
50
2 1.69 1.69 1.54
4 2.98 2.07 1.65
20
10
Table 3.3: Efficiency for Multiple Threads
5
2
The table shows the efficiency for the multi-thread run
1
with the largest working set size in the three figures on
210 213 216 219 222 225 228 page 28. The number shows the best possible speed-up
Working Set Size (Bytes) the test program incurs for the largest working set size by
#Threads=1 #Threads=2 #Threads=4 using two or four threads. For two threads the theoretical
limits for the speed-up are 2 and, for four threads, 4. The
numbers for two threads are not that bad. But for four
Figure 3.20: Sequential Increment, Multiple Threads
threads the numbers for the last test show that it is almost
not worth it to scale beyond two threads. The additional
1600
benefit is minuscule. We can see this more easily if we
represent the data in Figure 3.21 a bit differently.
1400
The curves in Figure 3.22 show the speed-up factors, i.e.,
1200
relative performance compared to the code executed by
1000
a single thread. We have to ignore the smallest sizes, the
measurements are not accurate enough. For the range of
800
the L2 and L3 cache we can see that we indeed achieve
600
almost linear acceleration. We almost reach factors of
2 and 4 respectively. But as soon as the L3 cache is
400
not sufficient to hold the working set the numbers crash.
200
They crash to the point that the speed-up of two and four
threads is identical (see the fourth column in Table 3.3).
210 213 216 219 222 225 228 This is one of the reasons why one can hardly find moth-
erboard with sockets for more than four CPUs all using
Working Set Size (Bytes)
the same memory controller. Machines with more pro-
#Threads=1 #Threads=2 #Threads=4
cessors have to be built differently (see section 5).
These numbers are not universal. In some cases even
Figure 3.21: Random Addnextlast, Multiple Threads
working sets which fit into the last level cache will not
28 Version 1.0 What Every Programmer Should Know About Memory
Cycles/List Element
Cycles/List Element
Cycles/List Element
4.5
The meaning of the variables is as follows:
4
3.5
3
N = Number of instructions.
2.5
Fmem = Fraction of N that access memory.
2
Ghit = Fraction of loads that hit the cache.
1.5
Tproc = Number of cycles per instruction.
1
Tcache = Number of cycles for cache hit.
0.5
Tmiss = Number of cycles for cache miss.
Texe = Execution time for program.
210 213 216 219 222 225 228
Working Set Size (Bytes)
#Threads=1 #Threads=2 #Threads=4
For it to make any sense to use two threads the execution
time of each of the two threads must be at most half of
Figure 3.22: Speed-Up Through Parallelism
that of the single-threaded code. The only variable on
either side is the number of cache hits. If we solve the
equation for the minimum cache hit rate required to not
allow linear speed-ups. In fact, this is the norm since
slow down the thread execution by 50% or more we get
threads are usually not as decoupled as is the case in this
the graph in Figure 3.23.
test program. On the other hand it is possible to work
with large working sets and still take advantage of more
100%
than two threads. Doing this requires thought, though.
We will talk about some approaches in section 6.
90%
80%
Special Case: Hyper-Threads Hyper-Threads (some-
70%
times called Symmetric Multi-Threading, SMT) are im-
60%
plemented by the CPU and are a special case since the
individual threads cannot really run concurrently. They
50%
all share almost all the processing resources except for
40%
the register set. Individual cores and CPUs still work
in parallel but the threads implemented on each core are
30%
limited by this restriction. In theory there can be many
20%
threads per core but, so far, Intel s CPUs at most have
two threads per core. The CPU is responsible for time-
10%
multiplexing the threads. This alone would not make
0%
much sense, though. The real advantage is that the CPU
60% 70% 80% 90% 100%
can schedule another hyper-thread and take advantage of
available resources such as arithmetic logic units (ALUs) [ Pobierz całość w formacie PDF ]

Odnośniki