Results and analysis, Memory bandwidth – Dell PowerEdge R820 User Manual

Page 12

Advertising
background image

Performance Analysis of HPC Applications on Several Dell PowerEdge 12

th

Generation Servers

12

study were identical to this one, and therefore data from that analysis is leveraged for the Sandy
Bridge EP portion of this work.

4. Results and analysis

This section compares the performance characteristics of each of the above mentioned applications on
the three different server platforms.

Because the Dell PowerEdge R820 has double the number of cores per server over the PowerEdge M620
and the PowerEdge M420, the performance comparison is made on the basis of core count rather than
the number of servers. This comparison is also helpful when studying applications that have per-core
licensing costs. For example, the PowerEdge R820 needs double the number of ANSYS Fluent licenses
for each server (32) when compared to the 16 needed for a PowerEdge M620 or M420. For all tests, the
cores in the server were fully subscribed. For example, a 32-core result indicates that the test used
two PowerEdge M620 (2*16 cores/server), one PowerEdge R820, and two PowerEdge M420s. All
application results in this section are plotted relative to the performance on the PowerEdge M620
cluster.

Before jumping into application performance, the obvious differences in the memory subsystem of the
three server platforms are studied first. The impact each server’s architecture has on system memory
bandwidth is demonstrated at a micro benchmark level using the Stream benchmark [6]. Subsequent
sections analyze and explain the application level performance.

4.1. Memory bandwidth

The memory bandwidth and memory bandwidth per core for the three platforms measured using the
Stream benchmark is plotted in Figure 4. The height of the bar indicates the total memory bandwidth
of the system. The value above each bar marks the memory bandwidth per core. The Dell PowerEdge
R620 is a rack based server with a similar architecture and expected performance as the PowerEdge
M620 blade server.

As expected, the PowerEdge R820 has the maximum total memory bandwidth measured at ~110GB/s.
The corresponding bandwidth for the 2 socket PowerEdge R620 is 78GB/s. The Stream Triad benchmark
performs two reads, and one write to memory. If additional data is transferred to/from memory during
this benchmark measurement period, it is not counted towards the total memory bandwidth capability.
Therefore, the memory bandwidth available to certain applications may be higher than reported by
Stream. On the Intel Xeon processor E5-4600 product family, an issued non-cacheable write instruction
still triggers a read for ownership due to the cache coherency protocol. This extra read is not counted
when running the benchmark but takes memory bandwidth to accomplish. This is explained in more
detail in [7]. If this extra read was counted by the benchmark, the effective memory bandwidth of the
PowerEdge R820 would be approximately two times that of the PowerEdge R620.

This study uses the actual measured memory bandwidth as reported by Stream. An application may
have the same behavior and incur the same RFO penalty. This measured value provides a baseline for
the analysis.

At 4.8GB/s per core, the PowerEdge R620 has the highest memory bandwidth per core whereas the
memory bandwidth per core on the PowerEdge R820 is measured to be ~30 percent lower. Because the
PowerEdge M420 has three memory channels when compared to PowerEdge R620 or PowerEdge R820

Advertising
This manual is related to the following products: