Intel's HT (HyperThreading) and AMD's SMT (Simultaneous Multi-Threading) are introduced to maximize CPU core utilization by duplicating the parts which feed the execution units, but they do not duplicate the parts which do the floating point math operations. Moreover they work better with apps coded to be multi-threaded. Running many single-threaded apps (like Rosetta@home, or Einstein@home etc.) may be counterproductive, if the apps (including the data they actually process) don't fit into the (L3) cache of the CPU.
32 tasks on 32 cores: 47,000~52,000 secs (13h~14h30m) 265% (32.65% performance loss) 16 tasks on 32 cores: 19,200~19,600 secs (5h20m~5h30m) 100%My conclusion: perhaps it's not worth buying this CPU for Einstein@home due to L3 cache / memory bandwidth limitations.
Recent AMD CPUs are built on 8-core CPU chiplets, plus a much larger chip connecting the CPU chiplets together, and also to the outside world (RAM, PCIe bus, etc). As the (L2+L3) cache resides on the CPU chiplets, their size doesn't add up. Intel CPUs have a single chip inside (it may change in the future), the L3 cache is shared among all cores (and threads). Both architectures have pros and cons. For running many single threaded tasks, Intel's architecture may be a little bit better (I didn't compare them, as I don't have a 10+ core Intel CPU).
However, it makes sense to reduce the number of simultaneous Rosetta@home (and other) tasks to the number of cores the CPU has, for troubleshooting, and to leave some memory bandwidth free for other apps.
That's why I suggested:
The actual percentage depends on many factors:Try to limit the number of usable CPUs in BOINC manager to 50% (options -> computing preferences -> use at most 50% of the processors).
1. The number of GPUs in the system: the highest recommended number of threads is the max-(the number of GPUs in the system)
2. The CPU app: Rosetta uses up to 2 GB RAM (typically up to 1 GB) consequently it uses a lot of DDR bandwidth, so it is more likely counterproductive to run many of them because of the next factor
3. The ratio of the memory bandwidth and the number of processor cores: high (10+) core count processors have "only" 4-channel memory, which results in a lower memory bandwidth per processor core ratio than a 2 core 2-channel memory CPU has. To compensate this, the L3 cache size is increased in high core count processors, but it may be insufficient for many separate apps running simultaneously (depending on the app).
Let's evaluate the CPU-Z benchmark of my 4-core 8-thread i7-4770K:
Left column: Right column: Upper row: older test (non-AVX) 8 threads 6 threads Lower row: newer test (AVX) 8 threads 6 threadsLook for the "Multi-tread ratio"
Multi-Thread Ratio 8 threads | 6 threads | 4 threads (not included in the picture) older test (non-AVX) 5.13 (-2.87) | 4.39 (-1.61) | 3.77 (-0.23) newer test (AVX) 5.52 (-2.48) | 4.64 (-1.36) | 3.82 (-0.18)The most important outcome is that the multi-thread ratio is nowhere near 8 for 8 threads, also much less than 6 for 6 threads.
You can read the above like: "If I can squeeze another 0.8 core performance when I run 8 threads then I will", but the performance of real world apps (especially which use a lot of memory) may not scale up with the number of threads that well (especially single-threaded apps as their memory usage and memory bandwidth needs will also scale up). These will have a negative impact on the performance of each other, also on the performance of other parts of the PC thought to be independent of the CPU (GPU apps), proving that they are not independent.
For example GPUGrid's performance will suffer more as you increase the number of simultaneous CPU tasks, so you'll lose more credits on the degradation of the GPU performance, than the gain on CPU wise. According to my earlier measurements, running more than 1 Rosetta will cause more credit loss on GPUGrid than its gain (if you use a high-end GPU for GPUGrid). That's one of the reasons for me to build PCs for GPU crunching on cheap (i3) CPUs and cheap (which has only single PCIe3.0x16 slot) MBs with single GPUs. This way it doesn't bother me that I don't use their CPU for crunching. Recent CPUs may perform a little better, but not that much to change my mind.
You can test a recent HT or SMT CPU, you'll have similar results: the multi-thread ratio will scale up to the number of cores your CPU has in 1:1 ratio (almost), above that it will be 1:4 (or less). So there's no much gain in running more CPU tasks than the number of CPU cores (depending how awful the app is), but it could be worth running 1 more task than that. (on high core count CPUs it can be 2 or 3 more). You have to experiment the exact number, it should be near or equal the number of cores if you use the computer while it's crunching.
The app of Rosetta@home is a difficult one, as it runs for a given period of time, there's no actual length (measured in the number of floating point operations) of a workunit. Therefore we can't compare the completion times to see the actual performance degradation coming from overcommitting the CPU. We could compare the awarded credits per day, but it's not fixed either: If the given workunit gets lucky (=the protein is folded well), it will receive more credits. The max credit I receive for 24h workunits is about 1400 credits per core, as I run as many Rosetta@home simultaneously as many cores my CPU has (or even less than that).
|ID: 54584 | Rating: 0 | rate: / Reply Quote|
I thought that if I run 64 tasks simultaneously, then it would reduce the performance of the app greatly, so I disabled SMT in the BIOS. So there were "only" 32 tasks running, the run times were quite high: 47,000~52,000 secs (13h~14h30m). I decided to further reduce the number of simultaneous tasks, so I set "use at most 50% of the processors". I also wrote a little batch program to periodically set the CPU affinity of each task to the even numbered cores (to spread the running tasks between the CPU chiplets). The runtimes dropped to 19,200~19,600 secs (5h20m~5h30m), while the power consumption rose by 30W (the CPU temperature went up as well by 7°C).
did you take into account the increase in core clock speed you'd get by doing this? reducing the load (active cores) allows the cpu to boost it's CPU frequency.
did you try testing the middle of the spectrum, and not just the extremes? it's well known that having the CPU at 100% load causes issues with available resources and slow run times. but I would like to see the results at 80-90% load. that way you should get increased performance from more cores working that outweighs the clock speed benefit from only running 50% load.
turning off HT/SMT likely helps if you don't have enough system memory. Rosetta seems to use up to 2GB per task. meaning if you wanted to run 60+ threads simultaneously, you should probably have more than 128GB of system ram. faster ram with low latency will benefit here.
for these kinds of CPU/RAM intensive workloads, I would probably go to the Epyc line of processors instead, giving you 8-channels for RAM than just the 4 you get with threadripper.
|ID: 54586 | Rating: 0 | rate: / Reply Quote|
Zoltan that was beautiful. Thank you so much for taking the time to run those tests and explain them in detail. I've been hoping for 2 years since we discovered the problem with WCG's MIP and wondered if it also hampered Rosetta.
|ID: 54587 | Rating: 0 | rate: / Reply Quote|
More information about bottlenecks on x86/GPU systems : https://www.worldcommunitygrid.org/forums/wcg/viewthread_thread,42399_offset,0#627426
|ID: 55040 | Rating: 0 | rate: / Reply Quote|