Advanced search

Message boards : Number crunching : multiple WUs on GTX970 and similar high-end cards

Author Message
ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 39423 - Posted: 8 Jan 2015 | 21:40:40 UTC

Some time ago we tested running 2 WUs concurrently on Kepler GPUs, like GTX660Ti. There a clear benefit was observed for short runs, whereas long run WUs lost performance. The reason was that GPU utilization was already good enough, so switching between WUs only introduced further overhead.

With my new GTX970 running the recent Gerard CXCL12 WUs at just 7x% GPU utilization I found an approximate throughput increase of 12.5% with 2 concurrent WUs. GPU utilization went up to 91% (the WDM performance tax of Win Vista and newer applies).

I've been running like this since then, because there are so many of these "small" WUs (in terms of number of atoms) are still being distributed. But some of the usual big Noelias also crept in. Now the interesting part: I'm also seeing a throughput increase for these :)

They used to take about 21.5 ks each, with 162 W power draw and ~85% GPU utilization. Now two of them finish in about 38.5 ks. This includes some time of each WU running solo, when the other one uploads. So let's assume 2 WUs every 39 ks.. which yields a 10% throughput increase! Note: since then I've increased the power limit, so that my card now consumes 168 W (3.7% more), but runs approximately at the same boost state.

I assume this also applies to similarly powerful cards: GTX780Ti, Titan, Titan Z, GTX970 and GTX980. It may also work for GTX780 and GTX770, but I would expect the benefit to be lower.

It could also be that some improvement in Maxwell results in more efficient task switching, but I have no further information to judge this.

MrS
____________
Scanning for our furry friends since Jan 2002

Dayle Diamond
Send message
Joined: 5 Dec 12
Posts: 84
Credit: 1,663,883,415
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 39424 - Posted: 9 Jan 2015 | 0:55:37 UTC

That's fascinating! I've noticed my heat output on some of these new work units dropping to 45 degrees in cold weather, so I wondered if they were being put to use or what. I wonder if we should all try and make our maxwells more efficient, or if this type of work unit will be crunched away and we'd be at lower output again. Developers, any feedback on what's coming down the pipeline? Is an application update possible, either to swap between two WU packaged within or otherwise increase throughput?

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 39436 - Posted: 10 Jan 2015 | 5:41:07 UTC - in response to Message 39424.
Last modified: 10 Jan 2015 | 6:02:14 UTC

The amount of GPU memory is a known factor; if it doesn't have enough memory to run 2 tasks then it will be much slower overall.
As the 970 has 4GB GDDR5, memory amount isn't an issue for 2 tasks ATM.
My guess is that some other cards with 4GB might be capable of seeing improvements even around 85% GPU usage, but there is still going to be a point (say 90%) when you are not going to see any improvement, if not a performance loss.

I've never tested this with a Kepler GK110 and they are architecturally different from GK104/GK106/GK107 where I think there is less chance and you are really limited to the GTX660Ti 3GB models and up (670 4GB, 680 4GB, 770). Possibly the 760 4GB too but it's only got 1152shaders which maybe an obstacle in itself.

I expect that performance to vary by GPU model (Big Kepler's, small Kepler's, Maxwell's, bus width) how you tune it (GPU/RAM freq.), task type and the operating system (as GPU usage might be higher on XP and Linux to begin with).

I had a concern with the GTX660Ti 2GB that even with smaller molecules (lower GPU usage) some RAM was not always as easy to access (slower), beyond 1.5GB IIRC. So for 2 tasks of <750MB it might have been ok, but not >750MB. Might have been related to the super-scalar architecture, which Maxwell's don't have. Note than Win Vista and above uses 30MB or so on GPU0, so for that card the tasks would need to be <735MB.

The cache size might also come into play; the bigger the better.
GK104-300-KD-A2/GK106-400-A1/GK107 isn't as large (384K/384K/256K).
The other GK104 models (GTX770, GTX680, GTX670) have 512K.
GK110 is 1536K and GM107 is 2MB (though it's only got 1GB or 2GB GDDR5).

I expect the MCU load would be higher with 2 tasks, so relatively speaking, the 970 might be slightly better than the 980 as the 970 has a lower load/(shaders/bandwidth ratio).
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 39482 - Posted: 14 Jan 2015 | 23:24:44 UTC
Last modified: 14 Jan 2015 | 23:24:56 UTC

Note: From my testing, Windows 8 and higher use about 200-250 MB of GPU RAM on any display device, and even more if you have any Modern UI applications running.

disturber
Send message
Joined: 11 Jan 15
Posts: 11
Credit: 62,705,704
RAC: 0
Level
Thr
Scientific publications
watwatwat
Message 39505 - Posted: 17 Jan 2015 | 1:42:04 UTC

I have some interesting observation. On my i7-3770K I have a 660ti and a 970 running. I noticed that for some reason the temperature had crept up on the 660ti to about 72C. So I investigated and found that the core frequency kept changing. The GPU load was 95%, memory controller at 41%. Card is overclocked factory 1085 MHz core and 1502 MHz memory. The workunit was a NOELIA_PNP-1-10. It turns out that the card had run up against its TDP. After raising it 108% GPU-Z showed a 105% TDP power consumption. Has anyone had to increase their power target to keep the card from clocking down?

My 970 is over 90% utilized with 1397 MHz core and 1880 MHz memory. This is probably the reason I chose not to run 2 tasks on it. It would also require adding an app script so that I don't run 2 tasks on the 660ti being on the same computer.

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 39540 - Posted: 20 Jan 2015 | 20:12:50 UTC - in response to Message 39505.

Yes, the GTX660Ti and Noelia WUs in general won't benefit from 2 concurrent WUs. And regarding the power limit: the percentage is not relative to the TDP, but rather to the power target. Which is "just" 130 W for normal GTX660Ti's, so it's not that bad.

My cards also run into power limits - but this is on purpose. By lowering the power limit (target) I'm keeping the 28 nm cards around 1.10 V, where the clock speed is almost as high as at full boost (loosing around 70 MHz), but the energy efficiency is ~20% better.

MrS
____________
Scanning for our furry friends since Jan 2002

VT
Send message
Joined: 28 Nov 13
Posts: 17
Credit: 153,786,987
RAC: 0
Level
Ile
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 39739 - Posted: 27 Jan 2015 | 6:36:33 UTC - in response to Message 39540.

confirmed - been running 2 long WUs whenever I could over the last year using titan and now 970X.

disturber
Send message
Joined: 11 Jan 15
Posts: 11
Credit: 62,705,704
RAC: 0
Level
Thr
Scientific publications
watwatwat
Message 39970 - Posted: 1 Feb 2015 | 14:37:47 UTC

Now that I have run GPUgrid for a few weeks, I have come to the conclusion that the tasks are a lot more demanding than E@H. First I had to decrease the overclock on the cards until I added a bit more voltage to the chip to make them stable at the same clock as E@H. Then I found that I needed to increase the power target on the 660ti. I also had to increase the voltage on the Gigabyte 970 to make it stable at the same clock rates I was using for E@H. It has plenty of cooling but still with a GPU load of 91% it runs much hotter than before, 72C vs 56C. This caused me to increase the fan to 85%. I also see that TDP is now around 75% vs roughly 50% for E@H, leading to the extra heat.

I am not sure I want to run both my gpus at such elevated temperatures for the long run. Once I see where I can top out on my RAC, I may scale back the clock rates on both cards.

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 39981 - Posted: 1 Feb 2015 | 17:45:25 UTC - in response to Message 39970.

Yes, GPU-Grid is pushing the GPUs quite hard, pretty much like demanding games. This is not to say Einstein woudln't use them efficiently, but their algorithm is "less arithmetically dense", which means the have to spend more time on moving data around rather than "just" crunching the numbers.

If you feel your GPUs are working too hard simply lower the power target / limit. the cards will adjust clock speeds and voltages accordingly and run more power efficient (due to the lower voltage). I'm aiming for ~1.10 V on 28 nm GPUs, which costs me 50 - 70 MHz compared to full throttle operation, but saves ~15% power consumption.

MrS
____________
Scanning for our furry friends since Jan 2002

Post to thread

Message boards : Number crunching : multiple WUs on GTX970 and similar high-end cards

//