Advanced search

Message boards : Number crunching : Multiple Teslas in one box. Is there a limit per machine for tasks?

Author Message
Profile Coleslaw
Send message
Joined: 24 Jul 08
Posts: 36
Credit: 337,382,679
RAC: 18
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 37067 - Posted: 16 Jun 2014 | 20:53:18 UTC

One of my team mates has a very impressive rig and is having an issue with getting more than 8 work units at a time on his Teslas. He has 8 dual GPU cards in his rig and the only way he could feed all 16 was to add a second project. Is this a limitation on the server side?

He is running Scientific Linux.

http://paste.ubuntu.com/7654934/


System in question
http://www.gpugrid.net/show_host_detail.php?hostid=176860
____________

Eagle07
Send message
Joined: 22 Jan 14
Posts: 3
Credit: 729,009,819
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwat
Message 37068 - Posted: 16 Jun 2014 | 21:46:21 UTC - in response to Message 37067.
Last modified: 16 Jun 2014 | 21:48:17 UTC

All Teslas, Left are K10s, right are M2090s.
I am not sure why, but Gpugrid seems to only want to see the first 9 cores out of 16 on the K10s. 0-8.

Eagle07
Send message
Joined: 22 Jan 14
Posts: 3
Credit: 729,009,819
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwat
Message 37069 - Posted: 16 Jun 2014 | 22:59:38 UTC - in response to Message 37068.

It gets more fun... I have seen gpugrid try to use gpu 9 once in a while... but it has an immediate computational error... oO.

I got a V7 installed that works on SL6.5 7.2.33.33 I copied over the *.xml and slots. When I fired it back up the work units landed in chaos. The work units that were gpugrid that landed on 9+ failed. Meanwhile Einstein seems perfectly happy to use those cores.

Profile Coleslaw
Send message
Joined: 24 Jul 08
Posts: 36
Credit: 337,382,679
RAC: 18
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 37239 - Posted: 7 Jul 2014 | 16:41:16 UTC

Do the admins have any input on this issue? We were wondering if maybe there was some kind of limit server side preventing more than 8 work units in a single machine for GPU's or if running more than 8 at a time was a known issue. It surprises me that nobody has bothered to chime in even to get additional information.
____________

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 37241 - Posted: 7 Jul 2014 | 19:58:29 UTC - in response to Message 37239.

Don't know - I've some 8-GPU (K40) machines that run ok. What goes wrong exactly?
Who makes that server, BTW?

MJH

Eagle07
Send message
Joined: 22 Jan 14
Posts: 3
Credit: 729,009,819
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwat
Message 37243 - Posted: 8 Jul 2014 | 4:46:26 UTC - in response to Message 37241.
Last modified: 8 Jul 2014 | 5:04:43 UTC

Don't know - I've some 8-GPU (K40) machines that run ok. What goes wrong exactly?
Who makes that server, BTW?

MJH


It is the 9th gpus that pushes it over. 0-8 makes 9
All work units fail instantly on gpu 9 or the 10th gpu core or higher... They rarely try on higher gpu numbers past 9 as the gpus are given work units sequentially.

HP makes the SL270s Gen8 that these reside in.
K10 is a dual gpu card, K40 is single gpu albeit significantly stronger.

I have tried 2 installs of SL 6.5 I have removed the 5th and 6th card to make sure it was not the cards... Problem migrated to the next ones in line.

I updated the drivers to 337.19 and the problem persisted.
I dropped both servers down to 9 gpu cores and the problem goes away.

I would prefer to have all the K10s in one box... As is I have 3 empty slots in the left SL node and require 2 more servers to use all of them...

http://i.imgur.com/pEDLqoM.png

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 37245 - Posted: 8 Jul 2014 | 8:30:50 UTC - in response to Message 37243.
Last modified: 8 Jul 2014 | 8:53:49 UTC

4 possibilities come to mind:

1. Boinc cant facilitate any more GPU's
2. The ACEMD app limits the number
3. It's processor related
4. It's PCIE lane related.

I can't answer these possibilities but I can explain my thinking,

1. Is there a Boinc GPU cap/limit? If so that's the issue.
2. Are the apps limited to 8 GPUs? If so then this is the problem.
3. The E5-2670 is an 8core 16 thread S2 processor and there are two mounted. It would make sense to use 1 CPU until the next is needed (energy saving) and probably to use the next before using HT. Is there a problem starting the second CPU up? Do the drivers or the app not see it? Do the processors power settings need to be altered?
4. The maximum number of PCI Express Lanes is 40 for that processor. Don't know if the board supports twice that or not? Also, while it might be PCIE3.0 for the first slot (possibly 2nd, 3rd and 4th also), I doubt that they are all PCIE3 and would expect it to drop to PCIE2. These 40 lanes are also shared with other devices so in reality you might only have 32 which means 4 lanes per 8 GPU cores. The point is, your GPU's might not operate if there is less than 4 lanes/GPU.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Profile Coleslaw
Send message
Joined: 24 Jul 08
Posts: 36
Credit: 337,382,679
RAC: 18
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 37247 - Posted: 8 Jul 2014 | 12:41:55 UTC - in response to Message 37245.
Last modified: 8 Jul 2014 | 12:43:39 UTC

skgiven, if BOINC has the limitation on GPU's, it must be server side. As he stated above, he can fill the GPU's with Einstein work. Just not GPUGrid. At least that is what I took from our conversation...
____________

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 37320 - Posted: 20 Jul 2014 | 20:21:51 UTC - in response to Message 37243.

It is the 9th gpus that pushes it over. 0-8 makes 9


Yes. Our app has a limit of 8 GPUs/host. I'll see about getting that fixed, but it won't happen for a wee while.

Matt

Profile Coleslaw
Send message
Joined: 24 Jul 08
Posts: 36
Credit: 337,382,679
RAC: 18
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 37341 - Posted: 21 Jul 2014 | 17:17:17 UTC - in response to Message 37320.

It is the 9th gpus that pushes it over. 0-8 makes 9


Yes. Our app has a limit of 8 GPUs/host. I'll see about getting that fixed, but it won't happen for a wee while.

Matt


Thank you for the confirmation. This allows us to move on and not waste more time testing and tweaking.
____________

Post to thread

Message boards : Number crunching : Multiple Teslas in one box. Is there a limit per machine for tasks?

//