Advanced search

Message boards : Number crunching : Problem of misassignment of cuda4.2 vs cuda3.1 tasks

Author Message
Profile GDF
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 14 Mar 07
Posts: 1957
Credit: 629,356
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 26125 - Posted: 3 Jul 2012 | 10:30:16 UTC

I have made some changes to the server to add some debugging code and some other smaller changes.

Let me know if you have been given a cuda3.1 workunit and you should not have receive that.

gdf

Paul Raney
Send message
Joined: 26 Dec 10
Posts: 115
Credit: 416,576,946
RAC: 0
Level
Gln
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 26127 - Posted: 3 Jul 2012 | 10:32:58 UTC - in response to Message 26125.

Thank you! When should we expect the change to be fully effective? Should we wait a day to make sure any older 3.1 tasks have cleared the queue?

This will make many crunchers very happy!
____________
Thx - Paul

Note: Please don't use driver version 295 or 296! Recommended versions are 266 - 285.

Profile GDF
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 14 Mar 07
Posts: 1957
Credit: 629,356
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 26129 - Posted: 3 Jul 2012 | 10:34:41 UTC - in response to Message 26127.

it's in effect now for all new requests.

gdf

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,794,611,851
RAC: 9,297,069
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 26130 - Posted: 3 Jul 2012 | 10:41:07 UTC

On a sample of one (http://www.gpugrid.net/results.php?hostid=93580), last week's 3.1 allocation has been replaced by 4.2

It will be interesting to see if this little 420M laptop can complete it within 24 hours.

Profile GDF
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 14 Mar 07
Posts: 1957
Credit: 629,356
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 26132 - Posted: 3 Jul 2012 | 11:09:22 UTC - in response to Message 26130.

Good for now.

gdf

HA-SOFT, s.r.o.
Send message
Joined: 3 Oct 11
Posts: 100
Credit: 5,879,292,399
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 26134 - Posted: 3 Jul 2012 | 11:51:56 UTC - in response to Message 26125.
Last modified: 3 Jul 2012 | 11:52:07 UTC

This task is 3.1 but should be 4.2

http://www.gpugrid.net/result.php?resultid=5564707

Profile GDF
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 14 Mar 07
Posts: 1957
Credit: 629,356
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 26137 - Posted: 3 Jul 2012 | 14:48:18 UTC - in response to Message 26134.

The problem seems to be that your machine is marked as unreliable with the cuda4.2 application, so the server decides to give the cuda3.1 one which is reliable.

I'll contact Berkeley about it.

gdf

HA-SOFT, s.r.o.
Send message
Joined: 3 Oct 11
Posts: 100
Credit: 5,879,292,399
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 26138 - Posted: 3 Jul 2012 | 15:26:57 UTC - in response to Message 26137.

This host also gets 4.2 tasks.

All of my GTX5xx hosts get mixture of tasks. My one GTX680 host gets only 4.2 tasks.

Profile Stoneageman
Avatar
Send message
Joined: 25 May 09
Posts: 224
Credit: 34,057,224,498
RAC: 26
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 26139 - Posted: 3 Jul 2012 | 15:58:34 UTC

Still getting a mix. ie http://www.gpugrid.net/results.php?hostid=124305

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 26140 - Posted: 3 Jul 2012 | 16:15:31 UTC - in response to Message 26139.

Is a project reset needed following this mornings update?
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Profile GDF
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 14 Mar 07
Posts: 1957
Credit: 629,356
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 26141 - Posted: 3 Jul 2012 | 16:57:32 UTC - in response to Message 26140.

It should not be required, but you never know.
gdf

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,794,611,851
RAC: 9,297,069
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 26142 - Posted: 3 Jul 2012 | 17:39:47 UTC - in response to Message 26137.

The problem seems to be that your machine is marked as unreliable with the cuda4.2 application, so the server decides to give the cuda3.1 one which is reliable.

Could this be the result of the high error count with

ERROR: file deven.cpp line 1106: # Energies have become nan

which some people got with the cuda4.2 app?

I had several myself with my GTX 470 (host 43404). That's not a good host to generalise from, because I run it under app_info.xml, but in case it helps, here are my observations.

For over 3 months, I was running the cuda3.1 app with a count of 0.5, and tasks from other projects running alongside GPUGrid on the same GPU (see thread 2897). A few tasks failed, but no more than usual.

Then I swapped to cuda4.2 in the same configuration. The failure rate soared - to over 50%, by eye - and all errors were of the type 'Energies have become nan'.

Finally, I set count=1 in app_info (so that GPUGrid has sole use of the GPU while running, although it is swapped out periodically so other projects can run). Since making that change, I haven't had a single error.

So, perhaps, other apps in GPU memory cause a problem? I see someone else was talking about memory being a possible suspect in the news threads.

All of which leads me to suspect a buffer overflow, or use of uninitialised memory, in the cuda4.2 app. I recently helped a developer on another project pin down an error which was causing invalid data to be processed: his comments after he'd found the bug were:

I recall I always got some junk at the end of arrays (array size can be any but processing is vectorized to float4) ....

Brook build took fold buffer from CPU memory. And there was guarding zeroes written at the end to allow vectorized fetch.
Now fetch done directly from GPU buffer. And (will check this) no guard elements at buffer end.

The test which let us track that one down was: "If the host is regularly producing errors, perform a complete cold restart (to zero GPU RAM), and then allow tasks to run while avoiding any application which might load large amounts of data into VRAM" - so no games, video playback, photo editing etc. If the errors go away when VRAM is kept 'clean', that might be a pointer.

Mark Henderson
Send message
Joined: 21 Dec 08
Posts: 51
Credit: 26,320,167
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 26145 - Posted: 3 Jul 2012 | 18:36:43 UTC
Last modified: 3 Jul 2012 | 18:38:00 UTC

I got this error a few times, i solved it by raising the voltage a bit. Or not overclocking as much would help I would think too.

Profile Marty
Avatar
Send message
Joined: 8 Nov 08
Posts: 3
Credit: 241,504,865
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 26147 - Posted: 3 Jul 2012 | 22:08:25 UTC

This host is also getting an mix of cuda31 and cuda42 tasks.

Hasn't had an error since i installed the GTX560 in it and started running GPUGRID again.

Profile Stoneageman
Avatar
Send message
Joined: 25 May 09
Posts: 224
Credit: 34,057,224,498
RAC: 26
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 26148 - Posted: 3 Jul 2012 | 23:49:23 UTC

Not had a 3.1 task since my last post, so looking promising.

Profile GDF
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 14 Mar 07
Posts: 1957
Credit: 629,356
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 26149 - Posted: 4 Jul 2012 | 7:05:52 UTC - in response to Message 26148.

We have from now implemented a correcting suggested by David A. in the scheduler which according to him should fix the problem.

Let me know.

gdf

Profile GDF
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 14 Mar 07
Posts: 1957
Credit: 629,356
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 26151 - Posted: 4 Jul 2012 | 9:41:08 UTC - in response to Message 26149.

Any comment? Is the problem solved?

gdf

HA-SOFT, s.r.o.
Send message
Joined: 3 Oct 11
Posts: 100
Credit: 5,879,292,399
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 26152 - Posted: 4 Jul 2012 | 10:19:14 UTC - in response to Message 26151.

Just checked. Looks good. No new mixed tasks for me.

Profile dskagcommunity
Avatar
Send message
Joined: 28 Apr 11
Posts: 456
Credit: 817,865,789
RAC: 0
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 26154 - Posted: 4 Jul 2012 | 12:00:19 UTC

3 Jul 2012 | 16:41:51 UTC Thats the date of my last 31 sent. Its after your 10 oclock. But i must wait for more wus the current one is 42 but this means nothing ^^ 285gtx is slowing barely down on 42 apps so i need more time to wait :/
____________
DSKAG Austria Research Team: http://www.research.dskag.at



JAMES DORISIO
Send message
Joined: 6 Sep 10
Posts: 8
Credit: 2,467,192,626
RAC: 1,849,421
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 26157 - Posted: 4 Jul 2012 | 16:10:42 UTC

This computer has not received any cuda 4.2 work units since updating the driver on 6/30/2012. The last one just downloaded a few minutes ago, it was cuda 3.1 also. Any suggestions. http://www.gpugrid.net/show_host_detail.php?hostid=79921
Thanks Jim

Mark Henderson
Send message
Joined: 21 Dec 08
Posts: 51
Credit: 26,320,167
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 26158 - Posted: 4 Jul 2012 | 16:40:23 UTC

Did you try a project reset?

Profile Stoneageman
Avatar
Send message
Joined: 25 May 09
Posts: 224
Credit: 34,057,224,498
RAC: 26
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 26159 - Posted: 4 Jul 2012 | 17:02:31 UTC

Were sent 3.1 tasks at 7:58 UTC & 8:32 UTC. No more so far.

HA-SOFT, s.r.o.
Send message
Joined: 3 Oct 11
Posts: 100
Credit: 5,879,292,399
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 26160 - Posted: 4 Jul 2012 | 17:05:14 UTC - in response to Message 26159.
Last modified: 4 Jul 2012 | 17:05:26 UTC

Look promising. No mixed task so far.

5pot
Send message
Joined: 8 Mar 12
Posts: 411
Credit: 2,083,882,218
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 26161 - Posted: 4 Jul 2012 | 17:50:50 UTC

Haven't received any more on my 570

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 851
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 26163 - Posted: 4 Jul 2012 | 19:40:14 UTC
Last modified: 4 Jul 2012 | 19:41:40 UTC

I've received a CUDA3.1 task today on one of my hosts. However, my hosts receive much less CUDA3.1 tasks lately (btw most of them are turned off because we have a heatwave here in Hungary).

Profile Stoneageman
Avatar
Send message
Joined: 25 May 09
Posts: 224
Credit: 34,057,224,498
RAC: 26
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 26165 - Posted: 4 Jul 2012 | 22:19:57 UTC
Last modified: 4 Jul 2012 | 22:23:06 UTC

Doh! Just received a 3.1 task at 21:19:29 UTC.

Snow Crash
Send message
Joined: 4 Apr 09
Posts: 450
Credit: 539,316,349
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 26166 - Posted: 4 Jul 2012 | 22:36:17 UTC

I just got a 3.1 a few hours ago.
____________
Thanks - Steve

Profile GDF
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 14 Mar 07
Posts: 1957
Credit: 629,356
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 26177 - Posted: 5 Jul 2012 | 8:48:36 UTC - in response to Message 26166.

Guys,
if this is acceptable, then I would stay happy with it.
The boinc scheduler has its own personality with many conditionals.

gdf

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 851
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 26178 - Posted: 5 Jul 2012 | 9:20:03 UTC - in response to Message 26177.

This problem can be handled on the cruncher's side with my workaround.
GDF, I just want to be sure... Could it generate any problems if I crunch the CUDA3.1 tasks with the CUDA4.2 client? (as my workaround works this way)

5pot
Send message
Joined: 8 Mar 12
Posts: 411
Credit: 2,083,882,218
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 26182 - Posted: 5 Jul 2012 | 13:16:34 UTC

Yea I got one (3.1) on my 570 again. Or course it always sneaks in when I'm sleeping. I too would like to know if the workaround is acceptable. I will be putting it in place later myself with your permission GDF.

Profile GDF
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 14 Mar 07
Posts: 1957
Credit: 629,356
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 26191 - Posted: 5 Jul 2012 | 22:48:07 UTC - in response to Message 26182.

what is the percentage of 4.2 that you get compared to 3.1? 95% or much less?

gdf

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 26192 - Posted: 5 Jul 2012 | 23:02:19 UTC - in response to Message 26182.
Last modified: 5 Jul 2012 | 23:11:37 UTC

2 from the last 20 for me. So presently 90%
Prior to that (~2nd July) it was closer to 50/50.
- checked a few other peoples systems and that seems to be about right.

I think Boinc may periodically chose to run the 3.1 tasks in order to reconfirm that 4.2 is still faster and most stable (or not). Being generic Boinc code, on some projects some tasks may run faster on one app than another app; depending on the system and the tasks type. Of note, the performance of some x86 and x64 tasks my vary relatively from one system to another, and some apps might become more stable or less (statistically) when running different task types...

Just got my first 3.1 in a while. Of course it had to be a short one!
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Profile GDF
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 14 Mar 07
Posts: 1957
Credit: 629,356
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 26193 - Posted: 6 Jul 2012 | 9:27:48 UTC - in response to Message 26192.

Ok,
that's fine for me. Hopefully we will also remove the cuda3.1 app soon at least for long tasks.

gdf

Post to thread

Message boards : Number crunching : Problem of misassignment of cuda4.2 vs cuda3.1 tasks

//