Message boards : Number crunching : Problem of misassignment of cuda4.2 vs cuda3.1 tasks
Author | Message |
---|---|
I have made some changes to the server to add some debugging code and some other smaller changes. | |
ID: 26125 | Rating: 0 | rate: / Reply Quote | |
Thank you! When should we expect the change to be fully effective? Should we wait a day to make sure any older 3.1 tasks have cleared the queue? | |
ID: 26127 | Rating: 0 | rate: / Reply Quote | |
it's in effect now for all new requests. | |
ID: 26129 | Rating: 0 | rate: / Reply Quote | |
On a sample of one (http://www.gpugrid.net/results.php?hostid=93580), last week's 3.1 allocation has been replaced by 4.2 | |
ID: 26130 | Rating: 0 | rate: / Reply Quote | |
Good for now. | |
ID: 26132 | Rating: 0 | rate: / Reply Quote | |
This task is 3.1 but should be 4.2 | |
ID: 26134 | Rating: 0 | rate: / Reply Quote | |
The problem seems to be that your machine is marked as unreliable with the cuda4.2 application, so the server decides to give the cuda3.1 one which is reliable. | |
ID: 26137 | Rating: 0 | rate: / Reply Quote | |
This host also gets 4.2 tasks. | |
ID: 26138 | Rating: 0 | rate: / Reply Quote | |
Still getting a mix. ie http://www.gpugrid.net/results.php?hostid=124305 | |
ID: 26139 | Rating: 0 | rate: / Reply Quote | |
Is a project reset needed following this mornings update? | |
ID: 26140 | Rating: 0 | rate: / Reply Quote | |
It should not be required, but you never know. | |
ID: 26141 | Rating: 0 | rate: / Reply Quote | |
The problem seems to be that your machine is marked as unreliable with the cuda4.2 application, so the server decides to give the cuda3.1 one which is reliable. Could this be the result of the high error count with ERROR: file deven.cpp line 1106: # Energies have become nan which some people got with the cuda4.2 app? I had several myself with my GTX 470 (host 43404). That's not a good host to generalise from, because I run it under app_info.xml, but in case it helps, here are my observations. For over 3 months, I was running the cuda3.1 app with a count of 0.5, and tasks from other projects running alongside GPUGrid on the same GPU (see thread 2897). A few tasks failed, but no more than usual. Then I swapped to cuda4.2 in the same configuration. The failure rate soared - to over 50%, by eye - and all errors were of the type 'Energies have become nan'. Finally, I set count=1 in app_info (so that GPUGrid has sole use of the GPU while running, although it is swapped out periodically so other projects can run). Since making that change, I haven't had a single error. So, perhaps, other apps in GPU memory cause a problem? I see someone else was talking about memory being a possible suspect in the news threads. All of which leads me to suspect a buffer overflow, or use of uninitialised memory, in the cuda4.2 app. I recently helped a developer on another project pin down an error which was causing invalid data to be processed: his comments after he'd found the bug were: I recall I always got some junk at the end of arrays (array size can be any but processing is vectorized to float4) .... The test which let us track that one down was: "If the host is regularly producing errors, perform a complete cold restart (to zero GPU RAM), and then allow tasks to run while avoiding any application which might load large amounts of data into VRAM" - so no games, video playback, photo editing etc. If the errors go away when VRAM is kept 'clean', that might be a pointer. | |
ID: 26142 | Rating: 0 | rate: / Reply Quote | |
I got this error a few times, i solved it by raising the voltage a bit. Or not overclocking as much would help I would think too. | |
ID: 26145 | Rating: 0 | rate: / Reply Quote | |
This host is also getting an mix of cuda31 and cuda42 tasks. | |
ID: 26147 | Rating: 0 | rate: / Reply Quote | |
Not had a 3.1 task since my last post, so looking promising. | |
ID: 26148 | Rating: 0 | rate: / Reply Quote | |
We have from now implemented a correcting suggested by David A. in the scheduler which according to him should fix the problem. | |
ID: 26149 | Rating: 0 | rate: / Reply Quote | |
Any comment? Is the problem solved? | |
ID: 26151 | Rating: 0 | rate: / Reply Quote | |
Just checked. Looks good. No new mixed tasks for me. | |
ID: 26152 | Rating: 0 | rate: / Reply Quote | |
3 Jul 2012 | 16:41:51 UTC Thats the date of my last 31 sent. Its after your 10 oclock. But i must wait for more wus the current one is 42 but this means nothing ^^ 285gtx is slowing barely down on 42 apps so i need more time to wait :/ | |
ID: 26154 | Rating: 0 | rate: / Reply Quote | |
This computer has not received any cuda 4.2 work units since updating the driver on 6/30/2012. The last one just downloaded a few minutes ago, it was cuda 3.1 also. Any suggestions. http://www.gpugrid.net/show_host_detail.php?hostid=79921 | |
ID: 26157 | Rating: 0 | rate: / Reply Quote | |
Did you try a project reset? | |
ID: 26158 | Rating: 0 | rate: / Reply Quote | |
Were sent 3.1 tasks at 7:58 UTC & 8:32 UTC. No more so far. | |
ID: 26159 | Rating: 0 | rate: / Reply Quote | |
Look promising. No mixed task so far. | |
ID: 26160 | Rating: 0 | rate: / Reply Quote | |
Haven't received any more on my 570 | |
ID: 26161 | Rating: 0 | rate: / Reply Quote | |
I've received a CUDA3.1 task today on one of my hosts. However, my hosts receive much less CUDA3.1 tasks lately (btw most of them are turned off because we have a heatwave here in Hungary). | |
ID: 26163 | Rating: 0 | rate: / Reply Quote | |
Doh! Just received a 3.1 task at 21:19:29 UTC. | |
ID: 26165 | Rating: 0 | rate: / Reply Quote | |
I just got a 3.1 a few hours ago. | |
ID: 26166 | Rating: 0 | rate: / Reply Quote | |
Guys, | |
ID: 26177 | Rating: 0 | rate: / Reply Quote | |
This problem can be handled on the cruncher's side with my workaround. | |
ID: 26178 | Rating: 0 | rate: / Reply Quote | |
Yea I got one (3.1) on my 570 again. Or course it always sneaks in when I'm sleeping. I too would like to know if the workaround is acceptable. I will be putting it in place later myself with your permission GDF. | |
ID: 26182 | Rating: 0 | rate: / Reply Quote | |
what is the percentage of 4.2 that you get compared to 3.1? 95% or much less? | |
ID: 26191 | Rating: 0 | rate: / Reply Quote | |
2 from the last 20 for me. So presently 90% | |
ID: 26192 | Rating: 0 | rate: / Reply Quote | |
Ok, | |
ID: 26193 | Rating: 0 | rate: / Reply Quote | |
Message boards : Number crunching : Problem of misassignment of cuda4.2 vs cuda3.1 tasks