Author |
Message |
|
I've had several of these give an Error While Computing. Anyone else? These WU's seem to estimate at almost twice the computing time as I normally have.
____________
|
|
|
|
I reported it 4 days ago for G92 cards (compute capability 1.1) like 9800GT, 8800 GT (G92)...
http://www.gpugrid.net/forum_thread.php?id=2274
|
|
|
Old manSend message
Joined: 24 Jan 09 Posts: 42 Credit: 16,676,387 RAC: 0 Level
Scientific publications
|
Here are also one:
http://www.gpugrid.net/result.php?resultid=2935402
My card are gtx 460
stderr out
<core_client_version>6.10.58</core_client_version>
<![CDATA[
<message>
Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
# Using device 0
# There is 1 device supporting CUDA
# Device 0: "GeForce GTX 460"
# Clock rate: 1.55 GHz
# Total amount of global memory: 804847616 bytes
# Number of multiprocessors: 7
# Number of cores: 56
MDIO ERROR: cannot open file "restart.coor"
</stderr_txt>
]]>
|
|
|
ignasiSend message
Joined: 10 Apr 08 Posts: 254 Credit: 16,836,000 RAC: 0 Level
Scientific publications
|
What drivers are you using? |
|
|
skgivenVolunteer moderator Volunteer tester
Send message
Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level
Scientific publications
|
DigitalDingus is using two 9600 GSO (767MB) cards with driver: 19745
(Q9450, XP x86).
The fail times look random:
2935235 1870438 8 Sep 2010 7:06:12 UTC 8 Sep 2010 7:32:16 UTC Error while computing 1,496.16 11.69 6,409.23 --- ACEMD2: GPU molecular dynamics v6.05 (cuda)
2934119 1869838 8 Sep 2010 2:53:40 UTC 8 Sep 2010 7:02:58 UTC Error while computing 14,446.09 23.11 6,409.23 --- ACEMD2: GPU molecular dynamics v6.05 (cuda)
2934086 1869814 8 Sep 2010 1:40:10 UTC 8 Sep 2010 2:53:40 UTC Error while computing 2,728.41 11.33 6,322.41 --- ACEMD2: GPU molecular dynamics v6.05 (cuda)
2931920 1868719 7 Sep 2010 12:15:59 UTC 8 Sep 2010 0:21:49 UTC Error while computing 20,453.97 14.77 6,322.41 --- ACEMD2: GPU molecular dynamics v6.05 (cuda)
2930618 1868078 7 Sep 2010 4:36:49 UTC 12 Sep 2010 4:36:49 UTC In progress --- --- --- --- ACEMD2: GPU molecular dynamics v6.05 (cuda)
2930026 1867745 7 Sep 2010 4:03:02 UTC 7 Sep 2010 4:36:49 UTC Error while computing 1,912.63 12.89 6,409.23 --- ACEMD2: GPU molecular dynamics v6.05 (cuda)
2928799 1867124 6 Sep 2010 19:51:29 UTC 6 Sep 2010 22:04:25 UTC Error while computing 7,864.14 8.88 6,016.70 --- ACEMD2: GPU molecular dynamics v6.05 (cuda)
2928286 1866896 6 Sep 2010 15:19:01 UTC 7 Sep 2010 18:40:38 UTC Completed and validated 72,823.73 1,372.66 4,535.61 5,669.51 ACEMD2: GPU molecular dynamics v6.05 (cuda)
2927745 1866582 6 Sep 2010 15:19:01 UTC 6 Sep 2010 16:51:46 UTC Error while computing 5,424.13 41.77 6,016.70 --- ACEMD2: GPU molecular dynamics v6.05 (cuda)
2925177 1865300 5 Sep 2010 21:53:33 UTC 6 Sep 2010 15:19:01 UTC Error while computing 36,642.95 80.09 6,409.23 --- ACEMD2: GPU molecular dynamics v6.05 (cuda)
2924932 1865162 5 Sep 2010 20:14:39 UTC 6 Sep 2010 15:19:01 UTC Error while computing 42,419.78 43.20 6,322.41 --- ACEMD2: GPU molecular dynamics v6.05 (cuda)
I would suggest you try the latest drivers 25896. If you keep getting failures try to find out what else is running when these tasks crash (if anything).
Tapio, your task failed after 4sec GPU time. Some tasks seem to fail within 20sec. These are not very significant and do not reduce your contribution by much. Your card seems to be running well. |
|
|
ftpd Send message
Joined: 6 Jun 08 Posts: 152 Credit: 328,250,382 RAC: 0 Level
Scientific publications
|
@skgiven,
I had the same problems with windows xp pro + gts250 + 258.96 driver after a lot of hours processing. See other thread.
Success
____________
Ton (ftpd) Netherlands |
|
|
|
Will try the newer nVidia drivers, if any exist. Just upgraded to the latest BOINC in case it made a difference, but it did not. Other than that, I'll be crunching Collatz for a while I think. |
|
|
ftpd Send message
Joined: 6 Jun 08 Posts: 152 Credit: 328,250,382 RAC: 0 Level
Scientific publications
|
Driver 258.96 exists for this card.
Please try it!
Good luck
____________
Ton (ftpd) Netherlands |
|
|
Olivier Send message
Joined: 12 Jun 09 Posts: 1 Credit: 2,063,022 RAC: 0 Level
Scientific publications
|
Same problem here unfortunatly. Theres something wrong with those kashif units ... |
|
|
ftpd Send message
Joined: 6 Jun 08 Posts: 152 Credit: 328,250,382 RAC: 0 Level
Scientific publications
|
@skgiven
Hi Kev,
Again after several hours (6) processing aborted. Windows XP-pro - gts250 258.96.
Gives also windows-message and waiting for answer, so no further processing during the night. I do not like this kind of errors. Do not send them anymore to this type of gpu-cards, please?
Good luck.
____________
Ton (ftpd) Netherlands |
|
|
skgivenVolunteer moderator Volunteer tester
Send message
Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level
Scientific publications
|
The HIVPR_n1_bound tasks seem very troublesome on CC1.1 cards. I made suggestions to allow crunchers to opt out of crunching some task types. It would involve some work for the scientists on the project design and server layout. If GDF can get it implemented it would allow crunchers to deselect troublesome projects, which would make it useful for other problems too.
Did an update try to automatically install on your system overnight?
I think the issue primarily relates to crunching those tasks, and only occasionally appears for other tasks, so perhaps this can be worked around by the programmers; you managed to crunch two revlo_TRYP work units in the last couple of days, so the card is still a useful, working card. We just need you to crunch the good tasks for that type of card.
|
|
|
ftpd Send message
Joined: 6 Jun 08 Posts: 152 Credit: 328,250,382 RAC: 0 Level
Scientific publications
|
The error from GPUgrid (HIVPR) causes a windows-error-message, which was waiting for a reply (send or no send to Microsoft). So all GPU-tasks were waiting during the night.
Keep on crunching!
____________
Ton (ftpd) Netherlands |
|
|
skgivenVolunteer moderator Volunteer tester
Send message
Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level
Scientific publications
|
I expect the Microsoft Error was along the lines of,
acemd2_6.05_windows_intelx86__cuda *32 has stopped working.
If you are sitting at the compter and see this error message pop-up, sometimes you can press the system restart button (on the computer case) and when it restarts the task is often able to pickup where from the last checkpoint; so you don't loose the task. That would not work after a minute never mind sometime overnight.
I'm guessing you have already restarted the system.
Do you know from the logs if a system update occured at that time of the error message (error logs), or some backup, defrag or other heavy CPU app ran - just in case something other than the task/driver is at fault here? |
|
|
ftpd Send message
Joined: 6 Jun 08 Posts: 152 Credit: 328,250,382 RAC: 0 Level
Scientific publications
|
Hi Kev,
I use this machine only for crunching 24/7, so no back-up, no updates etc.
Just Gpugrid and RNA or Ibercivis ore Freehal. I do no have to restart this system.
Success!
____________
Ton (ftpd) Netherlands |
|
|
|
I have the same problems with this card:
NVIDIA GPU 0: GeForce 9600 GT (driver version 25721, CUDA version 3010, compute capability 1.1, 496MB, 218 GFLOPS peak)
here's an example:
MDIO ERROR: cannot open file "restart.coor"
SWAN : FATAL : Failure executing kernel sync [PmeRealSpace_compute_forces_kernel] [999]
Assertion failed: 0, file swanlib_nv.cpp, line 121
This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.
|
|
|
skgivenVolunteer moderator Volunteer tester
Send message
Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level
Scientific publications
|
Thanks for reporting the error. The same error has been posted up several times now, and the developers are aware of it.
A driver bug is catching out the applications when they run on CC1.1 cards. It does not always occur but is a concern. With long complex GPU calculations the odd error is always expected, but these tasks are more problematic than others.
Several suggestions and potential work around’s have been made. |
|
|
|
I expect the Microsoft Error was along the lines of,
acemd2_6.05_windows_intelx86__cuda *32 has stopped working.
If you are sitting at the compter and see this error message pop-up, sometimes you can press the system restart button (on the computer case) and when it restarts the task is often able to pickup where from the last checkpoint; so you don't loose the task. That would not work after a minute never mind sometime overnight.
I did this trick several times over the last month (four 9800GT cards).
System restart without clicking away the "error message pop-up" worked for me mostly - even hours after the error happend.
With the current KASHIF_HIVPR_*_bound* (*_unbound*) errors it worked never. |
|
|
|
Computer ID 78963
Report deadline 15 Sep 2010 15:54:10 UTC
Run time 11402.593746
CPU time 736.2813
stderr out
<core_client_version>6.10.58</core_client_version>
<![CDATA[
<stderr_txt>
# Using device 0
# There is 1 device supporting CUDA
# Device 0: "GeForce GTX 480"
# Clock rate: 1.40 GHz
# Total amount of global memory: 1610153984 bytes
# Number of multiprocessors: 15
# Number of cores: 120
MDIO ERROR: cannot open file "restart.coor"
# Using device 0
# There is 1 device supporting CUDA
# Device 0: "GeForce GTX 480"
# Clock rate: 1.40 GHz
# Total amount of global memory: 1610153984 bytes
# Number of multiprocessors: 15
# Number of cores: 120
# Using device 0
# There is 1 device supporting CUDA
# Device 0: "GeForce GTX 480"
# Clock rate: 1.40 GHz
# Total amount of global memory: 1610153984 bytes
# Number of multiprocessors: 15
# Number of cores: 120
# Using device 0
# There is 1 device supporting CUDA
# Device 0: "GeForce GTX 480"
# Clock rate: 1.40 GHz
# Total amount of global memory: 1610153984 bytes
# Number of multiprocessors: 15
# Number of cores: 120
# Time per step (avg over 275000 steps): 11.463 ms
# Approximate elapsed time for entire WU: 11462.898 s
called boinc_finish
</stderr_txt>
]]>
Validate state Geldig
Claimed credit 6322.41203703704
Granted credit 9483.61805555556
application version ACEMD2: GPU molecular dynamics v6.11 (cuda31)
With an 9800GTX+, it didn't work either.
____________
Knight Who Says Ni N! |
|
|
|
Fred ... you posted results from a good run out of a 480 and it does not look like you are even running a 9800 anymore so I'm not sure wehere you were going with that.
____________
Thanks - Steve |
|
|
skgivenVolunteer moderator Volunteer tester
Send message
Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level
Scientific publications
|
Fred use to have a GTX470, and is now using a GTX480. That task completed on his 480 but failed on a GTX460 (not a 9800GTX+). I did see a 9800 failure against one of his GTX470 successes.
Fred, keep your good cards hooked up to GPUGrid, a GTX480 would be wasted anywhere else. |
|
|
|
Since the 9800GTX+ started making 'trouble', like overheating, which resulted
in faults, I first got a GTX470 which I traded for repairing an PII (Compaq).
Then I could buy a 'show-model', from which I've seen it work.
(All kinds of simulations), I bought it for €275 .(€485 normal+BTW)
I found out that these 'monsters', need a 650W(minimal), 850W is better, PSU
It draws 17A from it's 8 pin and 17A from it's 6 pin connectors and an additionel ~6 - 10A from the Mainboard. (ASUS P5E).
Now I have to find a way to get the 470 to work..........
But I'm glad I made the change, for GPUGrid it's working like a charm and on
SETI@Home, I now can run 3 MultiBeam's (0.04CPU+0.33GPU), at a time, so sometimes
BOINC 6.10.58, 64BIT, runs 7 SETI tasks and/or a mix of Einstein and other project.
I use driver 258.96 and CUDA 3.1.
And it looks like those KASHIF_HIVPR WU's, need to have compute capabillity
2.0. (2.1?)
____________
Knight Who Says Ni N! |
|
|
mwgiiiSend message
Joined: 22 Jan 09 Posts: 8 Credit: 988,332,833 RAC: 0 Level
Scientific publications
|
All of the KASHIF_HIVPR are generating errors on both of my machines.
Out of the first two pages of my Tasks (40 work units), I have had 24 work units error out, all KASHIF_HIVPR. It is killing my contributions as ftpd said, the GPU crunching halts until I notice the error message.
____________
|
|
|
skgivenVolunteer moderator Volunteer tester
Send message
Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level
Scientific publications
|
Probably best to do a system restart and then abort the download of any KASHIF_HIVPR tasks that you pick up.
Hopefully you will pick up other work units. |
|
|
mwgiiiSend message
Joined: 22 Jan 09 Posts: 8 Credit: 988,332,833 RAC: 0 Level
Scientific publications
|
I reboot around every other day. If I see anymore KASHIF, I will abort immediately.
____________
|
|
|
|
seems also that all KASHIF.. task fail at my Geforce 9800 GT :-/
(ok currently evething failed cause a OC attemp, but KASHIF task didnt work before OC ^^)
____________
|
|
|
|
I have same error:
http://www.gpugrid.net/result.php?resultid=3030293
http://www.gpugrid.net/result.php?resultid=3028306
<core_client_version>6.10.58</core_client_version>
<![CDATA[
<message>
process exited with code 193 (0xc1, -63)
</message>
<stderr_txt>
# There is 1 device supporting CUDA
# Device 0: "GeForce 9800 GT"
# Clock rate: 1.50 GHz
# Total amount of global memory: 536543232 bytes
# Number of multiprocessors: 14
# Number of cores: 112
MDIO ERROR: cannot open file "restart.coor"
# There is 1 device supporting CUDA
# Device 0: "GeForce 9800 GT"
# Clock rate: 1.50 GHz
# Total amount of global memory: 536543232 bytes
# Number of multiprocessors: 14
# Number of cores: 112
MDIO ERROR: cannot open file "restart.coor"
SWAN : FATAL : Failure executing kernel sync [transpose_float2] [700]
acemd2_6.04_x86_64-pc-linux-gnu__cuda: ../swan/swanlib_nv.cpp:203: void swanRunKernel(const char*, int3, int3, size_t, ...): Assertion `0' failed.
SIGABRT: abort called
Stack trace (17 frames):
../../projects/www.gpugrid.net/acemd2_6.04_x86_64-pc-linux-gnu__cuda(boinc_catch_signal+0x4d)[0x46438d]
/lib/libc.so.6(+0x324c0)[0x7f4e7810d4c0]
/lib/libc.so.6(gsignal+0x35)[0x7f4e7810d445]
/lib/libc.so.6(abort+0x180)[0x7f4e7810e860]
/lib/libc.so.6(__assert_fail+0xf1)[0x7f4e781064e1]
../../projects/www.gpugrid.net/acemd2_6.04_x86_64-pc-linux-gnu__cuda[0x459c20]
../../projects/www.gpugrid.net/acemd2_6.04_x86_64-pc-linux-gnu__cuda[0x45feae]
../../projects/www.gpugrid.net/acemd2_6.04_x86_64-pc-linux-gnu__cuda[0x46032f]
../../projects/www.gpugrid.net/acemd2_6.04_x86_64-pc-linux-gnu__cuda[0x45db09]
../../projects/www.gpugrid.net/acemd2_6.04_x86_64-pc-linux-gnu__cuda[0x45b400]
../../projects/www.gpugrid.net/acemd2_6.04_x86_64-pc-linux-gnu__cuda[0x45a864]
../../projects/www.gpugrid.net/acemd2_6.04_x86_64-pc-linux-gnu__cuda[0x428e20]
../../projects/www.gpugrid.net/acemd2_6.04_x86_64-pc-linux-gnu__cuda[0x41253c]
../../projects/www.gpugrid.net/acemd2_6.04_x86_64-pc-linux-gnu__cuda(sin+0xab0)[0x407f10]
../../projects/www.gpugrid.net/acemd2_6.04_x86_64-pc-linux-gnu__cuda(sin+0x2bb)[0x40771b]
/lib/libc.so.6(__libc_start_main+0xfd)[0x7f4e780f9d2d]
../../projects/www.gpugrid.net/acemd2_6.04_x86_64-pc-linux-gnu__cuda(sinh+0x49)[0x407569]
Exiting...
</stderr_txt>
]]>
<core_client_version>6.10.58</core_client_version>
<![CDATA[
<message>
process exited with code 193 (0xc1, -63)
</message>
<stderr_txt>
# There is 1 device supporting CUDA
# Device 0: "GeForce 9800 GT"
# Clock rate: 1.50 GHz
# Total amount of global memory: 536543232 bytes
# Number of multiprocessors: 14
# Number of cores: 112
MDIO ERROR: cannot open file "restart.coor"
SWAN : FATAL : Failure executing kernel sync [PmeRealSpace_compute_forces_kernel] [700]
acemd2_6.04_x86_64-pc-linux-gnu__cuda: ../swan/swanlib_nv.cpp:203: void swanRunKernel(const char*, int3, int3, size_t, ...): Assertion `0' failed.
SIGABRT: abort called
Stack trace (14 frames):
../../projects/www.gpugrid.net/acemd2_6.04_x86_64-pc-linux-gnu__cuda(boinc_catch_signal+0x4d)[0x46438d]
/lib/libc.so.6(+0x324c0)[0x7f1c49b544c0]
/lib/libc.so.6(gsignal+0x35)[0x7f1c49b54445]
/lib/libc.so.6(abort+0x180)[0x7f1c49b55860]
/lib/libc.so.6(__assert_fail+0xf1)[0x7f1c49b4d4e1]
../../projects/www.gpugrid.net/acemd2_6.04_x86_64-pc-linux-gnu__cuda[0x459c20]
../../projects/www.gpugrid.net/acemd2_6.04_x86_64-pc-linux-gnu__cuda[0x45d3f9]
../../projects/www.gpugrid.net/acemd2_6.04_x86_64-pc-linux-gnu__cuda[0x45a864]
../../projects/www.gpugrid.net/acemd2_6.04_x86_64-pc-linux-gnu__cuda[0x428e20]
../../projects/www.gpugrid.net/acemd2_6.04_x86_64-pc-linux-gnu__cuda[0x41253c]
../../projects/www.gpugrid.net/acemd2_6.04_x86_64-pc-linux-gnu__cuda(sin+0xab0)[0x407f10]
../../projects/www.gpugrid.net/acemd2_6.04_x86_64-pc-linux-gnu__cuda(sin+0x2bb)[0x40771b]
/lib/libc.so.6(__libc_start_main+0xfd)[0x7f1c49b40d2d]
../../projects/www.gpugrid.net/acemd2_6.04_x86_64-pc-linux-gnu__cuda(sinh+0x49)[0x407569]
Exiting...
</stderr_txt>
]]>
Only on KASHIF tasks. TONI always work fine. |
|
|
|
I found a reason of my error. This is automatic suspend. After restart KASHIF tasks make an error. |
|
|
Saenger Send message
Joined: 20 Jul 08 Posts: 134 Credit: 23,657,183 RAC: 0 Level
Scientific publications
|
I just had this one wrecked:
stderr out
<core_client_version>6.10.17</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)
</message>
<stderr_txt>
# There is 1 device supporting CUDA
# Device 0: "GeForce GT 240"
# Clock rate: 1.34 GHz
# Total amount of global memory: 536150016 bytes
# Number of multiprocessors: 12
# Number of cores: 96
MDIO ERROR: cannot open file "restart.coor"
</stderr_txt>
]]>
I don't have the faintest idea why it was restarted (or what "restart.coor" is good for at all), I don't run other projects on the GPU in parallel, and I wasn't doing anything on the machine at that time.
____________
Gruesse vom Saenger
For questions about Boinc look in the BOINC-Wiki |
|
|