New NOELIA Longruns

Message boards : Number crunching : New NOELIA Longruns

Author	Message
HA-SOFT, s.r.o. Send message Joined: 3 Oct 11 Posts: 100 Credit: 5,879,292,399 RAC: 0 Level Scientific publications	Message 26446 - Posted: 25 Jul 2012 \| 19:54:23 UTC Last modified: 25 Jul 2012 \| 20:26:46 UTC
	All new NOELIA longrun tasks errored out on my pc's immediately after start. Other tasks run ok. EDIT: CUDA31 tasks run ok, CUDA42 not.
	ID: 26446 \| Rating: 0 \| rate: / Reply Quote

JugNut Send message Joined: 27 Nov 11 Posts: 11 Credit: 1,021,749,297 RAC: 0 Level Scientific publications	Message 26449 - Posted: 25 Jul 2012 \| 20:47:01 UTC - in response to Message 26446. Last modified: 25 Jul 2012 \| 21:19:53 UTC
	All new NOELIA longrun tasks errored out on my pc's immediately after start. Other tasks run ok. EDIT: CUDA31 tasks run ok, CUDA42 not. Me too, exactly the same. All NOELIA's fail so far. Both CUDA31 & CUDA42 NOELIA'a All other WU's CUDA42 work fine. GTX580, win7 x64
	ID: 26449 \| Rating: 0 \| rate: / Reply Quote

Bedrich Hajek Send message Joined: 28 Mar 09 Posts: 485 Credit: 11,133,622,824 RAC: 15,518,739 Level Scientific publications	Message 26451 - Posted: 25 Jul 2012 \| 22:11:42 UTC - in response to Message 26449. Last modified: 25 Jul 2012 \| 22:12:45 UTC
	What is wrong with these units? They have all crashed. http://www.gpugrid.net/results.php?hostid=127986&offset=0&show_names=1&state=0&appid=
	ID: 26451 \| Rating: 0 \| rate: / Reply Quote

flashawk Send message Joined: 18 Jun 12 Posts: 297 Credit: 3,572,627,986 RAC: 0 Level Scientific publications	Message 26452 - Posted: 25 Jul 2012 \| 23:25:48 UTC - in response to Message 26451.
	Same here, I've had 3 error out, 2 on a GTX 670 and 1 on a GTX 560 with another queued right now. ____________
	ID: 26452 \| Rating: 0 \| rate: / Reply Quote

Rayzor Send message Joined: 19 Jan 11 Posts: 13 Credit: 294,225,579 RAC: 0 Level Scientific publications	Message 26454 - Posted: 26 Jul 2012 \| 0:25:29 UTC
	So for, 3 out of 3 failed on a GTX 275, Windows XP Pro 32, cuda31 tasks http://www.gpugrid.net/results.php?hostid=124381 run1_replica3-NOELIA_sh2fragment_run-0-4-RND4005_1 run4_replica1-NOELIA_sh2fragment_run-0-4-RND5679_2 run1_replica48-NOELIA_sh2fragment_run-0-4-RND0084_0
	ID: 26454 \| Rating: 0 \| rate: / Reply Quote

Carlos Augusto Engel Send message Joined: 5 Jun 09 Posts: 38 Credit: 2,880,758,878 RAC: 0 Level Scientific publications	Message 26456 - Posted: 26 Jul 2012 \| 1:40:42 UTC - in response to Message 26454.
	Error in all NOELIA's tasks on GTX 580 and GTX 570. run3_replica47-NOELIA_sh2fragment_run-0-4-RND8455_4 3597875 117522 26 Jul 2012 \| 1:11:26 UTC 26 Jul 2012 \| 1:17:34 UTC Error while computing 10.10 2.07 --- Long runs (8-12 hours on fastest card) v6.16 (cuda42) run2_replica33-NOELIA_sh2fragment_run-0-4-RND3214_3 3597817 117522 26 Jul 2012 \| 0:59:06 UTC 26 Jul 2012 \| 1:05:16 UTC Error while computing 10.11 1.92 --- Long runs (8-12 hours on fastest card) v6.16 (cuda42) run1_replica34-NOELIA_sh2fragment_run-0-4-RND4425_3 3597770 117522 26 Jul 2012 \| 0:16:07 UTC 26 Jul 2012 \| 0:22:12 UTC Error while computing 10.07 1.83 --- Long runs (8-12 hours on fastest card) v6.16 (cuda42) run7_replica15-NOELIA_sh2fragment_run-0-4-RND0291_4 3598026 117522 26 Jul 2012 \| 0:46:39 UTC 26 Jul 2012 \| 0:52:50 UTC Error while computing 10.13 1.67 --- Long runs (8-12 hours on fastest card) v6.16 (cuda42) run6_replica15-NOELIA_sh2fragment_run-0-4-RND1911_4 3597982 117522 26 Jul 2012 \| 0:09:58 UTC 26 Jul 2012 \| 0:16:07 UTC Error while computing 10.08 1.84 --- Long runs (8-12 hours on fastest card) v6.16 (cuda42) run6_replica14-NOELIA_sh2fragment_run-0-4-RND6550_3 3597981 117522 26 Jul 2012 \| 0:34:26 UTC 26 Jul 2012 \| 0:40:31 UTC Error while computing 11.07 2.07 --- Long runs (8-12 hours on fastest card) v6.16 (cuda42) run7_replica46-NOELIA_sh2fragment_run-0-4-RND8233_1 3598055 117522 26 Jul 2012 \| 0:04:58 UTC 26 Jul 2012 \| 0:09:58 UTC Error while computing 10.06 2.25 --- Long runs (8-12 hours on fastest card) v6.16 (cuda42) run1_replica50-NOELIA_sh2fragment_run-0-4-RND3470_4 3597789 117522 26 Jul 2012 \| 1:05:16 UTC 26 Jul 2012 \| 1:11:26 UTC Error while computing 10.07 1.92 --- Long runs (8-12 hours on fastest card) v6.16 (cuda42) run6_replica6-NOELIA_sh2fragment_run-0-4-RND3986_2 3598016 117522 26 Jul 2012 \| 0:22:12 UTC 26 Jul 2012 \| 0:28:21 UTC Error while computing 10.32 1.78 --- Long runs (8-12 hours on fastest card) v6.16 (cuda42) run2_replica17-NOELIA_sh2fragment_run-0-4-RND4720_4 3597800 117522 26 Jul 2012 \| 0:52:50 UTC 26 Jul 2012 \| 0:59:06 UTC Error while computing 10.06 1.87 --- Long runs (8-12 hours on fastest card) v6.16 (cuda42) run7_replica19-NOELIA_sh2fragment_run-0-4-RND9229_1 3598030 117522 26 Jul 2012 \| 0:28:21 UTC 26 Jul 2012 \| 0:34:26 UTC Error while computing 10.09 1.44 --- Long runs (8-12 hours on fastest card) v6.16 (cuda42) run9_replica39-NOELIA_sh2fragment_run-0-4-RND8136_2 3598123 117522 26 Jul 2012 \| 0:40:31 UTC 26 Jul 2012 \| 0:46:39 UTC Error while computing 10.07 1.89 --- Long runs (8-12 hours on fastest card) v6.16 (cuda42) run5_replica8-NOELIA_sh2fragment_run-0-4-RND4219_1 3597974 117522 25 Jul 2012 \| 22:11:24 UTC 26 Jul 2012 \| 0:04:58 UTC Error while computing 10.30 2.00 --- Long runs (8-12 hours on fastest card) v6.16 (cuda42) run7_replica38-NOELIA_sh2fragment_run-0-4-RND5635_1 3597691 101457 25 Jul 2012 \| 21:55:52 UTC 26 Jul 2012 \| 0:27:04 UTC Error while computing 15.40 1.72 --- Long runs (8-12 hours on fastest card) v6.16 (cuda42) ____________
	ID: 26456 \| Rating: 0 \| rate: / Reply Quote

[PUGLIA] kidkidkid3 Send message Joined: 23 Feb 11 Posts: 98 Credit: 1,285,571,653 RAC: 2,051,418 Level Scientific publications	Message 26461 - Posted: 26 Jul 2012 \| 5:31:09 UTC - in response to Message 26456.
	Hi Noelia, same error (twice) also for me in http://www.gpugrid.net/result.php?resultid=5664840 http://www.gpugrid.net/result.php?resultid=5664472 <core_client_version>7.0.28</core_client_version> <![CDATA[ <message> - exit code 98 (0x62) </message> <stderr_txt> MDIO: cannot open file "restart.coor" ERROR: file deven.cpp line 1106: # Energies have become nan called boinc_finish </stderr_txt> ]]> I'll stop or cancel your WU. k. ____________ Dreams do not always come true. But not because they are too big or impossible. Why did we stop believing. (Martin Luther King)
	ID: 26461 \| Rating: 0 \| rate: / Reply Quote

MarkJ Volunteer moderator Volunteer tester Send message Joined: 24 Dec 08 Posts: 738 Credit: 200,909,904 RAC: 0 Level Scientific publications	Message 26465 - Posted: 26 Jul 2012 \| 7:59:36 UTC Last modified: 26 Jul 2012 \| 8:10:55 UTC
	Just had a look at my tasks. Looks like all the sh2_fragment long work units are failing for everybody, not just me. Obviously a bad batch of work units seeing as everybody are failing them. I have sent her a PM so hopefully they will sort things out soon. ____________ BOINC blog
	ID: 26465 \| Rating: 0 \| rate: / Reply Quote

Bedrich Hajek Send message Joined: 28 Mar 09 Posts: 485 Credit: 11,133,622,824 RAC: 15,518,739 Level Scientific publications	Message 26468 - Posted: 26 Jul 2012 \| 13:57:11 UTC - in response to Message 26465. Last modified: 26 Jul 2012 \| 13:59:02 UTC
	These NOELIA_sh2fragment units are all crashing with the same error message: <core_client_version>7.0.28</core_client_version> <![CDATA[ <message> - exit code 98 (0x62) </message> <stderr_txt> MDIO: cannot open file "restart.coor" ERROR: file deven.cpp line 1106: # Energies have become nan called boinc_finish </stderr_txt> ]]> Isn't it time to cancel this batch of units already?
	ID: 26468 \| Rating: 0 \| rate: / Reply Quote

Bedrich Hajek Send message Joined: 28 Mar 09 Posts: 485 Credit: 11,133,622,824 RAC: 15,518,739 Level Scientific publications	Message 26492 - Posted: 28 Jul 2012 \| 3:49:48 UTC - in response to Message 26468.
	Just completed the first NOELIA_sh2fragment unit successfully. See link below: http://www.gpugrid.net/workunit.php?wuid=3601869 Whatever you did to fix the bug, worked. I have 2 more such units still crunching. Hopefully, they will be successful too.
	ID: 26492 \| Rating: 0 \| rate: / Reply Quote

SMTB1963 Send message Joined: 27 Jun 10 Posts: 38 Credit: 524,420,921 RAC: 0 Level Scientific publications	Message 26494 - Posted: 28 Jul 2012 \| 4:27:58 UTC - in response to Message 26492.
	Whatever you did to fix the bug, worked. Me too!
	ID: 26494 \| Rating: 0 \| rate: / Reply Quote

Stoneageman Send message Joined: 25 May 09 Posts: 224 Credit: 34,057,374,498 RAC: 186 Level Scientific publications	Message 26495 - Posted: 28 Jul 2012 \| 10:12:48 UTC
	Just had a failure after 7hrs of a NOELIA run9 replica21 task on a reliable (up to now) card. The card is now working ok on a PAOLA task. <core_client_version>7.0.28</core_client_version> <![CDATA[ <message> process exited with code 193 (0xc1, -63) </message> <stderr_txt> MDIO: cannot open file "restart.coor" SWAN : FATAL : Cuda driver error 702 in file 'swanlibnv2.cpp' in line 1574. acemd.2562.x64.cuda42: swanlibnv2.cpp:59: void swan_assert(int): Assertion `a' failed. SIGABRT: abort called Stack trace (15 frames): ../../projects/www.gpugrid.net/acemd.2562.x64.cuda42(boinc_catch_signal+0x4d)[0x551f6d] /lib/x86_64-linux-gnu/libc.so.6(+0x364c0)[0x7fa96d2cf4c0] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x35)[0x7fa96d2cf445] /lib/x86_64-linux-gnu/libc.so.6(abort+0x17b)[0x7fa96d2d2bab] /lib/x86_64-linux-gnu/libc.so.6(+0x2f10e)[0x7fa96d2c810e] /lib/x86_64-linux-gnu/libc.so.6(+0x2f1b2)[0x7fa96d2c81b2] ../../projects/www.gpugrid.net/acemd.2562.x64.cuda42[0x482916] ../../projects/www.gpugrid.net/acemd.2562.x64.cuda42[0x4848da] ../../projects/www.gpugrid.net/acemd.2562.x64.cuda42[0x44d4bd] ../../projects/www.gpugrid.net/acemd.2562.x64.cuda42[0x44e54c] ../../projects/www.gpugrid.net/acemd.2562.x64.cuda42[0x41ec14] ../../projects/www.gpugrid.net/acemd.2562.x64.cuda42(sin+0xb6c)[0x407d6c] ../../projects/www.gpugrid.net/acemd.2562.x64.cuda42(sin+0x256)[0x407456] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed)[0x7fa96d2ba76d] ../../projects/www.gpugrid.net/acemd.2562.x64.cuda42(sinh+0x49)[0x4072f9] Exiting... </stderr_txt>
	ID: 26495 \| Rating: 0 \| rate: / Reply Quote

Bedrich Hajek Send message Joined: 28 Mar 09 Posts: 485 Credit: 11,133,622,824 RAC: 15,518,739 Level Scientific publications	Message 26496 - Posted: 28 Jul 2012 \| 14:39:07 UTC - in response to Message 26492.
	Just completed the first NOELIA_sh2fragment unit successfully. See link below: http://www.gpugrid.net/workunit.php?wuid=3601869 Whatever you did to fix the bug, worked. I have 2 more such units still crunching. Hopefully, they will be successful too. Two more of these units completed successfully. See links below: http://www.gpugrid.net/workunit.php?wuid=3601862 http://www.gpugrid.net/workunit.php?wuid=3601963 Though one took about 16 hours to complete, while the other two took about 9 to 10 hours.
	ID: 26496 \| Rating: 0 \| rate: / Reply Quote

HA-SOFT, s.r.o. Send message Joined: 3 Oct 11 Posts: 100 Credit: 5,879,292,399 RAC: 0 Level Scientific publications	Message 26498 - Posted: 28 Jul 2012 \| 17:55:46 UTC - in response to Message 26495.
	Just had a failure after 7hrs of a NOELIA run9 replica21 task on a reliable (up to now) card. The card is now working ok on a PAOLA task. <core_client_version>7.0.28</core_client_version> <![CDATA[ <message> process exited with code 193 (0xc1, -63) </message> <stderr_txt> MDIO: cannot open file "restart.coor" SWAN : FATAL : Cuda driver error 702 in file 'swanlibnv2.cpp' in line 1574. acemd.2562.x64.cuda42: swanlibnv2.cpp:59: void swan_assert(int): Assertion `a' failed. SIGABRT: abort called Stack trace (15 frames): ../../projects/www.gpugrid.net/acemd.2562.x64.cuda42(boinc_catch_signal+0x4d)[0x551f6d] /lib/x86_64-linux-gnu/libc.so.6(+0x364c0)[0x7fa96d2cf4c0] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x35)[0x7fa96d2cf445] /lib/x86_64-linux-gnu/libc.so.6(abort+0x17b)[0x7fa96d2d2bab] /lib/x86_64-linux-gnu/libc.so.6(+0x2f10e)[0x7fa96d2c810e] /lib/x86_64-linux-gnu/libc.so.6(+0x2f1b2)[0x7fa96d2c81b2] ../../projects/www.gpugrid.net/acemd.2562.x64.cuda42[0x482916] ../../projects/www.gpugrid.net/acemd.2562.x64.cuda42[0x4848da] ../../projects/www.gpugrid.net/acemd.2562.x64.cuda42[0x44d4bd] ../../projects/www.gpugrid.net/acemd.2562.x64.cuda42[0x44e54c] ../../projects/www.gpugrid.net/acemd.2562.x64.cuda42[0x41ec14] ../../projects/www.gpugrid.net/acemd.2562.x64.cuda42(sin+0xb6c)[0x407d6c] ../../projects/www.gpugrid.net/acemd.2562.x64.cuda42(sin+0x256)[0x407456] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed)[0x7fa96d2ba76d] ../../projects/www.gpugrid.net/acemd.2562.x64.cuda42(sinh+0x49)[0x4072f9] Exiting... </stderr_txt> Exactly the same on my GTX580 under Linux.
	ID: 26498 \| Rating: 0 \| rate: / Reply Quote

Stoneageman Send message Joined: 25 May 09 Posts: 224 Credit: 34,057,374,498 RAC: 186 Level Scientific publications	Message 26500 - Posted: 28 Jul 2012 \| 20:42:01 UTC
	and I've another failed after 7 hrs, as it did before me. Considering aborting all NOELLA tasks now :-( http://www.gpugrid.net/workunit.php?wuid=3601979
	ID: 26500 \| Rating: 0 \| rate: / Reply Quote

ritterm Send message Joined: 31 Jul 09 Posts: 88 Credit: 244,413,897 RAC: 0 Level Scientific publications	Message 26502 - Posted: 28 Jul 2012 \| 22:18:16 UTC - in response to Message 26500.
	Considering aborting all NOELLA tasks now :-( Me, too...I thought maybe I was okay after this one finished successfully: run10_replica21-NOELIA_sh2fragment_fixed-0-4-RND7749_0 But then the one I had queued up next crashed after 14-hours plus: run9_replica1-NOELIA_sh2fragment_fixed-0-4-RND6355_1 Both had the "MDIO: cannot open file 'restart.coor'" message in the stderr output. ____________
	ID: 26502 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,249,865,968 RAC: 4,089,892 Level Scientific publications	Message 26504 - Posted: 28 Jul 2012 \| 22:55:19 UTC - in response to Message 26502. Last modified: 28 Jul 2012 \| 22:57:06 UTC
	Both had the "MDIO: cannot open file 'restart.coor'" message in the stderr output. This is a false error message. It appears in every task, even in the successful ones. BTW these "fixed" NOELIA tasks are running fine on all of my hosts. Perhaps you should lower your GPU clock (calibrate your overclocking settings to the more demanding CUDA 4.2 tasks, because these extra long tasks are even more demanding).
	ID: 26504 \| Rating: 0 \| rate: / Reply Quote

ritterm Send message Joined: 31 Jul 09 Posts: 88 Credit: 244,413,897 RAC: 0 Level Scientific publications	Message 26508 - Posted: 29 Jul 2012 \| 2:52:54 UTC - in response to Message 26504.
	Perhaps you should lower your GPU clock (calibrate your overclocking settings to the more demanding CUDA 4.2 tasks, because these extra long tasks are even more demanding). Even if I'm running at stock speeds? Other than two NOELIA's, I've had few, if any, comp errors with this card that weren't attributable to "bad" tasks. ____________
	ID: 26508 \| Rating: 0 \| rate: / Reply Quote

MarkJ Volunteer moderator Volunteer tester Send message Joined: 24 Dec 08 Posts: 738 Credit: 200,909,904 RAC: 0 Level Scientific publications	Message 26509 - Posted: 29 Jul 2012 \| 3:23:11 UTC Last modified: 29 Jul 2012 \| 3:25:46 UTC
	So far all the sh2fragment_fixed have been working on my two GTX670's. Make sure the work units have "fixed" in their name, otherwise they are probably the bad ones we already know about. They vary a bit in size, but have been taking around 8 hours (which is what the old cuda 3.1 long wu were taking before). ____________ BOINC blog
	ID: 26509 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,249,865,968 RAC: 4,089,892 Level Scientific publications	Message 26511 - Posted: 29 Jul 2012 \| 3:45:19 UTC - in response to Message 26508.
	Perhaps you should lower your GPU clock (calibrate your overclocking settings to the more demanding CUDA 4.2 tasks, because these extra long tasks are even more demanding). Even if I'm running at stock speeds? Other than two NOELIA's, I've had few, if any, comp errors with this card that weren't attributable to "bad" tasks. Yes. But if you are running your cards at stock speeds, I'd rather try to increase the GPU core voltage by 25mV (if your GPU temperatures allows to do so).
	ID: 26511 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,249,865,968 RAC: 4,089,892 Level Scientific publications	Message 26512 - Posted: 29 Jul 2012 \| 9:47:13 UTC - in response to Message 26504. Last modified: 29 Jul 2012 \| 9:48:52 UTC
	BTW these "fixed" NOELIA tasks are running fine on all of my hosts. One of these workunits was stuck on my GTX590 for 7 hours. The progress indicator did not increased since my previous post. A system restart helped. Bye-bye 24 hours bonus....
	ID: 26512 \| Rating: 0 \| rate: / Reply Quote

dskagcommunity Send message Joined: 28 Apr 11 Posts: 456 Credit: 817,865,789 RAC: 0 Level Scientific publications	Message 26513 - Posted: 29 Jul 2012 \| 15:11:50 UTC Last modified: 29 Jul 2012 \| 15:13:31 UTC
	Soon my first fixed NOELIA Unit is finished on cuda31. Unfortunaly it is the first WU on my 285GTX that need more then 24hours (25hours ;)) to compute :( Bye-bye 24 hours bonus.... But it seems to work here.. (this one, we will see on the next ones ^^) ____________ DSKAG Austria Research Team: http://www.research.dskag.at
	ID: 26513 \| Rating: 0 \| rate: / Reply Quote

ritterm Send message Joined: 31 Jul 09 Posts: 88 Credit: 244,413,897 RAC: 0 Level Scientific publications	Message 26520 - Posted: 31 Jul 2012 \| 1:56:29 UTC
	And another one bites the dust after almost 7 hours: run9_replica14-NOELIA_sh2fragment_fixed-1-4-RND1629_2 Stderr output includes: "SWAN : FATAL : Cuda driver error 999 in file 'swanlibnv2.cpp' in line 1574. Assertion failed: a, file swanlibnv2.cpp, line 59" ____________
	ID: 26520 \| Rating: 0 \| rate: / Reply Quote

dskagcommunity Send message Joined: 28 Apr 11 Posts: 456 Credit: 817,865,789 RAC: 0 Level Scientific publications	Message 26521 - Posted: 31 Jul 2012 \| 5:36:46 UTC
	Hm my second one worked too. ____________ DSKAG Austria Research Team: http://www.research.dskag.at
	ID: 26521 \| Rating: 0 \| rate: / Reply Quote

MarkJ Volunteer moderator Volunteer tester Send message Joined: 24 Dec 08 Posts: 738 Credit: 200,909,904 RAC: 0 Level Scientific publications	Message 26522 - Posted: 31 Jul 2012 \| 11:49:59 UTC - in response to Message 26513. Last modified: 31 Jul 2012 \| 11:52:16 UTC
	Soon my first fixed NOELIA Unit is finished on cuda31. Unfortunaly it is the first WU on my 285GTX that need more then 24hours (25hours ;)) to compute :( Bye-bye 24 hours bonus.... But it seems to work here.. (this one, we will see on the next ones ^^) Dskagcommunity you might want to update to 301.42 drivers. They still should work on your GTX285. That way you can get the speed advantages of the cuda 4.2 app. ____________ BOINC blog
	ID: 26522 \| Rating: 0 \| rate: / Reply Quote

dskagcommunity Send message Joined: 28 Apr 11 Posts: 456 Credit: 817,865,789 RAC: 0 Level Scientific publications	Message 26523 - Posted: 31 Jul 2012 \| 12:55:20 UTC Last modified: 31 Jul 2012 \| 12:56:47 UTC
	I downgrade them because 42 runs much slower on 285gtx. Up to 15000secs! Included some wus that erroring because i dont want to change anything of the stockclocked cardsettings. So i prevere to compute one sort of wus in over 24hours but the rest computes for sure and secure in good times. ____________ DSKAG Austria Research Team: http://www.research.dskag.at
	ID: 26523 \| Rating: 0 \| rate: / Reply Quote

MarkJ Volunteer moderator Volunteer tester Send message Joined: 24 Dec 08 Posts: 738 Credit: 200,909,904 RAC: 0 Level Scientific publications	Message 26527 - Posted: 1 Aug 2012 \| 11:00:10 UTC - in response to Message 26523.
	I downgrade them because 42 runs much slower on 285gtx. Up to 15000secs! Included some wus that erroring because i dont want to change anything of the stockclocked cardsettings. So i prevere to compute one sort of wus in over 24hours but the rest computes for sure and secure in good times. I wonder if you were getting the downclock bug, which only appears if the cards are running hot (but not overheating). Its supposedly fixed in the 304 (beta) drivers but I have heard of issues with running other project apps (Seti) with 304 drivers. What sort of temps is your GTX285 running at under load? ____________ BOINC blog
	ID: 26527 \| Rating: 0 \| rate: / Reply Quote

dskagcommunity Send message Joined: 28 Apr 11 Posts: 456 Credit: 817,865,789 RAC: 0 Level Scientific publications	Message 26528 - Posted: 1 Aug 2012 \| 12:46:38 UTC Last modified: 1 Aug 2012 \| 12:54:23 UTC
	79-80 degrees in a room with AC with manual set fan speed cos normal it would run with 90-92 that i found little to much for 24h operation. dont installin beta things. When there is a downclocking problem, then i hope 304 is soon available for stable release ^^ someone told me to raise gpu voltage a minimum to kill the errorthing but i dont want to touch these settings because i have only bad experience with such things. Ether it runs like it is from the factory or not ;) thats the ground why i dont buy any OC Cards. ____________ DSKAG Austria Research Team: http://www.research.dskag.at
	ID: 26528 \| Rating: 0 \| rate: / Reply Quote

MarkJ Volunteer moderator Volunteer tester Send message Joined: 24 Dec 08 Posts: 738 Credit: 200,909,904 RAC: 0 Level Scientific publications	Message 26532 - Posted: 2 Aug 2012 \| 12:26:42 UTC - in response to Message 26528.
	79-80 degrees in a room with AC with manual set fan speed cos normal it would run with 90-92 that i found little to much for 24h operation. dont installin beta things. When there is a downclocking problem, then i hope 304 is soon available for stable release ^^ someone told me to raise gpu voltage a minimum to kill the errorthing but i dont want to touch these settings because i have only bad experience with such things. Ether it runs like it is from the factory or not ;) thats the ground why i dont buy any OC Cards. Yep that would be enough to trip the overheating bug. Well anyway at least you know why 301.42 doesn't work well for your config. Hopefully they'll sort out the issues with 304 and get a good one out the door soon. ____________ BOINC blog
	ID: 26532 \| Rating: 0 \| rate: / Reply Quote

mhhall Send message Joined: 21 Mar 10 Posts: 23 Credit: 861,667,631 RAC: 0 Level Scientific publications	Message 26534 - Posted: 3 Aug 2012 \| 20:54:25 UTC
	My system is currently processing WU 3601638 which appears to now be at 41 hours elapsed and 61% completed. This appears likely to be 2 or 2.5 times longer than most units that my system has worked recently. Does this appear to be a WU that I should allow to run to completion, or is it indicating a problem. Seems that most WU of this type of had problems with immediate failures. Don't know if this is also a problem of a cuda31 process on my machine (which seems like it has been handling cuda42 work properly).
	ID: 26534 \| Rating: 0 \| rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 26548 - Posted: 5 Aug 2012 \| 10:59:21 UTC - in response to Message 26534.
	Presumably you mean this task: run10_replica19-NOELIA_sh2fragment_fixed-0-4-RND1077_1 3601638 1 Aug 2012 \| 14:31:44 UTC 4 Aug 2012 \| 22:41:22 UTC Completed and validated 240,549.78 7,615.19 67,500.00 Long runs (8-12 hours on fastest card) v6.16 (cuda31) There is probably more than one thing affecting performance in this case. As you say it ran under the 3.1app, which is around 50% slower for most tasks. Still, 2.8 days on a GT550Ti is too long: Although the 3.1tasks report a Stderr output file, it doesn't show anything interesting in this case: Stderr output <core_client_version>6.12.34</core_client_version> <![CDATA[ <stderr_txt> # Using device 0 # There is 1 device supporting CUDA # Device 0: "GeForce GTX 550 Ti" # Clock rate: 1.80 GHz # Total amount of global memory: 1072889856 bytes # Number of multiprocessors: 4 # Number of cores: 32 MDIO: cannot open file "restart.coor" # Using device 0 # There is 1 device supporting CUDA # Device 0: "GeForce GTX 550 Ti" # Clock rate: 1.80 GHz # Total amount of global memory: 1072889856 bytes # Number of multiprocessors: 4 # Number of cores: 32 # Time per step (avg over 1930000 steps): 48.186 ms # Approximate elapsed time for entire WU: 240929.858 s 18:17:15 (28048): called boinc_finish </stderr_txt> ]]> The task was just stopped once during the run (system restart for example). I don't know what the expected 'Time per step' values are for these tasks on a similar card. In this case my guess is that your GPU downclocked for a period during the run; the GPU ran at 100MHz for a while. Maybe after the restart it ran at normal clock rates. Increasing the GPU fan speed can sometimes prevent downclocking. One other possibility is that your CPU (a dual core Opteron) was struggling; I think the 3.1app is more demanding of the CPU which isn't really high end. I think your system also uses DDR2 which can cause some performance loss (seen in GPU utilization). If you are also running CPU tasks, 1 would be fine but 2 would result in a poor performance for the GPU (unless you redefined nice values). Just running these tasks on the new app seems to avoid this problem - your card ran a similar NOELIA_sh2fragment_fixed task on the cuda42 app in almost 1/3rd the time. ____________ FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help
	ID: 26548 \| Rating: 0 \| rate: / Reply Quote

Post to thread

Message boards : Number crunching : New NOELIA Longruns

	About	Science	Volunteers	Performance	Forum	Join us	Donate

Author	Message
HA-SOFT, s.r.o. Send message Joined: 3 Oct 11 Posts: 100 Credit: 5,879,292,399 RAC: 0 Level Scientific publications	Message 26446 - Posted: 25 Jul 2012 \| 19:54:23 UTC Last modified: 25 Jul 2012 \| 20:26:46 UTC
	All new NOELIA longrun tasks errored out on my pc's immediately after start. Other tasks run ok. EDIT: CUDA31 tasks run ok, CUDA42 not.
	ID: 26446 \| Rating: 0 \| rate: / Reply Quote

JugNut Send message Joined: 27 Nov 11 Posts: 11 Credit: 1,021,749,297 RAC: 0 Level Scientific publications	Message 26449 - Posted: 25 Jul 2012 \| 20:47:01 UTC - in response to Message 26446. Last modified: 25 Jul 2012 \| 21:19:53 UTC
	All new NOELIA longrun tasks errored out on my pc's immediately after start. Other tasks run ok. EDIT: CUDA31 tasks run ok, CUDA42 not. Me too, exactly the same. All NOELIA's fail so far. Both CUDA31 & CUDA42 NOELIA'a All other WU's CUDA42 work fine. GTX580, win7 x64
	ID: 26449 \| Rating: 0 \| rate: / Reply Quote

Bedrich Hajek Send message Joined: 28 Mar 09 Posts: 485 Credit: 11,133,622,824 RAC: 15,518,739 Level Scientific publications	Message 26451 - Posted: 25 Jul 2012 \| 22:11:42 UTC - in response to Message 26449. Last modified: 25 Jul 2012 \| 22:12:45 UTC
	What is wrong with these units? They have all crashed. http://www.gpugrid.net/results.php?hostid=127986&offset=0&show_names=1&state=0&appid=
	ID: 26451 \| Rating: 0 \| rate: / Reply Quote

flashawk Send message Joined: 18 Jun 12 Posts: 297 Credit: 3,572,627,986 RAC: 0 Level Scientific publications	Message 26452 - Posted: 25 Jul 2012 \| 23:25:48 UTC - in response to Message 26451.
	Same here, I've had 3 error out, 2 on a GTX 670 and 1 on a GTX 560 with another queued right now. ____________
	ID: 26452 \| Rating: 0 \| rate: / Reply Quote

Rayzor Send message Joined: 19 Jan 11 Posts: 13 Credit: 294,225,579 RAC: 0 Level Scientific publications	Message 26454 - Posted: 26 Jul 2012 \| 0:25:29 UTC
	So for, 3 out of 3 failed on a GTX 275, Windows XP Pro 32, cuda31 tasks http://www.gpugrid.net/results.php?hostid=124381 run1_replica3-NOELIA_sh2fragment_run-0-4-RND4005_1 run4_replica1-NOELIA_sh2fragment_run-0-4-RND5679_2 run1_replica48-NOELIA_sh2fragment_run-0-4-RND0084_0
	ID: 26454 \| Rating: 0 \| rate: / Reply Quote

Carlos Augusto Engel Send message Joined: 5 Jun 09 Posts: 38 Credit: 2,880,758,878 RAC: 0 Level Scientific publications	Message 26456 - Posted: 26 Jul 2012 \| 1:40:42 UTC - in response to Message 26454.
	Error in all NOELIA's tasks on GTX 580 and GTX 570. run3_replica47-NOELIA_sh2fragment_run-0-4-RND8455_4 3597875 117522 26 Jul 2012 \| 1:11:26 UTC 26 Jul 2012 \| 1:17:34 UTC Error while computing 10.10 2.07 --- Long runs (8-12 hours on fastest card) v6.16 (cuda42) run2_replica33-NOELIA_sh2fragment_run-0-4-RND3214_3 3597817 117522 26 Jul 2012 \| 0:59:06 UTC 26 Jul 2012 \| 1:05:16 UTC Error while computing 10.11 1.92 --- Long runs (8-12 hours on fastest card) v6.16 (cuda42) run1_replica34-NOELIA_sh2fragment_run-0-4-RND4425_3 3597770 117522 26 Jul 2012 \| 0:16:07 UTC 26 Jul 2012 \| 0:22:12 UTC Error while computing 10.07 1.83 --- Long runs (8-12 hours on fastest card) v6.16 (cuda42) run7_replica15-NOELIA_sh2fragment_run-0-4-RND0291_4 3598026 117522 26 Jul 2012 \| 0:46:39 UTC 26 Jul 2012 \| 0:52:50 UTC Error while computing 10.13 1.67 --- Long runs (8-12 hours on fastest card) v6.16 (cuda42) run6_replica15-NOELIA_sh2fragment_run-0-4-RND1911_4 3597982 117522 26 Jul 2012 \| 0:09:58 UTC 26 Jul 2012 \| 0:16:07 UTC Error while computing 10.08 1.84 --- Long runs (8-12 hours on fastest card) v6.16 (cuda42) run6_replica14-NOELIA_sh2fragment_run-0-4-RND6550_3 3597981 117522 26 Jul 2012 \| 0:34:26 UTC 26 Jul 2012 \| 0:40:31 UTC Error while computing 11.07 2.07 --- Long runs (8-12 hours on fastest card) v6.16 (cuda42) run7_replica46-NOELIA_sh2fragment_run-0-4-RND8233_1 3598055 117522 26 Jul 2012 \| 0:04:58 UTC 26 Jul 2012 \| 0:09:58 UTC Error while computing 10.06 2.25 --- Long runs (8-12 hours on fastest card) v6.16 (cuda42) run1_replica50-NOELIA_sh2fragment_run-0-4-RND3470_4 3597789 117522 26 Jul 2012 \| 1:05:16 UTC 26 Jul 2012 \| 1:11:26 UTC Error while computing 10.07 1.92 --- Long runs (8-12 hours on fastest card) v6.16 (cuda42) run6_replica6-NOELIA_sh2fragment_run-0-4-RND3986_2 3598016 117522 26 Jul 2012 \| 0:22:12 UTC 26 Jul 2012 \| 0:28:21 UTC Error while computing 10.32 1.78 --- Long runs (8-12 hours on fastest card) v6.16 (cuda42) run2_replica17-NOELIA_sh2fragment_run-0-4-RND4720_4 3597800 117522 26 Jul 2012 \| 0:52:50 UTC 26 Jul 2012 \| 0:59:06 UTC Error while computing 10.06 1.87 --- Long runs (8-12 hours on fastest card) v6.16 (cuda42) run7_replica19-NOELIA_sh2fragment_run-0-4-RND9229_1 3598030 117522 26 Jul 2012 \| 0:28:21 UTC 26 Jul 2012 \| 0:34:26 UTC Error while computing 10.09 1.44 --- Long runs (8-12 hours on fastest card) v6.16 (cuda42) run9_replica39-NOELIA_sh2fragment_run-0-4-RND8136_2 3598123 117522 26 Jul 2012 \| 0:40:31 UTC 26 Jul 2012 \| 0:46:39 UTC Error while computing 10.07 1.89 --- Long runs (8-12 hours on fastest card) v6.16 (cuda42) run5_replica8-NOELIA_sh2fragment_run-0-4-RND4219_1 3597974 117522 25 Jul 2012 \| 22:11:24 UTC 26 Jul 2012 \| 0:04:58 UTC Error while computing 10.30 2.00 --- Long runs (8-12 hours on fastest card) v6.16 (cuda42) run7_replica38-NOELIA_sh2fragment_run-0-4-RND5635_1 3597691 101457 25 Jul 2012 \| 21:55:52 UTC 26 Jul 2012 \| 0:27:04 UTC Error while computing 15.40 1.72 --- Long runs (8-12 hours on fastest card) v6.16 (cuda42) ____________
	ID: 26456 \| Rating: 0 \| rate: / Reply Quote

[PUGLIA] kidkidkid3 Send message Joined: 23 Feb 11 Posts: 98 Credit: 1,285,571,653 RAC: 2,051,418 Level Scientific publications	Message 26461 - Posted: 26 Jul 2012 \| 5:31:09 UTC - in response to Message 26456.
	Hi Noelia, same error (twice) also for me in http://www.gpugrid.net/result.php?resultid=5664840 http://www.gpugrid.net/result.php?resultid=5664472 <core_client_version>7.0.28</core_client_version> <![CDATA[ <message> - exit code 98 (0x62) </message> <stderr_txt> MDIO: cannot open file "restart.coor" ERROR: file deven.cpp line 1106: # Energies have become nan called boinc_finish </stderr_txt> ]]> I'll stop or cancel your WU. k. ____________ Dreams do not always come true. But not because they are too big or impossible. Why did we stop believing. (Martin Luther King)
	ID: 26461 \| Rating: 0 \| rate: / Reply Quote

MarkJ Volunteer moderator Volunteer tester Send message Joined: 24 Dec 08 Posts: 738 Credit: 200,909,904 RAC: 0 Level Scientific publications	Message 26465 - Posted: 26 Jul 2012 \| 7:59:36 UTC Last modified: 26 Jul 2012 \| 8:10:55 UTC
	Just had a look at my tasks. Looks like all the sh2_fragment long work units are failing for everybody, not just me. Obviously a bad batch of work units seeing as everybody are failing them. I have sent her a PM so hopefully they will sort things out soon. ____________ BOINC blog
	ID: 26465 \| Rating: 0 \| rate: / Reply Quote

Bedrich Hajek Send message Joined: 28 Mar 09 Posts: 485 Credit: 11,133,622,824 RAC: 15,518,739 Level Scientific publications	Message 26468 - Posted: 26 Jul 2012 \| 13:57:11 UTC - in response to Message 26465. Last modified: 26 Jul 2012 \| 13:59:02 UTC
	These NOELIA_sh2fragment units are all crashing with the same error message: <core_client_version>7.0.28</core_client_version> <![CDATA[ <message> - exit code 98 (0x62) </message> <stderr_txt> MDIO: cannot open file "restart.coor" ERROR: file deven.cpp line 1106: # Energies have become nan called boinc_finish </stderr_txt> ]]> Isn't it time to cancel this batch of units already?
	ID: 26468 \| Rating: 0 \| rate: / Reply Quote

Bedrich Hajek Send message Joined: 28 Mar 09 Posts: 485 Credit: 11,133,622,824 RAC: 15,518,739 Level Scientific publications	Message 26492 - Posted: 28 Jul 2012 \| 3:49:48 UTC - in response to Message 26468.
	Just completed the first NOELIA_sh2fragment unit successfully. See link below: http://www.gpugrid.net/workunit.php?wuid=3601869 Whatever you did to fix the bug, worked. I have 2 more such units still crunching. Hopefully, they will be successful too.
	ID: 26492 \| Rating: 0 \| rate: / Reply Quote

SMTB1963 Send message Joined: 27 Jun 10 Posts: 38 Credit: 524,420,921 RAC: 0 Level Scientific publications	Message 26494 - Posted: 28 Jul 2012 \| 4:27:58 UTC - in response to Message 26492.
	Whatever you did to fix the bug, worked. Me too!
	ID: 26494 \| Rating: 0 \| rate: / Reply Quote

Stoneageman Send message Joined: 25 May 09 Posts: 224 Credit: 34,057,374,498 RAC: 186 Level Scientific publications	Message 26495 - Posted: 28 Jul 2012 \| 10:12:48 UTC
	Just had a failure after 7hrs of a NOELIA run9 replica21 task on a reliable (up to now) card. The card is now working ok on a PAOLA task. <core_client_version>7.0.28</core_client_version> <![CDATA[ <message> process exited with code 193 (0xc1, -63) </message> <stderr_txt> MDIO: cannot open file "restart.coor" SWAN : FATAL : Cuda driver error 702 in file 'swanlibnv2.cpp' in line 1574. acemd.2562.x64.cuda42: swanlibnv2.cpp:59: void swan_assert(int): Assertion `a' failed. SIGABRT: abort called Stack trace (15 frames): ../../projects/www.gpugrid.net/acemd.2562.x64.cuda42(boinc_catch_signal+0x4d)[0x551f6d] /lib/x86_64-linux-gnu/libc.so.6(+0x364c0)[0x7fa96d2cf4c0] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x35)[0x7fa96d2cf445] /lib/x86_64-linux-gnu/libc.so.6(abort+0x17b)[0x7fa96d2d2bab] /lib/x86_64-linux-gnu/libc.so.6(+0x2f10e)[0x7fa96d2c810e] /lib/x86_64-linux-gnu/libc.so.6(+0x2f1b2)[0x7fa96d2c81b2] ../../projects/www.gpugrid.net/acemd.2562.x64.cuda42[0x482916] ../../projects/www.gpugrid.net/acemd.2562.x64.cuda42[0x4848da] ../../projects/www.gpugrid.net/acemd.2562.x64.cuda42[0x44d4bd] ../../projects/www.gpugrid.net/acemd.2562.x64.cuda42[0x44e54c] ../../projects/www.gpugrid.net/acemd.2562.x64.cuda42[0x41ec14] ../../projects/www.gpugrid.net/acemd.2562.x64.cuda42(sin+0xb6c)[0x407d6c] ../../projects/www.gpugrid.net/acemd.2562.x64.cuda42(sin+0x256)[0x407456] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed)[0x7fa96d2ba76d] ../../projects/www.gpugrid.net/acemd.2562.x64.cuda42(sinh+0x49)[0x4072f9] Exiting... </stderr_txt>
	ID: 26495 \| Rating: 0 \| rate: / Reply Quote

Bedrich Hajek Send message Joined: 28 Mar 09 Posts: 485 Credit: 11,133,622,824 RAC: 15,518,739 Level Scientific publications	Message 26496 - Posted: 28 Jul 2012 \| 14:39:07 UTC - in response to Message 26492.
	Just completed the first NOELIA_sh2fragment unit successfully. See link below: http://www.gpugrid.net/workunit.php?wuid=3601869 Whatever you did to fix the bug, worked. I have 2 more such units still crunching. Hopefully, they will be successful too. Two more of these units completed successfully. See links below: http://www.gpugrid.net/workunit.php?wuid=3601862 http://www.gpugrid.net/workunit.php?wuid=3601963 Though one took about 16 hours to complete, while the other two took about 9 to 10 hours.
	ID: 26496 \| Rating: 0 \| rate: / Reply Quote

HA-SOFT, s.r.o. Send message Joined: 3 Oct 11 Posts: 100 Credit: 5,879,292,399 RAC: 0 Level Scientific publications	Message 26498 - Posted: 28 Jul 2012 \| 17:55:46 UTC - in response to Message 26495.
	Just had a failure after 7hrs of a NOELIA run9 replica21 task on a reliable (up to now) card. The card is now working ok on a PAOLA task. <core_client_version>7.0.28</core_client_version> <![CDATA[ <message> process exited with code 193 (0xc1, -63) </message> <stderr_txt> MDIO: cannot open file "restart.coor" SWAN : FATAL : Cuda driver error 702 in file 'swanlibnv2.cpp' in line 1574. acemd.2562.x64.cuda42: swanlibnv2.cpp:59: void swan_assert(int): Assertion `a' failed. SIGABRT: abort called Stack trace (15 frames): ../../projects/www.gpugrid.net/acemd.2562.x64.cuda42(boinc_catch_signal+0x4d)[0x551f6d] /lib/x86_64-linux-gnu/libc.so.6(+0x364c0)[0x7fa96d2cf4c0] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x35)[0x7fa96d2cf445] /lib/x86_64-linux-gnu/libc.so.6(abort+0x17b)[0x7fa96d2d2bab] /lib/x86_64-linux-gnu/libc.so.6(+0x2f10e)[0x7fa96d2c810e] /lib/x86_64-linux-gnu/libc.so.6(+0x2f1b2)[0x7fa96d2c81b2] ../../projects/www.gpugrid.net/acemd.2562.x64.cuda42[0x482916] ../../projects/www.gpugrid.net/acemd.2562.x64.cuda42[0x4848da] ../../projects/www.gpugrid.net/acemd.2562.x64.cuda42[0x44d4bd] ../../projects/www.gpugrid.net/acemd.2562.x64.cuda42[0x44e54c] ../../projects/www.gpugrid.net/acemd.2562.x64.cuda42[0x41ec14] ../../projects/www.gpugrid.net/acemd.2562.x64.cuda42(sin+0xb6c)[0x407d6c] ../../projects/www.gpugrid.net/acemd.2562.x64.cuda42(sin+0x256)[0x407456] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed)[0x7fa96d2ba76d] ../../projects/www.gpugrid.net/acemd.2562.x64.cuda42(sinh+0x49)[0x4072f9] Exiting... </stderr_txt> Exactly the same on my GTX580 under Linux.
	ID: 26498 \| Rating: 0 \| rate: / Reply Quote

Stoneageman Send message Joined: 25 May 09 Posts: 224 Credit: 34,057,374,498 RAC: 186 Level Scientific publications	Message 26500 - Posted: 28 Jul 2012 \| 20:42:01 UTC
	and I've another failed after 7 hrs, as it did before me. Considering aborting all NOELLA tasks now :-( http://www.gpugrid.net/workunit.php?wuid=3601979
	ID: 26500 \| Rating: 0 \| rate: / Reply Quote

ritterm Send message Joined: 31 Jul 09 Posts: 88 Credit: 244,413,897 RAC: 0 Level Scientific publications	Message 26502 - Posted: 28 Jul 2012 \| 22:18:16 UTC - in response to Message 26500.
	Considering aborting all NOELLA tasks now :-( Me, too...I thought maybe I was okay after this one finished successfully: run10_replica21-NOELIA_sh2fragment_fixed-0-4-RND7749_0 But then the one I had queued up next crashed after 14-hours plus: run9_replica1-NOELIA_sh2fragment_fixed-0-4-RND6355_1 Both had the "MDIO: cannot open file 'restart.coor'" message in the stderr output. ____________
	ID: 26502 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,249,865,968 RAC: 4,089,892 Level Scientific publications	Message 26504 - Posted: 28 Jul 2012 \| 22:55:19 UTC - in response to Message 26502. Last modified: 28 Jul 2012 \| 22:57:06 UTC
	Both had the "MDIO: cannot open file 'restart.coor'" message in the stderr output. This is a false error message. It appears in every task, even in the successful ones. BTW these "fixed" NOELIA tasks are running fine on all of my hosts. Perhaps you should lower your GPU clock (calibrate your overclocking settings to the more demanding CUDA 4.2 tasks, because these extra long tasks are even more demanding).
	ID: 26504 \| Rating: 0 \| rate: / Reply Quote

ritterm Send message Joined: 31 Jul 09 Posts: 88 Credit: 244,413,897 RAC: 0 Level Scientific publications	Message 26508 - Posted: 29 Jul 2012 \| 2:52:54 UTC - in response to Message 26504.
	Perhaps you should lower your GPU clock (calibrate your overclocking settings to the more demanding CUDA 4.2 tasks, because these extra long tasks are even more demanding). Even if I'm running at stock speeds? Other than two NOELIA's, I've had few, if any, comp errors with this card that weren't attributable to "bad" tasks. ____________
	ID: 26508 \| Rating: 0 \| rate: / Reply Quote

MarkJ Volunteer moderator Volunteer tester Send message Joined: 24 Dec 08 Posts: 738 Credit: 200,909,904 RAC: 0 Level Scientific publications	Message 26509 - Posted: 29 Jul 2012 \| 3:23:11 UTC Last modified: 29 Jul 2012 \| 3:25:46 UTC
	So far all the sh2fragment_fixed have been working on my two GTX670's. Make sure the work units have "fixed" in their name, otherwise they are probably the bad ones we already know about. They vary a bit in size, but have been taking around 8 hours (which is what the old cuda 3.1 long wu were taking before). ____________ BOINC blog
	ID: 26509 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,249,865,968 RAC: 4,089,892 Level Scientific publications	Message 26511 - Posted: 29 Jul 2012 \| 3:45:19 UTC - in response to Message 26508.
	Perhaps you should lower your GPU clock (calibrate your overclocking settings to the more demanding CUDA 4.2 tasks, because these extra long tasks are even more demanding). Even if I'm running at stock speeds? Other than two NOELIA's, I've had few, if any, comp errors with this card that weren't attributable to "bad" tasks. Yes. But if you are running your cards at stock speeds, I'd rather try to increase the GPU core voltage by 25mV (if your GPU temperatures allows to do so).
	ID: 26511 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,249,865,968 RAC: 4,089,892 Level Scientific publications	Message 26512 - Posted: 29 Jul 2012 \| 9:47:13 UTC - in response to Message 26504. Last modified: 29 Jul 2012 \| 9:48:52 UTC
	BTW these "fixed" NOELIA tasks are running fine on all of my hosts. One of these workunits was stuck on my GTX590 for 7 hours. The progress indicator did not increased since my previous post. A system restart helped. Bye-bye 24 hours bonus....
	ID: 26512 \| Rating: 0 \| rate: / Reply Quote

dskagcommunity Send message Joined: 28 Apr 11 Posts: 456 Credit: 817,865,789 RAC: 0 Level Scientific publications	Message 26513 - Posted: 29 Jul 2012 \| 15:11:50 UTC Last modified: 29 Jul 2012 \| 15:13:31 UTC
	Soon my first fixed NOELIA Unit is finished on cuda31. Unfortunaly it is the first WU on my 285GTX that need more then 24hours (25hours ;)) to compute :( Bye-bye 24 hours bonus.... But it seems to work here.. (this one, we will see on the next ones ^^) ____________ DSKAG Austria Research Team: http://www.research.dskag.at
	ID: 26513 \| Rating: 0 \| rate: / Reply Quote

ritterm Send message Joined: 31 Jul 09 Posts: 88 Credit: 244,413,897 RAC: 0 Level Scientific publications	Message 26520 - Posted: 31 Jul 2012 \| 1:56:29 UTC
	And another one bites the dust after almost 7 hours: run9_replica14-NOELIA_sh2fragment_fixed-1-4-RND1629_2 Stderr output includes: "SWAN : FATAL : Cuda driver error 999 in file 'swanlibnv2.cpp' in line 1574. Assertion failed: a, file swanlibnv2.cpp, line 59" ____________
	ID: 26520 \| Rating: 0 \| rate: / Reply Quote

dskagcommunity Send message Joined: 28 Apr 11 Posts: 456 Credit: 817,865,789 RAC: 0 Level Scientific publications	Message 26521 - Posted: 31 Jul 2012 \| 5:36:46 UTC
	Hm my second one worked too. ____________ DSKAG Austria Research Team: http://www.research.dskag.at
	ID: 26521 \| Rating: 0 \| rate: / Reply Quote

MarkJ Volunteer moderator Volunteer tester Send message Joined: 24 Dec 08 Posts: 738 Credit: 200,909,904 RAC: 0 Level Scientific publications	Message 26522 - Posted: 31 Jul 2012 \| 11:49:59 UTC - in response to Message 26513. Last modified: 31 Jul 2012 \| 11:52:16 UTC
	Soon my first fixed NOELIA Unit is finished on cuda31. Unfortunaly it is the first WU on my 285GTX that need more then 24hours (25hours ;)) to compute :( Bye-bye 24 hours bonus.... But it seems to work here.. (this one, we will see on the next ones ^^) Dskagcommunity you might want to update to 301.42 drivers. They still should work on your GTX285. That way you can get the speed advantages of the cuda 4.2 app. ____________ BOINC blog
	ID: 26522 \| Rating: 0 \| rate: / Reply Quote

dskagcommunity Send message Joined: 28 Apr 11 Posts: 456 Credit: 817,865,789 RAC: 0 Level Scientific publications	Message 26523 - Posted: 31 Jul 2012 \| 12:55:20 UTC Last modified: 31 Jul 2012 \| 12:56:47 UTC
	I downgrade them because 42 runs much slower on 285gtx. Up to 15000secs! Included some wus that erroring because i dont want to change anything of the stockclocked cardsettings. So i prevere to compute one sort of wus in over 24hours but the rest computes for sure and secure in good times. ____________ DSKAG Austria Research Team: http://www.research.dskag.at
	ID: 26523 \| Rating: 0 \| rate: / Reply Quote

MarkJ Volunteer moderator Volunteer tester Send message Joined: 24 Dec 08 Posts: 738 Credit: 200,909,904 RAC: 0 Level Scientific publications	Message 26527 - Posted: 1 Aug 2012 \| 11:00:10 UTC - in response to Message 26523.
	I downgrade them because 42 runs much slower on 285gtx. Up to 15000secs! Included some wus that erroring because i dont want to change anything of the stockclocked cardsettings. So i prevere to compute one sort of wus in over 24hours but the rest computes for sure and secure in good times. I wonder if you were getting the downclock bug, which only appears if the cards are running hot (but not overheating). Its supposedly fixed in the 304 (beta) drivers but I have heard of issues with running other project apps (Seti) with 304 drivers. What sort of temps is your GTX285 running at under load? ____________ BOINC blog
	ID: 26527 \| Rating: 0 \| rate: / Reply Quote

dskagcommunity Send message Joined: 28 Apr 11 Posts: 456 Credit: 817,865,789 RAC: 0 Level Scientific publications	Message 26528 - Posted: 1 Aug 2012 \| 12:46:38 UTC Last modified: 1 Aug 2012 \| 12:54:23 UTC
	79-80 degrees in a room with AC with manual set fan speed cos normal it would run with 90-92 that i found little to much for 24h operation. dont installin beta things. When there is a downclocking problem, then i hope 304 is soon available for stable release ^^ someone told me to raise gpu voltage a minimum to kill the errorthing but i dont want to touch these settings because i have only bad experience with such things. Ether it runs like it is from the factory or not ;) thats the ground why i dont buy any OC Cards. ____________ DSKAG Austria Research Team: http://www.research.dskag.at
	ID: 26528 \| Rating: 0 \| rate: / Reply Quote

MarkJ Volunteer moderator Volunteer tester Send message Joined: 24 Dec 08 Posts: 738 Credit: 200,909,904 RAC: 0 Level Scientific publications	Message 26532 - Posted: 2 Aug 2012 \| 12:26:42 UTC - in response to Message 26528.
	79-80 degrees in a room with AC with manual set fan speed cos normal it would run with 90-92 that i found little to much for 24h operation. dont installin beta things. When there is a downclocking problem, then i hope 304 is soon available for stable release ^^ someone told me to raise gpu voltage a minimum to kill the errorthing but i dont want to touch these settings because i have only bad experience with such things. Ether it runs like it is from the factory or not ;) thats the ground why i dont buy any OC Cards. Yep that would be enough to trip the overheating bug. Well anyway at least you know why 301.42 doesn't work well for your config. Hopefully they'll sort out the issues with 304 and get a good one out the door soon. ____________ BOINC blog
	ID: 26532 \| Rating: 0 \| rate: / Reply Quote

mhhall Send message Joined: 21 Mar 10 Posts: 23 Credit: 861,667,631 RAC: 0 Level Scientific publications	Message 26534 - Posted: 3 Aug 2012 \| 20:54:25 UTC
	My system is currently processing WU 3601638 which appears to now be at 41 hours elapsed and 61% completed. This appears likely to be 2 or 2.5 times longer than most units that my system has worked recently. Does this appear to be a WU that I should allow to run to completion, or is it indicating a problem. Seems that most WU of this type of had problems with immediate failures. Don't know if this is also a problem of a cuda31 process on my machine (which seems like it has been handling cuda42 work properly).
	ID: 26534 \| Rating: 0 \| rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 26548 - Posted: 5 Aug 2012 \| 10:59:21 UTC - in response to Message 26534.
	Presumably you mean this task: run10_replica19-NOELIA_sh2fragment_fixed-0-4-RND1077_1 3601638 1 Aug 2012 \| 14:31:44 UTC 4 Aug 2012 \| 22:41:22 UTC Completed and validated 240,549.78 7,615.19 67,500.00 Long runs (8-12 hours on fastest card) v6.16 (cuda31) There is probably more than one thing affecting performance in this case. As you say it ran under the 3.1app, which is around 50% slower for most tasks. Still, 2.8 days on a GT550Ti is too long: Although the 3.1tasks report a Stderr output file, it doesn't show anything interesting in this case: Stderr output <core_client_version>6.12.34</core_client_version> <![CDATA[ <stderr_txt> # Using device 0 # There is 1 device supporting CUDA # Device 0: "GeForce GTX 550 Ti" # Clock rate: 1.80 GHz # Total amount of global memory: 1072889856 bytes # Number of multiprocessors: 4 # Number of cores: 32 MDIO: cannot open file "restart.coor" # Using device 0 # There is 1 device supporting CUDA # Device 0: "GeForce GTX 550 Ti" # Clock rate: 1.80 GHz # Total amount of global memory: 1072889856 bytes # Number of multiprocessors: 4 # Number of cores: 32 # Time per step (avg over 1930000 steps): 48.186 ms # Approximate elapsed time for entire WU: 240929.858 s 18:17:15 (28048): called boinc_finish </stderr_txt> ]]> The task was just stopped once during the run (system restart for example). I don't know what the expected 'Time per step' values are for these tasks on a similar card. In this case my guess is that your GPU downclocked for a period during the run; the GPU ran at 100MHz for a while. Maybe after the restart it ran at normal clock rates. Increasing the GPU fan speed can sometimes prevent downclocking. One other possibility is that your CPU (a dual core Opteron) was struggling; I think the 3.1app is more demanding of the CPU which isn't really high end. I think your system also uses DDR2 which can cause some performance loss (seen in GPU utilization). If you are also running CPU tasks, 1 would be fine but 2 would result in a poor performance for the GPU (unless you redefined nice values). Just running these tasks on the new app seems to avoid this problem - your card ran a similar NOELIA_sh2fragment_fixed task on the cuda42 app in almost 1/3rd the time. ____________ FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help
	ID: 26548 \| Rating: 0 \| rate: / Reply Quote