NOELIA WUs getting "stuck"

Message boards : Graphics cards (GPUs) : NOELIA WUs getting "stuck"

Author	Message
ETQuestor Send message Joined: 11 Jul 09 Posts: 27 Credit: 1,000,618,568 RAC: 0 Level Scientific publications	Message 26611 - Posted: 15 Aug 2012 \| 20:36:39 UTC
	I am seeing about 10% of my NOELIA WUs getting "stuck" - the "fraction done" output stops moving. This seems to happen most often with "run4" WUs, but I have also seen with the other run numbers. If I restart BOINC, it starts the WU over from 0.00000. Sometimes it will freeze again at another spot, sometimes the WU will finish successfully after this restart, sometimes they error out. Link to the machine - http://www.gpugrid.net/show_host_detail.php?hostid=111125 Linux x86_64 (Fedora 17 3.4.6-2.fc17.x86_64) NVIDIA UNIX x86_64 Kernel Module 304.37 GeForce GTX 560 Ti
	ID: 26611 \| Rating: 0 \| rate: / Reply Quote

Dylan Send message Joined: 16 Jul 12 Posts: 98 Credit: 386,043,752 RAC: 0 Level Scientific publications	Message 26612 - Posted: 15 Aug 2012 \| 21:15:16 UTC
	Hmm, I haven't encountered this issue, but I run Windows, so that might be why. For me, sometimes tasks restart when Boinc does, but not all the time, though. Sorry I can't help you.
	ID: 26612 \| Rating: 0 \| rate: / Reply Quote

K1atOdessa Send message Joined: 25 Feb 08 Posts: 249 Credit: 370,320,941 RAC: 0 Level Scientific publications	Message 26615 - Posted: 16 Aug 2012 \| 3:55:32 UTC Last modified: 16 Aug 2012 \| 3:56:07 UTC
	I had this happen twice on me to date (I am running Windows). One time it was 2 days before I noticed, so I just aborted (given the utilization was practically 0% and progress not moving). The 2nd time I saw it, I closed BOINC and reopened, after which the WU errored out. Kinda a pain, but just keeping an out eye at this point.
	ID: 26615 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,201,255,749 RAC: 851 Level Scientific publications	Message 26617 - Posted: 16 Aug 2012 \| 7:30:45 UTC Last modified: 16 Aug 2012 \| 7:34:15 UTC
	One of my hosts is crunching such a workunit right now. It's running for 21h26m now, and it's at 78.320%, progressing very slowly. This type of workunits usually take less than 7 hours to complete. I've tried to pause and restart the task, then I put it on another GPU in the same host, but there's no change in its speed. I've double checked that none of this host's GPUs is downclocked.
	ID: 26617 \| Rating: 0 \| rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 26622 - Posted: 16 Aug 2012 \| 11:31:21 UTC - in response to Message 26617. Last modified: 16 Aug 2012 \| 11:55:48 UTC
	The same workunit was aborted on another system with these verbosely challenged details: Stderr output <core_client_version>6.12.34</core_client_version> Most likely some problem with the tasks, but perhaps this needs more CPU or bandwidth. Does freeing up another CPU core/thread help (if its bandwidth it would take you to suspend any CPU tasks to notice)? Are GPU temps/fan speeds normal? ____________ FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help
	ID: 26622 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,201,255,749 RAC: 851 Level Scientific publications	Message 26628 - Posted: 16 Aug 2012 \| 23:09:32 UTC - in response to Message 26622.
	It's finished after 27 hours... Stderr output: <core_client_version>6.10.60</core_client_version> <![CDATA[ <stderr_txt> MDIO: cannot open file "restart.coor" No heartbeat from core client for 30 sec - exiting # Time per step (avg over 545000 steps): 19.800 ms # Approximate elapsed time for entire WU: 99000.143 s called boinc_finish </stderr_txt> ]]> Since then my host finished a couple of workunits without any problems, and without any restart.
	ID: 26628 \| Rating: 0 \| rate: / Reply Quote

ETQuestor Send message Joined: 11 Jul 09 Posts: 27 Credit: 1,000,618,568 RAC: 0 Level Scientific publications	Message 26629 - Posted: 16 Aug 2012 \| 23:26:56 UTC - in response to Message 26622.
	I was seeing this issue with the much shorter "trypsin_lig" runs. When these ran successfully, they ran very quickly (like under an hour). The CPU has been under 20% utilization and the GPU/fan are normal. I have been running other ACEMD and long runs on this same machine without issue for almost a year.
	ID: 26629 \| Rating: 0 \| rate: / Reply Quote

ETQuestor Send message Joined: 11 Jul 09 Posts: 27 Credit: 1,000,618,568 RAC: 0 Level Scientific publications	Message 26630 - Posted: 16 Aug 2012 \| 23:29:11 UTC - in response to Message 26611.
	Here are some examples of failed WUs - three errored out and two I aborted after they got stuck. http://www.gpugrid.net/result.php?resultid=5726005 http://www.gpugrid.net/result.php?resultid=5724879 http://www.gpugrid.net/result.php?resultid=5723761 http://www.gpugrid.net/result.php?resultid=5719313 http://www.gpugrid.net/result.php?resultid=5713069
	ID: 26630 \| Rating: 0 \| rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 26632 - Posted: 17 Aug 2012 \| 12:44:47 UTC - in response to Message 26630.
	We are seeing three different errors here. Zoltan's system had a "No heartbeat from core client for 30 sec" error, and ETQuestor had 2 different errors (3 tasks were aborted). One error was a sig abort and the other was an "energies have become nan" error: SWAN : FATAL : Cuda driver error 700 in file 'swanlibnv2.cpp' in line 1574. acemd.2562.x64.cuda42: swanlibnv2.cpp:59: void swan_assert(int): Assertion `a' failed. SIGABRT: abort called Stack trace (15 frames): ../../projects/www.gpugrid.net/acemd.2562.x64.cuda42(boinc_catch_signal+0x4d)[0x551f6d] /lib64/libc.so.6(+0x359a0)[0x7fdff7ad39a0] /lib64/libc.so.6(gsignal+0x35)[0x7fdff7ad3925] /lib64/libc.so.6(abort+0x148)[0x7fdff7ad50d8] /lib64/libc.so.6(+0x2e6a2)[0x7fdff7acc6a2] /lib64/libc.so.6(+0x2e752)[0x7fdff7acc752] ../../projects/www.gpugrid.net/acemd.2562.x64.cuda42[0x482916] ../../projects/www.gpugrid.net/acemd.2562.x64.cuda42[0x4848da] ../../projects/www.gpugrid.net/acemd.2562.x64.cuda42[0x44d4bd] ../../projects/www.gpugrid.net/acemd.2562.x64.cuda42[0x44e54c] ../../projects/www.gpugrid.net/acemd.2562.x64.cuda42[0x41ec14] ../../projects/www.gpugrid.net/acemd.2562.x64.cuda42(sin+0xb6c)[0x407d6c] ../../projects/www.gpugrid.net/acemd.2562.x64.cuda42(sin+0x256)[0x407456] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7fdff7abf735] ../../projects/www.gpugrid.net/acemd.2562.x64.cuda42(sinh+0x49)[0x4072f9] Exiting... -- ERROR: file deven.cpp line 1106: # Energies have become nan Each of this errors can be caused by more than one problem. There have been suggestions about these errors in the past. While the No Heartbeat error can simply be caused by not having enough access to the CPU, its often due to not being able to write to the hard drive, due to other tasks basically thrashing the drive (I think any process that reads or writes to the drive would be prioritised over all things Boinc). Things like automatic disk defrags, Windows search engines and AV scans can trigger this. I have also experienced this error when a SATA cable became lose (the sub-standard Red ones)! I suspect opening or leaving the Event Log window open might contribute to this as well (if you have lots of cc_config flags set). The SIGABRT (an abort task signal) and the Not a Number errors could well be task related, but also Linux setup/driver/library issues. In the past similar errors were supposedly caused by Boinc running CPU benchmarks, amongst other things. A lib.so.6 "double free or corruption" was reported back in Jan when crunching one a TONI task, though there are no suggestions in that thread. Soneageman also reported this error. Again I don't see any specific helpful info. If the problem continues look at hardware (temps/fan speed/noise), the driver and the Boinc client (config/updates). ____________ FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help
	ID: 26632 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,201,255,749 RAC: 851 Level Scientific publications	Message 26633 - Posted: 17 Aug 2012 \| 16:23:17 UTC - in response to Message 26632.
	We are seeing three different errors here. Zoltan's system had a "No heartbeat from core client for 30 sec" error, .... While the No Heartbeat error can simply be caused by not having enough access to the CPU, its often due to not being able to write to the hard drive, due to other tasks basically thrashing the drive (I think any process that reads or writes to the drive would be prioritised over all things Boinc). Things like automatic disk defrags, Windows search engines and AV scans can trigger this. I have also experienced this error when a SATA cable became lose (the sub-standard Red ones)! I suspect opening or leaving the Event Log window open might contribute to this as well (if you have lots of cc_config flags set). While this is true, it's not the source of the slowness of this workunit. It was slow right from the start. My rosetta@home tasks were going wild, using 400 to 850MB, so when I started Skype, it caused the BOINC manager to shut down every task, because the memory used by BOINC applications exceeded the treshold of maximum usable physical memory (90%). Then it restarted the tasks one by one, but rosetta@home tasks read 1.3GB (and write 130MB) at startup, and since I don't have SSD in this PC, this could overwhelm the file system, causing tasks starting and stopping several times (because the "No Heartbeat from core client" error), and rendering my PC unusable for a couple of minutes. Since then I've doubled the RAM in this host (now it has 12GB). The other 3 tasks running at the same time were experiencing this "No heartbeat error", but they didn't slow down. This error makes the BOINC manager to stop the task, and restart it from the last checkpoint. Here is a list of my workunits which experienced this error, but didn't slow down: 5743222 2 times 5743031 2 times 5742951 2 times 5742426 4 times 5741987 2 times 5741772 2 times 5741561 2 times 5741537 2 times 5740471 5740130 5739827 5739821
	ID: 26633 \| Rating: 0 \| rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 26635 - Posted: 17 Aug 2012 \| 20:07:54 UTC - in response to Message 26633. Last modified: 17 Aug 2012 \| 20:39:22 UTC
	Sort of an aside, as the (no heartbeat) 30sec no response->stop task issue is clouding the problem here, but I think a delay needs to be introduced during task startup/restart, but this should really be done at the Boinc level, rather than app or by the users. That said a script to do so would be a good workaround, similar to the Linux startup delay, but on a task by task basis (allow a few seconds for each task to load). If possible using a secondary hard drive should help avoid this issue. That said, for normal usage you would want an SSD for the system to boost system performance, especially startup/shutdown, rather than just being used for Boinc. Of course I'm guilty of buying an SSD just to support some of the more challenging projects, but then I like a challenge. I really see this as a problem with Rosetta and Boinc. Basically I think Boinc should always prioritise GPU projects over CPU projects, even if it means using delayed write or suspending the CPU project. Frankly I never want any CPU project to interfere with a GPUGrid Long run for any reason (HDD, CPU, RAM...). If CPU projects could ascertain how much resources were available to them, after the GPU project starts as a priority, this sort of problem should never happen. ____________ FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help
	ID: 26635 \| Rating: 0 \| rate: / Reply Quote

Raptures Riot Send message Joined: 30 Apr 11 Posts: 6 Credit: 220,588,795 RAC: 0 Level Scientific publications	Message 26636 - Posted: 17 Aug 2012 \| 21:09:13 UTC
	I am getting a lot of 'Energies have become nan' the same as many others. This usually occurs 3 or 4 hours into the calc. I'm presuming 'nan' means 'indeterminate' which is a legitimate conclusion to the model. Therefore I do not understand why the calculation ends in an error. Please, if this result is useful info, can a 'completeded successfully' be awarded? There seems to be some disenchantment in the forums over this topic. I know everyone here is dedicated and respectful and we try real hard for results. Let's understand why 'nan' can be a good 3 letter word.
	ID: 26636 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,201,255,749 RAC: 851 Level Scientific publications	Message 26637 - Posted: 17 Aug 2012 \| 21:18:05 UTC - in response to Message 26636.
	The nan error has it's own thread.
	ID: 26637 \| Rating: 0 \| rate: / Reply Quote

Post to thread

Message boards : Graphics cards (GPUs) : NOELIA WUs getting "stuck"

	About	Science	Volunteers	Performance	Forum	Join us	Donate