The system cannot find the path specified

Message boards : Number crunching : The system cannot find the path specified

Author	Message
Morten Ross Send message Joined: 26 Sep 08 Posts: 6 Credit: 95,071,491 RAC: 0 Level Scientific publications	Message 27503 - Posted: 2 Dec 2012 \| 11:21:13 UTC
	I have a host with 2xGTX590 and 2xGTX690, and tasks exit with below error at various stages of the processing: <core_client_version>7.0.28</core_client_version> <![CDATA[ <message> The system cannot find the path specified. (0x3) - exit code 3 (0x3) </message> <stderr_txt> MDIO: cannot open file "restart.coor" SWAN : FATAL : Cuda driver error 999 in file 'swanlibnv2.cpp' in line 1574. Assertion failed: a, file swanlibnv2.cpp, line 59 This application has requested the Runtime to terminate it in an unusual way. Please contact the application's support team for more information. </stderr_txt> ]]> As the app is poor at logging which device it's using/failing on I'm unable see which devices are being used. As a comparison this host is one of the top hosts on Seti@Home (now down, and the reason for me crunching here) which no error tasks, so there is no hardware related issue or Nvidia driver issue. Regards Morten
	ID: 27503 \| Rating: 0 \| rate: / Reply Quote

mikey Send message Joined: 2 Jan 09 Posts: 297 Credit: 6,230,518,347 RAC: 29,447,646 Level Scientific publications	Message 27505 - Posted: 2 Dec 2012 \| 13:40:31 UTC - in response to Message 27503. Last modified: 2 Dec 2012 \| 13:40:57 UTC
	I have a host with 2xGTX590 and 2xGTX690, and tasks exit with below error at various stages of the processing: <core_client_version>7.0.28</core_client_version> <![CDATA[ <message> The system cannot find the path specified. (0x3) - exit code 3 (0x3) </message> <stderr_txt> MDIO: cannot open file "restart.coor" SWAN : FATAL : Cuda driver error 999 in file 'swanlibnv2.cpp' in line 1574. Assertion failed: a, file swanlibnv2.cpp, line 59 This application has requested the Runtime to terminate it in an unusual way. Please contact the application's support team for more information. </stderr_txt> ]]> As the app is poor at logging which device it's using/failing on I'm unable see which devices are being used. As a comparison this host is one of the top hosts on Seti@Home (now down, and the reason for me crunching here) which no error tasks, so there is no hardware related issue or Nvidia driver issue. Regards Morten If you look at the top of the Event Log it will tell you which gpu's loaded and which is #0 and which is #1. It should also tell that it 'found gpu #1, for instance, and failed to load the drivers or whatever. Here is mine from this machine: 11/30/2012 10:51:03 AM \| \| ATI GPU 0: ATI Radeon HD 5700 series (Juniper) (CAL version 1.4.1741, 1024MB, 991MB available, 2720 GFLOPS peak) 11/30/2012 10:51:03 AM \| \| ATI GPU 1: ATI Radeon HD 5700 series (Juniper) (CAL version 1.4.1741, 1024MB, 991MB available, 2800 GFLOPS peak) 11/30/2012 10:51:03 AM \| \| OpenCL: ATI GPU 0: ATI Radeon HD 5700 series (Juniper) (driver version CAL 1.4.1741 (VM), device version OpenCL 1.2 AMD-APP (938.2), 1024MB, 991MB available) 11/30/2012 10:51:03 AM \| \| OpenCL: ATI GPU 1: ATI Radeon HD 5700 series (Juniper) (driver version CAL 1.4.1741 (VM), device version OpenCL 1.2 AMD-APP (938.2), 1024MB, 991MB available) As you can see I have two AMD 5770's in the machine. The first lines says what it found, while the last lines show the drivers loaded properly. I also see you are using Windows, IF a gpu crashes in Windows the ONLY way to get it back to normal is to reboot the machine, you might try that first and see if it helps.
	ID: 27505 \| Rating: 0 \| rate: / Reply Quote

Morten Ross Send message Joined: 26 Sep 08 Posts: 6 Credit: 95,071,491 RAC: 0 Level Scientific publications	Message 27506 - Posted: 2 Dec 2012 \| 15:54:38 UTC - in response to Message 27505.
	Hi, The top of the BOINC manager eventlog is absolutely of no importance during the load of the [science]app and crunching. In other projects the app logs on which gpu it loads the task - not so here. What you are referring to is simply the listing of which GPU devices BOINC has found available on system. In the meantime I also see this crash behavior on my GTX690-only host, so this is also occurring on a non-mixed-version GPU host. Also, again, there is no NVIDIA driver crash/restart involved in this scenario - only the app. Morten ____________
	ID: 27506 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,249,865,968 RAC: 4,089,892 Level Scientific publications	Message 27507 - Posted: 2 Dec 2012 \| 23:41:58 UTC - in response to Message 27506.
	Hi! The MDIO: cannot open file "restart.coor" is a false error message (it appears in every task), however the following one is a real error. The CUDA 4.2 tasks are faster than the CUDA 3.1 tasks, and they are more demanding at the same time, so the CUDA 4.2 tasks tolerate less overclocking, therefore if you are overclocking your cards, you should recalibrate for the CUDA 4.2 tasks. For example I had to set my GTX 590 to 625MHz for the CUDA 4.2 client. It was running fine at 725MHz with the CUDA 3.1 client. However, the CUDA 4.2 client is 40% faster than the CUDA 3.1 client, so it can do 20% more work at the lower frequency. Maybe your GPU temperatures are too high (below 80°C is ok, above 90°C is dangerous).
	ID: 27507 \| Rating: 0 \| rate: / Reply Quote

Morten Ross Send message Joined: 26 Sep 08 Posts: 6 Credit: 95,071,491 RAC: 0 Level Scientific publications	Message 27509 - Posted: 3 Dec 2012 \| 1:07:00 UTC - in response to Message 27507.
	Hi! The MDIO: cannot open file "restart.coor" is a false error message (it appears in every task), however the following one is a real error. The CUDA 4.2 tasks are faster than the CUDA 3.1 tasks, and they are more demanding at the same time, so the CUDA 4.2 tasks tolerate less overclocking, therefore if you are overclocking your cards, you should recalibrate for the CUDA 4.2 tasks. For example I had to set my GTX 590 to 625MHz for the CUDA 4.2 client. It was running fine at 725MHz with the CUDA 3.1 client. However, the CUDA 4.2 client is 40% faster than the CUDA 3.1 client, so it can do 20% more work at the lower frequency. Maybe your GPU temperatures are too high (below 80°C is ok, above 90°C is dangerous). Hi Retvari, Temp is not an issue - none reach above 70c. Overclocking is also not an issue as all cards are on stock settings.
	ID: 27509 \| Rating: 0 \| rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 27511 - Posted: 3 Dec 2012 \| 9:26:57 UTC - in response to Message 27509. Last modified: 3 Dec 2012 \| 12:38:11 UTC
	This post or other posts in the same FAQ thread might point you in the right direction, FAQ - Why does my run fail? Some answers What's your CPU usage? BTW. From Boinc, Tasks tab you can click on a task and select Properties to see which GPU it's running on, device 0 or device 1 for example. This however isn't listed in the Boinc logs, neither at the top of the page or in the reams of messages; If you select a GPUGrid line and then 'Show only this project' it makes it a little easier to find some info, so long as you don't have lots of flags set, but what you'll find is something like this, 03/12/2012 08:06:43 \| GPUGRID \| Starting task p039_r2-TONI_AGGd4-5-100-RND1877_0 using acemdlong version 616 (cuda42) in slot 0 Slot 0 isn't a physical slot (PCIE slot, and doesn't correspond to the device either). It's just a logical slot that Boinc allocates the task to. A bit of a misnomer and probably another relic from the CPU based beginnings of Boinc. ____________ FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help
	ID: 27511 \| Rating: 0 \| rate: / Reply Quote

Morten Ross Send message Joined: 26 Sep 08 Posts: 6 Credit: 95,071,491 RAC: 0 Level Scientific publications	Message 27521 - Posted: 3 Dec 2012 \| 15:34:20 UTC - in response to Message 27511.
	Hi, CPU utilization is stable around 35%, so plenty of idle cores. Other projects have science apps that actually log what device they run on, and thus when a task fails you can see if it's the same device each time. This project does not aid in any way to find such a pattern, as the app logs absolutely no such info. As a comparison Milkyway@Home app: <search_application> milkyway_separation 1.02 Windows x86_64 double OpenCL </search_application> Unrecognized XML in project preferences: max_gfx_cpu_pct Skipping: 100 Skipping: /max_gfx_cpu_pct Unrecognized XML in project preferences: allow_non_preferred_apps Skipping: 1 Skipping: /allow_non_preferred_apps Unrecognized XML in project preferences: nbody_graphics_poll_period Skipping: 30 Skipping: /nbody_graphics_poll_period Unrecognized XML in project preferences: nbody_graphics_float_speed Skipping: 5 Skipping: /nbody_graphics_float_speed Unrecognized XML in project preferences: nbody_graphics_textured_point_size Skipping: 250 Skipping: /nbody_graphics_textured_point_size Unrecognized XML in project preferences: nbody_graphics_point_point_size Skipping: 40 Skipping: /nbody_graphics_point_point_size Guessing preferred OpenCL vendor 'Advanced Micro Devices, Inc.' Error loading Lua script 'astronomy_parameters.txt': [string "number_parameters: 4..."]:1: '<name>' expected near '4' Error reading astronomy parameters from file 'astronomy_parameters.txt' Trying old parameters file Using SSE4.1 path Found 1 platform Platform 0 information: Name: AMD Accelerated Parallel Processing Version: OpenCL 1.2 AMD-APP (1016.4) Vendor: Advanced Micro Devices, Inc. Extensions: cl_khr_icd cl_amd_event_callback cl_amd_offline_devices cl_khr_d3d10_sharing cl_khr_d3d11_sharing cl_khr_dx9_media_sharing Profile: FULL_PROFILE Using device 0 on platform 0 Found 2 CL devices Device 'Cayman' (Advanced Micro Devices, Inc.:0x1002) (CL_DEVICE_TYPE_GPU) Driver version: 1016.4 (VM) Version: OpenCL 1.2 AMD-APP (1016.4) Compute capability: 0.0 Max compute units: 24 Clock frequency: 890 Mhz Global mem size: 2147483648 Local mem size: 32768 Max const buf size: 65536 Double extension: cl_khr_fp64 Build log: -------------------------------------------------------------------------------- "D:\Users\EXCHTE~1\AppData\Local\Temp\OCL798C.tmp.cl", line 30: warning: OpenCL extension is now part of core #pragma OPENCL EXTENSION cl_khr_fp64 : enable ^ LOOP UNROLL: pragma unroll (line 288) Unrolled as requested! LOOP UNROLL: pragma unroll (line 280) Unrolled as requested! LOOP UNROLL: pragma unroll (line 273) Unrolled as requested! LOOP UNROLL: pragma unroll (line 244) Unrolled as requested! LOOP UNROLL: pragma unroll (line 202) Unrolled as requested! -------------------------------------------------------------------------------- Build log: -------------------------------------------------------------------------------- "D:\Users\EXCHTE~1\AppData\Local\Temp\OCL7A49.tmp.cl", line 27: warning: OpenCL extension is now part of core #pragma OPENCL EXTENSION cl_khr_fp64 : enable ^ -------------------------------------------------------------------------------- Using AMD IL kernel Binary status (0): CL_SUCCESS Estimated AMD GPU GFLOP/s: 2734 SP GFLOP/s, 684 DP FLOP/s Using a target frequency of 60.0 Using a block size of 6144 with 72 blocks/chunk Using clWaitForEvents() for polling (mode -1) Range: { nu_steps = 960, mu_steps = 1600, r_steps = 1400 } Iteration area: 2240000 Chunk estimate: 5 Num chunks: 6 Chunk size: 442368 Added area: 414208 Effective area: 2654208 Initial wait: 12 ms Integration time: 78.151557 s. Average time per iteration = 81.407872 ms Integral 0 time = 78.963098 s Running likelihood with 94785 stars Likelihood time = 1.212664 s <background_integral> 0.000692186091274 </background_integral> <stream_integral> 186.259197021827840 1439.041456655146900 </stream_integral> <background_likelihood> -3.216576106093333 </background_likelihood> <stream_only_likelihood> -3.733580537591124 -4.441329969686433 </stream_only_likelihood> <search_likelihood> -2.935394639475085 </search_likelihood> 16:02:13 (10500): called boinc_finish Seti@Home app: setiathome_CUDA: Found 6 CUDA device(s): Device 1: GeForce GTX 590, 1535 MiB, regsPerBlock 32768 computeCap 2.0, multiProcs 16 pciBusID = 3, pciSlotID = 0 clockRate = 1215 MHz Device 2: GeForce GTX 590, 1535 MiB, regsPerBlock 32768 computeCap 2.0, multiProcs 16 pciBusID = 4, pciSlotID = 0 clockRate = 1215 MHz Device 3: GeForce GTX 590, 1535 MiB, regsPerBlock 32768 computeCap 2.0, multiProcs 16 pciBusID = 8, pciSlotID = 0 clockRate = 1225 MHz Device 4: GeForce GTX 590, 1535 MiB, regsPerBlock 32768 computeCap 2.0, multiProcs 16 pciBusID = 12, pciSlotID = 0 clockRate = 1225 MHz Device 5: GeForce GTX 590, 1535 MiB, regsPerBlock 32768 computeCap 2.0, multiProcs 16 pciBusID = 13, pciSlotID = 0 clockRate = 1225 MHz Device 6: GeForce GTX 590, 1535 MiB, regsPerBlock 32768 computeCap 2.0, multiProcs 16 pciBusID = 9, pciSlotID = 0 clockRate = 1225 MHz In cudaAcc_initializeDevice(): Boinc passed DevPref 3 setiathome_CUDA: CUDA Device 3 specified, checking... Device 3: GeForce GTX 590 is okay SETI@home using CUDA accelerated device GeForce GTX 590 mbcuda.cfg, processpriority key detected Priority of process set to ABOVE_NORMAL successfully Priority of worker thread set successfully Multibeam x41x Preview, Cuda 4.20 Legacy setiathome_enhanced V6 mode. Work Unit Info: ............... WU true angle range is : 0.431954 VRAM: cudaMalloc((void) &dev_cx_DataArray, 1048576x 8bytes = 8388608bytes, offs256=0, rtotal= 8388608bytes VRAM: cudaMalloc((void) &dev_cx_ChirpDataArray, 1179648x 8bytes = 9437184bytes, offs256=0, rtotal= 17825792bytes VRAM: cudaMalloc((void) &dev_flag, 1x 8bytes = 8bytes, offs256=0, rtotal= 17825800bytes VRAM: cudaMalloc((void) &dev_WorkData, 1179648x 8bytes = 9437184bytes, offs256=0, rtotal= 27262984bytes VRAM: cudaMalloc((void) &dev_PowerSpectrum, 1048576x 4bytes = 4194304bytes, offs256=0, rtotal= 31457288bytes VRAM: cudaMalloc((void) &dev_t_PowerSpectrum, 1048584x 4bytes = 1048608bytes, offs256=0, rtotal= 32505896bytes VRAM: cudaMalloc((void) &dev_GaussFitResults, 1048576x 16bytes = 16777216bytes, offs256=0, rtotal= 49283112bytes VRAM: cudaMalloc((void) &dev_PoT, 1572864x 4bytes = 6291456bytes, offs256=0, rtotal= 55574568bytes VRAM: cudaMalloc((void) &dev_PoTPrefixSum, 1572864x 4bytes = 6291456bytes, offs256=0, rtotal= 61866024bytes VRAM: cudaMalloc((void) &dev_NormMaxPower, 16384x 4bytes = 65536bytes, offs256=0, rtotal= 61931560bytes VRAM: cudaMalloc((void) &dev_flagged, 1048576x 4bytes = 4194304bytes, offs256=0, rtotal= 66125864bytes VRAM: cudaMalloc((void) &dev_outputposition, 1048576x 4bytes = 4194304bytes, offs256=0, rtotal= 70320168bytes VRAM: cudaMalloc((void) &dev_PowerSpectrumSumMax, 262144x 12bytes = 3145728bytes, offs256=0, rtotal= 73465896bytes VRAM: cudaMallocArray( &dev_gauss_dof_lcgf_cache, 1x 8192bytes = 8192bytes, offs256=248, rtotal= 73474088bytes VRAM: cudaMallocArray( &dev_null_dof_lcgf_cache, 1x 8192bytes = 8192bytes, offs256=144, rtotal= 73482280bytes VRAM: cudaMalloc((void) &dev_find_pulse_flag, 1x 8bytes = 8bytes, offs256=0, rtotal= 73482288bytes VRAM: cudaMalloc((void**) &dev_t_funct_cache, 1966081x 4bytes = 7864324bytes, offs256=0, rtotal= 81346612bytes Thread call stack limit is: 1k boinc_exit(): requesting safe worker shutdown -> Worker Acknowledging exit request, spinning-> boinc_exit(): received safe worker shutdown acknowledge -> Checking which GPU a task is running on is not very helpful when I require the device that it has failed on.
	ID: 27521 \| Rating: 0 \| rate: / Reply Quote

Morten Ross Send message Joined: 26 Sep 08 Posts: 6 Credit: 95,071,491 RAC: 0 Level Scientific publications	Message 27543 - Posted: 4 Dec 2012 \| 17:25:03 UTC - in response to Message 27521. Last modified: 4 Dec 2012 \| 17:27:25 UTC
	I've found a workaround which of course is only for those without anything else in life to do: restart Boincmanager. Then the task that has not been terminated will restart from checkpoint and successfully go beyond the point of app crash. Just got back from work and now 3 tasks have terminated on one host, while they are hanging on the other - waiting to be killed by time limit exceeded. Restarting Boincmanager restarted the tasks and saved them from Computation error. It's really strange that there are no app developer here to aid in the root cause analysis..................... Same system is rock stable in all other projects I participate in, so without input from a developer I guess the current state of the project is not for me.
	ID: 27543 \| Rating: 0 \| rate: / Reply Quote

microchip Send message Joined: 4 Sep 11 Posts: 110 Credit: 326,102,587 RAC: 0 Level Scientific publications	Message 27545 - Posted: 4 Dec 2012 \| 18:09:41 UTC Last modified: 4 Dec 2012 \| 18:20:46 UTC
	I have a similar problem here on a GTX 560 thus I can't crunch for GPUGRID anymore until this is resolved. All my WUs error out with failures in swanlibnv2.cpp, some with "energies have become nan" ____________ Team Belgium
	ID: 27545 \| Rating: 0 \| rate: / Reply Quote

mikey Send message Joined: 2 Jan 09 Posts: 297 Credit: 6,230,518,347 RAC: 29,447,646 Level Scientific publications	Message 27578 - Posted: 5 Dec 2012 \| 22:07:54 UTC - in response to Message 27545.
	I have a similar problem here on a GTX 560 thus I can't crunch for GPUGRID anymore until this is resolved. All my WUs error out with failures in swanlibnv2.cpp, some with "energies have become nan" It might help if you upgraded your version of Boinc from 6.12.34 to a more current one. There have been a TON of changes and your problem could be one of them. I have two 560Ti's, which is different than yours but should be similar enough, and mine run just fine. Another difference is I am on Windows but other Linux folks don't have the same problems or one would expect them to be here too.
	ID: 27578 \| Rating: 0 \| rate: / Reply Quote

microchip Send message Joined: 4 Sep 11 Posts: 110 Credit: 326,102,587 RAC: 0 Level Scientific publications	Message 27604 - Posted: 7 Dec 2012 \| 13:42:04 UTC - in response to Message 27578. Last modified: 7 Dec 2012 \| 13:43:24 UTC
	I have a similar problem here on a GTX 560 thus I can't crunch for GPUGRID anymore until this is resolved. All my WUs error out with failures in swanlibnv2.cpp, some with "energies have become nan" It might help if you upgraded your version of Boinc from 6.12.34 to a more current one. There have been a TON of changes and your problem could be one of them. I have two 560Ti's, which is different than yours but should be similar enough, and mine run just fine. Another difference is I am on Windows but other Linux folks don't have the same problems or one would expect them to be here too. I can't at the moment as the recent 7.0.28 version for Linux is compiled with a higher libc version which my current openSUSE 12.1 system does not provide so 6.12.34 is the highest version I can run right now and I'm really not up to compiling boinc from source ____________ Team Belgium
	ID: 27604 \| Rating: 0 \| rate: / Reply Quote

mikey Send message Joined: 2 Jan 09 Posts: 297 Credit: 6,230,518,347 RAC: 29,447,646 Level Scientific publications	Message 27623 - Posted: 9 Dec 2012 \| 14:36:11 UTC - in response to Message 27604.
	I have a similar problem here on a GTX 560 thus I can't crunch for GPUGRID anymore until this is resolved. All my WUs error out with failures in swanlibnv2.cpp, some with "energies have become nan" It might help if you upgraded your version of Boinc from 6.12.34 to a more current one. There have been a TON of changes and your problem could be one of them. I have two 560Ti's, which is different than yours but should be similar enough, and mine run just fine. Another difference is I am on Windows but other Linux folks don't have the same problems or one would expect them to be here too. I can't at the moment as the recent 7.0.28 version for Linux is compiled with a higher libc version which my current openSUSE 12.1 system does not provide so 6.12.34 is the highest version I can run right now and I'm really not up to compiling boinc from source That is a good reason NOT to update then!
	ID: 27623 \| Rating: 0 \| rate: / Reply Quote

microchip Send message Joined: 4 Sep 11 Posts: 110 Credit: 326,102,587 RAC: 0 Level Scientific publications	Message 28544 - Posted: 16 Feb 2013 \| 18:43:27 UTC - in response to Message 27545.
	I have a similar problem here on a GTX 560 thus I can't crunch for GPUGRID anymore until this is resolved. All my WUs error out with failures in swanlibnv2.cpp, some with "energies have become nan" My problem was related to heat output. I've reduced the fan speed thus my GPU got hotter while crunching. After restoring to defaults, short WUs complete with no issues. However, long WUs still error out so I've disabled them for now. ____________ Team Belgium
	ID: 28544 \| Rating: 0 \| rate: / Reply Quote

dskagcommunity Send message Joined: 28 Apr 11 Posts: 456 Credit: 817,865,789 RAC: 0 Level Scientific publications	Message 28545 - Posted: 16 Feb 2013 \| 18:49:54 UTC - in response to Message 27545.
	I have a similar problem here on a GTX 560 thus I can't crunch for GPUGRID anymore until this is resolved. All my WUs error out with failures in swanlibnv2.cpp, some with "energies have become nan" i had this problem on my 560ti 384 cores too when i dont downclock the GPU memory and increase the GPU voltage to 1.025V. Since then it run without errors. ____________ DSKAG Austria Research Team: http://www.research.dskag.at
	ID: 28545 \| Rating: 0 \| rate: / Reply Quote

Post to thread

Message boards : Number crunching : The system cannot find the path specified

	About	Science	Volunteers	Performance	Forum	Join us	Donate

Author	Message
Morten Ross Send message Joined: 26 Sep 08 Posts: 6 Credit: 95,071,491 RAC: 0 Level Scientific publications	Message 27503 - Posted: 2 Dec 2012 \| 11:21:13 UTC
	I have a host with 2xGTX590 and 2xGTX690, and tasks exit with below error at various stages of the processing: <core_client_version>7.0.28</core_client_version> <![CDATA[ <message> The system cannot find the path specified. (0x3) - exit code 3 (0x3) </message> <stderr_txt> MDIO: cannot open file "restart.coor" SWAN : FATAL : Cuda driver error 999 in file 'swanlibnv2.cpp' in line 1574. Assertion failed: a, file swanlibnv2.cpp, line 59 This application has requested the Runtime to terminate it in an unusual way. Please contact the application's support team for more information. </stderr_txt> ]]> As the app is poor at logging which device it's using/failing on I'm unable see which devices are being used. As a comparison this host is one of the top hosts on Seti@Home (now down, and the reason for me crunching here) which no error tasks, so there is no hardware related issue or Nvidia driver issue. Regards Morten
	ID: 27503 \| Rating: 0 \| rate: / Reply Quote

mikey Send message Joined: 2 Jan 09 Posts: 297 Credit: 6,230,518,347 RAC: 29,447,646 Level Scientific publications	Message 27505 - Posted: 2 Dec 2012 \| 13:40:31 UTC - in response to Message 27503. Last modified: 2 Dec 2012 \| 13:40:57 UTC
	I have a host with 2xGTX590 and 2xGTX690, and tasks exit with below error at various stages of the processing: <core_client_version>7.0.28</core_client_version> <![CDATA[ <message> The system cannot find the path specified. (0x3) - exit code 3 (0x3) </message> <stderr_txt> MDIO: cannot open file "restart.coor" SWAN : FATAL : Cuda driver error 999 in file 'swanlibnv2.cpp' in line 1574. Assertion failed: a, file swanlibnv2.cpp, line 59 This application has requested the Runtime to terminate it in an unusual way. Please contact the application's support team for more information. </stderr_txt> ]]> As the app is poor at logging which device it's using/failing on I'm unable see which devices are being used. As a comparison this host is one of the top hosts on Seti@Home (now down, and the reason for me crunching here) which no error tasks, so there is no hardware related issue or Nvidia driver issue. Regards Morten If you look at the top of the Event Log it will tell you which gpu's loaded and which is #0 and which is #1. It should also tell that it 'found gpu #1, for instance, and failed to load the drivers or whatever. Here is mine from this machine: 11/30/2012 10:51:03 AM \| \| ATI GPU 0: ATI Radeon HD 5700 series (Juniper) (CAL version 1.4.1741, 1024MB, 991MB available, 2720 GFLOPS peak) 11/30/2012 10:51:03 AM \| \| ATI GPU 1: ATI Radeon HD 5700 series (Juniper) (CAL version 1.4.1741, 1024MB, 991MB available, 2800 GFLOPS peak) 11/30/2012 10:51:03 AM \| \| OpenCL: ATI GPU 0: ATI Radeon HD 5700 series (Juniper) (driver version CAL 1.4.1741 (VM), device version OpenCL 1.2 AMD-APP (938.2), 1024MB, 991MB available) 11/30/2012 10:51:03 AM \| \| OpenCL: ATI GPU 1: ATI Radeon HD 5700 series (Juniper) (driver version CAL 1.4.1741 (VM), device version OpenCL 1.2 AMD-APP (938.2), 1024MB, 991MB available) As you can see I have two AMD 5770's in the machine. The first lines says what it found, while the last lines show the drivers loaded properly. I also see you are using Windows, IF a gpu crashes in Windows the ONLY way to get it back to normal is to reboot the machine, you might try that first and see if it helps.
	ID: 27505 \| Rating: 0 \| rate: / Reply Quote

Morten Ross Send message Joined: 26 Sep 08 Posts: 6 Credit: 95,071,491 RAC: 0 Level Scientific publications	Message 27506 - Posted: 2 Dec 2012 \| 15:54:38 UTC - in response to Message 27505.
	Hi, The top of the BOINC manager eventlog is absolutely of no importance during the load of the [science]app and crunching. In other projects the app logs on which gpu it loads the task - not so here. What you are referring to is simply the listing of which GPU devices BOINC has found available on system. In the meantime I also see this crash behavior on my GTX690-only host, so this is also occurring on a non-mixed-version GPU host. Also, again, there is no NVIDIA driver crash/restart involved in this scenario - only the app. Morten ____________
	ID: 27506 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,249,865,968 RAC: 4,089,892 Level Scientific publications	Message 27507 - Posted: 2 Dec 2012 \| 23:41:58 UTC - in response to Message 27506.
	Hi! The MDIO: cannot open file "restart.coor" is a false error message (it appears in every task), however the following one is a real error. The CUDA 4.2 tasks are faster than the CUDA 3.1 tasks, and they are more demanding at the same time, so the CUDA 4.2 tasks tolerate less overclocking, therefore if you are overclocking your cards, you should recalibrate for the CUDA 4.2 tasks. For example I had to set my GTX 590 to 625MHz for the CUDA 4.2 client. It was running fine at 725MHz with the CUDA 3.1 client. However, the CUDA 4.2 client is 40% faster than the CUDA 3.1 client, so it can do 20% more work at the lower frequency. Maybe your GPU temperatures are too high (below 80°C is ok, above 90°C is dangerous).
	ID: 27507 \| Rating: 0 \| rate: / Reply Quote

Morten Ross Send message Joined: 26 Sep 08 Posts: 6 Credit: 95,071,491 RAC: 0 Level Scientific publications	Message 27509 - Posted: 3 Dec 2012 \| 1:07:00 UTC - in response to Message 27507.
	Hi! The MDIO: cannot open file "restart.coor" is a false error message (it appears in every task), however the following one is a real error. The CUDA 4.2 tasks are faster than the CUDA 3.1 tasks, and they are more demanding at the same time, so the CUDA 4.2 tasks tolerate less overclocking, therefore if you are overclocking your cards, you should recalibrate for the CUDA 4.2 tasks. For example I had to set my GTX 590 to 625MHz for the CUDA 4.2 client. It was running fine at 725MHz with the CUDA 3.1 client. However, the CUDA 4.2 client is 40% faster than the CUDA 3.1 client, so it can do 20% more work at the lower frequency. Maybe your GPU temperatures are too high (below 80°C is ok, above 90°C is dangerous). Hi Retvari, Temp is not an issue - none reach above 70c. Overclocking is also not an issue as all cards are on stock settings.
	ID: 27509 \| Rating: 0 \| rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 27511 - Posted: 3 Dec 2012 \| 9:26:57 UTC - in response to Message 27509. Last modified: 3 Dec 2012 \| 12:38:11 UTC
	This post or other posts in the same FAQ thread might point you in the right direction, FAQ - Why does my run fail? Some answers What's your CPU usage? BTW. From Boinc, Tasks tab you can click on a task and select Properties to see which GPU it's running on, device 0 or device 1 for example. This however isn't listed in the Boinc logs, neither at the top of the page or in the reams of messages; If you select a GPUGrid line and then 'Show only this project' it makes it a little easier to find some info, so long as you don't have lots of flags set, but what you'll find is something like this, 03/12/2012 08:06:43 \| GPUGRID \| Starting task p039_r2-TONI_AGGd4-5-100-RND1877_0 using acemdlong version 616 (cuda42) in slot 0 Slot 0 isn't a physical slot (PCIE slot, and doesn't correspond to the device either). It's just a logical slot that Boinc allocates the task to. A bit of a misnomer and probably another relic from the CPU based beginnings of Boinc. ____________ FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help
	ID: 27511 \| Rating: 0 \| rate: / Reply Quote

Morten Ross Send message Joined: 26 Sep 08 Posts: 6 Credit: 95,071,491 RAC: 0 Level Scientific publications	Message 27521 - Posted: 3 Dec 2012 \| 15:34:20 UTC - in response to Message 27511.
	Hi, CPU utilization is stable around 35%, so plenty of idle cores. Other projects have science apps that actually log what device they run on, and thus when a task fails you can see if it's the same device each time. This project does not aid in any way to find such a pattern, as the app logs absolutely no such info. As a comparison Milkyway@Home app: <search_application> milkyway_separation 1.02 Windows x86_64 double OpenCL </search_application> Unrecognized XML in project preferences: max_gfx_cpu_pct Skipping: 100 Skipping: /max_gfx_cpu_pct Unrecognized XML in project preferences: allow_non_preferred_apps Skipping: 1 Skipping: /allow_non_preferred_apps Unrecognized XML in project preferences: nbody_graphics_poll_period Skipping: 30 Skipping: /nbody_graphics_poll_period Unrecognized XML in project preferences: nbody_graphics_float_speed Skipping: 5 Skipping: /nbody_graphics_float_speed Unrecognized XML in project preferences: nbody_graphics_textured_point_size Skipping: 250 Skipping: /nbody_graphics_textured_point_size Unrecognized XML in project preferences: nbody_graphics_point_point_size Skipping: 40 Skipping: /nbody_graphics_point_point_size Guessing preferred OpenCL vendor 'Advanced Micro Devices, Inc.' Error loading Lua script 'astronomy_parameters.txt': [string "number_parameters: 4..."]:1: '<name>' expected near '4' Error reading astronomy parameters from file 'astronomy_parameters.txt' Trying old parameters file Using SSE4.1 path Found 1 platform Platform 0 information: Name: AMD Accelerated Parallel Processing Version: OpenCL 1.2 AMD-APP (1016.4) Vendor: Advanced Micro Devices, Inc. Extensions: cl_khr_icd cl_amd_event_callback cl_amd_offline_devices cl_khr_d3d10_sharing cl_khr_d3d11_sharing cl_khr_dx9_media_sharing Profile: FULL_PROFILE Using device 0 on platform 0 Found 2 CL devices Device 'Cayman' (Advanced Micro Devices, Inc.:0x1002) (CL_DEVICE_TYPE_GPU) Driver version: 1016.4 (VM) Version: OpenCL 1.2 AMD-APP (1016.4) Compute capability: 0.0 Max compute units: 24 Clock frequency: 890 Mhz Global mem size: 2147483648 Local mem size: 32768 Max const buf size: 65536 Double extension: cl_khr_fp64 Build log: -------------------------------------------------------------------------------- "D:\Users\EXCHTE~1\AppData\Local\Temp\OCL798C.tmp.cl", line 30: warning: OpenCL extension is now part of core #pragma OPENCL EXTENSION cl_khr_fp64 : enable ^ LOOP UNROLL: pragma unroll (line 288) Unrolled as requested! LOOP UNROLL: pragma unroll (line 280) Unrolled as requested! LOOP UNROLL: pragma unroll (line 273) Unrolled as requested! LOOP UNROLL: pragma unroll (line 244) Unrolled as requested! LOOP UNROLL: pragma unroll (line 202) Unrolled as requested! -------------------------------------------------------------------------------- Build log: -------------------------------------------------------------------------------- "D:\Users\EXCHTE~1\AppData\Local\Temp\OCL7A49.tmp.cl", line 27: warning: OpenCL extension is now part of core #pragma OPENCL EXTENSION cl_khr_fp64 : enable ^ -------------------------------------------------------------------------------- Using AMD IL kernel Binary status (0): CL_SUCCESS Estimated AMD GPU GFLOP/s: 2734 SP GFLOP/s, 684 DP FLOP/s Using a target frequency of 60.0 Using a block size of 6144 with 72 blocks/chunk Using clWaitForEvents() for polling (mode -1) Range: { nu_steps = 960, mu_steps = 1600, r_steps = 1400 } Iteration area: 2240000 Chunk estimate: 5 Num chunks: 6 Chunk size: 442368 Added area: 414208 Effective area: 2654208 Initial wait: 12 ms Integration time: 78.151557 s. Average time per iteration = 81.407872 ms Integral 0 time = 78.963098 s Running likelihood with 94785 stars Likelihood time = 1.212664 s <background_integral> 0.000692186091274 </background_integral> <stream_integral> 186.259197021827840 1439.041456655146900 </stream_integral> <background_likelihood> -3.216576106093333 </background_likelihood> <stream_only_likelihood> -3.733580537591124 -4.441329969686433 </stream_only_likelihood> <search_likelihood> -2.935394639475085 </search_likelihood> 16:02:13 (10500): called boinc_finish Seti@Home app: setiathome_CUDA: Found 6 CUDA device(s): Device 1: GeForce GTX 590, 1535 MiB, regsPerBlock 32768 computeCap 2.0, multiProcs 16 pciBusID = 3, pciSlotID = 0 clockRate = 1215 MHz Device 2: GeForce GTX 590, 1535 MiB, regsPerBlock 32768 computeCap 2.0, multiProcs 16 pciBusID = 4, pciSlotID = 0 clockRate = 1215 MHz Device 3: GeForce GTX 590, 1535 MiB, regsPerBlock 32768 computeCap 2.0, multiProcs 16 pciBusID = 8, pciSlotID = 0 clockRate = 1225 MHz Device 4: GeForce GTX 590, 1535 MiB, regsPerBlock 32768 computeCap 2.0, multiProcs 16 pciBusID = 12, pciSlotID = 0 clockRate = 1225 MHz Device 5: GeForce GTX 590, 1535 MiB, regsPerBlock 32768 computeCap 2.0, multiProcs 16 pciBusID = 13, pciSlotID = 0 clockRate = 1225 MHz Device 6: GeForce GTX 590, 1535 MiB, regsPerBlock 32768 computeCap 2.0, multiProcs 16 pciBusID = 9, pciSlotID = 0 clockRate = 1225 MHz In cudaAcc_initializeDevice(): Boinc passed DevPref 3 setiathome_CUDA: CUDA Device 3 specified, checking... Device 3: GeForce GTX 590 is okay SETI@home using CUDA accelerated device GeForce GTX 590 mbcuda.cfg, processpriority key detected Priority of process set to ABOVE_NORMAL successfully Priority of worker thread set successfully Multibeam x41x Preview, Cuda 4.20 Legacy setiathome_enhanced V6 mode. Work Unit Info: ............... WU true angle range is : 0.431954 VRAM: cudaMalloc((void) &dev_cx_DataArray, 1048576x 8bytes = 8388608bytes, offs256=0, rtotal= 8388608bytes VRAM: cudaMalloc((void) &dev_cx_ChirpDataArray, 1179648x 8bytes = 9437184bytes, offs256=0, rtotal= 17825792bytes VRAM: cudaMalloc((void) &dev_flag, 1x 8bytes = 8bytes, offs256=0, rtotal= 17825800bytes VRAM: cudaMalloc((void) &dev_WorkData, 1179648x 8bytes = 9437184bytes, offs256=0, rtotal= 27262984bytes VRAM: cudaMalloc((void) &dev_PowerSpectrum, 1048576x 4bytes = 4194304bytes, offs256=0, rtotal= 31457288bytes VRAM: cudaMalloc((void) &dev_t_PowerSpectrum, 1048584x 4bytes = 1048608bytes, offs256=0, rtotal= 32505896bytes VRAM: cudaMalloc((void) &dev_GaussFitResults, 1048576x 16bytes = 16777216bytes, offs256=0, rtotal= 49283112bytes VRAM: cudaMalloc((void) &dev_PoT, 1572864x 4bytes = 6291456bytes, offs256=0, rtotal= 55574568bytes VRAM: cudaMalloc((void) &dev_PoTPrefixSum, 1572864x 4bytes = 6291456bytes, offs256=0, rtotal= 61866024bytes VRAM: cudaMalloc((void) &dev_NormMaxPower, 16384x 4bytes = 65536bytes, offs256=0, rtotal= 61931560bytes VRAM: cudaMalloc((void) &dev_flagged, 1048576x 4bytes = 4194304bytes, offs256=0, rtotal= 66125864bytes VRAM: cudaMalloc((void) &dev_outputposition, 1048576x 4bytes = 4194304bytes, offs256=0, rtotal= 70320168bytes VRAM: cudaMalloc((void) &dev_PowerSpectrumSumMax, 262144x 12bytes = 3145728bytes, offs256=0, rtotal= 73465896bytes VRAM: cudaMallocArray( &dev_gauss_dof_lcgf_cache, 1x 8192bytes = 8192bytes, offs256=248, rtotal= 73474088bytes VRAM: cudaMallocArray( &dev_null_dof_lcgf_cache, 1x 8192bytes = 8192bytes, offs256=144, rtotal= 73482280bytes VRAM: cudaMalloc((void) &dev_find_pulse_flag, 1x 8bytes = 8bytes, offs256=0, rtotal= 73482288bytes VRAM: cudaMalloc((void**) &dev_t_funct_cache, 1966081x 4bytes = 7864324bytes, offs256=0, rtotal= 81346612bytes Thread call stack limit is: 1k boinc_exit(): requesting safe worker shutdown -> Worker Acknowledging exit request, spinning-> boinc_exit(): received safe worker shutdown acknowledge -> Checking which GPU a task is running on is not very helpful when I require the device that it has failed on.
	ID: 27521 \| Rating: 0 \| rate: / Reply Quote

Morten Ross Send message Joined: 26 Sep 08 Posts: 6 Credit: 95,071,491 RAC: 0 Level Scientific publications	Message 27543 - Posted: 4 Dec 2012 \| 17:25:03 UTC - in response to Message 27521. Last modified: 4 Dec 2012 \| 17:27:25 UTC
	I've found a workaround which of course is only for those without anything else in life to do: restart Boincmanager. Then the task that has not been terminated will restart from checkpoint and successfully go beyond the point of app crash. Just got back from work and now 3 tasks have terminated on one host, while they are hanging on the other - waiting to be killed by time limit exceeded. Restarting Boincmanager restarted the tasks and saved them from Computation error. It's really strange that there are no app developer here to aid in the root cause analysis..................... Same system is rock stable in all other projects I participate in, so without input from a developer I guess the current state of the project is not for me.
	ID: 27543 \| Rating: 0 \| rate: / Reply Quote

microchip Send message Joined: 4 Sep 11 Posts: 110 Credit: 326,102,587 RAC: 0 Level Scientific publications	Message 27545 - Posted: 4 Dec 2012 \| 18:09:41 UTC Last modified: 4 Dec 2012 \| 18:20:46 UTC
	I have a similar problem here on a GTX 560 thus I can't crunch for GPUGRID anymore until this is resolved. All my WUs error out with failures in swanlibnv2.cpp, some with "energies have become nan" ____________ Team Belgium
	ID: 27545 \| Rating: 0 \| rate: / Reply Quote

mikey Send message Joined: 2 Jan 09 Posts: 297 Credit: 6,230,518,347 RAC: 29,447,646 Level Scientific publications	Message 27578 - Posted: 5 Dec 2012 \| 22:07:54 UTC - in response to Message 27545.
	I have a similar problem here on a GTX 560 thus I can't crunch for GPUGRID anymore until this is resolved. All my WUs error out with failures in swanlibnv2.cpp, some with "energies have become nan" It might help if you upgraded your version of Boinc from 6.12.34 to a more current one. There have been a TON of changes and your problem could be one of them. I have two 560Ti's, which is different than yours but should be similar enough, and mine run just fine. Another difference is I am on Windows but other Linux folks don't have the same problems or one would expect them to be here too.
	ID: 27578 \| Rating: 0 \| rate: / Reply Quote

microchip Send message Joined: 4 Sep 11 Posts: 110 Credit: 326,102,587 RAC: 0 Level Scientific publications	Message 27604 - Posted: 7 Dec 2012 \| 13:42:04 UTC - in response to Message 27578. Last modified: 7 Dec 2012 \| 13:43:24 UTC
	I have a similar problem here on a GTX 560 thus I can't crunch for GPUGRID anymore until this is resolved. All my WUs error out with failures in swanlibnv2.cpp, some with "energies have become nan" It might help if you upgraded your version of Boinc from 6.12.34 to a more current one. There have been a TON of changes and your problem could be one of them. I have two 560Ti's, which is different than yours but should be similar enough, and mine run just fine. Another difference is I am on Windows but other Linux folks don't have the same problems or one would expect them to be here too. I can't at the moment as the recent 7.0.28 version for Linux is compiled with a higher libc version which my current openSUSE 12.1 system does not provide so 6.12.34 is the highest version I can run right now and I'm really not up to compiling boinc from source ____________ Team Belgium
	ID: 27604 \| Rating: 0 \| rate: / Reply Quote

mikey Send message Joined: 2 Jan 09 Posts: 297 Credit: 6,230,518,347 RAC: 29,447,646 Level Scientific publications	Message 27623 - Posted: 9 Dec 2012 \| 14:36:11 UTC - in response to Message 27604.
	I have a similar problem here on a GTX 560 thus I can't crunch for GPUGRID anymore until this is resolved. All my WUs error out with failures in swanlibnv2.cpp, some with "energies have become nan" It might help if you upgraded your version of Boinc from 6.12.34 to a more current one. There have been a TON of changes and your problem could be one of them. I have two 560Ti's, which is different than yours but should be similar enough, and mine run just fine. Another difference is I am on Windows but other Linux folks don't have the same problems or one would expect them to be here too. I can't at the moment as the recent 7.0.28 version for Linux is compiled with a higher libc version which my current openSUSE 12.1 system does not provide so 6.12.34 is the highest version I can run right now and I'm really not up to compiling boinc from source That is a good reason NOT to update then!
	ID: 27623 \| Rating: 0 \| rate: / Reply Quote

microchip Send message Joined: 4 Sep 11 Posts: 110 Credit: 326,102,587 RAC: 0 Level Scientific publications	Message 28544 - Posted: 16 Feb 2013 \| 18:43:27 UTC - in response to Message 27545.
	I have a similar problem here on a GTX 560 thus I can't crunch for GPUGRID anymore until this is resolved. All my WUs error out with failures in swanlibnv2.cpp, some with "energies have become nan" My problem was related to heat output. I've reduced the fan speed thus my GPU got hotter while crunching. After restoring to defaults, short WUs complete with no issues. However, long WUs still error out so I've disabled them for now. ____________ Team Belgium
	ID: 28544 \| Rating: 0 \| rate: / Reply Quote

dskagcommunity Send message Joined: 28 Apr 11 Posts: 456 Credit: 817,865,789 RAC: 0 Level Scientific publications	Message 28545 - Posted: 16 Feb 2013 \| 18:49:54 UTC - in response to Message 27545.
	I have a similar problem here on a GTX 560 thus I can't crunch for GPUGRID anymore until this is resolved. All my WUs error out with failures in swanlibnv2.cpp, some with "energies have become nan" i had this problem on my 560ti 384 cores too when i dont downclock the GPU memory and increase the GPU voltage to 1.025V. Since then it run without errors. ____________ DSKAG Austria Research Team: http://www.research.dskag.at
	ID: 28545 \| Rating: 0 \| rate: / Reply Quote