ubuntu cuda100 not surviving restart of client

Message boards : Number crunching : ubuntu cuda100 not surviving restart of client

Author	Message
JStateson Send message Joined: 31 Oct 08 Posts: 186 Credit: 3,331,546,800 RAC: 0 Level Scientific publications	Message 53165 - Posted: 27 Nov 2019 \| 20:09:31 UTC Last modified: 27 Nov 2019 \| 20:09:54 UTC
	Restarted the client and lost all 3 Linux cuda 100 tasks. Did not realize this was a problem. I probably should have suspended them all before doing a restart of boinc. This is unfortunate as I don't always get gpugrid Linux tasks and the few I get I hate to lose this way.
	ID: 53165 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,201,255,749 RAC: 851 Level Scientific publications	Message 53173 - Posted: 27 Nov 2019 \| 22:50:09 UTC - in response to Message 53165. Last modified: 27 Nov 2019 \| 22:51:22 UTC
	Restarted the client and lost all 3 Linux cuda 100 tasks. Did not realize this was a problem. I probably should have suspended them all before doing a restart of boinc. This is unfortunate as I don't always get gpugrid Linux tasks and the few I get I hate to lose this way. The reason for this error is in the stderr output of the task: <core_client_version>7.16.1</core_client_version> <![CDATA[ <message> process exited with code 195 (0xc3, -61)</message> <stderr_txt> 09:41:49 (11866): wrapper (7.7.26016): starting 09:41:49 (11866): wrapper (7.7.26016): starting 09:41:49 (11866): wrapper: running acemd3 (--boinc input --device 1) 13:57:59 (13231): wrapper (7.7.26016): starting 13:57:59 (13231): wrapper (7.7.26016): starting 13:57:59 (13231): wrapper: running acemd3 (--boinc input --device 0) ERROR: /home/user/conda/conda-bld/acemd3_1570536635323/work/src/mdsim/context.cpp line 322: Cannot use a restart file on a different device! 13:58:05 (13231): acemd3 exited; CPU time 5.243312 13:58:05 (13231): app exit status: 0x9e 13:58:05 (13231): called boinc_finish(195) </stderr_txt> ]]> This could happen only on hosts with multiple GPUs (this is a known bug of the ACEMD3 app). To resolve this you should 1. make notes of task-device pairs 2. suspend all GPUGrid tasks (first the ones which are not running ["ready to start"]) 3. restart your host 4. resume your GPUGrid tasks in the order of the device numbers (the task was running on device 0 should be resumed first and so on)
	ID: 53173 \| Rating: 0 \| rate: / Reply Quote

JStateson Send message Joined: 31 Oct 08 Posts: 186 Credit: 3,331,546,800 RAC: 0 Level Scientific publications	Message 53174 - Posted: 27 Nov 2019 \| 23:29:14 UTC - in response to Message 53173. Last modified: 27 Nov 2019 \| 23:33:28 UTC
	This could happen only on hosts with multiple GPUs (this is a known bug of the ACEMD3 app). To resolve this you should 1. make notes of task-device pairs 2. suspend all GPUGrid tasks (first the ones which are not running ["ready to start"]) 3. restart your host 4. resume your GPUGrid tasks in the order of the device numbers (the task was running on device 0 should be resumed first and so on) Thanks, was not aware of that! Going to be a real problem as there is a windows 10 "feature 1909" pending. However, ubuntu will be unaffected. Not sure if you noticed, but my "El Cheapo" P102-100 mining card "D1" is far and away the faster of the 1660Ti "D0" and especially the GTX-1070 "D2" GPUGRID 2.10 New version of ACEMD (cuda100) 0.983C + 1NV (d1) 99.87 02:30:22 (02:30:10) 04:16:50 57.000 Running tb85-nvidia test449-TONI_GSNTEST3-6-100-RND1891_0 12/2/2019 9:53:34 AM JStateson GPUGRID 2.10 New version of ACEMD (cuda100) 0.983C + 1NV (d0) 99.91 02:30:20 (02:30:12) 04:40:43 53.000 Running tb85-nvidia initial_1911-ELISA_GSN4V1-9-100-RND1684_0 12/2/2019 11:52:22 AM JStateson GPUGRID 2.10 New version of ACEMD (cuda100) 0.983C + 1NV (d2) 99.89 02:30:19 (02:30:09) 05:28:30 45.000 Running tb85-nvidia initial_1243-ELISA_GSN4V1-1-100-RND2537_0 12/2/2019 1:44:26 PM JStateson start time for all 3 above was 2:30:19 within 3 seconds. The mining card will finish an hour ahead of the 1660Ti and 2 hours ahead of the 1070 is my guess
	ID: 53174 \| Rating: 0 \| rate: / Reply Quote

Message boards : Number crunching : ubuntu cuda100 not surviving restart of client

//