Advanced search

Message boards : Number crunching : Error While Computing

Author Message
jkdma
Send message
Joined: 21 Mar 20
Posts: 6
Credit: 53,007,324
RAC: 0
Level
Thr
Scientific publications
wat
Message 58656 - Posted: 16 Apr 2022 | 17:02:55 UTC

The vast majority of the units my computer completes have been reported as 'Error While Computing'. This has been going on for a few months. For a while a few weeks ago, the units seemed to be much smaller and only take a few hours to complete. These seemed to be validated much more often than the large units that take a couple days of crunching.

Is there a larger reason for this or is it a problem with my machine?

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1284
Credit: 4,928,806,959
RAC: 6,468,484
Level
Arg
Scientific publications
watwatwatwatwat
Message 58657 - Posted: 17 Apr 2022 | 1:31:09 UTC

The acemd4 and python tasks are still being debugged by the admin developers.

So lots of errors still and nothing wrong with your host.

The acemd3 tasks have been stable for over a year. So they should validate on everyone's hardware.

Only investigate your hardware if the errors are with this type of task.

jkdma
Send message
Joined: 21 Mar 20
Posts: 6
Credit: 53,007,324
RAC: 0
Level
Thr
Scientific publications
wat
Message 58658 - Posted: 17 Apr 2022 | 4:52:42 UTC - in response to Message 58657.
Last modified: 17 Apr 2022 | 5:30:24 UTC

I think the only tasks I've gotten have been ACEMD3. Some validate, many show an error while computing. What could cause this on my end??

Erich56
Send message
Joined: 1 Jan 15
Posts: 1090
Credit: 6,603,906,926
RAC: 18,783,925
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 58659 - Posted: 17 Apr 2022 | 5:40:09 UTC - in response to Message 58658.
Last modified: 17 Apr 2022 | 5:40:25 UTC

What could cause this on my end??

Do you overclock your GPU ? What's the temperature of the GPU ?

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,607,486,851
RAC: 8,616,181
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58660 - Posted: 17 Apr 2022 | 7:50:16 UTC - in response to Message 58657.

The acemd3 tasks have been stable for over a year.

And one of mine has just crashed on a normally stable computer.

Result 32884789: exit code 0, "Incorrect function", after 5 seconds.

The acemd3 application normally has a usage lifetime of around a year before it needs a software licence renewal. Are we reaching that time again? Shouldn't be - it was last refreshed on 10 Nov 2021.

Greg _BE
Send message
Joined: 30 Jun 14
Posts: 126
Credit: 107,156,939
RAC: 166,633
Level
Cys
Scientific publications
watwatwatwatwatwat
Message 58661 - Posted: 17 Apr 2022 | 7:56:05 UTC

Just to piggyback on this thread with something else,,,,

The run time vs cpu time, over the course of working ACMED 3 (2 days plus run time, still far away from deadline) I am seeing a 2 hour time difference between the two. I have never seen that on my other projects. Everything is running ok, but is this normal? The two hours?

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1031
Credit: 35,707,082,483
RAC: 79,561,535
Level
Trp
Scientific publications
wat
Message 58662 - Posted: 17 Apr 2022 | 12:49:41 UTC - in response to Message 58661.

Just to piggyback on this thread with something else,,,,

The run time vs cpu time, over the course of working ACMED 3 (2 days plus run time, still far away from deadline) I am seeing a 2 hour time difference between the two. I have never seen that on my other projects. Everything is running ok, but is this normal? The two hours?


I would say no that's not normal. I'm going to guess that you're running the CPU on 100% utilization on some CPU project too? that's probably the reason. you're starving the GPU of CPU resources.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1284
Credit: 4,928,806,959
RAC: 6,468,484
Level
Arg
Scientific publications
watwatwatwatwat
Message 58663 - Posted: 17 Apr 2022 | 16:23:27 UTC - in response to Message 58658.

I think the only tasks I've gotten have been ACEMD3. Some validate, many show an error while computing. What could cause this on my end??

Looking at your error:

08:26:39 (15796): wrapper: running bin/acemd3.exe (--boinc --device 0)
Detected memory leaks!

You are having issues with either a hot gpu, hot cpu or flaky memory.

These are the typical issues that cause memory errors.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1031
Credit: 35,707,082,483
RAC: 79,561,535
Level
Trp
Scientific publications
wat
Message 58664 - Posted: 17 Apr 2022 | 16:32:21 UTC - in response to Message 58663.
Last modified: 17 Apr 2022 | 16:35:47 UTC

I think the only tasks I've gotten have been ACEMD3. Some validate, many show an error while computing. What could cause this on my end??

Looking at your error:

08:26:39 (15796): wrapper: running bin/acemd3.exe (--boinc --device 0)
Detected memory leaks!

You are having issues with either a hot gpu, hot cpu or flaky memory.

These are the typical issues that cause memory errors.


You quoted the wrong issue. Detected memory leaks is ubiquitous in the Windows ACEMD3 app. Even successful runs shows that error. It’s benign and not indicative of any problem.

His real issue is here:

01:56:42 (4340): wrapper: running bin/acemd3.exe (--boinc --device 0)
ERROR: C:\Users\admin\miniconda3\conda-bld\acemd3_1632736748005\work\src\mdsim\trajectory.cpp line 103: Cannot open XTC file

____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1284
Credit: 4,928,806,959
RAC: 6,468,484
Level
Arg
Scientific publications
watwatwatwatwat
Message 58665 - Posted: 17 Apr 2022 | 18:39:57 UTC

Thanks for the correction. I wasn't aware that memory leaks are a common problem on Windows hosts.

jkdma
Send message
Joined: 21 Mar 20
Posts: 6
Credit: 53,007,324
RAC: 0
Level
Thr
Scientific publications
wat
Message 58667 - Posted: 18 Apr 2022 | 1:52:19 UTC - in response to Message 58664.

Ok, since Keith Myers quoted me, are you saying I have a different problem on my end or there is no problem on my end?

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1284
Credit: 4,928,806,959
RAC: 6,468,484
Level
Arg
Scientific publications
watwatwatwatwat
Message 58668 - Posted: 18 Apr 2022 | 5:25:53 UTC - in response to Message 58667.

You had a problem with the task configuration. Server issue.

Not your hardware issue after all.

jkdma
Send message
Joined: 21 Mar 20
Posts: 6
Credit: 53,007,324
RAC: 0
Level
Thr
Scientific publications
wat
Message 58670 - Posted: 18 Apr 2022 | 16:16:30 UTC - in response to Message 58668.

Thanks. Incidentally, I'm getting 'error while computing' issues on Rosetta@Home units, also. . .

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1284
Credit: 4,928,806,959
RAC: 6,468,484
Level
Arg
Scientific publications
watwatwatwatwat
Message 58671 - Posted: 18 Apr 2022 | 17:10:14 UTC - in response to Message 58670.

Then something wrong with your Python environment I guess. Rosetta is doing Python tasks also I believe.

But still nothing wrong on your end. Up to the project to package all the Python bits necessary to crunch the task and send it to you properly.

Greg _BE
Send message
Joined: 30 Jun 14
Posts: 126
Credit: 107,156,939
RAC: 166,633
Level
Cys
Scientific publications
watwatwatwatwatwat
Message 58697 - Posted: 21 Apr 2022 | 17:39:46 UTC
Last modified: 21 Apr 2022 | 17:40:13 UTC

ERROR: C:\Users\admin\miniconda3\conda-bld\acemd3_1632736748005\work\src\mdsim\context.cpp line 318: Cannot use a restart file on a different device!

http://www.gpugrid.net/result.php?resultid=32884878

ACMED 3 task
195 (0xc3) EXIT_CHILD_FAILED

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1031
Credit: 35,707,082,483
RAC: 79,561,535
Level
Trp
Scientific publications
wat
Message 58698 - Posted: 21 Apr 2022 | 22:34:28 UTC - in response to Message 58697.

ERROR: C:\Users\admin\miniconda3\conda-bld\acemd3_1632736748005\work\src\mdsim\context.cpp line 318: Cannot use a restart file on a different device!

http://www.gpugrid.net/result.php?resultid=32884878

ACMED 3 task
195 (0xc3) EXIT_CHILD_FAILED


This is a well known issue. You can’t restart the task on a different GPU. Basically can’t interrupt a running task at all.
____________

Greg _BE
Send message
Joined: 30 Jun 14
Posts: 126
Credit: 107,156,939
RAC: 166,633
Level
Cys
Scientific publications
watwatwatwatwatwat
Message 58702 - Posted: 22 Apr 2022 | 18:00:44 UTC - in response to Message 58698.

ERROR: C:\Users\admin\miniconda3\conda-bld\acemd3_1632736748005\work\src\mdsim\context.cpp line 318: Cannot use a restart file on a different device!

http://www.gpugrid.net/result.php?resultid=32884878

ACMED 3 task
195 (0xc3) EXIT_CHILD_FAILED


This is a well known issue. You can’t restart the task on a different GPU. Basically can’t interrupt a running task at all.



I have had the same error, I suspend and shut down the client and exit via the menu at the end of my computing day. The next morning I start up again and the task resumes on the same GPU. But a half day later for full day later then it crashes.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1031
Credit: 35,707,082,483
RAC: 79,561,535
Level
Trp
Scientific publications
wat
Message 58703 - Posted: 22 Apr 2022 | 18:03:57 UTC - in response to Message 58702.

ERROR: C:\Users\admin\miniconda3\conda-bld\acemd3_1632736748005\work\src\mdsim\context.cpp line 318: Cannot use a restart file on a different device!

http://www.gpugrid.net/result.php?resultid=32884878

ACMED 3 task
195 (0xc3) EXIT_CHILD_FAILED


This is a well known issue. You can’t restart the task on a different GPU. Basically can’t interrupt a running task at all.



I have had the same error, I suspend and shut down the client and exit via the menu at the end of my computing day. The next morning I start up again and the task resumes on the same GPU. But a half day later for full day later then it crashes.


you can see in your task log that it actually restarted on a different GPU. that's why it failed.

08:01:06 (9168): wrapper (7.9.26016): starting
08:01:06 (9168): wrapper: running bin/acemd3.exe (--boinc --device 1)
Detected memory leaks!
Dumping objects ->
..\api\boinc_api.cpp(309) : {389760} normal block at 0x000001DBD2145AF0, 8 bytes long.
Data: < > 00 00 CC D3 DB 01 00 00
..\lib\diagnostics_win.cpp(417) : {388486} normal block at 0x000001DBD2175230, 1080 bytes long.
Data: <8 $ > 38 1E 00 00 CD CD CD CD 24 01 00 00 00 00 00 00
..\zip\boinc_zip.cpp(122) : {153} normal block at 0x000001DBD2150860, 260 bytes long.
Data: < > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Object dump complete.
08:37:06 (15692): wrapper (7.9.26016): starting
08:37:06 (15692): wrapper: running bin/acemd3.exe (--boinc --device 0)
ERROR: C:\Users\admin\miniconda3\conda-bld\acemd3_1632736748005\work\src\mdsim\context.cpp line 318: Cannot use a restart file on a different device!


it started on device 1, then the final restart happened on device 0.

I would recommend not restarting your computer until the GPUGRID task finishes. I've even seen this issue happen from restarting on the same GPU after something like a driver update. just don't interrupt the task at all.
____________

Post to thread

Message boards : Number crunching : Error While Computing

//