Advanced search

Message boards : Number crunching : What is 195 (0xc3) EXIT_CHILD_FAILED

Author Message
Yeti
Avatar
Send message
Joined: 20 Jul 08
Posts: 3
Credit: 1,123,210,586
RAC: 3,863,517
Level
Met
Scientific publications
watwatwatwatwatwatwat
Message 56889 - Posted: 22 May 2021 | 8:04:28 UTC
Last modified: 22 May 2021 | 8:04:49 UTC

This task has run until end, when finishing, it throwed this error:

195 (0xc3) EXIT_CHILD_FAILED

<core_client_version>7.16.11</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code 195 (0xc3)</message>
<stderr_txt>
11:55:20 (4072): wrapper (7.9.26016): starting
11:55:20 (4072): wrapper: running acemd3.exe (--boinc input --device 0)
Detected memory leaks!

http://www.gpugrid.net/result.php?resultid=32585300
____________


Supporting BOINC, a great concept !

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1031
Credit: 35,493,857,483
RAC: 71,175,505
Level
Trp
Scientific publications
wat
Message 56891 - Posted: 22 May 2021 | 15:32:59 UTC - in response to Message 56889.

Don’t worry about the memory leaks message. It’s harmless, the Windows app always shows this.

The source of your failure is further down:

07:55:53 (6200): wrapper (7.9.26016): starting
07:55:53 (6200): wrapper: running acemd3.exe (--boinc input --device 0)
ACEMD failed:
Error invoking kernel: CUDA_ERROR_ILLEGAL_ADDRESS (700)
09:42:41 (6200): acemd3.exe exited; CPU time 6346.656250
09:42:41 (6200): app exit status: 0x1
09:42:41 (6200): called boinc_finish(195)


I’d suspect either somethig wrong with the card, or a problem with the drivers.
____________

mrchips
Send message
Joined: 9 May 21
Posts: 7
Credit: 859,893,000
RAC: 3,458,304
Level
Glu
Scientific publications
wat
Message 57389 - Posted: 29 Sep 2021 | 11:22:20 UTC

All 6 of my latest tasks finished with EXIT_CHILD_FAILED
WU ran for only a few seconds

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,601,636,851
RAC: 8,782,039
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57390 - Posted: 29 Sep 2021 | 12:39:48 UTC - in response to Message 57389.

All 6 of my latest tasks finished with EXIT_CHILD_FAILED
WU ran for only a few seconds

In your case, the actual error number is
app exit status: 0xc0000135
which officially means 'status_dll_not_found'.

See the thread New version tasks failing on Windows hosts (in the GPU area), and specifically message 57353, where user jjch gives instructions for installing a missing Microsoft Visual C++ Redistributable.

bozz4science
Send message
Joined: 22 May 20
Posts: 109
Credit: 68,936,176
RAC: 0
Level
Thr
Scientific publications
wat
Message 57409 - Posted: 1 Oct 2021 | 8:32:12 UTC
Last modified: 1 Oct 2021 | 8:35:47 UTC

I do have the same error reported on the first task that I got in months. However the app exit status says 0x1.

The corresponding task stderr file reports that

ACEMD failed: Particle coordinate is nan
06:44:59 (8516): bin/acemd3.exe exited; CPU time 27435.906250
06:44:59 (8516): app exit status: 0x1

What might have caused this?

This is my Host and that is said Task.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,601,636,851
RAC: 8,782,039
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57410 - Posted: 1 Oct 2021 | 8:35:46 UTC - in response to Message 57409.

Bad data in the task files. Unless your GPU is so superheated that it can't do maths any more, there's nothing you can do.

bozz4science
Send message
Joined: 22 May 20
Posts: 109
Credit: 68,936,176
RAC: 0
Level
Thr
Scientific publications
wat
Message 57411 - Posted: 1 Oct 2021 | 8:37:47 UTC

All right! Thanks Richard for the blazing fast response. Card is running at a mild 53-55 degree Celsius so that is unlikely the culprit. Already feared that there's nothing I can do...

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 7,520
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57423 - Posted: 2 Oct 2021 | 0:42:10 UTC - in response to Message 57411.

I do have the same error reported on the first task that I got in months. However the app exit status says 0x1.

The corresponding task stderr file reports that
ACEMD failed: Particle coordinate is nan
06:44:59 (8516): bin/acemd3.exe exited; CPU time 27435.906250
06:44:59 (8516): app exit status: 0x1

What might have caused this?

This is my Host and that is said Task.
Bad data in the task files. Unless your GPU is so superheated that it can't do maths any more, there's nothing you can do.
All right! Thanks Richard for the blazing fast response. Card is running at a mild 53-55 degree Celsius so that is unlikely the culprit. Already feared that there's nothing I can do...
If I consider the contents of your post in the hardware enthusiast corner, I would say that the reason for the NAN (Not A Number) error in your case is undervolting your GPU.

bozz4science
Send message
Joined: 22 May 20
Posts: 109
Credit: 68,936,176
RAC: 0
Level
Thr
Scientific publications
wat
Message 57485 - Posted: 5 Oct 2021 | 12:47:21 UTC - in response to Message 57423.
Last modified: 5 Oct 2021 | 12:48:16 UTC

I think that you might have been right Zoltan! Haven't been running any GPU Grid tasks for several months now and just ran with the same OC settings that proved to be working with other projects in the past... And I forgot that those tasks here are extremely demanding and power hungry, so much though that the power limit on my card (100W out of 125W) didn't cut it. Core clock boost behaviour was erratic and due to the low power limit the voltage dropped quite severely as a result of that. When you wrote your comment here, I had already withdrawn the PL on my card to get better boost behaviour. But with your explanation I now know that voltages dropped so low (~ 910mV) that the computation became unstable. Now it is running at a 1,000-1,025mV range using up to 135W at times. Thanks for your valuable comment!

Post to thread

Message boards : Number crunching : What is 195 (0xc3) EXIT_CHILD_FAILED

//