Advanced search

Message boards : Number crunching : Task failing after 3.669 seconds

Author Message
Erich56
Send message
Joined: 1 Jan 15
Posts: 831
Credit: 3,470,292,479
RAC: 1,400,320
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwat
Message 57608 - Posted: 14 Oct 2021 | 16:50:07 UTC

Any idea why this task failed with "computation error" about 1 hour after start:

https://www.gpugrid.net/result.php?resultid=32654746

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 546
Credit: 4,544,826,357
RAC: 7,498,809
Level
Arg
Scientific publications
wat
Message 57613 - Posted: 14 Oct 2021 | 18:07:32 UTC - in response to Message 57608.

Exit status 194 (0xc2) EXIT_ABORTED_BY_CLIENT
____________

Erich56
Send message
Joined: 1 Jan 15
Posts: 831
Credit: 3,470,292,479
RAC: 1,400,320
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwat
Message 57615 - Posted: 14 Oct 2021 | 18:12:07 UTC - in response to Message 57613.

Exit status 194 (0xc2) EXIT_ABORTED_BY_CLIENT

which is definitely wrong. At least, if the client refers to me personally, for sure I did NOT abort the WU.

Further, under result plus under clientstatus is says: "Berechnungsfehler", i.e. "computation error".

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 546
Credit: 4,544,826,357
RAC: 7,498,809
Level
Arg
Scientific publications
wat
Message 57618 - Posted: 14 Oct 2021 | 19:36:43 UTC - in response to Message 57615.

don't get too hung up on the verbiage used by BOINC.

ANY kind of error, be it pre-computation, during-computation issues, manual aborts, automatic aborts, or even things like upload errors (after computation has completed) will be classified as "Computation Error". This is the same for all projects, It's just the generic words BOINC uses when there's an error it can't resolve, and more detailed info is usually in the logs or stderr output.

since it failed with aborted by client I can only assume some kind of issue between BOINC and the app, and the BOINC client itself just killed the task.

(unknown error) - exit code 194 (0xc2)


since all you have is "unknown error" I don't think there's much to run down here.
____________

Erich56
Send message
Joined: 1 Jan 15
Posts: 831
Credit: 3,470,292,479
RAC: 1,400,320
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwat
Message 57624 - Posted: 15 Oct 2021 | 4:50:30 UTC - in response to Message 57618.

...
since it failed with aborted by client I can only assume some kind of issue between BOINC and the app, and the BOINC client itself just killed the task.

(unknown error) - exit code 194 (0xc2)

since all you have is "unknown error" I don't think there's much to run down here.

in a way, I was lucky anyway that this happened after about 1 hour, and not, say, after 15 hours or so :-)

Erich56
Send message
Joined: 1 Jan 15
Posts: 831
Credit: 3,470,292,479
RAC: 1,400,320
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwat
Message 57695 - Posted: 30 Oct 2021 | 17:02:48 UTC

now, a task failed after about 16 hours, a few minitues before getting finished:
https://www.gpugrid.net/result.php?resultid=32658707

very annoying, of course.

Can anyone tell me what was going wrong with this task?

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 813
Credit: 1,104,449,831
RAC: 2,464,013
Level
Met
Scientific publications
watwatwatwatwat
Message 57696 - Posted: 30 Oct 2021 | 19:11:06 UTC - in response to Message 57695.
Last modified: 30 Oct 2021 | 19:12:50 UTC

Detected memory leaks!

Error invoking kernel: CUDA_ERROR_UNKNOWN (999)


Probably an error in the VRAM on the card. Try reducing the card temp by moving the fan speed up or reducing any overclocking.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1249
Credit: 3,359,461,168
RAC: 1,583,723
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57698 - Posted: 30 Oct 2021 | 20:10:53 UTC - in response to Message 57696.

Detected memory leaks!

All Windows users have that report from the app, on perfectly good tasks. I wouldn't worry about that.

Error invoking kernel: CUDA_ERROR_UNKNOWN (999)

Isn't that what happens after a reboot, particularly after the NVidia driver has been updated by Microsoft / Windows 10?

Probably an error in the VRAM on the card. Try reducing the card temp by moving the fan speed up or reducing any overclocking.

I think we'd need more evidence before making a leap of interpretation like that.

Is the video card in question driving a monitor? If so, are there any problems with the visible display? Colour blocks, bad pixels, that sort of thing? Have you changed any operating parameters - overclocked? undervolted?

mmonnin
Send message
Joined: 2 Jul 16
Posts: 307
Credit: 1,249,802,726
RAC: 2,439,803
Level
Met
Scientific publications
watwatwatwatwat
Message 57700 - Posted: 31 Oct 2021 | 16:46:22 UTC - in response to Message 57698.

Detected memory leaks!

All Windows users have that report from the app, on perfectly good tasks. I wouldn't worry about that.

Error invoking kernel: CUDA_ERROR_UNKNOWN (999)

Isn't that what happens after a reboot, particularly after the NVidia driver has been updated by Microsoft / Windows 10?

Probably an error in the VRAM on the card. Try reducing the card temp by moving the fan speed up or reducing any overclocking.

I think we'd need more evidence before making a leap of interpretation like that.

Is the video card in question driving a monitor? If so, are there any problems with the visible display? Colour blocks, bad pixels, that sort of thing? Have you changed any operating parameters - overclocked? undervolted?


Windows updates doesn't include OpenCL, so not an issue here at GPUGrid.

jiipee
Send message
Joined: 4 Jun 15
Posts: 19
Credit: 2,882,296,327
RAC: 924,627
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 57709 - Posted: 2 Nov 2021 | 9:09:28 UTC

Is acemd3 for Windows broken? All tasks seem to be failing:


Stderr output

<core_client_version>7.16.20</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code 195 (0xc3)</message>
<stderr_txt>
03:57:10 (9116): wrapper (7.9.26016): starting
03:57:10 (9116): wrapper: running bin/acemd3.exe (--boinc --device 0)
03:57:12 (9116): bin/acemd3.exe exited; CPU time 0.000000
03:57:12 (9116): app exit status: 0xc0000135
03:57:12 (9116): called boinc_finish(195)
0 bytes in 0 Free Blocks.
456 bytes in 4 Normal Blocks.
1144 bytes in 1 CRT Blocks.
0 bytes in 0 Ignore Blocks.
0 bytes in 0 Client Blocks.
Largest number used: 0 bytes.
Total allocations: 120166 bytes.
Dumping objects ->
...

Erich56
Send message
Joined: 1 Jan 15
Posts: 831
Credit: 3,470,292,479
RAC: 1,400,320
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwat
Message 57710 - Posted: 2 Nov 2021 | 13:10:11 UTC - in response to Message 57709.

Is acemd3 for Windows broken? All tasks seem to be failing:

how come at all that you receive tasks? There have not been any new ones available for serveral days.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 813
Credit: 1,104,449,831
RAC: 2,464,013
Level
Met
Scientific publications
watwatwatwatwat
Message 57717 - Posted: 2 Nov 2021 | 16:52:10 UTC - in response to Message 57710.

Check your preferences. I have been getting work everyday as have others.

zooxit
Send message
Joined: 4 Jul 21
Posts: 14
Credit: 59,210,528
RAC: 1,871,341
Level
Thr
Scientific publications
wat
Message 57791 - Posted: 10 Nov 2021 | 18:40:35 UTC

Hi,
so what is the answer for this post's title question (well actually it is a statement :) )?
Why are python apps failing after 2-4 seconds?
Should I install something on my machine (running Debian 11)?

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 813
Credit: 1,104,449,831
RAC: 2,464,013
Level
Met
Scientific publications
watwatwatwatwat
Message 57793 - Posted: 10 Nov 2021 | 19:07:39 UTC - in response to Message 57791.

The tasks are beta and the scientists are still debugging the configuration parameters. Errors are to be expected.

If you have a task error, look at the task ID in your Tasks list and see if the task has been sent to many others that have also errored out the task. If so, everything is normal.

However if the wingmen for the task has completed the task successfully, you need to look at the stderr.txt output of the task in the list and read to the end and see what kind of error was generated. If the error is local you might be able to do something about it by restarting the host or updating the video drivers.

And you can't do anything else or need to do anything else like downloading libraries or similar because each task is bundled with exactly the resources it need to complete successfully. Or at least in theory. Again, these are beta tasks and are still being debugged.

zooxit
Send message
Joined: 4 Jul 21
Posts: 14
Credit: 59,210,528
RAC: 1,871,341
Level
Thr
Scientific publications
wat
Message 57807 - Posted: 11 Nov 2021 | 20:11:40 UTC

Thanks@KeithMyers for tips.

Checked last 20 tasks I got and all failed (they all failed after 3 seconds) - they where all 'solved' by another host shortly thereafter, so...

Everything is updated on my host. It is Debian bullseye though, on computers that finished the task I think I mostly saw they were running Ubuntu 20.04 LTS. But that is probably not the likely cause.

My STDERR says:
INTERNAL ERROR: cannot create temporary directory!
Might that be a permissions problem?


----------------------------------------------
The full STDERR:

<core_client_version>7.16.16</core_client_version>
<![CDATA[
<message>
process exited with code 195 (0xc3, -61)</message>
<stderr_txt>
14:26:46 (1648902): wrapper (7.7.26016): starting
14:26:46 (1648902): wrapper (7.7.26016): starting
14:26:46 (1648902): wrapper: running /usr/bin/flock (/var/lib/boinc-client/projects/www.gpugrid.net/miniconda.lock -c "/bin/bash ./miniconda-installer.sh -b -u -p /var/lib/boinc-client/projects/www.gpugrid.net/miniconda &&
/var/lib/boinc-client/projects/www.gpugrid.net/miniconda/bin/conda install -m -y -p gpugridpy --file requirements.txt ")
[1648927] INTERNAL ERROR: cannot create temporary directory!
[1648931] INTERNAL ERROR: cannot create temporary directory!
14:26:47 (1648902): /usr/bin/flock exited; CPU time 0.139614
14:26:47 (1648902): app exit status: 0x1
14:26:47 (1648902): called boinc_finish(195)

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 462
Credit: 2,137,165,494
RAC: 1,470,965
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57808 - Posted: 11 Nov 2021 | 21:08:14 UTC - in response to Message 57807.
Last modified: 11 Nov 2021 | 21:09:22 UTC

My STDERR says:
INTERNAL ERROR: cannot create temporary directory!
Might that be a permissions problem?

The same problem was treated at Message #55986
A workaround solution was detailed there, maybe you are interested in trying.

zooxit
Send message
Joined: 4 Jul 21
Posts: 14
Credit: 59,210,528
RAC: 1,871,341
Level
Thr
Scientific publications
wat
Message 57811 - Posted: 12 Nov 2021 | 9:30:41 UTC

Thanks!
So, I tried:
sudo systemctl edit boinc-client.service
and added:
[Service]
PrivateTmp=true
then rebooted

Waiting for tasks now to see if it works...

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 462
Credit: 2,137,165,494
RAC: 1,470,965
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57816 - Posted: 12 Nov 2021 | 18:44:17 UTC - in response to Message 57811.

All right.
If it was that, You're done.
Now it's time to patiently wait for new Python WUs...

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 813
Credit: 1,104,449,831
RAC: 2,464,013
Level
Met
Scientific publications
watwatwatwatwat
Message 57820 - Posted: 12 Nov 2021 | 20:10:34 UTC

The bug where all tasks always run on Device#0 has been fixed this morning.
Should be smooth sailing from now on for python tasks.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 462
Credit: 2,137,165,494
RAC: 1,470,965
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57822 - Posted: 13 Nov 2021 | 0:19:07 UTC - in response to Message 57811.

If still failing, please, double check that your boinc-client.service is similar to this:

After adding the stated lines, it is necessary to save changes with Ctrl + O, confirm with Enter, then exit with Ctrl + X, and then reboot.
(Excuse that the menus are shown in Spanish version :-)

zooxit
Send message
Joined: 4 Jul 21
Posts: 14
Credit: 59,210,528
RAC: 1,871,341
Level
Thr
Scientific publications
wat
Message 57823 - Posted: 13 Nov 2021 | 10:29:55 UTC

Thanks for help.
Don't know why (and don't know if boinc-client.service is case sensitive) but I mistyped:
PrivateTMP=true

So repaired it to PrivateTmp=true and now am waiting for new tasks.


-----------------------------------------------
My complete boinc-client.service is like this (should I uncomment or add something else?):

### Editing /etc/systemd/system/boinc-client.service.d/override.conf
### Anything between here and the comment below will become the new contents of the file

[Service]
PrivateTmp=true

### Lines below this comment will be discarded

### /lib/systemd/system/boinc-client.service
# [Unit]
# Description=Berkeley Open Infrastructure Network Computing Client
# Documentation=man:boinc(1)
# After=network-online.target
#
# [Service]
# Type=simple
# ProtectHome=true
# ProtectSystem=strict
# ProtectControlGroups=true
# ReadWritePaths=-/var/lib/boinc -/etc/boinc-client
# Nice=10
# User=boinc
# WorkingDirectory=/var/lib/boinc
# ExecStart=/usr/bin/boinc
# ExecStop=/usr/bin/boinccmd --quit
# ExecReload=/usr/bin/boinccmd --read_cc_config
# ExecStopPost=/bin/rm -f lockfile
# IOSchedulingClass=idle
# # The following options prevent setuid root as they imply NoNewPrivileges=true
# # Since Atlas requires setuid root, they break Atlas
# # In order to improve security, if you're not using Atlas,
# # Add these options to the [Service] section of an override file using
# # sudo systemctl edit boinc-client.service
# #NoNewPrivileges=true
# #ProtectKernelModules=true
# #ProtectKernelTunables=true
# #RestrictRealtime=true
# #RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX
# #RestrictNamespaces=true
# #PrivateUsers=true
# #CapabilityBoundingSet=
# #MemoryDenyWriteExecute=true
# #PrivateTmp=true #Block X11 idle detection
#
# [Install]
# WantedBy=multi-user.target

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 462
Credit: 2,137,165,494
RAC: 1,470,965
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57825 - Posted: 13 Nov 2021 | 10:56:58 UTC - in response to Message 57823.
Last modified: 13 Nov 2021 | 10:59:31 UTC

As a general rule of thumb, everything in Linux is case-sensitive.
It should be right just this way.
Eventually coming tasks will confirm it.

Post to thread

Message boards : Number crunching : Task failing after 3.669 seconds