Advanced search

Message boards : Number crunching : All Gerard WUs erroring

Author Message
Trotador
Send message
Joined: 25 Mar 12
Posts: 88
Credit: 1,239,614,530
RAC: 250,551
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 42537 - Posted: 2 Jan 2016 | 20:14:14 UTC

Hi,

I'm seeing this happening with the last dowloaded units, wingmen also have the same error

"process exited with code 212 (0xd4, -44)"

Not sure but it could be only for linux WUs

Trotador
Send message
Joined: 25 Mar 12
Posts: 88
Credit: 1,239,614,530
RAC: 250,551
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 42538 - Posted: 2 Jan 2016 | 23:41:03 UTC - in response to Message 42537.

Also for windows, error message

"(unknown error) - exit code -97 (0xffffff9f)"

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 345
Credit: 4,248,187,009
RAC: 747,961
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 42539 - Posted: 3 Jan 2016 | 1:23:18 UTC - in response to Message 42537.
Last modified: 3 Jan 2016 | 1:24:23 UTC

Hi,

I'm seeing this happening with the last downloaded units, wingmen also have the same error

"process exited with code 212 (0xd4, -44)"

Not sure but it could be only for linux WUs



Yes, there seems to be a batch of WUs, that are failing on previously reliable Linux machines and some mostly bad windows hosts, but they are running fine on my windows computers. One has already completed successfully at this time.

See links:

https://www.gpugrid.net/workunit.php?wuid=11397999


https://www.gpugrid.net/workunit.php?wuid=11398213


https://www.gpugrid.net/workunit.php?wuid=11398820


https://www.gpugrid.net/workunit.php?wuid=11398294

Max Ringler
Send message
Joined: 27 Apr 15
Posts: 2
Credit: 89,794,758
RAC: 0
Level
Thr
Scientific publications
wat
Message 42540 - Posted: 3 Jan 2016 | 9:15:30 UTC

On my Windows 7 machine, (I7-3770, GTX 980) I currently had ~10 GERALD WU (more in the cue and still comming in) that were running @ less then %1 GPU usage (according to GPU-Z) while the progress in the BOINC manager appeared to be normal/a little slow (~15 hour estimation per WU). All these WU suddenly disappeared from the BOINC manager without any error massage and also without showing up in my results in my GPUGRID stats. Certainly there is something flawed with these WUs!

Max Ringler
Send message
Joined: 27 Apr 15
Posts: 2
Credit: 89,794,758
RAC: 0
Level
Thr
Scientific publications
wat
Message 42541 - Posted: 3 Jan 2016 | 9:23:45 UTC - in response to Message 42540.

I missed the other WUs, but right now this happened to the WU:

e14s27_e9s23p1f368-GERARD_CXCL12_DIM_HEP_GLYCAM-0-1-RND5008

This WU was running @ <1% GPU usage but at close to normal progress speed, however it was restarting every ~10 hours or so. I now cancelled this WU, and the next one in my cue seems to work normally again (e13s16_e8s26p11f203-GERARD_CXCL12_DIMPROTO3-0-1-RND2849; estimated time ~12 hours, 82% GPU usage)

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 1960
Credit: 12,622,269,019
RAC: 6,563,983
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 42542 - Posted: 3 Jan 2016 | 10:16:21 UTC

I have both kind of these WUs:

1. Erroring on all hosts, including mine.
https://www.gpugrid.net/workunit.php?wuid=11396918
https://www.gpugrid.net/workunit.php?wuid=11396911

2. Erroring on all hosts, except on mine:
https://www.gpugrid.net/workunit.php?wuid=11397526
https://www.gpugrid.net/workunit.php?wuid=11398513
https://www.gpugrid.net/workunit.php?wuid=11397102
https://www.gpugrid.net/workunit.php?wuid=11397012
https://www.gpugrid.net/workunit.php?wuid=11398161
https://www.gpugrid.net/workunit.php?wuid=11398515
https://www.gpugrid.net/workunit.php?wuid=11396116
https://www.gpugrid.net/workunit.php?wuid=11398187

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 345
Credit: 4,248,187,009
RAC: 747,961
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 42546 - Posted: 3 Jan 2016 | 12:15:47 UTC - in response to Message 42542.
Last modified: 3 Jan 2016 | 12:25:56 UTC

I have both kind of these WUs:

1. Erroring on all hosts, including mine.
https://www.gpugrid.net/workunit.php?wuid=11396918
https://www.gpugrid.net/workunit.php?wuid=11396911

2. Erroring on all hosts, except on mine:
https://www.gpugrid.net/workunit.php?wuid=11397526
https://www.gpugrid.net/workunit.php?wuid=11398513
https://www.gpugrid.net/workunit.php?wuid=11397102
https://www.gpugrid.net/workunit.php?wuid=11397012
https://www.gpugrid.net/workunit.php?wuid=11398161
https://www.gpugrid.net/workunit.php?wuid=11398515
https://www.gpugrid.net/workunit.php?wuid=11396116
https://www.gpugrid.net/workunit.php?wuid=11398187


So how many errors did you get recently? If it's a small number, you could attribute that to running into the occasional bad WU. If you have a lot more, than it's more than just a linux problem.

For the record, I have 2 errors since the new year. All WUs on my machines are currently running okay and I hope it stays that way!. So, I would say that I ran into 2 bad WUs.

Profile ServicEnginIC
Send message
Joined: 24 Sep 10
Posts: 3
Credit: 915,817,191
RAC: 1,315,283
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 42547 - Posted: 3 Jan 2016 | 12:16:31 UTC

I've found the same behavior in my linux hosts, in WUs received since Jan-02-2016 past midday.
Consequently, statistics are getting worse, possibly due to those failing linux WUs...
This can be seen at the bottom of "Server status" page.

https://www.gpugrid.net/server_status.php

On Jan-02-2016 at 22:41 UTC, the medium error rate over the 25 kinds of WUs in progress was 20,9952 %
This has increased to 25,7552 % at 11:44 UTC on Jan-03-2016.

____________

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 1960
Credit: 12,622,269,019
RAC: 6,563,983
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 42549 - Posted: 3 Jan 2016 | 14:07:23 UTC - in response to Message 42546.

So how many errors did you get recently? If it's a small number, you could attribute that to running into the occasional bad WU. If you have a lot more, than it's more than just a linux problem.
I have four errors recently. It's a bit more than usual. The two aborted WUs are my fault.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 641
Credit: 1,207,321,389
RAC: 81,982
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 42550 - Posted: 3 Jan 2016 | 15:00:26 UTC

I haven't seen the problem yet on a pair of GTX 960s.
https://www.gpugrid.net/results.php?hostid=194224&offset=0&show_names=0&state=0&appid=

I had originally boosted the P2 memory clock as per ETA's suggestion (https://einstein.phys.uwm.edu/forum_thread.php?id=11044), but saw a few "simulation unstable" messages, though I don't think they led to actual errors at that point. But that was a little to close to the edge for me, so I removed that boost and the cards are back to factory default, which is not much of an overclock on these MSI 2GD5T OC cards. Maybe that keeps them stable on the most difficult work units.

northcup
Send message
Joined: 29 Dec 15
Posts: 1
Credit: 135,300
RAC: 0
Level

Scientific publications
wat
Message 42551 - Posted: 3 Jan 2016 | 16:55:40 UTC

14814161 11399908 286919 3 Jan 2016 | 16:38:22 UTC 3 Jan 2016 | 16:39:05 UTC Error while computing 0.00 0.00 --- Long runs

14814079 11399366 286919 3 Jan 2016 | 16:16:38 UTC 3 Jan 2016 | 16:32:43 UTC Error while computing 0.00 0.00 ---

14813534 11399465 286919 3 Jan 2016 | 13:04:05 UTC 3 Jan 2016 | 13:06:03 UTC Error while computing 0.00 0.00 ---

14801182 11384321 286919 29 Dec 2015 | 20:21:34 UTC 1 Jan 2016 | 9:50:05 UTC Completed and validated 212,450.23 4,110.21 135,300.00 Long runs

Same problem here with a valid run from dezember last year. Greets, Klaus

Rion Family
Send message
Joined: 13 Jan 14
Posts: 20
Credit: 10,047,579,615
RAC: 15,224,043
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 42553 - Posted: 3 Jan 2016 | 17:52:02 UTC
Last modified: 3 Jan 2016 | 17:53:18 UTC

I have seen the same thing on my linux host - all work units since the one below error out the same way

Stderr output
<core_client_version>7.3.15</core_client_version>
<![CDATA[
<message>
process exited with code 212 (0xd4, -44)
</message>
<stderr_txt>

</stderr_txt>
]]>

14811283 11398815 176528 3 Jan 2016 | 0:28:00 UTC 3 Jan 2016 | 1:08:39 UTC Error while computing 0.00 0.00 --- Long runs (8-12 hours on fastest card) v8.46 (cuda65)

opr
Send message
Joined: 24 May 11
Posts: 7
Credit: 81,450,687
RAC: 122,641
Level
Thr
Scientific publications
watwatwatwatwat
Message 42554 - Posted: 3 Jan 2016 | 19:04:13 UTC

Hello , I'm using ubuntu 14.04 lts. Gerard-WU's stopped after 1 second and were uploaded. "Output file was absent" for four files at a time. I did some collatz conjecture earlier today but I guess that didn't mess up my computer as others are having problems too.

opr

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1093
Credit: 1,429,750,839
RAC: 1,080,631
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 42556 - Posted: 3 Jan 2016 | 20:26:43 UTC

Not sure if it's related, but I too just had an error with a Gerard unit, which is a rare thing to happen for me.

http://www.gpugrid.net/workunit.php?wuid=11389493
Exit status 194 (0xc2) EXIT_ABORTED_BY_CLIENT
(unknown error) - exit code 194 (0xc2)

Name e3s31_e2s25p1f424-GERARD_CXCL12_CHALC4_DIM1-0-1-RND7047_1
Workunit 11389493
Created 1 Jan 2016 | 22:45:42 UTC
Sent 1 Jan 2016 | 22:45:48 UTC
Received 3 Jan 2016 | 11:06:22 UTC
Server state Over
Outcome Computation error
Client state Compute error
Exit status 194 (0xc2) EXIT_ABORTED_BY_CLIENT
Computer ID 153764
Report deadline 6 Jan 2016 | 22:45:48 UTC
Run time 80,101.12
CPU time 11,903.64
Validate state Invalid
Credit 0.00
Application version Long runs (8-12 hours on fastest card) v8.47 (cuda65)
Stderr output

<core_client_version>7.6.22</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code 194 (0xc2)
</message>

Stroppy
Send message
Joined: 10 Feb 09
Posts: 4
Credit: 1,973,220,650
RAC: 759,309
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 42561 - Posted: 4 Jan 2016 | 18:09:44 UTC

Since 16:48 UTC on the second of January, my Linux host(206986) has failed all WU's it has received. My 2 Windows hosts are working as usual. A quick look through the task lists for the top 10 users shows the same pattern. Has anyone come up with a theory as to what is happening? In the meantime I have set that host to NNT to avoid causing any congestion at the server-side.

Trotador
Send message
Joined: 25 Mar 12
Posts: 88
Credit: 1,239,614,530
RAC: 250,551
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 42562 - Posted: 4 Jan 2016 | 18:29:57 UTC

This issue continues ocurring in all my hosts (Linux).

Guess is that administrators are still in holidays, no claim, they deserve them.

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 42563 - Posted: 4 Jan 2016 | 23:13:51 UTC - in response to Message 42562.

The Linux app binary has expired and needs to be updated. I'll get that done tomorrow, hopefully.

Profile Stoneageman
Avatar
Send message
Joined: 25 May 09
Posts: 211
Credit: 14,056,195,788
RAC: 3,314,766
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 42565 - Posted: 5 Jan 2016 | 10:46:48 UTC

Thanks Matt. Hope the update will improve it's performance

Profile God is Love, JC proves it...
Avatar
Send message
Joined: 24 Nov 11
Posts: 21
Credit: 112,611,152
RAC: 160,835
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwat
Message 42566 - Posted: 7 Jan 2016 | 0:05:06 UTC

WU e15s19_e14s24p1f286-GERARD_CXCL12_DIMPROTO3-0-1-RND3500_2 has been stuck at '85% "progress" ' for some 12 hours now.
I only have a 640, so WUs take 40-60 hours generally.
This task has already run for 69:58.
is this part of a defective batch?
How many more hours should I sacrifice for this WU?
I am presuming that if I abort it, there will be zero credit for these 70 hours (even if it is a flawed WU?)

I Run Win 7 on my HP-1120, i7-2600. (I am NOT going to 'upgrade' to Win 10 for months, until (I hope) MS gets all the garbage in Win8-10 patched up.)

Please advise.

Meanwhile I have paused it and am putting my GPU to better use.

Thanks.

____________
I think ∴ I THINK I am
My thinking neither is the source of my being
NOR proves it to you
God Is Love, Jesus proves it! ∴ we are

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1093
Credit: 1,429,750,839
RAC: 1,080,631
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 42567 - Posted: 7 Jan 2016 | 2:52:09 UTC - in response to Message 42566.

I'd suggest restarting the PC. And if the problem still persists, then abort the task.

Stephen Farrell
Send message
Joined: 3 Nov 14
Posts: 10
Credit: 57,322,675
RAC: 0
Level
Thr
Scientific publications
wat
Message 42573 - Posted: 8 Jan 2016 | 11:26:53 UTC

Hi,

I was wondering if others are still having this problem as the issue still persists on both my Linux boxes.

captainjack
Send message
Joined: 9 May 13
Posts: 146
Credit: 1,008,101,213
RAC: 1,238,180
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 42574 - Posted: 8 Jan 2016 | 13:31:18 UTC

Yep, the GPUGRID tasks are still not processing on my Linux boxes.

But my backup project is getting quite a bit of work done.

Stephen Farrell
Send message
Joined: 3 Nov 14
Posts: 10
Credit: 57,322,675
RAC: 0
Level
Thr
Scientific publications
wat
Message 42576 - Posted: 8 Jan 2016 | 16:09:52 UTC

Okay, thanks for the update. I guess I'll just add a backup project myself until the issue is resolved.

Cheers.

Trotador
Send message
Joined: 25 Mar 12
Posts: 88
Credit: 1,239,614,530
RAC: 250,551
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 42577 - Posted: 8 Jan 2016 | 19:22:01 UTC

Same here, no joy for Linux hosts, five days in a row, we don't seem to be anything worthy for the project.

nanoprobe
Send message
Joined: 26 Feb 12
Posts: 181
Credit: 221,883,797
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 42583 - Posted: 9 Jan 2016 | 14:30:56 UTC
Last modified: 9 Jan 2016 | 14:31:37 UTC

The tasks were doing OK on my XP box. I moved the cards to a Win7 box and now they all error out in 2 seconds. Looks like moving the cards was a mistake but I can't move them back ATM.

nanoprobe
Send message
Joined: 26 Feb 12
Posts: 181
Credit: 221,883,797
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 42585 - Posted: 9 Jan 2016 | 20:32:11 UTC
Last modified: 9 Jan 2016 | 20:38:43 UTC

I managed to download 1 task that didn't error out in 2 seconds. *fingers crossed*

Still having issues getting tasks to download all the files needed to run. From event log.

4680 GPUGRID 1/9/2016 3:25:00 PM Temporarily failed download of e17s19_e13s27p1f405-GERARD_CXCL12_CHALC2_MON1-0-pdb_file: transient HTTP error

After 5 attempts and 30inutes the last file did finally download.
EDIT: Second task downloaded and is running. Stay tuned.

Profile microchip
Send message
Joined: 4 Sep 11
Posts: 107
Credit: 232,638,839
RAC: 115,620
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 42587 - Posted: 10 Jan 2016 | 15:12:24 UTC

Same here on Linux. All WUs error out, even after a reset of the project.
____________

Team Belgium
The Cyberpunk Movies Database

Stephen Farrell
Send message
Joined: 3 Nov 14
Posts: 10
Credit: 57,322,675
RAC: 0
Level
Thr
Scientific publications
wat
Message 42591 - Posted: 12 Jan 2016 | 11:13:57 UTC - in response to Message 42585.

Hi nanoprobe,

did you successfully complete the work unit that started?

Profile bormolino
Send message
Joined: 16 May 13
Posts: 28
Credit: 26,277,333
RAC: 40,076
Level
Val
Scientific publications
watwatwat
Message 42597 - Posted: 13 Jan 2016 | 13:31:53 UTC

Still not running under linux ...

nanoprobe
Send message
Joined: 26 Feb 12
Posts: 181
Credit: 221,883,797
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 42598 - Posted: 13 Jan 2016 | 19:38:08 UTC - in response to Message 42591.

Hi nanoprobe,

did you successfully complete the work unit that started?



Yes. It had previously errored out on a Linux machine with 0 runtime and a Windows machine after about 60 minutes of run time. I have also received 6 more since that one that have completed and currently have 2 more in progress. For me all the version 8.4.1 tasks error out. Version 8.4.7 tasks seem to run fine with only an occasional error and unfortunately they run for hours before they go south.

Trotador
Send message
Joined: 25 Mar 12
Posts: 88
Credit: 1,239,614,530
RAC: 250,551
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 42600 - Posted: 13 Jan 2016 | 20:29:36 UTC

One day more without Linux crunching and without status info...who cares?

nanoprobe
Send message
Joined: 26 Feb 12
Posts: 181
Credit: 221,883,797
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 42602 - Posted: 14 Jan 2016 | 1:36:14 UTC - in response to Message 42600.

One day more without Linux crunching and without status info...who cares?


Someone didn't get their nap today.

Profile Bikermatt
Send message
Joined: 8 Apr 10
Posts: 37
Credit: 2,537,167,648
RAC: 4,114,644
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 42615 - Posted: 15 Jan 2016 | 2:01:56 UTC - in response to Message 42602.

One day more without Linux crunching and without status info...who cares?


Someone didn't get their nap today.


No, don't be a jerk. This has been a known problem with a known cause for a week now and no one has bothered to fix it.

For many years there was a significant performance boost when crunching with Linux at this project. The developers actually recommended that you crunch with Linux. Many of us have dedicated Linux hosts to this project due to that fact. Now my Linux hosts are having to crunch mathematics crap and look for pulsars to keep my house warm.

Could someone please fix this?

nanoprobe
Send message
Joined: 26 Feb 12
Posts: 181
Credit: 221,883,797
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 42616 - Posted: 15 Jan 2016 | 4:32:33 UTC - in response to Message 42615.

One day more without Linux crunching and without status info...who cares?


Someone didn't get their nap today.


No, don't be a jerk. This has been a known problem with a known cause for a week now and no one has bothered to fix it.

For many years there was a significant performance boost when crunching with Linux at this project. The developers actually recommended that you crunch with Linux. Many of us have dedicated Linux hosts to this project due to that fact. Now my Linux hosts are having to crunch mathematics crap and look for pulsars to keep my house warm.

Could someone please fix this?

No nap and lost your sense of humor? Go look in a mirror and take a chill pill man. This ain't life or death and GPUGrid doesn't revolve around you.

nanoprobe
Send message
Joined: 26 Feb 12
Posts: 181
Credit: 221,883,797
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 42617 - Posted: 15 Jan 2016 | 4:33:04 UTC - in response to Message 42615.
Last modified: 15 Jan 2016 | 4:35:18 UTC

That was weird. Triple post.?????

nanoprobe
Send message
Joined: 26 Feb 12
Posts: 181
Credit: 221,883,797
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 42618 - Posted: 15 Jan 2016 | 4:33:21 UTC - in response to Message 42615.
Last modified: 15 Jan 2016 | 4:36:53 UTC

Can't explain the triple post.

Trotador
Send message
Joined: 25 Mar 12
Posts: 88
Credit: 1,239,614,530
RAC: 250,551
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 42619 - Posted: 15 Jan 2016 | 5:19:04 UTC - in response to Message 42618.

Can't explain the triple post.


You missed your nap? :)

Gerard
Volunteer moderator
Project developer
Project scientist
Send message
Joined: 26 Mar 14
Posts: 101
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 42620 - Posted: 15 Jan 2016 | 10:35:21 UTC - in response to Message 42615.

Guys! Matt is trying to fix it, see https://www.gpugrid.net/forum_thread.php?id=4235 . Apparently the solution must not be trivial. Please be patient!

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 345
Credit: 4,248,187,009
RAC: 747,961
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 42621 - Posted: 15 Jan 2016 | 11:45:24 UTC
Last modified: 15 Jan 2016 | 11:53:58 UTC

Now I am getting this same "linux" error on my both my windows machines.


https://www.gpugrid.net/hosts_user.php?userid=19626


Also, when I downloaded a new unit, and I suspended a good unit to test the new unit. The new unit would crash, and when I resumed the previously good unit, it also crashed.

nanoprobe
Send message
Joined: 26 Feb 12
Posts: 181
Credit: 221,883,797
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 42622 - Posted: 15 Jan 2016 | 12:24:05 UTC - in response to Message 42619.
Last modified: 15 Jan 2016 | 12:35:01 UTC

Can't explain the triple post.


You missed your nap? :)

Or I fell asleep at the keyboard. ;-)

FWIW most of the tasks I'm getting are resends that have failed at least once on a Linux host. So far they have all run to completion on my host. Win7, Xeon E5 2683, twin GTX 970. Along with GPUGrid tasks I'm also running a full load of CPU tasks minus 2 threads each for the cards if that means anything.

TJ
Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,348,955
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 42625 - Posted: 15 Jan 2016 | 14:42:34 UTC

Six errors in the last day on Windows, so has nothing to do with the Linux app.
Also the site is very slow at the moment.
____________
Greetings from TJ

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 42626 - Posted: 15 Jan 2016 | 14:49:17 UTC

New app 848.

Matt

nanoprobe
Send message
Joined: 26 Feb 12
Posts: 181
Credit: 221,883,797
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 42627 - Posted: 15 Jan 2016 | 15:07:13 UTC - in response to Message 42626.

New app 848.

Matt

Matt,
Is it just me or are others having download issues? From logs:

13444 GPUGRID 1/15/2016 10:05:43 AM Temporarily failed download of e1s22_2-GERARD_A2ARFX_luf6806_b1-1-e1s22_2-GERARD_A2ARFX_luf6806_b1-0-2-RND6779_1: transient HTTP error
13445 GPUGRID 1/15/2016 10:05:43 AM Backing off 00:02:20 on download of e1s22_2-GERARD_A2ARFX_luf6806_b1-1-e1s22_2-GERARD_A2ARFX_luf6806_b1-0-2-RND6779_1
13446 GPUGRID 1/15/2016 10:05:44 AM Temporarily failed download of e1s22_2-GERARD_A2ARFX_luf6806_b1-1-psf_file: transient HTTP error

I'm only having this issue here. Sometimes it takes hours to get all the files for 1 task to run.

Profile caffeineyellow5
Avatar
Send message
Joined: 30 Jul 14
Posts: 225
Credit: 2,658,976,345
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwat
Message 42628 - Posted: 15 Jan 2016 | 16:15:14 UTC - in response to Message 42622.
Last modified: 15 Jan 2016 | 16:23:07 UTC

Across 5 computers, all windows ranging from 7, 8.1, and 10, I have had about 130 errored out WUs in the past 24-30 hours. Over 100 of them are 0.00 second errors and the other 30 or so are with time put in. One of the computers crashed to an unrecoverable crash and needed manual assistance in the BIOS then the OS to get it back to a good state and running again. This caused aborted WUs or was caused by WUs. I didn't get a chance to check logs and drivers and recent updates and stuff, I just got it running again and walked away frustrated.

I did notice all my Windows machines have wanted or done reboots for several Windows Updates, some having to do with drivers such as power management, but most the general ominous "Update to Windows" or "Security Update for Windows" that you need to actual read KB articles to find out what they were. I am pretty sure they all wanted at least one if not two reboots in the past 7 days. I am not sure if this is significant to any issue with errored WUs, but I would suppose it would be more than just me if it was.

I did notice two of the tasks (GERARD_CXCL12_FXCHALC4_DIM and GERARD_CXCL12_FXCHALC4_MON) had a 100% error rate, so I manually aborted about 5 of them at less than an hour into them. They might have been fixed and actually finished, but it was a gut reaction. Before these recent errors I was hovering around 28-32 errors for all of the computers combined, so coming in to see 160 today was a shock. Then seeing they didn't even spend time to crunch eased my fears a bit that it was something totally on my end.

Hope this all works out and after all WUs are going to error out occasionally and bad bathes get released and fixed in every project since distributed computing began. But if there is something I can do on my end to help decrease these errors, let me know that too.
____________
1 Corinthians 9:16 "For though I preach the gospel, I have nothing to glory of: for necessity is laid upon me; yea, woe is unto me, if I preach not the gospel!"
Ephesians 6:18-20, please ;-)
http://tbc-pa.org

Trotador
Send message
Joined: 25 Mar 12
Posts: 88
Credit: 1,239,614,530
RAC: 250,551
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 42647 - Posted: 17 Jan 2016 | 7:26:19 UTC - in response to Message 42626.

New app 848.

Matt


working OK in my host

Stephen Farrell
Send message
Joined: 3 Nov 14
Posts: 10
Credit: 57,322,675
RAC: 0
Level
Thr
Scientific publications
wat
Message 42650 - Posted: 17 Jan 2016 | 11:01:21 UTC - in response to Message 42647.

New app 848.

Matt


working OK in my host


Same here on my Linux hosts.

Profile caffeineyellow5
Avatar
Send message
Joined: 30 Jul 14
Posts: 225
Credit: 2,658,976,345
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwat
Message 42678 - Posted: 23 Jan 2016 | 6:41:20 UTC - in response to Message 42628.

Across 5 computers, all windows ranging from 7, 8.1, and 10, I have had about 130 errored out WUs in the past 24-30 hours. Over 100 of them are 0.00 second errors and the other 30 or so are with time put in. One of the computers crashed to an unrecoverable crash and needed manual assistance in the BIOS then the OS to get it back to a good state and running again. This caused aborted WUs or was caused by WUs. I didn't get a chance to check logs and drivers and recent updates and stuff, I just got it running again and walked away frustrated.

I did notice all my Windows machines have wanted or done reboots for several Windows Updates, some having to do with drivers such as power management, but most the general ominous "Update to Windows" or "Security Update for Windows" that you need to actual read KB articles to find out what they were. I am pretty sure they all wanted at least one if not two reboots in the past 7 days. I am not sure if this is significant to any issue with errored WUs, but I would suppose it would be more than just me if it was.

Maybe a huge coincidence and maybe not, another one of my computers died and would not reboot to anything but the BIOS. The one mentioned above is a Windows 7 and this one was a Windows 10, but both have the same processor (i7-4790K) and motherboard (Asus Z97-AR). Both are just about 11 months old. The first one I was able to default the BIOS, then got it to start in Safe Mode, then without making changes it was able to reboot back to full windows with no problems. This one stick of memory went bad but even with completely replaced memory, it needed BOINC completely uninstalled and all traces deleted to get it running outside of Safe Mode. I had to disable BOINC from starting with Sysinternals Autoruns, then uninstall and delete in full mode. After that I reinstalled BOINC and added the projects again. It simply would not start again and not freeze until the full deletion. I suspect it was the actual WUs it was crunching that were marked as abandoned when I did that deletion of the Program Data BOINC folder.

At first I didn't post here about it because when I found it was a bad memory stick keeping it from starting, I suspected the power issue we had at the house that day. And the memory may very well have been from the power issue. But the not starting until I dumped BOINC, but had several restarts with it disabled, then froze as soon as I started BOINC manually several times, And even after uninstallations of BOINC and all the NVIDIA drivers and services, I narrowed the problem down to BOINC to the point of the freezing. Inside that, I can only suspect the exact reason. I know that memory crashes can cause programs to 'break' if part of the program is still in memory at the time of the crash and not recoverable, but with BOINC's checkpoints, I would suspect this not a problem after a reinstall if the program itself was damaged, the WU it was crunching should go back to the most recent checkpoint and continue on.

As this is an old subject now related to a specific batch of WUs, this may be a moot point, but thought it worth a late mention. Just in case there are still these units roaming around and computers that run or ran into problems than may only be found late also, this can serve as maybe some answer and possibly a solution (with the Autoruns in Safe Mode [With Networking if you need to download it, found with a google search "Sysinternals Autoruns"]).

And as always, any feedback, if this can help fix an issue or you can help me avoid them, is appreciated. And if I can confirm or help deny and suspicions on something I forgot to mention or may be able to answer, please also offer questions.
____________
1 Corinthians 9:16 "For though I preach the gospel, I have nothing to glory of: for necessity is laid upon me; yea, woe is unto me, if I preach not the gospel!"
Ephesians 6:18-20, please ;-)
http://tbc-pa.org

Profile caffeineyellow5
Avatar
Send message
Joined: 30 Jul 14
Posts: 225
Credit: 2,658,976,345
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwat
Message 42679 - Posted: 23 Jan 2016 | 6:41:25 UTC - in response to Message 42628.

Across 5 computers, all windows ranging from 7, 8.1, and 10, I have had about 130 errored out WUs in the past 24-30 hours. Over 100 of them are 0.00 second errors and the other 30 or so are with time put in. One of the computers crashed to an unrecoverable crash and needed manual assistance in the BIOS then the OS to get it back to a good state and running again. This caused aborted WUs or was caused by WUs. I didn't get a chance to check logs and drivers and recent updates and stuff, I just got it running again and walked away frustrated.

I did notice all my Windows machines have wanted or done reboots for several Windows Updates, some having to do with drivers such as power management, but most the general ominous "Update to Windows" or "Security Update for Windows" that you need to actual read KB articles to find out what they were. I am pretty sure they all wanted at least one if not two reboots in the past 7 days. I am not sure if this is significant to any issue with errored WUs, but I would suppose it would be more than just me if it was.

Maybe a huge coincidence and maybe not, another one of my computers died and would not reboot to anything but the BIOS. The one mentioned above is a Windows 7 and this one was a Windows 10, but both have the same processor (i7-4790K) and motherboard (Asus Z97-AR). Both are just about 11 months old. The first one I was able to default the BIOS, then got it to start in Safe Mode, then without making changes it was able to reboot back to full windows with no problems. This one stick of memory went bad but even with completely replaced memory, it needed BOINC completely uninstalled and all traces deleted to get it running outside of Safe Mode. I had to disable BOINC from starting with Sysinternals Autoruns, then uninstall and delete in full mode. After that I reinstalled BOINC and added the projects again. It simply would not start again and not freeze until the full deletion. I suspect it was the actual WUs it was crunching that were marked as abandoned when I did that deletion of the Program Data BOINC folder.

At first I didn't post here about it because when I found it was a bad memory stick keeping it from starting, I suspected the power issue we had at the house that day. And the memory may very well have been from the power issue. But the not starting until I dumped BOINC, but had several restarts with it disabled, then froze as soon as I started BOINC manually several times, And even after uninstallations of BOINC and all the NVIDIA drivers and services, I narrowed the problem down to BOINC to the point of the freezing. Inside that, I can only suspect the exact reason. I know that memory crashes can cause programs to 'break' if part of the program is still in memory at the time of the crash and not recoverable, but with BOINC's checkpoints, I would suspect this not a problem after a reinstall if the program itself was damaged, the WU it was crunching should go back to the most recent checkpoint and continue on.

As this is an old subject now related to a specific batch of WUs, this may be a moot point, but thought it worth a late mention. Just in case there are still these units roaming around and computers that run or ran into problems than may only be found late also, this can serve as maybe some answer and possibly a solution (with the Autoruns in Safe Mode [With Networking if you need to download it, found with a google search "Sysinternals Autoruns"]).

And as always, any feedback, if this can help fix an issue or you can help me avoid them, is appreciated. And if I can confirm or help deny and suspicions on something I forgot to mention or may be able to answer, please also offer questions.
____________
1 Corinthians 9:16 "For though I preach the gospel, I have nothing to glory of: for necessity is laid upon me; yea, woe is unto me, if I preach not the gospel!"
Ephesians 6:18-20, please ;-)
http://tbc-pa.org

Profile caffeineyellow5
Avatar
Send message
Joined: 30 Jul 14
Posts: 225
Credit: 2,658,976,345
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwat
Message 42680 - Posted: 23 Jan 2016 | 6:44:47 UTC

Now I am the one who can't explain the double post. lol

nanoprobe
Send message
Joined: 26 Feb 12
Posts: 181
Credit: 221,883,797
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 42760 - Posted: 6 Feb 2016 | 1:27:20 UTC

Matt:
I'm still having the download issue. Anyone else?

Stephen Farrell
Send message
Joined: 3 Nov 14
Posts: 10
Credit: 57,322,675
RAC: 0
Level
Thr
Scientific publications
wat
Message 42792 - Posted: 10 Feb 2016 | 10:06:31 UTC - in response to Message 42760.

I haven't had any issues since the new app was released.

nanoprobe
Send message
Joined: 26 Feb 12
Posts: 181
Credit: 221,883,797
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 42809 - Posted: 12 Feb 2016 | 22:14:26 UTC

I have on both machines running here. Can someone at least look into this and try to fix it?

fractal
Send message
Joined: 16 Aug 08
Posts: 87
Credit: 958,589,926
RAC: 758,897
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 42810 - Posted: 12 Feb 2016 | 23:04:17 UTC - in response to Message 42809.
Last modified: 12 Feb 2016 | 23:05:31 UTC

I have on both machines running here. Can someone at least look into this and try to fix it?
Non stop successful work all month for me. I can't see your machines to see if there is a useful error message since you have them hidden.

Have you tried resetting the project? That sometimes clears up http errors.

nanoprobe
Send message
Joined: 26 Feb 12
Posts: 181
Credit: 221,883,797
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 42814 - Posted: 13 Feb 2016 | 22:43:54 UTC - in response to Message 42810.

I have on both machines running here. Can someone at least look into this and try to fix it?
Non stop successful work all month for me. I can't see your machines to see if there is a useful error message since you have them hidden.

Have you tried resetting the project? That sometimes clears up http errors.

I'll unhide my machines but I don't see how that will help. The error message is always the same. Rebooting/resetting doesn't help. Here are the latest 2.

138 GPUGRID 2/13/2016 3:59:44 PM Temporarily failed download of e19s5_e17s11p1f237-GERARD_CXCL12_CHLKDER_mol01-0-psf_file: transient HTTP error
139 GPUGRID 2/13/2016 3:59:44 PM Backing off 00:02:33 on download of e19s5_e17s11p1f237-GERARD_CXCL12_CHLKDER_mol01-0-psf_file
140 GPUGRID 2/13/2016 3:59:47 PM Temporarily failed download of e19s6_e17s11p1f311-GERARD_CXCL12_CHLKDER_mol01-0-psf_file: transient HTTP error
141 GPUGRID 2/13/2016 3:59:47 PM Backing off 00:02:29 on download of e19s6_e17s11p1f311-GERARD_CXCL12_CHLKDER_mol01-0-psf_file
142 2/13/2016 3:59:48 PM Project communication failed: attempting access to reference site
143 2/13/2016 3:59:49 PM Internet access OK - project servers may be temporarily down.

Nick Name
Send message
Joined: 3 Sep 13
Posts: 15
Credit: 391,464,419
RAC: 1,005,760
Level
Asp
Scientific publications
watwatwatwatwatwat
Message 42815 - Posted: 14 Feb 2016 | 2:08:59 UTC - in response to Message 42814.

I've seen this occasionally, but as far as I know have not had files stuck trying to download for hours as was mentioned in another thread. When I've seen it before it resolved in a few minutes. Tonight I observed a couple tasks stuck (probably around ten files in total), with a backoff time of one hour and 45 minutes. I manually tried downloading again and most of them completed, but a couple hung with the http transient error. I still have one file left which I guess will eventually finish. All my cards are busy so it's not a problem right now.

Interestingly I had no problems uploading a completed job while this was happening.
____________
Team USA forum | Team USA page

fractal
Send message
Joined: 16 Aug 08
Posts: 87
Credit: 958,589,926
RAC: 758,897
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 42816 - Posted: 14 Feb 2016 | 2:13:58 UTC - in response to Message 42814.
Last modified: 14 Feb 2016 | 2:15:26 UTC

You are right. There is nothing I can see from your tasks that would explain the errors. All your task list shows are the successful tasks, not the ones that it could not download.

I can't offer any suggestions to find out where the issue is in the network. I can only tell you that everything in my path from my machine to GPUGrid does not exhibit that behavior any more. It did for a while a few weeks back, but it seems to have resolved itself with no action on my part.

kashi
Send message
Joined: 29 Jan 15
Posts: 3
Credit: 73,440,337
RAC: 0
Level
Thr
Scientific publications
wat
Message 42817 - Posted: 14 Feb 2016 | 4:30:49 UTC

Looked in the log. Can't find any recent stuck files that stick for hours thankfully, like I've got in the past. However to show that it's still happening, here's some recent ones that were stuck for a shorter time:

14/02/2016 10:33:03 AM | GPUGRID | Temporarily failed download of e5s20_e4s2p1f299-GERARD_CXCL12_CHLCPUBCHEM_chalcone4681-0-vel_file: transient HTTP error
14/02/2016 10:33:03 AM | GPUGRID | Backing off 00:02:51 on download of e5s20_e4s2p1f299-GERARD_CXCL12_CHLCPUBCHEM_chalcone4681-0-vel_file
14/02/2016 10:33:04 AM | | Project communication failed: attempting access to reference site
14/02/2016 10:33:05 AM | | Internet access OK - project servers may be temporarily down.
14/02/2016 10:33:06 AM | GPUGRID | Temporarily failed download of e5s20_e4s2p1f299-GERARD_CXCL12_CHLCPUBCHEM_chalcone4681-0-pdb_file: transient HTTP error
14/02/2016 10:33:06 AM | GPUGRID | Backing off 00:02:24 on download of e5s20_e4s2p1f299-GERARD_CXCL12_CHLCPUBCHEM_chalcone4681-0-pdb_file
14/02/2016 10:33:07 AM | | Project communication failed: attempting access to reference site
14/02/2016 10:33:08 AM | | Internet access OK - project servers may be temporarily down.
14/02/2016 10:35:31 AM | GPUGRID | Started download of e5s20_e4s2p1f299-GERARD_CXCL12_CHLCPUBCHEM_chalcone4681-0-pdb_file
14/02/2016 10:35:55 AM | GPUGRID | Started download of e5s20_e4s2p1f299-GERARD_CXCL12_CHLCPUBCHEM_chalcone4681-0-vel_file
14/02/2016 10:36:42 AM | GPUGRID | Temporarily failed download of e5s20_e4s2p1f299-GERARD_CXCL12_CHLCPUBCHEM_chalcone4681-0-pdb_file: transient HTTP error
14/02/2016 10:36:42 AM | GPUGRID | Backing off 00:07:46 on download of e5s20_e4s2p1f299-GERARD_CXCL12_CHLCPUBCHEM_chalcone4681-0-pdb_file
14/02/2016 10:36:43 AM | | Project communication failed: attempting access to reference site
14/02/2016 10:36:44 AM | | Internet access OK - project servers may be temporarily down.
14/02/2016 10:37:06 AM | GPUGRID | Temporarily failed download of e5s20_e4s2p1f299-GERARD_CXCL12_CHLCPUBCHEM_chalcone4681-0-vel_file: transient HTTP error
14/02/2016 10:37:06 AM | GPUGRID | Backing off 00:04:55 on download of e5s20_e4s2p1f299-GERARD_CXCL12_CHLCPUBCHEM_chalcone4681-0-vel_file
14/02/2016 10:37:07 AM | | Project communication failed: attempting access to reference site
14/02/2016 10:37:08 AM | | Internet access OK - project servers may be temporarily down.
14/02/2016 10:56:49 AM | GPUGRID | Started download of e5s20_e4s2p1f299-GERARD_CXCL12_CHLCPUBCHEM_chalcone4681-0-vel_file
14/02/2016 10:56:49 AM | GPUGRID | Started download of e5s20_e4s2p1f299-GERARD_CXCL12_CHLCPUBCHEM_chalcone4681-0-pdb_file
14/02/2016 10:56:55 AM | GPUGRID | Finished download of e5s20_e4s2p1f299-GERARD_CXCL12_CHLCPUBCHEM_chalcone4681-0-vel_file
14/02/2016 10:57:50 AM | GPUGRID | Finished download of e5s20_e4s2p1f299-GERARD_CXCL12_CHLCPUBCHEM_chalcone4681-0-pdb_file

nanoprobe
Send message
Joined: 26 Feb 12
Posts: 181
Credit: 221,883,797
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 42819 - Posted: 14 Feb 2016 | 18:17:04 UTC - in response to Message 42816.
Last modified: 14 Feb 2016 | 18:17:43 UTC

You are right. There is nothing I can see from your tasks that would explain the errors. All your task list shows are the successful tasks, not the ones that it could not download.

They all eventually download and run to completion. It's waiting for hours while the downloads are stuck that is the issue.

I can't offer any suggestions to find out where the issue is in the network. I can only tell you that everything in my path from my machine to GPUGrid does not exhibit that behavior any more. It did for a while a few weeks back, but it seems to have resolved itself with no action on my part.

I wish the issue would "resolve itself" but so far that has not happened.

Post to thread

Message boards : Number crunching : All Gerard WUs erroring