Advanced search

Message boards : Server and website : Warning: bad tasks re-appearing in the download queue

Author Message
Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1003
Credit: 2,522,570,166
RAC: 3,278,422
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 54961 - Posted: 26 May 2020 | 14:08:57 UTC
Last modified: 26 May 2020 | 14:10:05 UTC

There were a number of bad workunits accidentally created around 13:00 - 14:00 UTC on 21 May. Examples:

WU 20154512
WU 20154529

They are timing out, and being resent. Although I'm pretty well acquainted with the tricksy ways of BOINC, I'm finding them hard to cope with. All the file downloads go into persistent 'transient HTTP error' - but it's not transient.

One successful way is to:

Wait until GPUGrid is idle (all previous tasks reported)
Stop BOINC
Edit client_state.xml
Change https:// to http:// for the affected download files only
Save file
Restart BOINC

but be very careful when editing that file - use text mode only.

I'll try some other ways when the machine runs dry again - but downloading the files via a browser didn't work for me this time.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1003
Credit: 2,522,570,166
RAC: 3,278,422
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 54962 - Posted: 26 May 2020 | 14:36:50 UTC

We seem to have passed the bad batch, but I've got a fair few to deal with - and more will appear as people try to abort them, or in five days time when they timeout again.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 67
Credit: 923,643,067
RAC: 6,429,776
Level
Glu
Scientific publications
wat
Message 54963 - Posted: 26 May 2020 | 15:10:45 UTC - in response to Message 54962.

I had like 10 of them in a row just hanging out and the system failed over the the backup project. I just nuked em all and it downloaded new work
____________

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 252
Credit: 9,791,563,847
RAC: 3,936,285
Level
Tyr
Scientific publications
wat
Message 54964 - Posted: 26 May 2020 | 15:52:48 UTC

Not sure how to uniquely identify the defective WUs. But I had dozens sitting in my Transfer list this morning with Project backoffs of 3 to 5 hours. Aborting them also required a reboot to get more WUs to DL and many of those failed to DL. For me GG is going idle.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1003
Credit: 2,522,570,166
RAC: 3,278,422
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 54965 - Posted: 26 May 2020 | 16:06:36 UTC

I think it only needs a BOINC restart to clean them up, rather than a reboot - but I agree, it takes more than just a simple abort.

The trouble with aborting / nuking them is that they'll turn up again, like a bad penny. They'll each hang around until they reach their eighth error - which could be another 30 days away.

That's why I started this thread in the server area - I hoped I would attract advanced, skilled, users who knew how to edit files safely and efficiently, and help to get these blighters out of the way for good.

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 252
Credit: 9,791,563,847
RAC: 3,936,285
Level
Tyr
Scientific publications
wat
Message 54966 - Posted: 26 May 2020 | 16:14:47 UTC

Can't the entire batch of bad pennies be extirpated from the server side???

Kicking the can down the road is only going to make this worse.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1003
Credit: 2,522,570,166
RAC: 3,278,422
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 54967 - Posted: 26 May 2020 | 16:19:23 UTC

That's the other reason for posting in the server area - I was hoping Toni would notice and come up with a Cunning Plan.

Take a look at host 508381. Reported the last working task at 26 May 2020 | 16:11:12, got three new ones at 26 May 2020 | 16:13:00. Takes less than two minutes to shut down, edit the file, and start again, once you've got the hang of it. Another one bites the dust.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2185
Credit: 15,823,741,735
RAC: 734,144
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 54971 - Posted: 26 May 2020 | 22:37:46 UTC - in response to Message 54961.

Wait until GPUGrid is idle (all previous tasks reported)
Stop BOINC
Edit client_state.xml
Change https:// to http:// for the affected download files only
Save file
Restart BOINC

but be very careful when editing that file - use text mode only.
I'm not sure that the task generated by a "repaired" task would come using https, so we may do it 10 times.
(I've aborted the transfer of these tasks, then restarted the BOINC manager, as it thinks that "some downloads are stalled". It refers to the ones I've aborted.)

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1003
Credit: 2,522,570,166
RAC: 3,278,422
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 54976 - Posted: 27 May 2020 | 6:19:23 UTC - in response to Message 54971.

I'm not sure that the task generated by a "repaired" task would come using https, so we may do it 10 times.

It probably doesn't, but I'm less worried about that. It downloads, it computes, it returns results, and it validates - that's the object of the exercise.

Check the examples I posted in the opening post - they won't be troubling us any more. But yours will be coming back, unless cancelled.

Toni
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 9 Dec 08
Posts: 947
Credit: 4,353,973
RAC: 71
Level
Ala
Scientific publications
watwatwatwat
Message 54977 - Posted: 27 May 2020 | 7:47:27 UTC - in response to Message 54976.

I tried to update the urls in the database so hopefully reissued results will have the correct url. thanks.

Toni
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 9 Dec 08
Posts: 947
Credit: 4,353,973
RAC: 71
Level
Ala
Scientific publications
watwatwatwat
Message 54978 - Posted: 27 May 2020 | 7:49:13 UTC - in response to Message 54977.

Also, for curiosity, the reason of the problem is that in some place the non-https gpugrid.org was used. Our certificate covered several domains but not that one.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1003
Credit: 2,522,570,166
RAC: 3,278,422
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 54979 - Posted: 27 May 2020 | 8:17:55 UTC - in response to Message 54978.

Also, for curiosity, the reason of the problem is that in some place the non-https gpugrid.org was used. Our certificate covered several domains but not that one.

Ah, thanks. If any do leak out (let's hope not), I'll try using that as a fix, and report back.

Post to thread

Message boards : Server and website : Warning: bad tasks re-appearing in the download queue