Advanced search

Message boards : Number crunching : Disk usage limit exceeded

Author Message
Zalster
Avatar
Send message
Joined: 26 Feb 14
Posts: 211
Credit: 4,496,324,562
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwat
Message 48021 - Posted: 22 Oct 2017 | 0:43:41 UTC

So I'm starting to see these errors on the

e4s3_e1s13p0f400-ADRIA_FOLDUBQ80_crystal_ss_contacts_50_ubiquitin_1-0-1-RND1448_0

<core_client_version>7.6.22</core_client_version>
<![CDATA[
<message>
Disk usage limit exceeded
</message>

# Access violation : progress made, try to restart
called boinc_finish


Anyone else?

Speedy
Send message
Joined: 19 Aug 07
Posts: 43
Credit: 28,391,082
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwat
Message 48026 - Posted: 23 Oct 2017 | 1:27:26 UTC

Name e2s4_e1s43p0f362-ADRIA_FOLDUBQ80_crystal_ss_contacts_50_ubiquitin_0-0-1-RND8662_3

Application version Short runs (2-3 hours on fastest card) v9.18 (cuda80)


<core_client_version>7.6.33</core_client_version>
<![CDATA[
<message>
Disk usage limit exceeded
</message>
<stderr_txt>

Exit status 196 (0xc4) EXIT_DISK_LIMIT_EXCEEDED

Entry form out of Boinc manager log 23-Oct-17 1:43:58 PM | GPUGRID | Aborting task e2s4_e1s43p0f362-ADRIA_FOLDUBQ80_crystal_ss_contacts_50_ubiquitin_0-0-1-RND8662_3: exceeded disk limit: 286.70MB > 286.10MB

Run time 52,759.76
CPU time 9,484.81

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1617
Credit: 8,111,694,351
RAC: 18,129,731
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 48027 - Posted: 23 Oct 2017 | 7:40:12 UTC

These seem to be the same sequence of workunits as we've been discussing in Bad batch of tasks?.

Speedy's workunit 12788007 has the same error on three different computers.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1120
Credit: 8,967,195,176
RAC: 31,162,806
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 48028 - Posted: 23 Oct 2017 | 7:40:39 UTC

Most of the tasks from this batch are faulty.
See here: http://gpugrid.net/forum_thread.php?id=4632

Zalster
Avatar
Send message
Joined: 26 Feb 14
Posts: 211
Credit: 4,496,324,562
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwat
Message 48031 - Posted: 23 Oct 2017 | 14:24:34 UTC - in response to Message 48028.

Thanks to both you and Richard. I see now that they part of that bad batch. Ok, nothing we can do then....

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 48034 - Posted: 23 Oct 2017 | 21:38:55 UTC - in response to Message 48031.
Last modified: 23 Oct 2017 | 21:48:39 UTC

Thanks to both you and Richard. I see now that they part of that bad batch. Ok, nothing we can do then....
They are the part of that bad batch, but they fail for different reasons.
The other (the 'Bad batch of tasks?') thread is about the tasks which fail right after the start with "the simulation became unstable" error. Perhaps the algorithm to check the simulation's stability set to overly sensitive for this part of the batch.
This thread is about the tasks which run for hours, until they exceed the disk usage limits set for the tasks on the server and then error out. This is much more annoying than the 'original' one, as it wastes electricity and time, and it can be easily fixed by raising the disk usage limit of a task (if it is really necessary, that is the high disk usage is not a result of another error).

Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 48045 - Posted: 25 Oct 2017 | 10:09:55 UTC - in response to Message 48034.

Dears,

I think Adria has cancelled those tasks.

Perhaps they had more atoms than usual, which kind of fooled the pre submission checks and caused an higher failure rate. However, some WUs succeeded; when this happens, it's harder for us to tell for know what's going on.

Thanks for your patience.

T

Zalster
Avatar
Send message
Joined: 26 Feb 14
Posts: 211
Credit: 4,496,324,562
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwat
Message 48047 - Posted: 25 Oct 2017 | 12:44:03 UTC - in response to Message 48045.

Thank you Toni for the response and update.

It happens. Glad they can sort this out and hopefully find out what happen to allow us to process these correctly. Science is both setbacks and successes. We learn from both.



Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 48048 - Posted: 25 Oct 2017 | 13:42:06 UTC - in response to Message 48047.
Last modified: 26 Oct 2017 | 14:06:51 UTC

Thanks... please consider that it's a consequence of the fact that we are interested into a variety of systems and conditions, implying that we can not make all workunits exactly the same.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1617
Credit: 8,111,694,351
RAC: 18,129,731
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 48074 - Posted: 31 Oct 2017 | 12:54:00 UTC

I've just received a brand-new ADRIA short run task - e46s1_e44s2p0f280-ADRIA_FOLDPG80_crystal_ss_contacts_50_proteinG_2-0-1-RND2909_0. Hoping to catch any disk usage errors before they happen, I had a look at the file sizes.

The largest single upload file (_9) is allowed to reach 328,000,000 bytes. But the workunit as a whole is only allowed to use 300,000,000 bytes of disk space (<rsc_disk_bound>). That seems a touch inconsistent...

I'll try to keep an eye on the file size as it runs, and adjust the disk bound if I need to. So far, 1,308 KB in 15 minutes. Don't forget, we may have to allow for up to 150 MB of program files (like cufft64_80.dll) in the disk usage limit.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1617
Credit: 8,111,694,351
RAC: 18,129,731
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 48075 - Posted: 31 Oct 2017 | 14:56:33 UTC - in response to Message 48074.

Reached 12.3 MB and 37.5% progress after 2 hours 15 minutes - I think this one is going to make it.

Looking at long-run tasks on a different machine, they've been given a <rsc_disk_bound> of 4 billion bytes - 4,000,000,000, more than ten times as much. It may have been the difference between the short and long run queue setups, rather than the tasks themselves, which caught Adria out last time.

Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 48076 - Posted: 31 Oct 2017 | 15:05:20 UTC - in response to Message 48075.

Yes, it's inconsistent, but the real problem is that the _9 file should not become that large. I thought the workunits were cancelled... are they still around?

Thanks
T

Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 48077 - Posted: 31 Oct 2017 | 15:22:26 UTC - in response to Message 48076.

Also, do I mistake or they should have been long WUs?

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1617
Credit: 8,111,694,351
RAC: 18,129,731
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 48078 - Posted: 31 Oct 2017 | 17:02:47 UTC - in response to Message 48077.

Also, do I mistake or they should have been long WUs?

Speedy's workunit, from the previous bad batch, ran for between 32,429 seconds (GTX 1080) and 115,002 seconds (GTX 960). Yes, I think those should have been 'long queue' values.

My current task from the new batch today is on course for a 6-hour run (GTX 970) and a 33 MB final file size - I think we're going to make it :-)

Speedy
Send message
Joined: 19 Aug 07
Posts: 43
Credit: 28,391,082
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwat
Message 48080 - Posted: 31 Oct 2017 | 21:31:20 UTC - in response to Message 48078.
Last modified: 31 Oct 2017 | 21:32:23 UTC

Also, do I mistake or they should have been long WUs?

Speedy's workunit, from the previous bad batch, ran for between 32,429 seconds (GTX 1080) and 115,002 seconds (GTX 960). Yes, I think those should have been 'long queue' values.

My current task from the new batch today is on course for a 6-hour run (GTX 970) and a 33 MB final file size - I think we're going to make it :-)

The task that ran for 150,002 seconds was on a 970. I also agree these should have been in the long queue

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1617
Credit: 8,111,694,351
RAC: 18,129,731
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 48081 - Posted: 31 Oct 2017 | 22:16:35 UTC - in response to Message 48080.

The task that ran for 150,002 seconds was on a 970.

I did look at the stderr_txt, and the first three starts all say 960. It's only now I look more carefully that I see that the final part of the run was done on a 970.

Post to thread

Message boards : Number crunching : Disk usage limit exceeded

//