Advanced search

Message boards : News : ATM

Author Message
Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 28 Feb 23
Posts: 35
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60002 - Posted: 3 Mar 2023 | 10:39:46 UTC

Hello GPUGRID!

You‘ve already noticed that a new app called “ATM” has been deployed with some test runs. We are working on its validation and deployment, so expect more jobs to come on this app soon. Let me briefly explain what this new app is about.

The ATM application

The new ATM application stands for Alchemical Transfer Method, a methodology Emilio Gallicchio et al. designed for absolute and relative binding affinity predictions. The ATM method allows us to estimate binding affinities for molecules against a specific protein, measuring the strength at which they bind. This methodology falls under the category of alchemical free energy calculation methods, where unphysical intermediate states are used to estimate the free energy of physical processes (such as protein-ligand binding). The benefits of ATM, when compared with other common free energy prediction methods (like the popular FEP), come from its simplicity, as it can be used with any forcefield and does not require a lot of expertise to make it work properly.

Measuring experimental binding affinities between candidate molecules and the targeted protein is one of the first steps in drug discovery projects, but synthesizing molecules and performing experiments is expensive. Having the capacity to perform computational binding affinity predictions, particularly during drug lead optimization, is extremely beneficial. We are actively working now on testing and validating the ATM method so that we can start applying it to real drug discovery projects as soon as possible. Additionally, since these methods are usually applied to hundreds of molecules, it benefits a lot from the parallelization capabilities of GPUGRID, so if everything goes as expected, this could potentially send lots of work units.

The ATM app is based on Python, similar to the PythonRL application, where we ship it with a specific python environment.

Here are the two main references for the ATM method, for both absolute and relative binding affinity predictions:

Absolute binding free energy estimation with ATM: https://arxiv.org/pdf/2101.07894.pdf
Relative binding free energy estimation with ATM:
https://pubs.acs.org/doi/10.1021/acs.jcim.1c01129

For now we are only able to send jobs to Linux machines but we are hoping to have a Windows version soon.

Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 28 Feb 23
Posts: 35
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60003 - Posted: 3 Mar 2023 | 10:40:19 UTC

I’m brand new to GPUGRID so apologies in advance if I make some mistakes. I’m looking forward to learn from you all and discuss about this app :)

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1616
Credit: 8,048,394,351
RAC: 19,365,580
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60005 - Posted: 3 Mar 2023 | 11:50:49 UTC

Welcome!

Let's start with some good news. I picked up one of your test tasks a couple of days ago.

T0_1-QUICO_TEST_ATM-0-1-RND8922_0

It ran right through without raising any red flags, and validated at the end. A good start.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1064
Credit: 40,231,533,983
RAC: 55,339
Level
Trp
Scientific publications
wat
Message 60006 - Posted: 3 Mar 2023 | 13:21:48 UTC - in response to Message 60003.

Thanks for creating an official topic on these types of tasks.

The latest problem observed recently was upload hangs due to a file size too big. it didnt cause an error, but it just never uploaded because the file size exceeded the size limt of your apache server. the only resolution for the user was to abort the transfer and hope it didnt get marked as an error.

have you already addressed this issue? either by adjusting the apache server file size, or adjusting the tasks to not create such large result files.
____________

Greger
Send message
Joined: 6 Jan 15
Posts: 76
Credit: 23,381,327,249
RAC: 64,806,478
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 60007 - Posted: 3 Mar 2023 | 20:20:09 UTC
Last modified: 3 Mar 2023 | 21:19:31 UTC

Welcome and thanks for info Quico

I did notice on past batch the upload got halted by server. It got rejected to download result.
I did a check on client_state file and it was below max_nbyte but still it didn´t allow to upload.

File size in past history that max allowed have been 700mb and these have been around 713-730mb in so something else control this cap and a change maybe help but i don´t see where issue would be.

event log for TL9_72-RAIMIS_TEST_ATM did say

<nbytes>729766132.000000</nbytes>
<max_nbytes>10000000000.000000</max_nbytes>


https://ibb.co/4pYBfNS
parsing upload result response <data_server_reply> <status>0</status> <file_size>0</file_size
error code -224 (permanent HTTP error)
https://ibb.co/T40gFR9

I will do test new test on new units but would probably face same issue if server have not changed.

https://boinc.berkeley.edu/trac/wiki/JobTemplates

Erich56
Send message
Joined: 1 Jan 15
Posts: 1120
Credit: 8,853,795,176
RAC: 32,883,185
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60009 - Posted: 4 Mar 2023 | 6:05:52 UTC - in response to Message 60007.

File size in past history that max allowed have been 700mb

Greger, are you sure it was 700mb?
From what I remember, it was 500mb

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1616
Credit: 8,048,394,351
RAC: 19,365,580
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60011 - Posted: 4 Mar 2023 | 9:10:39 UTC

I have one which is looking a bit poorly. It's 'running' of host 132158l (Linux Mint 21.1, GTX 1660 super, 64 GB RAM), but it's only showing 3% progress after 18 hours.


(image from remote monitoring on a Windows computer)

Are there any files I can examine, or which would be useful to you for debugging - or should I simply abort it?

Dirk Broer
Send message
Joined: 4 Oct 09
Posts: 2
Credit: 154,422,019
RAC: 175,177
Level
Ile
Scientific publications
watwatwatwat
Message 60012 - Posted: 4 Mar 2023 | 9:50:47 UTC

I am trying to upload one, but can't get it to do the transfer:
Computer: MSI-B550-A-Pro
Project GPUGRID

Name TL9_82-RAIMIS_TEST_ATM-0-1-RND3943_1

Application ATM: Free energy calculations of protein-ligand binding 1.13 (cuda1121)
Workunit name TL9_82-RAIMIS_TEST_ATM-0-1-RND3943
State Uploading
Received 3/1/2023 4:46:17 PM
Report deadline 3/6/2023 4:46:16 PM
Estimated app speed 16.548,99 GFLOPs/sec
Estimated task size 1.000.000.000 GFLOPs
Resources 0,949 CPUs + 1 NVIDIA GPU
CPU time at last checkpoint 00:00:00
CPU time 05:27:34
Elapsed time 05:28:51
Estimated time remaining 00:00:00
Fraction done 100%
Virtual memory size 0,00 MB
Working set size 0,00 MB

Debug State: 4 - Scheduler: 0

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1616
Credit: 8,048,394,351
RAC: 19,365,580
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60013 - Posted: 4 Mar 2023 | 10:39:52 UTC

I think mine is a failure. Nothing has been written to stderr.txt since 14:22:59 UTC yesterday, and the final entries are:

+ echo 'Run AToM'
+ CONFIG_FILE=Tyk2_new_2-ejm_49-ejm_50_asyncre.cntl
+ python bin/rbfe_explicit_sync.py Tyk2_new_2-ejm_49-ejm_50_asyncre.cntl
Warning: importing 'simtk.openmm' is deprecated. Import 'openmm' instead.

I'm aborting it.

NB a previous user also failed with a task from the same workunit: 27418556

Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 28 Feb 23
Posts: 35
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60022 - Posted: 6 Mar 2023 | 9:51:35 UTC - in response to Message 60013.

Thanks everyone for the replies!

From what I have seen, from the single test job I personally sent, one replica finished without issues but the other two blew up (Particle coordinate is NaN). I do find this strange because I have seen in the preparation that I run locally but not during production, the errors should be different. I'll check a few things locally since I changed a few things from my local runs and we'll try again, also with different inputs.

Welcome and thanks for info Quico

I did notice on past batch the upload got halted by server. It got rejected to download result.
I did a check on client_state file and it was below max_nbyte but still it didn´t allow to upload.

File size in past history that max allowed have been 700mb and these have been around 713-730mb in so something else control this cap and a change maybe help but i don´t see where issue would be.

event log for TL9_72-RAIMIS_TEST_ATM did say
<nbytes>729766132.000000</nbytes>
<max_nbytes>10000000000.000000</max_nbytes>


https://ibb.co/4pYBfNS
parsing upload result response <data_server_reply> <status>0</status> <file_size>0</file_size
error code -224 (permanent HTTP error)
https://ibb.co/T40gFR9

I will do test new test on new units but would probably face same issue if server have not changed.

https://boinc.berkeley.edu/trac/wiki/JobTemplates


Thanks for this, I'll keep that in mind. From the succesful run the size file is 498M so it should be on the limit there to what @Erich56 says. But that's useful information for when I run bigger systems.

I think mine is a failure. Nothing has been written to stderr.txt since 14:22:59 UTC yesterday, and the final entries are:

+ echo 'Run AToM'
+ CONFIG_FILE=Tyk2_new_2-ejm_49-ejm_50_asyncre.cntl
+ python bin/rbfe_explicit_sync.py Tyk2_new_2-ejm_49-ejm_50_asyncre.cntl
Warning: importing 'simtk.openmm' is deprecated. Import 'openmm' instead.

I'm aborting it.

NB a previous user also failed with a task from the same workunit: 27418556


Hmmm, that's weird. It shouldn't softlock in that step. Although this warning pops up it should keep running without issues. I'll ask around

gemini8
Avatar
Send message
Joined: 3 Jul 16
Posts: 31
Credit: 2,111,837,676
RAC: 7,921,388
Level
Phe
Scientific publications
watwat
Message 60029 - Posted: 7 Mar 2023 | 11:43:14 UTC

This task didn't want to upload, but neither would GPUGrid update when I aborted the upload.
Only got 24h time-outs.
____________
- - - - - - - - - -
Greetings, Jens

STE\/E
Send message
Joined: 18 Sep 08
Posts: 368
Credit: 1,073,371,048
RAC: 11,778,470
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 60035 - Posted: 8 Mar 2023 | 12:14:20 UTC

I just aborted 1 ATM Wu https://www.gpugrid.net/result.php?resultid=33338739 that had been running for over 7 Days, it sat at 75% done the whole time. Got another one & it immediately jumped to 75% done. Probably just abort it & deselect any new ATM Wu's ...
____________
STE\/E

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 401
Credit: 16,754,760,632
RAC: 23,060,497
Level
Trp
Scientific publications
watwatwat
Message 60036 - Posted: 8 Mar 2023 | 14:24:59 UTC
Last modified: 8 Mar 2023 | 14:40:30 UTC

Some still running, many failing.
Does ATM really just need one CPU?
I think I saw a new 1.1 GB executable DLing. Maybe the failures tried to run on the older version?
What are the VRAM and RAM minimum requirements for ATM?

Server Status shows both ATM and ATMbeta tasks but Tasks shows them all as ATM.
Strange, all my previously completed ATM WUs have vanished from my Tasks list?

Thanks for the papers, I'll read them later.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1616
Credit: 8,048,394,351
RAC: 19,365,580
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60037 - Posted: 8 Mar 2023 | 15:23:40 UTC

Three successive errors on host 132158

All with "python: can't open file '/hdd/boinc-client/slots/2/Scripts/rbfe_explicit_sync.py': [Errno 2] No such file or directory"

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 401
Credit: 16,754,760,632
RAC: 23,060,497
Level
Trp
Scientific publications
watwatwat
Message 60038 - Posted: 8 Mar 2023 | 16:56:45 UTC

I let some computers run off all other WUs so they were just running 2 ATM WUs. It appears they do only use one CPU each but that may just be a consequence of specifying a single CPU in the client_state.xml file. Might your ATM project benefit from using multiple CPUs?

<app_version>
<app_name>ATM</app_name>
<version_num>113</version_num>
<platform>x86_64-pc-linux-gnu</platform>
<avg_ncpus>1.000000</avg_ncpus>
<flops>46211986880283.171875</flops>
<plan_class>cuda1121</plan_class>
<api_version>7.7.0</api_version>
nvidia-smi reports ATM 1.13 WUs are using 550 to 568 MB of VRAM so call it 0.6 GB VRAM. BOINCtasks reports all WUs are using less than 1.2 GB RAM. That means that my computers could easily run up to 20 ATM WUs simultaneourly. Sadly GPUGRID does not allow us to control the number of WUs we DL like LHC or WCG do. So we're stuck with 2 set by the ACEMD project. I never run more than a single PYTHON WU on a computer so I get two and abort one and then have to uncheck PYTHON in my GPUGRID Preferences just in case ACEMD or ATM WUs materialize. I wonder how many years it's been since GG has improved the UI to make it more user-friendly? When one clicks their Preferences they still get 2 Warnings and 2 Strict Standards that have never been fixed.

Please add a link to your applications: https://www.gpugrid.net/apps.php
____________

kksplace
Send message
Joined: 4 Mar 18
Posts: 53
Credit: 2,310,846,749
RAC: 6,672,061
Level
Phe
Scientific publications
wat
Message 60039 - Posted: 8 Mar 2023 | 19:52:31 UTC

Is there a way to tell if an ATM WU is progressing? I have had only one succeed so far over the last several weeks. However, all of the failures so far were one of two types: either a failure to upload (and the download aborted by me) or a simple "Error while computing", which happened very quickly.

However, I now have an ATM WU which has been processing for over seven hours. Looking at the WU properties, it shows the CPU time nearly equal to the elapsed time. The GPU shows processing spikes up to 99%, and the 'down' periods are short.

As others have reported, the Progress shows 75% steadily.

I am inclined to keep letting it compute, but want to know what behavior others have seen on successful ATM WUs.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1064
Credit: 40,231,533,983
RAC: 55,339
Level
Trp
Scientific publications
wat
Message 60041 - Posted: 8 Mar 2023 | 21:02:08 UTC
Last modified: 8 Mar 2023 | 21:03:20 UTC

let me explain something about the 75% since it seems many don't understand what's happening here. the 75% is in no way an indication of how much the task has progressed. it is totally a function of how BOINC acts with the wrapper when the tasks are setup in the way that they are.

the wrapper uses a jobs.xml file to instruct BOINC on different "subtasks" to perform over the course of the run of a single task from the project. in the <task> element there is an option to add a <weight> argument. this would tell boinc how much "weight" in percentage of total task completion that this subtask is worth. weight of 1 is equal to 1% and so on. if this weight argument is not defined, each subtask gets equal weight.

in the case of the ATM tasks, the job.xml file has four subtasks, and no weights defined. the first 3 tasks are just quick extractions and unpacking and complete quickly. which is why the tasks jump to 75% straight away. if it's staying at 75% indefinitely then that's pretty indicative that the task is stuck and probably wont make more progress.

by comparison, the PythonGPU tasks have 2 sub tasks, but the first extraction task has a weight of 1 and the second run.py task has weight of 99 which is why it doesnt have this kind of behavior. and the acemd3 tasks only have one subtask in the file so it doesnt need a weight at all and progress is pretty linear.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1616
Credit: 8,048,394,351
RAC: 19,365,580
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60042 - Posted: 8 Mar 2023 | 21:59:23 UTC - in response to Message 60039.

I have one that's running (?) much the same. I think I've found a way to confirm it's still alive.

I looked at the task properties to see which slot directory it was running in (slot 2, in my case). Then I found the relevant directory, and poked about a bit.

I found our usual touchstone (stderr.txt) to be useless - it hadn't been touched in hours. But another file - run.log - is currently active. The most recent entries are current:

2023-03-08 21:55:05 - INFO - sync_re - Started: sample 107, replica 12
2023-03-08 21:55:17 - INFO - sync_re - Finished: sample 107, replica 12
(duration: 12.440164870815352 s)

which seems to suggest that all is well. Perhaps Quico could let us know how many samples to expect in the current batch?

kksplace
Send message
Joined: 4 Mar 18
Posts: 53
Credit: 2,310,846,749
RAC: 6,672,061
Level
Phe
Scientific publications
wat
Message 60043 - Posted: 8 Mar 2023 | 22:39:10 UTC - in response to Message 60042.

Thanks for the idea. Sure enough, that file is showing activity (On sample 324, replica 3 for me.) OK. Just going to sit and wait.

Ian&Steve, thanks for the explanation. Just one thought: what if the fourth item is just "do everything else"? Couldn't that mean going straight from 75% to 100% at some point (assuming it is progressing)?

Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 28 Feb 23
Posts: 35
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60045 - Posted: 9 Mar 2023 | 9:26:00 UTC - in response to Message 60042.

I have one that's running (?) much the same. I think I've found a way to confirm it's still alive.

I looked at the task properties to see which slot directory it was running in (slot 2, in my case). Then I found the relevant directory, and poked about a bit.

I found our usual touchstone (stderr.txt) to be useless - it hadn't been touched in hours. But another file - run.log - is currently active. The most recent entries are current:

2023-03-08 21:55:05 - INFO - sync_re - Started: sample 107, replica 12
2023-03-08 21:55:17 - INFO - sync_re - Finished: sample 107, replica 12
(duration: 12.440164870815352 s)

which seems to suggest that all is well. Perhaps Quico could let us know how many samples to expect in the current batch?


Thanks for this input (and everyone's). At least in the runs I sent recently we are expecting 341 samples.

I've seen that there were many crashes in the last batch of jobs I sent. I'll check if there were some issues on my end or it's just that the systems decided to blow up.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1616
Credit: 8,048,394,351
RAC: 19,365,580
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60046 - Posted: 9 Mar 2023 | 10:43:01 UTC - in response to Message 60045.

At least in the runs I sent recently we are expecting 341 samples.

Thanks, that's helpful. I've reached sample 266, so I'll be able to predict when it's likely to finish.

But I think you need to reconsider some design decisions. The current task properties (from BOINC Manager) are:



This task will take over 24 hours to run on my GTX 1660 Ti - that's long, even by GPUGrid standards.

BOINC doesn't think it's checkpointed since the beginning, even though checkpoints are listed at the end of each sample in the job.log

BOINC Manager shows that the fraction done is 75.000% - and has displayed that figure, unchanging, since a few minutes into the run.

I'm not seeing any sign of an output file (or I haven't found it yet!), although it's specified in the <result> XML:

<file_ref>
<file_name>T_QUICO_Tyk2_new_2_ejm_47_ejm_55_4-QUICO_TEST_ATM-0-1-RND8906_2_0</file_name>
<open_name>output.tar.bz2</open_name>
<copy_file/>
</file_ref>

More when it finishes.

Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 28 Feb 23
Posts: 35
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60047 - Posted: 9 Mar 2023 | 11:40:25 UTC - in response to Message 60046.
Last modified: 9 Mar 2023 | 11:56:09 UTC

At least in the runs I sent recently we are expecting 341 samples.

Thanks, that's helpful. I've reached sample 266, so I'll be able to predict when it's likely to finish.

But I think you need to reconsider some design decisions. The current task properties (from BOINC Manager) are:



This task will take over 24 hours to run on my GTX 1660 Ti - that's long, even by GPUGrid standards.



That's good to know, thanks. Next time I'll prepare them so they run for shorter amounts of time and finish over next submissions. Is there an aprox time you suggest per task?


I'm not seeing any sign of an output file (or I haven't found it yet!), although it's specified in the <result> XML:

<file_ref>
<file_name>T_QUICO_Tyk2_new_2_ejm_47_ejm_55_4-QUICO_TEST_ATM-0-1-RND8906_2_0</file_name>
<open_name>output.tar.bz2</open_name>
<copy_file/>
</file_ref>

More when it finishes.


Can you see a cntxt_0 folder or several r0-r21 folders? These should be some of the outputs that the run generates, and also the ones I'm getting from the succesful runs.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1616
Credit: 8,048,394,351
RAC: 19,365,580
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60048 - Posted: 9 Mar 2023 | 11:57:55 UTC - in response to Message 60047.

Can you see a cntxt_0 folder or several r0-r21 folders? These should be some of the outputs that the run generates, and also the ones I'm getting from the succesful runs.

Yes, I have all of those, and they're filling up nicely. I want to catch the final upload archive, and check it for size.

Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 28 Feb 23
Posts: 35
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60049 - Posted: 9 Mar 2023 | 14:37:12 UTC - in response to Message 60048.

Can you see a cntxt_0 folder or several r0-r21 folders? These should be some of the outputs that the run generates, and also the ones I'm getting from the succesful runs.

Yes, I have all of those, and they're filling up nicely. I want to catch the final upload archive, and check it for size.


Ah I see, from what I've seen the final upload archive has been around 500MB for these runs. Taking into accont what was mentioned filesize-wise in the beginning of the thread I'll tweak some paramaters in order to avoid heavier files

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1064
Credit: 40,231,533,983
RAC: 55,339
Level
Trp
Scientific publications
wat
Message 60050 - Posted: 9 Mar 2023 | 14:50:08 UTC - in response to Message 60049.

you should also add weights to the <tasks> element in the jobs.xml file that's being used as well as adding some kind of progress reporting for the main script. jumping to 75% at the start and staying there for 12-24hrs until it jumps to 100% at the end is counterintuitive for most users and causes confusion about if the task is doing anything or not.
____________

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 401
Credit: 16,754,760,632
RAC: 23,060,497
Level
Trp
Scientific publications
watwatwat
Message 60051 - Posted: 9 Mar 2023 | 14:51:15 UTC - in response to Message 60047.
Last modified: 9 Mar 2023 | 14:53:22 UTC

Next time I'll prepare them so they run for shorter amounts of time and finish over next submissions. Is there an aprox time you suggest per task?

The sweet spot would be 0.5 to 4 hours. Above 8 hours is starting to drag. Some climate projects take over a week to run. It really depends on your needs, we're here to serve :-) It seems a quicker turn around time while you're tweaking your project would be to your benefit.

It seems it would help you if you created your own BOINC account and ran your WUs the same way we do. Get in the trenches with us and see what we see.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1616
Credit: 8,048,394,351
RAC: 19,365,580
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60052 - Posted: 9 Mar 2023 | 16:51:12 UTC
Last modified: 9 Mar 2023 | 17:01:11 UTC

Well, here it is:



BOINC sees that as 500.28 MB (Linux counts in 1000s, BOINC counts in 1024s) - wish me luck!

Edit - phew, it got through. But that's very, very close to the old limit. Task 33344733

Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 28 Feb 23
Posts: 35
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60053 - Posted: 9 Mar 2023 | 18:29:11 UTC - in response to Message 60051.

Next time I'll prepare them so they run for shorter amounts of time and finish over next submissions. Is there an aprox time you suggest per task?

The sweet spot would be 0.5 to 4 hours. Above 8 hours is starting to drag. Some climate projects take over a week to run. It really depends on your needs, we're here to serve :-) It seems a quicker turn around time while you're tweaking your project would be to your benefit.

It seems it would help you if you created your own BOINC account and ran your WUs the same way we do. Get in the trenches with us and see what we see.


Once the Windows version is live my personal set-up will join the cause and will have more feedback :)

Well, here it is:



BOINC sees that as 500.28 MB (Linux counts in 1000s, BOINC counts in 1024s) - wish me luck!

Edit - phew, it got through. But that's very, very close to the old limit. Task 33344733


Thanks, for the insight. I'll make it save frames less frequently in order to avoid bigger filesizes.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1064
Credit: 40,231,533,983
RAC: 55,339
Level
Trp
Scientific publications
wat
Message 60068 - Posted: 13 Mar 2023 | 16:26:17 UTC

nothing but errors from the current ATM batch. run.sh is missing or misnamed/misreferenced.
____________

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 401
Credit: 16,754,760,632
RAC: 23,060,497
Level
Trp
Scientific publications
watwatwat
Message 60069 - Posted: 13 Mar 2023 | 17:49:34 UTC
Last modified: 13 Mar 2023 | 17:49:46 UTC

I vaguely recall GG had a rule something like a computer can only DL 200 WUs a day. If it's still in place it would be absurd since the overriding rule is that a computer can only hold 2 WUs at a time.
At the rate ATM WUs are failing I could hit that limit, so I halted GG DLs.
Please delete all your WUs until you fix the bug.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1616
Credit: 8,048,394,351
RAC: 19,365,580
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60074 - Posted: 14 Mar 2023 | 12:35:21 UTC

Today's tasks are running OK - the run.sh script problem has been cured.

I'm running one that the previous user aborted before it even started - no need for that any more (WU 27426736).

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1064
Credit: 40,231,533,983
RAC: 55,339
Level
Trp
Scientific publications
wat
Message 60075 - Posted: 14 Mar 2023 | 12:51:35 UTC - in response to Message 60074.
Last modified: 14 Mar 2023 | 12:52:47 UTC

i wouldnt say "cured". but newer tasks seem to be fine. I'm still getting a good number of resends with the same problem. i guess they'll make their way through the meat grinder before defaulting out.

example: http://www.gpugrid.net/result.php?resultid=33357435
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1616
Credit: 8,048,394,351
RAC: 19,365,580
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60076 - Posted: 14 Mar 2023 | 14:47:33 UTC - in response to Message 60075.

My point was: if you get one of these, let it run - it may be going to produce useful science. If it's one of the faulty ones, you waste about 20 seconds, and move on.

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 401
Credit: 16,754,760,632
RAC: 23,060,497
Level
Trp
Scientific publications
watwatwat
Message 60084 - Posted: 15 Mar 2023 | 9:28:37 UTC
Last modified: 15 Mar 2023 | 9:30:08 UTC

Quico/GDF, GPU utilization is low so I'd like to test running 3 and 4 ATM WUs simultaneously.
Sadly GG chokes off work at 2 WUs per computer so that's presently impossible.

Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 28 Feb 23
Posts: 35
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60085 - Posted: 15 Mar 2023 | 10:12:31 UTC - in response to Message 60076.
Last modified: 15 Mar 2023 | 10:16:48 UTC

Sorry about the run.sh missing issue of the past few days. It slipped through me. Also they were a few re-send tests that also crashed, but it should be fixed now.


Is there a way I could delete the failed/crashed files from the server?

We're also trying to find alternatives to avoid the filesize issue. I hope we can find a nice solution in the next few days.

Do the last few runs take less time, being less of a drag to run them? I'm trying to find the sweet spot for everyone/most of us.

Thanks everyone!

Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 28 Feb 23
Posts: 35
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60086 - Posted: 15 Mar 2023 | 10:13:40 UTC - in response to Message 60084.

Quico/GDF, GPU utilization is low so I'd like to test running 3 and 4 ATM WUs simultaneously.
Sadly GG chokes off work at 2 WUs per computer so that's presently impossible.


How low is it? It really shouldn't be the case at least taking into account the tests we performed internally.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1616
Credit: 8,048,394,351
RAC: 19,365,580
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60087 - Posted: 15 Mar 2023 | 10:47:56 UTC - in response to Message 60085.

My host 508381 (GTX 1660 Ti) has finished a couple overnight, in about 9 hours. The last one finished just as I was reading your message, and I saw the upload size - 114 MB. Another failed with 'Energy is NaN', but that's another question.

The size and time figures are comfortable for me, but others will post their own views.

It would be helpful to work on the intermediate progress reports and checkpointing - at the moment, neither are reported to BOINC. This host (Linux Mint 20.3) spends the entire run reporting 75% progress: my other machine (Linux Mint 21.1) is stuck at 3%. Both run exactly the same build of BOINC.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1064
Credit: 40,231,533,983
RAC: 55,339
Level
Trp
Scientific publications
wat
Message 60091 - Posted: 15 Mar 2023 | 11:29:34 UTC - in response to Message 60086.
Last modified: 15 Mar 2023 | 11:45:09 UTC

My observations show the GPU switching from periods of high utilization (~96-98%) to periods of idle (0%). About every minute or two.

i think the current size of the ATM are pretty good. about 4hrs on a 3080Ti and about 5hrs on a 2080Ti.

I'll second Richards's comment that you should put some effort into checkpointing about fixing the completion reporting (add weights to the job.xml file)
____________

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 401
Credit: 16,754,760,632
RAC: 23,060,497
Level
Trp
Scientific publications
watwatwat
Message 60093 - Posted: 15 Mar 2023 | 14:47:42 UTC - in response to Message 60086.
Last modified: 15 Mar 2023 | 14:53:46 UTC

Quico/GDF, GPU utilization is low so I'd like to test running 3 and 4 ATM WUs simultaneously.
Sadly GG chokes off work at 2 WUs per computer so that's presently impossible.
How low is it? It really shouldn't be the case at least taking into account the tests we performed internally.

GPUgrid is set to only DL 2 WUs per computer.

It used to be higher but since ACEMD WUs take around 12ish hours and have approxiamtely 50% GPU utilization a normal BOINC client couldn't really make efficient use of more than 2. The history of setting the limit may have had something to do with DDOS attacks and throttling server access as a defense.

But Python WUs with a very low GPU utilization and ATM with about 25% utilization could run more. I believe it's possible for the work server to designate how many WUs of a given kind based on the client's hardware.

Some use a custom BOINC client that tricks the server into thinking their computer is more than one computer.

I suspect 1080s & 2080s could run 3 and 3080s could run 4 ATM WUs. Be nice to give it a try.

Checkpointing should be high on your To-Do List followed closely by progress reporting. File size is not an issue on the client side since you DL files over a GB. But increasing the limit on your server side would make that problem vanish. Run times have shortened and run fine, maybe a little shorter would be nice but not a priority.

Profile Stephen Uitti
Send message
Joined: 17 Mar 14
Posts: 4
Credit: 77,427,636
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwatwatwat
Message 60094 - Posted: 15 Mar 2023 | 14:58:08 UTC

I noticed Free energy calculations of protein ligand binding in WUProp. For example, today's time is 0.03 hours. I checked, and i've 68 of these with a total of minimal time. So i checked, and they all get "Error while computing". I looked at a recent work unit, 27429650 T_CDK2_new_2_edit_1oiu_26_T2_2A_1-QUICO_TEST_ATM-0-1-RND4575_0
The log has this:

+ python -m pip install git+https://github.com/raimis/AToM-OpenMM.git@5d7eac55295e8c6e777505c3ca7c998f1c85987d
Running command git clone --filter=blob:none --quiet https://github.com/raimis/AToM-OpenMM.git /t/boinclib/boinc-client/slots/8/tmp/pip-req-build-3qm67lb1
Running command git rev-parse -q --verify 'sha^5d7eac55295e8c6e777505c3ca7c998f1c85987d'
Running command git fetch -q https://github.com/raimis/AToM-OpenMM.git 5d7eac55295e8c6e777505c3ca7c998f1c85987d
Running command git checkout -q 5d7eac55295e8c6e777505c3ca7c998f1c85987d
error: subprocess-exited-with-error

&#195;&#151; python setup.py egg_info did not run successfully.
&#226;&#148;&#130; exit code: -4


I'm running Linux Mint 19 (a bit out of date), git is git version 2.17.1
/usr/bin/python is Python 2.7.17 and /usr/bin/python3 is Python 3.6.9 -- this was common until recently
uname -a
Linux berfon 5.4.0-104-generic #118~18.04.1-Ubuntu SMP Thu Mar 3 13:53:15 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
My machine has a gtx-950, so cuda tasks are OK.
It's having an issue writing to /t/boinclib/boinc-client/slots/8/tmp

sudo ls -ld /t/boinclib/boinc-client/slots/8/
drwxrwx--x 2 boinc boinc 4096 Mar 15 10:24 /t/boinclib/boinc-client/slots/8/
So it doesn't look like a permissions issue. The disk drive this is on has over 1 TB space free. It looks to me like git failed, and this is what is happening on all the work units.
My machine is running "New version of ACEMD" routinely.
My preferences for GPUGrid is to run everything. I'm not sure which category this is in, but it must be one of the beta apps.

I hope this helps.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1064
Credit: 40,231,533,983
RAC: 55,339
Level
Trp
Scientific publications
wat
Message 60095 - Posted: 15 Mar 2023 | 15:23:16 UTC - in response to Message 60093.

GPUgrid is set to only DL 2 WUs per computer.


it's actually 2 per GPU, for up to 8 GPUs. 16 per computer/host.

ACEMD WUs take around 12ish hours and have approxiamtely 50% GPU utilization


acemd3 has always used nearly 100% utilization with a single task on every GPU I've ever run. if you're only seeing 50%, sounds like you're hitting some other kind of bottleneck preventing the GPU from working to its full potential.

____________

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 401
Credit: 16,754,760,632
RAC: 23,060,497
Level
Trp
Scientific publications
watwatwat
Message 60096 - Posted: 15 Mar 2023 | 17:53:15 UTC

I just started using nvitop for Linux and it gives a very different image of GPU utilization while running ATM: https://github.com/XuehaiPan/nvitop

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1064
Credit: 40,231,533,983
RAC: 55,339
Level
Trp
Scientific publications
wat
Message 60097 - Posted: 15 Mar 2023 | 18:06:14 UTC - in response to Message 60096.
Last modified: 15 Mar 2023 | 18:09:46 UTC

i would probably give more trust to nvidia's own tools.

watch -n 1 nvidia-smi

or
watch -n 1 nvidia-smi --query-gpu=temperature.gpu,name,pci.bus_id,utilization.gpu,utilization.memory,clocks.current.sm,clocks.current.memory,power.draw,memory.used,pcie.link.gen.current,pcie.link.width.current --format=csv


but you said "acemd3" uses 50%. not ATM. overall I'd agree that ATM is closer to 50% effective or a little higher. it cycles between like 90 seconds @95+% and 30 seconds @0% and back and forth for the majority of the run.
____________

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 401
Credit: 16,754,760,632
RAC: 23,060,497
Level
Trp
Scientific publications
watwatwat
Message 60098 - Posted: 15 Mar 2023 | 18:09:25 UTC - in response to Message 60094.
Last modified: 15 Mar 2023 | 18:10:23 UTC

I'm running Linux Mint 19 (a bit out of date)
I just retired my last Linux Mint 19 computer yesterday and it had been running ATM, ACEMD & Python WUs on a 2080 Ti (12/7.5) fine. BTW, I tried the LM 21.1 upgrade from LM 20.3 and can't do things like open BOINC folder as admin. I can't see any advantage to 21.1 so I'm going to do a fresh install and revert back to 20.3.

My machine has a gtx-950, so cuda tasks are OK.
Is there a minimum requirement for CUDA and Compute Capability for ATM WUs?
https://www.techpowerup.com/gpu-specs/geforce-gtx-950.c2747 says CUDA 5.2 and https://developer.nvidia.com/cuda-gpus says 5.2.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1064
Credit: 40,231,533,983
RAC: 55,339
Level
Trp
Scientific publications
wat
Message 60099 - Posted: 15 Mar 2023 | 18:14:54 UTC - in response to Message 60098.

Is there a minimum requirement for CUDA and Compute Capability for ATM WUs?
https://www.techpowerup.com/gpu-specs/geforce-gtx-950.c2747 says CUDA 5.2 and https://developer.nvidia.com/cuda-gpus says 5.2.


very likely the min CC is 5.0 (Maxwell) since Kepler cards seem to be erroring with the message that the card is too old.

all cuda 11.x apps are supported by CUDA 11.1+ drivers. with CUDA 11.1, Nvidia introduced forward compatibility of minor versions. so as long as you have 450+ drivers you should be able to run any CUDA app up to 11.8. CUDA 12+ will require moving to CUDA 12+ compatible drivers.
____________

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 401
Credit: 16,754,760,632
RAC: 23,060,497
Level
Trp
Scientific publications
watwatwat
Message 60100 - Posted: 15 Mar 2023 | 18:16:47 UTC - in response to Message 60095.

GPUgrid is set to only DL 2 WUs per computer.

it's actually 2 per GPU, for up to 8 GPUs. 16 per computer/host.
I'm sure you're right, it's been years since I put more than on GPU on a computer.

ACEMD WUs take around 12ish hours and have approxiamtely 50% GPU utilization
acemd3 has always used nearly 100% utilization with a single task on every GPU I've ever run. if you're only seeing 50%, sounds like you're hitting some other kind of bottleneck preventing the GPU from working to its full potential.[/quote]Let me rephrase that since it's been a long time since there was a steady flow of ACEMD. I always run 2 ACEMD WUs per GPU with no other GPU projects running. I can't remember what ACEMD utilization was but I don't recall that they slowed down much by running 2 WUs together.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1064
Credit: 40,231,533,983
RAC: 55,339
Level
Trp
Scientific publications
wat
Message 60101 - Posted: 15 Mar 2023 | 18:19:02 UTC - in response to Message 60100.

maybe not much slower, but also not faster.
____________

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 401
Credit: 16,754,760,632
RAC: 23,060,497
Level
Trp
Scientific publications
watwatwat
Message 60102 - Posted: 15 Mar 2023 | 18:20:10 UTC - in response to Message 60097.

i would probably give more trust to nvidia's own tools.

watch -n 1 nvidia-smi

nvitop does that but graphs it.

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 401
Credit: 16,754,760,632
RAC: 23,060,497
Level
Trp
Scientific publications
watwatwat
Message 60103 - Posted: 15 Mar 2023 | 18:22:37 UTC - in response to Message 60101.

maybe not much slower, but also not faster.

But it has the advantage that compared to running a single ACEMD WU and letting the second GG sit idle waiting until it finishes and not getting the quick turnaround bonus feels like getting robbed :-) But who's counting?

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1064
Credit: 40,231,533,983
RAC: 55,339
Level
Trp
Scientific publications
wat
Message 60104 - Posted: 15 Mar 2023 | 18:26:29 UTC - in response to Message 60103.
Last modified: 15 Mar 2023 | 18:28:30 UTC

until your 12h task turns into two 25hr tasks running two and you get robbed anyway. robbed of the bonus for two tasks instead of just one.

you can set your machine to not download excess tasks by setting a smaller cache size or playing with resource share. that way it wont download the second task until the first one is nearly finished. there are lots of options you can tweak to get the desired behavior.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1330
Credit: 7,038,742,459
RAC: 15,645,875
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60105 - Posted: 15 Mar 2023 | 21:13:38 UTC
Last modified: 15 Mar 2023 | 21:13:58 UTC

Picked up another ATM task but not holding much hope that it will run correctly based on the previous wingmen output files. Looks like the configuration is not correct again.

Had hope since the task mentions new in the name.

T_CDK2_new_2_edit_26_1h1q_T4_2_1-QUICO_TEST_ATM-0-1-RND2833_2

[Errno 2] No such file or directory

openmm.OpenMMException: Illegal value for DeviceIndex: 1

Guess I will be the next guinea pig.

zombie67 [MM]
Avatar
Send message
Joined: 16 Jul 07
Posts: 209
Credit: 3,058,436,456
RAC: 5,418,210
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60106 - Posted: 16 Mar 2023 | 1:28:51 UTC

Does the ATM app work with RTX 4000 series?
____________
Reno, NV
Team: SETI.USA

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1064
Credit: 40,231,533,983
RAC: 55,339
Level
Trp
Scientific publications
wat
Message 60107 - Posted: 16 Mar 2023 | 2:12:42 UTC - in response to Message 60106.

Does the ATM app work with RTX 4000 series?


Maybe. The Python app does, and the ATM is a similar kind of setup. You’ll have to try it and see.

Not sure how much progress the project has made for Windows though.
____________

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 0
Level
Lys
Scientific publications
watwat
Message 60108 - Posted: 16 Mar 2023 | 8:06:10 UTC - in response to Message 60098.

I'm running Linux Mint 19 (a bit out of date)
I just retired my last Linux Mint 19 computer yesterday and it had been running ATM, ACEMD & Python WUs on a 2080 Ti (12/7.5) fine. BTW, I tried the LM 21.1 upgrade from LM 20.3 and can't do things like open BOINC folder as admin. I can't see any advantage to 21.1 so I'm going to do a fresh install and revert back to 20.3.

My machine has a gtx-950, so cuda tasks are OK.
Is there a minimum requirement for CUDA and Compute Capability for ATM WUs?
https://www.techpowerup.com/gpu-specs/geforce-gtx-950.c2747 says CUDA 5.2 and https://developer.nvidia.com/cuda-gpus says 5.2.



Glad to know someone else also has the same problem with Mint 21.1. I will shift to some other flavour.

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 0
Level
Lys
Scientific publications
watwat
Message 60111 - Posted: 18 Mar 2023 | 6:30:31 UTC

Got my first ATM Beta. Completed and validated.

Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 28 Feb 23
Posts: 35
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60120 - Posted: 20 Mar 2023 | 14:45:24 UTC - in response to Message 60091.

My observations show the GPU switching from periods of high utilization (~96-98%) to periods of idle (0%). About every minute or two.

i think the current size of the ATM are pretty good. about 4hrs on a 3080Ti and about 5hrs on a 2080Ti.

I'll second Richards's comment that you should put some effort into checkpointing about fixing the completion reporting (add weights to the job.xml file)


That sounds how ATM is intended to work for now. The idle GPU periods correspond to writing coordinates.

Happy to know that size of the jobs are good!


Picked up another ATM task but not holding much hope that it will run correctly based on the previous wingmen output files. Looks like the configuration is not correct again.

Had hope since the task mentions new in the name.

T_CDK2_new_2_edit_26_1h1q_T4_2_1-QUICO_TEST_ATM-0-1-RND2833_2

[Errno 2] No such file or directory

openmm.OpenMMException: Illegal value for DeviceIndex: 1

Guess I will be the next guinea pig.


I have seen your errors but I'm not sure why it's happening since I got several jobs running smoothly right now. I'll ask around.

The new tag is a legacy part on my end about receptor naming.

Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 28 Feb 23
Posts: 35
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60121 - Posted: 20 Mar 2023 | 14:46:25 UTC

Another heads-up, it seems that the Windows app will available soon! That way we'll be able to look into the progress reporting issue.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1120
Credit: 8,853,795,176
RAC: 32,883,185
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60123 - Posted: 20 Mar 2023 | 19:54:13 UTC - in response to Message 60121.

...it seems that the Windows app will available soon!

that's good news - I'm looking foward to receiving ATM tasks :-)

zombie67 [MM]
Avatar
Send message
Joined: 16 Jul 07
Posts: 209
Credit: 3,058,436,456
RAC: 5,418,210
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60126 - Posted: 22 Mar 2023 | 6:52:36 UTC
Last modified: 22 Mar 2023 | 6:53:46 UTC

I see that there is a windows app for ATM. But I have never received an app on any of my win machines, even with an updater. And yes, I have all the right project preferences set (everything checked). So, has anyone received an ATM task on a windows machine?

Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 28 Feb 23
Posts: 35
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60128 - Posted: 22 Mar 2023 | 11:15:48 UTC - in response to Message 60126.

I see that there is a windows app for ATM. But I have never received an app on any of my win machines, even with an updater. And yes, I have all the right project preferences set (everything checked). So, has anyone received an ATM task on a windows machine?


As far as I know, we are doing the final tests.
I'll let you know once it's fully ready and I have the green light to send jobs through there.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1064
Credit: 40,231,533,983
RAC: 55,339
Level
Trp
Scientific publications
wat
Message 60129 - Posted: 22 Mar 2023 | 11:32:53 UTC - in response to Message 60126.

I see that there is a windows app for ATM. But I have never received an app on any of my win machines, even with an updater. And yes, I have all the right project preferences set (everything checked). So, has anyone received an ATM task on a windows machine?


do you have allow beta/test applications checked?
____________

zombie67 [MM]
Avatar
Send message
Joined: 16 Jul 07
Posts: 209
Credit: 3,058,436,456
RAC: 5,418,210
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60130 - Posted: 22 Mar 2023 | 14:37:45 UTC - in response to Message 60129.

I see that there is a windows app for ATM. But I have never received an app on any of my win machines, even with an updater. And yes, I have all the right project preferences set (everything checked). So, has anyone received an ATM task on a windows machine?


do you have allow beta/test applications checked?

Yep. Are you saying that you have received windows tasks for ATM?
____________
Reno, NV
Team: SETI.USA

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1064
Credit: 40,231,533,983
RAC: 55,339
Level
Trp
Scientific publications
wat
Message 60132 - Posted: 22 Mar 2023 | 14:45:55 UTC - in response to Message 60130.

I see that there is a windows app for ATM. But I have never received an app on any of my win machines, even with an updater. And yes, I have all the right project preferences set (everything checked). So, has anyone received an ATM task on a windows machine?


do you have allow beta/test applications checked?

Yep. Are you saying that you have received windows tasks for ATM?


no I don't run windows. i was just asking if you had the beta box selected because that's necessary.

but looking at the server, some people did get them. someone else earlier in this thread reported that they got and processed one also. very few went out, so unless your system asked when they were available, it would be easy to miss. you can setup a script to ask for them regularly, BOINC will stop asking after so many requests with no tasks sent.
____________

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 401
Credit: 16,754,760,632
RAC: 23,060,497
Level
Trp
Scientific publications
watwatwat
Message 60134 - Posted: 22 Mar 2023 | 14:54:54 UTC - in response to Message 60126.

I see that there is a windows app for ATM. But I have never received an app on any of my win machines, even with an updater. And yes, I have all the right project preferences set (everything checked). So, has anyone received an ATM task on a windows machine?

I've yet to get a Windoze ATMbeta. They've been available for a while this morning and still nothing. That GPU just sits with bated breath.
What's the trick?

zombie67 [MM]
Avatar
Send message
Joined: 16 Jul 07
Posts: 209
Credit: 3,058,436,456
RAC: 5,418,210
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60135 - Posted: 22 Mar 2023 | 15:07:47 UTC - in response to Message 60132.

I see that there is a windows app for ATM. But I have never received an app on any of my win machines, even with an updater. And yes, I have all the right project preferences set (everything checked). So, has anyone received an ATM task on a windows machine?


do you have allow beta/test applications checked?

Yep. Are you saying that you have received windows tasks for ATM?


no I don't run windows. i was just asking if you had the beta box selected because that's necessary.

but looking at the server, some people did get them. someone else earlier in this thread reported that they got and processed one also. very few went out, so unless your system asked when they were available, it would be easy to miss. you can setup a script to ask for them regularly, BOINC will stop asking after so many requests with no tasks sent.


Yep. As I said, I have an updater script running as well.
____________
Reno, NV
Team: SETI.USA

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1064
Credit: 40,231,533,983
RAC: 55,339
Level
Trp
Scientific publications
wat
Message 60136 - Posted: 22 Mar 2023 | 15:11:24 UTC - in response to Message 60135.

KAMasud got one on his Windows system. maybe he can share his settings.
____________

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 401
Credit: 16,754,760,632
RAC: 23,060,497
Level
Trp
Scientific publications
watwatwat
Message 60137 - Posted: 22 Mar 2023 | 15:26:35 UTC
Last modified: 22 Mar 2023 | 15:38:55 UTC

Quico, Do you have some cryptic requirements specified for your Win ATMbeta WUs?

I've even had my Win computer set to only request ATMbeta WUs and still got nothing.

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 0
Level
Lys
Scientific publications
watwat
Message 60138 - Posted: 23 Mar 2023 | 8:57:10 UTC - in response to Message 60136.

KAMasud got one on his Windows system. maybe he can share his settings.

____________________

Yes, I did get an ATM task. Completed and validated with success. No, I do not have any special settings. The only thing I do is not run any other project with GPU Grid. I have a feeling that they interfere with each other. How? GPU Grid is all over my cores and threads. Lacks discipline. My take on the subject. Admin, sorry.
Even though resources are wasted, I am not after the credits.

Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 28 Feb 23
Posts: 35
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60139 - Posted: 23 Mar 2023 | 9:34:34 UTC
Last modified: 23 Mar 2023 | 13:06:14 UTC

I think it's just a matter of very few tests being submitted right now. Once I have the green light from Raimondas I'll start sending jobs through the windows app as well.
I have a complete system prepared just for you ;)

PS: You can now check the pre-print of our initial benchmark in the lab with ATM!
https://arxiv.org/abs/2303.11065

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 401
Credit: 16,754,760,632
RAC: 23,060,497
Level
Trp
Scientific publications
watwatwat
Message 60140 - Posted: 23 Mar 2023 | 12:46:59 UTC

Still no checkpoints. Hopefully this is top of your priority list.

BTW, highlight your URL and click URL above and it'll be linkable:
https://arxiv.org/abs/2303.11065

Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 28 Feb 23
Posts: 35
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60141 - Posted: 23 Mar 2023 | 13:08:40 UTC - in response to Message 60140.

Done! Thanks for it.

Reporting should be live for the jobs I'll send later today, please let me know if it works accordingly, specially the jobs with _BACE_ in their jobname.

I'll also start sending jobs through Windows today as well.

zombie67 [MM]
Avatar
Send message
Joined: 16 Jul 07
Posts: 209
Credit: 3,058,436,456
RAC: 5,418,210
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60142 - Posted: 23 Mar 2023 | 14:05:09 UTC

There ate two different ATM apps on the server stats page, and also on the apps.php page. But in project preferences, there is only one ATM app listed. We need a way to select both/either in our project preferences.
____________
Reno, NV
Team: SETI.USA

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 0
Level
Lys
Scientific publications
watwat
Message 60143 - Posted: 23 Mar 2023 | 16:09:27 UTC

Let it be. It is more fun this way. Never know what you will get next and adjust.

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 401
Credit: 16,754,760,632
RAC: 23,060,497
Level
Trp
Scientific publications
watwatwat
Message 60144 - Posted: 23 Mar 2023 | 16:23:31 UTC
Last modified: 23 Mar 2023 | 16:26:46 UTC

My new WU behaves differently but I don't think checkpointing is working. It reported the first checkpoint after a minute and after an hour has yet to report a second one. Progress is stuck at 0.2 but time remaining has decreased from 1222 days to 22 days.

The Windoze WUs are all failing.

Pop Piasa
Avatar
Send message
Joined: 8 Aug 19
Posts: 252
Credit: 458,054,251
RAC: 0
Level
Gln
Scientific publications
watwat
Message 60145 - Posted: 23 Mar 2023 | 17:11:09 UTC
Last modified: 23 Mar 2023 | 17:24:02 UTC

I have started to get these ATM tasks on my windoze hosts.

All are failing like this:

(unknown error) - exit code 195 (0xc3)</message>
<stderr_txt>
11:28:53 (11872): wrapper (7.9.26016): starting
11:28:53 (11872): wrapper: running python.exe (bin/conda-unpack)
11:28:54 (11872): python.exe exited; CPU time 0.000000
11:28:54 (11872): wrapper: running Library/usr/bin/tar.exe (xjvf input.tar.bz2)
analyze.sh
cntxt_0/
cntxt_0/PTP1B_new-23486-23479
p-0.dat
p-10.dat
p-11.dat
p-12.dat
p-13.dat
p-14.dat
p-15.dat
p-16.dat
p-17.dat
p-18.dat
p-19.dat
p-1.dat
p-20.dat
p-21.dat
p-2.dat
p-3.dat
p-4.dat
p-5.dat
p-6.dat
p-7.dat
p-8.dat
p-9.dat
PTP1B_new-23486-23479_0.xml
PTP1B_new-23486-23479_asyncre.cntl
PTP1B_new-23486-23479.inpcrd
PTP1B_new-23486-23479.prmtop
r0/
r0/PTP1B_new-23486-23479.dcd
r0/PTP1B_new-23486-23479_ckpt.xml
r0/PTP1B_new-23486-23479.out
r1/
r1/PTP1B_new-23486-23479.dcd
r1/PTP1B_new-23486-23479_ckpt.xml
r1/PTP1B_new-23486-23479.out
r10/
r10/PTP1B_new-23486-23479.dcd
r10/PTP1B_new-23486-23479_ckpt.xml
r10/PTP1B_new-23486-23479.out
r11/
r11/PTP1B_new-23486-23479.dcd
r11/PTP1B_new-23486-23479_ckpt.xml
r11/PTP1B_new-23486-23479.out
r12/
r12/PTP1B_new-23486-23479.dcd
r12/PTP1B_new-23486-23479_ckpt.xml
r12/PTP1B_new-23486-23479.out
r13/
r13/PTP1B_new-23486-23479.dcd
r13/PTP1B_new-23486-23479_ckpt.xml
r13/PTP1B_new-23486-23479.out
r14/
r14/PTP1B_new-23486-23479.dcd
r14/PTP1B_new-23486-23479_ckpt.xml
r14/PTP1B_new-23486-23479.out
r15/
r15/PTP1B_new-23486-23479.dcd
r15/PTP1B_new-23486-23479_ckpt.xml
r15/PTP1B_new-23486-23479.out
r16/
r16/PTP1B_new-23486-23479.dcd
r16/PTP1B_new-23486-23479_ckpt.xml
r16/PTP1B_new-23486-23479.out
r17/
r17/PTP1B_new-23486-23479.dcd
r17/PTP1B_new-23486-23479_ckpt.xml
r17/PTP1B_new-23486-23479.out
r18/
r18/PTP1B_new-23486-23479.dcd
r18/PTP1B_new-23486-23479_ckpt.xml
r18/PTP1B_new-23486-23479.out
r19/
r19/PTP1B_new-23486-23479.dcd
r19/PTP1B_new-23486-23479_ckpt.xml
r19/PTP1B_new-23486-23479.out
r2/
r2/PTP1B_new-23486-23479.dcd
r2/PTP1B_new-23486-23479_ckpt.xml
r2/PTP1B_new-23486-23479.out
r20/
r20/PTP1B_new-23486-23479.dcd
r20/PTP1B_new-23486-23479_ckpt.xml
r20/PTP1B_new-23486-23479.out
r21/
r21/PTP1B_new-23486-23479.dcd
r21/PTP1B_new-23486-23479_ckpt.xml
r21/PTP1B_new-23486-23479.out
r3/
r3/PTP1B_new-23486-23479.dcd
r3/PTP1B_new-23486-23479_ckpt.xml
r3/PTP1B_new-23486-23479.out
r4/
r4/PTP1B_new-23486-23479.dcd
r4/PTP1B_new-23486-23479_ckpt.xml
r4/PTP1B_new-23486-23479.out
r5/
r5/PTP1B_new-23486-23479.dcd
r5/PTP1B_new-23486-23479_ckpt.xml
r5/PTP1B_new-23486-23479.out
r6/
r6/PTP1B_new-23486-23479.dcd
r6/PTP1B_new-23486-23479_ckpt.xml
r6/PTP1B_new-23486-23479.out
r7/
r7/PTP1B_new-23486-23479.dcd
r7/PTP1B_new-23486-23479_ckpt.xml
r7/PTP1B_new-23486-23479.out
r8/
r8/PTP1B_new-23486-23479.dcd
r8/PTP1B_new-23486-23479_ckpt.xml
r8/PTP1B_new-23486-23479.out
r9/
r9/PTP1B_new-23486-23479.dcd
r9/PTP1B_new-23486-23479_ckpt.xml
r9/PTP1B_new-23486-23479.out
Rplots.pdf
run.log
run.sh
uwham_analysis.R
uwham_analysis.Rout
11:29:23 (11872): Library/usr/bin/tar.exe exited; CPU time 0.796875
11:29:23 (11872): wrapper: running C:/Windows/system32/cmd.exe (/c call run.bat)
'run.bat' is not recognized as an internal or external command,
operable program or batch file.

11:29:24 (11872): C:/Windows/system32/cmd.exe exited; CPU time 0.000000
11:29:24 (11872): app exit status: 0x1
11:29:24 (11872): called boinc_finish(195)


A script error?

Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 28 Feb 23
Posts: 35
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60146 - Posted: 23 Mar 2023 | 17:46:00 UTC - in response to Message 60145.

I have started to get these ATM tasks on my windoze hosts.

All are failing like this:

(unknown error) - exit code 195 (0xc3)</message>
<stderr_txt>
11:28:53 (11872): wrapper (7.9.26016): starting
11:28:53 (11872): wrapper: running python.exe (bin/conda-unpack)
11:28:54 (11872): python.exe exited; CPU time 0.000000
11:28:54 (11872): wrapper: running Library/usr/bin/tar.exe (xjvf input.tar.bz2)
analyze.sh
cntxt_0/
cntxt_0/PTP1B_new-23486-23479
p-0.dat
p-10.dat
p-11.dat
p-12.dat
p-13.dat
p-14.dat
p-15.dat
p-16.dat
p-17.dat
p-18.dat
p-19.dat
p-1.dat
p-20.dat
p-21.dat
p-2.dat
p-3.dat
p-4.dat
p-5.dat
p-6.dat
p-7.dat
p-8.dat
p-9.dat
PTP1B_new-23486-23479_0.xml
PTP1B_new-23486-23479_asyncre.cntl
PTP1B_new-23486-23479.inpcrd
PTP1B_new-23486-23479.prmtop
r0/
r0/PTP1B_new-23486-23479.dcd
r0/PTP1B_new-23486-23479_ckpt.xml
r0/PTP1B_new-23486-23479.out
r1/
r1/PTP1B_new-23486-23479.dcd
r1/PTP1B_new-23486-23479_ckpt.xml
r1/PTP1B_new-23486-23479.out
r10/
r10/PTP1B_new-23486-23479.dcd
r10/PTP1B_new-23486-23479_ckpt.xml
r10/PTP1B_new-23486-23479.out
r11/
r11/PTP1B_new-23486-23479.dcd
r11/PTP1B_new-23486-23479_ckpt.xml
r11/PTP1B_new-23486-23479.out
r12/
r12/PTP1B_new-23486-23479.dcd
r12/PTP1B_new-23486-23479_ckpt.xml
r12/PTP1B_new-23486-23479.out
r13/
r13/PTP1B_new-23486-23479.dcd
r13/PTP1B_new-23486-23479_ckpt.xml
r13/PTP1B_new-23486-23479.out
r14/
r14/PTP1B_new-23486-23479.dcd
r14/PTP1B_new-23486-23479_ckpt.xml
r14/PTP1B_new-23486-23479.out
r15/
r15/PTP1B_new-23486-23479.dcd
r15/PTP1B_new-23486-23479_ckpt.xml
r15/PTP1B_new-23486-23479.out
r16/
r16/PTP1B_new-23486-23479.dcd
r16/PTP1B_new-23486-23479_ckpt.xml
r16/PTP1B_new-23486-23479.out
r17/
r17/PTP1B_new-23486-23479.dcd
r17/PTP1B_new-23486-23479_ckpt.xml
r17/PTP1B_new-23486-23479.out
r18/
r18/PTP1B_new-23486-23479.dcd
r18/PTP1B_new-23486-23479_ckpt.xml
r18/PTP1B_new-23486-23479.out
r19/
r19/PTP1B_new-23486-23479.dcd
r19/PTP1B_new-23486-23479_ckpt.xml
r19/PTP1B_new-23486-23479.out
r2/
r2/PTP1B_new-23486-23479.dcd
r2/PTP1B_new-23486-23479_ckpt.xml
r2/PTP1B_new-23486-23479.out
r20/
r20/PTP1B_new-23486-23479.dcd
r20/PTP1B_new-23486-23479_ckpt.xml
r20/PTP1B_new-23486-23479.out
r21/
r21/PTP1B_new-23486-23479.dcd
r21/PTP1B_new-23486-23479_ckpt.xml
r21/PTP1B_new-23486-23479.out
r3/
r3/PTP1B_new-23486-23479.dcd
r3/PTP1B_new-23486-23479_ckpt.xml
r3/PTP1B_new-23486-23479.out
r4/
r4/PTP1B_new-23486-23479.dcd
r4/PTP1B_new-23486-23479_ckpt.xml
r4/PTP1B_new-23486-23479.out
r5/
r5/PTP1B_new-23486-23479.dcd
r5/PTP1B_new-23486-23479_ckpt.xml
r5/PTP1B_new-23486-23479.out
r6/
r6/PTP1B_new-23486-23479.dcd
r6/PTP1B_new-23486-23479_ckpt.xml
r6/PTP1B_new-23486-23479.out
r7/
r7/PTP1B_new-23486-23479.dcd
r7/PTP1B_new-23486-23479_ckpt.xml
r7/PTP1B_new-23486-23479.out
r8/
r8/PTP1B_new-23486-23479.dcd
r8/PTP1B_new-23486-23479_ckpt.xml
r8/PTP1B_new-23486-23479.out
r9/
r9/PTP1B_new-23486-23479.dcd
r9/PTP1B_new-23486-23479_ckpt.xml
r9/PTP1B_new-23486-23479.out
Rplots.pdf
run.log
run.sh
uwham_analysis.R
uwham_analysis.Rout
11:29:23 (11872): Library/usr/bin/tar.exe exited; CPU time 0.796875
11:29:23 (11872): wrapper: running C:/Windows/system32/cmd.exe (/c call run.bat)
'run.bat' is not recognized as an internal or external command,
operable program or batch file.

11:29:24 (11872): C:/Windows/system32/cmd.exe exited; CPU time 0.000000
11:29:24 (11872): app exit status: 0x1
11:29:24 (11872): called boinc_finish(195)


A script error?


Hmmm I did send those this morning. Probably they entered the queue once my windows app was live and was looking for the run.bat.
If that's the case expect many crashes incoming :_(

The tests I'm monitoring seem to be still running so there's still hope

zombie67 [MM]
Avatar
Send message
Joined: 16 Jul 07
Posts: 209
Credit: 3,058,436,456
RAC: 5,418,210
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60147 - Posted: 23 Mar 2023 | 19:33:22 UTC

FWIW, this morning my windows machines started getting ATM tasks. Most of these tasks are erroring out. For these tasks, they have been issued many times over too many and failed every time. Looks like a problem with the tasks and not the clients running them. They will eventually work their way out of the system. But a few of the windows tasks I received today are actually working. Here is a successful example:

http://www.gpugrid.net/result.php?resultid=33375372

So there is hope.
____________
Reno, NV
Team: SETI.USA

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 0
Level
Lys
Scientific publications
watwat
Message 60148 - Posted: 23 Mar 2023 | 20:20:25 UTC - in response to Message 60147.

FWIW, this morning my windows machines started getting ATM tasks. Most of these tasks are erroring out. For these tasks, they have been issued many times over too many and failed every time. Looks like a problem with the tasks and not the clients running them. They will eventually work their way out of the system. But a few of the windows tasks I received today are actually working. Here is a successful example:

http://www.gpugrid.net/result.php?resultid=33375372

So there is hope.

--------------
Welcome Zombie67. If you are looking for more excitement, Climate has implemented OpenIFS.

kotenok2000
Send message
Joined: 18 Jul 13
Posts: 78
Credit: 78,050,793
RAC: 1,301,924
Level
Thr
Scientific publications
wat
Message 60149 - Posted: 23 Mar 2023 | 20:23:04 UTC - in response to Message 60148.

All openifs tasks are already sent.

Pop Piasa
Avatar
Send message
Joined: 8 Aug 19
Posts: 252
Credit: 458,054,251
RAC: 0
Level
Gln
Scientific publications
watwat
Message 60150 - Posted: 24 Mar 2023 | 1:19:31 UTC - in response to Message 60147.

...But a few of the windows tasks I received today are actually working.


I have one that is working, but I had to add ATMs to my appconfig file to get them to more accurately show the time remaining, due to what Ian pointed out way upthread.
https://www.gpugrid.net/forum_thread.php?id=5379&nowrap=true#60041
I now see realistic time remaining.

My current appconfig.xml script
app_config>
<app>
<name>PythonGPU</name>
<max_concurrent>1</max_concurrent>
<fraction_done_exact/>
<gpu_versions>
<gpu_usage>1</gpu_usage>
<cpu_usage>1</cpu_usage>
</gpu_versions>
</app>
<app>
<name>acemd3</name>
<max_concurrent>1</max_concurrent>
<fraction_done_exact/>
<gpu_versions>
<gpu_usage>1</gpu_usage>
<cpu_usage>1</cpu_usage>
</gpu_versions>
</app>
<app>
<name>ATM</name>
<max_concurrent>1</max_concurrent>
<fraction_done_exact/>
<gpu_versions>
<gpu_usage>1</gpu_usage>
<cpu_usage>1</cpu_usage>
</gpu_versions>
</app>
<project_max_concurrent>1</project_max_concurrent>
<report_results_immediately/>
</app_config>


This task ran alongside a F@H task (project 18717) on a RTX3060 12GB card without any problem, in case anybody is interested.

kotenok2000
Send message
Joined: 18 Jul 13
Posts: 78
Credit: 78,050,793
RAC: 1,301,924
Level
Thr
Scientific publications
wat
Message 60151 - Posted: 24 Mar 2023 | 2:54:44 UTC - in response to Message 60150.
Last modified: 24 Mar 2023 | 2:55:02 UTC

Why not
<app>
<name>PythonGPU</name>
<max_concurrent>1</max_concurrent>
<fraction_done_exact/>
<gpu_versions>
<gpu_usage>1</gpu_usage>
<cpu_usage>4</cpu_usage>
</gpu_versions>
</app>

?

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 485
Credit: 10,374,198,466
RAC: 15,190,309
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60152 - Posted: 24 Mar 2023 | 9:48:54 UTC

So far, 2 WUs successfully completed, another one running.

https://www.gpugrid.net/workunit.php?wuid=27438037

https://www.gpugrid.net/workunit.php?wuid=27438416

https://www.gpugrid.net/workunit.php?wuid=27438497


kotenok2000
Send message
Joined: 18 Jul 13
Posts: 78
Credit: 78,050,793
RAC: 1,301,924
Level
Thr
Scientific publications
wat
Message 60153 - Posted: 24 Mar 2023 | 11:47:30 UTC - in response to Message 60152.
Last modified: 24 Mar 2023 | 12:05:55 UTC

it still can't run run.bat
http://www.gpugrid.net/result.php?resultid=33377536

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1064
Credit: 40,231,533,983
RAC: 55,339
Level
Trp
Scientific publications
wat
Message 60154 - Posted: 24 Mar 2023 | 12:16:41 UTC
Last modified: 24 Mar 2023 | 12:26:33 UTC

progress reporting is still not working.

instead of halting progress at 75%, it now halts at 0.19%. the weights help prevent the task from jumping to 75%, but there is still something missing.

Python tasks are able to jump to about 1% after the extraction phase due to the weights, and then slowly creeps up over time as the task progresses. 2%, 3%, 4%, etc until it hits 100% in a natural and linear way. The ATM tasks do not do this at all. they sit at 0.19% for hours and hours with no indication of when they will complete. is it 4hrs? is it 20hrs? there's no feedback to the user. when it's done it just jumps to 100% without warning.

makes it very difficult to tell is a task is stuck or working.

-Edit-

The "BACE" tasks do seem to be reporting progress now. but the earlier tasks from yesterday ("T_p38") do not.
____________

Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 28 Feb 23
Posts: 35
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60155 - Posted: 24 Mar 2023 | 13:37:41 UTC - in response to Message 60154.

progress reporting is still not working.

instead of halting progress at 75%, it now halts at 0.19%. the weights help prevent the task from jumping to 75%, but there is still something missing.

Python tasks are able to jump to about 1% after the extraction phase due to the weights, and then slowly creeps up over time as the task progresses. 2%, 3%, 4%, etc until it hits 100% in a natural and linear way. The ATM tasks do not do this at all. they sit at 0.19% for hours and hours with no indication of when they will complete. is it 4hrs? is it 20hrs? there's no feedback to the user. when it's done it just jumps to 100% without warning.

makes it very difficult to tell is a task is stuck or working.

-Edit-

The "BACE" tasks do seem to be reporting progress now. but the earlier tasks from yesterday ("T_p38") do not.


T_p38 were sent before the update so I guess it makes sense that they don't show reporting yet. Is the progress report for the BACE runs good? Is it staying stuck?

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1064
Credit: 40,231,533,983
RAC: 55,339
Level
Trp
Scientific publications
wat
Message 60156 - Posted: 24 Mar 2023 | 13:50:20 UTC - in response to Message 60155.

Yes, BACE looks good.

But something wrong with CDK2_new. It jumped to 100% but is still running.
____________

Emilio Gallicchio
Send message
Joined: 23 Mar 23
Posts: 4
Credit: 87,500
RAC: 0
Level

Scientific publications
wat
Message 60157 - Posted: 24 Mar 2023 | 13:59:50 UTC - in response to Message 60140.

Hello Quico and everyone. Thank you for trying AToM-OpenMM on GPUGRID.

I am unsure if it is relevant to this issue, but AToM implements full checkpointing. Each replica's status is stored in a .xml file in the replica directory. We usually checkpoint every 10 mins, but this interval can be changed in the control file with the CHECKPOINT_TIME parameter (in seconds). Checkpointing is also triggered by SIGTERM or SIGINT signals sent to the main AToM process.

Launching the AToM job from the same folder reads the checkpoints and should restart the simulation as if it had kept running.

bibi
Send message
Joined: 4 May 17
Posts: 14
Credit: 13,287,569,643
RAC: 39,768,569
Level
Trp
Scientific publications
watwatwatwatwat
Message 60158 - Posted: 24 Mar 2023 | 14:08:25 UTC
Last modified: 24 Mar 2023 | 14:13:21 UTC

The python task must tell the boinc client how many ticks are to calculate (MAX_SAMPLES = 341 from *_asyncre.cntl times 22 replica) and the end of each tick.

In addition, the elapsed time used starts counting again at 0 after each restart. I don't know what the current situation is.

If the progress indicator is now ok, forgot my reply

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1616
Credit: 8,048,394,351
RAC: 19,365,580
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60159 - Posted: 24 Mar 2023 | 14:09:29 UTC
Last modified: 24 Mar 2023 | 14:18:59 UTC

The ATM tasks also record that a task has checkpointed in the job.log file in the slot directory (or did so, a few debug iterations ago - see message 60046).

That file can be viewed while a task is running, but not after it's finished. It's written (I think) by the science app, but messages are passed to BOINC by the wrapper: that's probably where the problem is.

Edit: OK, I've downloaded a BACE task (resend _4) and a T_PTP1B_new task (resend _3). I'll watch them when the current pair of Abouh tasks have finished.

Emilio Gallicchio
Send message
Joined: 23 Mar 23
Posts: 4
Credit: 87,500
RAC: 0
Level

Scientific publications
wat
Message 60160 - Posted: 24 Mar 2023 | 15:45:51 UTC - in response to Message 60158.

The GPUGRID version of AToM:

https://github.com/Gallicchio-Lab/AToM-OpenMM/blob/master/sync/atm.py

has this:


# Report progress on GPUGRID
progress = float(isample)/float(num_samples - last_sample)
open("progress", "w").write(str(progress))



which checks out as far as I can tell. last_sample is retrieved from checkpoints upon restart, so the progress % should be tracked correctly across restarts.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1616
Credit: 8,048,394,351
RAC: 19,365,580
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60161 - Posted: 24 Mar 2023 | 15:46:40 UTC

OK, the BACE task is running, and after 7 minutes or so, I see:

2023-03-24 15:40:33 - INFO - sync_re - Started: checkpointing
2023-03-24 15:40:49 - INFO - sync_re - Finished: checkpointing (duration: 15.699278543004766 s)
2023-03-24 15:40:49 - INFO - sync_re - Finished: sample 1 (duration: 303.5407383099664 s)

in the run.log file. So checkpointing is happening, but just not being reported through to BOINC.

Progress is 3.582% after eleven minutes.

Emilio Gallicchio
Send message
Joined: 23 Mar 23
Posts: 4
Credit: 87,500
RAC: 0
Level

Scientific publications
wat
Message 60162 - Posted: 24 Mar 2023 | 16:04:08 UTC - in response to Message 60157.

Actually, it is unclear if AToM's GPUGRID version checkpoints after catching termination signals. I'll ask Raimondas. Termination without checkpointing is usually okay, but progress since the checkpoint would be lost, and the number of samples recorded in the checkpoint file would not reflect the actual number of samples recorded.

Does anyone know if BOINC sends specific signals to terminate an app? Would the app pass the signal to the main AToM's python process?

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1616
Credit: 8,048,394,351
RAC: 19,365,580
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60163 - Posted: 24 Mar 2023 | 16:20:44 UTC - in response to Message 60162.

The app seems to be both checkpointing, and updating progress, at the end of each sample. That will make re-alignment after a pause easier, but there's always some over-run, and data lost on restart. It's up to the application itself to record the data point reached, and to be used for the restart, as an integral part of the checkpointing process.

I can't answer immediately on the termination question, but it's all open-source and I can look through it. In this case, it's more complicated, because BOINC will talk to the wrapper, and the wrapper will talk to the science app.

But the basic idea is that BOINC will send a request to terminate over the API, and wait for the application to close itself down as it sees fit. Actual signals will only be used to force termination in the case of an unconditional quit, such as an operating system closedown.

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 401
Credit: 16,754,760,632
RAC: 23,060,497
Level
Trp
Scientific publications
watwatwat
Message 60164 - Posted: 24 Mar 2023 | 16:20:50 UTC
Last modified: 24 Mar 2023 | 16:20:58 UTC

Seriously? Only 14 tasks a day?

GPUGRID 3/24/2023 9:17:44 AM This computer has finished a daily quota of 14 tasks

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1616
Credit: 8,048,394,351
RAC: 19,365,580
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60165 - Posted: 24 Mar 2023 | 16:42:27 UTC - in response to Message 60164.

Seriously? Only 14 tasks a day?

The quota adjusts dynamically - it goes up if you report successful tasks, and goes down if you report errors.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1616
Credit: 8,048,394,351
RAC: 19,365,580
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60166 - Posted: 24 Mar 2023 | 16:53:12 UTC

The T_PTP1B_new task, on the other hand, is not reporting progress, even though it's logging checkpoints in the run.log

A file is maintained in the slot folder, called 'boinc_task_state.xml' (it's probably written by the wrapper, though I'm not certain of that).

The current contents are:

<active_task>
<project_master_url>https://www.gpugrid.net/</project_master_url>
<result_name>T_PTP1B_new_23484_23482_T3_2A_1-QUICO_TEST_ATM-0-1-RND3714_3</result_name>
<checkpoint_cpu_time>10.942300</checkpoint_cpu_time>
<checkpoint_elapsed_time>30.176729</checkpoint_elapsed_time>
<fraction_done>0.001996</fraction_done>
<peak_working_set_size>8318976</peak_working_set_size>
<peak_swap_size>16592896</peak_swap_size>
<peak_disk_usage>1318196036</peak_disk_usage>
</active_task>

The <fraction done> is reported as the 'progress%' figure - this one is reported as 0.199% by BOINC Manager (which truncates) and 0.200% by other tools (which round).

This task has been running for 43 minutes, and boinc_task_state.xml hasn't been re-written since the first minute.

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 0
Level
Lys
Scientific publications
watwat
Message 60167 - Posted: 24 Mar 2023 | 20:30:16 UTC


task 27438680
Completed and validated. While the following task had a failure after a re-start.
task 27438865

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 0
Level
Lys
Scientific publications
watwat
Message 60168 - Posted: 24 Mar 2023 | 20:30:47 UTC


task 27438680
Completed and validated. While the following task had a failure after a re-start.
task 27438865

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1616
Credit: 8,048,394,351
RAC: 19,365,580
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60169 - Posted: 24 Mar 2023 | 20:49:28 UTC

My BACE task 33378091 finished successfully after 5 hours, under Linux Mint 21.1 with a GTX 1660 Super.

Four previous attempts failed, two of them under Windows with a 0xc0000135 error in Python.exe - that's a missing DLL.

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 0
Level
Lys
Scientific publications
watwat
Message 60170 - Posted: 24 Mar 2023 | 21:46:07 UTC

Task 27438853
Completed and validated. Short one though.

Emilio Gallicchio
Send message
Joined: 23 Mar 23
Posts: 4
Credit: 87,500
RAC: 0
Level

Scientific publications
wat
Message 60171 - Posted: 25 Mar 2023 | 2:28:51 UTC - in response to Message 60163.


I can't answer immediately on the termination question, but it's all open-source and I can look through it. In this case, it's more complicated, because BOINC will talk to the wrapper, and the wrapper will talk to the science app.

But the basic idea is that BOINC will send a request to terminate over the API, and wait for the application to close itself down as it sees fit. Actual signals will only be used to force termination in the case of an unconditional quit, such as an operating system closedown.


Right, probably the wrapper should send a termination signal to AToM.

We have of course access to AToM's sources https://github.com/Gallicchio-Lab/AToM-OpenMM and we can make sure that it checkpoints appropriately when it receives the signal.

However, I do not have access to the wrapper. Quico: please advise.

Profile Landjunge
Send message
Joined: 2 Nov 08
Posts: 3
Credit: 9,780,421,724
RAC: 39,638,524
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60172 - Posted: 25 Mar 2023 | 9:32:49 UTC
Last modified: 25 Mar 2023 | 9:33:49 UTC

Hi, i have some "new_2" ATMs that run for 14h+ yet. Should i abort them?
Running linux with rtx3070 cards
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1616
Credit: 8,048,394,351
RAC: 19,365,580
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60173 - Posted: 25 Mar 2023 | 9:38:33 UTC - in response to Message 60171.
Last modified: 25 Mar 2023 | 9:50:33 UTC

The wrapper you're using at the moment is called "wrapper_26198_x86_64-pc-linux-gnu" (I haven't tried ATM under Windows yet, but can and will do so when I get a moment).

That wrapper name looks as if it was prepared from BOINC code dating to around February 2017. At that time, BOINC was working on versions of the wrapper specifically intended for use with VirtualBox.

BOINC makes pre-compiled versions of the wrapper available for projects to use "as is", but some projects customise the source code to suit their own needs. I don't know which path GPUGrid has taken.

Edit - I just looked at the file name the first time. In stderr.txt, I see

20:37:54 (115491): wrapper (7.7.26016): starting

That would put the date back to around November 2015, but I guess someone has made some local modifications.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1616
Credit: 8,048,394,351
RAC: 19,365,580
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60174 - Posted: 25 Mar 2023 | 9:45:14 UTC - in response to Message 60172.

Hi, i have some "new_2" ATMs that run for 14h+ yet. Should i abort them?

I have one at the moment which has been running for 17.5 hours. The same machine completed one yesterday (task 33374928) which ran for 19 hours. I wouldn't abort it just yet.

Profile Landjunge
Send message
Joined: 2 Nov 08
Posts: 3
Credit: 9,780,421,724
RAC: 39,638,524
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60175 - Posted: 25 Mar 2023 | 9:46:50 UTC - in response to Message 60174.

Hi, i have some "new_2" ATMs that run for 14h+ yet. Should i abort them?

I have one at the moment which has been running for 17.5 hours. The same machine completed one yesterday (task 33374928) which ran for 19 hours. I wouldn't abort it just yet.



thank you. I will let them running =)

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1616
Credit: 8,048,394,351
RAC: 19,365,580
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60176 - Posted: 25 Mar 2023 | 11:32:54 UTC - in response to Message 60175.

And completed.

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 401
Credit: 16,754,760,632
RAC: 23,060,497
Level
Trp
Scientific publications
watwatwat
Message 60177 - Posted: 25 Mar 2023 | 13:06:20 UTC - in response to Message 60165.
Last modified: 25 Mar 2023 | 13:53:41 UTC

Seriously? Only 14 tasks a day?

The quota adjusts dynamically - it goes up if you report successful tasks, and goes down if you report errors.

Quico, This behavior is intended to block misconfigured computers. In this case it's your Windows version that fails in seconds and being resent until it hits a Linux computer or fails 7 times. My Win computer was locked out of GG early yesterday but all my Linux computers donated until WUs ran out.
In this example the first 4 failures all went to Win7 & 11 computers and then Linux completed it successfully:
https://www.gpugrid.net/workunit.php?wuid=27438768

And the Win WUs are failing in seconds again with today's tranche.

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 401
Credit: 16,754,760,632
RAC: 23,060,497
Level
Trp
Scientific publications
watwatwat
Message 60183 - Posted: 25 Mar 2023 | 14:27:30 UTC

WUs failing on Linux computers:

+ python -m pip install git+https://github.com/raimis/AToM-OpenMM.git@172e6db924567cd0af1312d33f05b156b53e3d1c
Running command git clone --filter=blob:none --quiet https://github.com/raimis/AToM-OpenMM.git /var/lib/boinc-client/slots/36/tmp/pip-req-build-jsq34xa4
fatal: unable to access '/home/conda/feedstock_root/build_artifacts/git_1679396317102/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placeho/etc/gitconfig': Permission denied
error: subprocess-exited-with-error

&#195;&#151; git clone --filter=blob:none --quiet https://github.com/raimis/AToM-OpenMM.git /var/lib/boinc-client/slots/36/tmp/pip-req-build-jsq34xa4 did not run successfully.
&#226;&#148;&#130; exit code: 128
&#226;&#149;&#176;&#226;&#148;&#128;> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

https://www.gpugrid.net/result.php?resultid=33379917

Profile Landjunge
Send message
Joined: 2 Nov 08
Posts: 3
Credit: 9,780,421,724
RAC: 39,638,524
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60184 - Posted: 25 Mar 2023 | 14:30:06 UTC

Any ideas why WUs are failing on a linux ubuntu machine with gtx1070?

<core_client_version>7.20.5</core_client_version>
<![CDATA[
<message>
process exited with code 195 (0xc3, -61)</message>
<stderr_txt>
14:01:49 (3551): wrapper (7.7.26016): starting
14:02:12 (3551): wrapper (7.7.26016): starting
14:02:12 (3551): wrapper: running bin/python (bin/conda-unpack)
14:02:13 (3551): bin/python exited; CPU time 0.280413
14:02:13 (3551): wrapper: running bin/tar (xjvf input.tar.bz2)
14:02:14 (3551): bin/tar exited; CPU time 0.840912
14:02:14 (3551): wrapper: running bin/bash (run.sh)
+ echo 'Setup environment'
+ source bin/activate
++ _conda_pack_activate
++ local _CONDA_SHELL_FLAVOR
++ '[' -n x ']'
++ _CONDA_SHELL_FLAVOR=bash
++ local script_dir
++ case "$_CONDA_SHELL_FLAVOR" in
+++ dirname bin/activate
++ script_dir=bin
+++ cd bin
+++ pwd
++ local full_path_script_dir=/var/lib/boinc-client/slots/7/bin
+++ dirname /var/lib/boinc-client/slots/7/bin
++ local full_path_env=/var/lib/boinc-client/slots/7
+++ basename /var/lib/boinc-client/slots/7
++ local env_name=7
++ '[' -n '' ']'
++ export CONDA_PREFIX=/var/lib/boinc-client/slots/7
++ CONDA_PREFIX=/var/lib/boinc-client/slots/7
++ export _CONDA_PACK_OLD_PS1=
++ _CONDA_PACK_OLD_PS1=
++ PATH=/var/lib/boinc-client/slots/7/bin:/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/bin:/sbin:.
++ PS1='(7) '
++ case "$_CONDA_SHELL_FLAVOR" in
++ hash -r
++ local _script_dir=/var/lib/boinc-client/slots/7/etc/conda/activate.d
++ '[' -d /var/lib/boinc-client/slots/7/etc/conda/activate.d ']'
+++ ls -A /var/lib/boinc-client/slots/7/etc/conda/activate.d
++ '[' -n ocl-icd_activate.sh ']'
++ local _path
++ for _path in "$_script_dir"/*.sh
++ . /var/lib/boinc-client/slots/7/etc/conda/activate.d/ocl-icd_activate.sh
+++ conda_ocl_icd_activate
++++ ls /var/lib/boinc-client/slots/7/etc/OpenCL/vendors/
+++ [[ -z ocl-icd-system ]]
+ export PATH=/var/lib/boinc-client/slots/7:/var/lib/boinc-client/slots/7/bin:/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/bin:/sbin:.
+ PATH=/var/lib/boinc-client/slots/7:/var/lib/boinc-client/slots/7/bin:/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/bin:/sbin:.
+ echo 'Create a temporary directory'
+ export TMP=/var/lib/boinc-client/slots/7/tmp
+ TMP=/var/lib/boinc-client/slots/7/tmp
+ mkdir -p /var/lib/boinc-client/slots/7/tmp
+ echo 'Install AToM'
+ REPO_URL=git+https://github.com/raimis/AToM-OpenMM.git@172e6db924567cd0af1312d33f05b156b53e3d1c
+ python -m pip install git+https://github.com/raimis/AToM-OpenMM.git@172e6db924567cd0af1312d33f05b156b53e3d1c
Running command git clone --filter=blob:none --quiet https://github.com/raimis/AToM-OpenMM.git /var/lib/boinc-client/slots/7/tmp/pip-req-build-0qwsbkqo
Running command git rev-parse -q --verify 'sha^172e6db924567cd0af1312d33f05b156b53e3d1c'
Running command git fetch -q https://github.com/raimis/AToM-OpenMM.git 172e6db924567cd0af1312d33f05b156b53e3d1c
Running command git checkout -q 172e6db924567cd0af1312d33f05b156b53e3d1c
error: subprocess-exited-with-error

&#195;&#151; python setup.py egg_info did not run successfully.
&#226;&#148;&#130; exit code: -4
&#226;&#149;&#176;&#226;&#148;&#128;> [0 lines of output]
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

&#195;&#151; Encountered error while generating package metadata.
&#226;&#149;&#176;&#226;&#148;&#128;> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.
14:02:22 (3551): bin/bash exited; CPU time 2.696428
14:02:22 (3551): app exit status: 0x1
14:02:22 (3551): called boinc_finish(195)

</stderr_txt>

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1616
Credit: 8,048,394,351
RAC: 19,365,580
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60185 - Posted: 25 Mar 2023 | 16:27:51 UTC - in response to Message 60173.

(I haven't tried ATM under Windows yet, but can and will do so when I get a moment).

Just downloaded a BACE task for Windows. There may be trouble ahead...

The job.xml file reads:

<job_desc>
<unzip_input>
<zipfilename>windows_x86_64__cuda1121.zip</zipfilename>
</unzip_input>
<task>
<application>python.exe</application>
<command_line>bin/conda-unpack</command_line>
<weight>1</weight>
</task>
<task>
<application>Library/usr/bin/tar.exe</application>
<command_line>xjvf input.tar.bz2</command_line>
<setenv>PATH=$PWD/Library/usr/bin</setenv>
<weight>1</weight>
</task>
<task>
<application>C:/Windows/system32/cmd.exe</application>
<command_line>/c call run.bat</command_line>
<setenv>CUDA_DEVICE=$GPU_DEVICE_NUM</setenv>
<stdout_filename>run.log</stdout_filename>
<weight>1000</weight>
<fraction_done_filename>progress</fraction_done_filename>
</task>
</job_desc>


1) We had problems with python.exe triggering a missing DLL error. I'll run Dependency Walker over this one, to see what the problem is.

2) It runs a private version of tar.exe: Microsoft included tar as a system utility from Windows 10 onwards - but I'm running Windows 7. The MS utility wouldn't run for me - I'll try this one.

3) I'm not totally convinced of the cmd.exe syntax either, but we'll cross that bridge when we get to it.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1616
Credit: 8,048,394,351
RAC: 19,365,580
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60186 - Posted: 25 Mar 2023 | 17:16:04 UTC - in response to Message 60185.
Last modified: 25 Mar 2023 | 17:42:46 UTC

First reports from Dependency Walker:

"Error opening file: The system cannot find the file specified" for
API-MS-WIN-CORE-PATH-L1-1-0.DLL
API-MS-WIN-CORE-WINRT-ERROR-L1-1-0.DLL
API-MS-WIN-CORE-WINRT-L1-1-0.DLL
API-MS-WIN-CORE-WINRT-ROBUFFER-L1-1-0.DLL
API-MS-WIN-CORE-WINRT-STRING-L1-1-0.DLL
DCOMP.DLL
IESHIMS.DLL

The API-MS-WIN group and IESHIMS.DLL usually resolve when delay-load files are loaded during the run. But I can't find DCOMP.DLL in either the unpacked libraries, or the Windows system disk.

DCOMP.DLL seems to be called from MSHTML.DLL, which is a Windows system file. But I still can't find it from there.

Enough for now - my head is spinning!

Edit - DCOMP.DLL is present on my Windows 10 - now Windows 11 - laptop. Another fine example of Microsoft version control.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1616
Credit: 8,048,394,351
RAC: 19,365,580
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60188 - Posted: 26 Mar 2023 | 8:24:32 UTC
Last modified: 26 Mar 2023 | 9:21:02 UTC

Just a note of warning: one of my machines is running a JNK1 task - been running for 13 hours.

It's running fine - the run log has reached sample 287, and progress has reached 1.2654867256637168

But that's over 100%, and the BOINC display has reached (and is pegged at) 100% - probably has been for several hours. Ignore it.

Edit: It's reached sample 298. And I've found a [task name].cntl file, which contains the line

MAX_SAMPLES = 341

One reason why this needs fixing: I have my BOINC client set up in such a way that it normally fetches the next task around an hour before the current one is expected to finish. Because this one was (apparently) running so fast, it reached that point over five hours ago - and it's still waiting. Sorry Abouh - your next result will be late!

Freewill
Send message
Joined: 18 Mar 10
Posts: 20
Credit: 23,118,082,894
RAC: 171,645,879
Level
Trp
Scientific publications
watwatwatwatwat
Message 60189 - Posted: 26 Mar 2023 | 11:48:39 UTC

I also noticed this latest round of BACE tasks have become much longer to run on my GPUs. Some are hitting > 24 hrs. I am going to stop taking new ones unless the # samples/task is trimmed down.

[SG] Felix
Send message
Joined: 29 Jan 16
Posts: 11
Credit: 32,223,035
RAC: 0
Level
Val
Scientific publications
watwat
Message 60190 - Posted: 26 Mar 2023 | 12:16:20 UTC

I had this one running for about 8 hours, but then i had to shut down my computer.
Unfortunately, it couldn't restart from the app checkpoint, and since there is no boinc checkpoint, it crashed and reported no run time.

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 0
Level
Lys
Scientific publications
watwat
Message 60191 - Posted: 26 Mar 2023 | 12:31:35 UTC

Forget about a re-start, these WUs cannot even take a suspension. I suspended my computer and this WU collapsed.
task 27438865

[SG] Felix
Send message
Joined: 29 Jan 16
Posts: 11
Credit: 32,223,035
RAC: 0
Level
Val
Scientific publications
watwat
Message 60192 - Posted: 26 Mar 2023 | 13:10:54 UTC

i'm a bit surprised right now, i looked at the resend, it was successfully completed in just over 2 minutes, how come? the computer has more WUs that were successfully completed in such a short time. Am I doing something wrong?

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1064
Credit: 40,231,533,983
RAC: 55,339
Level
Trp
Scientific publications
wat
Message 60193 - Posted: 26 Mar 2023 | 13:44:47 UTC - in response to Message 60189.

I also noticed this latest round of BACE tasks have become much longer to run on my GPUs. Some are hitting > 24 hrs. I am going to stop taking new ones unless the # samples/task is trimmed down.


I agree, the 4-6hr runs are much better.
____________

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 485
Credit: 10,374,198,466
RAC: 15,190,309
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60194 - Posted: 26 Mar 2023 | 23:44:00 UTC

I have task that reached 100% an hour ago, which means it is suppose to be finished, but it's still running.............

https://www.gpugrid.net/workunit.php?wuid=27439822

I don't want to aborted it, but this is annoying..........

What would be the reasonable amount of time one lets it run?????

The runtime at posting time is 7 hours and 30 minutes.



Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1330
Credit: 7,038,742,459
RAC: 15,645,875
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60195 - Posted: 27 Mar 2023 | 1:25:12 UTC - in response to Message 60194.

My last ATM tasks spent at least a couple of hours at the 100% completion point.

Just let them run and eventually they will turn themselves in for validation.

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 485
Credit: 10,374,198,466
RAC: 15,190,309
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60196 - Posted: 27 Mar 2023 | 1:39:49 UTC - in response to Message 60195.

That's a mute point now. It errored out.

https://www.gpugrid.net/result.php?resultid=33381994

I guess this goes with the territory.




Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1330
Credit: 7,038,742,459
RAC: 15,645,875
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60197 - Posted: 27 Mar 2023 | 4:03:36 UTC - in response to Message 60196.

It looks like you got bit by a permission error.

PermissionError: [Errno 13] Permission denied: 'r0/Jnk1_new_2-18659-18634_ckpt.xml'

Your boinc.service file might be an old version that does not let applications access to the .tmp directory or something.

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 485
Credit: 10,374,198,466
RAC: 15,190,309
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60198 - Posted: 27 Mar 2023 | 7:03:07 UTC - in response to Message 60197.

It looks like you got bit by a permission error.

PermissionError: [Errno 13] Permission denied: 'r0/Jnk1_new_2-18659-18634_ckpt.xml'

Your boinc.service file might be an old version that does not let applications access to the .tmp directory or something.



The Boinc version is 7.20.7.

https://www.gpugrid.net/hosts_user.php?userid=19626


Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 485
Credit: 10,374,198,466
RAC: 15,190,309
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60199 - Posted: 27 Mar 2023 | 7:33:49 UTC

Another task failed.

https://www.gpugrid.net/result.php?resultid=33383003

03/27/2023 3:20:22 AM | GPUGRID | Computation for task MCL1_new_2_29_27_OFF_4-QUICO_ATM_OPENFF-0-1-RND5141_5 finished
03/27/2023 3:20:22 AM | GPUGRID | Output file MCL1_new_2_29_27_OFF_4-QUICO_ATM_OPENFF-0-1-RND5141_5_0 for task MCL1_new_2_29_27_OFF_4-QUICO_ATM_OPENFF-0-1-RND5141_5 absent

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1616
Credit: 8,048,394,351
RAC: 19,365,580
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60200 - Posted: 27 Mar 2023 | 8:04:20 UTC - in response to Message 60199.

The output file will always be absent if the task fails - it doesn't get as far as writing it. The actual error is in the online report:

ValueError: Energy is NaN.

('Not a Number')

That's a science problem - not your fault.

Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 28 Feb 23
Posts: 35
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60201 - Posted: 27 Mar 2023 | 8:48:51 UTC - in response to Message 60171.
Last modified: 27 Mar 2023 | 8:49:14 UTC

I've seen that you are unhappy with the last batch of runs. Seeing that they take too much time. I've been playing to divide the runs in different steps to get a sweet spot that you're happy with it and it's not madness for me to organize all this runs and re-runs. I'll backtrack to the previous setting we had before. Apologies for that.


I can't answer immediately on the termination question, but it's all open-source and I can look through it. In this case, it's more complicated, because BOINC will talk to the wrapper, and the wrapper will talk to the science app.

But the basic idea is that BOINC will send a request to terminate over the API, and wait for the application to close itself down as it sees fit. Actual signals will only be used to force termination in the case of an unconditional quit, such as an operating system closedown.


Right, probably the wrapper should send a termination signal to AToM.

We have of course access to AToM's sources https://github.com/Gallicchio-Lab/AToM-OpenMM and we can make sure that it checkpoints appropriately when it receives the signal.

However, I do not have access to the wrapper. Quico: please advise.


I'll ask Raimondas about this and the other things that have been mentioned since he's the one taking care of this issue.

Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 28 Feb 23
Posts: 35
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60202 - Posted: 27 Mar 2023 | 8:52:22 UTC

I've read some people mentioning that the reporter doesn't work or that it goes over 100%. Does it work correctly for someone?

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1616
Credit: 8,048,394,351
RAC: 19,365,580
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60203 - Posted: 27 Mar 2023 | 9:08:39 UTC - in response to Message 60202.

I've read some people mentioning that the reporter doesn't work or that it goes over 100%. Does it work correctly for someone?

It varies from task to task - or, I suspect, from batch to batch. I mentioned a specific problem with a JNK1 task - task 33380692 - but it's not a general problem.

I suspect that it may have been a specific problem with setting the data that drives the progress %age calculation - the wrong expected 'total number of samples' may have been used.

Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 28 Feb 23
Posts: 35
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60204 - Posted: 27 Mar 2023 | 9:37:43 UTC - in response to Message 60203.
Last modified: 27 Mar 2023 | 9:38:19 UTC

I've read some people mentioning that the reporter doesn't work or that it goes over 100%. Does it work correctly for someone?

It varies from task to task - or, I suspect, from batch to batch. I mentioned a specific problem with a JNK1 task - task 33380692 - but it's not a general problem.

I suspect that it may have been a specific problem with setting the data that drives the progress %age calculation - the wrong expected 'total number of samples' may have been used.


This one is a rerun, meaning that 2/3 of the run were previously simulated.
Maybe it was expecting to start from 0 samples and once it saw that we're at 228 from the beginning, it got confused.

I'll comment that.

PS: But others runs have been reporting correctly?

bibi
Send message
Joined: 4 May 17
Posts: 14
Credit: 13,287,569,643
RAC: 39,768,569
Level
Trp
Scientific publications
watwatwatwatwat
Message 60205 - Posted: 27 Mar 2023 | 9:43:06 UTC

https://www.gpugrid.net/result.php?resultid=33382097

I had to suspend this task at sample 149 und resumed it an hour later, but it started again with the python install step and died. It should restart with sample 149.

bibi
Send message
Joined: 4 May 17
Posts: 14
Credit: 13,287,569,643
RAC: 39,768,569
Level
Trp
Scientific publications
watwatwatwatwat
Message 60206 - Posted: 27 Mar 2023 | 9:47:44 UTC - in response to Message 60204.

see post https://www.gpugrid.net/forum_thread.php?id=5379&nowrap=true#60160

it should be
progress = float(isample)/float(num_samples)

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1616
Credit: 8,048,394,351
RAC: 19,365,580
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60207 - Posted: 27 Mar 2023 | 10:22:32 UTC - in response to Message 60206.

Or possibly

progress = float(isample - last_sample)/float(num_samples - last_sample)

if you want a truncated resend to start from 0% - but might that affect paused/resumed tasks as well?

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 401
Credit: 16,754,760,632
RAC: 23,060,497
Level
Trp
Scientific publications
watwatwat
Message 60208 - Posted: 27 Mar 2023 | 13:49:57 UTC

None of my WUs from yesterday completed. Please issue a server abort and eliminate all these defective WUs before releasing a new set. Otherwise defects will keep wasting 8 computers time for days to come.

The problem is not the time they take to run.
No checkpointing.
Fail if suspended and restarted.

kksplace
Send message
Joined: 4 Mar 18
Posts: 53
Credit: 2,310,846,749
RAC: 6,672,061
Level
Phe
Scientific publications
wat
Message 60209 - Posted: 27 Mar 2023 | 14:28:18 UTC

The problem is not the time they take to run.
No checkpointing.
Fail if suspended and restarted


I agree with this. I had one error out on a restart two days ago after reaching nearly 100% due to no checkpoints. Not only that, but it then only showed 37 seconds of CPU time, so it doesn’t show what really happened. My latest one did complete but showed no check points. Therefore the long run time of is more of a high risk for a potential interruption.

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 0
Level
Lys
Scientific publications
watwat
Message 60210 - Posted: 27 Mar 2023 | 16:05:56 UTC - in response to Message 60208.
Last modified: 27 Mar 2023 | 16:07:19 UTC

None of my WUs from yesterday completed. Please issue a server abort and eliminate all these defective WUs before releasing a new set. Otherwise defects will keep wasting 8 computers time for days to come.

The problem is not the time they take to run.
No checkpointing.
Fail if suspended and restarted.

______________

My problem with re-start and suspending is, these WUs are GPU intensive. As soon as one of these WUs pops up, my GPU fans let me know to do maintenance of the cooling system. I have laptops. I cannot take a blower on a running system.
Now this WU for example has run for 21 hours and is at 34.5%.
task 27440346
Edit. It is still running fine.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1330
Credit: 7,038,742,459
RAC: 15,645,875
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60211 - Posted: 27 Mar 2023 | 16:41:41 UTC - in response to Message 60198.

It looks like you got bit by a permission error.

PermissionError: [Errno 13] Permission denied: 'r0/Jnk1_new_2-18659-18634_ckpt.xml'

Your boinc.service file might be an old version that does not let applications access to the .tmp directory or something.



The Boinc version is 7.20.7.

https://www.gpugrid.net/hosts_user.php?userid=19626



Not your fault, I got a couple errored tasks that duplicated yours. Just a bad batch of tasks went out.

kotenok2000
Send message
Joined: 18 Jul 13
Posts: 78
Credit: 78,050,793
RAC: 1,301,924
Level
Thr
Scientific publications
wat
Message 60212 - Posted: 28 Mar 2023 | 0:39:31 UTC - in response to Message 60185.
Last modified: 28 Mar 2023 | 0:57:30 UTC

I have problem with cmd. It exits with code 1 in 0 seconds.
Boinc version is 7.22.0 from https://github.com/BOINC/boinc/releases/tag/client_release%2F7.22%2F7.22.0

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1616
Credit: 8,048,394,351
RAC: 19,365,580
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60213 - Posted: 28 Mar 2023 | 14:55:14 UTC

I've got another very curious one.

PTP1B_new_20670_2qbs_23466_T4_2A-QUICO_TEST_ATM-0-1-RND0584_1

It started running about 2 hours ago, and says it's passed 60% progress. But now it seems to be making much slower work of it.

Looking at the run log, it started with MAX_SAMPLES: 114. The log entries run from

2023-03-20 06:25:58 - INFO - sync_re - Started: sample 1, replica 0
to
2023-03-20 11:09:35 - INFO - sync_re - Finished: sample 114 (duration: 149.1039990450081 s)
2023-03-20 11:09:35 - INFO - sync_re - Finished: ATM simulations (duration: 17016.784924168984 s)

Then it appears to start again, this time with MAX_SAMPLES: 341, logging from

2023-03-28 13:25:11 - INFO - sync_re - Started: sample 115, replica 0
(this is roughly when the task started running on my machine)
to, so far
2023-03-28 15:45:16 - INFO - sync_re - Finished: sample 142 (duration: 299.707962396089 s)

Note that each sample is taking roughly twice as long to complete as the ones before 114 - presumably run on a differently machine?

The task is another resend, but the logging feels very strange. Is this how it's supposed to look?

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 0
Level
Lys
Scientific publications
watwat
Message 60214 - Posted: 28 Mar 2023 | 15:20:32 UTC - in response to Message 60210.

None of my WUs from yesterday completed. Please issue a server abort and eliminate all these defective WUs before releasing a new set. Otherwise defects will keep wasting 8 computers time for days to come.

The problem is not the time they take to run.
No checkpointing.
Fail if suspended and restarted.

______________

My problem with re-start and suspending is, these WUs are GPU intensive. As soon as one of these WUs pops up, my GPU fans let me know to do maintenance of the cooling system. I have laptops. I cannot take a blower on a running system.
Now this WU for example has run for 21 hours and is at 34.5%.
task 27440346
Edit. It is still running fine.

_____________________________

The above-mentioned WU is at 71.8% and has been running now for 1 Day and 20 hours. It is still running fine and as I cannot read log files, you can go over what it has been doing once finished.
I have marked no further WUs from GPUgrid. I will re-open after updates, etc which I have forced-paused.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1330
Credit: 7,038,742,459
RAC: 15,645,875
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60215 - Posted: 28 Mar 2023 | 17:18:27 UTC
Last modified: 28 Mar 2023 | 17:20:17 UTC

Looked at the errored tasks list on my account this morning and see another slew of badly misconfigured tasks went out.

Been seeing a lot of file not found errors.

FileNotFoundError: [Errno 2] No such file or directory: 'MCL1_new_2-50-60_0.xml'

Thankfully they fail fast and are purged shortly after working through the _7 iteration.

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 0
Level
Lys
Scientific publications
watwat
Message 60216 - Posted: 28 Mar 2023 | 23:51:18 UTC - in response to Message 60214.

None of my WUs from yesterday completed. Please issue a server abort and eliminate all these defective WUs before releasing a new set. Otherwise defects will keep wasting 8 computers time for days to come.

The problem is not the time they take to run.
No checkpointing.
Fail if suspended and restarted.

______________

My problem with re-start and suspending is, these WUs are GPU intensive. As soon as one of these WUs pops up, my GPU fans let me know to do maintenance of the cooling system. I have laptops. I cannot take a blower on a running system.
Now this WU for example has run for 21 hours and is at 34.5%.
task 27440346
Edit. It is still running fine.

_____________________________

The above-mentioned WU is at 71.8% and has been running now for 1 Day and 20 hours. It is still running fine and as I cannot read log files, you can go over what it has been doing once finished.
I have marked no further WUs from GPUgrid. I will re-open after updates, etc which I have forced-paused.

________________

Completed after two days, four hours and forty minutes.
Now there is another problem. One task is showing 100% completed for the last four hours but it is still using the CPU for something. Not the GPU. The elapsed clock is still ticking but the remaining is zero.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1330
Credit: 7,038,742,459
RAC: 15,645,875
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60218 - Posted: 29 Mar 2023 | 1:47:31 UTC - in response to Message 60216.

This task PTP1B_23471_23468_2_2A-QUICO_TEST_ATM-0-1-RND8957_1 is currently doing the same on this host.

Been at 100% complete now for at least an hour now.

I know to just leave them alone and they will eventually finish and report as validated.

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 485
Credit: 10,374,198,466
RAC: 15,190,309
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60219 - Posted: 29 Mar 2023 | 7:09:23 UTC

This task reached "100% complete" in about 7 hours, and then ran for an additional 7 hours +, before actually finishing.

https://www.gpugrid.net/workunit.php?wuid=27442023


Anybody got that beat??????



Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1616
Credit: 8,048,394,351
RAC: 19,365,580
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60220 - Posted: 29 Mar 2023 | 7:20:09 UTC - in response to Message 60219.
Last modified: 29 Mar 2023 | 7:30:44 UTC

Anybody got that beat??????

The task I reported in Message 60213 (14:55 yesterday) is still running. It was approaching 100% when I went to bed last night, and it's still there this morning. I'll go and check it out after coffee (I can't see the sample numbers remotely).

As soon as I wrote that, it uploaded and reported! Ah well, my other Linux machine has got one in the same state.

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 0
Level
Lys
Scientific publications
watwat
Message 60223 - Posted: 29 Mar 2023 | 8:23:53 UTC - in response to Message 60216.
Last modified: 29 Mar 2023 | 8:31:49 UTC

None of my WUs from yesterday completed. Please issue a server abort and eliminate all these defective WUs before releasing a new set. Otherwise defects will keep wasting 8 computers time for days to come.

The problem is not the time they take to run.
No checkpointing.
Fail if suspended and restarted.

______________

My problem with re-start and suspending is, these WUs are GPU intensive. As soon as one of these WUs pops up, my GPU fans let me know to do maintenance of the cooling system. I have laptops. I cannot take a blower on a running system.
Now this WU for example has run for 21 hours and is at 34.5%.
task 27440346
Edit. It is still running fine.

_____________________________

The above-mentioned WU is at 71.8% and has been running now for 1 Day and 20 hours. It is still running fine and as I cannot read log files, you can go over what it has been doing once finished.
I have marked no further WUs from GPUgrid. I will re-open after updates, etc which I have forced-paused.

________________

Completed after two days, four hours and forty minutes.
Now there is another problem. One task is showing 100% completed for the last four hours but it is still using the CPU for something. Not the GPU. The elapsed clock is still ticking but the remaining is zero.

_________________

Just woke up. The task was finished. Sent it home.
task 27441741

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1616
Credit: 8,048,394,351
RAC: 19,365,580
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60224 - Posted: 29 Mar 2023 | 8:31:04 UTC

OK, it's the same story as yesterday. This task:

PTP1B_23486_23479_4_2A-QUICO_TEST_ATM-0-1-RND5081_2

downloaded at 15:26:54 UTC yesterday, and started running at about 16:30 UTC.

As before, the run.log shows a MAX_SAMPLES: 114, with timings that don't match my machine. The 16:30 run has MAX_SAMPLES: 341, and starts running with sample 115.

The machine downloaded a new task at 3:50:47 UTC: that normally happens around 85 - 90% progress, with an hour to run - but the existing one is still only at sample 308, so maybe three hours to go. And it's another PTP1B_new_ resend, so we may have to go round the cycle again.

Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 28 Feb 23
Posts: 35
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60225 - Posted: 29 Mar 2023 | 9:22:38 UTC - in response to Message 60224.

OK, it's the same story as yesterday. This task:

PTP1B_23486_23479_4_2A-QUICO_TEST_ATM-0-1-RND5081_2

downloaded at 15:26:54 UTC yesterday, and started running at about 16:30 UTC.

As before, the run.log shows a MAX_SAMPLES: 114, with timings that don't match my machine. The 16:30 run has MAX_SAMPLES: 341, and starts running with sample 115.

The machine downloaded a new task at 3:50:47 UTC: that normally happens around 85 - 90% progress, with an hour to run - but the existing one is still only at sample 308, so maybe three hours to go. And it's another PTP1B_new_ resend, so we may have to go round the cycle again.


I believe it's what I imagined. From the manual division I was doing before I was splitting some runs in 2/3 steps: 114 - 228 - 341 samples. If the job ID has a 2A/3A it's most probably that it's starting from a previous checkpoint and the progress report is going crazy with it. I'll pass this on to Raimondas to see if he can get a look at it.

Our priority first is to be able to that these job divisions are done automatically like ACEMD does, that way we can avoid these really long jobs for everyone. Doing this manually makes it really hard to track all the jobs and the resends. So I hope that in the next days everything goes smoother.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1616
Credit: 8,048,394,351
RAC: 19,365,580
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60226 - Posted: 29 Mar 2023 | 12:12:55 UTC - in response to Message 60225.

Thanks. Now I know what I'm looking for (and when), I was able to watch the next transition.

Task PTP1B_new_20669_2qbr_23472_T1_2A-QUICO_TEST_ATM-0-1-RND5753_3 started with a couple of 0.1% initial steps (as usual), but then jumped to 50.983%. It then moved on by 0.441% every five minutes or so.

The run.log shows the same figures as before: a pre-existing run of 114 samples, then the real work starts with sample 115, and should proceed to a max_sample of 341. The progress jumps match the completion of samples 115 - 120.

The %age intervals match the formula in Emilio Gallicchio's post 60160 (115/(341-114)), but I can't see where the initial big value of 50.983 comes from.

Also, I don't follow the logic of the resend explanation. Mine is replication _3, so there have been 3 previous attempts - but none of them got beyond the program setup stages: all failed in less than 100 seconds. So who did the first 114 samples?

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1064
Credit: 40,231,533,983
RAC: 55,339
Level
Trp
Scientific publications
wat
Message 60227 - Posted: 29 Mar 2023 | 12:26:14 UTC - in response to Message 60226.

The %age intervals match the formula in Emilio Gallicchio's post 60160 (115/(341-114)), but I can't see where the initial big value of 50.983 comes from.


115/(341-114) = 0.5066 = 50.66%

strikingly close. maybe "BOINC logic" in some form of rounding. but it's pretty clear that the 50% value is coming from this calculation.

____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1616
Credit: 8,048,394,351
RAC: 19,365,580
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60228 - Posted: 29 Mar 2023 | 12:38:37 UTC - in response to Message 60227.

I thought I'd checked that, and got a different answer, but my mouse must have slipped on the calculator buttons.

The difference is probably the 0.2% program setup stages - it'll do. Thanks.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1616
Credit: 8,048,394,351
RAC: 19,365,580
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60229 - Posted: 29 Mar 2023 | 14:42:44 UTC

After that, it failed after 3 hours 20 minutes with a 'ValueError: Energy is NaN' error. Never mind - I tried.

kotenok2000
Send message
Joined: 18 Jul 13
Posts: 78
Credit: 78,050,793
RAC: 1,301,924
Level
Thr
Scientific publications
wat
Message 60230 - Posted: 29 Mar 2023 | 17:59:59 UTC - in response to Message 60229.
Last modified: 29 Mar 2023 | 18:27:03 UTC

C:/Windows/system32/cmd.exe command creates c:\users\frolo\.exe\ folder.
On subsequent runs it gives "A subdirectory or file .exe already exists." error.

C:/Windows/system32/cmd.exe /c call test.bat outputs
The syntax of the command is incorrect.


C:\Windows\system32\cmd.exe /c call test.bat outputs
'test.bat' is not recognized as an internal or external command,
operable program or batch file.

Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 28 Feb 23
Posts: 35
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60233 - Posted: 30 Mar 2023 | 9:51:23 UTC - in response to Message 60226.

Thanks. Now I know what I'm looking for (and when), I was able to watch the next transition.

Task PTP1B_new_20669_2qbr_23472_T1_2A-QUICO_TEST_ATM-0-1-RND5753_3 started with a couple of 0.1% initial steps (as usual), but then jumped to 50.983%. It then moved on by 0.441% every five minutes or so.

The run.log shows the same figures as before: a pre-existing run of 114 samples, then the real work starts with sample 115, and should proceed to a max_sample of 341. The progress jumps match the completion of samples 115 - 120.

The %age intervals match the formula in Emilio Gallicchio's post 60160 (115/(341-114)), but I can't see where the initial big value of 50.983 comes from.

Also, I don't follow the logic of the resend explanation. Mine is replication _3, so there have been 3 previous attempts - but none of them got beyond the program setup stages: all failed in less than 100 seconds. So who did the first 114 samples?


The first 114 samples should be calculated by: T_PTP1B_new_20669_2qbr_23472_1A_3-QUICO_TEST_ATM-0-1-RND2542_0.tar.bz2
I've been doing all the division and resends manually and we've been simplifying the naming convention for my sake. Now we are testing a multiple_steps protocol just like in AceMD which should help ease things and I hope mess less with the progress reporter.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1616
Credit: 8,048,394,351
RAC: 19,365,580
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60234 - Posted: 30 Mar 2023 | 11:31:27 UTC - in response to Message 60233.

Thanks. Be aware that out here in client-land we can only locate jobs by WU or task ID numbers - it's extremely difficult to find a task by name unless we can follow an ID chain.

Newer versions of the BOINC website tools do provide a rudimentary 'search by name' facility, but it requires a full task name - no wildcards or partial matches. And I know your colleagues on this project are very wary about updating the server code. We'll just have to live with it.

Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 28 Feb 23
Posts: 35
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60237 - Posted: 30 Mar 2023 | 18:08:57 UTC - in response to Message 60234.

Yeah I'm sorry about that. I'm trying to learn as I go.

I'll be sending (and already sent) some runs through the ATMbeta app. We tested the multiple_steps code and it seems to work fine. That way if everything runs smoothly everything should get 70 sample runs(~13ns), which should be much shorter for everyone and avoid the drag of the +24h runs.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1616
Credit: 8,048,394,351
RAC: 19,365,580
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60238 - Posted: 30 Mar 2023 | 18:22:27 UTC - in response to Message 60237.

Two downloaded, the first has reached 6% with no problems.

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 0
Level
Lys
Scientific publications
watwat
Message 60239 - Posted: 30 Mar 2023 | 18:43:26 UTC - in response to Message 60237.

Yeah I'm sorry about that. I'm trying to learn as I go.

I'll be sending (and already sent) some runs through the ATMbeta app. We tested the multiple_steps code and it seems to work fine. That way if everything runs smoothly everything should get 70 sample runs(~13ns), which should be much shorter for everyone and avoid the drag of the +24h runs.


____________________

It is un-stable tasks, re-start problems, suspend problems. Quite a few of us have done year-plus runs on Climate. 24-hour runs are no problem.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1120
Credit: 8,853,795,176
RAC: 32,883,185
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60240 - Posted: 30 Mar 2023 | 19:42:00 UTC
Last modified: 30 Mar 2023 | 20:12:31 UTC

deleted

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1330
Credit: 7,038,742,459
RAC: 15,645,875
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60241 - Posted: 30 Mar 2023 | 20:11:40 UTC - in response to Message 60237.

I believe I just finished one of these ATMbeta tasks.

https://www.gpugrid.net/result.php?resultid=33393179

It never checkpointed but it did show correct estimations of time to finish plus the progress was correct and incremented correctly.

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 401
Credit: 16,754,760,632
RAC: 23,060,497
Level
Trp
Scientific publications
watwatwat
Message 60242 - Posted: 30 Mar 2023 | 21:07:11 UTC - in response to Message 60241.
Last modified: 30 Mar 2023 | 21:07:30 UTC

I believe I just finished one of these ATMbeta tasks.

https://www.gpugrid.net/result.php?resultid=33393179

It never checkpointed but it did show correct estimations of time to finish plus the progress was correct and incremented correctly.

Same for me with Linux. Since there's no checkpointing I didn't bother to test suspending. I think all windows WUs failed.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1616
Credit: 8,048,394,351
RAC: 19,365,580
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60243 - Posted: 31 Mar 2023 | 7:14:27 UTC
Last modified: 31 Mar 2023 | 8:09:36 UTC

My current two ATM betas both have MAX_SAMPLES: +70 - but one started at 71, and the other at 141.

Both are displaying 100% progress. I watched one jump to 100% after about enough time to load the program and complete 1 sample: the other I would expect to finish within half an hour (it's on sample 205).

Edit - yes, it did. I see you've put step information in the task names: these were

PTP1B_20669_2qbr_23466_2-QUICO_ATM_OFF_STEPS-1-5-RND8189_0
PTP1B_23467_23475_4-QUICO_ATM_OFF_STEPS-2-5-RND5806_0

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 485
Credit: 10,374,198,466
RAC: 15,190,309
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60244 - Posted: 31 Mar 2023 | 8:29:47 UTC - in response to Message 60243.
Last modified: 31 Mar 2023 | 8:34:51 UTC

My current two ATM betas both have MAX_SAMPLES: +70 - but one started at 71, and the other at 141.

Both are displaying 100% progress. I watched one jump to 100% after about enough time to load the program and complete 1 sample: the other I would expect to finish within half an hour (it's on sample 205).

Edit - yes, it did. I see you've put step information in the task names: these were

PTP1B_20669_2qbr_23466_2-QUICO_ATM_OFF_STEPS-1-5-RND8189_0
PTP1B_23467_23475_4-QUICO_ATM_OFF_STEPS-2-5-RND5806_0


My observations are same. When the units download, the estimated finish time reads 606 days.


https://www.gpugrid.net/results.php?hostid=534811&offset=0&show_names=0&state=0&appid=45


So far, in this batch, 3 WUs completed successfully, 1 error and 1 is crunching, on a windows 10 machine.


The units all crash on my other computer, which runs windows 7 and is rather old, 13 years. Maybe, it's time to retire it from this project, though it still runs well on other projects, like Einstein and FAH.



https://www.gpugrid.net/results.php?hostid=544232&offset=0&show_names=0&state=0&appid=45

Erich56
Send message
Joined: 1 Jan 15
Posts: 1120
Credit: 8,853,795,176
RAC: 32,883,185
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60245 - Posted: 31 Mar 2023 | 12:52:00 UTC

My first ATM beta on Windows10 failed after some 6 hours :-(
https://www.gpugrid.net/result.php?resultid=33393839

anyone an idea what exactly the problem was?

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1616
Credit: 8,048,394,351
RAC: 19,365,580
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60246 - Posted: 31 Mar 2023 | 13:01:59 UTC - in response to Message 60245.

anyone an idea what exactly the problem was?

It says

ValueError: Energy is NaN.

A science error (impossible result), rather than a computing error.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1064
Credit: 40,231,533,983
RAC: 55,339
Level
Trp
Scientific publications
wat
Message 60247 - Posted: 31 Mar 2023 | 13:27:18 UTC - in response to Message 60246.

Potentially, it could also be due to instability in overclocks, where applicable. I know the ACEMD3 tasks are susceptible to a “particle coordinate is NaN” type error from too much overclocks.

Of course less likely if things are not overclocked, or only mild overclocking. But just expressing the possibility.
____________

Erich56
Send message
Joined: 1 Jan 15
Posts: 1120
Credit: 8,853,795,176
RAC: 32,883,185
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60248 - Posted: 31 Mar 2023 | 18:06:07 UTC - in response to Message 60247.

Potentially, it could also be due to instability in overclocks, where applicable. I know the ACEMD3 tasks are susceptible to a “particle coordinate is NaN” type error from too much overclocks.

Of course less likely if things are not overclocked, or only mild overclocking. But just expressing the possibility.

thanks for this thought; it could well be the case. For some time, this old GTX980TI has no longer followed the settings for GPU clock and Power target, in the old NVIDIA Inspector as well as in the newer Afterburner.
Hence, particularly with ATM tasks I noticed an overclocking from default 1152MHz up to 1330MHz. Not all the time, but many times.
I now experimented and found out that I can control the GPU clock by reducing the fan speed, with setting the GPU temperature at a fixed value and setting a check at "priorize temperature". So the clock now oscillates around 1.100MHz most of the time.
I will see whether the ATM tasks now will fail again, or not.


kotenok2000
Send message
Joined: 18 Jul 13
Posts: 78
Credit: 78,050,793
RAC: 1,301,924
Level
Thr
Scientific publications
wat
Message 60249 - Posted: 1 Apr 2023 | 6:29:10 UTC
Last modified: 1 Apr 2023 | 6:30:22 UTC

My atm beta tasks crash.
http://www.gpugrid.net/result.php?resultid=33398437
Do you know why?

ZUSE
Avatar
Send message
Joined: 10 Jun 20
Posts: 7
Credit: 775,415,897
RAC: 7,145,733
Level
Glu
Scientific publications
wat
Message 60250 - Posted: 1 Apr 2023 | 7:00:18 UTC

me too.
All ATM tasks!

graphic card Tesla P4
Ryzen 5600G
32GB RAM
Windows 11
under Linux too

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1330
Credit: 7,038,742,459
RAC: 15,645,875
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60251 - Posted: 1 Apr 2023 | 7:16:21 UTC - in response to Message 60249.

Something in your Windows configuration has a problem running cmd.exe and calling the run.bat file. Windows barfs on the 0x1 exit error.

Same as the other fellow running Windows.

No concrete smoking gun flaw shown.

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 0
Level
Lys
Scientific publications
watwat
Message 60254 - Posted: 1 Apr 2023 | 13:04:13 UTC

Another thing that can be possible is that your system re-started after an update or you suspended it.
I have concluded to let Intel, Microsoft, Dell and Acer update themselves when they want. Not our fault if the WU crashes. It is the onus of the Admin of the project to make their WU stable enough to default to the last checkpoint.
Our job is to keep our systems up to date and maintained to run these WUs to the best of our abilities.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1330
Credit: 7,038,742,459
RAC: 15,645,875
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60256 - Posted: 1 Apr 2023 | 15:28:48 UTC

The ATM tasks are just like the acemd3 tasks in that they can't be interrupted or restarted without erroring out. Unlike the acemd3 tasks which can be restarted on the same device, the ATM tasks can't be restarted or interrupted at all. They exit immediately if restarted.

STARBASEn
Avatar
Send message
Joined: 17 Feb 09
Posts: 91
Credit: 1,603,303,394
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 60257 - Posted: 2 Apr 2023 | 2:20:00 UTC - in response to Message 60256.

The ATM tasks are just like the acemd3 tasks in that they can't be interrupted or restarted without erroring out. Unlike the acemd3 tasks which can be restarted on the same device, the ATM tasks can't be restarted or interrupted at all. They exit immediately if restarted.


I agree, I have lost quite a few hours on WU's that were going to complete because I had perform reboots and lost them. Is anyone addressing this issue yet?

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1330
Credit: 7,038,742,459
RAC: 15,645,875
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60258 - Posted: 2 Apr 2023 | 2:39:38 UTC - in response to Message 60257.

Haven't heard or seen any comments by any of the devs. The acemd3 app hasn't been fixed in two years. And that is an internal application by Acellera.

No reason to expect any change in newest sub-project apps.

Not unless some dev has got a lot of time to dig into this type of bug.

And since almost all of the newer apps depend on external libraries, that falls to to those external toolsets and devs outside of this project.

So probably not going to happen.

ZUSE
Avatar
Send message
Joined: 10 Jun 20
Posts: 7
Credit: 775,415,897
RAC: 7,145,733
Level
Glu
Scientific publications
wat
Message 60259 - Posted: 2 Apr 2023 | 8:26:04 UTC

it is just interesting that acemd3 runs through and ATM does not. errors appear after a few minutes

the system was neither restarted during the calculation nor was there an update

so the problem lies elsewhere

Exactly the same on Linux.

Windows and drivers are up to date

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 401
Credit: 16,754,760,632
RAC: 23,060,497
Level
Trp
Scientific publications
watwatwat
Message 60260 - Posted: 2 Apr 2023 | 8:31:36 UTC - in response to Message 60259.
Last modified: 2 Apr 2023 | 8:31:56 UTC

it is just interesting that acemd3 runs through and ATM does not. errors appear after a few minutes
the system was neither restarted during the calculation nor was there an update
so the problem lies elsewhere
Exactly the same on Linux.
Windows and drivers are up to date

If you're time-slicing with another GPU project that will cause a fatal "computation error" when BOINC switches between them.

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 0
Level
Lys
Scientific publications
watwat
Message 60261 - Posted: 2 Apr 2023 | 11:03:21 UTC

Task failed.
ImportError: DLL load failed while importing _openmm: The specified module could not be found.
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

Encountered error while generating package metadata.
__________________

Another one.

ImportError: DLL load failed while importing _openmm: The specified module could not be found.
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

Encountered error while generating package metadata.
_____________________

Third one.

ImportError: DLL load failed while importing _openmm: The specified module could not be found.
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

Encountered error while generating package metadata.
_________________________________

I have eleven failed tasks(proud of the record setting) all, revolving around the same thing.

bluestang
Send message
Joined: 13 Apr 15
Posts: 10
Credit: 2,542,462,606
RAC: 0
Level
Phe
Scientific publications
wat
Message 60264 - Posted: 2 Apr 2023 | 17:55:46 UTC

Beta or not...how can a Project send out tasks with this length of runtime and not have any checkpointing of some sort.

The amount of resources that are wasted because of this has to be mind boggling.

STARBASEn
Avatar
Send message
Joined: 17 Feb 09
Posts: 91
Credit: 1,603,303,394
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 60266 - Posted: 2 Apr 2023 | 22:22:50 UTC - in response to Message 60264.

I concur with bluestang, some of those likely successful WU's I lost had 20+ hours getting wasted because of a necessary reboot. Project owners should make some sort of a fix a priority.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1330
Credit: 7,038,742,459
RAC: 15,645,875
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60267 - Posted: 2 Apr 2023 | 23:46:57 UTC

Or just acknowledge you aren't willing to accept the project limitations and move onto other gpu projects that fit your usage conditions.

I have no issue letting work run to completion because I know that I must let all hosts run uninterrupted.

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 0
Level
Lys
Scientific publications
watwat
Message 60268 - Posted: 3 Apr 2023 | 0:01:08 UTC - in response to Message 60267.

Or just acknowledge you aren't willing to accept the project limitations and move onto other gpu projects that fit your usage conditions.

I have no issue letting work run to completion because I know that I must let all hosts run uninterrupted.

________________

Do you know what the problem is? Quico has not understood what Abou did at the very start. I am pretty sure whatever it is if he brings it to the thread he will find an answer. There are a lot of people on the thread and one of them is you, who are willing to help to the best of their ability. Experts.
I have paused all Microsoft Updates for five weeks which seems to trigger the rest of the updates, like Intels. Just for these WUs.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1330
Credit: 7,038,742,459
RAC: 15,645,875
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60269 - Posted: 3 Apr 2023 | 3:08:20 UTC - in response to Message 60268.

But abouh's app is different from Quico's. They use different external tools. You can't apply the same fixes that abouh did for Quico's app.

The Python tasks use the pytorch libraries and the Quico uses the AtoM libraries.

Plus the Python app is mostly a cpu app while the ATM app is mostly a gpu app.

They work very differently. Expecting that the app structure and design of the Python app is directly applicable to the AToM app is naive and simplistic.

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 0
Level
Lys
Scientific publications
watwat
Message 60270 - Posted: 3 Apr 2023 | 8:10:32 UTC - in response to Message 60269.

But abouh's app is different from Quico's. They use different external tools. You can't apply the same fixes that abouh did for Quico's app.

The Python tasks use the pytorch libraries and the Quico uses the AtoM libraries.

Plus the Python app is mostly a cpu app while the ATM app is mostly a gpu app.

They work very differently. Expecting that the app structure and design of the Python app is directly applicable to the AToM app is naive and simplistic.

___________________

I am not saying anything but I agree with the sentiments of some.
Maybe, some of us can play with AToM libs.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1616
Credit: 8,048,394,351
RAC: 19,365,580
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60271 - Posted: 3 Apr 2023 | 8:52:44 UTC - in response to Message 60270.

Having looked into the internal logging of Quico's tasks in some detail because of the progress %age problem, it's clear that it goes through the motions of writing a checkpoint normally - 70 times per task for the recent short runs, 341 per task for the very long ones. That's about once every five minutes on my machines, which would be perfectly acceptable to me.

I would judge the problem to be with the other end of the problem - re-starting the task after an interruption. That's more complicated, from the programmer's point of view - not only does the state of the science program's data have to be restored from disk in the proper format, all the wrapper's counters and timings have to be re-aligned and re-started.

By all means explore and learn about the tools and libraries used for these tasks, but I suspect you'll have to get down and dirty with the application's code as well. Let us know how you get on.

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 0
Level
Lys
Scientific publications
watwat
Message 60273 - Posted: 4 Apr 2023 | 13:30:45 UTC

Impressive.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1120
Credit: 8,853,795,176
RAC: 32,883,185
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60274 - Posted: 4 Apr 2023 | 15:39:05 UTC

anyone any idea why this task:
https://www.gpugrid.net/result.php?resultid=33405348 failed after 5 1/2 hours?

This time, there was no overclocking involved. So the reason must have been a different one :-(

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1330
Credit: 7,038,742,459
RAC: 15,645,875
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60275 - Posted: 4 Apr 2023 | 18:29:37 UTC - in response to Message 60274.

ValueError: Energy is NaN. IOW Not a number.

Impossible value got the task thrown out. Couple of possible reasons.

Misconfigured or "bad" task

GPU running overclocked or hot and caused math errors.

[AF>FAH-Addict.net]toTOW
Send message
Joined: 28 Oct 10
Posts: 9
Credit: 25,781,299
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60276 - Posted: 4 Apr 2023 | 18:47:37 UTC
Last modified: 4 Apr 2023 | 18:54:28 UTC

All WUs seems to be failing the same way with missing files :
https://www.gpugrid.net/result.php?resultid=33406732
https://www.gpugrid.net/result.php?resultid=33406795
https://www.gpugrid.net/result.php?resultid=33406795

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1330
Credit: 7,038,742,459
RAC: 15,645,875
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60277 - Posted: 4 Apr 2023 | 18:54:51 UTC - in response to Message 60276.

We see this frequently with misconfigured tasks. Researcher does a poor job updating the task generation template when configuring for new tasks.

Wastes time and resources for every one.

[CSF] Aleksey Belkov
Send message
Joined: 26 Dec 13
Posts: 86
Credit: 1,258,731,270
RAC: 480,669
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60278 - Posted: 4 Apr 2023 | 21:34:20 UTC - in response to Message 60276.

seems to be failing the same way with missing files

Same here:
https://www.gpugrid.net/result.php?resultid=33406558
But first failed among dozen successful completed.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1120
Credit: 8,853,795,176
RAC: 32,883,185
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60279 - Posted: 5 Apr 2023 | 6:43:22 UTC - in response to Message 60277.

Wastes time and resources for every one.

Well, as long as a tasks fails within a few minutes (I had a few such ones yesterday), I think it's not that bad.
But I had one, day before yesterday, which failed after some 5-1/2 hours - which is not good :-(

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 0
Level
Lys
Scientific publications
watwat
Message 60281 - Posted: 5 Apr 2023 | 8:47:03 UTC

task 27451592
task 27451185
task 27451971
task 27451763
Completed and validated. No errors as yet.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1616
Credit: 8,048,394,351
RAC: 19,365,580
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60282 - Posted: 5 Apr 2023 | 18:46:01 UTC

Thought I'd run a quick test to see if there was any progress on the restart front. Waited until a task had just finished, and let a new one start and run to the first checkpoint: then paused it, and waited while another project borrowed the GPU temporarily.

The test task was p38_2m_2j_5-QUICO_ATM_OFF_STEPS-1-5-RND9265_1. On restart, it started again from zero progress, zero elapsed time, and ran up to the 0.200% point: then it crashed as before. I didn't have any time to rescue any logs from the restart - my BOINC client cleaned and reused the slot for something else before I could catch it.

The website report says it ran for about 40 seconds, and stderr.txt contains the lines

Running command git clone --filter=blob:none --quiet https://github.com/raimis/AToM-OpenMM.git /hdd/boinc-client/slots/2/tmp/pip-req-build-368b4spp
fatal: unable to access '/home/conda/feedstock_root/build_artifacts/git_1679396317102/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placeho/etc/gitconfig': Permission denied

That doesn't sound very hopeful. It's still a problem.

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 0
Level
Lys
Scientific publications
watwat
Message 60283 - Posted: 5 Apr 2023 | 19:05:03 UTC - in response to Message 60281.

task 27451592
task 27451185
task 27451971
task 27451763
Completed and validated. No errors as yet.

_______________________________

task 27451763
task 27451117
task 27452961
Completed and validated. No errors as yet. I dare not even sneeze near them. All updates are off.

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 0
Level
Lys
Scientific publications
watwat
Message 60284 - Posted: 5 Apr 2023 | 19:35:54 UTC

This "OFF" in the WU points towards Python. "AToM" also has something to do with Python.
I errored out on one of Abou's WU because my GPU was updated. Python? I do not know. I do not know how to dive under the bonnet but Google up "OFF Python" and "AToM Python", there is a relationship.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1616
Credit: 8,048,394,351
RAC: 19,365,580
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60285 - Posted: 5 Apr 2023 | 20:44:22 UTC - in response to Message 60284.

I think 'Python' is a programming language, and 'AToM' is a scientific program written in that language.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1120
Credit: 8,853,795,176
RAC: 32,883,185
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60286 - Posted: 6 Apr 2023 | 4:55:32 UTC - in response to Message 60279.

Well, as long as a tasks fails within a few minutes (I had a few such ones yesterday), I think it's not that bad.
But I had one, day before yesterday, which failed after some 5-1/2 hours - which is not good :-(

what I noticed lately on my machines is: when ATM tasks fail, mostly after 60-90 seconds.
And stderr always says:

FileNotFoundError: [Errno 2] No such file or directory: 'thrombin_noH_2-1a-3b_0.xml'
23:18:10 (18772): C:/Windows/system32/cmd.exe exited; CPU time 18.421875

see here:
https://www.gpugrid.net/result.php?resultid=33409106

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 0
Level
Lys
Scientific publications
watwat
Message 60287 - Posted: 6 Apr 2023 | 7:46:00 UTC - in response to Message 60283.

task 27451592
task 27451185
task 27451971
task 27451763
Completed and validated. No errors as yet.

_______________________________

task 27451763
task 27451117
task 27452961
Completed and validated. No errors as yet. I dare not even sneeze near them. All updates are off.

task 27452387
task 27452312
task 27452961
task 27452969
completed and validated.
One task in error,
task 33410323

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1616
Credit: 8,048,394,351
RAC: 19,365,580
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60289 - Posted: 6 Apr 2023 | 10:44:12 UTC - in response to Message 60286.

Similar story. MCL1_49_35_4-QUICO_ATM_OFF_STEPS-0-5-RND5875_2 failed in 44 seconds with

FileNotFoundError: [Errno 2] No such file or directory: 'MCL1_pmx-49-35_0.xml'

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 0
Level
Lys
Scientific publications
watwat
Message 60290 - Posted: 6 Apr 2023 | 12:09:09 UTC

Has anyone noticed the WUs with 'Bace' in their name, they show progress as 100% but the Time Elapsed counter is still ticking. Task Manager shows the task is still busy computing. This goes on for hours on end and one Task went up to 24 Hrs in this state.
If a Task is doing this, it does not mean a failed task. Check in the Task Manager first. Let it complete. Currently, this Task is doing it.
task 33409877
I wish someone would put up a Notice that this project is not for persons who switch off their computers at night or for some other reasons.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1120
Credit: 8,853,795,176
RAC: 32,883,185
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60291 - Posted: 6 Apr 2023 | 12:10:38 UTC - in response to Message 60289.

Similar story. MCL1_49_35_4-QUICO_ATM_OFF_STEPS-0-5-RND5875_2 failed in 44 seconds with

FileNotFoundError: [Errno 2] No such file or directory: 'MCL1_pmx-49-35_0.xml'

same here, about 1 hour ago:

FileNotFoundError: [Errno 2] No such file or directory: 'MCL1_pmx-30-40_0.xml'

such errors, happening often enough, may show some kind of sloppy tasks configuration ?

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 0
Level
Lys
Scientific publications
watwat
Message 60292 - Posted: 6 Apr 2023 | 14:17:30 UTC - in response to Message 60291.
Last modified: 6 Apr 2023 | 14:18:30 UTC

Similar story. MCL1_49_35_4-QUICO_ATM_OFF_STEPS-0-5-RND5875_2 failed in 44 seconds with

FileNotFoundError: [Errno 2] No such file or directory: 'MCL1_pmx-49-35_0.xml'

same here, about 1 hour ago:

FileNotFoundError: [Errno 2] No such file or directory: 'MCL1_pmx-30-40_0.xml'

such errors, happening often enough, may show some kind of sloppy tasks configuration ?

__________________

Same here. The Task with 'MCLI' in their name lasted 18 seconds only.
task 33411408

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 0
Level
Lys
Scientific publications
watwat
Message 60293 - Posted: 6 Apr 2023 | 14:49:23 UTC
Last modified: 6 Apr 2023 | 15:30:56 UTC

This WU with 'Jnk1' in it, lasted ten seconds.
task 33411216


Edit. Now I have a WU with 'thrombin' in its name. Reached 100% in 15 minutes but is still busy with the GPU and CPU.
task 33413038

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 0
Level
Lys
Scientific publications
watwat
Message 60294 - Posted: 6 Apr 2023 | 20:16:42 UTC

This 'MCL1' has been running steadily for the last hour. But it is showing progress as 100% while the elapsed clock is ticking. Task Manager shows it is busy.
task 33412833

[CSF] Aleksey Belkov
Send message
Joined: 26 Dec 13
Posts: 86
Credit: 1,258,731,270
RAC: 480,669
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60295 - Posted: 6 Apr 2023 | 21:45:11 UTC - in response to Message 60294.
Last modified: 6 Apr 2023 | 21:46:02 UTC

But it is showing progress as 100%

So with all ATM WUs, this is "normal".
Perhaps later the devs will be able to fix it.
So there is no need to be surprised by this fact in every post -_-

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 0
Level
Lys
Scientific publications
watwat
Message 60296 - Posted: 6 Apr 2023 | 22:52:16 UTC - in response to Message 60293.
Last modified: 6 Apr 2023 | 22:55:29 UTC

This WU with 'Jnk1' in it, lasted ten seconds.
task 33411216


Edit. Now I have a WU with 'thrombin' in its name. Reached 100% in 15 minutes but is still busy with the GPU and CPU.
task 33413038

_______________

Completed and validated.

No. For some reason, people are aborting, like this WU 'thrombin'. We normally watch the progress report. Instead, check the Task Manager. If there is a heartbeat, let it run.

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 0
Level
Lys
Scientific publications
watwat
Message 60297 - Posted: 6 Apr 2023 | 23:23:17 UTC - in response to Message 60294.

This 'MCL1' has been running steadily for the last hour. But it is showing progress as 100% while the elapsed clock is ticking. Task Manager shows it is busy.
task 33412833

_______________

Completed and validated.
Auram?

Speedy
Send message
Joined: 19 Aug 07
Posts: 43
Credit: 28,391,082
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwat
Message 60301 - Posted: 11 Apr 2023 | 8:51:48 UTC

When tasks are available how much percentage of CPU do they require, does the CPU usage fluctuate like the other Python tasks?

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 401
Credit: 16,754,760,632
RAC: 23,060,497
Level
Trp
Scientific publications
watwatwat
Message 60302 - Posted: 11 Apr 2023 | 11:59:50 UTC - in response to Message 60301.

When tasks are available how much percentage of CPU do they require, does the CPU usage fluctuate like the other Python tasks?

One CPU is plenty for these tasks. It doesn't need a full GPU so I run Einstein, Milkyway or OPNG with it. Problem is if BOINC time slices it the ATM WU will fail when it gets restarted.
Unless it time-sliced due to the final step (zipping up maybe?) after several hours. Then when it UL and Report as Valid.
The best way to assure these ATM WUs succeed is to not run a different project to avoid having BOINC switch the GPU and crash it when it restarts. Running 2 ATM WUs per GPU or an ACEMD+ATM is ok since it doesn't switch away.

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 401
Credit: 16,754,760,632
RAC: 23,060,497
Level
Trp
Scientific publications
watwatwat
Message 60303 - Posted: 11 Apr 2023 | 12:08:11 UTC - in response to Message 60297.
Last modified: 11 Apr 2023 | 12:10:05 UTC

This 'MCL1' has been running steadily for the last hour. But it is showing progress as 100% while the elapsed clock is ticking. Task Manager shows it is busy.
task 33412833
_______________
Completed and validated.
Auram?

Yes the failed WU is my Rig-11 which is having intermittent failures/reboots due to a MB/GPU issue of unknown origin. I've swapped GPUs several times and the problem stays with the Rig-11 MB so it's not a bad GPU. If I leave the GPU idle the CPU runs WUs fine. Einstein and Milkyway don't seem to cause the problem but Asteroids, GG and maybe OPNG do at random intervals. Also it might be time-slicing that I described in my penultimate reply.
It's probably time to scrap the MB. Since most are designed for gamers they stuff too much junk on them and compromise their reliability.

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 401
Credit: 16,754,760,632
RAC: 23,060,497
Level
Trp
Scientific publications
watwatwat
Message 60304 - Posted: 11 Apr 2023 | 15:23:17 UTC

Looks like all of today's WUs are failing:

FileNotFoundError: [Errno 2] No such file or directory: 'CDK2_new_2_edit-1oiy-1h1q_0.xml'
It dumbfounds me why they still have it set to fail 7 times. If they fail at the end then that's several days of compute time wasted. Isn't two failures enough?

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1064
Credit: 40,231,533,983
RAC: 55,339
Level
Trp
Scientific publications
wat
Message 60305 - Posted: 11 Apr 2023 | 15:27:48 UTC - in response to Message 60304.

I had two fail in this way, but the rest (20+ or so) are running fine. Certainly not "all" of them.
____________

Erich56
Send message
Joined: 1 Jan 15
Posts: 1120
Credit: 8,853,795,176
RAC: 32,883,185
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60306 - Posted: 11 Apr 2023 | 16:05:14 UTC

Strange enough, about 2 hours ago one of my rigs downloaded 2 ATM tasks, while Python tasks were running.
The ATM tasks failed after a minute.
I checked my settings - it cleary says:
ATM (beta): no

So, how come that ATMs are being downloaded?

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1330
Credit: 7,038,742,459
RAC: 15,645,875
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60307 - Posted: 11 Apr 2023 | 16:20:02 UTC - in response to Message 60306.

I think the beta toggle in preferences is 'sticky' in the scheduler.

Seen similiar. Didn't get Python beta until I set beta in preferences. Unset beta in preferences and still got beta Python tasks. Beta set again for ATM.

Probably only a detach and reattach will fix it.

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 401
Credit: 16,754,760,632
RAC: 23,060,497
Level
Trp
Scientific publications
watwatwat
Message 60308 - Posted: 12 Apr 2023 | 2:08:15 UTC - in response to Message 60306.

Strange enough, about 2 hours ago one of my rigs downloaded 2 ATM tasks, while Python tasks were running.
The ATM tasks failed after a minute.
I checked my settings - it cleary says:
ATM (beta): no

So, how come that ATMs are being downloaded?

I think ATMbeta is controlled by
Run test applications?

Erich56
Send message
Joined: 1 Jan 15
Posts: 1120
Credit: 8,853,795,176
RAC: 32,883,185
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60309 - Posted: 12 Apr 2023 | 6:59:28 UTC - in response to Message 60308.

I think ATMbeta is controlled by
Run test applications?

oh, this might explain.
While I unchecked "ATM beta" I neglected to uncheck "Run test applications"

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 401
Credit: 16,754,760,632
RAC: 23,060,497
Level
Trp
Scientific publications
watwatwat
Message 60310 - Posted: 12 Apr 2023 | 15:43:22 UTC
Last modified: 12 Apr 2023 | 15:43:37 UTC

This WU had me error out for NAN at 913 seconds. I never overclock my GPUs and power limited this 2080 Ti to 180 W since GPUs are notorious for wasting energy. This NAN error is due setting the calculation boundaries wrong.
https://www.gpugrid.net/workunit.php?wuid=27468777

bibi
Send message
Joined: 4 May 17
Posts: 14
Credit: 13,287,569,643
RAC: 39,768,569
Level
Trp
Scientific publications
watwatwatwatwat
Message 60312 - Posted: 12 Apr 2023 | 15:56:19 UTC

Hello Quico,
I hope my interpretation is correct.
see https://boinc.berkeley.edu/trac/wiki/WrapperApp

If no task has checkpoint_filename defined, then the job starts over and breaks with python pip.
The task with the run script should define checkpoint_filename. The progress file is changed after each checkpoint. Maybe it is enough to specify progress as checkpoint_filename. Resume should then work exactly the same as when starting with checkpoint.

progress
The formula should be changed as suggested by Richard Haselgrove in http://www.gpugrid.net/forum_thread.php?id=5379&nowrap=true#60206:
progress = float(isample - last_sample)/float(num_samples - last_sample)


Translated with www.DeepL.com/Translator (free version)

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 401
Credit: 16,754,760,632
RAC: 23,060,497
Level
Trp
Scientific publications
watwatwat
Message 60319 - Posted: 14 Apr 2023 | 13:21:34 UTC
Last modified: 14 Apr 2023 | 13:39:21 UTC

File "/var/lib/boinc-client/slots/34/lib/python3.9/site-packages/openmm/app/statedatareporter.py", line 365, in _checkForErrors
raise ValueError('Energy is NaN. For more information, see https://github.com/openmm/openmm/wiki/Frequently-Asked-Questions#nan')
ValueError: Energy is NaN.
https://www.gpugrid.net/workunit.php?wuid=27469907

Watched a WU finish and it spent 6 minutes out of 313 minutes on 100%.
No checkpointing.
Has it been confirmed that the calculation boundaries are correct and not the cause of the NaN errors?

Erich56
Send message
Joined: 1 Jan 15
Posts: 1120
Credit: 8,853,795,176
RAC: 32,883,185
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60320 - Posted: 15 Apr 2023 | 8:41:17 UTC - in response to Message 60319.

I, too, had such an error after the task had run for 7.885 seconds:

File "Z:\BOINC\slots\3\lib\site-packages\openmm\app\statedatareporter.py", line 365, in _checkForErrors
raise ValueError('Energy is NaN. For more information, see https://github.com/openmm/openmm/wiki/Frequently-Asked-Questions#nan')
ValueError: Energy is NaN


https://www.gpugrid.net/result.php?resultid=33436488

no overclocking.

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 0
Level
Lys
Scientific publications
watwat
Message 60321 - Posted: 15 Apr 2023 | 11:07:03 UTC - in response to Message 60319.

File "/var/lib/boinc-client/slots/34/lib/python3.9/site-packages/openmm/app/statedatareporter.py", line 365, in _checkForErrors
raise ValueError('Energy is NaN. For more information, see https://github.com/openmm/openmm/wiki/Frequently-Asked-Questions#nan')
ValueError: Energy is NaN.
https://www.gpugrid.net/workunit.php?wuid=27469907

Watched a WU finish and it spent 6 minutes out of 313 minutes on 100%.
No checkpointing.
Has it been confirmed that the calculation boundaries are correct and not the cause of the NaN errors?


Wow! Six minutes is a significant improvement over the hours it was taking before. Just don't give it a kick and abort.

bibi
Send message
Joined: 4 May 17
Posts: 14
Credit: 13,287,569,643
RAC: 39,768,569
Level
Trp
Scientific publications
watwatwatwatwat
Message 60322 - Posted: 15 Apr 2023 | 11:19:45 UTC - in response to Message 60312.

Too bad, not so simple after all. I wrote the checkpoint tag in the job.xml in the project directory under Windows and after two samples suspend/resume and again the job started with the first task and died with python pip.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1120
Credit: 8,853,795,176
RAC: 32,883,185
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60333 - Posted: 16 Apr 2023 | 18:26:41 UTC - in response to Message 60320.

I, too, had such an error after the task had run for 7.885 seconds:

File "Z:\BOINC\slots\3\lib\site-packages\openmm\app\statedatareporter.py", line 365, in _checkForErrors
raise ValueError('Energy is NaN. For more information, see https://github.com/openmm/openmm/wiki/Frequently-Asked-Questions#nan')
ValueError: Energy is NaN


https://www.gpugrid.net/result.php?resultid=33436488

no overclocking.


this time, the task errored out after 16.400 seconds :-(

https://www.gpugrid.net/result.php?resultid=33442242

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 401
Credit: 16,754,760,632
RAC: 23,060,497
Level
Trp
Scientific publications
watwatwat
Message 60335 - Posted: 17 Apr 2023 | 8:10:17 UTC

It feels like there's at least four categories of ATMbeta WUs running simultaneously.
None have checkpointing.
Top Priority should be to make checkpointing work.
Shotgun approach squanters a lot of compute time.

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 0
Level
Lys
Scientific publications
watwat
Message 60336 - Posted: 17 Apr 2023 | 11:15:57 UTC

My Nation like many others has gone into a default situation. The most expensive item is the supply of electricity and they are frequently switching off the grid without informing us.
David H, says the WUs are checkpointing. If they are checkpointing, then why are they not recovering? Well recovering or not, I cannot do a thing about the electric grid. So, best of luck to the WUs and as it is Ramadan, I have nothing left in the upper chamber to argue. Over and out.

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 401
Credit: 16,754,760,632
RAC: 23,060,497
Level
Trp
Scientific publications
watwatwat
Message 60352 - Posted: 26 Apr 2023 | 16:16:15 UTC

Still no checkpointing.
Suspent then unsuspend = crash.
Many WUs failing due to subprocess.

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 0
Level
Lys
Scientific publications
watwat
Message 60353 - Posted: 28 Apr 2023 | 5:30:09 UTC

If there is a storm and electricity go, WU crashes. I know that Boincer's do not do a re-start for months on end but I have to do a re-start. WU crashes. If the GPU updates or the System updates, the WU crashes. If the cat plays with the keyboard, the WU crashes.
I do not want catty remarks but will keep crashing them from now on. Who cares.

[CSF] Aleksey Belkov
Send message
Joined: 26 Dec 13
Posts: 86
Credit: 1,258,731,270
RAC: 480,669
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60354 - Posted: 30 Apr 2023 | 12:40:47 UTC - in response to Message 60353.
Last modified: 30 Apr 2023 | 12:42:32 UTC

Who cares.

No, it's not about who cares.

This is about which of the project employees has the knowledge and resources to implement the necessary functionality, and which of them has the time for this.
And as you should understand, they don't make decisions there on their own, it's not a hobby.
The necessary specialists can now be involved in other, more priority projects for the institute, and neither we nor the employees themselves can influence this.
Deal with it.

Nothing will change from the number of tearful posts about the problem, no matter how much someone would like it.
Unless, of course, the goal is once again just to let off steam somewhere because of indignation.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1616
Credit: 8,048,394,351
RAC: 19,365,580
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60357 - Posted: 3 May 2023 | 8:34:49 UTC
Last modified: 3 May 2023 | 9:32:57 UTC

Task TYK2_m44_m55_5_FIX-QUICO_ATM_Sage_xTB-0-5-RND2847_0 (today):

FileNotFoundError: [Errno 2] No such file or directory: 'TYK2_m44_m55_0.xml'


Later - CDK2_miu_m26_4-QUICO_ATM_Sage_xTB-0-5-RND8419_0 running OK.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1616
Credit: 8,048,394,351
RAC: 19,365,580
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60358 - Posted: 5 May 2023 | 7:14:35 UTC
Last modified: 5 May 2023 | 7:46:54 UTC

And a similar batch configuration error with today's BACE run, like

BACE_m24_m7e_5-QUICO_ATM_Sage_xTB-0-5-RND7993_0

08:05:32 (386384): wrapper: running bin/bash (run.sh)
bin/bash: run.sh: No such file or directory

(five so far)

Edit - now wasted 20 of the things, and switched to Python to avoid quota errors. I should have dropped in to give you a hand when passing through Barcelona at the weekend!

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 0
Level
Lys
Scientific publications
watwat
Message 60359 - Posted: 5 May 2023 | 8:04:09 UTC

I cannot resource share ATMBeta with other projects because it is stopped to run other projects. Ends up with an error.

[CSF] Aleksey Belkov
Send message
Joined: 26 Dec 13
Posts: 86
Credit: 1,258,731,270
RAC: 480,669
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60360 - Posted: 5 May 2023 | 9:01:39 UTC - in response to Message 60358.

And a similar batch configuration error with today's BACE run, like

Same for Win apps:
https://www.gpugrid.net/result.php?resultid=33475629
https://www.gpugrid.net/results.php?userid=101590
Sad : /

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1064
Credit: 40,231,533,983
RAC: 55,339
Level
Trp
Scientific publications
wat
Message 60361 - Posted: 5 May 2023 | 11:44:07 UTC - in response to Message 60359.

I cannot resource share ATMBeta with other projects because it is stopped to run other projects. Ends up with an error.


set all other GPU projects to resource share of 0, then they wont run at all when you have ATM work.
____________

Erich56
Send message
Joined: 1 Jan 15
Posts: 1120
Credit: 8,853,795,176
RAC: 32,883,185
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60362 - Posted: 5 May 2023 | 16:10:12 UTC

many of the recent ATMs errored out after not even a minute, stderr says:

wrapper: running C:/Windows/system32/cmd.exe (/c call run.bat)
Der Befehl "run.bat" ist entweder falsch geschrieben oder
konnte nicht gefunden werden.

in English: the command "run.but" is either misspelled our could not be found.

What's up?

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1330
Credit: 7,038,742,459
RAC: 15,645,875
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60363 - Posted: 5 May 2023 | 16:16:35 UTC

Same equivalent type of error in Linux for a great many tasks.

bin/bash: run.sh: No such file or directory

BACE_m7g_m7c_3-QUICO_ATM_Sage_xTB-0-5-RND8127_3

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 0
Level
Lys
Scientific publications
watwat
Message 60364 - Posted: 5 May 2023 | 17:19:23 UTC

Got a collection of twenty-one errored tasks. Suspended work fetch on that computer. The other is busy with Abous WU.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1616
Credit: 8,048,394,351
RAC: 19,365,580
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60365 - Posted: 5 May 2023 | 19:07:51 UTC

Now these are doing it as well: MCL1_m28_m47_1_FIX-QUICO_ATM_Sage_xTB-0-5-RND0954_0

18:09:56 (394275): wrapper: running bin/bash (run.sh)
bin/bash: run.sh: No such file or directory

The experimenters and/or staff have got to get a grip on this - you are wasting everybody's time and electricity.

BOINC is very unforgiving: you have to get it 100% exact, all at the same time, every time. It's worth you taking a pause after each new batch is prepared, and then going back and proof-reading the configuration. Five minutes spent checking would probably have meant getting some real research results over the weekend: now, nothing will probably work until Monday (and I'm not holding my breath then, either).

bluestang
Send message
Joined: 13 Apr 15
Posts: 10
Credit: 2,542,462,606
RAC: 0
Level
Phe
Scientific publications
wat
Message 60366 - Posted: 5 May 2023 | 21:27:48 UTC - in response to Message 60365.
Last modified: 5 May 2023 | 21:28:11 UTC

Now these are doing it as well: MCL1_m28_m47_1_FIX-QUICO_ATM_Sage_xTB-0-5-RND0954_0

18:09:56 (394275): wrapper: running bin/bash (run.sh)
bin/bash: run.sh: No such file or directory

The experimenters and/or staff have got to get a grip on this - you are wasting everybody's time and electricity.

BOINC is very unforgiving: you have to get it 100% exact, all at the same time, every time. It's worth you taking a pause after each new batch is prepared, and then going back and proof-reading the configuration. Five minutes spent checking would probably have meant getting some real research results over the weekend: now, nothing will probably work until Monday (and I'm not holding my breath then, either).



Exactly!

When you have more Tasks that Error (277) than Valid (240) ... that is pretty damn sad!