Advanced search

Message boards : Number crunching : ADRIA WUs *still* have serious problems

Author Message
Life v lies: Dont be a DN...
Send message
Joined: 14 Feb 20
Posts: 12
Credit: 11,950,453
RAC: 52
Level
Pro
Scientific publications
wat
Message 59133 - Posted: 19 Aug 2022 | 6:18:01 UTC
Last modified: 19 Aug 2022 | 6:24:14 UTC

I sure wish BOINC allowed for inclusion of screen shots!

8/16/2022 11:27:13 PM | GPUGRID | Starting task 2-ADRIA_FS_RNAfmnrb_EFL6_2-1-2-RND3133_1

from the event log (I have saved screen shots to document this)
date time progress elapsed remaining
8/18 10:01 A [b]57.799%[/b] 1d 10:34:05 1d 01:14:17
8/18 10:56 A [b]57.799%[/b] 1d 11:29:30 1d 01:54:45
8/18 11:00 A [b]57.799%[/b] 1d 11:32:26 1d 01:56:54

...At this point, I suspended the WU for 1hr and 20 min. then restarted
...the BOINC 'elapsed' timer mysteriously 'shortened' the actual run time,

8/18 12:19 P [b]57.799%[/b] 23:01:10 1d 14:28:10 by 12 hr, 31min


LUCKILY, it does seem this tactic of pausing and restarting will save the WU,
https://gpugrid.net/result.php?resultid=33001055
and it is slated to finish after 1d 15:58 of run time on a GTX 1660-Ti
YET this faulty WU wasted 12 1/2 hours of GPU time

PLEASE fix the problem(s) with ADRIA tasks

LLP, PhD, Prof Engr
____________

Life v lies: Dont be a DN...
Send message
Joined: 14 Feb 20
Posts: 12
Credit: 11,950,453
RAC: 52
Level
Pro
Scientific publications
wat
Message 59138 - Posted: 20 Aug 2022 | 1:44:32 UTC

what does "Detected memory leaks!" mean??

https://gpugrid.net/result.php?resultid=33001055

Stderr output
<core_client_version>7.16.20</core_client_version>
<![CDATA[
<stderr_txt>
23:27:19 (7328): wrapper (7.9.26016): starting
23:27:19 (7328): wrapper: running bin/acemd3.exe (--boinc --device 0)
Detected memory leaks!
Dumping objects ->

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 744
Credit: 4,943,798,494
RAC: 524,854
Level
Arg
Scientific publications
wat
Message 59140 - Posted: 20 Aug 2022 | 15:00:40 UTC - in response to Message 59138.

what does "Detected memory leaks!" mean??

https://gpugrid.net/result.php?resultid=33001055

Stderr output
<core_client_version>7.16.20</core_client_version>
<![CDATA[
<stderr_txt>
23:27:19 (7328): wrapper (7.9.26016): starting
23:27:19 (7328): wrapper: running bin/acemd3.exe (--boinc --device 0)
Detected memory leaks!
Dumping objects ->



Read the FAQs

https://gpugrid.net/forum_thread.php?id=5272
____________

Life v lies: Dont be a DN...
Send message
Joined: 14 Feb 20
Posts: 12
Credit: 11,950,453
RAC: 52
Level
Pro
Scientific publications
wat
Message 59149 - Posted: 23 Aug 2022 | 16:14:05 UTC - in response to Message 59140.

Thanks for your reply, but "Please ignore the message. ..." is a ludicrous answer by GPUGrid.

"Such messages are always present in Windows" I'm not sure what "Such messages" is supposed to mean, but with over 40 years of working with computers, a Masters and a PhD and being a registered PE (a licensed Professional Engineer) ... I have never seen any other "such messages"

"It's completely harmless.... not related to successful" well, there was no other error message in the task output, yet the task 'stalled' and wasted over 12 hours of GPU time on a not-that-bad GPU (NVidia GTX 1660 Ti)


If the GPUGrid project is willing to ask for and accept the in-kind donations of people's GPU time, then GPUGrid has an obligation to do what they can to resolve problematic tasks and code
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 744
Credit: 4,943,798,494
RAC: 524,854
Level
Arg
Scientific publications
wat
Message 59150 - Posted: 23 Aug 2022 | 16:27:28 UTC - in response to Message 59149.

Thanks for your reply, but "Please ignore the message. ..." is a ludicrous answer by GPUGrid.

"Such messages are always present in Windows" I'm not sure what "Such messages" is supposed to mean, but with over 40 years of working with computers, a Masters and a PhD and being a registered PE (a licensed Professional Engineer) ... I have never seen any other "such messages"

"It's completely harmless.... not related to successful" well, there was no other error message in the task output, yet the task 'stalled' and wasted over 12 hours of GPU time on a not-that-bad GPU (NVidia GTX 1660 Ti)


If the GPUGrid project is willing to ask for and accept the in-kind donations of people's GPU time, then GPUGrid has an obligation to do what they can to resolve problematic tasks and code


I'm not sure what you're on about. you've only completed this one single task. and it was completed successfully. and you received credit for it.

what do you mean by "wasted over 12 hours of GPU time"? these tasks are VERY long running. 12hrs seems normal for that relatively weak GPU. and the ACEMD3 tasks can vary in length depending on what it's doing. there was a time when they only took 20mins, and a time where it took 24hrs. just depends on the work

so what's the problem exactly? it looks like your complaining about a valid/successful task. i see nothing wrong with this task.

____________

Life v lies: Dont be a DN...
Send message
Joined: 14 Feb 20
Posts: 12
Credit: 11,950,453
RAC: 52
Level
Pro
Scientific publications
wat
Message 59159 - Posted: 24 Aug 2022 | 15:59:27 UTC - in response to Message 59150.
Last modified: 24 Aug 2022 | 16:03:49 UTC

please read
Message 59133 - Posted: 19 Aug 2022 | 6:18:01 UTC
Quite clearly, you have not.

you've only completed this one single task

Wow.
So, I have a total "Credit: 11,845,453" by having completed one single task.
Amazing.

Indeed, I used to give high preference to GPUGrid among GPU projects because I felt its scientific merits deserved it, despite this project giving far, far less credit per GPU hour than a number of other projects (e.g., PrimeGrid, SRBase)

what do you mean by "wasted over 12 hours of GPU time"?

please read
Message 59133 - Posted: 19 Aug 2022 | 6:18:01 UTC
the first post in this thread.

LLP, PhD, Prof. Engr.
____________

Life v lies: Dont be a DN...
Send message
Joined: 14 Feb 20
Posts: 12
Credit: 11,950,453
RAC: 52
Level
Pro
Scientific publications
wat
Message 59160 - Posted: 24 Aug 2022 | 16:12:36 UTC - in response to Message 59133.
Last modified: 24 Aug 2022 | 16:19:13 UTC

date _ time _ progress _ elapsed _ remaining
8/18 10:01 A 57.799% 1d 10:34:05 1d 01:14:17
8/18 10:56 A 57.799% 1d 11:29:30 1d 01:54:45
8/18 11:00 A 57.799% 1d 11:32:26 1d 01:56:54

...At this point, I suspended the WU for 1hr and 20 min. then restarted
...the BOINC 'elapsed' timer mysteriously 'shortened' the actual run time,

8/18 12:19 P 57.799% 23:01:10 1d 14:28:10 by 12 hr, 31min

The above is from the event log (I have saved screen shots to document this)
Thus, this WU was 'hung up' for who knows how long.
At the very minimum, this WU wasted over 12 1/2 hours of GPU time


GPUGrid admins:
1 PLEASE fix all ongoing problem(s) with GPUGrid tasks
2 PLEASE use the Notices tab in BIONC Manager to communicate info (or direct link to such info) regarding needed patches, mods, etc for GPUGird WUs to run properly
Thank you

____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 744
Credit: 4,943,798,494
RAC: 524,854
Level
Arg
Scientific publications
wat
Message 59161 - Posted: 24 Aug 2022 | 16:28:52 UTC - in response to Message 59159.

It's clear that the spamming of your credentials hasn't aided in your critical thinking ability. It should be very clear that I was referencing RECENTLY completed tasks. Or rather I should say the singular "task". Not sure how you can extrapolate ONE issue on ONE system to be an endemic problem with all ADRIA tasks. Wow.

since this has not come up as a wide spread issue, it's much more likely to be an issue with your system and nothing wrong with the tasks. I have completed thousands of these tasks with this application and earned billions of credits from them. and not once has this happened.

ACEMD3 tasks (of various campaigns, ADRIA or otherwise), are particularly stressful for the GPU as compared to many other projects, and stress areas of the GPU that other projects might not. and it's not uncommon to have driver crashes in Windows as a result of that stress. when the driver crashes in windows and tries to recover, I could see that hanging up a GPU task in BOINC. whenever a task is suspended and resumed in BOINC, that triggers it to restart from the last checkpoint (which is why the timer reset).

you should up date your drivers (looks like they aren't recent), update Windows, and verify your system is clean of dust or other issues that might cause thermal issues like bad airflow to the GPU. and finally could be a faulty GPU or power problem or some other hardware issue with the system.

try less condescension and outrage, and more big picture thinking and problem solving that Engineers are known for.

NASA Engineer.


____________

Post to thread

Message boards : Number crunching : ADRIA WUs *still* have serious problems

//