Advanced search

Message boards : Number crunching : New ACEMD tasks are ending in computation error

Author Message
Robert Meckley
Send message
Joined: 3 Nov 13
Posts: 4
Credit: 709,570,134
RAC: 230,655
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 53836 - Posted: 3 Mar 2020 | 20:02:32 UTC

Let me apologize in advance if this has already been covered in the Forum, but my search of the Forum came up dry. However, the problem I'm having is getting ACEMD tasks to complete without error. The last time I successfully ran an ACEMD task was Feb. 25th. Since then, according to my task summary, all tasks have ended prematurely in 'computation error'. (By the way, it doesn't appear that my task summary is accurately reporting what my GPU is doing, but that's a different matter.) My daily quota is down to 18 tasks a day. To address this problem, I've updated the GPU driver, updated my operating system (W10-64), rebooted several times to reestablish connection to the project, but nothing seems to work. Each time I get work units that seem to run O.K. I've even seen one run to completion; but, according to my task summary, all I see is a few tasks that end in 'computation error'. My GPU is a GTX 970, and is in good operating order and not overclocked or over powered. It runs files from my backup project just fine. If anyone can help me solve this problem, I will be very appreciative. At this point I'm willing to try anything. I really like supporting this project and would love to continue doing so.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 514
Credit: 578,771,811
RAC: 1,645,906
Level
Lys
Scientific publications
watwatwat
Message 53837 - Posted: 3 Mar 2020 | 20:22:23 UTC - in response to Message 53836.

I have had a couple of NaN errors but rarely. I chalk it up to a fluke.
Not sure why in your case that is all you get.

You could try stopping BOINC and deleting the Nvidia ComputeCache of its compute primitives and let them regenerate when you restart BOINC. The folder is here for Windows:

C:\Users\{username}\AppData\Roaming\NVIDIA\ComputeCache

rod4x4
Send message
Joined: 4 Aug 14
Posts: 168
Credit: 1,901,549,768
RAC: 1,283,091
Level
His
Scientific publications
watwatwatwatwatwatwatwatwat
Message 53842 - Posted: 3 Mar 2020 | 23:36:22 UTC - in response to Message 53836.

All of your failed tasks have failed on one or more other computers.
It seems you have been unlucky in grabbing bad or bordering on bad tasks.
Most of your failed tasks have eventually succeeded on other computers.

Is your GPU factory overclocked? (eg... EVGA GTX 970 SSC)
If it is, perhaps try power limiting the card to see if it helps reduce failure rate.

The GPU can be power limited by:
Open Command Prompt (as administrator)
change to folder C:\Program Files\NVIDIA Corporation\NVSMI
run this command - nvidia-smi -pl 120

This will limit the card to 120 Watts power (GTX 970 is spec'd at 145 Watts)

The completion time of tasks will only be slightly affected.

This command will need to be executed every time the computer is rebooted.

It is worth a try if you want to troubleshoot the issue.

Robert Meckley
Send message
Joined: 3 Nov 13
Posts: 4
Credit: 709,570,134
RAC: 230,655
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 53848 - Posted: 4 Mar 2020 | 14:15:05 UTC

I want to thank you both for your suggestions. I've deleted the compute primitives. I wasn't aware that this could be done, so we will see. I've also turned back power by 20%. I use the manufacture's software to do this. I really didn't know you could use Windows to do this. This is something else I didn't know. I've run my 970 with the power down before to keep the operating temperature down, and maybe the last time I did this was Feb 25 when the last successful ACEMD task validated. (I'm an old man and I don't remember.) But, at any rate, I'm trying both of your suggestions and hopefully this will do the trick. I probably won't known until tomorrow, however, because my daily quota is down to 8 tasks per day and I'm running backup project right now. Regardless how this turns out, I appreciate you both taking the time to help me out.

Post to thread

Message boards : Number crunching : New ACEMD tasks are ending in computation error