Advanced search

Message boards : Number crunching : Many errors in new vwesion of application

Author Message
L
Send message
Joined: 22 Mar 14
Posts: 41
Credit: 439,177,758
RAC: 322,849
Level
Gln
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 53276 - Posted: 3 Dec 2019 | 20:53:01 UTC

One of my host does not process any job without errors, I do not understand why. https://www.gpugrid.net/results.php?hostid=170784. Currently I stop getting new WU.
____________

Jim1348
Send message
Joined: 28 Jul 12
Posts: 733
Credit: 1,478,749,566
RAC: 142,659
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 53277 - Posted: 3 Dec 2019 | 21:06:41 UTC - in response to Message 53276.

The short (10 second) errors are not your fault, but a problem with the work units.

However, the long runs that fail after 54 hours probably means that you GTX 650 is overclocked or too hot. I would just use the GTX 1060 anyway. Your GTX 650 will use more electric energy than it is worth in my opinion.

Toni
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 9 Dec 08
Posts: 947
Credit: 4,353,973
RAC: 71
Level
Ala
Scientific publications
watwatwatwat
Message 53278 - Posted: 3 Dec 2019 | 22:47:11 UTC - in response to Message 53277.

The errors occur at the end of the run. I'm inclined to think that the problem is with the software, permissions, or something like that.

L
Send message
Joined: 22 Mar 14
Posts: 41
Credit: 439,177,758
RAC: 322,849
Level
Gln
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 53341 - Posted: 14 Dec 2019 | 7:20:54 UTC - in response to Message 53278.

Toni +1, I think so too because older versions of WU worked well.
____________

Toni
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 9 Dec 08
Posts: 947
Credit: 4,353,973
RAC: 71
Level
Ala
Scientific publications
watwatwatwat
Message 53342 - Posted: 14 Dec 2019 | 9:35:03 UTC - in response to Message 53341.

Reinstall boinc.

archeye
Send message
Joined: 10 May 13
Posts: 10
Credit: 6,490,450
RAC: 0
Level
Ser
Scientific publications
wat
Message 53426 - Posted: 1 Jan 2020 | 15:41:21 UTC
Last modified: 1 Jan 2020 | 15:41:39 UTC

I had a similar issue today, just at the end of the run it failed with,

Name initial_1730-ELISA_GSN4V1-41-100-RND8024_0
Exit status 195 (0xc3) EXIT_CHILD_FAILED
Run time 103,023.27
CPU time 22,582.33

http://www.gpugrid.net/result.php?resultid=21587058

I was hoping for some advice on how to check if my PC/GPU(s) are working correctly.

If it was failing at the start of a run I would be perhaps less interested but still concerned for my hardware and also all this computing effort is effectively wasted which is a shame.

I have now just suspended all tasks so I can shut down the computer and clean out the filters. I will see what happens with the next 2 GPUGRID tasks.

archeye
Send message
Joined: 10 May 13
Posts: 10
Credit: 6,490,450
RAC: 0
Level
Ser
Scientific publications
wat
Message 53427 - Posted: 1 Jan 2020 | 17:44:11 UTC - in response to Message 53426.
Last modified: 1 Jan 2020 | 18:24:47 UTC

So I cleaned and washed out my PC fan filter, the fan runs slower now so that's good.

Just had 2 more that were running together fail with the same error as previously posted.

This time I had just started up my mmo game Elder Scrolls Online and the screen went black, the fan speed dropped right down so I killed the game and checked my GPU tasks and saw both had failed.

I have been playing this game and running GPU tasks together for the past week with no problem.

I will not run any GPU tasks for a while and see if I have any strange PC behaviour before attempting to run any more.

Edit:
Just as an afterthought, where a GPU task has already achieved one or more checkpoints and then there is a failure of some kind. Rather than just exiting whey cant it just reload from the last checkpoint? It sort of makes more sense to me and this would also minimise the lost computing effort of the volunteer crunchers :)

Edit2: The last checkpoint must be preserved as when you just switch off your computer when tasks are between checkpoints. This saved point is used as the starting point next time you run Boinc manager.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2185
Credit: 15,823,750,346
RAC: 732,335
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 53432 - Posted: 2 Jan 2020 | 8:00:46 UTC - in response to Message 53427.

Just as an afterthought, where a GPU task has already achieved one or more checkpoints and then there is a failure of some kind. Rather than just exiting whey cant it just reload from the last checkpoint?
What would make it run smoothly for the 2nd (3rd...1000th) attempt without user intervention?

It sort of makes more sense to me and this would also minimise the lost computing effort of the volunteer crunchers :)
The tasks on such hosts would never finish this way, they would try to run them until their deadline which would slow down the processing of the given chain of workunits too much. Failed workunits serve as the source of self protection for the project, and they also serve as a warning sign for the user.

Does your host have two GTX 980s?
Are those cards in SLI mode? (That could be the problem.)
The upper card tends to be run hotter in this setup, so you should check the temperatures of your cards by MSI Afterburner (it runs with other manufacturer's cards too, or you can use similar tools provided by the manufacturer of your card).
You can also use this tool to lower the temps of your cards by:
1. lowering its power target (=lowering GPU clock frequency and GPU voltage)
2. lowering its memory clock frequency
3. increasing its fan speed
These problems usually arise by high clock speeds combined with high temperatures.
The previous version of the GPUGrid app didn't tolerate well if it was suspended frequently, perhaps this could be the case for the new app too.
If you have two different Nvidia cards in the same system, you should provide a solution for the suspended GPUGrid tasks to restart on the same card on which they were processed previously. (The new app can't restart suspended tasks on a different card.) The easiest solution is to exclude the lesser card from GPUGrid by creating / editing cc_config.xml (see the exclude_GPU section here).
The new app utilizes the GPU much more than the previous version, so you may have to re-calibrate (lower) your overclock settings.

archeye
Send message
Joined: 10 May 13
Posts: 10
Credit: 6,490,450
RAC: 0
Level
Ser
Scientific publications
wat
Message 53433 - Posted: 2 Jan 2020 | 17:13:28 UTC - in response to Message 53432.

Thanks for the detailed reply it helps to know we are well supported.

As for you Avatar, it seems you are looking t the universe for answers but I would imagine the answers more likely originate inside your self.

Anyway, 1. Hardware,

Operating System
Windows 10 Pro 64-bit
CPU
Intel Core i7 @ 4.00GHz
Haswell 22nm Technology
RAM
32.0GB
Motherboard
MSI Z97-G45 GAMING (MS-7821) (SOCKET 0)
%1 Chipset
Graphics
ROG PG278Q (2560x1440@144Hz)
SAMSUNG (1680x1050@59Hz)
4095MB NVIDIA GeForce GTX 980 (NVIDIA) 37 °C
4095MB NVIDIA GeForce GTX 980 (NVIDIA) 26 °C
ForceWare version: 441.20
SLI Enabled
Storage
447GB Crucial_CT480M500SSD1 (SATA (SSD))
931GB TOSHIBA DT01ACA100 (SATA ) 27 °C
8GB Microsoft Virtual Disk (File-backed Virtual (SSD))
Optical Drives
HL-DT-ST DVDRAM GH24NSC0
Audio
NVIDIA High Definition Audio

2. Info from Boinc Manager,

PC

1 01/01/2020 18:54:38 cc_config.xml not found - using defaults
2 01/01/2020 18:54:38 Starting BOINC client version 7.14.2 for windows_x86_64
3 01/01/2020 18:54:38 log flags: file_xfer, sched_ops, task
4 01/01/2020 18:54:38 Libraries: libcurl/7.47.1 OpenSSL/1.0.2g zlib/1.2.8
5 01/01/2020 18:54:38 Data directory: C:\ProgramData\BOINC
6 01/01/2020 18:54:38 Running under account Chris
7 01/01/2020 18:54:40 CUDA: NVIDIA GPU 0: GeForce GTX 980 (driver version 441.20, CUDA version 10.2, compute capability 5.2, 4096MB, 3292MB available, 4979 GFLOPS peak)
8 01/01/2020 18:54:40 CUDA: NVIDIA GPU 1: GeForce GTX 980 (driver version 441.20, CUDA version 10.2, compute capability 5.2, 4096MB, 3292MB available, 4979 GFLOPS peak)
9 01/01/2020 18:54:40 OpenCL: NVIDIA GPU 0: GeForce GTX 980 (driver version 441.20, device version OpenCL 1.2 CUDA, 4096MB, 3292MB available, 4979 GFLOPS peak)
10 01/01/2020 18:54:40 OpenCL: NVIDIA GPU 1: GeForce GTX 980 (driver version 441.20, device version OpenCL 1.2 CUDA, 4096MB, 3292MB available, 4979 GFLOPS peak)
11 01/01/2020 18:54:40 Host name: PC
12 01/01/2020 18:54:40 Processor: 8 GenuineIntel Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz [Family 6 Model 60 Stepping 3]
13 01/01/2020 18:54:40 Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush acpi mmx fxsr sse sse2 ss htt tm pni ssse3 fma cx16 sse4_1 sse4_2 movebe popcnt aes f16c rdrandsyscall nx lm avx avx2 tm2 pbe fsgsbase bmi1 smep bmi2
14 01/01/2020 18:54:40 OS: Microsoft Windows 10: Professional x64 Edition, (10.00.18362.00)
15 01/01/2020 18:54:40 Memory: 31.95 GB physical, 36.70 GB virtual
16 01/01/2020 18:54:40 Disk: 366.27 GB total, 121.16 GB free
17 01/01/2020 18:54:40 Local time is UTC +1 hours
18 01/01/2020 18:54:40 No WSL found.
19 01/01/2020 18:54:40 VirtualBox version: 5.2.8

25

3. What would make it run smoothly for the 2nd (3rd...1000th) attempt without user intervention?
Well i agree if it was left unchecked but a one time fail may be a glitch associated with other computing activity. 3 times with any consecutive failure then exit for me seems sensible.

However you are free to run your project as best suits your needs and while I am a volunteer helper so I also guess I support your decisions too.

4. Afterburner
I do use this and have tweaked the curve for the fan speed so it cuts in at a higher rate at lower temps.
There is no overclocking
I have not looked into,
1. lowering its power target (=lowering GPU clock frequency and GPU voltage)
2. lowering its memory clock frequency
I just left that alone as I really don't have enough knowledge to know the impact of any changes there.

Suspending gpugrid tasks,

I use the EfMer Boinc tasks app and the EfMer TTrottle
I set my CPU temp limit to 70deg C and Gpu(s) to 65 dec C
sometimes I just need to turn off the computer but if I think a but ahead I will use the setting in EfMer Boinc tasks app to "suspend at checkpoint"

I do use SLI mode, maybe an alternative is to just allow the GPU task to run on the second GPU which is providing mostly PhsyX.
Apparently I can configure the settings in EfMer Boinc tasks app config file to disable my GPU1 for task Boinc task use.

That's enough for now i think :)

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 508
Credit: 524,027,144
RAC: 1,662,925
Level
Lys
Scientific publications
wat
Message 53434 - Posted: 2 Jan 2020 | 18:30:54 UTC

Since you have two of the same card type, you don't need to worry about restarting a paused/suspended task on another card and erroring out the task.

I still would be concerned about the cards being in SLI configuration. Anecdotal evidence at all projects I crunch for says that computing on cards with a SLI connection is problematic. There is too much going on under the covers with the Nvidia driver to gang hardware on both cards together that prevents proper calculations on both cards.

If you need to keep the SLI configuration for gaming, I would restrict computation to only one card by excluding a device.

archeye
Send message
Joined: 10 May 13
Posts: 10
Credit: 6,490,450
RAC: 0
Level
Ser
Scientific publications
wat
Message 53435 - Posted: 4 Jan 2020 | 10:53:18 UTC - in response to Message 53434.
Last modified: 4 Jan 2020 | 10:53:43 UTC

Thanks Keith for the advice I will surely try that and allow only 1 GPU card for processing.

After that I will also experiment with disabling SLI while running both cards.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 508
Credit: 524,027,144
RAC: 1,662,925
Level
Lys
Scientific publications
wat
Message 53436 - Posted: 4 Jan 2020 | 18:41:46 UTC - in response to Message 53435.

It has been a long while since I ran Windows, but I seem to remember a SLI setting in the Nvidia control panel that toggled SLI on and off. May require a host restart though. But that might be a solution. The physical SLI connector is not the problem, it can be present and has no effect if the software enabling SLI is not turned on.

You could enable the SLI connection for gaming and then when finished,reboot the host with SLI disabled for crunching with both cards.

JochenZ
Send message
Joined: 15 Aug 09
Posts: 1
Credit: 165,754,715
RAC: 129,679
Level
Ile
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 53442 - Posted: 9 Jan 2020 | 22:32:31 UTC - in response to Message 53426.

Hello,

also nearly 100% of the tasks failed with exit code 195 on my ASUS GTX 1070TI, which I manually overclocked, some tasks after minutes and some task after hours calculating.
After reducing speed to normal overclocking mode all 7 tasks I received finished correct. So I think, too high overclocking is the reason for failures in New version of ACEMD tasks. Short or long run tasks had never failures before.
Also in other projects like Einstein or Milkyway I got no failures with manually higher overclocking, but appox. 8% longer crunching time with normal overclocking.

But the worst is, that I get no new tasks in GPUGRID. Even with manually triggering no tasks are available.... I can't do any further tests.

Rampf, mampf my computer wants to crunch ;-)
(Denglish rhyme)

Post to thread

Message boards : Number crunching : Many errors in new vwesion of application