Advanced search

Message boards : Number crunching : Everyone is getting computation errors

Author Message
flashawk
Send message
Joined: 18 Jun 12
Posts: 297
Credit: 3,572,627,986
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 49862 - Posted: 15 Jul 2018 | 1:08:24 UTC

I suspended GPUGrid

kksplace
Send message
Joined: 4 Mar 18
Posts: 53
Credit: 2,464,071,749
RAC: 5,770,116
Level
Phe
Scientific publications
wat
Message 49863 - Posted: 15 Jul 2018 | 1:16:15 UTC

Same. Errors on last 5 WUs within 2 seconds.

flashawk
Send message
Joined: 18 Jun 12
Posts: 297
Credit: 3,572,627,986
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 49864 - Posted: 15 Jul 2018 | 1:27:18 UTC

Ya, I was scrambling around downclocking my cards and turning up the voltage having a litter of kittens wondering how all 4 of my 1180's all went bad at the same time.

It's not a good feeling, these things aren't cheap. I'm glad it's the work units rather than my cards.

tullio
Send message
Joined: 8 May 18
Posts: 190
Credit: 104,426,808
RAC: 0
Level
Cys
Scientific publications
wat
Message 49865 - Posted: 15 Jul 2018 | 1:27:58 UTC

I get no errors on my Linux systems.
Tullio

flashawk
Send message
Joined: 18 Jun 12
Posts: 297
Credit: 3,572,627,986
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 49866 - Posted: 15 Jul 2018 | 1:35:24 UTC - in response to Message 49865.

Maybe it's a Windows only thing, both Pablo and Adria WU's are getting errors in Windows not Linux.

tullio
Send message
Joined: 8 May 18
Posts: 190
Credit: 104,426,808
RAC: 0
Level
Cys
Scientific publications
wat
Message 49868 - Posted: 15 Jul 2018 | 1:57:39 UTC

On my Windows 10 PC always updated and with nVidia drivers I also get errors so I am running GPUGRID only on the two Linux boxes.
Tullio

Erich56
Send message
Joined: 1 Jan 15
Posts: 1131
Credit: 9,599,332,676
RAC: 32,036,678
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 49870 - Posted: 15 Jul 2018 | 6:10:39 UTC

same thing here: all newly downloaded tasks (regardless whether PABLO or ADRIA) error out after a few seconds:

(unknown error) - exit code -44 (0xffffffd4)

did no one at GPUGRID notice this problem?

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 49872 - Posted: 15 Jul 2018 | 8:59:11 UTC

I think this time the Windows / CUDA8.0 client got its license expired, as the Windows XP / CUDA6.5 and the Linux / CUDA8.0 client is working fine.
Too bad that my Windows XP hosts are offline for the summer.
Many workunits will be lost, due to most of the hosts are Windows 10 and 7.
It think the Windows / CUDA8.0 client should be deprecated immediately.

Profile [PUGLIA] kidkidkid3
Avatar
Send message
Joined: 23 Feb 11
Posts: 96
Credit: 1,239,505,544
RAC: 3,091,023
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 49873 - Posted: 15 Jul 2018 | 9:25:16 UTC - in response to Message 49872.

After purchase of a new license, all of us need a reset of daily quota to crunch WU, is't correct ?
Thanks
K.
____________
Dreams do not always come true. But not because they are too big or impossible. Why did we stop believing.
(Martin Luther King)

Betting Slip
Send message
Joined: 5 Jan 09
Posts: 670
Credit: 2,498,095,550
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 49874 - Posted: 15 Jul 2018 | 9:39:40 UTC - in response to Message 49873.

No need to reset it for the sake of just one day.


____________
Radio Caroline, the world's most famous offshore pirate radio station.
Great music since April 1964. Support Radio Caroline Team -
Radio Caroline

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 49875 - Posted: 15 Jul 2018 | 10:57:19 UTC

The error rate on the Server status page of three workunit batches are in the red range (above 75%) now

flashawk
Send message
Joined: 18 Jun 12
Posts: 297
Credit: 3,572,627,986
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 49884 - Posted: 15 Jul 2018 | 18:55:30 UTC

No word from the staff yet when it's safe to start crunching? The Linux guys should have enough work to go the next couple of days. Doesn't anyone monitor the servers and software over the weekend?

I asked this question twice before and got no answer, what happened to the moderators?

kain
Send message
Joined: 3 Sep 14
Posts: 152
Credit: 752,657,245
RAC: 2,554,561
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 49885 - Posted: 15 Jul 2018 | 19:32:28 UTC

It is a small (but still very productive) team, and this is weekend, and today was a world cup final. Let them live ;)

MartinKanne
Send message
Joined: 27 Dec 16
Posts: 6
Credit: 53,210,225
RAC: 0
Level
Thr
Scientific publications
watwatwatwat
Message 49906 - Posted: 16 Jul 2018 | 20:07:16 UTC

I do not suspend GPUGruid, because it only takes 3 to 8 Seconds.
But I would be glad, GPUGrid would be able to fix the Problem the next 48 Hours.

MartinKanne
Send message
Joined: 27 Dec 16
Posts: 6
Credit: 53,210,225
RAC: 0
Level
Thr
Scientific publications
watwatwatwat
Message 49907 - Posted: 16 Jul 2018 | 20:09:37 UTC

I do not suspend GPUGruid, because it only takes 3 to 8 Seconds.
But I would be glad, GPUGrid would be able to fix the Problem the next 48 Hours.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 49908 - Posted: 16 Jul 2018 | 21:15:10 UTC

I was running an ADRIA_FOLDT1015 on my GTX 1060 (Ubuntu 16.04) when it crashed. Not only that, but it took out the QC work units running on the CPU also. I will lay off the GPU for a while; it is too warm anyway.

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 485
Credit: 10,774,998,466
RAC: 15,385,380
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 49910 - Posted: 16 Jul 2018 | 22:05:07 UTC

Now, I am getting the same error on cuda 6.5 / windows xp.

http://www.gpugrid.net/results.php?hostid=411423&offset=0&show_names=0&state=5&appid=

http://www.gpugrid.net/result.php?resultid=18147634

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 49913 - Posted: 16 Jul 2018 | 22:33:05 UTC - in response to Message 49910.

Now, I am getting the same error on cuda 6.5 / windows xp.
Yep, me too. Too bad... At least my electricity bill will be the lowest in years...

flashawk
Send message
Joined: 18 Jun 12
Posts: 297
Credit: 3,572,627,986
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 49915 - Posted: 17 Jul 2018 | 1:01:19 UTC

I can't believe they haven't fixed this yet, over 4300 work units now and growing. It's obvious the Linux machines can't keep up, this is starting to get strange.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1131
Credit: 9,599,332,676
RAC: 32,036,678
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 49916 - Posted: 17 Jul 2018 | 4:50:07 UTC - in response to Message 49913.

Now, I am getting the same error on cuda 6.5 / windows xp.
Yep, me too. Too bad... At least my electricity bill will be the lowest in years...

same here :-(

is GPUGRID falling apart?

Stefan
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 5 Mar 13
Posts: 348
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 49917 - Posted: 17 Jul 2018 | 8:29:26 UTC
Last modified: 17 Jul 2018 | 8:32:04 UTC

Toni who is probably the most qualified person for updating the app with a new ACEMD version is currently on holidays without good internet. I am also on holidays currently although I doubt I could have fixed it anyway. I told the guys at the lab to cancel the GPU workunits until it's fixed, so you might have to wait a few days before we fix it and send out new ones. I'm sorry but some stuff is beyond my control sometimes.
Maybe Gianni can take a look at it while we are out, I have informed him as well.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1618
Credit: 8,452,194,351
RAC: 15,895,459
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 49918 - Posted: 17 Jul 2018 | 8:39:55 UTC - in response to Message 49917.

Toni who is probably the most qualified person for updating the app with a new ACEMD version is currently on holidays without good internet. I am also on holidays currently although I doubt I could have fixed it anyway. I told the guys at the lab to cancel the GPU workunits until it's fixed, so you might have to wait a few days before we fix it and send out new ones. I'm sorry but some stuff is beyond my control sometimes.
Maybe Gianni can take a look at it while we are out, I have informed him as well.

It might be better to keep the tasks, but deprecate the Windows apps - that way, you would still get some work done (albeit at only ~20% capacity) by your Linux volunteers.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 49919 - Posted: 17 Jul 2018 | 8:51:25 UTC - in response to Message 49918.

Toni who is probably the most qualified person for updating the app with a new ACEMD version is currently on holidays without good internet. I am also on holidays currently although I doubt I could have fixed it anyway. I told the guys at the lab to cancel the GPU workunits until it's fixed, so you might have to wait a few days before we fix it and send out new ones. I'm sorry but some stuff is beyond my control sometimes.
Maybe Gianni can take a look at it while we are out, I have informed him as well.

It might be better to keep the tasks, but deprecate the Windows apps - that way, you would still get some work done (albeit at only ~20% capacity) by your Linux volunteers.
+1

Profile [AF] fansyl
Send message
Joined: 26 Sep 13
Posts: 20
Credit: 1,714,356,441
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 49920 - Posted: 17 Jul 2018 | 9:09:49 UTC
Last modified: 17 Jul 2018 | 9:10:17 UTC

You are entitled to a holiday :-)
Courage to the whole team to fix this even if there is no emergency.

mmonnin
Send message
Joined: 2 Jul 16
Posts: 337
Credit: 7,401,801,065
RAC: 7,715,030
Level
Tyr
Scientific publications
watwatwatwatwat
Message 49921 - Posted: 17 Jul 2018 | 10:40:13 UTC - in response to Message 49919.

Toni who is probably the most qualified person for updating the app with a new ACEMD version is currently on holidays without good internet. I am also on holidays currently although I doubt I could have fixed it anyway. I told the guys at the lab to cancel the GPU workunits until it's fixed, so you might have to wait a few days before we fix it and send out new ones. I'm sorry but some stuff is beyond my control sometimes.
Maybe Gianni can take a look at it while we are out, I have informed him as well.

It might be better to keep the tasks, but deprecate the Windows apps - that way, you would still get some work done (albeit at only ~20% capacity) by your Linux volunteers.
+1

+2

Still crunching here.

PappaLitto
Send message
Joined: 21 Mar 16
Posts: 511
Credit: 4,672,242,755
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwat
Message 49922 - Posted: 17 Jul 2018 | 10:46:11 UTC

They are probably making sure the results given back so far are valid and scientifically useful as I'm sure trust in the results after something like this is probably slim.

tullio
Send message
Joined: 8 May 18
Posts: 190
Credit: 104,426,808
RAC: 0
Level
Cys
Scientific publications
wat
Message 49923 - Posted: 17 Jul 2018 | 11:26:15 UTC

I am a new user and don't want to criticize. But I see that minimum quorum is one.Why?
Tullio

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 49924 - Posted: 17 Jul 2018 | 11:37:38 UTC - in response to Message 49918.

It might be better to keep the tasks, but deprecate the Windows apps - that way, you would still get some work done (albeit at only ~20% capacity) by your Linux volunteers.

I will put my GTX 980 on Ubuntu to help. My GTX 1060 that crashed was overheating at 82C or more - it has a bad heatsink or voltage regulator or something.

PappaLitto
Send message
Joined: 21 Mar 16
Posts: 511
Credit: 4,672,242,755
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwat
Message 49927 - Posted: 17 Jul 2018 | 14:24:28 UTC - in response to Message 49924.

My GTX 1060 that crashed was overheating at 82C or more - it has a bad heatsink or voltage regulator or something.

Try taking off the heat sync and changing the thermal paste. Whatever you put on will definitely be better than stock and will last a lot longer. I recommend Arctic Silver 5, but make sure you don't get any on components because it is conductive.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 49928 - Posted: 17 Jul 2018 | 15:00:57 UTC - in response to Message 49927.

Try taking off the heat sync and changing the thermal paste.

Yes, I did that a few weeks ago, using Arctic MX-4. It didn't change a thing. I noticed several months ago that it was getting too warm for comfort, and have tried it now in three different machines. One of them has a 120mm rear exhaust fan, a 120mm top exhaust fan, and a 120mm front intake fan. It still ran at 80C, in a cool room. I think it is gone - either a heatpipe, or else the GPU chip itself or voltage regulator is causing too much current to flow.

I have an EVGA GTX 970 though which will work fine until Nvidia decides to release something worth buying at a reasonable price.

Stefan
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 5 Mar 13
Posts: 348
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 49942 - Posted: 18 Jul 2018 | 7:19:27 UTC - in response to Message 49923.

Well apparently Gianni also knows how to deprecate apps. So now we will have Raimondas compiling the new app version which may take a few days and then he will deploy the new app. I assume we should have some sort of tutorial on this stuff for more people but from what I gather managing BOINC is a very esoteric business

MartinKanne
Send message
Joined: 27 Dec 16
Posts: 6
Credit: 53,210,225
RAC: 0
Level
Thr
Scientific publications
watwatwatwat
Message 49991 - Posted: 22 Jul 2018 | 14:05:39 UTC

OK, a few days have happen.
Did anyone fix the Problem ?

[AF>P4G] anthony
Send message
Joined: 14 Mar 10
Posts: 14
Credit: 501,938,373
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 49992 - Posted: 22 Jul 2018 | 17:35:40 UTC

Hello,
I have also the same problem. Since 5 or 6 days, all the WU have caulculation errors in 2 seconds. My configuration :
GTX 1060
Windows 10
Driver Fichier INF : oem57.inf | Marque : Nvidia | Classe : Display | Version : 24.21.13.9836 | Date : 24/06/2018

I have to install the cuda toolkit 9.1 ?

Erich56
Send message
Joined: 1 Jan 15
Posts: 1131
Credit: 9,599,332,676
RAC: 32,036,678
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 49993 - Posted: 22 Jul 2018 | 18:55:55 UTC - in response to Message 49992.
Last modified: 22 Jul 2018 | 18:56:20 UTC

I have to install the cuda toolkit 9.1 ?

no, the only thing that would help is to install Linux

Post to thread

Message boards : Number crunching : Everyone is getting computation errors

//