Advanced search

Message boards : Number crunching : NOELIA_DIPEPT progressing very slowly

Author Message
Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34977 - Posted: 11 Feb 2014 | 13:45:21 UTC
Last modified: 11 Feb 2014 | 15:15:08 UTC

Eventually noticed my credit dropping, and found that a NOELIA WU was barely progressing. The temps were cool and it was progressing at a rate of about 0.002% per minute. After 235h (over 9days) it was only 66% complete. I aborted the WU and a SANTI_MAR is progressing normally.

gluglux8x68-NOELIA_DIPEPT-0-2-RND0892
7718045 159186 31 Jan 2014 | 22:54:24 UTC 11 Feb 2014 | 13:27:00 UTC Aborted by user 848,171.58 7,994.09 --- Long runs (8-12 hours on fastest card) v8.03 (cuda55)

So, just a reminder to keep an eye out for lazy tasks.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Jim1348
Send message
Joined: 28 Jul 12
Posts: 807
Credit: 1,562,732,971
RAC: 80,087
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34984 - Posted: 11 Feb 2014 | 18:47:55 UTC - in response to Message 34977.
Last modified: 11 Feb 2014 | 18:49:31 UTC

I have had about a dozen complete without problems (GTX 660), but one errored out on all users who got it. It ended quickly fortunately, but maybe there are problems with the series.
http://www.gpugrid.net/workunit.php?wuid=5128425

But these days when I see a slow clock, I assume that the card is being over-stressed and slowing down to protect itself. I then reduce the clocks. It is counter-intuitive, but works; I haven't seen that problem in months.

Stefan
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 5 Mar 13
Posts: 348
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 34991 - Posted: 12 Feb 2014 | 12:03:40 UTC
Last modified: 12 Feb 2014 | 12:04:06 UTC

Sounds nasty :/ I sent the post around but I don't think there will be much on it as it seems to be an outlier (a very bad one admitedly).

ecafkid
Send message
Joined: 31 Dec 10
Posts: 4
Credit: 1,359,947,817
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwat
Message 34994 - Posted: 12 Feb 2014 | 13:15:05 UTC

I aborted this WU after 24 hours it was at 0% and the remaing column had only dashes in it.

7764248 166275 11 Feb 2014 | 12:23:51 UTC 12 Feb 2014 | 12:32:31 UTC Aborted by user 86,754.02 0.00 --- Long runs (8-12 hours on fastest card) v8.15 (cuda55)

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 35004 - Posted: 12 Feb 2014 | 18:15:20 UTC - in response to Message 34994.

Nothing to do with 'NOELIA_DIPEPT progressing very slowly' but regarding your setup:
You get a lot of errors! These problems may stem from overheating GPU's. 80C is probably too hot. Suggest you control your GPU temperatures so that they are a bit cooler. Try MSI afterburner.

http://www.gpugrid.net/result.php?resultid=7758914

Can't tell what the temps are for device 2 but many tasks seem to fail shortly after the GPU reaches 80C. I believe throttling starts at 80C, but in this case the issue may be due to swapping the task from one of your Quadro K5000's to the other, or your Tesla K20c.

I wonder if this works well for the Quadro's and Tesla's?
Possibly something for Matt to think about, but if the temp is kept below 80C then this wouldn't be happening.

Stderr output

<core_client_version>7.2.33</core_client_version>
<![CDATA[
<message>
The file exists.
(0x50) - exit code 80 (0x50)
</message>
<stderr_txt>
# GPU [Quadro K5000] Platform [Windows] Rev [3203M] VERSION [55]
# SWAN Device 2 :
# Name : Quadro K5000
# ECC : Disabled
# Global mem : 4095MB
# Capability : 3.0
# PCI ID : 0000:05:00.0
# Device clock : 705MHz
# Memory clock : 2700MHz
# Memory width : 256bit
# Driver version : r331_00 : 33182
# GPU 0 : 74C
# GPU 1 : 65C
# GPU 0 : 75C
# GPU 0 : 76C
# GPU 0 : 77C
# GPU 0 : 78C
# GPU 1 : 66C
# GPU 0 : 79C
# GPU 1 : 67C
# GPU 1 : 68C
# GPU 1 : 69C
# GPU 0 : 80C
# GPU 1 : 70C
# GPU 1 : 71C
# GPU [Tesla K20c] Platform [Windows] Rev [3203M] VERSION [55]
# SWAN Device 0 :
# Name : Tesla K20c
# ECC : Enabled
# Global mem : 4095MB
# Capability : 3.5
# PCI ID : 0000:22:00.0
# Device clock : 705MHz
# Memory clock : 2600MHz
# Memory width : 320bit
# Driver version : r331_00 : 33182
SWAN : FATAL : Cuda driver error 719 in file 'swanlibnv2.cpp' in line 1963.
# SWAN swan_assert 0

</stderr_txt>
]]>

____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

ecafkid
Send message
Joined: 31 Dec 10
Posts: 4
Credit: 1,359,947,817
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwat
Message 35055 - Posted: 15 Feb 2014 | 16:25:00 UTC - in response to Message 35004.

Thank's! Is there an optimum temperature for them to run at? I know sometimes there is a magic temp that things run at there best performance. I really appreciate you looking into this. I am not a scientist or researcher. I just have some spre cycles on my computers I feel should go to good use. I will probably bring three or four more online soon.

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 35124 - Posted: 18 Feb 2014 | 17:42:38 UTC - in response to Message 35055.

Try to keep your GPU's below 70C and they should run fine. Often 70C to ~78C is OK, but when you go over 80C expect trouble, especially on the smaller cards. When GK104, GK106 and GK107 cards hit 70C they mostly stop boosting as high. The GK110 cards tend to throttle their boost when they hit 80C. So keeping below these helps. Ideally, you keep the GPU's as cool as possible as that way they lose less power as heat radiation. Power is one factor that can limit the GPU's clocks. Reliable voltage is another...
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Post to thread

Message boards : Number crunching : NOELIA_DIPEPT progressing very slowly

//