Advanced search

Message boards : News : acemdlong application 8.14 - discussion

Author Message
Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 32322 - Posted: 27 Aug 2013 | 15:59:45 UTC

We decided to go live with the new application on ACEMD-Long.

It is version 8.00, and the server will automatically assign you a CUDA 4.2 or 5.5 version, depending on your driver version.

Any problems on this thread, please.

MJH

5pot
Send message
Joined: 8 Mar 12
Posts: 411
Credit: 2,083,882,218
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32324 - Posted: 27 Aug 2013 | 16:54:04 UTC
Last modified: 27 Aug 2013 | 16:55:55 UTC

Saying none available :(

Short run queue is also dry.

8/27/2013 11:56:14 AM | GPUGRID | Scheduler request completed: got 0 new tasks
8/27/2013 11:56:14 AM | GPUGRID | No tasks sent
8/27/2013 11:56:14 AM | GPUGRID | No tasks are available for Short runs (2-3 hours on fastest card)
8/27/2013 11:56:14 AM | GPUGRID | No tasks are available for Long runs (8-12 hours on fastest card)

klepel
Send message
Joined: 23 Dec 09
Posts: 189
Credit: 4,195,561,293
RAC: 1,590,531
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32325 - Posted: 27 Aug 2013 | 16:54:09 UTC
Last modified: 27 Aug 2013 | 17:12:57 UTC

Unfortunately I do not get new work on my GTX 670 (2048MB) driver: 311.6, although you mention that the application will assign the Cuda version correspondingly.

I do get the following message on the boinc log:
27/08/2013 11:51:53 a.m. | GPUGRID | Scheduler request completed: got 0 new tasks
27/08/2013 11:51:53 a.m. | GPUGRID | NVIDIA GPU: Actualiza al Ășltimo driver para usar todas las aplicaciones de GPU de este proyecto
27/08/2013 11:51:53 a.m. | GPUGRID | No tasks sent
27/08/2013 11:51:53 a.m. | GPUGRID | No tasks are available for Long runs (8-12 hours on fastest card)
Although the server status page shows, there are some long WUs available.
I personally do not like to change drivers as my systems are running without flaws for months, until it is indispensable, stable and makes the video cards faster.

Stefan
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 5 Mar 13
Posts: 348
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 32326 - Posted: 27 Aug 2013 | 17:10:47 UTC

I think it's probably that there are no WU's right now and the webpage doesn't update very often the count.

klepel
Send message
Joined: 23 Dec 09
Posts: 189
Credit: 4,195,561,293
RAC: 1,590,531
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32328 - Posted: 27 Aug 2013 | 17:15:10 UTC

Are you producing new WUs at this very moment? I am asking as there had been quite a lot (around 2000) a few days ago, but as somebody else noted they desapiered very fast some days ago.

Profile GDF
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 14 Mar 07
Posts: 1957
Credit: 629,356
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 32330 - Posted: 27 Aug 2013 | 19:25:30 UTC - in response to Message 32328.

noelia is checking some small batch and then putting on 1000.

gdf

5pot
Send message
Joined: 8 Mar 12
Posts: 411
Credit: 2,083,882,218
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32342 - Posted: 28 Aug 2013 | 2:36:33 UTC
Last modified: 28 Aug 2013 | 3:10:53 UTC

I am beyond impressed

30k seconds for a long run nathan_kid on 680

# Time per step (avg over 5000000 steps): 6.104 ms
# Approximate elapsed time for entire WU: 30518.949 s

20k seconds for same task type on a 780

# Time per step (avg over 5000000 steps): 4.105 ms
# Approximate elapsed time for entire WU: 20526.378 s

Rick A. Sponholz
Avatar
Send message
Joined: 20 Jan 09
Posts: 52
Credit: 2,518,707,115
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32343 - Posted: 28 Aug 2013 | 2:37:00 UTC
Last modified: 28 Aug 2013 | 2:37:21 UTC

Not getting any work units, installed lates drivers:
8/27/2013 1:25:58 PM | | CUDA: NVIDIA GPU 0: GeForce GTX 690 (driver version 320.78, CUDA version 5.50, compute capability 3.0, 2048MB, 1955MB available, 3132 GFLOPS peak)
8/27/2013 1:25:58 PM | | CUDA: NVIDIA GPU 1: GeForce GTX 690 (driver version 320.78, CUDA version 5.50, compute capability 3.0, 2048MB, 1926MB available, 3132 GFLOPS peak)
8/27/2013 1:25:58 PM | | CUDA: NVIDIA GPU 2: GeForce GTX 690 (driver version 320.78, CUDA version 5.50, compute capability 3.0, 2048MB, 1955MB available, 3132 GFLOPS peak)
8/27/2013 1:25:58 PM | | CUDA: NVIDIA GPU 3: GeForce GTX 690 (driver version 320.78, CUDA version 5.50, compute capability 3.0, 2048MB, 1955MB available, 3132 GFLOPS peak)
8/27/2013 1:25:58 PM | | OpenCL: NVIDIA GPU 0: GeForce GTX 690 (driver version 320.78, device version OpenCL 1.1 CUDA, 2048MB, 1955MB available, 3132 GFLOPS peak)
8/27/2013 1:25:58 PM | | OpenCL: NVIDIA GPU 1: GeForce GTX 690 (driver version 320.78, device version OpenCL 1.1 CUDA, 2048MB, 1926MB available, 3132 GFLOPS peak)
8/27/2013 1:25:58 PM | | OpenCL: NVIDIA GPU 2: GeForce GTX 690 (driver version 320.78, device version OpenCL 1.1 CUDA, 2048MB, 1955MB available, 3132 GFLOPS peak)
8/27/2013 1:25:58 PM | | OpenCL: NVIDIA GPU 3: GeForce GTX 690 (driver version 320.78, device version OpenCL 1.1 CUDA, 2048MB, 1955MB available, 3132 GFLOPS peak)


BUT AM GETTING THIS:
8/27/2013 10:10:19 PM | GPUGRID | Sending scheduler request: To fetch work.
8/27/2013 10:10:19 PM | GPUGRID | Requesting new tasks for NVIDIA
8/27/2013 10:10:21 PM | GPUGRID | Scheduler request completed: got 0 new tasks
8/27/2013 10:10:21 PM | GPUGRID | NVIDIA GPU: Upgrade to the latest driver to use all of this project's GPU applications
8/27/2013 10:10:21 PM | GPUGRID | No tasks sent
8/27/2013 10:10:21 PM | GPUGRID | No tasks are available for Long runs (8-12 hours on fastest card)
____________

Operator
Send message
Joined: 15 May 11
Posts: 108
Credit: 297,176,099
RAC: 0
Level
Asn
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32344 - Posted: 28 Aug 2013 | 3:25:19 UTC - in response to Message 32343.

Not getting any work units, installed lates drivers:
8/27/2013 1:25:58 PM | | CUDA: NVIDIA GPU 0: GeForce GTX 690 (driver version 320.78, CUDA version 5.50, compute capability 3.0, 2048MB, 1955MB available, 3132 GFLOPS peak)....


BUT AM GETTING THIS:

8/27/2013 10:10:21 PM | GPUGRID | NVIDIA GPU: Upgrade to the latest driver to use all of this project's GPU applications....


Try the latest from here:


https://developer.nvidia.com/opengl-driver

Currently 326.84

Operator
____________

5pot
Send message
Joined: 8 Mar 12
Posts: 411
Credit: 2,083,882,218
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32345 - Posted: 28 Aug 2013 | 3:46:51 UTC

Operator, please post your completed time on the long run for you titan when finished.

Cheers

Matt
Avatar
Send message
Joined: 11 Jan 13
Posts: 216
Credit: 846,538,252
RAC: 0
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32347 - Posted: 28 Aug 2013 | 5:22:28 UTC

Just updated to 320.49, the latest Nvidia driver for my 680. Boinc is telling me I still need to update my drivers in order to receive work.

I see Operator posted a link to another driver. Is this the one we need in order to receive any of the new WUs? I don't have a lot of experience with these and am wondering how installing this will impact day to day uses of the computer (internet, movies, gaming).

Thanks for any info. I'd like to keep contributing as much as possible while still being to use my PC for other purposes.

HA-SOFT, s.r.o.
Send message
Joined: 3 Oct 11
Posts: 100
Credit: 5,879,292,399
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32352 - Posted: 28 Aug 2013 | 7:59:08 UTC - in response to Message 32345.
Last modified: 28 Aug 2013 | 8:00:18 UTC

Operator, please post your completed time on the long run for you titan when finished.

Cheers

Nathan Kid on Titan 19k seconds 3.9ms per step

TJ
Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32353 - Posted: 28 Aug 2013 | 8:15:24 UTC - in response to Message 32347.
Last modified: 28 Aug 2013 | 8:17:15 UTC

Just updated to 320.49, the latest Nvidia driver for my 680. Boinc is telling me I still need to update my drivers in order to receive work.

I see Operator posted a link to another driver. Is this the one we need in order to receive any of the new WUs? I don't have a lot of experience with these and am wondering how installing this will impact day to day uses of the computer (internet, movies, gaming).

Thanks for any info. I'd like to keep contributing as much as possible while still being to use my PC for other purposes.

Hi Matt,
You can install those drivers safe, but you find the latest drivers at nVidiaÂŽs official site as well. Look here for the GTX680. If you click then on "start search" you get the latest for your card, 326.80.
Hope this helps.

Edit: it is a beta driver, but I am using it now for a week on two rigs without any issues.
____________
Greetings from TJ

Profile Zarck
Send message
Joined: 16 Aug 08
Posts: 145
Credit: 328,473,995
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32354 - Posted: 28 Aug 2013 | 8:19:04 UTC - in response to Message 32347.
Last modified: 28 Aug 2013 | 8:19:18 UTC

Just updated to 320.49, the latest Nvidia driver for my 680. Boinc is telling me I still need to update my drivers in order to receive work.

I see Operator posted a link to another driver. Is this the one we need in order to receive any of the new WUs? I don't have a lot of experience with these and am wondering how installing this will impact day to day uses of the computer (internet, movies, gaming).

Thanks for any info. I'd like to keep contributing as much as possible while still being to use my PC for other purposes.


326.80

http://www.geforce.com/drivers

@+
*_*
____________

Profile Zarck
Send message
Joined: 16 Aug 08
Posts: 145
Credit: 328,473,995
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32358 - Posted: 28 Aug 2013 | 9:53:53 UTC - in response to Message 32354.
Last modified: 28 Aug 2013 | 10:00:56 UTC

First long unit GPUGRID completed by my Titan.

http://www.gpugrid.net/results.php?hostid=156585

The second is being,

https://www.dropbox.com/s/5s9xejf5ffit7o7/GpuGridUnitLong.jpg

@+
*_*
____________

Profile dskagcommunity
Avatar
Send message
Joined: 28 Apr 11
Posts: 456
Credit: 817,865,789
RAC: 0
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32361 - Posted: 28 Aug 2013 | 10:16:00 UTC - in response to Message 32322.

We decided to go live with the new application on ACEMD-Long.

It is version 8.00, and the server will automatically assign you a CUDA 4.2 or 5.5 version, depending on your driver version.

Any problems on this thread, please.

MJH


Does the difference from 4.2 and 5.5 make any difference for fermi cards or can "we" stay at the currect drivers performancewise?
____________
DSKAG Austria Research Team: http://www.research.dskag.at



Betting Slip
Send message
Joined: 5 Jan 09
Posts: 670
Credit: 2,498,095,550
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32363 - Posted: 28 Aug 2013 | 10:27:24 UTC - in response to Message 32361.

There seems to be a small improvement in the new application but not in the Cuda version.
____________
Radio Caroline, the world's most famous offshore pirate radio station.
Great music since April 1964. Support Radio Caroline Team -
Radio Caroline

jlhal
Send message
Joined: 1 Mar 10
Posts: 147
Credit: 1,077,535,540
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32365 - Posted: 28 Aug 2013 | 10:57:07 UTC

Hello guys and gals !
Still nothing to crunch for my GPUs ...

mer. 28 août 2013 12:49:09 CEST | GPUGRID | Sending scheduler request: To fetch work.
mer. 28 août 2013 12:49:09 CEST | GPUGRID | Requesting new tasks for NVIDIA
mer. 28 août 2013 12:49:11 CEST | GPUGRID | Scheduler request completed: got 0 new tasks
mer. 28 août 2013 12:49:11 CEST | GPUGRID | No tasks sent
mer. 28 août 2013 12:49:11 CEST | GPUGRID | No tasks are available for Short runs (2-3 hours on fastest card)
mer. 28 août 2013 12:49:11 CEST | GPUGRID | No tasks are available for Long runs (8-12 hours on fastest card)

____________
Lubuntu 16.04.1 LTS x64

Operator
Send message
Joined: 15 May 11
Posts: 108
Credit: 297,176,099
RAC: 0
Level
Asn
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32370 - Posted: 28 Aug 2013 | 12:03:58 UTC - in response to Message 32363.

There seems to be a small improvement in the new application but not in the Cuda version.


Well if you have a Titan or a 780 the apps now work! I think that was the point, at least it was for me (since March).

Now... I wish the server would give me more than just one WU at a time since I have TWO GPUs.

Operator

____________

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32371 - Posted: 28 Aug 2013 | 12:07:28 UTC - in response to Message 32370.

Now... I wish the server would give me more than just one WU at a time since I have TWO GPUs.

Operator

I wish it would give me ANY WUs. Updated the 670 to 326.80 and STILL NO WORK :-(

5pot
Send message
Joined: 8 Mar 12
Posts: 411
Credit: 2,083,882,218
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32373 - Posted: 28 Aug 2013 | 12:28:52 UTC

Odd error on a long run for my 780:

The simulation has become unstable. Terminating to avoid lock-up

All WUs are currently dry again as well.

GPUGRID
Send message
Joined: 12 Dec 11
Posts: 91
Credit: 2,730,095,033
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 32384 - Posted: 28 Aug 2013 | 14:52:46 UTC

Running dry of WUs in all machines. No long or short are beeing splited for sometime. As others say, itÂŽs not a driver issue, despite the new message from server.

GPUGRID
Send message
Joined: 12 Dec 11
Posts: 91
Credit: 2,730,095,033
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 32396 - Posted: 28 Aug 2013 | 16:06:10 UTC

They are flowing now, thank you.

Rick A. Sponholz
Avatar
Send message
Joined: 20 Jan 09
Posts: 52
Credit: 2,518,707,115
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32406 - Posted: 28 Aug 2013 | 19:16:44 UTC

Ok, I'm getting wu's again, BUT I give up on what new app names to use in the app_config.xml file. Can anyone help me? Thanks in advance, Rick
____________

Operator
Send message
Joined: 15 May 11
Posts: 108
Credit: 297,176,099
RAC: 0
Level
Asn
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32414 - Posted: 28 Aug 2013 | 19:53:56 UTC - in response to Message 32345.

Operator, please post your completed time on the long run for you titan when finished.

Cheers


5pot;

The first 'non-beta' long WU for me was I79R10-NATHAN_KIDKIXc22_6-3-50-RND1517_0 at 19,232.48 seconds (5.34 hrs). http://www.gpugrid.net/workunit.php?wuid=4727415

The second long WU for me was I46R3-NATHAN_KIDKIXc22_6-2-50-RND7378_1 at 19,145.93 seconds (5.318314 hrs). http://www.gpugrid.net/workunit.php?wuid=4723649

W7x64
Dell T3500 12GB
Titan x2 (both EVGA factory OC'd, and on air - limited to 80C)
1100hz (ballpark freq, not fixed) running 326.84 drivers from the developer site.

No crashes or funny business so far.

Operator
____________

5pot
Send message
Joined: 8 Mar 12
Posts: 411
Credit: 2,083,882,218
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32415 - Posted: 28 Aug 2013 | 20:28:46 UTC

So about another 1k seconds shaved off.

Im really really impressed with these cards.

TJ
Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32418 - Posted: 28 Aug 2013 | 20:46:37 UTC - in response to Message 32414.


W7x64
Dell T3500 12GB
Titan x2 (both EVGA factory OC'd, and on air - limited to 80C)
1100hz (ballpark freq, not fixed) running 326.84 drivers from the developer site.


Hello Operator, you are the one I need to ask something.
What is the PSU in your Dell and how many GPU power plugs do you have? I have a T7400 with a 1000W PSU but only two 6 pins and 1 not usual 8 pin GPU power plug, so I am very limited with this big box.
Thanks for your answer highly appreciated.

____________
Greetings from TJ

Operator
Send message
Joined: 15 May 11
Posts: 108
Credit: 297,176,099
RAC: 0
Level
Asn
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32428 - Posted: 28 Aug 2013 | 23:17:17 UTC - in response to Message 32418.
Last modified: 28 Aug 2013 | 23:17:46 UTC


Hello Operator, you are the one I need to ask something.
What is the PSU in your Dell and how many GPU power plugs do you have? I have a T7400 with a 1000W PSU but only two 6 pins and 1 not usual 8 pin GPU power plug, so I am very limited with this big box.
Thanks for your answer highly appreciated.


TJ;

I've sent you a PM so as not to crosspost.

Operator
____________

Betting Slip
Send message
Joined: 5 Jan 09
Posts: 670
Credit: 2,498,095,550
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32441 - Posted: 29 Aug 2013 | 8:41:49 UTC - in response to Message 32370.

There seems to be a small improvement in the new application but not in the Cuda version.


Well if you have a Titan or a 780 the apps now work! I think that was the point, at least it was for me (since March).

Now... I wish the server would give me more than just one WU at a time since I have TWO GPUs.

Operator


Well that might have been the point of the new app but the question and answer that you have quoted was about performance increase for Fermi cards. So nothing at all to do with Titan or 780.

5pot
Send message
Joined: 8 Mar 12
Posts: 411
Credit: 2,083,882,218
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32512 - Posted: 29 Aug 2013 | 23:10:57 UTC

Getting the unknown error number crashes on kid WUs

5pot
Send message
Joined: 8 Mar 12
Posts: 411
Credit: 2,083,882,218
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32515 - Posted: 30 Aug 2013 | 0:30:00 UTC

More have crashed, with the exact same unknown error number. I dont know if this is from the 8.02 or whatever app that has been pushed out, but it only recently began doing this.

/sigh

Operator
Send message
Joined: 15 May 11
Posts: 108
Credit: 297,176,099
RAC: 0
Level
Asn
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32519 - Posted: 30 Aug 2013 | 4:28:37 UTC

Please can we go back to 8.00 (or maybe 8.01)?

8.02 makes all the Nathans error out now.

It got so messed up I just reset the project.

Was doing fine with the 8.00 on Titans.

Operator
____________

Profile GDF
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 14 Mar 07
Posts: 1957
Credit: 629,356
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 32523 - Posted: 30 Aug 2013 | 7:45:40 UTC - in response to Message 32519.

Opearator, 5pot,
looking at the stats you seem to be two of the 3 users which have a problems with it and did not have it with 800. Would it be possible to restart the machine just to make sure?

gdf

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 32524 - Posted: 30 Aug 2013 | 7:58:56 UTC

8.02 is, in general, doing better than 8.00 but it looks like there's some regression that's affecting a few machines. For now I've reverted acemdlong to 8.00 (which will appear as 8.03 because of a bug in the server). 8.02 will stay on acemdshort for continued testing.

MJH

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,599,311,851
RAC: 8,786,170
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32527 - Posted: 30 Aug 2013 | 8:20:37 UTC - in response to Message 32512.

Getting the unknown error number crashes on kid WUs

It isn't the error number which is unknown, it's the plain-English description for it.

Yours got a 0xffffffffc0000005: mine has just died with a 0xffffffffffffff9f, description equally unknown. Task 7221543.

Profile Stoneageman
Avatar
Send message
Joined: 25 May 09
Posts: 224
Credit: 34,057,224,498
RAC: 231
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32528 - Posted: 30 Aug 2013 | 8:41:59 UTC

I've not had any issues with 8.0, 8.1 or 8.2 on my 500 & 600 cards, either Linux or Win. The only problem now is getting fresh wu's from the server.

HA-SOFT, s.r.o.
Send message
Joined: 3 Oct 11
Posts: 100
Credit: 5,879,292,399
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32529 - Posted: 30 Aug 2013 | 8:47:55 UTC - in response to Message 32528.

I've not had any issues with 8.0, 8.1 or 8.2 on my 500 & 600 cards, either Linux or Win. The only problem now is getting fresh wu's from the server.


Me too for linux and 680 and Titan.

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32530 - Posted: 30 Aug 2013 | 8:52:05 UTC - in response to Message 32528.

Short runs (2-3 hours on fastest card) 0 801 1.95 (0.10 - 9.83) 501
ACEMD beta version 4 319 1.44 (0.01 - 5.58) 78
Long runs (8-12 hours on fastest card) 0 1,230 7.87 (0.27 - 47.95) 396

So, there are 4 Beta's and that's it?
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Profile GDF
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 14 Mar 07
Posts: 1957
Credit: 629,356
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 32532 - Posted: 30 Aug 2013 | 8:58:28 UTC - in response to Message 32530.

Noelia has over 1000 WU to submit but we cannot as there are still these problems.

Now they are running on beta to test.

gdf

5pot
Send message
Joined: 8 Mar 12
Posts: 411
Credit: 2,083,882,218
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32536 - Posted: 30 Aug 2013 | 12:42:36 UTC - in response to Message 32527.

Getting the unknown error number crashes on kid WUs

It isn't the error number which is unknown, it's the plain-English description for it.

Yours got a 0xffffffffc0000005: mine has just died with a 0xffffffffffffff9f, description equally unknown. Task 7221543.


Ty for the clarification.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,599,311,851
RAC: 8,786,170
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32538 - Posted: 30 Aug 2013 | 14:58:44 UTC - in response to Message 32536.

Getting the unknown error number crashes on kid WUs

It isn't the error number which is unknown, it's the plain-English description for it.

Yours got a 0xffffffffc0000005: mine has just died with a 0xffffffffffffff9f, description equally unknown. Task 7221543.

Ty for the clarification.

No problem.

Looking a little further down, my 0xffffffffffffff9f (more legibly described as 'exit code -97') also says

# Simulation has crashed.

Since then, I've had another with the same failure: Task 7224143.

TJ
Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32542 - Posted: 30 Aug 2013 | 16:01:02 UTC - in response to Message 32532.

Noelia has over 1000 WU to submit but we cannot as there are still these problems.

Now they are running on beta to test.

gdf

Well put 500 of them in the beta queque then please, so our rigs can run overnight while we sleep.
____________
Greetings from TJ

Operator
Send message
Joined: 15 May 11
Posts: 108
Credit: 297,176,099
RAC: 0
Level
Asn
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32554 - Posted: 30 Aug 2013 | 22:11:52 UTC

Using the 803-55 application now (on Titans) I received two NATHAN KIDKIX WUs

And both took almost 9k more seconds to complete than with previous 800-55 version.

Still using the 326.84 drivers.

I don't think this 803 is actually comparable to the 800. I think something changed, and not in a good way.

http://www.gpugrid.net/result.php?resultid=7224469

http://www.gpugrid.net/result.php?resultid=7224304

I've also received at least one Nathan baxbimx and a NOELIA KLEBE that I had to abort because they were going no where. Just stuck.

Operator
____________

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 32555 - Posted: 30 Aug 2013 | 22:13:26 UTC - in response to Message 32554.

803 and 800 are the exact same binary.

MJH

Operator
Send message
Joined: 15 May 11
Posts: 108
Credit: 297,176,099
RAC: 0
Level
Asn
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32556 - Posted: 30 Aug 2013 | 22:19:31 UTC - in response to Message 32555.

803 and 800 are the exact same binary.

MJH


Okay. I think I found the issue. My Precision X was off.

Thanks,

Back in the fast lane now.

Operator.

____________

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32563 - Posted: 31 Aug 2013 | 12:12:17 UTC - in response to Message 32556.

...and 8.04 is what?
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 32565 - Posted: 31 Aug 2013 | 12:18:26 UTC - in response to Message 32563.

8.04 is a new beta app. Includes a bit more debugging to help me find the cause of the remaining common failure modes.

MJH

TJ
Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32569 - Posted: 31 Aug 2013 | 15:14:01 UTC - in response to Message 32565.

8.04 is a new beta app. Includes a bit more debugging to help me find the cause of the remaining common failure modes.

MJH

Hello MJH,
Do we need to post the error message or the links to it, or find you it yourself at server side? I have 6 error with 8.04 for Harvey test and Noelia_Klebebeata.
8.00 and 8.02 did okay on my 660 and 770.
I have the most error on the 660, both cards have 2GB.
____________
Greetings from TJ

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 32570 - Posted: 31 Aug 2013 | 15:24:36 UTC - in response to Message 32569.
Last modified: 31 Aug 2013 | 15:36:20 UTC

Hi,

No need to post errors here, they all end up in a database that I can inspect.

MJH

klepel
Send message
Joined: 23 Dec 09
Posts: 189
Credit: 4,195,561,293
RAC: 1,590,531
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32613 - Posted: 2 Sep 2013 | 3:31:52 UTC

As Stderr output reports:
<core_client_version>7.0.28</core_client_version>
<![CDATA[
<message>
aborted by user
</message>
]]>
I wanted to inform that the WU "I74R6-NATHAN_KIDKIXc22_6-6-50-RND6702_0" Version 8.03 caused a driver crash and several blue screens until I was able to get rid of it.

The following task: "I83R9-NATHAN_baxbimx-4-6-RND5261_0" short queue seems to go to nowhere, I am pretty sure it is also a blue screen tomorrow morning when I get to the office.

I would not say this video card does not crash at times, but this was the first time that I was not able to simply restart the computer and download a new task.

klepel
Send message
Joined: 23 Dec 09
Posts: 189
Credit: 4,195,561,293
RAC: 1,590,531
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32635 - Posted: 2 Sep 2013 | 15:44:00 UTC

EDIT my previous post: Just not make false accusations: the task "I83R9-NATHAN_baxbimx-4-6-RND5261_0" did not run until today because I forgot to quite the suspension of BOINC MANAGER. It is running fine now.

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 32864 - Posted: 9 Sep 2013 | 21:56:38 UTC

The version 8.14 application that's been live on short is now out on acemdlong.
For those of you that haven't been paying attention to recent developments, this version has improved error recovery and fixes the suspend/resume problems.

MJH

Lazydude
Send message
Joined: 25 Sep 08
Posts: 12
Credit: 161,238,437
RAC: 0
Level
Ile
Scientific publications
watwatwatwatwatwatwatwatwat
Message 32883 - Posted: 11 Sep 2013 | 9:12:26 UTC

Hi!
I got lot of this with 8.14 shutdown/restarts
This did not happend with 800 /803
http://www.gpugrid.net/result.php?resultid=7266298

# GPU [GeForce GTX 780] Platform [Windows] Rev [3203] VERSION [55]
# SWAN Device 0 :
# Name : GeForce GTX 780
# ECC : Disabled
# Global mem : 3072MB
# Capability : 3.5
# PCI ID : 0000:01:00.0
# Device clock : 954MHz
# Memory clock : 3024MHz
# Memory width : 384bit
# Driver version : r325_00 : 32680
# GPU 0 : 62C
# GPU 1 : 58C
# GPU 0 : 63C
# GPU 0 : 64C
# GPU 0 : 65C
# GPU 0 : 66C
# GPU 0 : 67C
# Access violation : progress made, try to restart]

5pot
Send message
Joined: 8 Mar 12
Posts: 411
Credit: 2,083,882,218
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32885 - Posted: 11 Sep 2013 | 12:23:30 UTC

While I wasnt getting them with 803, I havent had a crash yet on my 780s.

so while it does slow the task down a little bit. Thats better than crashing the task to me.

John C MacAlister
Send message
Joined: 17 Feb 13
Posts: 181
Credit: 144,871,276
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 32888 - Posted: 11 Sep 2013 | 14:20:58 UTC

Hi, MJH:

I am getting short and long tasks for CUDA 4.2. My driver on the 650 Ti GPUs is 314.22 as I had too many failures with driver versions 320.57 and 320.18. What is the difference between CUDA 4.2 and 5.5 and do I need to do anything more to successfully process GPUGRID tasks in an optimum way? Despite wanting to, I cannot buy the higher performing video cards.

Thanks,

John

TJ
Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32889 - Posted: 11 Sep 2013 | 14:48:15 UTC

Hi Matt,

I had 50% error with your previous CRASH tests, the SantiÂŽs on my 660 with the -97 error you wanted to see lots of. Well have a look if you have time.

Since the last 3 days I some times see this in the output file:

# BOINC suspending at user request (exit)

However I have done nothing, I was asleep, so the rig was unattended.
It is there when a WU failed, but also with a success as in in the last Nathan LR. On the 770 everything works perfect.
Perhaps the latest derives are optimized for the 7xx series by nVidia and not so good for the 6xx series of cards.
____________
Greetings from TJ

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32890 - Posted: 11 Sep 2013 | 14:54:02 UTC - in response to Message 32889.
Last modified: 11 Sep 2013 | 14:54:56 UTC

Since the last 3 days I some times see this in the output file:
# BOINC suspending at user request (exit)

However I have done nothing, I was asleep, so the rig was unattended.


Sometimes benchmarks automatically run, and GPU work gets temporarily suspended. Also, sometimes high-priority CPU jobs step in, and GPU work gets temporarily suspended.

You can always look in Event Viewer, or the stdoutdae.txt file, for more information.

nanoprobe
Send message
Joined: 26 Feb 12
Posts: 184
Credit: 222,376,233
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 32891 - Posted: 11 Sep 2013 | 15:19:07 UTC - in response to Message 32889.

Hi Matt,

I had 50% error with your previous CRASH tests, the SantiÂŽs on my 660 with the -97 error you wanted to see lots of. Well have a look if you have time.

Since the last 3 days I some times see this in the output file:
# BOINC suspending at user request (exit)

However I have done nothing, I was asleep, so the rig was unattended.
It is there when a WU failed, but also with a success as in in the last Nathan LR. On the 770 everything works perfect.
Perhaps the latest derives are optimized for the 7xx series by nVidia and not so good for the 6xx series of cards.

Add this line of code to your cc_config file to stop the CPU benchmarks from running.
<skip_cpu_benchmarks>1</skip_cpu_benchmarks>

If you're using a version of BOINC above 7.0.55 then you can go to advanced>read config files and the changes will take effect. if you're using a version lower than 7.0.55 you'll have to shut down and restart BOINC for the change to take effect.

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32892 - Posted: 11 Sep 2013 | 15:24:10 UTC - in response to Message 32891.
Last modified: 11 Sep 2013 | 15:25:01 UTC

Since the last 3 days I some times see this in the output file:
# BOINC suspending at user request (exit)

However I have done nothing, I was asleep, so the rig was unattended.


Sometimes benchmarks automatically run, and GPU work gets temporarily suspended. Also, sometimes high-priority CPU jobs step in, and GPU work gets temporarily suspended.

You can always look in Event Viewer, or the stdoutdae.txt file, for more information.


I just did a brief test, where I ran CPU benchmarks manually using: Advanced -> Run CPU benchmarks.
The result in the slots stderr.txt file was:
# BOINC suspending at user request (thread suspend)
# BOINC resuming at user request (thread suspend)

I then ran a test where I selected Activity -> Suspend GPU, then Activity -> Use GPU based on preferences.
The result in the slots stderr.txt file was:
# BOINC suspending at user request (exit)

So...
I no-longer believe your issue was caused by benchmarks.
But it still could have been caused by high-priority CPU jobs.

Operator
Send message
Joined: 15 May 11
Posts: 108
Credit: 297,176,099
RAC: 0
Level
Asn
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32893 - Posted: 11 Sep 2013 | 16:30:39 UTC - in response to Message 32888.
Last modified: 11 Sep 2013 | 16:38:49 UTC

Hi, MJH:

I am getting short and long tasks for CUDA 4.2. My driver on the 650 Ti GPUs is 314.22 as I had too many failures with driver versions 320.57 and 320.18. What is the difference between CUDA 4.2 and 5.5 and do I need to do anything more to successfully process GPUGRID tasks in an optimum way? Despite wanting to, I cannot buy the higher performing video cards.

Thanks,

John


John,

I know you were asking MJH....I have a GTX 650TI (2GB) running the 326.84 drivers without any issues at all.

http://www.gpugrid.net/show_host_detail.php?hostid=155526

Running the older drivers will result in your system getting tasks for the older CUDA version which is less efficient. The project is leaning forward in trying to eventually bring the codebase to the latest CUDA (5.5) version to keep up with the technology and newer GPUs.

You would be wise to try and run newer drivers to allow your system to run as efficiently as possible. Again, I can verify that my GTX 650Ti runs fine on 326.84.

Operator
____________

John C MacAlister
Send message
Joined: 17 Feb 13
Posts: 181
Credit: 144,871,276
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 32896 - Posted: 11 Sep 2013 | 17:42:08 UTC - in response to Message 32893.

Many thanks, Operator!!

John

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32897 - Posted: 11 Sep 2013 | 18:33:33 UTC - in response to Message 32896.

John,

As Operator suggests, the 326.80 drivers are working very well on my GTX 650 Ti also (1 GB). The only errors have been on the betas.
http://www.gpugrid.net/results.php?userid=90514&offset=0&show_names=1&state=0&appid=

Profile Ascholten
Send message
Joined: 21 Dec 10
Posts: 7
Credit: 78,122,357
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 32898 - Posted: 11 Sep 2013 | 19:45:48 UTC - in response to Message 32897.

I don't know what happened here but all of my tasks are not running. They start run for a second or two, get a fraction of a percentile done then just stop utilizing the video card. The timer in boinc is running but no progress is being done and I can see they are not hitting the cards for any work.

This just happened out of the blue that I can see, I do have the latest Nvidia drivers I loaded in a week or so ago and I don't recall any updates that might have caused any problems.

I tried restarting my computer / boinc but no help.

Any ideas?
Thank you
Aaron

John C MacAlister
Send message
Joined: 17 Feb 13
Posts: 181
Credit: 144,871,276
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 32899 - Posted: 11 Sep 2013 | 20:31:53 UTC - in response to Message 32897.

Hi, Jim:

Good to know: I will update the drivers in the next few days.

Keep on crunching....

John

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 32900 - Posted: 11 Sep 2013 | 20:45:54 UTC - in response to Message 32889.
Last modified: 11 Sep 2013 | 20:49:45 UTC

TJ - that message probably doesn-t indicate that the client is running benchmarks unless you have a version or configuration I've not seen. For benchmarking the message is along the lines of '(thread suspended)' not '(exit)'.

Looks like something your computer is doing is causing the client to suspend the work. High priority tasks from other projects? Antivirus scans? Cat walking on the keyboard?

Matt

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 32901 - Posted: 11 Sep 2013 | 20:50:35 UTC - in response to Message 32898.

Ascholten - which of your machines is the problematic one?

MJH

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 32902 - Posted: 11 Sep 2013 | 20:53:42 UTC - in response to Message 32888.


What is the difference between CUDA 4.2 and 5.5


Same code base, compiled with two different compiler versions.
4.2 is very old now and we need to move to 5.5 to stay current and be ready for future hardware.
(We had to skip 5.0 because of unresolved performance problems)

Matt

GPUGRID
Send message
Joined: 12 Dec 11
Posts: 91
Credit: 2,730,095,033
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 32905 - Posted: 11 Sep 2013 | 21:28:09 UTC - in response to Message 32902.


What is the difference between CUDA 4.2 and 5.5


Same code base, compiled with two different compiler versions.
4.2 is very old now and we need to move to 5.5 to stay current and be ready for future hardware.
(We had to skip 5.0 because of unresolved performance problems)

Matt

Is 5.5 supposed to be any faster? I didnÂŽt notice any difference in crunching speed.

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 32906 - Posted: 11 Sep 2013 | 21:36:40 UTC - in response to Message 32905.


Is 5.5 supposed to be any faster? I didnÂŽt notice any difference in crunching speed.


Nope.

M

klepel
Send message
Joined: 23 Dec 09
Posts: 189
Credit: 4,195,561,293
RAC: 1,590,531
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32909 - Posted: 12 Sep 2013 | 0:15:11 UTC

Then I will stay with driver: 311.6 and CUDA 4.2. Works great for me!

nanoprobe
Send message
Joined: 26 Feb 12
Posts: 184
Credit: 222,376,233
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 32910 - Posted: 12 Sep 2013 | 2:12:24 UTC - in response to Message 32906.


Is 5.5 supposed to be any faster? I didnÂŽt notice any difference in crunching speed.


Nope.

M

FWIW 5.5 was about 5 minutes faster on the betas than the 4.2 for me.

http://www.gpugrid.net/result.php?resultid=7264729 <--4.2

http://www.gpugrid.net/result.php?resultid=7265118 <--5.5

Long runs seem to be a little faster also. 18-20 minutes

http://www.gpugrid.net/result.php?resultid=7226082 <--4.2

http://www.gpugrid.net/result.php?resultid=7232613<--5.5

TJ
Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32915 - Posted: 12 Sep 2013 | 14:03:27 UTC - in response to Message 32900.

TJ - that message probably doesn-t indicate that the client is running benchmarks unless you have a version or configuration I've not seen. For benchmarking the message is along the lines of '(thread suspended)' not '(exit)'.

Looks like something your computer is doing is causing the client to suspend the work. High priority tasks from other projects? Antivirus scans? Cat walking on the keyboard?

Matt

Aha, suspended is indeed something different than exit. That for letting that know. Only 5 Rosetta tasks are running on this 2, Xeon 4 cores computer, so 3 real cores are free to run GPUGRID and other things like antivirus. Indeed my AV runs at night a few times a week, but when watched I have never seen that is suspend any BOINC work. And we don't have a cat or dog :)
Well as long as the WU will resume and not result in error, then it is okay.
____________
Greetings from TJ

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32917 - Posted: 12 Sep 2013 | 15:27:12 UTC - in response to Message 32915.

Tools, Computing Preferences, Processor usage tab,
While processor usage is less than - set it to Zero.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

TJ
Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32918 - Posted: 12 Sep 2013 | 17:23:10 UTC - in response to Message 32917.

Tools, Computing Preferences, Processor usage tab,
While processor usage is less than - set it to Zero.

I will try that immediately.
____________
Greetings from TJ

Operator
Send message
Joined: 15 May 11
Posts: 108
Credit: 297,176,099
RAC: 0
Level
Asn
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32919 - Posted: 12 Sep 2013 | 18:14:20 UTC

MJH;

Running 8.14 long on my Titan box the last few days has been weird.

Not anywhere near the temp limits I have set (running at around 60+ instead of 81-82), and tasks are taking a lot longer than before, almost twice the time.

And there a lot of this nonsense:

NC suspending at user request (thread suspend)
# BOINC resuming at user request (thread suspend)
# BOINC suspending at user request (thread suspend)
# BOINC resuming at user request (thread suspend)
# BOINC suspending at user request (thread suspend)
# BOINC resuming at user request (thread suspend)
# BOINC suspending at user request (thread suspend)
# BOINC resuming at user request (thread suspend)
# BOINC suspending at user request (thread suspend)
# BOINC resuming at user request (thread suspend)
# BOINC suspending at user request (thread suspend)
# BOINC resuming at user request (thread suspend)
# BOINC suspending at user request (thread suspend)
# BOINC resuming at user request (thread suspend)

It's no wonder it takes so long.

And believe me there is nothing else this system is doing besides GPUGrid.

I didn't change anything...so what I'm asking is what did you folks change that is causing this?

Is it just those NATHAN_KIDc tasks?

Operator
____________

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 32920 - Posted: 12 Sep 2013 | 19:08:35 UTC - in response to Message 32919.

Operator,

Well, the reason why the tasks are slow is because they keep getting suspended.


# BOINC suspending at user request (thread suspend)


Those log messages are printed in response to suspend/resume signals received from the client and
quite why that's happening so frequently here isn't clear, but looks very much like a client error.

If it helps you to diagnose it, I think thread suspending should only happen when the client runs its benchmarks, with other suspend event causing the app to exit ( "user request (exit)" ).


Mjh

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32921 - Posted: 12 Sep 2013 | 20:11:21 UTC - in response to Message 32920.

Event Log may also give more insight into why the client is suspending/resuming. I don't think it's a client error, but rather user configuration that is making it do so.

Does Event Log give you any hint?

Operator
Send message
Joined: 15 May 11
Posts: 108
Credit: 297,176,099
RAC: 0
Level
Asn
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32922 - Posted: 12 Sep 2013 | 20:20:17 UTC - in response to Message 32920.

MJH;

I'm with you. I got it. We're on the same page.

Question is why this is doing this using 8.14 on NATHAN_KIDc and NOELIA longs on both my Titan and my GTX590 boxes. I don't see this behavior on my GTX 650Ti box though.

Both the Titan and GTX590 boxes are now seemingly taking almost if not exactly twice the normal amount of time to process a NATHAN/NOELIA (and possibly newer other) WUs.

Please take a look at:
http://www.gpugrid.net/results.php?hostid=147455

and;

http://www.gpugrid.net/results.php?hostid=152263

to see what I'm referring to.

Once I started processing with 8.14 instead of 8.03, that's when all the suspend/resume shenanigans started.

And we can't have shenanigans, now can we? ;-}

Operator
____________

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 32923 - Posted: 12 Sep 2013 | 20:43:30 UTC - in response to Message 32922.

Operator,

Your 590 machine looks fine - it seems to me that it's just your Titan machine that's affected.
What does the client GUI say about the task states?

Matt

Operator
Send message
Joined: 15 May 11
Posts: 108
Credit: 297,176,099
RAC: 0
Level
Asn
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32924 - Posted: 12 Sep 2013 | 22:44:29 UTC - in response to Message 32923.
Last modified: 12 Sep 2013 | 22:57:23 UTC

Operator,

Your 590 machine looks fine - it seems to me that it's just your Titan machine that's affected.
What does the client GUI say about the task states?

Matt


As in task state "2" ?

I may need a little direction here.

Oh and the Titan machine keeps switching to waiting WUs and then back again. Meaning it will work on two of them for 10 or 20% and then stop and start on the other two in the queue, and then go back to the first two.

Also I get the "waiting access violation" thing now and again.

And as I said before the temps are running 64-66 degrees C which is really strange. No where near the 81-82 limits they ran with 8.03.

I have already reset the project a couple of times thinking that would fix this. It hasn't. So I'm at a bit of a loss right now.

It doesn't seem to matter if my Precision X utility is on or not.

Operator.
____________

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32925 - Posted: 12 Sep 2013 | 23:15:21 UTC - in response to Message 32924.
Last modified: 12 Sep 2013 | 23:15:55 UTC

Operator, what setting do you have in place for, Switch Between Applications Every (time)?
- Boinc Manager (Advanced View), Tools, Computing Preferences, Other Options.

I use 990min, but I think the default is 60min (which means Boinc will run one app then another...).
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 32926 - Posted: 12 Sep 2013 | 23:16:40 UTC - in response to Message 32924.


Oh and the Titan machine keeps switching to waiting WUs and then back again. Meaning it will work on two of them for 10 or 20% and then stop and start on the other two in the queue, and then go back to the first two.


And that's in addition to the "thread suspend"ing ? Can you say how frequently that is happening (watch the output to the stderr file as the app is running). Does the task state that the client reports (running, suspend, etc) match, or is it always showing the task as running ,even when you can see that it has suspended?

Could you try running just a single task, and see whether that behaves itself. Use the app_config setting Beyond advises, here:
[quote
http://www.gpugrid.net/forum_thread.php?id=3473&nowrap=true#32913
[/quote]

Matt

Operator
Send message
Joined: 15 May 11
Posts: 108
Credit: 297,176,099
RAC: 0
Level
Asn
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32927 - Posted: 13 Sep 2013 | 2:09:00 UTC - in response to Message 32926.


And that's in addition to the "thread suspend"ing ? Can you say how frequently that is happening (watch the output to the stderr file as the app is running). Does the task state that the client reports (running, suspend, etc) match, or is it always showing the task as running ,even when you can see that it has suspended?

Could you try running just a single task, and see whether that behaves itself. Use the app_config setting Beyond advises, here:
[quote
http://www.gpugrid.net/forum_thread.php?id=3473&nowrap=true#32913


Matt[/quote]

I've just changed the computing prefs from 60.0 minutes to 990.0.

Don't know why this Titan box would start hopping around now when it never has before, and the GTX590 box has the exact same settings and doesn't do it with the 4 WUs it has waiting in the queue..but...

And as for the stderr;

Check this out, this is just a portion..

9/12/2013 8:09:37 PM | GPUGRID | Restarting task e7s7_e6s12f416-SDOERR_VillinAdaptN2-0-1-RND2117_0 using acemdlong version 814 (cuda55) in slot 1
9/12/2013 8:09:37 PM | GPUGRID | Restarting task e7s15_e4s11f480-SDOERR_VillinAdaptN4-0-1-RND1175_0 using acemdlong version 814 (cuda55) in slot 3
9/12/2013 8:14:11 PM | GPUGRID | Restarting task I93R10-NATHAN_KIDc22_glu-1-10-RND9319_0 using acemdlong version 814 (cuda55) in slot 0
9/12/2013 8:15:26 PM | GPUGRID | Restarting task e7s7_e6s12f416-SDOERR_VillinAdaptN2-0-1-RND2117_0 using acemdlong version 814 (cuda55) in slot 1
9/12/2013 8:17:38 PM | GPUGRID | Restarting task e7s15_e4s11f480-SDOERR_VillinAdaptN4-0-1-RND1175_0 using acemdlong version 814 (cuda55) in slot 3
9/12/2013 8:19:38 PM | GPUGRID | Restarting task I93R10-NATHAN_KIDc22_glu-1-10-RND9319_0 using acemdlong version 814 (cuda55) in slot 0
9/12/2013 8:22:10 PM | GPUGRID | Restarting task e7s7_e6s12f416-SDOERR_VillinAdaptN2-0-1-RND2117_0 using acemdlong version 814 (cuda55) in slot 1
9/12/2013 8:23:00 PM | GPUGRID | Restarting task e7s15_e4s11f480-SDOERR_VillinAdaptN4-0-1-RND1175_0 using acemdlong version 814 (cuda55) in slot 3
9/12/2013 8:27:13 PM | GPUGRID | Restarting task I93R10-NATHAN_KIDc22_glu-1-10-RND9319_0 using acemdlong version 814 (cuda55) in slot 0
9/12/2013 8:28:06 PM | GPUGRID | Restarting task e7s15_e4s11f480-SDOERR_VillinAdaptN4-0-1-RND1175_0 using acemdlong version 814 (cuda55) in slot 3
9/12/2013 8:30:32 PM | GPUGRID | Restarting task e7s7_e6s12f416-SDOERR_VillinAdaptN2-0-1-RND2117_0 using acemdlong version 814 (cuda55) in slot 1
9/12/2013 8:36:27 PM | GPUGRID | Restarting task I93R10-NATHAN_KIDc22_glu-1-10-RND9319_0 using acemdlong version 814 (cuda55) in slot 0
9/12/2013 8:39:18 PM | GPUGRID | Restarting task e7s7_e6s12f416-SDOERR_VillinAdaptN2-0-1-RND2117_0 using acemdlong version 814 (cuda55) in slot 1
9/12/2013 8:48:54 PM | GPUGRID | Restarting task e7s15_e4s11f480-SDOERR_VillinAdaptN4-0-1-RND1175_0 using acemdlong version 814 (cuda55) in slot 3
9/12/2013 8:51:12 PM | GPUGRID | Restarting task e7s7_e6s12f416-SDOERR_VillinAdaptN2-0-1-RND2117_0 using acemdlong version 814 (cuda55) in slot 1
9/12/2013 8:55:34 PM | GPUGRID | Restarting task I93R10-NATHAN_KIDc22_glu-1-10-RND9319_0 using acemdlong version 814 (cuda55) in slot 0
9/12/2013 8:56:47 PM | GPUGRID | Restarting task e7s7_e6s12f416-SDOERR_VillinAdaptN2-0-1-RND2117_0 using acemdlong version 814 (cuda55) in slot 1
9/12/2013 9:01:11 PM | GPUGRID | Restarting task e7s15_e4s11f480-SDOERR_VillinAdaptN4-0-1-RND1175_0 using acemdlong version 814 (cuda55) in slot 3

I will try running just one WU (suspending the others) and see what happens.

Operator
____________

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32930 - Posted: 13 Sep 2013 | 12:49:35 UTC - in response to Message 32925.

Operator, what setting do you have in place for, Switch Between Applications Every (time)?
- Boinc Manager (Advanced View), Tools, Computing Preferences, Other Options.

I use 990min, but I think the default is 60min (which means Boinc will run one app then another...).

I use a higher setting too but this should only apply when running more than 1 project. If it's switching between WUs of the same project it's a BOINC bug (heaven forbid) and should be reported.

5pot
Send message
Joined: 8 Mar 12
Posts: 411
Credit: 2,083,882,218
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32931 - Posted: 13 Sep 2013 | 13:30:03 UTC

My 780 completes them fine. But has constant access violations.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,599,311,851
RAC: 8,786,170
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32932 - Posted: 13 Sep 2013 | 13:56:07 UTC - in response to Message 32930.

Operator, what setting do you have in place for, Switch Between Applications Every (time)?
- Boinc Manager (Advanced View), Tools, Computing Preferences, Other Options.

I use 990min, but I think the default is 60min (which means Boinc will run one app then another...).

I use a higher setting too but this should only apply when running more than 1 project. If it's switching between WUs of the same project it's a BOINC bug (heaven forbid) and should be reported.

That is one side-effect of the "BOINC temporary exit" procedure, which I think is what Matt Harvey is using for his new crash-recovery application.

When an internal problem occurs, the new v8.14 application exits completely, and as far as BOINC knows, the GPU is free and available to schedule another task. If you have another task - from this or any other project - ready to start, BOINC will start it.

On my system, with (still) a stupidly high DCF, the sequence is:
Task 1 errors and exits
Task 2 starts on the vacant GPU
Task 1 becomes ready ('waiting to run') again.
BOINC notices a deadline miss looming
BOINC pre-empts Task 2, and restarts Task 1, marking it 'high priority'

That results in Task 2 showing 'Waiting to run', with a minute or two of runtime completed, before Task 1 finally completes.

With more normal estimates and no EDF, Task 1 and Task 2 would swap places every time a fault and temporary exit occurred. Even if you run minimal cache and don't normally fetch the next task until shortly before it is needed, you may find that a work fetch is triggered by the first error and temporary exit, and the swapping behaviour starts after that.

If the tasks swap places more than a few times each run, your GPU is marginal and you should investigate the cause.

Operator
Send message
Joined: 15 May 11
Posts: 108
Credit: 297,176,099
RAC: 0
Level
Asn
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32933 - Posted: 13 Sep 2013 | 15:48:12 UTC - in response to Message 32932.



When an internal problem occurs, the new v8.14 application exits completely, and as far as BOINC knows, the GPU is free and available to schedule another task. If you have another task - from this or any other project - ready to start, BOINC will start it.

On my system, with (still) a stupidly high DCF, the sequence is:
Task 1 errors and exits
Task 2 starts on the vacant GPU
Task 1 becomes ready ('waiting to run') again.
BOINC notices a deadline miss looming
BOINC pre-empts Task 2, and restarts Task 1, marking it 'high priority'

That results in Task 2 showing 'Waiting to run', with a minute or two of runtime completed, before Task 1 finally completes.

With more normal estimates and no EDF, Task 1 and Task 2 would swap places every time a fault and temporary exit occurred. Even if you run minimal cache and don't normally fetch the next task until shortly before it is needed, you may find that a work fetch is triggered by the first error and temporary exit, and the swapping behaviour starts after that.

If the tasks swap places more than a few times each run, your GPU is marginal and you should investigate the cause.



I'm not sure what your last statement means "your GPU is marginal".

I have reviewed some completed WUs for another Titan host to see if we were having similar issues.
http://www.gpugrid.net/results.php?hostid=156948

And this host (Anonymous) is posting completion times I used to get while running the 8.03 long app.

The major thing that jumped out at me was this box is running driver 326.41.

Still showing lots of "Access violations" though.

I believe I'm running 326.84.

I just can't figure out why this just started seemingly out of the blue. I don't remember making any changes to the system and this box does nothing but GPUGrid.

Operator
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,599,311,851
RAC: 8,786,170
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32934 - Posted: 13 Sep 2013 | 16:29:17 UTC - in response to Message 32933.

I'm not sure what your last statement means "your GPU is marginal".

For me, a marginal GPU is one which throws a lot of errors, and a GPU which throws a lot of errors is marginal, even if it recovers from them.

Marginal being faulty, badly installed, badly maintained, or just in need of some TLC.

Things like overclocking, overheating, under-ventilating, under-powering (small PSU), badly seating (in PCIe slot) - anything which makes it unhappy.

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 32935 - Posted: 13 Sep 2013 | 17:04:03 UTC - in response to Message 32934.

Error -97s are a strong indication that the GPU is misbehaving. When I get around to it, I'll sort out a memory testing program for you all to test for this. The access violations are indicative of GPU hardware problems. There are a few hosts that are disproportionately affected by these (Operator, 5pot) and it's not clear why. I suspect there's a relationship with some third-party software, but don't yet have a handle on it.

Matt

TJ
Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32936 - Posted: 13 Sep 2013 | 17:15:39 UTC - in response to Message 32935.

Error -97s are a strong indication that the GPU is misbehaving. When I get around to it, I'll sort out a memory testing program for you all to test for this. The access violations are indicative of GPU hardware problems. There are a few hosts that are disproportionately affected by these (Operator, 5pot) and it's not clear why. I suspect there's a relationship with some third-party software, but don't yet have a handle on it.

Matt

Well my GTX660 has a lot of -97 errors, especially with the CRASH (Santi) tests. LR's and the Noelia's are handled most of the time error free on it. Einstein, Albert and Milkyway run almost error free on it too. So if it is something of the PC or other software (I don't have many installed), would also result in 50% error with the other 3 GPU projects. Or not?
____________
Greetings from TJ

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 7,520
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32943 - Posted: 13 Sep 2013 | 23:14:45 UTC - in response to Message 32936.
Last modified: 13 Sep 2013 | 23:16:55 UTC

Well my GTX660 has a lot of -97 errors, especially with the CRASH (Santi) tests. LR's and the Noelia's are handled most of the time error free on it. Einstein, Albert and Milkyway run almost error free on it too. So if it is something of the PC or other software (I don't have many installed), would also result in 50% error with the other 3 GPU projects. Or not?

No. Different projects have different apps, which use different parts of the GPU (causing different GPU usage, and/or power draw).
I have two Asus GTX 670 DC2OG 2GD5 (factory overclocked) cards in one of my hosts, and they were unreliable with some batches of workunits. I've upgraded the MB/CPU/RAM in this host, but its reliability stayed low with those batches. Even the WLAN connection was lost when the GPU related failures happened. Then I've found a voltage tweaking utility for the Kepler based cards, and with the help of this utility I've raised the GPU's boost voltage by 25mV, and its power limits by 25W. Since then this host didn't have any GPU related errors. Maybe you should give it a try too. It's a quite simple tool, if you put nvflash beside its files, it can directly flash the modified BIOS.

5pot
Send message
Joined: 8 Mar 12
Posts: 411
Credit: 2,083,882,218
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32944 - Posted: 13 Sep 2013 | 23:34:38 UTC

The *only* other thing running besides GPUgrid, was WCG. And I mean no anti-virus, nothing at all. Still got the access violations. So I suspended the WCG app. Still the access violations occurred.

Although my GPU appears to be functioning correctly (just in general), I bumped the voltage to 1.175 from 1.163. Will report back if this has any impact on the access violations.

TJ
Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32945 - Posted: 13 Sep 2013 | 23:39:43 UTC - in response to Message 32943.

Thanks Zoltan, I will install them and try tomorrow (later today after sleep).
____________
Greetings from TJ

Operator
Send message
Joined: 15 May 11
Posts: 108
Credit: 297,176,099
RAC: 0
Level
Asn
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32954 - Posted: 14 Sep 2013 | 17:37:41 UTC - in response to Message 32935.
Last modified: 14 Sep 2013 | 17:44:58 UTC

Error -97s are a strong indication that the GPU is misbehaving. When I get around to it, I'll sort out a memory testing program for you all to test for this. The access violations are indicative of GPU hardware problems. There are a few hosts that are disproportionately affected by these (Operator, 5pot) and it's not clear why. I suspect there's a relationship with some third-party software, but don't yet have a handle on it.

Matt


Yesterday evening in response to Mr. Haselgrove's post I decided to do a thorough review of my Titan system.

I removed the 326.84 drivers and did a clean install of the 326.41 drivers.

Removed the Precision X utility completely.

Removed, inspected and reinstalled both Titan GPUs. No dust or other foreign contaminants were found inside the case or GPU enclosures.

All cabling checks out.

Confirmed all bios settings were as originally set and correct.

No other programs are starting with Windows except Teamviewer (which is on all my systems and was long before these problems started).

I reinstalled BOINC from scratch with no settings saved from the old installation.

I setup GPUGrid all over again (new machine number now).

No appconfig or any other XML tweak files are present.

Both GPUs are running with factory settings (EVGA GTX Titan SC).

I downloaded two long 8.14 WUs and they started. Then the second two downloaded and went from "Ready to Start" to "Waiting to Run" as the first two paused and the system started processing the newly arrived ones. And then after a few minutes back again to the first two that were downloaded.

Temps the whole time were nominal - that is to say in the 70's C.

I monitored the task manager and occasionally would see one of the 814 apps drop off, leaving only one running, and then after a minute or so there would be two of them running again.

The results show multiple access violations:

http://www.gpugrid.net/result.php?resultid=7277767

http://www.gpugrid.net/result.php?resultid=7267751

It is also taking longer to complete these WUs using 8.14 (than with the 8.03 app).

Again, this started for me with the 8.14 app. None of this happened on the 8.03 app, even with a bit of overclocking it was dead stable.

I am confident with what I went through last evening that this issue is not with my system.

I have now set this box to do betas only.

- Updated the betas (8.14) are doing the same swapping as the normal long WUs were doing. I have to manually suspend two of them to keep this from happening.

On the other hand my dual GTX 590 system is happy as a clam.

Two steps forward, one step back.

Operator
____________

5pot
Send message
Joined: 8 Mar 12
Posts: 411
Credit: 2,083,882,218
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32955 - Posted: 14 Sep 2013 | 17:54:40 UTC

Yours appear to be happening much more frequently than mine. Even after I increased the voltage, still get access violations.

To reiterate, this is with absolutely no third party software running at all. In all honesty, I can only assume that something in the app is causing this behavior.

Why it's affecting Operator's Titans more than my 780s, I'm not sure.

Operator
Send message
Joined: 15 May 11
Posts: 108
Credit: 297,176,099
RAC: 0
Level
Asn
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32957 - Posted: 14 Sep 2013 | 18:59:54 UTC - in response to Message 32954.
Last modified: 14 Sep 2013 | 19:02:07 UTC


I have now set this box to do betas only.

- Updated the betas (8.14) are doing the same swapping as the normal long WUs were doing. I have to manually suspend two of them to keep this from happening.



And now...

9/14/2013 12:44:56 PM | GPUGRID | Restarting task 152-MJHARVEY_CRASH2-24-25-RND4251_0 using acemdbeta version 814 (cuda55) in slot 0
9/14/2013 12:48:12 PM | GPUGRID | Restarting task 152-MJHARVEY_CRASH2-24-25-RND4251_0 using acemdbeta version 814 (cuda55) in slot 0
9/14/2013 12:51:13 PM | GPUGRID | Restarting task 152-MJHARVEY_CRASH2-24-25-RND4251_0 using acemdbeta version 814 (cuda55) in slot 0
9/14/2013 12:54:26 PM | GPUGRID | Restarting task 152-MJHARVEY_CRASH2-24-25-RND4251_0 using acemdbeta version 814 (cuda55) in slot 0
9/14/2013 12:57:43 PM | GPUGRID | Restarting task 152-MJHARVEY_CRASH2-24-25-RND4251_0 using acemdbeta version 814 (cuda55) in slot 0
9/14/2013 1:00:45 PM | GPUGRID | Restarting task 152-MJHARVEY_CRASH2-24-25-RND4251_0 using acemdbeta version 814 (cuda55) in slot 0
9/14/2013 1:03:48 PM | GPUGRID | Restarting task 152-MJHARVEY_CRASH2-24-25-RND4251_0 using acemdbeta version 814 (cuda55) in slot 0
9/14/2013 1:07:02 PM | GPUGRID | Restarting task 152-MJHARVEY_CRASH2-24-25-RND4251_0 using acemdbeta version 814 (cuda55) in slot 0
9/14/2013 1:10:19 PM | GPUGRID | Restarting task 152-MJHARVEY_CRASH2-24-25-RND4251_0 using acemdbeta version 814 (cuda55) in slot 0
9/14/2013 1:13:22 PM | GPUGRID | Restarting task 152-MJHARVEY_CRASH2-24-25-RND4251_0 using acemdbeta version 814 (cuda55) in slot 0
9/14/2013 1:16:24 PM | GPUGRID | Restarting task 152-MJHARVEY_CRASH2-24-25-RND4251_0 using acemdbeta version 814 (cuda55) in slot 0
9/14/2013 1:19:27 PM | GPUGRID | Restarting task 152-MJHARVEY_CRASH2-24-25-RND4251_0 using acemdbeta version 814 (cuda55) in slot 0
9/14/2013 1:22:29 PM | GPUGRID | Restarting task 152-MJHARVEY_CRASH2-24-25-RND4251_0 using acemdbeta version 814 (cuda55) in slot 0
9/14/2013 1:25:32 PM | GPUGRID | Restarting task 152-MJHARVEY_CRASH2-24-25-RND4251_0 using acemdbeta version 814 (cuda55) in slot 0
9/14/2013 1:28:34 PM | GPUGRID | Restarting task 152-MJHARVEY_CRASH2-24-25-RND4251_0 using acemdbeta version 814 (cuda55) in slot 0
9/14/2013 1:31:38 PM | GPUGRID | Restarting task 152-MJHARVEY_CRASH2-24-25-RND4251_0 using acemdbeta version 814 (cuda55) in slot 0
9/14/2013 1:34:40 PM | GPUGRID | Restarting task 152-MJHARVEY_CRASH2-24-25-RND4251_0 using acemdbeta version 814 (cuda55) in slot 0
9/14/2013 1:37:33 PM | GPUGRID | Restarting task 152-MJHARVEY_CRASH2-24-25-RND4251_0 using acemdbeta version 814 (cuda55) in slot 0
9/14/2013 1:40:50 PM | GPUGRID | Restarting task 152-MJHARVEY_CRASH2-24-25-RND4251_0 using acemdbeta version 814 (cuda55) in slot 0
9/14/2013 1:43:52 PM | GPUGRID | Restarting task 152-MJHARVEY_CRASH2-24-25-RND4251_0 using acemdbeta version 814 (cuda55) in slot 0
9/14/2013 1:46:56 PM | GPUGRID | Restarting task 152-MJHARVEY_CRASH2-24-25-RND4251_0 using acemdbeta version 814 (cuda55) in slot 0
9/14/2013 1:49:59 PM | GPUGRID | Restarting task 152-MJHARVEY_CRASH2-24-25-RND4251_0 using acemdbeta version 814 (cuda55) in slot 0
9/14/2013 1:53:02 PM | GPUGRID | Restarting task 152-MJHARVEY_CRASH2-24-25-RND4251_0 using acemdbeta version 814 (cuda55) in slot 0
9/14/2013 1:56:04 PM | GPUGRID | Restarting task 152-MJHARVEY_CRASH2-24-25-RND4251_0 using acemdbeta version 814 (cuda55) in slot 0
9/14/2013 1:58:56 PM | GPUGRID | Restarting task 152-MJHARVEY_CRASH2-24-25-RND4251_0 using acemdbeta version 814 (cuda55) in slot 0

So even manually suspending the two waiting WUs the ones that are supposed to be running keep starting and stopping.

I really have no idea what could be causing this.

Operator
____________

TJ
Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32961 - Posted: 14 Sep 2013 | 23:56:07 UTC - in response to Message 32957.

Hello Operator,

Perhaps a ridiculous idea and perhaps you have tried it already: what happens when you only use one Titan in that PC?

Have you set the minimum and maximum work buffer as low as possible, so that you don't get to fast new WU's?

It seems to be happening with Titan and 780 the two top cards. The 326.41 drivers are not the latest one at nVidia's site.
____________
Greetings from TJ

Operator
Send message
Joined: 15 May 11
Posts: 108
Credit: 297,176,099
RAC: 0
Level
Asn
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32963 - Posted: 15 Sep 2013 | 3:18:04 UTC - in response to Message 32961.

Hello Operator,

Perhaps a ridiculous idea and perhaps you have tried it already: what happens when you only use one Titan in that PC?

Have you set the minimum and maximum work buffer as low as possible, so that you don't get to fast new WU's?

It seems to be happening with Titan and 780 the two top cards. The 326.41 drivers are not the latest one at nVidia's site.


TJ

I thought of taking one card out to see what happens and will probably try that tomorrow.

I observed today running beta apps (8.14) that there were times when both WUs were "Waiting to Run" Scheduler: Access Violation. Meaning with only two WUs, one for each GPU, nothing was getting done because the app(s) had temporarily shutdown.

So I would expect that with just one card the same thing would happen, just one WU and sometimes it would just stop and then restart.

I will try that tomorrow anyway.

Operator

____________

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32970 - Posted: 15 Sep 2013 | 12:15:55 UTC - in response to Message 32963.

Operator: this sounds like your card(s ?) may experience computation errors which were not detected by the 8.03 app. I don't know for sure, but it could well be that the new recovery mode also added enhanced detection of faults. What I'd try:

- use 1 Titan (as TJ said) to rule out an insufficient PSU
- downclock the cards a fair bit (~50 MHz should do).. I know they're factory clocked now, but it wouldn't be the first time that a manufacturer set overclocks too high
- try a card in a different PC

MrS
____________
Scanning for our furry friends since Jan 2002

TJ
Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32977 - Posted: 15 Sep 2013 | 20:54:00 UTC

This morning I saw that my GTX660 was down clocked to 50% again (would be nice Matt if we can see that in the stderr output file), so I booted the system.
Then in the evening I saw that one WU was Waiting to Run, while the other one was running with only 2% left on the WU that was waiting. Same as Operator had on his Titan. However I did not intervene and the WU finished with good result but I saw this i the output:
(only a part of it off course):

# GPU 0 : 64C
SWAN : FATAL : Cuda driver error 999 in file 'swanlibnv2.cpp' in line 1963.
# SWAN swan_assert 0
# GPU [GeForce GTX 660] Platform [Windows] Rev [3203] VERSION [42]
# SWAN Device 0 :
# Name : GeForce GTX 660


I have never seen this before. Perhaps "strange" things happened in the past as weel, but now we can look for it.
____________
Greetings from TJ

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32978 - Posted: 15 Sep 2013 | 21:10:30 UTC - in response to Message 32970.
Last modified: 15 Sep 2013 | 21:52:55 UTC

I think Richards explanation sums up what's going on well.

If a WU fails and the app doesn't crash out or fail the WU, but want's to restart it, the WU is stopped. As soon as this happens another WU will run (if it's already downloaded). In TJ's case the fix worked, but for Operator (who may have a dud card or some other issue), WU's keep trying to recover, one after the other.
Perhaps there should be a limit on the number of attempts to recover, say 20?

Operator, I thought you may have a dud card but I now think it's just a cooling issue.

The Access violations are occurring just after GPU0 reaches a high temp, usually 79°C or above - check your logs:

# GPU 0 : 80C
# GPU 1 : 74C
# Access violation : progress made, try to restart
# GPU [GeForce GTX TITAN] Platform [Windows] Rev [3203] VERSION [55]
# SWAN Device 1 :
# Name : GeForce GTX TITAN
# ECC : Disabled
# Global mem : 4095MB
# Capability : 3.5
# PCI ID : 0000:03:00.0
# Device clock : 928MHz
# Memory clock : 3004MHz
# Memory width : 384bit
# Driver version : r325_00 : 32641

The issue is probably with the top GPU (closest to the CPU) getting too warm. If cooling is improved I bet the error rates will fall.

I suggest you test this by disabling the use of top GPU for crunching at GPUGrid (hopefully GPU0 according to Boinc) using cc_config (and telling Boinc to re-read the config file):


    <cc_config>
    <options>
    <use_all_gpus>1</use_all_gpus>
    <exclude_gpu>
    <url>http://www.gpugrid.net/</url>
    <device_num>0</device_num>
    </exclude_gpu>
    </options>
    </cc_config>



You might still want to swap one or both cards into other system to check them.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 32980 - Posted: 15 Sep 2013 | 23:09:09 UTC - in response to Message 32978.

Currently when the app does a temporary exit it tells the client to wait 30secs before attempting a restart. I'll probably change this to an immediate restart; this should minimise the opportunity for the client to chop and change tasks.

MJH

5pot
Send message
Joined: 8 Mar 12
Posts: 411
Credit: 2,083,882,218
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32982 - Posted: 15 Sep 2013 | 23:50:25 UTC

http://www.gpugrid.net/result.php?resultid=7280096

Here is an example of one that constantly was getting access violations. Notice how the temps were pretty much always in the low 70s. Which AFAIK, isn't hot. Or rather, too hot to matter.

I have also tried running only one WU, and suspending the other, and the errors were still present.

I will bump up the fans a little bit, currently I have them at around 80%

TJ
Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32983 - Posted: 16 Sep 2013 | 0:02:12 UTC

I don't think the cards are to hot. I have had a GTX550Ti running at 79°C 24/7 with this project and almost none errors.
Moreover Operator had no problems with version 8.03 and the temperatures should be around the same. And according to nVidia its maximum temperature is 95°C!
More strange is a lot of access violations and WU stops and starts but eventually the WU finished okay.
Matt knows absolute what he is doing, he has made the app a whole lot better, but I guess he has overlooked a small thing between version 8.03 towards 8.14.
Perhaps he can have one more look tomorrow after a good sleep?
____________
Greetings from TJ

5pot
Send message
Joined: 8 Mar 12
Posts: 411
Credit: 2,083,882,218
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32985 - Posted: 16 Sep 2013 | 3:57:26 UTC

I've dropped the clock, and still get the violations.

Let me ask this, does anyone have a gk110 that isnt getting access violations?

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32993 - Posted: 16 Sep 2013 | 11:09:21 UTC - in response to Message 32983.

I don't think the cards are to hot. I have had a GTX550Ti running at 79°C 24/7 with this project and almost none errors.

I don't think you can generalize from one card to the next even if they are in the same series, much less from a GTX550Ti to a GTX 660. If it fails when it hits a certain temperature, that looks like a smoking gun to me.

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32995 - Posted: 16 Sep 2013 | 11:41:27 UTC - in response to Message 32985.
Last modified: 16 Sep 2013 | 11:52:36 UTC

Perhaps most of the access violations were already being recovered by the card while running the earlier apps; there are errors that the card can recover without intervention (recoverable errors).

Regarding safe temps - I agree with Jim, it's an individual thing.

When my GTX660Ti ran at around 78°C it had quite a lot of errors and my 660 is also good until it gets into the high 70's. Now that both are usually below 60°C errors are very rare (just 1 Beta since 6th Sept). One of my GTX670's is fine in the high 70's and my 470's were generally OK until they went into the 80's, but they use a Fermi architecture and the default fan profile allowed the cards to reach 93C!

The GPU temperature of a card doesn't tell you how hot anything else is.

Perhaps Operator has different issues to 5pot, possibly Titan vs 780, or perhaps there are two different general issues; one that kicks in when the temp hits 80°C and a separate issue that can occur at lower temps.

Both systems have two GPU's (which might be significant, and can be tested by removing or disabling one), and both systems have one GPU that is noticeably hotter than the other; GPU0 which presumably sits between the other GPU and the CPU. This is common, but suggests insufficient cooling.
From experience of running single and multi-GPU setups, when you add a GPU, even when you get the GPUs temperatures down to what you think is reasonable the rest of the card is hotter, so you have to aim to reduce the ambient system temps a lot.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Operator
Send message
Joined: 15 May 11
Posts: 108
Credit: 297,176,099
RAC: 0
Level
Asn
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32999 - Posted: 16 Sep 2013 | 15:22:38 UTC - in response to Message 32995.
Last modified: 16 Sep 2013 | 15:23:13 UTC

Okay so let's take a look at somebody else.

After scanning the listing of top hosts I see another system similarly configured to mine, two GTX Titans, 3.20 Xeon processor, 12GB ram, etc. all the same as mine, except this box is running Win8.

Now take a look at these results riddled with Access violations and temps that occasionally approach 80C from time to time. And the times are in most cases close to (but actually better) than my systems WU completion times.

http://www.gpugrid.net/results.php?hostid=156948

It's not just my system having these issues.

I will get around to pulling out one of the GPUs and running tests this evening, but I do not anticipate that with just one GPU installed that there will be any change from what my experience has been running 8.14 with both GPUs installed.

Titans do not just 'crash' when they get to 80C. They do start reducing frequency to reduce temps, but they don't 'quit'.

I know there has been discussion previously about Titans and memory mis-reporting by the driver. I notice that issue is still not resolved. Titans (currently) come standard with 6GB of memory, but they are always reported in BOINC as having only 4GB. No idea what that's about or if it even matters.

Operator
____________

Operator
Send message
Joined: 15 May 11
Posts: 108
Credit: 297,176,099
RAC: 0
Level
Asn
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33001 - Posted: 16 Sep 2013 | 16:31:30 UTC

MJH

Does the 4X Titan E5 system in your facility only run Linux?

If it runs Windows, does it display the same issues (Access violations, etc.)?

Operator
____________

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 33002 - Posted: 16 Sep 2013 | 16:39:05 UTC - in response to Message 33001.

We only run Linux in-house.

MJH

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 33003 - Posted: 16 Sep 2013 | 16:43:03 UTC - in response to Message 33001.

The access violations appear to come from deep inside NVIDIA code. Maybe I'll have to get a Titan machine bumped to Windows for testing, since there's a limit to the information I can get remotely.


I know there has been discussion previously about Titans and memory mis-reporting by the driver. I notice that issue is still not resolved. Titans (currently) come standard with 6GB of memory, but they are always reported in BOINC as having only 4GB. No idea what that's about or if it even matters.


That's just because the client is using a 32bit integer to hold the memory size. 4GB is largest value representable in that datatype.

MJH

5pot
Send message
Joined: 8 Mar 12
Posts: 411
Credit: 2,083,882,218
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33007 - Posted: 16 Sep 2013 | 18:12:25 UTC

Well, thats one step forward. Best of luck. Deep inside nvidia code doesnt sound good, but for me, while it does slow the tasks down, im just happy theyre running and validating.

Operator
Send message
Joined: 15 May 11
Posts: 108
Credit: 297,176,099
RAC: 0
Level
Asn
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33008 - Posted: 16 Sep 2013 | 19:23:49 UTC - in response to Message 33003.

The access violations appear to come from deep inside NVIDIA code. Maybe I'll have to get a Titan machine bumped to Windows for testing, since there's a limit to the information I can get remotely.

MJH


Thank you sir for the 32 bit reference on the memory size, makes perfect sense.

While we're on that topic, I don't suppose there is any hope of compiling a 64 bit version of the app and sending that out for testing?

It's my understanding that either a 32 bit or 64 bit app can be compilied by the toolkit.

Is this correct?

What's the downside of doing this?

Thanks,

Operator
____________

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 33009 - Posted: 16 Sep 2013 | 19:27:59 UTC - in response to Message 33008.


While we're on that topic, I don't suppose there is any hope of compiling a 64 bit version of the app and sending that out for testing?


No, the Windows application will stay 32bit for the near future, since that will work on all hosts. Importantly, there's no performance advantage in a 64bit version. It may happen in the future, but not until 1) after the transition to Cuda 5.5 is complete, and 2) 32bit hosts contribute an insignificant fraction of our compute capability.

Matt

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 33011 - Posted: 16 Sep 2013 | 19:38:45 UTC - in response to Message 32995.


Perhaps most of the access violations were already being recovered by the card while running the earlier apps; there are errors that the card can recover without intervention (recoverable errors).


The app will attempt recovery if it ran long enough to make a new checkpoint file. If it starts and crashes before that point, it will just abort the task, to avoid getting stuck in a loop.


Regarding safe temps - I agree with Jim, it's an individual thing.


My experience is that you get best performance out of a Titan if the temperature is below 78C. By 80 it is throttling. Over 80 and the thermal environment is too challenging for it to maintain its target (the card will be spending most of its time in the lowest performance state) r, and you should really try and improve the cooling.

Just turning up the fanspeed can have counter-intuitive effects. When we tried this on our chasses, the increased airflow through the GPUs hindered airflow around the cards, actually making the top parts of the card away from the thermal sensors hotter.

MJH

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33014 - Posted: 16 Sep 2013 | 20:48:24 UTC

For tests you could just grad an old spare HDD an install some windows on it - it doesn't even need to be activated.

What could be interesting is whether these access violations already happened in 8.03 but had no visible effect, or if they're caused by some change made to the app since then.

And you're right, Titans regulate themselves to "up to 80°C" by boosting clock speed and voltage, but not any higher. Hence it's their [b]expected temperature[/(b]. They'll always like more cooling, though, unless you go below 30 K.. :D

MrS
____________
Scanning for our furry friends since Jan 2002

Operator
Send message
Joined: 15 May 11
Posts: 108
Credit: 297,176,099
RAC: 0
Level
Asn
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33017 - Posted: 16 Sep 2013 | 21:04:54 UTC - in response to Message 33014.
Last modified: 16 Sep 2013 | 21:05:28 UTC



What could be interesting is whether these access violations already happened in 8.03 but had no visible effect, or if they're caused by some change made to the app since then.

MrS



MrS;

Easy enough to find out.

Here's the results of the last 4 WUs crunched on the Titan system using 8.03 before 8.14 got downloaded automatically:

http://www.gpugrid.net/result.php?resultid=7264399

http://www.gpugrid.net/result.php?resultid=7263637

http://www.gpugrid.net/result.php?resultid=7262985

http://www.gpugrid.net/result.php?resultid=7262745

I don't see any evidence of either errors or Access violations.

And here is the very first WU running on the 8.14 app:

http://www.gpugrid.net/result.php?resultid=7265074

So I don't think it's about heat issues, third party software, tribbles in the vent shafts, moon phases, any of that.

I've been through this system thoroughly.

It clearly started with 8.14. From the very first 8.14 WU I got.

Operator
____________

5pot
Send message
Joined: 8 Mar 12
Posts: 411
Credit: 2,083,882,218
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33022 - Posted: 16 Sep 2013 | 21:31:35 UTC

What you're forgetting is that it most likely wasn't reporting the error since that time point. He added more debug information as time went on.

My WU times are in line with previous batches as well.

TJ
Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33024 - Posted: 16 Sep 2013 | 21:52:09 UTC - in response to Message 33017.

Operator, you and I don't think it is an heat issue, but the temperature reading where absent in 8.03 and where introduced between 8.04 and 8.14.
So somewhere it could have to do with temperature readings or other things regarding temperature. Matt should know as he programmed the otherwise improved app.
____________
Greetings from TJ

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 33026 - Posted: 16 Sep 2013 | 21:56:14 UTC

For these access violation problems, it seems that I'm going to have to set up a Windows system with a Titan in the lab and try to reproduce it. Unfortunately I'll not be back to do that until mid October at the earliest. I hope you can tolerate the current state of affairs until then?

Matt

TJ
Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33028 - Posted: 16 Sep 2013 | 23:37:06 UTC

In the afternoon I looked at my PC and just by coincidence I saw a WU stop (CRASH) and then another one started but the GPU clock dropped to half by that. Nothing I did with suspending/resuming/EVGA software to get the clock up again than booting the system, 1 day and 11 hours after its last boot by the same issue.

Now that WU is finished with good result I looked at it and found this again:

# BOINC suspending at user request (exit)

I did nothing and the PC was only doing GPUGRID and 5 Rosetta WU's in the CPU's. Virus scanner was not in use will happen during night time.
And I used the line from Operator in cc_config to never do a Benchmark.

I think Matt has made a good diagnostic program and we get now to see things we never saw but could have happened. It would be nice though to see somewhere what all these messages mean (and what we could do or not do about it).
But only when you have time Matt, we know you are busy with programming and you need to get your PhD as well.

I am now 3 days error free even on my 660, so things have improved, for me at least. Thanks for that.
____________
Greetings from TJ

5pot
Send message
Joined: 8 Mar 12
Posts: 411
Credit: 2,083,882,218
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33030 - Posted: 16 Sep 2013 | 23:42:31 UTC - in response to Message 33026.

For these access violation problems, it seems that I'm going to have to set up a Windows system with a Titan in the lab and try to reproduce it. Unfortunately I'll not be back to do that until mid October at the earliest. I hope you can tolerate the current state of affairs until then?

Matt


Like I said, theyre running and validating. Fine with me

Operator
Send message
Joined: 15 May 11
Posts: 108
Credit: 297,176,099
RAC: 0
Level
Asn
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33032 - Posted: 17 Sep 2013 | 0:38:46 UTC - in response to Message 33026.

I hope you can tolerate the current state of affairs until then?

Matt


Matt;

Will have to do. Thanks for looking into it though. That's encouraging.

As I indicated I would, I removed one GPU and booted up to run long WUs.

Got one NATHAN_KIDc downloaded and running and the second, a NOELIA_INS "Ready to Start".

After one hour I came back to check and sure enough the first one had stopped and was now "Waiting to run" and the second one was running.

I'm sure they'll swap back and forth again several times before completion.

Operator

____________

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33033 - Posted: 17 Sep 2013 | 2:09:17 UTC - in response to Message 33032.

Operator:

I sent you a private message on GPUGrid, with my email address, requesting some files from you. I'd like to help your situation. Can you send me those files?

Thanks,
Jacob

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,599,311,851
RAC: 8,786,170
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33035 - Posted: 17 Sep 2013 | 10:08:00 UTC

Here's an anecdotal story, based on a random sample of one. YMMV - in fact, your system will certainly be different - but this may be of interest.

My GTX 670 host has been having a lot of problems - starting in August, which was particularly warm here. "Problems" were the occasional BSOD, but most commonly a total system freeze - Windows desktop shows on screen as normal, but the system clock stops updating and there's no response to mouse or keyboard. First suspect was overheating, so I installed extra side fans in an already well-ventilated HAF case and moved the machine to a cooler room - that seemed to improve things, but wasn't a complete cure.

Then, after this month's Windows security updates, it got much worse again - freezing every six hours or so. OS is Windows 7 Home Premium, 64-bit, and CPU is an 'Ivy Bridge' (third generation) i7 with HD 4000 graphics. Motherboard is by Gigabyte with Z77 express chipset.

Looking around, I found:



After consulting an experienced developer and system builder, I installed - in this order - the following updates:

1) Platform Update - http://support.microsoft.com/kb/2670838
2) Intel HD 4000 driver from the Intel site - Intel Download Centre
3) The two Driver Framework updates from the list above - Kernel-Mode and User-Mode
4) The most recent NVidia driver available - 326.80 Beta (using the 'clean install' option)

Since I did all that, the machine has run without error, and no errors have been logged in the most recent beta tasks. I'm going to try switching back to long tasks after the current beta has finished.

Operator
Send message
Joined: 15 May 11
Posts: 108
Credit: 297,176,099
RAC: 0
Level
Asn
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33038 - Posted: 17 Sep 2013 | 12:56:52 UTC - in response to Message 33033.
Last modified: 17 Sep 2013 | 13:20:37 UTC

Jacob;

Files are in your inbox now.

Thanks,

Operator
____________

Operator
Send message
Joined: 15 May 11
Posts: 108
Credit: 297,176,099
RAC: 0
Level
Asn
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33040 - Posted: 17 Sep 2013 | 13:49:12 UTC - in response to Message 33035.



After consulting an experienced developer and system builder, I installed - in this order - the following updates:

1) Platform Update - http://support.microsoft.com/kb/2670838
2) Intel HD 4000 driver from the Intel site - Intel Download Centre
3) The two Driver Framework updates from the list above - Kernel-Mode and User-Mode
4) The most recent NVidia driver available - 326.80 Beta (using the 'clean install' option)



Richard;

Thanks. My system board (Dell) has no integrated Intel HD video, discrete only.

I do have the platform updates already installed, and in fact have most if not all of the other updates you show there installed as well.

I actually did have Nvidia driver version 326.84 installed and reverted back to a clean install of 326.41 to determine if that had anything to do with the problem, but apparently it didn't. I think it's the way the 8.14 app runs on 780/Titan GPUs that is the issue. I don't see any of these problems with apps running on my 590 box. Matt (MJH) says he's going to have a go at investigating when he gets a chance.

I'm considering doing a Linux build to see if that makes any difference because it seems that the development branches may be different for Windows vs Linux GPUGrid apps. But I have very little experience with Linux in general so this would be time consuming for me to get spun up on.

Operator

____________

John C MacAlister
Send message
Joined: 17 Feb 13
Posts: 181
Credit: 144,871,276
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 33041 - Posted: 17 Sep 2013 | 15:25:33 UTC
Last modified: 17 Sep 2013 | 15:26:57 UTC

Hi, Folks

26h run time forecast....Is this reasonable?

AMD FX-8350 with GTX 650 Ti.

Computer ID Name Location Avg. credit Total credit BOINC
version CPU GPU Operating System Last contact
ID: 158482
Details | Tasks
Cross-project stats:
BOINCstats.com Free-DC Panzer-001 home 52,782.88 786,400 7.0.64 AuthenticAMD
AMD FX(tm)-8350 Eight-Core Processor [Family 21 Model 2 Stepping 0]
(8 processors) [2] NVIDIA GeForce GTX 650 Ti (1023MB) driver: 314.22 Microsoft Windows 7
Ultimate x64 Edition, Service Pack 1, (06.01.7601.00) 17 Sep 2013 | 15:16:19 UTC

Name 35x7-SANTI_RAP74wtCUBIC-5-34-RND8406_1
Workunit 4779214
Created 17 Sep 2013 | 12:01:53 UTC
Sent 17 Sep 2013 | 15:16:19 UTC
Received ---
Server state In progress
Outcome ---
Client state New
Exit status 0 (0x0)
Computer ID 158482
Report deadline 22 Sep 2013 | 15:16:19 UTC
Run time 0.00
CPU time 0.00
Validate state Initial
Credit 0.00
Application version Long runs (8-12 hours on fastest card) v8.14 (cuda42)

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 33043 - Posted: 17 Sep 2013 | 15:45:04 UTC - in response to Message 33040.

Operator,

Would a bootable Linux image be useful for you?
Was planning to put one together for the memory tester anyway.

Matt

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33044 - Posted: 17 Sep 2013 | 16:12:48 UTC - in response to Message 33041.

Hi, Folks

26h run time forecast....Is this reasonable?

I wouldn't pay much attention to the forecast. See what the actual run time is; it should be about 18 hours.

John C MacAlister
Send message
Joined: 17 Feb 13
Posts: 181
Credit: 144,871,276
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 33045 - Posted: 17 Sep 2013 | 16:25:39 UTC - in response to Message 33044.
Last modified: 17 Sep 2013 | 16:26:19 UTC

Hi, Folks

26h run time forecast....Is this reasonable?

I wouldn't pay much attention to the forecast. See what the actual run time is; it should be about 18 hours.



Many thanks, Jim.

Regards,

John

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33048 - Posted: 17 Sep 2013 | 20:27:40 UTC - in response to Message 33017.

What could be interesting is whether these access violations already happened in 8.03 but had no visible effect, or if they're caused by some change made to the app since then.

MrS

MrS;

Easy enough to find out.

Here's the results of the last 4 WUs crunched on the Titan system using 8.03 before 8.14 got downloaded automatically:
...

I don't see any evidence of either errors or Access violations.

If I remember correctly Matt only introduced the error handling with 8.11. And may have also improved the error detection. So I still think it's possible that what ever triggers the error detection now was happening before, but did not actually harm the WUs. It's just one possibility, though, which I don't think we can answer.

Matt, would it be sufficient if you got remote access to a Titan on Win? I don't have any, but others might want to help. That would certainly be quicker than to set the system up yourself.. although you migth want to have some Windows system to hunt nasty bugs anyway.

MrS
____________
Scanning for our furry friends since Jan 2002

Operator
Send message
Joined: 15 May 11
Posts: 108
Credit: 297,176,099
RAC: 0
Level
Asn
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33049 - Posted: 18 Sep 2013 | 0:38:51 UTC - in response to Message 33048.

What could be interesting is whether these access violations already happened in 8.03 but had no visible effect, or if they're caused by some change made to the app since then.

MrS

MrS;

Easy enough to find out.

Here's the results of the last 4 WUs crunched on the Titan system using 8.03 before 8.14 got downloaded automatically:
...

I don't see any evidence of either errors or Access violations.


If I remember correctly Matt only introduced the error handling with 8.11. And may have also improved the error detection. So I still think it's possible that what ever triggers the error detection now was happening before, but did not actually harm the WUs. It's just one possibility, though, which I don't think we can answer.

MrS


MrS;

Looking back at the last 10 or so SANTI_RAP, NOELIA-INSP, and NATHAN_KIDKIX WUs that were run on the 8.03 app just before the switch to 8.14...

http://www.gpugrid.net/results.php?hostid=158641&offset=20&show_names=1&state=0&appid=

you can see that average completion times were about 20k.

After 8.14? Sometimes double that due to the constant restarts.

So even if error checking was introduced with version 8.11, and there may have been hidden errors created when running the 8.03 app (I'm not sure how that follows logically though), the near doubling of the work unit completion times immediately upon initial usage of the 8.14 app is enough of a smoking gun that there is something amiss.

And that is the real problem here I think, the amount of time it takes a WU to complete due to all the starts and stops. That directly impacts the number of WUs that this system (and other Titan/780 equipped systems like it) can get returned. If you like, look at it from the perspective of the "return on the Kilowatts consumed".

Now, I am perfectly happy to wait till Matt has a chance to do some testing, and see where that takes us.

I'll put the second Titan GPU back in the case and continue as before until...whatever.

Operator
____________

TJ
Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33054 - Posted: 18 Sep 2013 | 14:41:59 UTC

A CRASHNPT was suspended, still 3% to finish, and another was running. I suppose this happened due to the "termination by the app to avoid hangup". So I suspended the other WU and the one that was almost finished, started again, but failed immediately. So this manually suspending is not working properly anymore, or it is because the app stopped it itself?
____________
Greetings from TJ

Operator
Send message
Joined: 15 May 11
Posts: 108
Credit: 297,176,099
RAC: 0
Level
Asn
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33056 - Posted: 18 Sep 2013 | 16:03:19 UTC - in response to Message 33049.



And that is the real problem here I think, the amount of time it takes a WU to complete due to all the starts and stops. That directly impacts the number of WUs that this system (and other Titan/780 equipped systems like it) can get returned. If you like, look at it from the perspective of the "return on the Kilowatts consumed".

Operator



As an example of what I was referring to above:

With one Titan GPU installed and only one WU downloaded and crunching, the amount of time 'wasted' by the "Scheduler: Access violation, Waiting to Run" issue for I59R6-NATHAN_KIDc22_glu-6-10-RND3767_1 was 2 hours 47 minutes and 31 seconds of nothing happening.

This data came from the stdoutdae.txt file and was imported into Excel where the 'gaps' between restarts for this WU were totalled up.

So this WU could have finished in 'real time' (not GPU time) almost three hours earlier than it did and would have allowed another WU to have been mostly completed if not for all the restarts.

Let me know if anybody sees this a different way.

Operator
____________

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33061 - Posted: 18 Sep 2013 | 16:56:54 UTC - in response to Message 33056.

I agree that it's possible that the loading and clearing of the app could use up a substantial amount of time. This again suggests that recoverable errors are now triggering the app suspension and recovery mechanism. Maybe the app just needs to be refined so that it doesn't get triggered so often.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33062 - Posted: 18 Sep 2013 | 17:04:08 UTC - in response to Message 33043.

Operator,

Would a bootable Linux image be useful for you?
Was planning to put one together for the memory tester anyway.

Matt

It would be nice to have a 64bit Linux image with BOINC, NVidia and ATI drivers installed if that's even possible. No need for anything else. All my boxes are AMD with both NVidia and AMD GPUs. Haven't had a lot of success getting Linux running so that BOINC will work for both GPU types.

5pot
Send message
Joined: 8 Mar 12
Posts: 411
Credit: 2,083,882,218
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33065 - Posted: 18 Sep 2013 | 17:41:45 UTC - in response to Message 33056.



And that is the real problem here I think, the amount of time it takes a WU to complete due to all the starts and stops. That directly impacts the number of WUs that this system (and other Titan/780 equipped systems like it) can get returned. If you like, look at it from the perspective of the "return on the Kilowatts consumed".

Operator



As an example of what I was referring to above:

With one Titan GPU installed and only one WU downloaded and crunching, the amount of time 'wasted' by the "Scheduler: Access violation, Waiting to Run" issue for I59R6-NATHAN_KIDc22_glu-6-10-RND3767_1 was 2 hours 47 minutes and 31 seconds of nothing happening.

This data came from the stdoutdae.txt file and was imported into Excel where the 'gaps' between restarts for this WU were totalled up.

So this WU could have finished in 'real time' (not GPU time) almost three hours earlier than it did and would have allowed another WU to have been mostly completed if not for all the restarts.

Let me know if anybody sees this a different way.

Operator


yours is doing something completely different from mine. Why I don't know. But since mine suspend and start another task, very little is lost. In fact, my times are pretty much unchanged.

Your issue is. Odd and unique.

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 33066 - Posted: 18 Sep 2013 | 17:47:45 UTC - in response to Message 33062.


Haven't had a lot of success getting Linux running so that BOINC will work for both GPU types.


Unsurprising. It's difficult to do, and fragile when it's done. The trick is to do the installation in this order:
* Operating System's X, mesa packages
* Nvidia driver
* force a re-install X, mesa packages
* Catalyst
* Configure X server for the AMD card.
* Start X

MJH

Operator
Send message
Joined: 15 May 11
Posts: 108
Credit: 297,176,099
RAC: 0
Level
Asn
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33069 - Posted: 18 Sep 2013 | 19:19:14 UTC - in response to Message 33065.



yours is doing something completely different from mine. Why I don't know. But since mine suspend and start another task, very little is lost. In fact, my times are pretty much unchanged.

Your issue is. Odd and unique.


To be clear I'm referring to the difference between the WU runtime showing in the results (20+k seconds) and the actual 'real' time the computer took to complete the WU from start to finish.

As an example, if you start a WU and you only have that one running, and it repeatedly starts and stops until its finished, there will be a difference in the 'GPU runtime' versus the actual clock time the WU took to complete.

Unless I'm way off base the GPU time is logged only when the WU is being actively worked. If it's "Waiting to run" I don't think that time counts. So that's why I said that there was 2 hours 47 minutes and 31 seconds of nothing happening that was essentially lost.

Now, if I completely have this wrong about GPU time vs. 'real time' please jump in here and straighten me out!

Operator

____________

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33073 - Posted: 18 Sep 2013 | 20:02:03 UTC - in response to Message 33069.

Now, if I completely have this wrong about GPU time vs. 'real time' please jump in here and straighten me out!

You're certainly not wrong here.. but looking at your tasks I actually see 2 different issues. With 8.03 you needed ~20ks for the long runs.

Now you get some tasks which take ~23ks and have many access violations and subsequent restarts. Not good, for sure. These use as much CPU as GPU time.

Then there are the tasks taking 40 - 50ks with lot's of
"# BOINC suspending at user request (thread suspend)"
and the occasional access violation thrown in for fun. Here the GPU time is twice as high as CPU time. These ones really hurt, I think.

Maybe a stupid question, but just to make sure: do you have BOINC set to use 100% CPU time? And "only run if CPU load is lower than 0%"? Do you run TThrottle? It's curious.. which user is requesting this suspension?

So even if error checking was introduced with version 8.11, and there may have been hidden errors created when running the 8.03 app (I'm not sure how that follows logically though), the near doubling of the work unit completion times immediately upon initial usage of the 8.14 app is enough of a smoking gun that there is something amiss.

No doubt about something being wrong. The error I was speculating about is this: I don't know how exactly Matt's error detection works, but he's certainly has to look for unusual / unwanted behaviour of the simulation. Now it could be that something fullfils his criteria, something which has been happening all along and which is not an actual error in the sense that the simulation can simply continue despite this whatever happening.

It's just me speculating about a possibility, though, so don't spend too much time wondering about it. We can't do anything to research this.

Another wild guess: if some functionality added between 8.03 and 8.14 triggers the error.. why not deactivate half of them (as far as possible) and try a bisection search with a new beta? This could work to identify at least the offending functionality in a few days. If it's as simple as the temperature reporting, it could easily be removed for GK110 until nVidia fixes it.

MrS
____________
Scanning for our furry friends since Jan 2002

5pot
Send message
Joined: 8 Mar 12
Posts: 411
Credit: 2,083,882,218
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33078 - Posted: 18 Sep 2013 | 20:21:45 UTC

Correct, if you only have 1 WU, and none downloaded. Im saying, if you have one you're working on, and one thats next in line. The time lost switching between tasks wont be nearly that large. Will it still affect real time computation? Yes. But maybe by a couple minutes.

Operator
Send message
Joined: 15 May 11
Posts: 108
Credit: 297,176,099
RAC: 0
Level
Asn
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33092 - Posted: 19 Sep 2013 | 13:54:31 UTC - in response to Message 33043.

Operator,

Would a bootable Linux image be useful for you?
Was planning to put one together for the memory tester anyway.

Matt


Matt;

Yes it would!

Operator

____________

Operator
Send message
Joined: 15 May 11
Posts: 108
Credit: 297,176,099
RAC: 0
Level
Asn
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33093 - Posted: 19 Sep 2013 | 14:13:10 UTC - in response to Message 33078.

Correct, if you only have 1 WU, and none downloaded. Im saying, if you have one you're working on, and one thats next in line. The time lost switching between tasks wont be nearly that large. Will it still affect real time computation? Yes. But maybe by a couple minutes.


I agree. In part.

Most of the time when one WU goes into the "Waiting to run" state the next one in the queue resumes computation. But not always!

There have actually been times when all WUs were showing "Waiting to run" and absolutely nothing was happening. Doesn't happen often I'll admit.

So I do have the system set to have an additional 0.2 days worth of work in the queue and that does provide another WU for the system to start crunching when one is stopped for some reason.

But I was referring to a specific set of circumstances where I had only one Titan installed and only one WU downloaded (nothing waiting to start). That's the worst case scenario.

When I did the calculations of the 'gaps' between the stops and restarts (previous post about the 2:47:31) I was struck by the fact that the majority of the times spans the app was not actively working (stopped and waiting to restart) was 2 minutes and 20 seconds. Not every gap but most of them were precisely that long. Curious.

Operator

____________

TJ
Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33290 - Posted: 30 Sep 2013 | 17:51:08 UTC

My troublesome GTX660 has now done a Noelia LR and a Santi LR, both without any interruption!. So no "Terminating to avoid lock-up" and no "BOINC suspending at user request (exit)" (whatever that may be).
While with SR this happens mostly at least once. While I should think that a LR for more than 12 hours would be more susceptible for interruption.
Could it be that the LR are different written than the SR?

____________
Greetings from TJ

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33291 - Posted: 30 Sep 2013 | 18:22:23 UTC - in response to Message 33290.

I don't think a single successful completion with a Noelia long proves much. My GTX 660s successfully completed three NOELIA_INS1P until erroring out on the fourth. But a mere error is not a big deal; it was the slow run on a NATHAN_KIDKIXc22 that caused me the real problem.

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33314 - Posted: 1 Oct 2013 | 21:07:24 UTC

Could be lower temperatures helping you, TJ (in this case it would be hardware-related).

MrS
____________
Scanning for our furry friends since Jan 2002

TJ
Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33316 - Posted: 1 Oct 2013 | 21:23:39 UTC - in response to Message 33314.

Could be lower temperatures helping you, TJ (in this case it would be hardware-related).

MrS

Or throttling the GPU clock a little :)
However it was a bot of joy to early, had a beta (Santi SR) again with The simulation has become unstable. Terminating to avoid lock-up and a downclocking again overnight. Another beta ran without any interruption. So a bit random. I know now for sure that I don't like 660's.
____________
Greetings from TJ

Post to thread

Message boards : News : acemdlong application 8.14 - discussion

//