Advanced search

Message boards : Graphics cards (GPUs) : *_pYEEI_* information and issues

Author Message
ignasi
Send message
Joined: 10 Apr 08
Posts: 254
Credit: 16,836,000
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 14028 - Posted: 22 Dec 2009 | 10:46:02 UTC

Please use this thread to post any problem regarding all workunits tagged as *_pYEEI_*.

Thanks,
ignasi

ignasi
Send message
Joined: 10 Apr 08
Posts: 254
Credit: 16,836,000
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 14029 - Posted: 22 Dec 2009 | 10:49:26 UTC - in response to Message 14028.

I am already aware of very recent reports of *_reverse1_pYEEI_2112_* failing.
Let any new result die. They have been cancelled though.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,601,336,851
RAC: 8,787,904
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 14030 - Posted: 22 Dec 2009 | 11:34:05 UTC - in response to Message 14029.

They have been cancelled though.

Thanks for doing that. It isn't terribly much of a problem when tasks crash after five seconds - much more annoying when they run for several hours first and then waste all the effort - but it is a bit wasteful of bandwidth when tasks take longer to download than they do to crunch!

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,601,336,851
RAC: 8,787,904
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 14035 - Posted: 22 Dec 2009 | 22:27:13 UTC - in response to Message 14029.

They have been cancelled though.

I don't think you zapped them all. I'll let my copy of WU 1037889 get its five seconds of fame overnight, but I will be most surprised if it's still alive in the morning.

Profile Edboard
Avatar
Send message
Joined: 24 Sep 08
Posts: 72
Credit: 12,410,275
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 14039 - Posted: 23 Dec 2009 | 12:30:38 UTC
Last modified: 23 Dec 2009 | 12:32:22 UTC

I have received two more of them today (they errored as expected after 2-3 seconds). You can see them here and here.

Snow Crash
Send message
Joined: 4 Apr 09
Posts: 450
Credit: 539,316,349
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 14041 - Posted: 23 Dec 2009 | 23:34:21 UTC - in response to Message 14039.
Last modified: 24 Dec 2009 | 0:10:33 UTC

I just got more and more, and now GPUGrid won't send me any new WUs ... I'll be back after a few days of Milkyway ... hopefully you will either really have them deleted or they will have failed so many times by then they will automagically be taken out of the pool :-(
These are 2312 series ... looks like same error got put into the *replacement* batch?
____________
Thanks - Steve

vaio
Avatar
Send message
Joined: 6 Nov 09
Posts: 20
Credit: 10,781,505
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwat
Message 14043 - Posted: 24 Dec 2009 | 0:17:55 UTC

I came home today to a dead wu and a corrupted desktop.
This on a rock steady rig with a GTS 250 at stock....fine for >2 months til now.

Can't get work.....have it folding at the moment.

Never noted what wu it was.....had to reboot to check out problems.
Corrupted desktop went with the reboot.
____________
Join Here
Team Forums

vaio
Avatar
Send message
Joined: 6 Nov 09
Posts: 20
Credit: 10,781,505
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwat
Message 14044 - Posted: 24 Dec 2009 | 0:22:10 UTC - in response to Message 14043.

Just had a look.....todays electricity bill contribution:

23 Dec 2009 13:53:20 UTC 23 Dec 2009 16:24:17 UTC Error while computing 7.32 6.92 0.04 --- Full-atom molecular dynamics v6.71 (cuda)
1661886 1042549 23 Dec 2009 13:49:32 UTC 23 Dec 2009 13:53:20 UTC Error while computing 5.38 4.84 4,531.91 --- Full-atom molecular dynamics v6.71 (cuda)
1661884 1042547 23 Dec 2009 13:41:25 UTC 23 Dec 2009 13:45:50 UTC Error while computing 5.45 4.83 4,531.91 --- Full-atom molecular dynamics v6.71 (cuda)
1661877 1042543 23 Dec 2009 13:24:37 UTC 23 Dec 2009 13:28:34 UTC Error while computing 5.35 4.94 4,531.91 --- Full-atom molecular dynamics v6.71 (cuda)
1661868 1042537 23 Dec 2009 13:32:52 UTC 23 Dec 2009 13:36:48 UTC Error while computing 6.35 5.89 4,022.81 --- Full-atom molecular dynamics v6.71 (cuda)
1661862 1042532 23 Dec 2009 13:28:34 UTC 23 Dec 2009 13:32:52 UTC Error while computing 3.38 2.94 0.02 --- Full-atom molecular dynamics v6.71 (cuda)
1661803 1042484 23 Dec 2009 13:36:48 UTC 23 Dec 2009 13:41:25 UTC Error while computing 6.31 5.88 4,022.81 --- Full-atom molecular dynamics v6.71 (cuda)
1661790 1042473 23 Dec 2009 13:20:20 UTC 23 Dec 2009 13:24:37 UTC Error while computing 5.52 4.86 4,531.91 --- Full-atom molecular dynamics v6.71 (cuda)
1661780 1042463 23 Dec 2009 13:45:50 UTC 23 Dec 2009 13:49:32 UTC Error while computing 5.34 4.83 4,531.91 --- Full-atom molecular dynamics v6.71 (cuda)
1661734 1042434 23 Dec 2009 13:15:30 UTC 23 Dec 2009 13:20:20 UTC Error while computing 7.60 7.00 0.04 --- Full-atom molecular dynamics v6.71 (cuda)
1661728 1042430 23 Dec 2009 13:11:24 UTC 23 Dec 2009 13:15:30 UTC Error while computing 7.44 6.86 0.04 --- Full-atom molecular dynamics v6.71 (cuda)
1661471 1042269 23 Dec 2009 11:35:27 UTC 23 Dec 2009 13:11:24 UTC Error while computing 5,546.53 768.25 3,539.96 --- Full-atom molecular dynamics v6.71 (cuda)
1660795 1041835 23 Dec 2009 7:20:21 UTC 28 Dec 2009 7:20:21 UTC In progress --- --- --- --- Full-atom molecular dynamics v6.71 (cuda)
1660714 1030551 23 Dec 2009 6:01:18 UTC 23 Dec 2009 20:10:45 UTC Completed and validated 45,871.60 5,901.35 3,539.96 4,778.94 Full-atom molecular dynamics v6.71 (cuda)
1659183 1040817 22 Dec 2009 19:52:50 UTC 23 Dec 2009 11:41:26 UTC Completed and validated 56,499.53 4,107.08 4,531.91 6,118.08 Full-atom molecular dynamics v6.71 (cuda)
1658422 1035863 22 Dec 2009 15:30:32 UTC 23 Dec 2009 7:20:21 UTC Error while computing 4,779.29 644.83 3,539.96 --- Full-atom molecular dynamics v6.71 (cuda)
1655707 1038933 21 Dec 2009 23:47:15 UTC 23 Dec 2009 6:01:18 UTC Completed and validated 52,962.71 2,202.70 4,428.01 5,977.82 Full-atom molecular dynamics v6.71 (cuda)
1654210 1038221 22 Dec 2009 4:53:16 UTC 22 Dec 2009 19:55:27 UTC Completed and validated 53,802.30 2,652.13 4,428.01 5,977.82 Full-atom molecular dynamics v6.71 (cuda)
1653811 1037642 21 Dec 2009 13:08:13 UTC 21 Dec 2009 13:09:44 UTC Error while computing 4.61 0.16 0.00 --- Full-atom molecular dynamics v6.71 (cuda)

____________
Join Here
Team Forums

Profile GDF
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 14 Mar 07
Posts: 1957
Credit: 629,356
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 14045 - Posted: 24 Dec 2009 | 0:38:57 UTC - in response to Message 14044.
Last modified: 24 Dec 2009 | 0:39:10 UTC

Vaio,
what's your host?
gdf

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,601,336,851
RAC: 8,787,904
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 14046 - Posted: 24 Dec 2009 | 0:51:24 UTC - in response to Message 14045.

Vaio,
what's your host?
gdf

It's host 55606

Although he's had a couple of pYEEIs, I reckon it was the IBUCH_TRYP at 13:11 which did the damage - corrupted the internal state of the card so badly that all subsequent tasks failed (even ones which are normally OK on G92). That would account for the corrupted desktop as well.

Profile Michael Goetz
Avatar
Send message
Joined: 2 Mar 09
Posts: 124
Credit: 46,573,744
RAC: 1,021,404
Level
Val
Scientific publications
watwatwatwatwatwatwatwat
Message 14047 - Posted: 24 Dec 2009 | 7:12:04 UTC - in response to Message 14046.

I had one fail today:

184-IBUCH_reverse1fix_pYEEI_2312-0-40-RND4766_4

I guess the "fix" in the name isn't quite there yet. :)

This WU has failed on 5 hosts so far.

WU: 1043207

stderr:

<core_client_version>6.10.18</core_client_version>
<![CDATA[
<message>
- exit code 98 (0x62)
</message>
<stderr_txt>
# Using CUDA device 0
# There is 1 device supporting CUDA
# Device 0: "GeForce GTX 280"
# Clock rate: 1.35 GHz
# Total amount of global memory: 1073741824 bytes
# Number of multiprocessors: 30
# Number of cores: 240
ERROR: mdsim.cu, line 101: Failed to parse input file
called boinc_finish

</stderr_txt>
]]>

ignasi
Send message
Joined: 10 Apr 08
Posts: 254
Credit: 16,836,000
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 14048 - Posted: 24 Dec 2009 | 10:50:55 UTC - in response to Message 14047.

certainly not.

my apologies,
ignasi

canardo
Send message
Joined: 11 Feb 09
Posts: 4
Credit: 8,675,472
RAC: 0
Level
Ser
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 14052 - Posted: 24 Dec 2009 | 12:22:58 UTC - in response to Message 14028.

Hello Ignasi,
Just have a look here comp id: 26091
also, TONI' s fail, unfortunately at the end of the run.
Ciao,
Jaak
____________

Snow Crash
Send message
Joined: 4 Apr 09
Posts: 450
Credit: 539,316,349
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 14053 - Posted: 24 Dec 2009 | 13:30:59 UTC

In general, do you test new configurations with the people who have opted in to "Run test applications"?
____________
Thanks - Steve

vaio
Avatar
Send message
Joined: 6 Nov 09
Posts: 20
Credit: 10,781,505
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwat
Message 14054 - Posted: 24 Dec 2009 | 16:46:53 UTC - in response to Message 14053.

I pulled down some new work today and it seems to be behaving so far.
Also, the downloading issue seems to have corrected itself....whatever weighting I gave it would never give me more than one work unit at a time.

This morning it pulled two.
____________
Join Here
Team Forums

ignasi
Send message
Joined: 10 Apr 08
Posts: 254
Credit: 16,836,000
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 14070 - Posted: 29 Dec 2009 | 12:39:36 UTC - in response to Message 14052.

Hello Ignasi,
Just have a look here comp id: 26091
also, TONI' s fail, unfortunately at the end of the run.
Ciao,
Jaak


Please report this *HERG* fail on its thread as well, it will be helpful.

Snow Crash
Send message
Joined: 4 Apr 09
Posts: 450
Credit: 539,316,349
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 14073 - Posted: 29 Dec 2009 | 21:38:36 UTC

Another quick fail batch ...
name 333-IBUCH_reverse_pYEEI_2912-0-40-RND8124
application Full-atom molecular dynamics
created 29 Dec 2009 13:42:42 UTC
minimum quorum 1
initial replication 1
max # of error/total/success tasks 5, 10, 6
Task ID
click for details Computer Sent Time reported
or deadline
explain Status Run time
(sec) CPU time
(sec) Claimed credit Granted credit Application
1683801 26061 29 Dec 2009 20:20:30 UTC 29 Dec 2009 20:21:53 UTC Error while computing 2.22 0.06 0.00 --- Full-atom molecular dynamics v6.71 (cuda23)
1685230 31780 29 Dec 2009 21:06:43 UTC 29 Dec 2009 21:08:06 UTC Error while computing 2.12 0.09 0.00 --- Full-atom molecular dynamics v6.71 (cuda23)
1685404 54778 29 Dec 2009 21:30:44 UTC 3 Jan 2010 21:30:44 UTC In progress --- --- --- --- Full-atom molecular dynamics v6.71 (cuda23)

____________
Thanks - Steve

Profile Michael Goetz
Avatar
Send message
Joined: 2 Mar 09
Posts: 124
Credit: 46,573,744
RAC: 1,021,404
Level
Val
Scientific publications
watwatwatwatwatwatwatwat
Message 14075 - Posted: 29 Dec 2009 | 22:37:36 UTC

Starting a little while ago, I've been receiving a bunch of these WUs -- and they're all getting compute errors within a few seconds of starting.

GPU is EVGA GTX280 (factory OC), CPU is C2Q Q6600 2.4 GHZ (no OC). Vista 32 bit.

You can follow my name link to get to the details on the computer and the WUs.

____________
Want to find one of the largest known primes? Try PrimeGrid. Or help cure disease at WCG.

Snow Crash
Send message
Joined: 4 Apr 09
Posts: 450
Credit: 539,316,349
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 14076 - Posted: 29 Dec 2009 | 23:52:12 UTC - in response to Message 14075.

so now it looks like I have errored too many times so the server will not send me any work. See ya later ... I'm going back to Milkyway for another couple of days :-(
____________
Thanks - Steve

Profile Stoneageman
Avatar
Send message
Joined: 25 May 09
Posts: 224
Credit: 34,057,224,498
RAC: 231
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 14078 - Posted: 30 Dec 2009 | 1:06:12 UTC
Last modified: 30 Dec 2009 | 1:30:02 UTC

A couple of my gpu's have just choked on seven of these reverse_pYEEI wu's and now are idle. Begs the question as to why so many were sent out in the first place when they were suspect!
UPDATE: another four have taken out one more gpu :{

Profile Michael Goetz
Avatar
Send message
Joined: 2 Mar 09
Posts: 124
Credit: 46,573,744
RAC: 1,021,404
Level
Val
Scientific publications
watwatwatwatwatwatwatwat
Message 14079 - Posted: 30 Dec 2009 | 1:38:05 UTC - in response to Message 14078.

I *just* managed to squeak by. I had six of these error out, dropping my daily quota to 9. The next WU was the 9th of the day; fortunately, it's a different series and is crunching normally. If it had been another error I think this GPU would have been done for the day. (Unless it's still counting this as WUs per CPU core, in which case I had a lot of headway.)


____________
Want to find one of the largest known primes? Try PrimeGrid. Or help cure disease at WCG.

Profile Stoneageman
Avatar
Send message
Joined: 25 May 09
Posts: 224
Credit: 34,057,224,498
RAC: 231
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 14080 - Posted: 30 Dec 2009 | 2:21:20 UTC
Last modified: 30 Dec 2009 | 2:57:17 UTC

UPDATE: Four more have trashed another gpu
Aborted a boat load of these critters, yet still they come. It's like they are breeding!

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 14081 - Posted: 30 Dec 2009 | 9:15:13 UTC

Can you PLEASE PLEASE PLEASE make sure WU batches are OK before sending them out.

Siegfried Niklas
Avatar
Send message
Joined: 23 Feb 09
Posts: 39
Credit: 144,654,294
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 14082 - Posted: 30 Dec 2009 | 9:54:46 UTC

GTX295 - Nine *_pYEEI_* WUs crashed in a row.

http://www.gpugrid.net/results.php?hostid=53295

"MDIO ERROR: syntax error in file "structure.psf", line number 1: failed to find PSF keyword
ERROR: mdioload.cu, line 172: Unable to read topology file"

No new work sent for 7,5 hours. (recently got new)

Should I abort *_pYEEI_* on other GPUs (cache)?

hzels
Send message
Joined: 4 Sep 08
Posts: 7
Credit: 52,864,406
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 14083 - Posted: 30 Dec 2009 | 11:11:14 UTC - in response to Message 14082.

last WUs all going down the drain:

<stderr_txt>
# Using CUDA device 0
# There are 2 devices supporting CUDA
# Device 0: "GeForce GTX 280"
# Clock rate: 1.55 GHz
# Total amount of global memory: 1073741824 bytes
# Number of multiprocessors: 30
# Number of cores: 240
# Device 1: "GeForce GTX 260"
# Clock rate: 1.51 GHz
# Total amount of global memory: 939524096 bytes
# Number of multiprocessors: 27
# Number of cores: 216
MDIO ERROR: syntax error in file "structure.psf", line number 1: failed to find PSF keyword
ERROR: mdioload.cu, line 172: Unable to read topology file

called boinc_finish

</stderr_txt>

I'm over to Collatz for some days.

Profile Michael Goetz
Avatar
Send message
Joined: 2 Mar 09
Posts: 124
Credit: 46,573,744
RAC: 1,021,404
Level
Val
Scientific publications
watwatwatwatwatwatwatwat
Message 14085 - Posted: 30 Dec 2009 | 16:17:39 UTC

I just had another one of these fail:

1057058


____________
Want to find one of the largest known primes? Try PrimeGrid. Or help cure disease at WCG.

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 14086 - Posted: 30 Dec 2009 | 16:29:08 UTC - in response to Message 14085.
Last modified: 30 Dec 2009 | 16:32:34 UTC

Had 2 fail in a few seconds on one system, 3 on another.
184-IBUCH_reverse_pYEEI_2912-0-40-RND6748 http://www.gpugrid.net/workunit.php?wuid=1056751
128-IBUCH_reverse_pYEEI_2912-0-40-RND3643 http://www.gpugrid.net/workunit.php?wuid=1056695
Also, could not get any tasks this morning between about 1am and noon, on the same system, but running a task now.

http://www.gpugrid.net/workunit.php?wuid=1056826
http://www.gpugrid.net/workunit.php?wuid=1056758
http://www.gpugrid.net/workunit.php?wuid=1056826

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 14092 - Posted: 31 Dec 2009 | 0:36:13 UTC - in response to Message 14028.

Please use this thread to post any problem regarding all workunits tagged as *_pYEEI_*.

Thanks,
ignasi

As you can see (I hope) massive problems have been reported and many systems have been locked out (and are sitting idle) of receiving new WUs due to these faulty units. Don't you think it's about time to pull the rest? It looks like they're just being allowed to run until they fail so many times that the server cancels them. That's not showing any concern at all for the people who are doing your work.

I know they're not being canceled because I've received 22 of them so far today. Every one of those 22 has failed on several machines before being sent to me. That's just wrong.


Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 14094 - Posted: 31 Dec 2009 | 14:09:24 UTC - in response to Message 14092.

In a way the _pYEEI_ tasks are SPAM!

I had to take extreme action yesterday - shut down my system for a couple of hours ;)

ignasi
Send message
Joined: 10 Apr 08
Posts: 254
Credit: 16,836,000
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 14107 - Posted: 3 Jan 2010 | 17:50:40 UTC - in response to Message 14094.

My most sincere apologies to everybody for all this.
I wanted to fill up the queue before going offline for some days but obviously it didn't work as expected.

The balance between keeping crunchers support, not having an empty queue and having private life is always very sensitive to human errors.

Sincerely,
ignasi

Profile Michael Goetz
Avatar
Send message
Joined: 2 Mar 09
Posts: 124
Credit: 46,573,744
RAC: 1,021,404
Level
Val
Scientific publications
watwatwatwatwatwatwatwat
Message 14109 - Posted: 3 Jan 2010 | 18:28:46 UTC - in response to Message 14107.

My most sincere apologies to everybody for all this.


No worries here; stuff happens. It's the nature of the "free" distributed computing that there are going to be minor problems along the way.

Happy new year!

____________
Want to find one of the largest known primes? Try PrimeGrid. Or help cure disease at WCG.

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 14111 - Posted: 3 Jan 2010 | 19:26:08 UTC - in response to Message 14109.

My most sincere apologies to everybody for all this.

Happy new year!

Thanks for letting us know what happened. Communication is appreciated.

Happy new year everyone!

Profile Stoneageman
Avatar
Send message
Joined: 25 May 09
Posts: 224
Credit: 34,057,224,498
RAC: 231
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 14112 - Posted: 3 Jan 2010 | 19:58:41 UTC - in response to Message 14107.


The balance between keeping crunchers support, not having an empty queue and having private life is always very sensitive to human errors.

Sincerely,
ignasi


"A PRIVATE life".......... well ok. However, we expect you to sleep with the server :)

ignasi
Send message
Joined: 10 Apr 08
Posts: 254
Credit: 16,836,000
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 14113 - Posted: 4 Jan 2010 | 9:33:01 UTC - in response to Message 14112.

"A PRIVATE life".......... well ok. However, we expect you to sleep with the server :)
[/quote]

I am afraid girlfriends are too jealous...

Snow Crash
Send message
Joined: 4 Apr 09
Posts: 450
Credit: 539,316,349
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 14122 - Posted: 4 Jan 2010 | 23:52:38 UTC - in response to Message 14113.

"A PRIVATE life".......... well ok. However, we expect you to sleep with the server :)


I am afraid girlfriends are too jealous...


You have more than ONE !!! No wonder he can't get the WUs straight , he is sleep deprived :-)

Keep up the good work, we'll crunch the best we can!
____________
Thanks - Steve

Profile [AF>Libristes>Jip] Elgran...
Avatar
Send message
Joined: 16 Jul 08
Posts: 45
Credit: 78,618,001
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 14348 - Posted: 26 Jan 2010 | 14:27:46 UTC - in response to Message 14122.

Three compute errors 1,2,3 on this host .

AndyMM
Send message
Joined: 27 Jan 09
Posts: 4
Credit: 582,988,184
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 14396 - Posted: 27 Jan 2010 | 10:10:20 UTC

Sorry but gong to say good bye. Last 3 days non stop computation errors made even worse by the fact the cards just sat there doing nothing.

Switching all my GPUs to F@H. I do not accept having my money wasted with units processing for 17 hours then showing a computing error.

GPUGRID Role account
Send message
Joined: 15 Feb 07
Posts: 134
Credit: 1,349,535,983
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 14410 - Posted: 27 Jan 2010 | 13:56:28 UTC - in response to Message 14396.

Hi,

It's because you have been accepting beta work from us. If reliability of work is of paramount importance to you, don't track the beta application.

Matt

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 14524 - Posted: 28 Jan 2010 | 0:30:51 UTC - in response to Message 14396.

Switching all my GPUs to F@H.

Your cards do a lot more work here than they can at F@H.

If the problem is Beta related you just need to turn the Betas off, as MJH said. It might also be that you need to restart the system. Sometimes one failure can cause contunuous failures (a runaway) and you need to restart the system. I say this because the problem was only limited to your GTX 295, and not your GTX 275.
Many of your tasks seem to have been aborted by user. Some immediately and one after running for a long time,
286-IBUCH_esrever_pYEEI_0301-10-40-RND7408 - Aborted by user after 43,189.28 seconds.

Turn off Betas, restart and see how you get on.

AndyMM
Send message
Joined: 27 Jan 09
Posts: 4
Credit: 582,988,184
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 14789 - Posted: 29 Jan 2010 | 14:08:33 UTC - in response to Message 14524.

Thanks for the comments. I looked in my GPUGrid preferences and did not notice anything saying Beta
I did see
"Run test applications?
This helps us develop applications, but may cause jobs to fail on your computer"

Which was already set to no.

Please advise, how do a turn off receiving Beta work units

Thanks

Andy

AndyMM
Send message
Joined: 27 Jan 09
Posts: 4
Credit: 582,988,184
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 14793 - Posted: 29 Jan 2010 | 15:09:41 UTC - in response to Message 14789.

[quote]Thanks for the comments. I looked in my GPUGrid preferences and did not notice anything saying Beta
I did see
"Run test applications?
This helps us develop applications, but may cause jobs to fail on your computer"

Which was already set to no.
[quote]

I have found the answer in another thread. So unless someone has switched the "Run Test Applications" off for me in the last 2 days I have never accepted Beta Applications.

I have re attached a 275. I will leave that running for a few days. The 295s will stay on F@H for now, they run F@H (and used to run S@H) fine it was only GPUGrid causing problems.

Also FYI it was me who aborted the work units after seeing this thread and relating it to the problems I had been having. After seeing work units processing for hours then showing computation error I was not in the mood to waste any more time.

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 14799 - Posted: 29 Jan 2010 | 17:25:47 UTC - in response to Message 14793.

Other users cannot see if you have Betas enabled or not, just suggest you turn it off if you are having problems. There are many things that can cause errors. We can only guess as we do not have all the info. I cant tell if your system has automatic updates turned on, or if you have your local Boinc client set to Use GPU when computer is in use. All I can do is suggest you disable automatic updates as these force restarts and crash tasks, and turn off Use GPU when computer is in use, if you watch any video on your system.

GL

AndyMM
Send message
Joined: 27 Jan 09
Posts: 4
Credit: 582,988,184
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 14823 - Posted: 30 Jan 2010 | 9:46:57 UTC - in response to Message 14799.

Thanks for the advice. The PCs are all part of a crunching farm I have. All headless and controlled by VNC.
Only 4 of them have 9 series or higher Nvidia cards suitable for GPU Grid. Rest are simple Quads with built in graphics running Rosetta and WCG.
Either way I will leave a single 275 running on GpuGrid for now. The rest can stay on F@H.
Andy

Profile [AF>Libristes>Jip] Elgran...
Avatar
Send message
Joined: 16 Jul 08
Posts: 45
Credit: 78,618,001
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 15015 - Posted: 5 Feb 2010 | 11:29:44 UTC

Compute error with a GTX295 GPU on this computer .

Profile X-Files 27
Avatar
Send message
Joined: 11 Oct 08
Posts: 95
Credit: 68,023,693
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 15647 - Posted: 8 Mar 2010 | 14:03:45 UTC

i got weird wu(1949860), it error out but then a success?


# There is 1 device supporting CUDA
# Device 0: "GeForce 9800 GT"
# Clock rate: 1.75 GHz
# Total amount of global memory: 523829248 bytes
# Number of multiprocessors: 14
# Number of cores: 112
MDIO ERROR: cannot open file "restart.coor"
SWAN : FATAL : Failure executing kernel sync [frc_sum_kernel] [999]
Assertion failed: 0, file ../swan/swanlib_nv.cpp, line 203

This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.
# There is 1 device supporting CUDA
# Device 0: "GeForce 9800 GT"
# Clock rate: 1.75 GHz
# Total amount of global memory: 523829248 bytes
# Number of multiprocessors: 14
# Number of cores: 112
# Time per step: 59.524 ms
# Approximate elapsed time for entire WU: 37202.798 s
called boinc_finish

____________

Snow Crash
Send message
Joined: 4 Apr 09
Posts: 450
Credit: 539,316,349
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 15648 - Posted: 8 Mar 2010 | 14:47:29 UTC - in response to Message 15647.

I've seen a recent handful of errors on my GTX295 and I know a team mate of mine has seen a few also. TONI wu process fine (which I think are more computationally intensive) so I think our OC is OK.

Are you seeing a higher failure rate on these WUs between last night and early this morning?
____________
Thanks - Steve

ignasi
Send message
Joined: 10 Apr 08
Posts: 254
Credit: 16,836,000
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 15774 - Posted: 16 Mar 2010 | 10:03:46 UTC - in response to Message 15648.

Still happening?

Could you post some of this failed results so I can double check they are right?

thanks

Snow Crash
Send message
Joined: 4 Apr 09
Posts: 450
Credit: 539,316,349
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 15775 - Posted: 16 Mar 2010 | 10:32:30 UTC - in response to Message 15774.

Still happening?

No. Everything looks good now :-)

____________
Thanks - Steve

ftpd
Send message
Joined: 6 Jun 08
Posts: 152
Credit: 328,250,382
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 15776 - Posted: 16 Mar 2010 | 10:35:12 UTC - in response to Message 15774.

16-3-2010 10:40:54 GPUGRID Restarting task p34-IBUCH_chall_pYEEI_100301-15-40-RND6745_1 using acemd version 671
16-3-2010 10:40:55 GPUGRID Restarting task p31-IBUCH_21_pYEEI_100301-13-40-RND4121_0 using acemd2 version 603
16-3-2010 10:58:32 GPUGRID Computation for task p31-IBUCH_21_pYEEI_100301-13-40-RND4121_0 finished
16-3-2010 10:58:32 GPUGRID Output file p31-IBUCH_21_pYEEI_100301-13-40-RND4121_0_1 for task p31-IBUCH_21_pYEEI_100301-13-40-RND4121_0 absent
16-3-2010 10:58:32 GPUGRID Output file p31-IBUCH_21_pYEEI_100301-13-40-RND4121_0_2 for task p31-IBUCH_21_pYEEI_100301-13-40-RND4121_0 absent
16-3-2010 10:58:32 GPUGRID Output file p31-IBUCH_21_pYEEI_100301-13-40-RND4121_0_3 for task p31-IBUCH_21_pYEEI_100301-13-40-RND4121_0 absent
16-3-2010 10:58:32 GPUGRID Starting p9-IBUCH_201_pYEEI_100301-13-40-RND6673_0
16-3-2010 10:58:34 GPUGRID Starting task p9-IBUCH_201_pYEEI_100301-13-40-RND6673_0 using acemd2 version 603
16-3-2010 11:29:43 GPUGRID Computation for task p9-IBUCH_201_pYEEI_100301-13-40-RND6673_0 finished
16-3-2010 11:29:43 GPUGRID Output file p9-IBUCH_201_pYEEI_100301-13-40-RND6673_0_1 for task p9-IBUCH_201_pYEEI_100301-13-40-RND6673_0 absent
16-3-2010 11:29:43 GPUGRID Output file p9-IBUCH_201_pYEEI_100301-13-40-RND6673_0_2 for task p9-IBUCH_201_pYEEI_100301-13-40-RND6673_0 absent
16-3-2010 11:29:43 GPUGRID Output file p9-IBUCH_201_pYEEI_100301-13-40-RND6673_0_3 for task p9-IBUCH_201_pYEEI_100301-13-40-RND6673_0 absent
16-3-2010 11:29:43 GPUGRID Starting a33-TONI_HERG79a-3-100-RND6672_0
16-3-2010 11:29:44 GPUGRID Starting task a33-TONI_HERG79a-3-100-RND6672_0 using acemd2 version 603

I am also using GTX 295, both jobs cancelled after 45 min. device 1.
Yesterday 3 jobs out of 4 cancelled after almost 5 hours processing.

I can use some help!!!!!!

See also "errors after 7 hours"

____________
Ton (ftpd) Netherlands

ftpd
Send message
Joined: 6 Jun 08
Posts: 152
Credit: 328,250,382
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 15797 - Posted: 17 Mar 2010 | 14:18:07 UTC

Please HELP!!!

Today again 4 out of 5 jobs cancelled after more than 4 hours processing!!

GTX 295 - Windows XP


____________
Ton (ftpd) Netherlands

ftpd
Send message
Joined: 6 Jun 08
Posts: 152
Credit: 328,250,382
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 15809 - Posted: 18 Mar 2010 | 9:25:32 UTC

Again today 6 out of 6 cancelled after 45 secs.

Windows XP - gtx 295 - driver 197.13

Also working Windows XP - gts 250 - driver 197.13 - no problems in abou 10 hours.

Any ideas????
____________
Ton (ftpd) Netherlands

ignasi
Send message
Joined: 10 Apr 08
Posts: 254
Credit: 16,836,000
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 15810 - Posted: 18 Mar 2010 | 12:21:43 UTC - in response to Message 15809.
Last modified: 18 Mar 2010 | 12:24:51 UTC

@ftpd

I see all your errors, yes.
Your case is one of the hardest to debug. All WU you took where already started by somebody else, therefore it is not an input file corruption. Neither, given by the fact that they fail after some execution time.
We neither see no major failure due to the application solely at least.
But what I observe in your case is that none of the other cards have such a high rate of failure with similar or even equal WU's and app version.

Have you considered that the source might be the card itself?
What brand is the card? Can you monitor temperature while running?
Is that your video output card?
Do you experience that sort of fails in other projects?

ftpd
Send message
Joined: 6 Jun 08
Posts: 152
Credit: 328,250,382
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 15811 - Posted: 18 Mar 2010 | 12:31:59 UTC - in response to Message 15810.

Dear Ignasi,

Since last day i have also problems with WU-Milky Way with the same card.

The temp is OK = about 65C

In case of processin 6.71 cuda no problems, only with acemd2?

This computer is working 24/7 and is not using (except for monitor) the card.

The card is 6 months old.

Regards,

Ton

PS Now processing device 1 = collatz and device 0 = gpugrid
____________
Ton (ftpd) Netherlands

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 15813 - Posted: 18 Mar 2010 | 13:09:10 UTC - in response to Message 15811.

You may want to consider returning it to the seller or manufacturer (RTM), if it is under warrantee. If you have tried it in more than one system with the same result, I think it is fair to say the issue is with the card. As you are now getting errors with other projects and the error rate is rising the card might actually fail soon.

Snow Crash
Send message
Joined: 4 Apr 09
Posts: 450
Credit: 539,316,349
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 15822 - Posted: 19 Mar 2010 | 0:48:15 UTC

looks like this WU is bad ...
p25-IBUCH_101b_pYEEI_100304-11-80-RND0419_5
I will be starting it in a few hours so we'll see if the string of errors continues for this WU.

____________
Thanks - Steve

ftpd
Send message
Joined: 6 Jun 08
Posts: 152
Credit: 328,250,382
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 15826 - Posted: 19 Mar 2010 | 9:27:31 UTC

Last night same machine GTX 295 3 out of 4 were OK!!!!!!!!!!

____________
Ton (ftpd) Netherlands

ignasi
Send message
Joined: 10 Apr 08
Posts: 254
Credit: 16,836,000
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 15830 - Posted: 19 Mar 2010 | 10:02:08 UTC - in response to Message 15822.

looks like this WU is bad ...
p25-IBUCH_101b_pYEEI_100304-11-80-RND0419_5
I will be starting it in a few hours so we'll see if the string of errors continues for this WU.


Thanks Snow Crash,
certainly some WUs seem to be condemned to die. We have been discussing that internally and it can be either by chance that a result is corrupted when saved/uploaded/etc. or that particular cards are are corrupting results from time to time.

Anyways, please let us know if you detect any pattern of failure regarding 'condemned WUs'.

cheers,
i

Snow Crash
Send message
Joined: 4 Apr 09
Posts: 450
Credit: 539,316,349
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 15875 - Posted: 21 Mar 2010 | 13:09:48 UTC

You guys do such a good job that I have not seen another "one off" wu error.

I just finished my first *long* WU and it took precisely twice as long as previous pYEEI wus which I bet is exactly what you planned. Excellent work.

Can you tell us anything about the numbr of atoms and how much time these wus model?
____________
Thanks - Steve

ignasi
Send message
Joined: 10 Apr 08
Posts: 254
Credit: 16,836,000
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 15902 - Posted: 22 Mar 2010 | 10:54:27 UTC - in response to Message 15875.
Last modified: 22 Mar 2010 | 10:55:07 UTC

Can you tell us anything about the numbr of atoms and how much time these wus model?


Sure.
These *long* Wu's are exactly twice as long as the previous one's with similar name. They are modeling exactly 5 nanoseconds (ns) of ~36000 atoms (*pYEEI* & *PQpYEEIPI*). In these systems we have a protein (good old friend SH2 domain) and ligand (phosphoTYR-GLU-GLU-ILE & PRO-GLN-phosphoTYR-GLU-GLU-ILE-PRO-ILE //aminoacids) for which we are computing 'free energies of binding'. Basically the strength of their interaction.
We are willing to increase the size for one main reason. Our 'optimal' simulation time for analysis is no shorter than 5 ns at the moment. That means that our waiting time then is made of a normal WU (2.5ns) + queuing + normal WU (2.5ns), this times 50 which is the number of WUs for one of these experiments.
As you may see, the time-to-answer will greatly vary. With twice as long WUs, we omit the queuing time. Now with a faster application shouldn't be much of a hassle.

However, it is still a test. We want to have your feedback on them.

thanks,
ignasi

Snow Crash
Send message
Joined: 4 Apr 09
Posts: 450
Credit: 539,316,349
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 15908 - Posted: 22 Mar 2010 | 12:24:22 UTC - in response to Message 15902.

Looking at 9 hours of processing on a current, state of the art, GPU to return 5 ns worth of realtime simulation puts into perspective just how important it is for all of us to pull together. I've read some of the papers you guys have published and not that I can follow any of it but I always knew you were working at the atomic level (seriously cool, you rock!).

Also, knowing that with the normal size you ultimately need to put together 100 WUs back to back before you have anything that even makes sense for you to start looking at highlites why we need to turn these WUs as quickly as possible. Best case scenario you don't get a finished run for more than 3 months ... and that's best case. I imagine it is more common to have to wait at least 4 months.

Running stable with a small cache will reduce the overall runtime of an experiment much more than any one GPU running fast. So, everyone ... get those card stable, turn your cache down low, and keep on crunching!
____________
Thanks - Steve

ftpd
Send message
Joined: 6 Jun 08
Posts: 152
Credit: 328,250,382
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 15913 - Posted: 22 Mar 2010 | 14:19:33 UTC

Today 6 out of 6 cancelled after a few hours processing! Also the long WU 6.71

What can we do about it???
____________
Ton (ftpd) Netherlands

Snow Crash
Send message
Joined: 4 Apr 09
Posts: 450
Credit: 539,316,349
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 15914 - Posted: 22 Mar 2010 | 14:36:41 UTC - in response to Message 15913.

1. Are you connecting to this machine remotely?
2. Are you crunching anything else on this machine?
3. Can you suspend one of the WUs that are currently running to see if the other one will finish properly.
____________
Thanks - Steve

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 15915 - Posted: 22 Mar 2010 | 14:43:40 UTC - in response to Message 15913.
Last modified: 22 Mar 2010 | 14:45:09 UTC

Your GTX295 is getting about 7K points per day on average. On that system it should be getting about 49K! It must be particularly annoying to have 4 tasks all fail after going more than 50% through a task; one task must have been about 20min from finishing!

RTM it, try it in a different system or edit your config file to run only 1 tasks at a time on your GTX295 (28500 would be a good bit better than 7000, if that worked), or try Snow Crash's suggestion - to suspend one task and let one finish before beginning the second (need to select no new tasks before starting the second task).

By the way, one of the tasks that failed on your GTX295 also failed for me on a card that very rarely fails, and also failed for someone else. So it is possible that that particular task was problematic.

At least your new GTS250 is running well!

ftpd
Send message
Joined: 6 Jun 08
Posts: 152
Credit: 328,250,382
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 15921 - Posted: 22 Mar 2010 | 17:41:13 UTC - in response to Message 15914.

I am not connected remotely. It is my office-machine. It was crunching the weekend.

It is crunching also Milkyway, Collatz, Seti all cuda-gpu-jobs.

I also do 1 job - gpugrid and 1 job - seti or anything else.

I have also GTX 260 and GTS 250(in other machines) - no problems with that cards.


____________
Ton (ftpd) Netherlands

ftpd
Send message
Joined: 6 Jun 08
Posts: 152
Credit: 328,250,382
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 15974 - Posted: 25 Mar 2010 | 12:54:52 UTC

25-3-2010 13:12:57 GPUGRID Computation for task p20-IBUCH_025a_pYEEI_100309-13-20-RND9969_0 finished
25-3-2010 13:12:57 GPUGRID Output file p20-IBUCH_025a_pYEEI_100309-13-20-RND9969_0_1 for task p20-IBUCH_025a_pYEEI_100309-13-20-RND9969_0 absent
25-3-2010 13:12:57 GPUGRID Output file p20-IBUCH_025a_pYEEI_100309-13-20-RND9969_0_2 for task p20-IBUCH_025a_pYEEI_100309-13-20-RND9969_0 absent
25-3-2010 13:12:57 GPUGRID Output file p20-IBUCH_025a_pYEEI_100309-13-20-RND9969_0_3 for task p20-IBUCH_025a_pYEEI_100309-13-20-RND9969_0 absent
25-3-2010 13:12:57 GPUGRID Starting p16-IBUCH_2_PQpYEEIPI_long_100319-2-4-RND1703_0
25-3-2010 13:12:58 GPUGRID Starting task p16-IBUCH_2_PQpYEEIPI_long_100319-2-4-RND1703_0 using acemd version 671
25-3-2010 13:13:34 GPUGRID Computation for task p16-IBUCH_2_PQpYEEIPI_long_100319-2-4-RND1703_0 finished
25-3-2010 13:13:34 GPUGRID Output file p16-IBUCH_2_PQpYEEIPI_long_100319-2-4-RND1703_0_1 for task p16-IBUCH_2_PQpYEEIPI_long_100319-2-4-RND1703_0 absent
25-3-2010 13:13:34 GPUGRID Output file p16-IBUCH_2_PQpYEEIPI_long_100319-2-4-RND1703_0_2 for task p16-IBUCH_2_PQpYEEIPI_long_100319-2-4-RND1703_0 absent
25-3-2010 13:13:34 GPUGRID Output file p16-IBUCH_2_PQpYEEIPI_long_100319-2-4-RND1703_0_3 for task p16-IBUCH_2_PQpYEEIPI_long_100319-2-4-RND1703_0 absent

1 job cancelled after 3 hours 12 minutes and 1 job cancelled after 22 secs.

Any reasons?

Yesterday 4 jobs - all OK!!!
____________
Ton (ftpd) Netherlands

Snow Crash
Send message
Joined: 4 Apr 09
Posts: 450
Credit: 539,316,349
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 15975 - Posted: 25 Mar 2010 | 15:04:24 UTC - in response to Message 15974.

I was very stable until I started running both versions of the apps. Then I started to get failures on the old 6.71 which made my system unstable and the new version 6.03 would start to crash. I would restart my computer and a couple of 6.03 would run and all was good until I ran a 6.71 and it errored and again made my system unstable.

Last night in BOINC Manger I told it "No New Tasks" for GPUGrid
Then I went to my GPUGrid preferences here on the webite and told it to only send me ACEMD 2. (this is the new app version and is much faster).
Back in BOINC Manager I "Reset" GPUGrid.
Then I told it to accept new work.

So far everything looks good with no errors. I have a vague suspicion that one of the dlls distributed with the apps is different but is not being replaced and is what causes problems on otherwise stable machines.

____________
Thanks - Steve

ignasi
Send message
Joined: 10 Apr 08
Posts: 254
Credit: 16,836,000
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 16041 - Posted: 29 Mar 2010 | 9:03:59 UTC - in response to Message 15975.

Actually, we shouldn't be distributing the old app anymore.
There are though some WUs sent last week with the old app, but that was a mistake.
In principle all new WUs are going to come with the new app.

Let's see if that ends up with weird failures.

cheers,
i

Post to thread

Message boards : Graphics cards (GPUs) : *_pYEEI_* information and issues

//