Advanced search

Message boards : Graphics cards (GPUs) : KASHIF_??? workunits fixed

Author Message
Profile GDF
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 14 Mar 07
Posts: 1947
Credit: 629,356
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 10003 - Posted: 20 May 2009 | 14:45:24 UTC

The new submitted workunits called KASHIF_??? should now work even on G90 cards. The large KASHIF_dim workunits have been reduced by half length, and the data upload by 4 times.

There could be around old workunits with the same name, you could look at the creation date on the web site.
Changes have been applied now:
20 May 16:44 CEST

Hope it works.

gdf

Profile Zydor
Send message
Joined: 8 Feb 09
Posts: 252
Credit: 1,309,451
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 10006 - Posted: 20 May 2009 | 15:43:52 UTC - in response to Message 10003.

For the Devs


Regards
Zy

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 10016 - Posted: 21 May 2009 | 9:35:58 UTC
Last modified: 21 May 2009 | 10:02:06 UTC

That's music to my ears :)
Can you tell us a little about the problem and its fix?

EDIT:

GDF wrote:
So, it seems that there is a bug in the compiler/hardware which appears only on pre G200 cards.
We found a way to avoid it for now, but it limits what we can do, so it is not a solution.

Seems like the ball is in nVidias court now.

MrS
____________
Scanning for our furry friends since Jan 2002

Profile GDF
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 14 Mar 07
Posts: 1947
Credit: 629,356
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 10050 - Posted: 21 May 2009 | 20:54:26 UTC - in response to Message 10016.

A bug in a routine of cuda FFT.

gdf

Profile Hydropower
Avatar
Send message
Joined: 3 Apr 09
Posts: 70
Credit: 6,003,024
RAC: 0
Level
Ser
Scientific publications
watwatwatwatwatwat
Message 10056 - Posted: 22 May 2009 | 0:20:33 UTC - in response to Message 10050.
Last modified: 22 May 2009 | 0:27:46 UTC

I just had a couple of these:
"Cuda error in file '..\cuda/cutil.h' in line 968 : out of memory.
Memory usage: host: bytes device: bytes
Assertion failed: 0, file ..\cuda/cutil.h, line 968

This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.
"

Is that a similar error we're talking about ?
WU: 482275 and 482302 (IBUCHs) I notice these are all on GPU1 which may indicate a local problem.

Profile GDF
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 14 Mar 07
Posts: 1947
Credit: 629,356
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 10062 - Posted: 22 May 2009 | 7:18:24 UTC - in response to Message 10056.

Some time ago this was an Nvidia driver problem which was sorted with latest drivers.

gdf

Profile GDF
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 14 Mar 07
Posts: 1947
Credit: 629,356
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 10093 - Posted: 23 May 2009 | 16:50:21 UTC - in response to Message 10062.

Nvidia is looking into the bug.

gdf

MarkJ
Volunteer moderator
Volunteer tester
Send message
Joined: 24 Dec 08
Posts: 738
Credit: 200,909,904
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 10157 - Posted: 25 May 2009 | 12:19:32 UTC

I had this WU die. It was run on a GTS250 (G92 chipset I believe) using 185.85 drivers. It will get reported later tonight, so i'm not sure what the actual error is until then.
____________
BOINC blog

SkyeHunter
Send message
Joined: 7 Mar 09
Posts: 12
Credit: 1,254,285
RAC: 0
Level
Ala
Scientific publications
watwatwatwatwat
Message 10158 - Posted: 25 May 2009 | 12:43:58 UTC - in response to Message 10157.

Had a few tasks crashing with a similar error message. The latest was a KASHIF one.

# Encounter 10-12 H-bond term
WARNING: parameters.cu, line 568: Found zero 10-12 H-bond term.
WARNING: parameters.cu, line 568: Found zero 10-12 H-bond term.
MDIO ERROR: cannot open file "restart.coor"

The card has been running overclocked but fairly cool. This should be safe but then again, overclocking never is. Hot spring weather may play a role (hot attick)...

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 10171 - Posted: 25 May 2009 | 21:24:56 UTC - in response to Message 10158.

Actually your error message is "Incorrect function. (0x1) - exit code 1 (0x1)", which is quite a generic one. It's not "the nasty bug" and might be related to OC and temperature.

MrS
____________
Scanning for our furry friends since Jan 2002

MarkJ
Volunteer moderator
Volunteer tester
Send message
Joined: 24 Dec 08
Posts: 738
Credit: 200,909,904
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 10173 - Posted: 25 May 2009 | 21:38:00 UTC - in response to Message 10157.

I had this WU die. It was run on a GTS250 (G92 chipset I believe) using 185.85 drivers. It will get reported later tonight, so i'm not sure what the actual error is until then.


Turns out its the cuda fft_data_swizzle_in error. So they don't appear to work on GTS250's with the 185.85 drivers.
____________
BOINC blog

SkyeHunter
Send message
Joined: 7 Mar 09
Posts: 12
Credit: 1,254,285
RAC: 0
Level
Ala
Scientific publications
watwatwatwatwat
Message 10180 - Posted: 26 May 2009 | 8:36:06 UTC - in response to Message 10171.

It's not "the nasty bug" and might be related to OC and temperature.

MrS


OK, will throttle back the GPU to half the OC. The core ran about 65°C on hot days (high 50ties during the night) I suspect it will be the memory chips, but for safety measures I'll throttle down the CPU likewise.

Profile Zydor
Send message
Joined: 8 Feb 09
Posts: 252
Credit: 1,309,451
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 10190 - Posted: 26 May 2009 | 13:39:18 UTC - in response to Message 10171.
Last modified: 26 May 2009 | 13:40:14 UTC

Had a real strange one, and with the new WU's.

I've "lost" one (!)

Sequence is below copying the key parts of the BOINC Manager messages:
25/05/2009 22:37:15 GPUGRID Computation for task p730000-IBUCH_pYEpYVk1_2105-3-10-RND7622_0 finished
25/05/2009 22:37:15 GPUGRID Starting 4-KASHIF_HIVPR_mon_ba1-14-100-RND3111_0
25/05/2009 22:37:16 GPUGRID Starting task 4-KASHIF_HIVPR_mon_ba1-14-100-RND3111_0 using acemd version 664

So far so good .... the RND7622 message correlates with my task list.

Here comes the next one, getting ready for completion of RND3111, downloaded automatically (cache set to 0.1)

26/05/2009 10:09:30 GPUGRID Sending scheduler request: To fetch work.
26/05/2009 10:09:30 GPUGRID Requesting new tasks
26/05/2009 10:09:35 GPUGRID Scheduler request completed: got 1 new tasks
26/05/2009 10:09:37 GPUGRID Started download of p1480000-IBUCH_pYEpYIk1_2105-3-LICENSE
26/05/2009 10:09:37 GPUGRID Started download of p1480000-IBUCH_pYEpYIk1_2105-3-COPYRIGHT
26/05/2009 10:09:37 GPUGRID Started download of p1480000-IBUCH_pYEpYIk1_2105-3-p1480000-IBUCH_pYEpYIk1_2105-2-10-RND5345_1

Doing good correlates with Task list...... bare with me...

26/05/2009 12:13:53 climateprediction.net Scheduler request completed: got 0 new tasks
26/05/2009 12:14:43 GPUGRID Computation for task 4-KASHIF_HIVPR_mon_ba1-14-100-RND3111_0 finished
26/05/2009 12:14:48 GPUGRID Starting p1480000-IBUCH_pYEpYIk1_2105-3-10-RND5345_0
26/05/2009 12:14:48 GPUGRID Starting task p1480000-IBUCH_pYEpYIk1_2105-3-10-RND5345_0 using acemd version 664
26/05/2009 12:14:50 GPUGRID Started upload of 4-KASHIF_HIVPR_mon_ba1-14-100-RND3111_0_0
26/05/2009 12:14:50 GPUGRID Started upload of 4-KASHIF_HIVPR_mon_ba1-14-100-RND3111_0_1
26/05/2009 12:14:50 GPUGRID Started upload of 4-KASHIF_HIVPR_mon_ba1-14-100-RND3111_0_2
26/05/2009 12:14:50 GPUGRID Started upload of 4-KASHIF_HIVPR_mon_ba1-14-100-RND3111_0_3
26/05/2009 12:14:50 GPUGRID Started upload of 4-KASHIF_HIVPR_mon_ba1-14-100-RND3111_0_4
26/05/2009 12:14:57 GPUGRID Finished upload of 4-KASHIF_HIVPR_mon_ba1-14-100-RND3111_0_0
26/05/2009 12:15:32 GPUGRID Finished upload of 4-KASHIF_HIVPR_mon_ba1-14-100-RND3111_0_3
26/05/2009 12:15:44 GPUGRID Finished upload of 4-KASHIF_HIVPR_mon_ba1-14-100-RND3111_0_1
26/05/2009 12:15:44 GPUGRID Finished upload of 4-KASHIF_HIVPR_mon_ba1-14-100-RND3111_0_2
26/05/2009 12:28:49 GPUGRID Finished upload of 4-KASHIF_HIVPR_mon_ba1-14-100-RND3111_0_4

Kashif RND3111 has now finished and uploaded, RND5345 has now started. All good, the latter still crunching away.

Problem is, the Kashif has disappeared from sight, there is no record of it either being downloaded as a task in the first place, nor uploaded when it was finished, nothing in my Task list at all. If I hadnt seen it coming through this end, and "blinked" it would have come and gone without me knowing .... something has stopped its recording as being issued, and something stopped it being recorded in Task list as complete. Suspect the credit side went wonky as well, but the key issue, is the WU which was "never issued" was crunched and returned, but according to the Task list never existed nor returned.

I have no doubt it lurks on the server somewhere right now and server side all is probably normal, its not normal this end. It was uploaded and crunched, and the thought occured that since these are "new" WUs, maybe an unknown bug lurks .... dont know, but its wierd enough to report it.

If that makes sense rofl:)

Looks like Hollywood released Gremlins 5 and we were a secret Alpha for the pesky critters, and they eat my WU :)

Regards
Zy

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 10214 - Posted: 26 May 2009 | 21:38:30 UTC - in response to Message 10180.

I suspect it will be the memory chips


Memory almost never fails due to higher temperatures.. unless pushed really hard. (that's because in contrast to CPU and GPU the memory frequency is not limited by temperature to begin with)

MrS
____________
Scanning for our furry friends since Jan 2002

Post to thread

Message boards : Graphics cards (GPUs) : KASHIF_??? workunits fixed

//