Advanced search

Message boards : Number crunching : Compute errors

Author Message
Profile Bender10
Avatar
Send message
Joined: 3 Dec 07
Posts: 167
Credit: 8,368,897
RAC: 0
Level
Ser
Scientific publications
watwatwatwatwatwatwat
Message 9242 - Posted: 3 May 2009 | 10:16:34 UTC

I've noticed more 'compute errors' in the last week...on a daily basis, on these types of wu's :

HIVPR
GIANNI_pYIpYVkp01
IBUCH

Running Linux64
boinc 6.4.5
177.80 driver
____________


Consciousness: That annoying time between naps......

Experience is a wonderful thing: it enables you to recognize a mistake every time you repeat it.

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9253 - Posted: 3 May 2009 | 12:38:46 UTC - in response to Message 9242.

Your tasks show the message:

<message>
process got signal 11
</message>

Anyone knows what this means?

MrS
____________
Scanning for our furry friends since Jan 2002

Profile Zydor
Send message
Joined: 8 Feb 09
Posts: 252
Credit: 1,309,451
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 9272 - Posted: 3 May 2009 | 22:18:57 UTC - in response to Message 9253.
Last modified: 3 May 2009 | 22:28:16 UTC

Its a general system wide error message, not BOINC specific, you can get it using DB's for example (MySql etc). Its hard to know what the absolute cause is because its a general error message. It could be because you hit a bug, its possible that a binary or one of the libraries it was linked to is corrupted, badly built, or misconfigured. It could have lost its link to the Project because of that, and fallen over. The error can also be caused by failing hardware. In general terms its indicating that some Link information needed to continue has been lost - and that can happen for a multitude of reasons.

In other threads I have seen over time on this one, general concensous has been - after all the usual sensible checks for "standard" problems - is if it still is repeating the error "signal 11", run the cache dry and do a reset, in the final analysis, detach / reattach which generates new files thus getting over the corruption if it was software based. Clearly that will not resolve it if hardware based, however it seems by-in-large to surface as a software problem.

Sledgehammer to crack a nut approach, however I have seen no other solution to this issue when its surfaced elsewhere.

A real hairy example of whats behind this (UNIX but this applies to other systems) is at signal 11 core explanations

Personally I have no idea what half that link means, however someone out there might and enlighten us all :)

Regards
Zy

Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 9280 - Posted: 3 May 2009 | 22:51:05 UTC

Memory segmentation error This may be caused by bad compiles, bad memory, kernal panics, and the most common OC (if they are randomly distributed) ...

If it is a bad compile all tasks will fail at the same point on all systems everywhere.

Bad memory will usually be clustered in the same arena but will affect all projects ...

Kernal panics are rare, but caused by OS errors...

Overclocking ... well, enough said.

Profile Bender10
Avatar
Send message
Joined: 3 Dec 07
Posts: 167
Credit: 8,368,897
RAC: 0
Level
Ser
Scientific publications
watwatwatwatwatwatwat
Message 9285 - Posted: 4 May 2009 | 1:41:58 UTC - in response to Message 9272.

I had looked around a bit before posting. And I found similar to what you are mentioning Zydor. But, fresh eyes on an issue sometimes help.

I spent some time looking while I was letting the cache run down. The solution came down to detach, memory issue or lib32 issues, not sure about the last one.....

So, I detached, and will see what happens...and keep the sledge hammer next to the bench...just in case..;)


____________


Consciousness: That annoying time between naps......

Experience is a wonderful thing: it enables you to recognize a mistake every time you repeat it.

Martin Chartrand
Send message
Joined: 4 Apr 09
Posts: 13
Credit: 17,030,367
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9402 - Posted: 6 May 2009 | 21:49:46 UTC - in response to Message 9285.

I had a lot of error initially as well.
I found out that when the wife would switch to her side of the computer I immediately would get a "computer error"
SO I change the BOINC settings in allowing work to be done ONLY on my side and I disabled the "do GPU while computer is in use" and I did not get other errors.
I havent been crunching lately (waiting for a server) GPUGRID but give this a try.

Martin

Profile Bender10
Avatar
Send message
Joined: 3 Dec 07
Posts: 167
Credit: 8,368,897
RAC: 0
Level
Ser
Scientific publications
watwatwatwatwatwatwat
Message 9408 - Posted: 7 May 2009 | 0:09:51 UTC - in response to Message 9402.

Thanks for the tip. I actually asked my wife if she had been trying to use that computer for something...Not that she would be using it, but I had to ask. Besides, it's in the dungeon, surrounded by barbed wire, claymores and a pair of guard wraiths...

It is a 24/7 cruncher. And is not used for much except to check the status of wu's, forum and email in the evening.

The detach trick did not work. So I am shutting it down until this weekend. I'll try a driver update and see what that does.
____________


Consciousness: That annoying time between naps......

Experience is a wonderful thing: it enables you to recognize a mistake every time you repeat it.

Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 9411 - Posted: 7 May 2009 | 1:30:56 UTC - in response to Message 9408.

Thanks for the tip. I actually asked my wife if she had been trying to use that computer for something...Not that she would be using it, but I had to ask. Besides, it's in the dungeon, surrounded by barbed wire, claymores and a pair of guard wraiths...

It is a 24/7 cruncher. And is not used for much except to check the status of wu's, forum and email in the evening.

The detach trick did not work. So I am shutting it down until this weekend. I'll try a driver update and see what that does.

If you remote into it that can also cause problems. Remote desktop and user switch ... bad moon rising ...

Profile Aardvark
Avatar
Send message
Joined: 27 Nov 08
Posts: 28
Credit: 82,362,324
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 9425 - Posted: 7 May 2009 | 10:06:27 UTC

I've been getting compute errors on files of type 182-IBUCH_KID_shao_ba1-1-100-RND0470_1 only. This has happened on two separate machines. One with 32 bit Vista, 8800GT (O/C), client 6.6.20 & 182.50 driver. The other with 64 bit Vista, 9800 GX2 (Not O/C), client 6.6.20 & 182.50 driver. Any suggestions ?

Profile Bender10
Avatar
Send message
Joined: 3 Dec 07
Posts: 167
Credit: 8,368,897
RAC: 0
Level
Ser
Scientific publications
watwatwatwatwatwatwat
Message 9427 - Posted: 7 May 2009 | 10:29:53 UTC - in response to Message 9425.

I powered up my xp64 box to take over for the buggy Linux64 machine. Same GPU and no problems.
____________


Consciousness: That annoying time between naps......

Experience is a wonderful thing: it enables you to recognize a mistake every time you repeat it.

Post to thread

Message boards : Number crunching : Compute errors

//