Advanced search

Message boards : Number crunching : Process exited with code 159 (0x9f, -97)

Author Message
Profile microchip
Avatar
Send message
Joined: 4 Sep 11
Posts: 110
Credit: 326,102,587
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34116 - Posted: 4 Dec 2013 | 16:11:31 UTC

This is often what I get while crunching SANTI short run tasks. I can complete one task, and the following two error out.

Honestly, GPUGRID is the only project with such a high failure rate. I don't know what's wrong, but these tasks are very buggy
____________

Team Belgium

Dagorath
Send message
Joined: 16 Mar 11
Posts: 509
Credit: 179,005,236
RAC: 0
Level
Ile
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34120 - Posted: 4 Dec 2013 | 19:08:15 UTC - in response to Message 34116.

159 is one of the codes listed in the list of project error codes from the FAQ section.

____________
BOINC <<--- credit whores, pedants, alien hunters

Profile microchip
Avatar
Send message
Joined: 4 Sep 11
Posts: 110
Credit: 326,102,587
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34122 - Posted: 4 Dec 2013 | 20:14:57 UTC - in response to Message 34120.

159 is one of the codes listed in the list of project error codes from the FAQ section.


Thanks,

According to this code, which refers to -97, it is an indication of a hardware issue. However, Milkyway, which has the same GPU usage as GPUGRID on my card, thus generating same temperature (77°C), runs just fine as does Einstein.

Anyway. I added a big fan blowing towards the GPU. I have 2 GPUs in my system. Case is fully open on the GPUs side but they are fairly close to each others so I think the faster GPU (a GTX 560) can't suck in enough air to be cooled down further. Now with the blowing fan (120mm) it dropped from 77°C to 68°C.

I will try to crunch a few GPUGRID WUs to see how it goes. Thanks for referring me to the codes. I overlooked that thread and honestly forgot about it.
____________

Team Belgium

Dagorath
Send message
Joined: 16 Mar 11
Posts: 509
Credit: 179,005,236
RAC: 0
Level
Ile
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34128 - Posted: 5 Dec 2013 | 3:10:03 UTC - in response to Message 34122.

I think it should run OK at 77 C. If the temperature decrease doesn't fix the problem then maybe it's clocked too high?

____________
BOINC <<--- credit whores, pedants, alien hunters

Profile microchip
Avatar
Send message
Joined: 4 Sep 11
Posts: 110
Credit: 326,102,587
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34129 - Posted: 5 Dec 2013 | 11:35:53 UTC - in response to Message 34128.

I think it should run OK at 77 C. If the temperature decrease doesn't fix the problem then maybe it's clocked too high?


I ran two more tasks. One succeeded and one failed. Temp was around 68°C. The GPU runs at stock speeds. I haven't done anything to it. I also ran a memory test on the GPU but after 50 iterations, all was OK

No idea what happens. Some tasks complete, others fail.
____________

Team Belgium

Dagorath
Send message
Joined: 16 Mar 11
Posts: 509
Credit: 179,005,236
RAC: 0
Level
Ile
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34130 - Posted: 5 Dec 2013 | 20:11:52 UTC - in response to Message 34129.

Going through the tasks that have failed on your system I notice that the second iterations are almost always successful on 5xx or newer GPUs which leads me to believe the tasks themselves are not buggy. The second iterations seem to fail on 4xx GPUs. I must admit my sample group is small but it makes me wonder if older GPUs have the required compute capability for SANTI tasks.
____________
BOINC <<--- credit whores, pedants, alien hunters

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34131 - Posted: 5 Dec 2013 | 20:38:01 UTC

Which driver version are you running? It seems to be fairly new, since you'Re getting CUDA 5.5 tasks, but Linux doesn't show us anything else.

MrS
____________
Scanning for our furry friends since Jan 2002

Profile microchip
Avatar
Send message
Joined: 4 Sep 11
Posts: 110
Credit: 326,102,587
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34142 - Posted: 6 Dec 2013 | 18:24:27 UTC

I run GPUGRID only on the GTX 560. The GT 440 is used for other projects (Milkyway/Einstein). Driver version is latest stable 331.20

I've been able to crunch 5 consecutive tasks in a row, without a single failure thus far. I'm crunching my 6th task now and will see if it'll succeed
____________

Team Belgium

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34149 - Posted: 7 Dec 2013 | 11:24:10 UTC - in response to Message 34142.
Last modified: 7 Dec 2013 | 11:25:38 UTC

Even with 304.88 Linux drivers (which are not CUDA5.5) you still get CUDA5.5 tasks - tasks are not strictly allocated by driver version. The tasks run and complete as normal.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Profile microchip
Avatar
Send message
Joined: 4 Sep 11
Posts: 110
Credit: 326,102,587
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34167 - Posted: 9 Dec 2013 | 7:15:53 UTC

I think the problem is fixed. I've crunched 12 short WUs in a row, without a single failure. That extra fan I added seems to have helped :)
____________

Team Belgium

Dagorath
Send message
Joined: 16 Mar 11
Posts: 509
Credit: 179,005,236
RAC: 0
Level
Ile
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34168 - Posted: 9 Dec 2013 | 8:38:27 UTC - in response to Message 34167.

That's good to hear but I'm surprised it produced errors at 77 C. This spec sheet from NVIDIA claims the max. temperature is 98 C, not that one would ever want to run it that hot, but it seems like 77 C should be low enough to give error free results. So I wonder what software you use to check the temperature and does it give the temperature of both cards? Is it possible you reported the temperature of the other card which was getting better air flow and likely running cooler?

____________
BOINC <<--- credit whores, pedants, alien hunters

Profile microchip
Avatar
Send message
Joined: 4 Sep 11
Posts: 110
Credit: 326,102,587
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34170 - Posted: 9 Dec 2013 | 11:40:15 UTC - in response to Message 34168.

That's good to hear but I'm surprised it produced errors at 77 C. This spec sheet from NVIDIA claims the max. temperature is 98 C, not that one would ever want to run it that hot, but it seems like 77 C should be low enough to give error free results. So I wonder what software you use to check the temperature and does it give the temperature of both cards? Is it possible you reported the temperature of the other card which was getting better air flow and likely running cooler?


I use the NV control center to check temps. Without the extra fan, the GTX 560 was between 77 and 79°C when running GPUGRID. The GT 440, which doesn't run GPUGRID but Milkyway/Einstein, hovers around 60°C at full load. The GTX 560 is the first card and under it is the GT 440 so there's a narrow passage between it and the GT 440 so my theory is that it gets hotter because of the narrow opening and because the GT 440 also produces heat on the backside which gets sucked in by the GTX 560's fan.

I know, the NV spec says it can go a lot higher but then I get failed WUs. No, it's not possible I reported wrong temps as you'd really need to be blind to confuse the two. The NV center under Linux separates both cards pretty well.

Anyways. I'm happy now I can contribute more to GPUGRID :)
____________

Team Belgium

Dagorath
Send message
Joined: 16 Mar 11
Posts: 509
Credit: 179,005,236
RAC: 0
Level
Ile
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34174 - Posted: 9 Dec 2013 | 18:18:13 UTC - in response to Message 34170.

The NV center under Linux separates both cards pretty well.


That's what I was wondering. Thanks :)

____________
BOINC <<--- credit whores, pedants, alien hunters

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34176 - Posted: 9 Dec 2013 | 22:29:24 UTC - in response to Message 34170.

Swap the card around and the 560 should be cooler.

That's good to hear but I'm surprised it produced errors at 77 C. This spec sheet from NVIDIA claims the max. temperature is 98 C, not that one would ever want to run it that hot, but it seems like 77 C should be low enough to give error free results. So I wonder what software you use to check the temperature and does it give the temperature of both cards? Is it possible you reported the temperature of the other card which was getting better air flow and likely running cooler?


I use the NV control center to check temps. Without the extra fan, the GTX 560 was between 77 and 79°C when running GPUGRID. The GT 440, which doesn't run GPUGRID but Milkyway/Einstein, hovers around 60°C at full load. The GTX 560 is the first card and under it is the GT 440 so there's a narrow passage between it and the GT 440 so my theory is that it gets hotter because of the narrow opening and because the GT 440 also produces heat on the backside which gets sucked in by the GTX 560's fan.

I know, the NV spec says it can go a lot higher but then I get failed WUs. No, it's not possible I reported wrong temps as you'd really need to be blind to confuse the two. The NV center under Linux separates both cards pretty well.

Anyways. I'm happy now I can contribute more to GPUGRID :)


____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Profile microchip
Avatar
Send message
Joined: 4 Sep 11
Posts: 110
Credit: 326,102,587
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34190 - Posted: 10 Dec 2013 | 14:51:47 UTC

Yeah, I did that. Reduced temps by a few degrees. My case really isn't made for multiple GPUs, as even on the bottom there isn't much space open
____________

Team Belgium

Profile robertmiles
Send message
Joined: 16 Apr 09
Posts: 498
Credit: 575,516,497
RAC: 11,537
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34192 - Posted: 10 Dec 2013 | 19:28:57 UTC

Errors with 8.14:

Error 255:

http://www.gpugrid.net/result.php?resultid=7530621
The extended attributes are inconsistent.

http://www.gpugrid.net/result.php?resultid=7530360
The extended attributes are inconsistent.


Error -97:

http://www.gpugrid.net/result.php?resultid=7528287
(unknown error) - exit code -97 (0xffffff9f)
The simulation has become unstable. Terminating to avoid lock-up (1)
The simulation has become unstable. Terminating to avoid lock-up (1)


GPUGRID now set to no new tasks on this computer, probably until
something relevant is updated. Tasks still running on my other
computer, with a slower GPU.

GTX 560, at stock speeds.

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34194 - Posted: 11 Dec 2013 | 0:22:21 UTC
Last modified: 11 Dec 2013 | 0:24:07 UTC

My case isn't suited for multiple GPUs either, but I manage it by using Precision-X to set a custom GPU fan curve where it will go maximum-fan before hitting 70*C, thus keeping GPU Boost at maximum clockrates, while keeping temperatures low enough to process tasks successfully. You might consider doing what I've done. (Despite being made by EVGA, any nVidia user can register/download/use Precision-X freely... though, not sure if it's available outside of Windows).

Dagorath
Send message
Joined: 16 Mar 11
Posts: 509
Credit: 179,005,236
RAC: 0
Level
Ile
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34195 - Posted: 11 Dec 2013 | 1:50:10 UTC - in response to Message 34194.
Last modified: 11 Dec 2013 | 2:05:30 UTC

My case isn't suited for multiple GPUs either, but I manage it by using Precision-X to set a custom GPU fan curve where it will go maximum-fan before hitting 70*C, thus keeping GPU Boost at maximum clockrates, while keeping temperatures low enough to process tasks successfully. You might consider doing what I've done. (Despite being made by EVGA, any nVidia user can register/download/use Precision-X freely... though, not sure if it's available outside of Windows).


It probably isn't available for Linux. A lot of Windows software will run on Linux under Wine (Windows emulator) but if it needs hardware access then it usually won't run.

Does Precision-X re-flash the ROM on the card? Or do you have to keep Precision-X running while crunching in order to maintain the custom fanspeed curve?

The EVGA 660ti I have running in one of my Linux boxes allows the temperature to get to 80C before it gets really serious about boosting fan RPM so I wrote a Python script to put the fan control in manual mode. The script reads the temperature every 5 secs and adjusts the fan speed up or down to keep the GPU at whatever target temperature the user specifies. It uses an ncurses interface which keeps the RAM and CPU overhead quite low compared to a point 'n click GUI. It works very well.

The potential problem with the script is that if it should crash/hang, the fan will stay at whatever speed it is at unless the script catches the fault and recovers or is able to restore auto-fan-control before it exits. If bad luck continues and something should then happen to elevate GPU temperature the card will downclock if that mechanism works or if bad luck continues (they say it comes in threes) it will possibly fry. So far the script has run for 60 days without crashing but I would prefer to let the hardware/firmware do the temperature control and perhaps have a script that just monitors the card to verify that the card is doing the proper job. Trouble is, as I said earlier, the stock fan curve installed by EVGA lets the temp hit 80 C before kicking the fan speed up. I want it at 70 C.

I have gathered together scraps of info and software from various fora/newsgroups/blogs with which they claim I can reflash the ROM to recurve the temperature function and adjust clock freqs on cards that a lot of people have considered to be "locked". I haven't had time to try it but from feedback it apparently works perfectly. Yes, I am aware that re-flashing ROM can brick the card but if one saves a copy of the original content it's just a matter of flashing the card with the original bin file. It works easiest on Windows but IIUC it's doable on pure Linux too. By pure I mean not a dual-boot Win-Lin system.

If anybody wants links just ask and I'll give you what I have so far.
____________
BOINC <<--- credit whores, pedants, alien hunters

Profile microchip
Avatar
Send message
Joined: 4 Sep 11
Posts: 110
Credit: 326,102,587
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34201 - Posted: 11 Dec 2013 | 8:33:50 UTC

@Dagorath

that script could come in handy :)
links please?
____________

Team Belgium

Dagorath
Send message
Joined: 16 Mar 11
Posts: 509
Credit: 179,005,236
RAC: 0
Level
Ile
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34223 - Posted: 12 Dec 2013 | 4:57:35 UTC - in response to Message 34201.

Actually I meant I would provide links to the various bits of info I have found scattered around the 'net concerning how to reflash BIOS on NVIDIA based cards. I didn't mean the script as I felt that's not a good solution for managing fan speed however, if you are interested then I will provide it. And it does more than just control fan speed. It shows various info (freqs, usage, etc) which might be useful for anybody running without a desktop. I've been playing with gnuplot too and have been thinking the script could save interesting, plotable data to files and provide possibly informative graphs, even collate those with task type history or...

So, I'm tidying it up a bit so you can get it going without too much trouble. I'll post a download link in a few hours. Depending on feedback I might put it on Github so others can critique/add/improve/whatever.

____________
BOINC <<--- credit whores, pedants, alien hunters

Dagorath
Send message
Joined: 16 Mar 11
Posts: 509
Credit: 179,005,236
RAC: 0
Level
Ile
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34249 - Posted: 12 Dec 2013 | 15:58:34 UTC - in response to Message 34201.

@Dagorath

that script could come in handy :)
links please?


The script is named gpu_d.py. Get it and a readme in a zip file here.

I created a new thread named gpu_d in the cafe if anybody wants to comment on gpu_d. I suggest comments go in that thread rather than this thread. If that thread exceeds more than a few posts and project admins/mods wish, I'll setup something for discussing gpu_d somewhere else, no problem.

____________
BOINC <<--- credit whores, pedants, alien hunters

Profile microchip
Avatar
Send message
Joined: 4 Sep 11
Posts: 110
Credit: 326,102,587
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34260 - Posted: 12 Dec 2013 | 20:34:27 UTC

Thanks Dagorath :)
____________

Team Belgium

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34303 - Posted: 14 Dec 2013 | 13:58:10 UTC - in response to Message 34195.

It probably isn't available for Linux. A lot of Windows software will run on Linux under Wine (Windows emulator) but if it needs hardware access then it usually won't run.

Does Precision-X re-flash the ROM on the card? Or do you have to keep Precision-X running while crunching in order to maintain the custom fanspeed curve?

Win only (like pretty much all the other usual tuning tools), needs hardware access (or at least to driver settings), no bios flash and doesn't need to be kept running. I think it really just changes driver settings, as custom settings have to be re-applied after e.g. a driver reset (or reboot).

MrS
____________
Scanning for our furry friends since Jan 2002

Dagorath
Send message
Joined: 16 Mar 11
Posts: 509
Credit: 179,005,236
RAC: 0
Level
Ile
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34323 - Posted: 15 Dec 2013 | 9:01:59 UTC - in response to Message 34303.

OK thanks, MrS.

____________
BOINC <<--- credit whores, pedants, alien hunters

Post to thread

Message boards : Number crunching : Process exited with code 159 (0x9f, -97)

//