Advanced search

Message boards : Number crunching : Diagnose error: exit code 98 (0x62)

Author Message
Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 20290 - Posted: 30 Jan 2011 | 22:21:14 UTC
Last modified: 30 Jan 2011 | 22:27:45 UTC

Could somebody please help to explain why this work unit failed? I recently got a 3rd graphics card (a GTX460 added to 2 9800GTs) for my system (Windows 7 x64, Intel Core i7 965 Extreme Edition, 6GB ram, 260.99 x64 drivers installed), which caused me to take a peek at my stats and credits, and I noticed some tasks are failing.

I shouldn't have to set any "swan" windows variables to get these tasks to complete properly, and I don't think anything is overheating.

Is there a bug in the "ACEMD2: GPU molecular dynamics v6.13 (cuda31)" application? The errors in this failure say:
MDIO ERROR: cannot open file "restart.coor"
and
ERROR: file c:\cygwin\home\speechserver\gpumd2_c\src\pme\CPME_cufft.cpp line 106: cufftExecC2C (gridCalc2.2)
called boinc_finish

Here's some extra info:
Work unit number: 2273537
Work unit name: p14-IBUCH_1_wtEGFR_110121-4-20-RND3367
Task number: 3625935
Outcome: Client error
Client state: Compute error
Exit status: 98 (0x62)
Computer ID: 65623
Run time: 79226.688393
CPU time: 4028.679
Validate state: Invalid
Claimed credit: 9485.43518518518
Application: ACEMD2: GPU molecular dynamics v6.13 (cuda31)

stderr out:
<core_client_version>6.12.12</core_client_version>
<![CDATA[
<message>
- exit code 98 (0x62)
</message>
<stderr_txt>
# Using device 1
# There are 3 devices supporting CUDA
# Device 0: "GeForce GTX 460"
# Clock rate: 0.81 GHz
# Total amount of global memory: 1041825792 bytes
# Number of multiprocessors: 7
# Number of cores: 56
# Device 1: "GeForce 9800 GT"
# Clock rate: 1.50 GHz
# Total amount of global memory: 515440640 bytes
# Number of multiprocessors: 14
# Number of cores: 112
# Device 2: "GeForce 9800 GT"
# Clock rate: 1.50 GHz
# Total amount of global memory: 515571712 bytes
# Number of multiprocessors: 14
# Number of cores: 112
MDIO ERROR: cannot open file "restart.coor"
# Using device 1
# There are 3 devices supporting CUDA
# Device 0: "GeForce GTX 460"
# Clock rate: 0.81 GHz
# Total amount of global memory: 1041825792 bytes
# Number of multiprocessors: 7
# Number of cores: 56
# Device 1: "GeForce 9800 GT"
# Clock rate: 1.50 GHz
# Total amount of global memory: 515440640 bytes
# Number of multiprocessors: 14
# Number of cores: 112
# Device 2: "GeForce 9800 GT"
# Clock rate: 1.50 GHz
# Total amount of global memory: 515571712 bytes
# Number of multiprocessors: 14
# Number of cores: 112
# Using device 1
# There are 3 devices supporting CUDA
# Device 0: "GeForce GTX 460"
# Clock rate: 0.81 GHz
# Total amount of global memory: 1041825792 bytes
# Number of multiprocessors: 7
# Number of cores: 56
# Device 1: "GeForce 9800 GT"
# Clock rate: 1.50 GHz
# Total amount of global memory: 515440640 bytes
# Number of multiprocessors: 14
# Number of cores: 112
# Device 2: "GeForce 9800 GT"
# Clock rate: 1.50 GHz
# Total amount of global memory: 515571712 bytes
# Number of multiprocessors: 14
# Number of cores: 112
# Using device 1
# There are 3 devices supporting CUDA
# Device 0: "GeForce GTX 460"
# Clock rate: 0.81 GHz
# Total amount of global memory: 1041825792 bytes
# Number of multiprocessors: 7
# Number of cores: 56
# Device 1: "GeForce 9800 GT"
# Clock rate: 1.50 GHz
# Total amount of global memory: 515440640 bytes
# Number of multiprocessors: 14
# Number of cores: 112
# Device 2: "GeForce 9800 GT"
# Clock rate: 1.50 GHz
# Total amount of global memory: 515571712 bytes
# Number of multiprocessors: 14
# Number of cores: 112
ERROR: file c:\cygwin\home\speechserver\gpumd2_c\src\pme\CPME_cufft.cpp line 106: cufftExecC2C (gridCalc2.2)
called boinc_finish

</stderr_txt>
]]>



Any help would be greatly appreciated! I don't want the work to go to waste!

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 20295 - Posted: 31 Jan 2011 | 13:31:02 UTC - in response to Message 20290.
Last modified: 31 Jan 2011 | 13:37:08 UTC

Most of your failures are on device 1, a 9800GT.
I would suggest you remove it or don’t use it for now. Your GTX460 is device 0.

This failure occurred after 6 task restarts.
I don’t know what way you have Boinc configured or what your were doing to make it stop and start so many times, but the task began running on your GTX460, and ended up running, and failing on your device 1 (9800GT). Try to aviod restarts, suspending tasks and using the snooze feature.
My guess is that the 9800GT cards temperature may be partially to blame. Adding a GPU almost always increases the case temperature and card temperature. You should use fan controlling software to keep the GPU temps below 70deg C. It usually improves stability.

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 20297 - Posted: 31 Jan 2011 | 13:52:10 UTC - in response to Message 20295.
Last modified: 31 Jan 2011 | 13:56:25 UTC

skgiven,

Thanks for your help so far, but I'm really looking for more details.

Regarding the task in the original post, can you give me more information about resolving the errors:
MDIO ERROR: cannot open file "restart.coor"
and
ERROR: file c:\cygwin\home\speechserver\gpumd2_c\src\pme\CPME_cufft.cpp line 106: cufftExecC2C (gridCalc2.2)
called boinc_finish

I'm trying to find out why these errors are happening, based on the actual errors given in the stderr out. Is that the wrong approach?

Regarding suspending, I suspend BOINC when I game or I want to watch a video with no lag. I usually game every evening, and suspend when I game, but sometimes several times throughout a given day. As far as I knew, both BOINC and GPUGrid supported suspending and resuming - do you have reason to believe otherwise?

Regarding heat, the cards run around 70-84 *C at full load 24-7, but the GPU fans are on automatic, and run around 30-70%, leading me to believe that the cards are not overheating. If the fan was at 100% on auto, then I might believe heat to be a problem, but that is not the case.

Is there any way to figure out what really went wrong? I don't want to blame heat unless there's no other explanation.

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 20298 - Posted: 31 Jan 2011 | 14:00:04 UTC - in response to Message 20297.

Here is another work unit that failed recently, with a very similar error:
http://www.gpugrid.net/result.php?resultid=3629042

The error in this task is:
ERROR: file c:\cygwin\home\speechserver\gpumd2_c\src\pme\CPME_cufft.cpp line 106: cufftExecC2C (gridCalc2.2)
called boinc_finish

The error in the original post was:
ERROR: file c:\cygwin\home\speechserver\gpumd2_c\src\pme\CPME_cufft.cpp line 73: cufftExecC2C (gridcalc2.1)
called boinc_finish

What is causing these?!?

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 20299 - Posted: 31 Jan 2011 | 14:08:23 UTC - in response to Message 20297.
Last modified: 31 Jan 2011 | 14:33:13 UTC

Firstly, this is not an error -> MDIO ERROR: cannot open file "restart.coor"

The (cufft) error on line 106 might be a cuda error or just be a reporting element of the app and not specifically related to the failure. Only the program writer could know that, but what actually caused the error is most likely a latent cuda bug that creeps in and impacts on some cc1.1 cards every now and then, and/or a heat related problem. At this stage there is nothing that could be done about any cuda bug other than looking for a workaround. Given the number of errors you have, from different tasks, but on the same card, I would say that a heat related issue could be the underlying problem. Perhaps at the time of the error the card is being more stressed than normal; performing a calculation that cause the core to heat up, even for just a fraction of a second.

Just because a fan does not run at 100% does not mean the cards are at a comfortable temperature, and trust me when I say that 70 to 84 deg C is not comfortable - your case could double as a denaturing heat bucket.
Seriously, download EVGA Precision or something similar. It's free and it's set and forget software. Aim for below 70deg C and after a couple of days running new tasks you can start to ask why tasks fail, if you need to.

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 20314 - Posted: 31 Jan 2011 | 20:23:08 UTC - in response to Message 20299.
Last modified: 31 Jan 2011 | 20:37:46 UTC

You said: Firstly, this is not an error -> MDIO ERROR: cannot open file "restart.coor" ... I say that it's either a legitimate error that should be fixed, or if it's not a problem, the line should be omitted from the output, since it is obviously misleading. It is obviously an error of some sorts - it says error!

Also, you hinted that only the program writer could know if the real issue is a cuda bug or not. I'm asking for a program writer to look into the issue.

You've even said, in a different post, "Expect failures to happen when using a 9800GT." Why? If something is not working, it should either be an app bug that needs to be fixed on GPUGrid's side, or it should be a previously documented cuda bug that should be prominently placed as a sticky to these forums, with lists of driver version numbers that cause and fix the problem.

I will do a more thorough analysis of my tasks that had problems, including which cards they ran on, and which errors were produced... But I'm asking for a developer's assistance here also! Do they read these forums? Is a more-thorough investigation (including code inspection) too much to ask?

If this is not open source, then how do I get the developers to take a look?

- Jacob

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 20323 - Posted: 31 Jan 2011 | 23:34:52 UTC - in response to Message 20314.

Sorry if that came across as rude. I am frustrated, and am looking to solve this problem by identifying where in the code things went wrong.

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 20324 - Posted: 1 Feb 2011 | 1:35:24 UTC - in response to Message 20323.
Last modified: 1 Feb 2011 | 7:23:56 UTC

Forums are for posting about problems and concerns; I expect grumpy, especially from crunchers that seldom post messages.

You said: Firstly, this is not an error -> MDIO ERROR: cannot open file "restart.coor" ... I say that it's either a legitimate error that should be fixed, or if it's not a problem, the line should be omitted from the output, since it is obviously misleading. It is obviously an error of some sorts - it says error!

“cannot open file "restart.coor"” is a known false error. You have a point about the line being excluded from the output, and a suggested method of dealing with this has been previously suggested, but as you know, I don’t write the code. Some error messages are not terminal application errors, they are just reports. This is common with other Boinc based projects and there are some cross project examples (wiki have a list). It’s the same in many environments. Take an operating system for example. You will get an error message if for example your time was not synchronised with a time server, or if one server cannot contact another over a VPN it will hit you with numerous errors (even if the Internet connection is down). Many things can trigger an error message.

Also, you hinted that only the program writer could know if the real issue is a cuda bug or not. I'm asking for a program writer to look into the issue.

You've even said, in a different post, "Expect failures to happen when using a 9800GT." Why? If something is not working, it should either be an app bug that needs to be fixed on GPUGrid's side, or it should be a previously documented cuda bug that should be prominently placed as a sticky to these forums, with lists of driver version numbers that cause and fix the problem.

Again, I don’t write the code, or debug it. If you want other crunchers and CA’s to express their opinion, make suggestions and try to help you then feel free to ask in the forums. If you want the specific opinion of a programmer, ask the programmer directly, but don't be offended if they don't reply (they are busy doing research). My efforts at troubleshooting similar problems make me think that the cuda fast Fourier transfer routine is more susceptible at specific times to failures on CC1.1 cards, especially when the failing card is running hot, but that is just my humble opinion.

I will do a more thorough analysis of my tasks that had problems, including which cards they ran on, and which errors were produced... But I'm asking for a developer's assistance here also! Do they read these forums? Is a more-thorough investigation (including code inspection) too much to ask?

While the researchers read the forums and occasionally respond they do not develop cuda, NVidia does that. They just work with the tools in order to build applications to run tasks on various GPUs. The ACEMD programming tool is not open source but the important thing is that the research is published.

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 20325 - Posted: 1 Feb 2011 | 3:45:10 UTC - in response to Message 20324.
Last modified: 1 Feb 2011 | 3:47:47 UTC

I understand that you do not have access to the code, or a debugger. But I want to investigate this issue to the best of my (our) abilities.

So, if we ignore this ignorable error:
MDIO ERROR: cannot open file "restart.coor"

... that leaves us with some failures that may or not be around:
ERROR: file c:\cygwin\home\speechserver\gpumd2_c\src\pme\CPME_cufft.cpp line 106: cufftExecC2C (gridCalc2.2)

You mentioned that you "think that the cuda fast file transfer routine is more susceptible at specific times to failures on CC1.1 cards, especially when the failing card is running hot".

My research indicates that cufft is actually "CUFFT, the NVIDIA® CUDA™ Fast Fourier Transform (FFT) library". Documentation on it can be found here:
http://developer.download.nvidia.com/compute/cuda/3_2_prod/toolkit/docs/CUFFT_Library.pdf

I believe it is no coincidence that another user is also getting related errors with his/her 9800 GT. I believe we should be trying to find a fault in either the "ACEMD2: GPU molecular dynamics v6.13 (cuda31)" application, or Cuda itself.

If you cannot investigate further, could you please pass it along to a GPU Grid (ACEMD2) developer if possible? Or perhaps PM me with a way to get in contact with one? I've worked with nVidia in the past on graphics driver issues, and I wouldn't mind approaching them if we can conclude it's a Cuda fault.

Could a GPU Grid developer please look into why 9800 GT cards would be getting errors relating to cufft? It's wasting our resources, and we'd like to help with the project more by returning successful results.

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 20326 - Posted: 1 Feb 2011 | 7:33:54 UTC - in response to Message 20325.

My research indicates that cufft is actually "CUFFT, the NVIDIA® CUDA™ Fast Fourier Transform (FFT) library". Documentation on it can be found here:
http://developer.download.nvidia.com/compute/cuda/3_2_prod/toolkit/docs/CUFFT_Library.pdf
Yeah, that was a typo, now corrected. As you say, the fft is read from an NVidia library, and NVidia are not going to rewrite a dll file for a 3 generation old card.

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 20328 - Posted: 1 Feb 2011 | 13:05:50 UTC - in response to Message 20326.
Last modified: 1 Feb 2011 | 13:06:16 UTC

Have you confirmed that the bug is not with ACEMD2?

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 20330 - Posted: 1 Feb 2011 | 13:22:10 UTC - in response to Message 20326.

FYI, I just sent the following to accelera, who is listed on GPUGrid.net's About Us page as ACEMD support. Hopefully they respond.

Hello acellera,

I'm trying to resolve some errors that I'm seeing on the GPUGrid.Net forums, regarding GPUGrid.Net ACEMD2's use of the "CUFFT, the NVIDIA® CUDA™ Fast Fourier Transform (FFT) library" cufft. I've tried to work with the GPUGrid.net forums to resolve the issue, but they're stumped. I noticed that you are listed on their "About" page as ACEMD support, which is why I came to you.

My forum post on the GPUGrid.net website is http://www.gpugrid.net/forum_thread.php?id=2421

The application and version is:
ACEMD2: GPU molecular dynamics v6.13 (cuda31)

The errors appear to regularly happen on GeForce 9800 GT cards, and might involve multiple restarts of the work unit.

I've seen the following errors:

ERROR: file c:\cygwin\home\speechserver\gpumd2_c\src\pme\CPME_cufft.cpp line 106: cufftExecC2C (gridCalc2.2)
called boinc_finish

ERROR: file c:\cygwin\home\speechserver\gpumd2_c\src\pme\CPME_cufft.cpp line 73: cufftExecC2C (gridcalc2.1)
called boinc_finish

Could you please take a look, and see if there's anything that can be done? I'm most interested in knowing if this is a bug with ACEMD2, or a bug with Cude, or a result of pushing the gpus too hard. A solution or workaround would be nice too! As it is, I feel like I'm wasting gpu power and electricity, and would much rather return useful successful results!

Thanks,
Jacob Klein

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 20340 - Posted: 4 Feb 2011 | 3:17:13 UTC - in response to Message 20330.

skgiven:

I am further inclined to believe that this is not a heat problem, because I've done some extra monitoring, and have had the 2 9800 GT fans at 100%, and had my Case Fan cranked up, to where the 2 cards are at 58*C and 48*C at full load all day.

Yet I am still getting the same error on both of the cards.

How can we troubleshoot this further? Please help.

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 20381 - Posted: 9 Feb 2011 | 13:32:06 UTC - in response to Message 20340.

I have switched to using "Swan_Sync" set to "0", and have adjusted BOINC to free up 3 threads since I have 3 video cards, meanwhile running fans so that thermal limits are about 45-55 *C...

But I still have work units erroring out on the GeForce 9800 GT video cards with exit code 98 (0x62).

Can anybody (particularly a GPUGrid.net developer) please help me to determine if this is a GPUGrid.net application problem or an nVidia Cuda problem?? We should work together to get this problem resolved!

Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 20391 - Posted: 10 Feb 2011 | 0:30:25 UTC - in response to Message 20381.
Last modified: 10 Feb 2011 | 0:33:30 UTC

I'm under the impression that CUDA FFT has subtle bugs. They have been discussed extensively in the past, and reported to NVIDIA. Unfortunately problems seem to come and go intermittently with driver versions and specific boards. We had to remove a Fermi from a machine, because it did not run well with 275's (the cause seemed to be FFT).

Post to thread

Message boards : Number crunching : Diagnose error: exit code 98 (0x62)

//