Advanced search

Message boards : Number crunching : String WU*s with errors

Author Message
curiously_indifferent
Send message
Joined: 20 Nov 17
Posts: 21
Credit: 1,525,941,493
RAC: 4,167,939
Level
His
Scientific publications
watwatwat
Message 55276 - Posted: 11 Sep 2020 | 14:57:31 UTC

This morning I have had a string WU's error out. The screen goes blank and the GPU fans run at max RPM. Here is an output:

Stderr output
<core_client_version>7.16.7</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code 195 (0xc3)</message>
<stderr_txt>
10:32:33 (9976): wrapper (7.9.26016): starting
10:32:33 (9976): wrapper: running acemd3.exe (--boinc input --device 0)
# Engine failed: Error initializing CUDA: CUDA_ERROR_UNKNOWN (999) at C:\Miniconda37-x64\conda-bld\openmm_1562766554928\work\platforms\cuda\src\CudaContext.cpp:148
10:32:36 (9976): acemd3.exe exited; CPU time 0.000000
10:32:36 (9976): app exit status: 0x1
10:32:36 (9976): called boinc_finish(195)
0 bytes in 0 Free Blocks.
440 bytes in 8 Normal Blocks.
1144 bytes in 1 CRT Blocks.
0 bytes in 0 Ignore Blocks.
0 bytes in 0 Client Blocks.
Largest number used: 0 bytes.
Total allocations: 139016 bytes.
Dumping objects ->
{1614} normal block at 0x0000018150068220, 48 bytes long.
Data: <ACEMD_PLUGIN_DIR> 41 43 45 4D 44 5F 50 4C 55 47 49 4E 5F 44 49 52
{1603} normal block at 0x000001815006A1D0, 32 bytes long.
Data: <HOME=D:\ProgramD> 48 4F 4D 45 3D 44 3A 5C 50 72 6F 67 72 61 6D 44
{1592} normal block at 0x000001815006A830, 32 bytes long.
Data: <TMP=D:\ProgramDa> 54 4D 50 3D 44 3A 5C 50 72 6F 67 72 61 6D 44 61
{1581} normal block at 0x000001815006A710, 32 bytes long.
Data: <TEMP=D:\ProgramD> 54 45 4D 50 3D 44 3A 5C 50 72 6F 67 72 61 6D 44
{1570} normal block at 0x000001815006ABF0, 32 bytes long.
Data: <TMPDIR=D:\Progra> 54 4D 50 44 49 52 3D 44 3A 5C 50 72 6F 67 72 61
{1559} normal block at 0x00000181500597A0, 140 bytes long.
Data: <<project_prefere> 3C 70 72 6F 6A 65 63 74 5F 70 72 65 66 65 72 65
..\api\boinc_api.cpp(309) : {1556} normal block at 0x000001815006FEF0, 8 bytes long.
Data: < P > 00 00 03 50 81 01 00 00
{891} normal block at 0x0000018150052C50, 140 bytes long.
Data: <<project_prefere> 3C 70 72 6F 6A 65 63 74 5F 70 72 65 66 65 72 65
{202} normal block at 0x00000181500701C0, 8 bytes long.
Data: < P > 20 13 07 50 81 01 00 00
{196} normal block at 0x0000018150068290, 48 bytes long.
Data: <--boinc input --> 2D 2D 62 6F 69 6E 63 20 69 6E 70 75 74 20 2D 2D
{195} normal block at 0x0000018150070300, 16 bytes long.
Data: <x P > 78 F6 06 50 81 01 00 00 00 00 00 00 00 00 00 00
{194} normal block at 0x000001815006F950, 16 bytes long.
Data: <P P > 50 F6 06 50 81 01 00 00 00 00 00 00 00 00 00 00
{193} normal block at 0x000001815006FEA0, 16 bytes long.
Data: <( P > 28 F6 06 50 81 01 00 00 00 00 00 00 00 00 00 00
{192} normal block at 0x000001815006FE50, 16 bytes long.
Data: < P > 00 F6 06 50 81 01 00 00 00 00 00 00 00 00 00 00
{191} normal block at 0x0000018150066C50, 16 bytes long.
Data: < P > D8 F5 06 50 81 01 00 00 00 00 00 00 00 00 00 00
{190} normal block at 0x0000018150066A70, 16 bytes long.
Data: < P > B0 F5 06 50 81 01 00 00 00 00 00 00 00 00 00 00
{189} normal block at 0x00000181500685A0, 48 bytes long.
Data: <ComSpec=C:\Windo> 43 6F 6D 53 70 65 63 3D 43 3A 5C 57 69 6E 64 6F
{188} normal block at 0x0000018150066520, 16 bytes long.
Data: < G P > A0 47 05 50 81 01 00 00 00 00 00 00 00 00 00 00
{187} normal block at 0x000001815006A4D0, 32 bytes long.
Data: <SystemRoot=C:\Wi> 53 79 73 74 65 6D 52 6F 6F 74 3D 43 3A 5C 57 69
{186} normal block at 0x0000018150066A20, 16 bytes long.
Data: <xG P > 78 47 05 50 81 01 00 00 00 00 00 00 00 00 00 00
{184} normal block at 0x0000018150066200, 16 bytes long.
Data: <PG P > 50 47 05 50 81 01 00 00 00 00 00 00 00 00 00 00
{183} normal block at 0x00000181500669D0, 16 bytes long.
Data: <(G P > 28 47 05 50 81 01 00 00 00 00 00 00 00 00 00 00
{182} normal block at 0x0000018150066CA0, 16 bytes long.
Data: < G P > 00 47 05 50 81 01 00 00 00 00 00 00 00 00 00 00
{181} normal block at 0x0000018150066750, 16 bytes long.
Data: < F P > D8 46 05 50 81 01 00 00 00 00 00 00 00 00 00 00
{180} normal block at 0x0000018150066110, 16 bytes long.
Data: < F P > B0 46 05 50 81 01 00 00 00 00 00 00 00 00 00 00
{179} normal block at 0x00000181500546B0, 280 bytes long.
Data: < a P P > 10 61 06 50 81 01 00 00 F0 AB 06 50 81 01 00 00
{178} normal block at 0x0000018150066CF0, 16 bytes long.
Data: < P > 90 F5 06 50 81 01 00 00 00 00 00 00 00 00 00 00
{177} normal block at 0x00000181500662A0, 16 bytes long.
Data: <h P > 68 F5 06 50 81 01 00 00 00 00 00 00 00 00 00 00
{176} normal block at 0x0000018150066070, 16 bytes long.
Data: <@ P > 40 F5 06 50 81 01 00 00 00 00 00 00 00 00 00 00
{175} normal block at 0x000001815006F540, 496 bytes long.
Data: <p` P acemd3.e> 70 60 06 50 81 01 00 00 61 63 65 6D 64 33 2E 65
{64} normal block at 0x0000018150066930, 16 bytes long.
Data: < > 80 EA 05 0A F6 7F 00 00 00 00 00 00 00 00 00 00
{63} normal block at 0x00000181500664D0, 16 bytes long.
Data: <@ > 40 E9 05 0A F6 7F 00 00 00 00 00 00 00 00 00 00
{62} normal block at 0x00000181500666B0, 16 bytes long.
Data: < W > F8 57 02 0A F6 7F 00 00 00 00 00 00 00 00 00 00
{61} normal block at 0x00000181500660C0, 16 bytes long.
Data: < W > D8 57 02 0A F6 7F 00 00 00 00 00 00 00 00 00 00
{60} normal block at 0x0000018150066B10, 16 bytes long.
Data: <P > 50 04 02 0A F6 7F 00 00 00 00 00 00 00 00 00 00
{59} normal block at 0x0000018150066390, 16 bytes long.
Data: <0 > 30 04 02 0A F6 7F 00 00 00 00 00 00 00 00 00 00
{58} normal block at 0x00000181500665C0, 16 bytes long.
Data: < > E0 02 02 0A F6 7F 00 00 00 00 00 00 00 00 00 00
{57} normal block at 0x0000018150066D90, 16 bytes long.
Data: < > 10 04 02 0A F6 7F 00 00 00 00 00 00 00 00 00 00
{56} normal block at 0x0000018150066700, 16 bytes long.
Data: <p > 70 04 02 0A F6 7F 00 00 00 00 00 00 00 00 00 00
{55} normal block at 0x0000018150066160, 16 bytes long.
Data: < > 18 C0 00 0A F6 7F 00 00 00 00 00 00 00 00 00 00
Object dump complete.

</stderr_txt>
]]>

Any suggestions?

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1343
Credit: 7,715,200,648
RAC: 12,386,335
Level
Tyr
Scientific publications
watwatwatwatwat
Message 55277 - Posted: 11 Sep 2020 | 15:20:24 UTC

Card has gone wonky in the system. Reboot the system.

curiously_indifferent
Send message
Joined: 20 Nov 17
Posts: 21
Credit: 1,525,941,493
RAC: 4,167,939
Level
His
Scientific publications
watwatwat
Message 55278 - Posted: 11 Sep 2020 | 17:00:27 UTC - in response to Message 55277.

Reboot did not solve the issue. It is crashing on Milkway@Home also. I guess the GPU (RTX 20880Ti) has issues. The driver version is 27.21.14.5206 and it ran fine on that for several weeks.

This GPU is fairly new, it is disappointing if it is failing already.

Pop Piasa
Avatar
Send message
Joined: 8 Aug 19
Posts: 252
Credit: 458,054,251
RAC: 0
Level
Gln
Scientific publications
watwat
Message 55280 - Posted: 12 Sep 2020 | 1:45:36 UTC - in response to Message 55278.

CI, did your windows 10 recently update? (mine did)
That seems like it could be a driver/OS glitch. Maybe the next driver update will fix it if others are getting it while gaming.
Meanwhile, try testing the card with GPUPI or another math processing tester.
I would also experiment with re-installing BOINC, just in case the ACEMD file became corrupted.

curiously_indifferent
Send message
Joined: 20 Nov 17
Posts: 21
Credit: 1,525,941,493
RAC: 4,167,939
Level
His
Scientific publications
watwatwat
Message 55283 - Posted: 13 Sep 2020 | 1:37:02 UTC - in response to Message 55280.

I believe there was an OS update a couple of days before the GPU checked out. I did a restore to roll Win10 back to see if it would solve the issue - it did not.

I tested it with both PrimeGrid, Milkyway and then stressed it with OCCT - the GPU failed all three. I replaced the GPU with a Radeon and another Nvidia and they worked fine in the failed GPU's slot.

Unfortunately, I think I have a RTX 2080 Ti paperweight now.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1343
Credit: 7,715,200,648
RAC: 12,386,335
Level
Tyr
Scientific publications
watwatwatwatwat
Message 55284 - Posted: 13 Sep 2020 | 3:45:19 UTC - in response to Message 55278.

Reboot did not solve the issue. It is crashing on Milkway@Home also. I guess the GPU (RTX 20880Ti) has issues. The driver version is 27.21.14.5206 and it ran fine on that for several weeks.

This GPU is fairly new, it is disappointing if it is failing already.

I think M$ updated your driver to their version which does not have OpenCL support and possibly normal CUDA support. Your reported driver version is NOT a normal Nvidia driver version.

Remove the M$ driver and download the correct latest driver directly from Nvidia.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1132
Credit: 10,363,897,676
RAC: 29,196,793
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 55285 - Posted: 13 Sep 2020 | 5:26:21 UTC - in response to Message 55284.

I think M$ updated your driver to their version which does not have OpenCL support and possibly normal CUDA support. Your reported driver version is NOT a normal Nvidia driver version.

Remove the M$ driver and download the correct latest driver directly from Nvidia.

I made this experience some time ago when I crunched Folding@Home (during a lengthy time period with no GPUGRID tasks available).
After I replaced the MS driver with the original driver directly from NVIDIA, everything worked fine.

However: OpenCL is not needed for GPUGRID tasks, is it?

curiously_indifferent
Send message
Joined: 20 Nov 17
Posts: 21
Credit: 1,525,941,493
RAC: 4,167,939
Level
His
Scientific publications
watwatwat
Message 55286 - Posted: 13 Sep 2020 | 16:06:40 UTC - in response to Message 55284.

Reboot did not solve the issue. It is crashing on Milkway@Home also. I guess the GPU (RTX 20880Ti) has issues. The driver version is 27.21.14.5206 and it ran fine on that for several weeks.

This GPU is fairly new, it is disappointing if it is failing already.

I think M$ updated your driver to their version which does not have OpenCL support and possibly normal CUDA support. Your reported driver version is NOT a normal Nvidia driver version.

Remove the M$ driver and download the correct latest driver directly from Nvidia.


I downloaded this Nvidia driver a few weeks ago. The GPU ran for a week or two without issue using this driver. I have another RTX GPU that seems to be ok with this driver. I changed drivers to check, but the GPU was still bad.

I am fairly certain the GPU has failed.

Nick Name
Send message
Joined: 3 Sep 13
Posts: 53
Credit: 1,533,531,731
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 55287 - Posted: 13 Sep 2020 | 19:19:46 UTC

I had a similar problem once and it turned out the PSU had a couple burnt pins. You can check that along with the power cable(s), but if you used the same connection in the other cards you tested it can probably be ruled out.
____________
Team USA forum | Team USA page
Join us and #crunchforcures. We are now also folding:join team ID 236370!

Post to thread

Message boards : Number crunching : String WU*s with errors

//