Author |
Message |
|
on one of my computers, every task started to fail. I just restarted the system - is there any way to get a new task now? Any ideas on what happened? This system has been running fine for weeks.
http://www.gpugrid.net/show_host_detail.php?hostid=117970
thank you
|
|
|
nenymSend message
Joined: 31 Mar 09 Posts: 137 Credit: 1,308,230,581 RAC: 0 Level
Scientific publications
|
is there any way to get a new task now? I know a bit strange way, that affects statistics of your host:
- detach from GPUGRID
- chanage the hostname
- reboot
- attach to GPUGRID
Other connected projects changes your hostname only (as I can remember).
You can look for problems, if the host is connected to LAN.
|
|
|
|
Thanks for the hack. I want to keep my stats so I will just let the machine idle for a day or so.
Did anyone else have this issue? |
|
|
|
It looks like tasks continue to fail. Does anyone have any ideas of what might be wrong with this host?
Thx
|
|
|
|
The original clock rate was 1.88 GHz. Now it's 1.46 GHz & still failing. Is this the same card? Try under clocking the memory by 20%. Does it run other projects OK? |
|
|
|
It is the same card in the same computer. I lowered the clock rate to see if that would correct the condition.
That computer is down now. It should be running again this weekend. We will see if power was an issue.
thx |
|
|
|
Same problem on a different host. Could the 275.33 drivers be the issue? I have a different host with 285 drivers and it appears to be working fine.
any help is appreciated
http://www.gpugrid.net/show_host_detail.php?hostid=119703 |
|
|
|
The hosts appear to be working correctly again. Were the work units bad? |
|
|
RichFSend message
Joined: 6 Jan 09 Posts: 7 Credit: 5,741,255 RAC: 0 Level
Scientific publications
|
All my WUs have been failing for the past couple of days, too. Is this a widespread problem, and how can we fix it? Thanks. |
|
|
Old manSend message
Joined: 24 Jan 09 Posts: 42 Credit: 16,676,387 RAC: 0 Level
Scientific publications
|
Here also tasks failed.
Nimi 9px10-MJHARVEY_MJHXA1-8-30-RND0616_5
Työpaketti 3395291
Luotu 5 May 2012 | 11:53:47 UTC
Lähetetty 5 May 2012 | 15:23:02 UTC
Vastaanotettu 5 May 2012 | 15:26:05 UTC
Tila palvelimella Valmis
Tulos Laskentavirhe
Tila ohjelmassa Laskentavirhe
Exit status 98 (0x62)
Tietokoneen tunniste 123486
Raportoinnin takaraja 10 May 2012 | 15:23:02 UTC
Laskenta-aika 2.70
Suoritinaika 0.80
Vahvistuksen tila Vahvistamattomat
Pisteet 0.00
Sovellusversio ACEMD2: GPU molecular dynamics v6.16 (cuda31)
Stderr output
<core_client_version>6.12.34</core_client_version>
<![CDATA[
<message>
- exit code 98 (0x62)
</message>
<stderr_txt>
# Using device 0
# There are 2 devices supporting CUDA
# Device 0: "GeForce GTX 470"
# Clock rate: 1.21 GHz
# Total amount of global memory: 1275658240 bytes
# Number of multiprocessors: 14
# Number of cores: 112
# Device 1: "GeForce GTX 260"
# Clock rate: 1.30 GHz
# Total amount of global memory: 891748352 bytes
# Number of multiprocessors: 27
# Number of cores: 216
MDIO: read error for file "input.coor", byte number 4: number of atoms (-45219840) != (47792) expected
ERROR: Unable to read bincoordfile
called boinc_finish
</stderr_txt>
]]>
nimi 9px10-MJHARVEY_MJHXA1-8-30-RND0616
sovellus ACEMD2: GPU molecular dynamics
luotu 4 May 2012 | 14:27:28 UTC
oikeita tuloksia vähintään 1
alustavia toisintoja 1
suurin lkm virheitä/kokonaismääriä/onnistuneita tehtäviä 7, 10, 6
Tehtävä
napsauta tietoihin Tietokone Lähetetty Raportointiaika
tai takaraja
selite Tila Laskenta-aika
(sekuntia) Suoritinaika
(sekuntia) Pisteet Sovellus
5326942 124335 4 May 2012 | 17:49:33 UTC 4 May 2012 | 17:54:12 UTC Virhe latauksessa 0.00 0.00 --- ACEMD2: GPU molecular dynamics v6.16 (cuda31)
5327658 112695 4 May 2012 | 20:20:30 UTC 4 May 2012 | 21:24:15 UTC Virhe laskennassa 2.07 0.41 --- ACEMD2: GPU molecular dynamics v6.16 (cuda31)
5328368 124628 5 May 2012 | 2:08:11 UTC 5 May 2012 | 2:14:55 UTC Virhe laskennassa 7.75 0.00 --- ACEMD2: GPU molecular dynamics v6.16 (cuda31)
5329342 105945 5 May 2012 | 5:41:17 UTC 5 May 2012 | 5:48:23 UTC Virhe laskennassa 3.67 0.81 --- ACEMD2: GPU molecular dynamics v6.16 (cuda31)
5329857 102639 5 May 2012 | 11:26:08 UTC 5 May 2012 | 11:53:44 UTC Virhe laskennassa 2.15 0.53 --- ACEMD2: GPU molecular dynamics v6.16 (cuda31)
5330904 123486 5 May 2012 | 15:23:02 UTC 5 May 2012 | 15:26:05 UTC Virhe laskennassa 2.70 0.80 --- ACEMD2: GPU molecular dynamics v6.16 (cuda31)
5331534 --- --- --- Lähettämättä --- --- ---
As you can see, also all others have failed to run task :-( |
|
|
RichFSend message
Joined: 6 Jan 09 Posts: 7 Credit: 5,741,255 RAC: 0 Level
Scientific publications
|
Here is the error message I've been getting. Any help would be appreciated.
Stderr output
<core_client_version>6.12.34</core_client_version>
<![CDATA[
<message>
The system cannot find the path specified. (0x3) - exit code 3 (0x3)
</message>
<stderr_txt>
# Using device 1
SWAN: FATAL : Unable to enumerate devices
Assertion failed: 0, file swanlib_nv.c, line 390
This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.
</stderr_txt>
]]> |
|
|
|
I also have had failed workunits on this week. Of the last five three have failed. The first failed on wednesday, the next failed on friday and the latest failed tonight. Of all those there are messages like these in BOINC log:
5.5.2012 23:56:56 GPUGRID Computation for task 12px17-MJHARVEY_MJHXA1-4-30-RND6604_1 finished
5.5.2012 23:56:56 GPUGRID Output file 12px17-MJHARVEY_MJHXA1-4-30-RND6604_1_1 for task 12px17-MJHARVEY_MJHXA1-4-30-RND6604_1 absent
5.5.2012 23:56:56 GPUGRID Output file 12px17-MJHARVEY_MJHXA1-4-30-RND6604_1_2 for task 12px17-MJHARVEY_MJHXA1-4-30-RND6604_1 absent
5.5.2012 23:56:56 GPUGRID Output file 12px17-MJHARVEY_MJHXA1-4-30-RND6604_1_3 for task 12px17-MJHARVEY_MJHXA1-4-30-RND6604_1 absent
The following ACEMD2 workunit failed on friday:
1x21-MJHARVEY_MJHXA1-8-30-RND8065_0
Stderr output
<core_client_version>6.10.58</core_client_version>
<![CDATA[
<message>
- exit code -99 (0xffffff9d)
</message>
<stderr_txt>
# Using device 0
# There is 1 device supporting CUDA
# Device 0: "GeForce GTX 560 Ti"
# Clock rate: 1.46 GHz
# Total amount of global memory: 1341849600 bytes
# Number of multiprocessors: 14
# Number of cores: 112
MDIO: cannot open file "restart.coor"
# Using device 0
# There is 1 device supporting CUDA
# Device 0: "GeForce GTX 560 Ti"
# Clock rate: 1.46 GHz
# Total amount of global memory: 1341849600 bytes
# Number of multiprocessors: 14
# Number of cores: 112
# Using device 0
I also run Einstein@home with about 2 million credit points and their WU:s have never failed. My graphics card is a Gigabyte GTX 560Ti 448 which runs at reference clock speed of 723 MHz and temps are between 70 - 75 C. I have lowered fan speed with MSI Afterburner. It has been running GPUGRID workunits for about a week now. So should I suspect my computer of these failures?
Thank you
|
|
|
|
I got a failed WU because of: MDIO: cannot open file "output.restart.coor"
First time I've ever seen that. WU completed fine but errored when it tried to upload. No anti virus or backup running. Just a basic Win 7 install for crunching. This sucks, 21 hours wasted on a most likely valid WU because of a locked or disappearing file.
I see Mika_at_home has a similar error in his post above: MDIO: cannot open file "restart.coor". Is this happening to anyone else? Seems like a rash of errors recently. Could this be something needing fixing?
Stderr output
<core_client_version>7.0.25</core_client_version>
<![CDATA[
<message>
- exit code 98 (0x62)
</message>
<stderr_txt>
# Using device 0
# There is 1 device supporting CUDA
# Device 0: "GeForce GTX 560 Ti"
# Clock rate: 1.90 GHz
# Total amount of global memory: 1073741824 bytes
# Number of multiprocessors: 8
# Number of cores: 64
MDIO: cannot open file "output.restart.coor"
ERROR: get_Dvec() element 0 (b)
called boinc_finish
</stderr_txt>
]]>
NM* |
|
|
|
The MDIO: cannot open file "output.restart.coor" message is not a real error, it appears in every task, even in the successful ones.
Your real error message is ERROR: get_Dvec() element 0 (b), and I think that such an error cannot be caused by the upload, nor "a locked or disappearing file". This error is happened during processing the wu, probably near its completion, that is why it seems like to be caused by the upload. |
|
|
|
Several of my work units failed very near the end of the calculation process. Any ideas on why? The clock rate has been reduced to see if that will correct the issue.
thank you |
|
|
skgivenVolunteer moderator Volunteer tester
Send message
Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level
Scientific publications
|
Mika_at_home, it seems that your tasks are getting suspended and resumed many times. I think there is more chance of failures running this way. I suggest you configure Boinc Manager to allow GPU tasks to run when the system is in use.
All of the MJHARVEY tasks that failed on your system failed on at least one more system, and some repeatedly failed on many systems, suggesting an issue with the tasks; errors Too many errors (may have bug)
Sometimes these issues are very difficult to track down, as they only rarely appear on some combinations of operating system/driver/GPU, but in the above 'Too many errors' case the problem seems independent of GPU, driver and operating system, and my guess is that it was a badly built task,
MDIO: read error for file "input.coor", byte number 4: number of atoms (-45219840) != (47525) expected
ERROR: Unable to read bincoordfile
I would be more concerned by the tasks that fail after 10K sec than 2sec.
Paul Raney, as different task types are failing on your system it's more likely that the issue is a setup one (GPU clock, overuse of CPU, interference from another program...). 'Energies have become nan' is often symptomatic of a GPU issue with the clock, voltage or temps (but may also be linked to overuse of the CPU).
____________
FAQ's
HOW TO:
- Opt out of Beta Tests
- Ask for Help |
|
|
wiyosayaSend message
Joined: 22 Nov 09 Posts: 114 Credit: 589,114,683 RAC: 0 Level
Scientific publications
|
Running on Windows, one thing that I have done is to turn off the BOINC screen saver. After doing so, I have rarely had any GPUGrid WUs report computation error.
On most PCs these days with LCD monitors, screen savers are only eye candy as LCD monitors do not suffer from burn in as tube based monitors did.
To elaborate a bit further, I set my screen saver to NONE on the two machines where I currently run GPUGrid. I am bringing a third machine on line in the next week or so, and I will also turn off the screen saver on that one, too.
____________
|
|
|
|
skgiven, thanks for your analysis and advice. I have now completed one ACEMD2 workunit with the GPU task running always. It didn't cause any problems at least with web browsing and e-mail use. I also changed my screensaver to a more simple windows standard screensaver. Now I will get my Einstein GPU-WU:s completed quicker, too. :)
-Mika
|
|
|
lohphatSend message
Joined: 21 Jan 10 Posts: 44 Credit: 1,312,089,502 RAC: 6,463,713 Level
Scientific publications
|
All my GPUGRID WUs are failing. I even replaced my 9800 GTX with a GTX 680 and the WUs fail in less than 5 seconds.
I suspect the Nvidia driver. 301.10 is the only driver for the GTX 680. |
|
|
5potSend message
Joined: 8 Mar 12 Posts: 411 Credit: 2,083,882,218 RAC: 0 Level
Scientific publications
|
gtx 680 for Windows has not been released yet. They are currently working on it. Linux version was just released for beta, when Windows is released it will be on beta as well. |
|
|
lohphatSend message
Joined: 21 Jan 10 Posts: 44 Credit: 1,312,089,502 RAC: 6,463,713 Level
Scientific publications
|
I just updated to the 301.24 beta driver and the WU failed again within 3 seconds.
WIn7 x64
GTX 680
24GB RAM
i7 960 CPU |
|
|
5potSend message
Joined: 8 Mar 12 Posts: 411 Credit: 2,083,882,218 RAC: 0 Level
Scientific publications
|
Read my above post.
Patience. |
|
|
lohphatSend message
Joined: 21 Jan 10 Posts: 44 Credit: 1,312,089,502 RAC: 6,463,713 Level
Scientific publications
|
I don't fully understand your post:
"gtx 680 for Windows has not been released yet"
Yes it has. I have one. Installed. Running.
Or are you referring to the GPUGRID software compatible with the GTX 680 CUDA code? |
|
|
5potSend message
Joined: 8 Mar 12 Posts: 411 Credit: 2,083,882,218 RAC: 0 Level
Scientific publications
|
Sorry for the misunderstanding.
Currently, the CUDA app that is being used for both the short and long tasks do not work on the 680 using either Windows or Linux. There is currently a beta app in progress that allows for the usage of the Kepler series (CUDA 4.2), as well as shorter runtimes for other series as well. However, currently this Beta app is only for Linux at the moment.
Keep an eye out on the Graphics Cards section for when the Windows version will be released.
FYI- There are some projects that can run on Kepler, however they are not optimized for it (don't work as well as they could). Currently, my 680 is on Einstein@Home until GPUgrid releases the Beta app for Windows.
Hope this cleared things up. |
|
|
lohphatSend message
Joined: 21 Jan 10 Posts: 44 Credit: 1,312,089,502 RAC: 6,463,713 Level
Scientific publications
|
It did. Thanks! |
|
|
lohphatSend message
Joined: 21 Jan 10 Posts: 44 Credit: 1,312,089,502 RAC: 6,463,713 Level
Scientific publications
|
I've been getting the small beta (cuda42) WUs today that are < 2min and they've all seemd to have processed and uploaded properly. |
|
|
|
Guys:
any ideas on why this one failed http://www.gpugrid.net/workunit.php?wuid=3472235
I just moved all my cards around so it might just be fall out from the changes.
____________
Thx - Paul
Note: Please don't use driver version 295 or 296! Recommended versions are 266 - 285. |
|
|
|
Here is another weird failure
http://www.gpugrid.net/result.php?resultid=5459110
____________
Thx - Paul
Note: Please don't use driver version 295 or 296! Recommended versions are 266 - 285. |
|
|
nenymSend message
Joined: 31 Mar 09 Posts: 137 Credit: 1,308,230,581 RAC: 0 Level
Scientific publications
|
# Using device 0
# There are 2 devices supporting CUDA
# Device 0: "GeForce GTX 570"
# Clock rate: 1.88 GHz
# Total amount of global memory: 1275658240 bytes
# Number of multiprocessors: 15
# Number of cores: 120
# Device 1: "GeForce GTX 570"
# Clock rate: 1.62 GHz
# Total amount of global memory: 1275789312 bytes
# Number of multiprocessors: 15
# Number of cores: 120
Maybe OC of dvice 0. |
|
|
|
name 249-GIANNI_TEST7-0-5-RND2924
application ACEMD beta version
created 2 Jun 2012 | 9:34:56 UTC
I'm using an XPS 720, 4 core, 64 bit windows 7 operating system with a GTX 560 ti
for working this latest batch of work units...All have failed! Will you please acknowledge and advise. |
|
|
skgivenVolunteer moderator Volunteer tester
Send message
Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level
Scientific publications
|
Francis Butts, I suggest you stop trying to crunch the Beta App. We know there is a problem with it (lots of posts saying the 6.45app fails tasks).
http://www.gpugrid.net/prefs.php?subset=project
Run test applications? No
____________
FAQ's
HOW TO:
- Opt out of Beta Tests
- Ask for Help |
|
|