Message boards : Number crunching : failing tasks lately
Author | Message |
---|---|
This afternoon, I had 4 tasks in a row which failed after few seconds; see here: http://www.gpugrid.net/results.php?userid=125700&offset=0&show_names=1&state=0&appid= | |
ID: 52174 | Rating: 0 | rate: / Reply Quote | |
I've had three failed tasks over the last two days, but all the others have run normally. All the failed tasks had PABLO_V3_p27_sj403_IDP in their name. | |
ID: 52176 | Rating: 0 | rate: / Reply Quote | |
Yes, part of the PABLO_V3_p27_sj403_ID series seems to be erronious. | |
ID: 52177 | Rating: 0 | rate: / Reply Quote | |
The server status page shows an error rate of 56.37% for them. Which is high, isn't it? over night, failure rate has raised to 57.98%. The remaining tasks from this series should be cancelled from the queue. | |
ID: 52179 | Rating: 0 | rate: / Reply Quote | |
The server status page shows an error rate of 56.37% for them. Which is high, isn't it? meanwhile, the failure rate has passed the 60% mark. It's 60,12%, to be exact. And these faulty tasks are still in the download queue, WHY ??? | |
ID: 52182 | Rating: 0 | rate: / Reply Quote | |
I thought we'd got rid of these, but I've just sent back e15s24_e1s258p1f302-PABLO_V3_p27_sj403_IDP-0-2-RND4645_0 - note the _0 replication. I was the first victim since the job was created at 11:25:23 UTC today, seven more to go. | |
ID: 52189 | Rating: 0 | rate: / Reply Quote | |
The failure rate now is close to 64%, so it's still climbing up. | |
ID: 52194 | Rating: 0 | rate: / Reply Quote | |
The failure rate now is close to 64%, so it's still climbing up. A holiday. Some admins won't even cancel tasks like that even if they are active. Some will just let them error out the max # of times. | |
ID: 52195 | Rating: 0 | rate: / Reply Quote | |
Some will just let them error out the max # of times. The bad thing is that once a host has more than 2 or 3 such faulty tasks in a row, the host is considered as unreliable and will no longer receive tasks for the next 24 hours. So the host is penalized for something which is not in the responsibility of the host. Even more I am wondering that the GPUGRID people don't care :-( | |
ID: 52197 | Rating: 0 | rate: / Reply Quote | |
the failure rate has passed the 70% mark now. Great ! | |
ID: 52204 | Rating: 0 | rate: / Reply Quote | |
meanwhile, the failure rate has passed the 75% mark. It now is 75,18%, to be exact. | |
ID: 52208 | Rating: 0 | rate: / Reply Quote | |
If you are so unhappy running the available Windows tasks, just stop getting any work. Problem solved. You are happy now. | |
ID: 52210 | Rating: 0 | rate: / Reply Quote | |
If you are so unhappy running the available Windows tasks, just stop getting any work. Problem solved. You are happy now. The question isn't whether or not I am unhappy. The question rather is what makes sense and what doesn't. Don't you think the only real solution to the problem would logically be to simply withdraw the remaining tasks of this faulty series from the download queue? Or can you explain the rationale for leaving them in the download queue? In a few more weeks, when all these tasks will be used up, the error rate will be 100%. How does this serve the project? As I explained before: once a host happens to download such a faulty task 2 or 3 times in a row, this host is blocked for 24 hours. So, what sense does this then make? | |
ID: 52211 | Rating: 0 | rate: / Reply Quote | |
So far as I can tell from my account pages, my machines are processing GPUGrid tasks just fine and at the normal rate. | |
ID: 52215 | Rating: 0 | rate: / Reply Quote | |
My machine has also failed numerous GPUGrid tasks lately, running on 2 GTX 1070 cards (individual, not SLI'd). | |
ID: 52216 | Rating: 0 | rate: / Reply Quote | |
http://www.gpugrid.net/result.php?resultid=7412820 This WU is from 2013. I'll be skipping GPUGrid tasks from now on until it is resolved, as it is wasting CPU/GPU time that i can use for other projects on the machine.The 3 recent errors wasted 17 seconds on your host in the past 4 days, so there's no reason for panicking. (even though your host didn't received work for 3 days.) I'll refer back to these forums to check on updates though so i know when to restart GPUGRID tasks.The project is running fine beside this one bad batch, so you can do it right away. The number of resends may increase as this bad batch runs out, that may cause a host to be "blacklisted" for 24 hours, but it needs many failing workunits in a row (so it is unlikely to happen, as the maximal number of daily workunits get reduced by 1 after an error). The max number of "Long runs (8-12 hours on fastest card) 9.22 windows_intelx86 (cuda80)" app for your host is currently 28, so this host should be extremely unlucky to receive 28 bad workunits in a row to get "banned" for 24 hours. | |
ID: 52217 | Rating: 0 | rate: / Reply Quote | |
Oops my bad, i sorted the tasks by 'errored' and mixed up the ones to paste. | |
ID: 52218 | Rating: 0 | rate: / Reply Quote | |
There are two more 'bad' batches at the moment in the 'long' queue: | |
ID: 52342 | Rating: 0 | rate: / Reply Quote | |
any idea why all tasks downloaded within the last few hours fail immediately? | |
ID: 52385 | Rating: 0 | rate: / Reply Quote | |
any idea why all tasks downloaded within the last few hours fail immediately? No idea, but it's the same for others. I'm using Win7pro, work-units crash at once: Stderr Ausgabe <core_client_version>7.10.2</core_client_version> <![CDATA[ <message> (unknown error) - exit code -44 (0xffffffd4)</message> ]]> 07.08.2019 14:17:11 | GPUGRID | Sending scheduler request: To fetch work. 07.08.2019 14:17:11 | GPUGRID | Requesting new tasks for NVIDIA GPU 07.08.2019 14:17:13 | GPUGRID | Scheduler request completed: got 1 new tasks 07.08.2019 14:17:15 | GPUGRID | Started download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-LICENSE 07.08.2019 14:17:15 | GPUGRID | Started download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-COPYRIGHT 07.08.2019 14:17:17 | GPUGRID | Finished download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-LICENSE 07.08.2019 14:17:17 | GPUGRID | Finished download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-COPYRIGHT 07.08.2019 14:17:17 | GPUGRID | Started download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-coor_file 07.08.2019 14:17:17 | GPUGRID | Started download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-vel_file 07.08.2019 14:17:18 | GPUGRID | Finished download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-vel_file 07.08.2019 14:17:18 | GPUGRID | Started download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-idx_file 07.08.2019 14:17:19 | GPUGRID | Finished download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-idx_file 07.08.2019 14:17:19 | GPUGRID | Started download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-pdb_file 07.08.2019 14:17:21 | GPUGRID | Finished download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-coor_file 07.08.2019 14:17:21 | GPUGRID | Started download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-psf_file 07.08.2019 14:17:30 | GPUGRID | Finished download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-pdb_file 07.08.2019 14:17:30 | GPUGRID | Started download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-par_file 07.08.2019 14:17:33 | GPUGRID | Finished download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-par_file 07.08.2019 14:17:33 | GPUGRID | Started download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-conf_file_enc 07.08.2019 14:17:34 | GPUGRID | Finished download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-conf_file_enc 07.08.2019 14:17:34 | GPUGRID | Started download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-metainp_file 07.08.2019 14:17:35 | GPUGRID | Finished download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-metainp_file 07.08.2019 14:17:35 | GPUGRID | Started download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-hills_file 07.08.2019 14:17:36 | GPUGRID | Finished download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-hills_file 07.08.2019 14:17:36 | GPUGRID | Started download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-xsc_file 07.08.2019 14:17:37 | GPUGRID | Finished download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-xsc_file 07.08.2019 14:17:37 | GPUGRID | Started download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-prmtop_file 07.08.2019 14:17:38 | GPUGRID | Finished download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-psf_file 07.08.2019 14:17:38 | GPUGRID | Finished download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-prmtop_file 07.08.2019 14:19:22 | GPUGRID | Starting task e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-2-RND1985_4 07.08.2019 14:19:29 | GPUGRID | Computation for task e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-2-RND1985_4 finished 07.08.2019 14:19:29 | GPUGRID | Output file e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-2-RND1985_4_0 for task e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-2-RND1985_4 absent 07.08.2019 14:19:29 | GPUGRID | Output file e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-2-RND1985_4_1 for task e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-2-RND1985_4 absent 07.08.2019 14:19:29 | GPUGRID | Output file e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-2-RND1985_4_2 for task e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-2-RND1985_4 absent 07.08.2019 14:19:29 | GPUGRID | Output file e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-2-RND1985_4_3 for task e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-2-RND1985_4 absent 07.08.2019 14:19:37 | GPUGRID | Started upload of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-2-RND1985_4_7 07.08.2019 14:19:39 | GPUGRID | Finished upload of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-2-RND1985_4_7 Another member of our team has the same problem on Win10. I'd really like to compare this with Linux, but I didn't get any work-unit on my Debian machine for weeks. ____________ - - - - - - - - - - Greetings, Jens | |
ID: 52386 | Rating: 0 | rate: / Reply Quote | |
any idea why all tasks downloaded within the last few hours fail immediately? yes, I had checked that before I wrote my posting above. I wonder whether the GPUGRID team has realized this problem yet. | |
ID: 52390 | Rating: 0 | rate: / Reply Quote | |
same here all WU's with the same Error Code | |
ID: 52392 | Rating: 0 | rate: / Reply Quote | |
it seems that the licence for Windows 10 (and maybe for Windows 7/8, too) has expired. | |
ID: 52400 | Rating: 0 | rate: / Reply Quote | |
any idea why all tasks downloaded within the last few hours fail immediately? Things left to themselves tend to go from bad to worse. | |
ID: 52405 | Rating: 0 | rate: / Reply Quote | |
Several more tasks with computation errors, but nothing definite about just what kind of error. | |
ID: 52407 | Rating: 0 | rate: / Reply Quote | |
Same here... | |
ID: 52410 | Rating: 0 | rate: / Reply Quote | |
I actually got one to finish successfully: | |
ID: 52411 | Rating: 0 | rate: / Reply Quote | |
I actually got one to finish successfully: so it's clear that the license has expired. Changing the date of the host can indeed be tricky, even more if also other BOINC projects are running which could be totally confused by doing this. Happened to me last time when the license expired, it all ended up in a total mess. Let's hope that it won't take too long until there is a new acemd with a valid license. | |
ID: 52415 | Rating: 0 | rate: / Reply Quote | |
I actually got one to finish successfully: I thought one of the reasons for the new app was to not need the license that keeps expiring. Plus Turing support in a BOINC wrapper to separate the science part from the BOINC part. | |
ID: 52416 | Rating: 0 | rate: / Reply Quote | |
They are not using the new app yet, the reason the app expired is because it's still the old app. | |
ID: 52418 | Rating: 0 | rate: / Reply Quote | |
They are not using the new app yet, the reason the app expired is because it's still the old app. And? I was replying to this part "new acemd with a valid license." The new app won't need a license from what I recall. | |
ID: 52419 | Rating: 0 | rate: / Reply Quote | |
I've seen some mentions of tasks still completing properly on some rather old versions of Windows, such as Windows XP. Could some people with at least one computer with such a version give more details? | |
ID: 52432 | Rating: 0 | rate: / Reply Quote | |
the "older versions" also include an expiration check. | |
ID: 52433 | Rating: 0 | rate: / Reply Quote | |
I´m using Win XP 64 and havind just errors aswell. | |
ID: 52436 | Rating: 0 | rate: / Reply Quote | |
No, you are using Windows 7 x64. | |
ID: 52437 | Rating: 0 | rate: / Reply Quote | |
Stderr output | |
ID: 52472 | Rating: 0 | rate: / Reply Quote | |
Hello everyone, | |
ID: 52474 | Rating: 0 | rate: / Reply Quote | |
No, you are using Windows 7 x64. You are right, my bad. But I was having errors with the new drivers. Then I rolled back to 378.94 driver and it´s running fine now. http://www.gpugrid.net/show_host_detail.php?hostid=413063 http://www.gpugrid.net/workunit.php?wuid=16717273 | |
ID: 52500 | Rating: 0 | rate: / Reply Quote | |
Hello everyone, That's fixed now. But the errors continue, 2 seconds into a Pablo unit and poof they error out. I turned off the long run units and it seems there aren't any short run units to do for the gpu's. | |
ID: 52503 | Rating: 0 | rate: / Reply Quote | |
But the errors continue, 2 seconds into a Pablo unit and poof they error out mikey, the tasks with errors were run on a Turing based card (GTX1660ti). These GPUs are not currently supported by the ACEMD2 app. Admins are working on ACEMD3 app which will support Turing based GPUs. Hopefully this will be released soon. There is currently no short tasks in the queue. | |
ID: 52504 | Rating: 0 | rate: / Reply Quote | |
the faulty tasks seem to be back (erroring out after a few seconds): | |
ID: 52534 | Rating: 0 | rate: / Reply Quote | |
I had a task fail after few seconds. | |
ID: 52807 | Rating: 0 | rate: / Reply Quote | |
At least it went wrong for everyone, not just for you. A bad workunit. | |
ID: 52808 | Rating: 0 | rate: / Reply Quote | |
here another one, from this morning, with error message: | |
ID: 52819 | Rating: 0 | rate: / Reply Quote | |
Same here | |
ID: 52820 | Rating: 0 | rate: / Reply Quote | |
Same hereUntil the new app (ACEMD3) is released, you should assign this host to a venue which receives work only from the ACEMD3 queue, as the other two queues have the old client, which is incompatible with the Turing cards. | |
ID: 52822 | Rating: 0 | rate: / Reply Quote | |
obviously, the faulty tasks are back, here the next one from a minute ago: | |
ID: 52829 | Rating: 0 | rate: / Reply Quote | |
the next ones: | |
ID: 52892 | Rating: 0 | rate: / Reply Quote | |
and here some more: | |
ID: 52893 | Rating: 0 | rate: / Reply Quote | |
I think the license of the v9.22 app has expired this time. | |
ID: 52894 | Rating: 0 | rate: / Reply Quote | |
I think the license of the v9.22 app has expired this time. that's what I now am suspecting, too :-( | |
ID: 52895 | Rating: 0 | rate: / Reply Quote | |
Any prediction when continous supply of new WU's will become available again ? | |
ID: 52896 | Rating: 0 | rate: / Reply Quote | |
this is an increasingly annoying situation: | |
ID: 52923 | Rating: 0 | rate: / Reply Quote | |
Hi: | |
ID: 52928 | Rating: 0 | rate: / Reply Quote | |
I would like to contribute some useful results here with my Alienware laptop I am afraid that laptop GPUs are not made for this kind of load :-( | |
ID: 52929 | Rating: 0 | rate: / Reply Quote | |
I would like to contribute some useful results here with my Alienware laptop | |
ID: 52930 | Rating: 0 | rate: / Reply Quote | |
I would like to contribute some useful results here with my Alienware laptop I am afraid that laptop GPUs are not made for this kind of heavy load :-( | |
ID: 52931 | Rating: 0 | rate: / Reply Quote | |
My Dell G7 15 laptop is happily crunching. That is another matter that I have to send a blast of air every day to get the dust-out. | |
ID: 52932 | Rating: 0 | rate: / Reply Quote | |
Hi: The issue is with the Scheduler on the GPUgrid servers. The Scheduler is sending CUDA65 tasks to your Laptop, all of which will fail due to an expired license. (Server end) Your laptop can process CUDA80 tasks, but you are at the mercy of the Scheduler. For most Hosts it sends the correct tasks, and for a handful of Hosts, it is sending the wrong tasks. This issue tends to affect Kepler GPUs (600 series GPU), even though they are still supported. Some relevant posts discussing this issue are here: http://www.gpugrid.net/forum_thread.php?id=5000&nowrap=true#52924 http://www.gpugrid.net/forum_thread.php?id=5000&nowrap=true#52920 The Project is in the middle of changing the Application to a newer version, hopefully when the new Application is released (ACEMD3), these issues will be smoothed out. | |
ID: 52933 | Rating: 0 | rate: / Reply Quote | |
... when the new Application is released (ACEMD3)... I am curious WHEN this will be the case | |
ID: 52934 | Rating: 0 | rate: / Reply Quote | |
... when the new Application is released (ACEMD3)... I think you speak for all of us on this point.... | |
ID: 52935 | Rating: 0 | rate: / Reply Quote | |
I thought at one point when I saw the acemd2 long task buffer dwindle down that was in preparation of the project deprecating the acemd2 applications and move on to the new acemd3 applications. | |
ID: 52936 | Rating: 0 | rate: / Reply Quote | |
Sorry for my English. | |
ID: 52943 | Rating: 0 | rate: / Reply Quote | |
Sorry for my English. This post here applies to your issues as well: http://www.gpugrid.net/forum_thread.php?id=4954&nowrap=true#52933 | |
ID: 52944 | Rating: 0 | rate: / Reply Quote | |
Thank you. | |
ID: 52952 | Rating: 0 | rate: / Reply Quote | |
Sorry for my English. | |
ID: 53135 | Rating: 0 | rate: / Reply Quote | |
Is my graphics too old for this? :\Yes. | |
ID: 53145 | Rating: 0 | rate: / Reply Quote | |
Message boards : Number crunching : failing tasks lately