Author |
Message |
Beyond Send message
Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level
Scientific publications
|
I donĀ“t see a thread about the current NOELIAs, but they are erroring on all my rigs. Please?
They locked the Noelia thread, too many complaints I'd guess. Here's some of the NOELIA WUs I've gotten lately:
http://www.gpugrid.net/workunit.php?wuid=4629901
http://www.gpugrid.net/workunit.php?wuid=4630935
http://www.gpugrid.net/workunit.php?wuid=4630045
http://www.gpugrid.net/workunit.php?wuid=4631659
Notice they've all been sent out 8 times with no successes and are now marked: "Too many errors (may have bug)". That's the understatement of the month. It's getting so that when a NOELIA WU downloads the Daleks in the basement get all excited and start yelling "EXTERMINATE, EXTERMINATE!". Gets kind of noisy sometimes... |
|
|
Beyond Send message
Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level
Scientific publications
|
The last 5 Noelia's I got, finished all without error. I like to take them all...
You must have forgotten this one:
http://www.gpugrid.net/workunit.php?wuid=4633344 |
|
|
StefanProject administrator Project developer Project tester Project scientist Send message
Joined: 5 Mar 13 Posts: 348 Credit: 0 RAC: 0 Level
Scientific publications
|
I explained the reasons why I locked Noelia's thread in the last post of the thread. You can look it up.
Otherwise, the failure rate of the WU's is at acceptable values. Unfortunately she is simulating more complex stuff (bigger systems) than Santi's WU's right now, so they do have more failures than Santi's. |
|
|
|
I explained the reasons why I locked Noelia's thread in the last post of the thread. You can look it up.
You can avoid the crunchers complaining in the news thread by opening a new thread (named by the batch) for each and every new batch of workunits in a specialized topic (named "Workunit Batch Problems" or similar). That's too bad that there is no such topic yet. Usually we notice new batches when they're starting to fail on our hosts - obviously the topic of the problem free batches would be empty, but it is good to know where to look (or post) in the forums if someone experiences failures with a batch, and to know how many others have problems with that batch. In my opinion it is your job to start such treads before we even receive the first workunits of them.
Otherwise, the failure rate of the WU's is at acceptable values. Unfortunately she is simulating more complex stuff (bigger systems) than Santi's WU's right now, so they do have more failures than Santi's.
I think you (the staff) and we (the crunchers) have different ideas about what is an acceptable failure rate.
There are some factors which make a host more prone to errors, making our judgement worse than yours:
- hosts running Windows (especially 7 and 8)
- hosts running multiple GPUs
- overclocked (even factory overclocked) GPUs (it is more complicated to overclock a Kepler than a Fermi based card.)
- hosts having updated drivers
- hosts which had an error, and wasn't rebooted then.
- we do like to pay the cost of crunching, but we don't like to pay the cost of failures
I am aware that none of the above is directly your concern, so it would be nice to have a topic where we can discuss our problems with each other without disturbing the news (or other) thread. |
|
|
|
You can avoid the crunchers complaining in the news thread by opening a new thread (named by the batch) for each and every new batch of workunits in a specialized topic (named "Workunit Batch Problems" or similar). That's too bad that there is no such topic yet. Usually we notice new batches when they're starting to fail on our hosts - obviously the topic of the problem free batches would be empty, but it is good to know where to look (or post) in the forums if someone experiences failures with a batch, and to know how many others have problems with that batch. In my opinion it is your job to start such treads before we even receive the first workunits of them.
That a very good suggestion and goes hand in hand with what was also requested lately: some information about runs / batches. What is being simulated, what are special requirements?
That would be of practical use and make us feel more involved. It should be worth the time to set these threads up.
MrS
____________
Scanning for our furry friends since Jan 2002 |
|
|
StefanProject administrator Project developer Project tester Project scientist Send message
Joined: 5 Mar 13 Posts: 348 Credit: 0 RAC: 0 Level
Scientific publications
|
Both very nice ideas actually. I will pass it along that everyone creates a new "problems" thread when he sends a batch.
Yes you do make a point on the failure rate. It's a big discussion though so maybe it should also be moved.
My current problem is the lack of a specific subforum for this. I will see if I can somehow get a new one made. |
|
|
Beyond Send message
Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level
Scientific publications
|
I explained the reasons why I locked Noelia's thread in the last post of the thread. You can look it up.
Otherwise, the failure rate of the WU's is at acceptable values. Unfortunately she is simulating more complex stuff (bigger systems) than Santi's WU's right now, so they do have more failures than Santi's.
I read it. It was locked but not moved. I don't agree that the failure rate of Noelia WUs is acceptable. They're awful and getting worse. There should be a separate queue for WUs that require more than 1GB, or you should detect the GPUs memory size and send those WUs only to machines that can handle them. |
|
|
skgivenVolunteer moderator Volunteer tester
Send message
Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level
Scientific publications
|
...Yes you do make a point on the failure rate. It's a big discussion though so maybe it should also be moved.
My current problem is the lack of a specific subforum for this. I will see if I can somehow get a new one made.
One thing a project/app failure rate can not identify is the fact that individuals (or systems) with high failure rates leave the project and crunch elsewhere. Thus the failure rates may look about the same. You have to look at the number of active crunchers, work turnover... Even those that stay with the project move to the short queue, to crunch different WU's (despite the lesser credit) - to avoid system failures/crashes/restarts/loops/failed WU's/failures at other projects...
The bottom line is that you are reliant on crunchers. If crunchers are not happy, some just leave without saying a word. The more vocal complain, make suggestions and if nothing is done then they quit the project. This is the very reason why some people don't stick around, and that has been identified as a bigger problem to the project!
Even if you think the failure rate isn't that bad, to the individual cruncher, failing work is terrible, and if it crashes drivers, the system, causes restarts and loss of other work its death. This imbalance of opinion needs to be redressed. The problems crunchers face with some WU types has not been dealt with to their satisfaction. It's not that suggestions haven't been made by the researchers, it's that they have been slow in coming and communication has not been great. Problems that exist for months, despite suggested work-rounds, are like an unhealed wound. Some of the crunchers suggestions might be a pain for you to deal with but they would keep crunchers happy.
Number Crunching might be an appropriate sub-forum, as its 'about performance', though the name 'Crunching' would have been sufficient. Also, Wish List and other threads already contain many suggestions (that were not implemented, often due to staff shortages, funding or technical limitations).
____________
FAQ's
HOW TO:
- Opt out of Beta Tests
- Ask for Help |
|
|
Jim1348Send message
Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level
Scientific publications
|
One thing a project/app failure rate can not identify is the fact that individuals (or systems) with high failure rates leave the project and crunch elsewhere. Thus the failure rates may look about the same. You have to look at the number of active crunchers, work turnover... Even those that stay with the project move to the short queue, to crunch different WU's (despite the lesser credit) - to avoid system failures/crashes/restarts/loops/failed WU's/failures at other projects...
Excellent point. If I leave on a trip, I don't want my machine to seize up or BSOD while I am gone. I work on a variety of BOINC projects, and they would all be affected. So even if I am willing to put up with a failure on one, I don't want it to take down all the others too.
|
|
|
StefanProject administrator Project developer Project tester Project scientist Send message
Joined: 5 Mar 13 Posts: 348 Credit: 0 RAC: 0 Level
Scientific publications
|
Let's see if I know how to move threads :)
Edit: Ok I found out how but it's no fun. I have to do it post by post
So on the subject. We take system crashes very seriously because it's obviously very difficult to have to reboot and lose all work from other projects, especially since it's volunteer work. So when there are systematic system crashes we cancel the batches as we have done before. I haven't heard of any systematic ones recently though so if I missed any please inform me. |
|
|
|
At Einstein@Home they've got a subforum group "Help Desk". One subforum there is called "Problems and Bug Reports" and when ever a new search / batch is started they post a feedback thread there. Most of the feedback still ends up in the news thread, but at least the mechanism is there :)
Granted their searchesa and batches don't change as often as ours, but I still think something like this would be appropriate.
Regarding system crashes: I'm not aware of this currently happening. But failure rates of certain tasks on certains hosts seem really bad. Other hosts seem fine, though, so the average failure rate may be far lower than individual failure rates. And lacking statistically significiant hard numbers I don't think we can say much about the reasons for this behaviour.
I know Noelia is pushing the boundaries with the new features and more complex systems she simulates. But I feel part of this could be handled better. For example the GPU memory requirements: BOINC knows how much memory cards have and has a mechanism to avoid overloading GPU mem. But it has to be told to use this, i.e. the expected maximum GPU mem needed has to be included in WUs. From what we're seeing this is not being done (cards with less mem outright failing tasks or becoming very slow). This should easily be avoided!
And some of her tasks which require more memory also take longer. Longer than we'd like to, es even mid range Keplers are in trouble to make the 24h deadline. That's the current generation of GPUs, we can hardly buy anything faster without spending really big bucks.
And the probability of failures probably increases with runtime. These should be put into smaller work chunks.. or you should establish a "super queue" for such long tasks. But not many people would be able to participate there, making it pretty much redundant.
Then there's the topic of information, which I touched upon before. I think it would really help to lower frustration levels if we'd get more feedback:
- why the current WUs?
- any special requirements like driver versions, amount of memory
- we're getting xx% bonus credits for this, since it's "risk production" (*)
- once finished: what did you learn from this batch? (might sometimes be difficult to answer, or not look all that great.. but if there's anything to talk about, do so!)
(*) The credit bonus for the long runs queue used to be static, as far as I know. You could add a dynamic bit based on the average failure rate of batches.. though I don't know how much they actually differ.
Some of this was actually directed towards Noelia. But I feel you've got an open ear and we're in a broad discussion here anyway.. so I hope it's OK to just tell you about these thoughts of mine and hope they'll be passed on :)
MrS
____________
Scanning for our furry friends since Jan 2002 |
|
|
dyemanSend message
Joined: 21 Mar 09 Posts: 35 Credit: 591,434,551 RAC: 0 Level
Scientific publications
|
When I checked this morning, out of 3 WUs one had completed (santi), one had failed (Noelia - ("ACEMD terminated abnormally" dialog displayed) while a third had managed 5% progress in nearly three hours (also Noelia). This was one with very low memory controller use - it also crashed as soon as I tried to suspend and resume it.
Looking at the several failures I have had in the previous 3 days, most if not all of them have had one or more failures prior to mine. Of the 3 WUs I am currently processing, two are resends after a previous failure (one Noelia and one Santi). The Noelia is one of the slow 290px. It has previously crashed after 16 hours processing on a 660ti. It is now 9% complete after nearly 4 hours.
So the failure rate on current WUs for me is more than 50%. It is happening on all of my cards (2 660 and 1 660ti - I've already moved the 560ti to Folding because of the performance impact running some Noelia WUs on 1GB cards).
Of the WUs I have been able to complete successfully in the last 3 days only one has been a Noelia, and on that one there had been two previous failures on the WU. This is not an acceptable failure rate as far as I am concerned.
|
|
|
|
I have had four separate system crashes in the last 24hrs. I am now aborting every Noelia task when possible. This I do with a heavy heart. The risk of crashing when running Noelia tasks is no longer acceptable to me. For some reason, this latest batch has been particularly difficult on all cards. |
|
|
skgivenVolunteer moderator Volunteer tester
Send message
Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level
Scientific publications
|
As people have pointed out, some WU's take way too long (days on the top GPU's).
This would not be detected on a 'failure-rate' model. You would also have to look at run times.
The current GigaFLOPs is only 846,081 and the number of active crunchers is quite low, even for the Northern summer months.
You probably have enough data to see what the best drivers are for each WU type and each GPU type. Query for it and share the results.
What about the latest Beta drivers?
If you recompiled for those would it also improve other card type performances, as well as Titan and 780's?
Queues:
On the Project Status Page, http://www.gpugrid.net/server_status.php, there are 3 queues,
Short runs (2-3 hours on fastest card) 1,320 928 2.56 (0.22 - 10.90) 404
ACEMD beta version 0 10 0.02 (0.01 - 0.06) 24
Long runs (8-12 hours on fastest card) 3,493 1,533 5.96 (0.80 - 27.02) 414
On our preference page, http://www.gpugrid.net/prefs.php?subset=project, there are four!
ACEMD short runs (2-3 hours on fastest card) for CUDA 4.2
ACEMD short runs (2-3 hours on fastest card) for CUDA 3.1
ACEMD beta
ACEMD long runs (8-12 hours on fastest GPU) for CUDA 4.2
The 3.1 app is now deprecated. That queue space could be used for the more troublesome WU's, NOELIA and First Batch runs, and improve the credit accordingly (statically by +75%) or dynamically as MrS suggests.
Also, the amount of work we do for different research (paper/presentation/thesis) is not visible. It's about time it was so we can determine how much we contribute. After all the wee badges are based on this:
Active Research (put in MyAccount area)
Cancer Top 1% (11th/6582) contribution ongoing towards Nathaniel (who might want to mention some of the work he does and add a poster/presentation...).
____________
FAQ's
HOW TO:
- Opt out of Beta Tests
- Ask for Help |
|
|
|
Yesterday evening, I discovered that a NOELIA_klebe had been running for ~13 hours and was stuck at 0%. I suspended / resumed it, but that didn't make it progress. The GPU was also idling. Strange thing, the acemd process was consuming one full CPU core...
I aborted the NOELIA, then got a SANTI_RAP. These generally go well, so I was glad I got one, but my gladness didn't last long, pretty soon I discovered the same was happening: WU not progressing, GPU idling, one CPU core at 100%. Suspending / resuming the task didn't do anything. I killed Boinc, restarted it and it got in a state where it didn't start any tasks, did not respond to GUI RPC or even the command-line (boinccmd) and consumed one full CPU core (the boinc executable). Several Boinc restarts didn't help, nor rebooting the machine.
I ended up reinstalling Boinc, which finally fixed this.
All this on this machine - OS is Ubuntu 12.04 64-bit.
____________
|
|
|
skgivenVolunteer moderator Volunteer tester
Send message
Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level
Scientific publications
|
The acemd process kept polling the CPU, and so appeared to consume one full CPU core.
I've experienced something similar on Ubuntu 13.04 (304.88drivers). I shut down, powered the PSU off for a few minutes and when I started up things ran normally. The thing is, I was running a POEM WU on the card! >30h too.
It's long been the case that GPUGrid has bigger WU's than other GPU projects. This inherently means a higher task error/problem rate. If a WU fails after 6h, you lose 6h work and 1 WU at GPUGrid. At POEM 18WU's would need to fail to reach 6h and several hundred at MW. Einstein's WU's are similar to the Short WU's here. Ditto for Albert. However if a WU run perpetually it's the same level of problem for any project. While some people don't like the idea of hard cut-off times, I've had the misfortune to run numerous tasks without progression on many different systems over the years, sometimes several hundred hours before I notice. Hal was desperate.
____________
FAQ's
HOW TO:
- Opt out of Beta Tests
- Ask for Help |
|
|
|
Typical! Looks like the server has run out of Santi long tasks. Anyone else getting some? Moving to short tasks for now :-( |
|
|
|
Hi, Folks:
Given the waste of resources recently experienced, I cannot continue processing GPUGrid WUs.
I fully support the ideas posted, in particular shorter and reliable WUs with our ability to select the contribution area.
Unfortunately the management in Barcelona appears unable, in my view, to communicate with us crunchers. After all, we contribute our resources pro bono and deserve more respect.
While I understand the importance of GPUGrid research, I have stopped contributing to GPUGrid for now. From time to time I will review Forum posts to determine my future involvement.
Regards,
John |
|
|
|
John, you wouldn't have to quit completely if you wanted to. The SR WUs still seem to be OK.
@All and keep two things in mind: first it's weekend. And Second even failed WUs are very likely to help in some way. Sure, this feels like unvoluntary beta testing.. but I'm sure the researchers gain some knowledge from running them. I just can't tell you what and how much, since I don't have such insiders information either.
SK wrote: Also, the amount of work we do for different research (paper/presentation/thesis) is not visible. It's about time it was so we can determine how much we contribute. After all the wee badges are based on this:
I think the badges are quite nice already, if combined with general information on what the individual batches are doing. We already have some general information on HIV / Cancer etc. type of work, but this only covers the broad scope and highlights. Which is good and appreciated, but can not easily be connected to the individual batches we run.
MrS
____________
Scanning for our furry friends since Jan 2002 |
|
|
Jim1348Send message
Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level
Scientific publications
|
Hi, Folks:
Given the waste of resources recently experienced, I cannot continue processing GPUGrid WUs.
I fully support the ideas posted, in particular shorter and reliable WUs with our ability to select the contribution area.
John,
Me too. I like the project, but my GTX 660s are not now suited for the longs (BSODs or hangs with Santis, slow running and errors with Noelias).
It is not that I get them all the time, just enough that I can't trust it to run reliably for days at a time. I will check back later when the quality control improves (or is implemented).
|
|
|
skgivenVolunteer moderator Volunteer tester
Send message
Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level
Scientific publications
|
Long runs (8-12 hours on fastest card) 3,511 1,510 6.13 (0.62 - 19.10) 409
So, 3,511 of what?
____________
FAQ's
HOW TO:
- Opt out of Beta Tests
- Ask for Help |
|
|
StefanProject administrator Project developer Project tester Project scientist Send message
Joined: 5 Mar 13 Posts: 348 Credit: 0 RAC: 0 Level
Scientific publications
|
Message received. We will have a meeting on Monday to see what we can do about the problems. |
|
|
|
Long runs (8-12 hours on fastest card) 3,511 1,510 6.13 (0.62 - 19.10) 409
So, 3,511 of what?
I've 5 pieces of SANTI_RAP74wt, 1 NATHAN_KIDKIX, 1 NOELIA_KLEBE in progress at the moment.
Out of curiosity I've installed the v326.41 driver on one of my hosts. It had no errors since then, but there haven't been enough workunits completed on this host to say it solves our problems, but I keep my fingers crossed that it does.
The 3rd heatwave of this summer is gone here, so I can crunch all day long.
I'm thinking of a more complicated batch program, which checks the progress and the error messages of the acemd client, and restarts the host automatically if there's no progress. |
|
|
|
I'm thinking of a more complicated batch program, which checks the progress and the error messages of the acemd client, and restarts the host automatically if there's no progress.
In an ideal world the app itself would take of this. But apparently it doesn't work like that yet, or at least for some errors.
The "assertion failed" error messages we're frequently seeing are actualy checks built in by the devs which spotted a critical (non-correctable?) error and halt the app (error out).
MrS
____________
Scanning for our furry friends since Jan 2002 |
|
|
|
I'm thinking of a more complicated batch program, which checks the progress and the error messages of the acemd client, and restarts the host automatically if there's no progress.
In an ideal world the app itself would take of this.
In an ideal world an app couldn't crash other apps, nor the GPU driver, nor the OS.
But apparently it doesn't work like that yet, or at least for some errors.
We had similar problems before. For the last time, when the application error popped up it was correctable by a system restart, and after the restart the app could continue from the last checkpoint. Now we facing a different situation: the app hangs, or runs to error after the restart, so the user have to click on a button to terminate the app, and I have to experiment with this function (the progress check is already working). Luckily I didn't have such an error since then, mostly because I've received only a couple of NATHAN_KIDKIXc22's, which are quite stable. Maybe the other batches were put on hold (or on lower priority) because of our complaints?
The "assertion failed" error messages we're frequently seeing are actually checks built in by the devs which spotted a critical (non-correctable?) error and halt the app (error out).
The devs should respond to this... |
|
|
Beyond Send message
Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level
Scientific publications
|
I have had four separate system crashes in the last 24hrs. I am now aborting every Noelia task when possible. This I do with a heavy heart. The risk of crashing when running Noelia tasks is no longer acceptable to me. For some reason, this latest batch has been particularly difficult on all cards.
This has also been my experience. The SANTI and NATHAN WUs run fine. The NOELIA WUs have become so problematic that they're not even worth attempting. I'm tired of my GPUs locking up because of poor programming skills. |
|
|
dyemanSend message
Joined: 21 Mar 09 Posts: 35 Credit: 591,434,551 RAC: 0 Level
Scientific publications
|
Just got aanother Noelia (9-NOELIA_005p_express-2-3-RND7707) which crashed after about 3000 seconds.
Should I feel satisfaction that none of the previous 8 failures on this WU (including a Quadro) lasted more than 23 seconds? |
|
|
skgivenVolunteer moderator Volunteer tester
Send message
Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level
Scientific publications
|
A 50min failure is worse than a 23sec failure as you've wasted more time.
____________
FAQ's
HOW TO:
- Opt out of Beta Tests
- Ask for Help |
|
|
TJSend message
Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level
Scientific publications
|
I don't think the problems are easy to solve as there are many different things that play a part.
My rigs have the least problems with Noelia's and the most trouble with Santi's. Others have more problems with Noelia's. Perhaps a card and/or driver issue? People have longer running WUĀ“s, some are suspending for gaming, some have BSODĀ“s, some have acemd crashers, reboots, and driver crashes. So all type of problems are present.
My quad core vista 32 bit with a GTX550Ti and driver 320.49 (the driver that causing problems according to a lot of crunchers here) did very good. All SR SantiĀ“s (take long on this card, LR more than 30-40 hours, thatĀ“s why I run only SR on it) 51 in a row without error. So driver seems no issue here.
My rig with the promoted good driver 310.99 has again SantiĀ“s SR errored. NoeliaĀ“s and NathanĀ“s do fine (mostly) with all different drivers as well. Its the GTX660. I had a bunch beta with a few errors. I have now opt them out to give them to the Titans and 780.
My AMD rig with the GTX770 and driver 320.49 has done all types LR and SR with none errors yet from the moment it started. Was off a few days due to heat. Win7 64 bit. 29 Yet error free, also NoeliaĀ“s.
Cards are not OC unless by factory. I have most clocks set a bit less.
Then we see a small group of crunchers with comments and complains. Does this mean the rest has no problems? No it doesnĀ“t. They donĀ“t look as closely as we do, or do not react on the forum. So the error rate has to be monitored by the servers. All the ones that run good, we donĀ“t hear about either. A status as Einsten@home has would be great to see all this.
At last I want to stress that NoeliaĀ“s WU are not worse than other oneĀ“s for my cards the GTX660 and 770.
We are here to help science and in science you have a lot of trial and error. Not everything can be completely controlled as in a lab as we are the lab.
____________
Greetings from TJ |
|
|
|
My quad core vista 32 bit with a GTX550Ti and driver 320.49 (the driver that causing problems according to a lot of crunchers here) did very good. All SR SantiĀ“s (take long on this card, LR more than 30-40 hours, thatĀ“s why I run only SR on it) 51 in a row without error. So driver seems no issue here.
Yes seems to be a problem with kepler and above (but still can be a driver issue then?). Fermi running fine, except these "klebes" witch needs more than 1GB ;) 570Ā“s and 560ti 448 (all with 1,28GB and overclocked) running fine here too (but with 310.xx driver). I still use XP32 too.
Have now two Workunits with high times, but only because i didnt recognized the only error i had 2 days before hung up one of the cards a bit ^^
____________
DSKAG Austria Research Team: http://www.research.dskag.at
|
|
|
TJSend message
Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level
Scientific publications
|
My quad core vista 32 bit with a GTX550Ti and driver 320.49 (the driver that causing problems according to a lot of crunchers here) did very good. All SR SantiĀ“s (take long on this card, LR more than 30-40 hours, thatĀ“s why I run only SR on it) 51 in a row without error. So driver seems no issue here.
Yes seems to be a problem with kepler and above (but still can be a driver issue then?). Fermi running fine, except these "klebes" witch needs more than 1GB ;) 570Ā“s and 560ti 448 (all with 1,28GB and overclocked) running fine here too (but with 310.xx driver). I still use XP32 too.
Have now two Workunits with high times, but only because i didnt recognized the only error i had 2 days before hung up one of the cards a bit ^^
I had a Noelia WU (not the Klebe) that used less then 700MB on the GTX660. Seems that it depend on the WU?
____________
Greetings from TJ |
|
|
|
For sure, every scientists has several simulations witch you can recognize on the name. So they have different requirements.
____________
DSKAG Austria Research Team: http://www.research.dskag.at
|
|
|
|
I've made a rather particular observation:
One of my hosts has a WLAN connection (through USB), since the integrated LAN chip has no WinXP drivers. This host usually loose the WLAN connection daily, so I've made a scheduled batch program which checks the WLAN connection, and restarts the WLAN connection or the host if it's necessary to restore the WLAN connection. When the NOELIA tasks showed up, the WLAN connection's reliability dropped so much, that I've bought another USB-WLAN adapter based on a different chip to fix it, but the new one turned out to be as unreliable as the old one. Since the NOELIA's gone, this host haven't had a single occurence of such WLAN connectivity failure.... That is strange. |
|
|
Beyond Send message
Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level
Scientific publications
|
At last I want to stress that NoeliaĀ“s WU are not worse than other oneĀ“s for my cards the GTX660 and 770.
You've only run 1 of the problem NOELIA long runs and while it completed it took a very long time for a 770. A sample of 1 doesn't mean a lot, but maybe a 770 can run these WUs. Maybe not. Try running them on any 1GB GPU. |
|
|
|
At last I want to stress that NoeliaĀ“s WU are not worse than other oneĀ“s for my cards the GTX660 and 770.
You've only run 1 of the problem NOELIA long runs and while it completed it took a very long time for a 770. A sample of 1 doesn't mean a lot, but maybe a 770 can run these WUs. Maybe not. Try running them on any 1GB GPU.
My 1GB GTX 650Ti has successfully run a couple of the problematic NOELIAs. They took ~45h to complete, but they sure did complete without error.
When these NOELIAs first appeared, I tried to avoid them. Then a heatwave came along and they were ideal with the relatively low GPU and CPU load.
____________
|
|
|
StefanProject administrator Project developer Project tester Project scientist Send message
Joined: 5 Mar 13 Posts: 348 Credit: 0 RAC: 0 Level
Scientific publications
|
After the meeting there are some updates on the subject:
1. The Noelia WU's were put on low priority due to the crashes. You should be getting mostly Nate's WU's on long now.
2. Now for every batch we send out we decided we will make a thread in the News section with the exact batch name. If someone forgets to do that send a message quick and I will remind them :D These threads will also contain information about the specific batch.
3. There are plans to test features (maybe on a new project?) such as adding hardware requirements to WU's so that they only run on specified hardware and thus prevent unnessecary crashes.
4. Shorter WU's with faster turnaround might make their appearance also in the following months.
I hope this solves the most immediate problems and adds some interesting stuff for the future. |
|
|
|
Thx for the updates, sound good.
____________
DSKAG Austria Research Team: http://www.research.dskag.at
|
|
|
|
Welcome news, but does (1) mean that these Noelia Wu's will eventually get released by the server when Nates start drying up? What % of Noelia WU's are left to crunch? |
|
|
StefanProject administrator Project developer Project tester Project scientist Send message
Joined: 5 Mar 13 Posts: 348 Credit: 0 RAC: 0 Level
Scientific publications
|
Hm there are still quite a few left. I do not know why Gianni didn't outright cancel them. I will ask him tomorrow. Maybe because Noelia is on holidays. So yeah I guess they would come if Nate's dry up, but I think he suggested that he would keep it filled.(?)
In any case the error occurrences should go down. I remember StoneageMan that you said in another thread that they caused you system crashes no? That's weird, because until now no one else reported such problems. I mean they crash but only the WU's not the machine. |
|
|
neilp62Send message
Joined: 23 Nov 10 Posts: 14 Credit: 7,882,259,205 RAC: 524,335 Level
Scientific publications
|
That's weird, because until now no one else reported such problems. I mean they crash but only the WU's not the machine.
Stefan, I agree with TJ that a lot of the severe platform crashes may have gone unreported. I am within the group of crunchers he described that donĀ“t look as closely or react on the forum as a few more active forum participants do. I've had 11 Noelia WU failures in the last 14 days, and most of these WUs also failed for several other users. In two instances, a single WU failure seems to have induced failure in other active WUs on my platform. One failure left my platform's GPU that is attached to the display in a state that required a full power down before Windows7 would start properly. |
|
|
|
Thanks Stefan, sounds good so far!
About that new project: new, as in "GPU-Grid beta" or something? Or a new subproject / WUs type like the short and long queues, maybe "risk production"? Credit-wise it might be nicer to have them all combined under the same banner. But we and you might not be able to set things up as specifically as we want it, if they're not separate projects.
MrS
____________
Scanning for our furry friends since Jan 2002 |
|
|
StefanProject administrator Project developer Project tester Project scientist Send message
Joined: 5 Mar 13 Posts: 348 Credit: 0 RAC: 0 Level
Scientific publications
|
Ok so now the Noelia WU's got cancelled totally.
Have a happy crash-free crunching month :D |
|
|
|
They were fun while they lasted. Though, I managed to finish the last few successfully, including two overnight, no errors since last Friday. It's nice to end on a high note. I hope the results are useful, and I hope these simulations resume once the bugs are fixed.
|
|
|
StefanProject administrator Project developer Project tester Project scientist Send message
Joined: 5 Mar 13 Posts: 348 Credit: 0 RAC: 0 Level
Scientific publications
|
Yes, they will probably come back in two three weeks with some different parameters which should decrease the error rate. |
|
|
pvhSend message
Joined: 17 Mar 10 Posts: 23 Credit: 1,173,824,416 RAC: 0 Level
Scientific publications
|
Ok so now the Noelia WU's got cancelled totally.
Excellent news! I will start up GPUGRID again when I get home... |
|
|
petebeSend message
Joined: 19 Nov 12 Posts: 31 Credit: 1,549,545,867 RAC: 0 Level
Scientific publications
|
Thanks very much for the update and follow-thru, Stefan! |
|
|
|
Thanks for the update, Stefan.
John |
|
|
|
Hi, Jim:
I will not be processing GPUGrid WUs for a while as I am concentrating on other areas of interest. I will keep an eye on the Forum and decide at a future date if I should contribute more.
Happy crunching!
John |
|
|
TJSend message
Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level
Scientific publications
|
Ok so now the Noelia WU's got cancelled totally.
Have a happy crash-free crunching month :D
I am not happy with this as on my rigs the Noelia's did better than Santi's and that still is, I had again Santi errors, LR and SR even with the latest beta drivers.
____________
Greetings from TJ |
|
|
StefanProject administrator Project developer Project tester Project scientist Send message
Joined: 5 Mar 13 Posts: 348 Credit: 0 RAC: 0 Level
Scientific publications
|
Well, we cannot please everyone unfortunately :D It's great to hear that they worked fine on your machine. But on the last days the Noelia WU's had an incredibly high failure rate, so even if they worked for you they were crashing for nearly 30-40% of the users. So the general good had to prevail here :) |
|
|
TJSend message
Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level
Scientific publications
|
Well, we cannot please everyone unfortunately :D It's grat to hear that they worked fine on your machine. But on the last days the Noelia WU's had an incredibly high failure rate, so even if they worked for you they were crashing for nearly 30-40% of the users. So the general good had to prevail here :)
Aha, about a weak ago we could read that the failure rate was acceptable according to the project. This however is another conclusion ;-)
Well never mind, the Santi's keep failing on my rigs and with a lot of wingman too who got them afterwards. So I guess the complains about them will now increase. Not longer from me, I set to LR and will not longer complain about them.
And I have not to hurry to build new rigs, or update old ones with 690, 780 and titans.
Perhaps you could crunch a few of those Santi's, Stefan than you can see it yourself.
____________
Greetings from TJ |
|
|
StefanProject administrator Project developer Project tester Project scientist Send message
Joined: 5 Mar 13 Posts: 348 Credit: 0 RAC: 0 Level
Scientific publications
|
Well a week ago it was acceptable. For some reason it started increasing and became quite unacceptable (hence I said "on the last days").
Santi's are at 2-7% error rate which might be an all-time historical low or something like that :P
As for crunching them, I think my (single) GTX 280 might cry. |
|
|
TJSend message
Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level
Scientific publications
|
Thanks for the clarification Stefan.
Yes indeed the 280 will get it very warm :)
____________
Greetings from TJ |
|
|
Beyond Send message
Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level
Scientific publications
|
Ok so now the Noelia WU's got cancelled totally.
Have a happy crash-free crunching month :D
Thanks for this. Since the Noelia WUs disappeared I've had no crashes or failures at all on my 8 GPUGrid machines.
2. Now for every batch we send out we decided we will make a thread in the News section with the exact batch name. If someone forgets to do that send a message quick and I will remind them :D These threads will also contain information about the specific batch.
3. There are plans to test features (maybe on a new project?) such as adding hardware requirements to WU's so that they only run on specified hardware and thus prevent unnessecary crashes.
Great news on both counts although I don't see why a new project would be necessary. |
|
|
StefanProject administrator Project developer Project tester Project scientist Send message
Joined: 5 Mar 13 Posts: 348 Credit: 0 RAC: 0 Level
Scientific publications
|
Supposedly from what I understood some options (like WU deadlines? or hardware requirements?) are defined project wise. So we could not test them publicly on our main project as it could ruin everything. That's at least what I understood. |
|
|
|
Ok so now the Noelia WU's got cancelled totally.
Excellent news! I will start up GPUGRID again when I get home...
Hi, Stefan:
Thank you for this most welcome news! I will now run a couple of short run WUs and see what happens. I cannot turn my back on this important research: that's why I invested in two GTX 650Ti GPUs a few months ago. Until that time I had processed other WUs with ATI GPUs only.
Thanks, again.
John |
|
|
Beyond Send message
Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level
Scientific publications
|
Santi's are at 2-7% error rate which might be an all-time historical low or something like that :P
Haven't had a single WU error since the Noelia WUs left over a week ago. Looking back, the very few errors I had with the Nathan and Santi WUs seem to have occurred after defective Noelia WUs put the GPUs into a bad state. This is with 8 machines running a range of GPUs from the lowly GTX 460/768, the 560 1GB, the 650 Ti 1GB and the 670 2GB. If my hypothesis is correct, without the Noelia WUs around to mess up the GPUs, the error rate for other WU types should be falling. |
|
|
|
Santi's are at 2-7% error rate which might be an all-time historical low or something like that :P
Haven't had a single WU error since the Noelia WUs left over a week ago. Looking back, the very few errors I had with the Nathan and Santi WUs seem to have occurred after defective Noelia WUs put the GPUs into a bad state. This is with 8 machines running a range of GPUs from the lowly GTX 460/768, the 560 1GB, the 650 Ti 1GB and the 670 2GB. If my hypothesis is correct, without the Noelia WUs around to mess up the GPUs, the error rate for other WU types should be falling.
This reminds me, on my windows xp computer, I had observed on 2 occasions, when the Noelia unit crashed, the subsequent non Noelia would not load into the GPU (the clock would run, but progress would stay at 0.00%). To fix this, I had to suspend the unit, reboot the computer, and then resume the unit. This non Noelia unit would then run normally.
I almost forgot about this. Thanks for jogging my memory with your post.
|
|
|
TJSend message
Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level
Scientific publications
|
On my 660 the Santi's keep erroring so Noelia's WU have nothing to do with this!
So I have withdrawn the 660 and give it to Einstein and Albert.
____________
Greetings from TJ |
|
|
Beyond Send message
Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level
Scientific publications
|
On my 660 the Santi's keep erroring so Noelia's WU have nothing to do with this!
So I have withdrawn the 660 and give it to Einstein and Albert.
Did you ever try to RMA it as suggested? |
|
|
|
I've wrote a batch program to watch if a workunit is stuck, and when it happens this batch program restarts the operating system, but it could be programmed to take other actions (like deleting files from the failed workunit to make it run to an error instead of hanging at next start, but a simple OS restart seems to resolve the majority of the WU hangs). It works on all current Windows versions, 32 and 64 bit (XP, 7, 8)
The batch program consists of two batch files, which make another batch files depending on how many workunits are running at the same time.
You have to save these batch files into the same directory, in which you have all access rights (write, read, execute, modify, delete), for example in a folder on your desktop.
I call the first file check.bat, to create it you should start notepad, copy and paste the following text, and then save it to your designated folder as "check.bat", and don't forget to set the file type to "all files" before you press "save" (or else the notepad will save it as check.bat.txt)
@ECHO OFF
IF "%ALLUSERSPROFILE%"=="%SYSTEMDRIVE%\ProgramData" GOTO Win7
SET SLOTDIR=%ALLUSERSPROFILE%\Application Data\BOINC\slots
GOTO WinXP
:Win7
SET SLOTDIR=%ALLUSERSPROFILE%\BOINC\slots
:WinXP
IF NOT EXIST slotnum.bat GOTO src4slots
CALL slotnum.bat
IF %SLOTNUM%==SLOTCHANGE GOTO src4slots
SET SLOTCOUNT=0
SET APPNAME=acemd.800-42.exe
FOR /L %%i IN (1,1,20) DO CALL slotcheck %%i c
SET APPNAME=acemd.800-55.exe
FOR /L %%i IN (1,1,20) DO CALL slotcheck %%i c
IF NOT %SLOTNUM%==%SLOTCOUNT% GOTO src4slots
IF %SLOTNUM%==0 GOTO end
FOR /L %%i IN (1,1,%SLOTNUM%) DO CALL slot%%i
IF EXIST slotnum.bat GOTO end
echo === RESTART: ACEMD stuck ==== >>check.log
DATE /t >>check.log
TIME /t >>check.log
echo . >>check.log
SHUTDOWN /r /f /d 4:5 /c "ACEMD stuck"
GOTO end
:src4slots
IF EXIST slotnum.bat DEL slotnum.bat /q /f
SET SLOTNUM=0
SET APPNAME=acemd.800-42.exe
FOR /L %%i IN (1,1,20) DO CALL slotcheck %%i
SET APPNAME=acemd.800-55.exe
FOR /L %%i IN (1,1,20) DO CALL slotcheck %%i
ECHO SET SLOTNUM=%SLOTNUM% >slotnum.bat
:end
If your host use the CUDA5.5 client, the brown section is not needed.
If your host use the CUDA4.2 client, the green section is not needed.
You can use this batch program to check any client's progress (other than GPUGrid's client), all you have to do is to replace the name of the acemd client with the name of the designated client's executable file at the end of the first line in the brown, or the green section. You have to repeat these two sections as many times as many client's progress you want to check.
The second batch file: (it must be named as slotcheck.bat, as the first batch file refers to this file with that name.)
IF NOT EXIST "%SLOTDIR%\%1\%APPNAME%" GOTO end
IF NOT .%2.==.. GOTO count
IF %SLOTNUM%==8 SET SLOTNUM=9
IF %SLOTNUM%==7 SET SLOTNUM=8
IF %SLOTNUM%==6 SET SLOTNUM=7
IF %SLOTNUM%==5 SET SLOTNUM=6
IF %SLOTNUM%==4 SET SLOTNUM=5
IF %SLOTNUM%==3 SET SLOTNUM=4
IF %SLOTNUM%==2 SET SLOTNUM=3
IF %SLOTNUM%==1 SET SLOTNUM=2
IF %SLOTNUM%==0 SET SLOTNUM=1
DEL slot%SLOTNUM%.bat /q /f
ECHO IF EXIST "%SLOTDIR%\%1\%APPNAME%" GOTO checkprogress >slot%SLOTNUM%.bat
ECHO IF EXIST slotnum.bat ECHO SET SLOTNUM=SLOTCHANGE ^>slotnum.bat >>slot%SLOTNUM%.bat
ECHO GOTO end >>slot%SLOTNUM%.bat
ECHO :checkprogress >>slot%SLOTNUM%.bat
ECHO FIND "<fraction_done>" ^<"%SLOTDIR%\%1\boinc_task_state.xml" ^>%SLOTNUM%.txt >>slot%SLOTNUM%.bat
ECHO FC %SLOTNUM%.txt %SLOTNUM%.xml >>slot%SLOTNUM%.bat
ECHO IF ERRORLEVEL 1 GOTO ok >>slot%SLOTNUM%.bat
ECHO FIND "<result_name>" ^<"%SLOTDIR%\%1\boinc_task_state.xml" ^>^>check.log >>slot%SLOTNUM%.bat
ECHO TYPE %SLOTNUM%.txt ^>^>check.log >>slot%SLOTNUM%.bat
ECHO DEL slotnum.bat /q /f >>slot%SLOTNUM%.bat
ECHO :ok >>slot%SLOTNUM%.bat
ECHO COPY %SLOTNUM%.txt %SLOTNUM%.xml /y >>slot%SLOTNUM%.bat
ECHO :end >>slot%SLOTNUM%.bat
FIND "<fraction_done>" <"%SLOTDIR%\%1\boinc_task_state.xml" >%slotnum%.xml
GOTO end
:count
IF %SLOTCOUNT%==8 SET SLOTCOUNT=9
IF %SLOTCOUNT%==7 SET SLOTCOUNT=8
IF %SLOTCOUNT%==6 SET SLOTCOUNT=7
IF %SLOTCOUNT%==5 SET SLOTCOUNT=6
IF %SLOTCOUNT%==4 SET SLOTCOUNT=5
IF %SLOTCOUNT%==3 SET SLOTCOUNT=4
IF %SLOTCOUNT%==2 SET SLOTCOUNT=3
IF %SLOTCOUNT%==1 SET SLOTCOUNT=2
IF %SLOTCOUNT%==0 SET SLOTCOUNT=1
:end
You should create a scheduled task to run "check.bat" every 10 minutes (shorter period is not recommended), with the highest access rights on Win7 and 8, or administrator privileges on WinXP (or else it won't be able to restart the OS)
Known limitations:
It can monitor 9 slots at the most.
It checks only the first 20 slots for the targeted clients (it can be easily modified in the green and blue section)
You should not pause any tasks which is monitored by this batch program and already processed to any extent (or else the batch program will restart your host's OS every 20 minutes)
If the monitored application writes its progress to the disk less frequent than every 10 minutes, you should increase the repetition interval according to the application.
I'm using it on WinXP, and haven't tested on other Windows (7,8), but it should work.
It creates the folloing files:
- check.log: record of every restart with the date, time, workunit name and its progress
- slotnum.bat file: it tells the batch program how many slots it has to monitor
- slotn.bat file for every slot the batch program has to monitor
- n.xml and n.txt files to record every slot's progress |
|
|
|
There is a "typo" in the brown and green sections, as the slot numbering starts at 0, also the app names should be modified to reflect the new app versions:
For the CUDA5.5 client you should use:
SET APPNAME=acemd.802-55.exe
FOR /L %%i IN (0,1,20) DO CALL slotcheck %%i c
SET APPNAME=acemd.803-55.exe
FOR /L %%i IN (0,1,20) DO CALL slotcheck %%i c
SET APPNAME=acemd.804-55.exe
FOR /L %%i IN (0,1,20) DO CALL slotcheck %%i c
SET APPNAME=acemd.802-55.exe
FOR /L %%i IN (0,1,20) DO CALL slotcheck %%i
SET APPNAME=acemd.803-55.exe
FOR /L %%i IN (0,1,20) DO CALL slotcheck %%i
SET APPNAME=acemd.804-55.exe
FOR /L %%i IN (0,1,20) DO CALL slotcheck %%i
For the CUDA 4.2 client you should use:
SET APPNAME=acemd.802-42.exe
FOR /L %%i IN (0,1,20) DO CALL slotcheck %%i c
SET APPNAME=acemd.803-42.exe
FOR /L %%i IN (0,1,20) DO CALL slotcheck %%i c
SET APPNAME=acemd.804-42.exe
FOR /L %%i IN (0,1,20) DO CALL slotcheck %%i c
SET APPNAME=acemd.802-42.exe
FOR /L %%i IN (0,1,20) DO CALL slotcheck %%i
SET APPNAME=acemd.803-42.exe
FOR /L %%i IN (0,1,20) DO CALL slotcheck %%i
SET APPNAME=acemd.804-42.exe
FOR /L %%i IN (0,1,20) DO CALL slotcheck %%i
I've had a couple of stuck tasks in the last few days, and it seems to me that sometimes there is no boinc_task_state.xml file in the slot directory when it happens. The batch program is still working in that case, but the result name is missing from the log file it creates. I've added additional debug info to the slotcheck.bat (like the name of the stuck application), I'll publish the new version when I'll have more info about the missing boinc_task_state.xml. |
|
|
|
Here comes the second version of my monitoring batch programs.
This version counts the running tasks, instead looking for a stuck task, so there's no problem if the BOINC manager (or the user) pauses a task because of an overestimated remaining time of a new workunit.
You should set the number of concurrently running workunits in the line marked with indigo color. (Now it's set to 2, as I have 2 GPUs in my system)
When less workunits made progress than that number, this batch program restarts the operating system, but it could be programmed to take other actions (like deleting files from the failed workunit to make it run to an error instead of hanging at next start, but a simple OS restart seems to resolve the majority of the WU hangs). It works on all current Windows versions, 32 and 64 bit (XP, 7, 8)
The batch program consists of two batch files, which make another batch files depending on how many workunits are running at the same time.
You have to save these batch files into the same directory, in which you have all access rights (write, read, execute, modify, delete), for example in a folder on your desktop.
I call the first file check.bat, to create it you should start notepad, copy and paste the following (colored) text, and then save it to your designated folder as "check.bat", and don't forget to set the file type to "all files" before you press "save" (or else the notepad will save it as check.bat.txt)
@ECHO OFF
IF "%ALLUSERSPROFILE%"=="%SYSTEMDRIVE%\ProgramData" GOTO Win7
SET SLOTDIR=%ALLUSERSPROFILE%\Application Data\BOINC\slots
GOTO WinXP
:Win7
SET SLOTDIR=%ALLUSERSPROFILE%\BOINC\slots
:WinXP
IF NOT EXIST slotnum.bat GOTO src4slots
CALL slotnum.bat
SET SLOTCOUNT=0
SET APPNAME=acemd.800-55.exe
FOR /L %%i IN (0,1,20) DO CALL slotcheck %%i c
SET APPNAME=acemd.814-55.exe
FOR /L %%i IN (0,1,20) DO CALL slotcheck %%i c
IF NOT %SLOTNUM%==%SLOTCOUNT% GOTO src4slots
IF %SLOTNUM%==0 GOTO end
SET INPROGRESS=0
FOR /L %%i IN (1,1,%SLOTNUM%) DO CALL slot%%i
IF NOT EXIST slotnum.bat GOTO src4slots
IF %INPROGRESS% GEQ 2 GOTO end
IF %INPROGRESS%==%SLOTNUM% GOTO end
echo ======= RESTART: ACEMD stuck ======= >>check.log
SHUTDOWN /r /f /d 4:5 /c "ACEMD stuck"
GOTO end
:src4slots
SET SLOTNUM=0
SET APPNAME=acemd.800-55.exe
FOR /L %%i IN (0,1,20) DO CALL slotcheck %%i
SET APPNAME=acemd.814-55.exe
FOR /L %%i IN (0,1,20) DO CALL slotcheck %%i
ECHO SET SLOTNUM=%SLOTNUM% >slotnum.bat
:end
If your host using the CUDA4.2 client, you should change the appname in the brown and the green sections to this:
SET APPNAME=acemd.800-42.exe
SET APPNAME=acemd.814-42.exe
You can use this batch program to check any client's progress (other than GPUGrid's client), all you have to do is to replace the name of the acemd client with the name of the designated client's executable file at the end of the first line in the brown, or the green section. You have to repeat these two sections as many times as many client's progress you want to check. (however it's not recommended to mix different applications, since this version counts the running tasks, instead looking for a stuck task)
The second batch file: (it must be named as slotcheck.bat, as the first batch file refers to this file with that name.)
IF NOT EXIST "%SLOTDIR%\%1\%APPNAME%" GOTO end
IF NOT .%2.==.. GOTO count
IF %SLOTNUM%==8 SET SLOTNUM=9
IF %SLOTNUM%==7 SET SLOTNUM=8
IF %SLOTNUM%==6 SET SLOTNUM=7
IF %SLOTNUM%==5 SET SLOTNUM=6
IF %SLOTNUM%==4 SET SLOTNUM=5
IF %SLOTNUM%==3 SET SLOTNUM=4
IF %SLOTNUM%==2 SET SLOTNUM=3
IF %SLOTNUM%==1 SET SLOTNUM=2
IF %SLOTNUM%==0 SET SLOTNUM=1
DEL slot%SLOTNUM%.bat /q /f
ECHO IF NOT EXIST slotnum.bat GOTO end >slot%SLOTNUM%.bat
ECHO IF EXIST "%SLOTDIR%\%1\%APPNAME%" GOTO checkprogress >slot%SLOTNUM%.bat
ECHO IF EXIST slotnum.bat DEL slotnum.bat /q /f >>slot%SLOTNUM%.bat
ECHO GOTO end >>slot%SLOTNUM%.bat
ECHO :checkprogress >>slot%SLOTNUM%.bat
ECHO FIND "<fraction_done>" ^<"%SLOTDIR%\%1\boinc_task_state.xml" ^>%SLOTNUM%.txt >>slot%SLOTNUM%.bat
ECHO FC %SLOTNUM%.txt %SLOTNUM%.xml >>slot%SLOTNUM%.bat
ECHO IF ERRORLEVEL 1 GOTO ok >>slot%SLOTNUM%.bat
ECHO ECHO . ^>^>check.log >>slot%SLOTNUM%.bat
ECHO DATE /t ^>^>check.log >>slot%SLOTNUM%.bat
ECHO TIME /t ^>^>check.log >>slot%SLOTNUM%.bat
ECHO ECHO application %APPNAME% is stuck in slot %1 ^>^>check.log >>slot%SLOTNUM%.bat
ECHO IF NOT EXIST "%SLOTDIR%\%1\boinc_task_state.xml" ECHO %SLOTDIR%\%1\boinc_task_state.xml is not exists! ^>^>check.log >>slot%SLOTNUM%.bat
ECHO FIND "<result_name>" ^<"%SLOTDIR%\%1\boinc_task_state.xml" ^>^>check.log >>slot%SLOTNUM%.bat
rem ECHO TYPE "%SLOTDIR%\%1\boinc_task_state.xml" ^>^>check.log >>slot%SLOTNUM%.bat
ECHO TYPE %SLOTNUM%.xml ^>^>check.log >>slot%SLOTNUM%.bat
ECHO GOTO end >>slot%SLOTNUM%.bat
ECHO :ok >>slot%SLOTNUM%.bat
ECHO COPY %SLOTNUM%.txt %SLOTNUM%.xml /y >>slot%SLOTNUM%.bat
ECHO IF %%INPROGRESS%%==8 SET INPROGRESS=9 >>slot%SLOTNUM%.bat
ECHO IF %%INPROGRESS%%==7 SET INPROGRESS=8 >>slot%SLOTNUM%.bat
ECHO IF %%INPROGRESS%%==6 SET INPROGRESS=7 >>slot%SLOTNUM%.bat
ECHO IF %%INPROGRESS%%==5 SET INPROGRESS=6 >>slot%SLOTNUM%.bat
ECHO IF %%INPROGRESS%%==4 SET INPROGRESS=5 >>slot%SLOTNUM%.bat
ECHO IF %%INPROGRESS%%==3 SET INPROGRESS=4 >>slot%SLOTNUM%.bat
ECHO IF %%INPROGRESS%%==2 SET INPROGRESS=3 >>slot%SLOTNUM%.bat
ECHO IF %%INPROGRESS%%==1 SET INPROGRESS=2 >>slot%SLOTNUM%.bat
ECHO IF %%INPROGRESS%%==0 SET INPROGRESS=1 >>slot%SLOTNUM%.bat
ECHO :end >>slot%SLOTNUM%.bat
FIND "<fraction_done>" <"%SLOTDIR%\%1\boinc_task_state.xml" >%slotnum%.xml
GOTO end
:count
IF %SLOTCOUNT%==8 SET SLOTCOUNT=9
IF %SLOTCOUNT%==7 SET SLOTCOUNT=8
IF %SLOTCOUNT%==6 SET SLOTCOUNT=7
IF %SLOTCOUNT%==5 SET SLOTCOUNT=6
IF %SLOTCOUNT%==4 SET SLOTCOUNT=5
IF %SLOTCOUNT%==3 SET SLOTCOUNT=4
IF %SLOTCOUNT%==2 SET SLOTCOUNT=3
IF %SLOTCOUNT%==1 SET SLOTCOUNT=2
IF %SLOTCOUNT%==0 SET SLOTCOUNT=1
:end
You should create a scheduled task to run "check.bat" every 10 minutes (shorter period is not recommended), with the highest access rights on Win7 and 8, or administrator privileges on WinXP (or else it won't be able to restart the OS)
Known limitations:
It can monitor 9 slots at the most.
It checks only the first 20 slots for the targeted clients (it can be easily modified in the green and brown section)
You have to set the number of GPUs manually in the check.bat batch file (in the indigo colored line)
You should not pause more tasks (which is monitored by this batch program and already processed to any extent) than you've set in that line (or else the batch program will restart your host's OS every 20 minutes)
If the monitored application writes its progress to the disk less frequent than every 10 minutes, you should increase the repetition interval according to the application.
I'm using it on WinXP, and haven't tested on other Windows (7,8), but it should work.
It creates the following files:
- check.log: record of every restart with the date, time, workunit name and its progress
- slotnum.bat file: it tells the batch program how many slots it has to monitor
- slotn.bat file for every slot the batch program has to monitor
- n.xml and n.txt files to record every slot's progress |
|
|
|
There is a surprisingly high rate at which the task completion coincides with the scheduled start of my batch program, and in this case the previous versions trigger a false positive, so I've modified the slotcheck.bat not to consider a task as stuck, when there is no boinc_task_state.xml present in the slot directory (only when there is no such file for the 2nd consecutive checking)
IF NOT EXIST "%SLOTDIR%\%1\%APPNAME%" GOTO end
IF NOT .%2.==.. GOTO count
IF %SLOTNUM%==8 SET SLOTNUM=9
IF %SLOTNUM%==7 SET SLOTNUM=8
IF %SLOTNUM%==6 SET SLOTNUM=7
IF %SLOTNUM%==5 SET SLOTNUM=6
IF %SLOTNUM%==4 SET SLOTNUM=5
IF %SLOTNUM%==3 SET SLOTNUM=4
IF %SLOTNUM%==2 SET SLOTNUM=3
IF %SLOTNUM%==1 SET SLOTNUM=2
IF %SLOTNUM%==0 SET SLOTNUM=1
DEL slot%SLOTNUM%.bat /q /f
ECHO IF NOT EXIST slotnum.bat GOTO end >slot%SLOTNUM%.bat
ECHO IF EXIST "%SLOTDIR%\%1\%APPNAME%" GOTO checkprogress >slot%SLOTNUM%.bat
ECHO IF EXIST slotnum.bat DEL slotnum.bat /q /f >>slot%SLOTNUM%.bat
ECHO GOTO end >>slot%SLOTNUM%.bat
ECHO :checkprogress >>slot%SLOTNUM%.bat
ECHO IF EXIST "%SLOTDIR%\%1\boinc_task_state.xml" GOTO chk2 >>slot%SLOTNUM%.bat
ECHO IF NOT EXIST %SLOTNUM%.txt GOTO stuck >>slot%SLOTNUM%.bat
ECHO DEL %SLOTNUM%.txt / q /f >>slot%SLOTNUM%.bat
ECHO GOTO ok2 >>slot%SLOTNUM%.bat
ECHO :chk2 >>slot%SLOTNUM%.bat
ECHO FIND "<fraction_done>" ^<"%SLOTDIR%\%1\boinc_task_state.xml" ^>%SLOTNUM%.txt >>slot%SLOTNUM%.bat
ECHO FC %SLOTNUM%.txt %SLOTNUM%.xml >>slot%SLOTNUM%.bat
ECHO IF ERRORLEVEL 1 GOTO ok >>slot%SLOTNUM%.bat
ECHO :stuck >>slot%SLOTNUM%.bat
ECHO ECHO . ^>^>check.log >>slot%SLOTNUM%.bat
ECHO DATE /t ^>^>check.log >>slot%SLOTNUM%.bat
ECHO TIME /t ^>^>check.log >>slot%SLOTNUM%.bat
ECHO ECHO application %APPNAME% is stuck in slot %1 ^>^>check.log >>slot%SLOTNUM%.bat
ECHO IF NOT EXIST "%SLOTDIR%\%1\boinc_task_state.xml" ECHO %SLOTDIR%\%1\boinc_task_state.xml is not exists! ^>^>check.log >>slot%SLOTNUM%.bat
ECHO FIND "<result_name>" ^<"%SLOTDIR%\%1\boinc_task_state.xml" ^>^>check.log >>slot%SLOTNUM%.bat
rem ECHO TYPE "%SLOTDIR%\%1\boinc_task_state.xml" ^>^>check.log >>slot%SLOTNUM%.bat
ECHO TYPE %SLOTNUM%.xml ^>^>check.log >>slot%SLOTNUM%.bat
ECHO GOTO end >>slot%SLOTNUM%.bat
ECHO :ok >>slot%SLOTNUM%.bat
ECHO COPY %SLOTNUM%.txt %SLOTNUM%.xml /y >>slot%SLOTNUM%.bat
ECHO :ok2 >>slot%SLOTNUM%.bat
ECHO IF %%INPROGRESS%%==8 SET INPROGRESS=9 >>slot%SLOTNUM%.bat
ECHO IF %%INPROGRESS%%==7 SET INPROGRESS=8 >>slot%SLOTNUM%.bat
ECHO IF %%INPROGRESS%%==6 SET INPROGRESS=7 >>slot%SLOTNUM%.bat
ECHO IF %%INPROGRESS%%==5 SET INPROGRESS=6 >>slot%SLOTNUM%.bat
ECHO IF %%INPROGRESS%%==4 SET INPROGRESS=5 >>slot%SLOTNUM%.bat
ECHO IF %%INPROGRESS%%==3 SET INPROGRESS=4 >>slot%SLOTNUM%.bat
ECHO IF %%INPROGRESS%%==2 SET INPROGRESS=3 >>slot%SLOTNUM%.bat
ECHO IF %%INPROGRESS%%==1 SET INPROGRESS=2 >>slot%SLOTNUM%.bat
ECHO IF %%INPROGRESS%%==0 SET INPROGRESS=1 >>slot%SLOTNUM%.bat
ECHO :end >>slot%SLOTNUM%.bat
FIND "<fraction_done>" <"%SLOTDIR%\%1\boinc_task_state.xml" >%slotnum%.xml
COPY %SLOTNUM%.xml %SLOTNUM%.txt /y
GOTO end
:count
IF %SLOTCOUNT%==8 SET SLOTCOUNT=9
IF %SLOTCOUNT%==7 SET SLOTCOUNT=8
IF %SLOTCOUNT%==6 SET SLOTCOUNT=7
IF %SLOTCOUNT%==5 SET SLOTCOUNT=6
IF %SLOTCOUNT%==4 SET SLOTCOUNT=5
IF %SLOTCOUNT%==3 SET SLOTCOUNT=4
IF %SLOTCOUNT%==2 SET SLOTCOUNT=3
IF %SLOTCOUNT%==1 SET SLOTCOUNT=2
IF %SLOTCOUNT%==0 SET SLOTCOUNT=1
:end |
|
|
lohphatSend message
Joined: 21 Jan 10 Posts: 44 Credit: 1,308,255,633 RAC: 8,219,899 Level
Scientific publications
|
After looking at my stats I came here and found out about the stuck WUs and it looks like I wasted a MONTH of GPU time. I reset the project and am now getting different WUs.
Oh joy.
|
|
|
|
We seem to have a persistent problem with WU 4792977 (I60R5-NATHAN_KIDKIXc22_6-9-50-RND2135). Three computers have failed to run it so far, all with 'exit status 98' after two or three seconds. The error messages are variously
ERROR: file mdsim.cpp line 985: Invalid celldimension (linux)
ERROR: file pme.cpp line 85: PME NX too small (windows) |
|
|