Advanced search

Message boards : Graphics cards (GPUs) : Change in run time?

Author Message
Profile Nightlord
Avatar
Send message
Joined: 22 Jul 08
Posts: 61
Credit: 5,461,041
RAC: 0
Level
Ser
Scientific publications
watwatwatwatwat
Message 2160 - Posted: 7 Sep 2008 | 12:31:01 UTC

Has something changed in the WU's to reduce the run time?

This host: http://www.gpugrid.net/results.php?hostid=6090 was running around 53000 seconds per WU, but the last two have been around 38000, then 35500 seconds with no changes to that box. The WU it's running right now has completed 8.632% in 52.00 minutes = 36000 seconds!

____________

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 2161 - Posted: 7 Sep 2008 | 12:34:14 UTC

I'd be suspecious. Maybe some error which the client doesn't detect? Have you tried rebooting?

MrS
____________
Scanning for our furry friends since Jan 2002

Profile Nightlord
Avatar
Send message
Joined: 22 Jul 08
Posts: 61
Credit: 5,461,041
RAC: 0
Level
Ser
Scientific publications
watwatwatwatwat
Message 2162 - Posted: 7 Sep 2008 | 12:59:24 UTC - in response to Message 2161.
Last modified: 7 Sep 2008 | 13:24:55 UTC

I'd be suspecious. Maybe some error which the client doesn't detect? Have you tried rebooting?

MrS



I just noticed that these are the first WU crunched on that box since I upgraded to Boinc 6.3.10 to work with BoincView - So I was incorrect to state "no changes". However, I performed the upgrade on a couple of boxes at the same time and they do not show this behaviour.

I'll shutdown Boinc, reboot that machine and see what happens. Shorter WU time is great if the science is good, but I don't want a bad box reporting and getting credit for "valid" results if they are not really valid.


/edit

This is what I get:

Previous
Run time 1:14:30, 11.682% complete = 38263 seconds predicted total run time

After re-boot
Run time 1:24:30, 12.858% complete = +600 seconds, +1.176% = ~51000 seconds predicted total run time

After re-boot it appears to be back at normal runtime.

Other boxes upgraded to 6.3.10 are OK, so I don't think this is the real cause. What is a worry is that there were no errors reported and the WU were marked as valid.
____________

Profile Nightlord
Avatar
Send message
Joined: 22 Jul 08
Posts: 61
Credit: 5,461,041
RAC: 0
Level
Ser
Scientific publications
watwatwatwatwat
Message 2205 - Posted: 8 Sep 2008 | 19:53:15 UTC
Last modified: 8 Sep 2008 | 20:52:48 UTC

Well, now I'm confused. Yesterday after reboot the machine seemed to be crunching at normal run time. Today it is back running fast again.

Run time 7:04:47, 69.505% complete = ~36600 seconds

If GDF picks this up, can you confirm if the last few (fast) units on this computer are OK? They are marked valid, but seem much faster than all others.

/edit

So now I think I understand a little more. That host shows average CPU efficiency at around 50%, which means something is taking CPU cycles. The box is a dedicated cruncher with no other purpose other than to the heat the room.

I noticed a couple of strange instances this evening where the displayed run time would freeze. I noticed that if I moved the mouse the displayed run time started ticking up again. That box has no screen saver or power options set.

So, then I left the System Monitor open and watched for this behaviour. Whenever the run time appeared to freeze, Boincmgr shot up to 50% CPU. I can confirm this is not the benchmarks running, it is something else.

I think this is the culprit driving the Average CPU Efficiency and also for the false run time. I suspect the run time is being incorrectly calculated based on the number of seconds that Boincmgr is active.

So the WU's are probably valid, but in fact they are probably taking twice as long rather than half as much time because the CPU is loaded by Boincmgr.

My conclusion is that I have a corrupt Boinc installation. I'll make a clean installation and see what happens.
____________

Profile GDF
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 14 Mar 07
Posts: 1957
Credit: 629,356
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 2210 - Posted: 9 Sep 2008 | 8:45:17 UTC - in response to Message 2205.

Well, now I'm confused. Yesterday after reboot the machine seemed to be crunching at normal run time. Today it is back running fast again.

Run time 7:04:47, 69.505% complete = ~36600 seconds

If GDF picks this up, can you confirm if the last few (fast) units on this computer are OK? They are marked valid, but seem much faster than all others.

/edit

So now I think I understand a little more. That host shows average CPU efficiency at around 50%, which means something is taking CPU cycles. The box is a dedicated cruncher with no other purpose other than to the heat the room.

I noticed a couple of strange instances this evening where the displayed run time would freeze. I noticed that if I moved the mouse the displayed run time started ticking up again. That box has no screen saver or power options set.

So, then I left the System Monitor open and watched for this behaviour. Whenever the run time appeared to freeze, Boincmgr shot up to 50% CPU. I can confirm this is not the benchmarks running, it is something else.

I think this is the culprit driving the Average CPU Efficiency and also for the false run time. I suspect the run time is being incorrectly calculated based on the number of seconds that Boincmgr is active.

So the WU's are probably valid, but in fact they are probably taking twice as long rather than half as much time because the CPU is loaded by Boincmgr.

My conclusion is that I have a corrupt Boinc installation. I'll make a clean installation and see what happens.



Yes, the WU are correct, and also slower really than the previous ones.
As you said, you probably add some other process taking uup resources.
CPU time is not the best to guess compute time. In the next application version I will print in stderr the time per step which should be the same independently of the wu and give and absolute measure of one card or another one.

gdf

Profile Nightlord
Avatar
Send message
Joined: 22 Jul 08
Posts: 61
Credit: 5,461,041
RAC: 0
Level
Ser
Scientific publications
watwatwatwatwat
Message 2213 - Posted: 9 Sep 2008 | 11:22:43 UTC - in response to Message 2210.


Yes, the WU are correct, and also slower really than the previous ones.
As you said, you probably add some other process taking uup resources.
CPU time is not the best to guess compute time. In the next application version I will print in stderr the time per step which should be the same independently of the wu and give and absolute measure of one card or another one.

gdf



Thanks GDF!

Last night I made a clean installation of Boinc 6.3.10 and ran a normal CPU project for an hour. The same thing happened - after a few minutes boincmgr increased to 50% CPU and the wall time for the WU's stopped ticking. Moving the mouse restarted the wall time and dropped boincmgr back to 0%.

So I quickly re-installed Ubuntu and made a 100% clean host. Installed drivers and downloaded a new Boinc 6.3.10. Finally, I connected to PS3grid and let it run overnight. This morning I find the same symptoms on this 100% clean installation. Wall time was frozen and boincmgr at 50% cpu. Moving the mouse drops boincmgr and releases the wall time clock.

I know I upgraded the box from boinc 6.3.8 to 6.3.10 just before this first stated to happen last weekend, but this is now a clean OS install and a clean boinc 6.3.10. So I don't know why boincmgr should stay at 50% cpu after a few minutes, why it only started recently and why it continues on a clean OS install.

Nonetheless if the WU are valid, then it's ok for now.....just wish I could fix it.


____________

Post to thread

Message boards : Graphics cards (GPUs) : Change in run time?

//