Message boards : Number crunching : GPUgrid disk writes 100GB+ a day, any way to reduce checkpoint frequency?
Author | Message |
---|---|
So basically, my logs look like this: | |
ID: 41646 | Rating: 0 | rate: / Reply Quote | |
My GERARD_FXCXCL12_LIG have been check-pointing about every 5.5 minutes. I have the minimum checkpoint interval set to 300 seconds (5 minutes) for whatever good that does, using BOINC 7.6.6 x64 (Win7 64-bit). Maybe your version of BOINC needs updating? | |
ID: 41647 | Rating: 0 | rate: / Reply Quote | |
Thanks Jim, I am using the latest stable version. That's interesting that you are using the same OS as me, but the development client and not seeing the issue. I will try that out. | |
ID: 41649 | Rating: 0 | rate: / Reply Quote | |
That's interesting, doobedy! I have the same BOINC version as you have and am seeing frequent file changes in a GPU-Grid slot directory as well. In ressource monitor the data rate of the GPU-Grid process is really low, though, never exceeding 100 B/s (watched ~ 1 min). | |
ID: 41651 | Rating: 0 | rate: / Reply Quote | |
I've checked it on my hosts (with the built in task manager), and I've found that at 50% completion the acemd.847-65.exe app wrote nearly 3.5GB. On a host with a GTX 980 and Windows XP x64 it will write ~23.26GB every day. This much data will not shorten an SSD's lifetime significantly. However the GPUGrid app actually does not checkpoints according to the period set in BOINC manager. | |
ID: 41652 | Rating: 0 | rate: / Reply Quote | |
I've checked it on my hosts (with the built in task manager), and I've found that at 50% completion the acemd.847-65.exe app wrote nearly 3.5GB. On a host with a GTX 980 and Windows XP x64 it will write ~23.26GB every day. This much data will not shorten an SSD's lifetime significantly. However the GPUGrid app actually does not checkpoints according to the period set in BOINC manager. Like I said, I'm very laid back when it comes to SSDs. I get similar numbers to you written by acemd.847-65.exe, it's not the main culprit. What's causing an extra 100Gb on top of that can be seen (at least on my machine) by looking at the event logs with checkpoint_debug enabled. It's logging checkpoints 4x a minute. Under the Resource Monitor, on the Disk tab a constant stream of writing to restart.coor, restart.idx, and restart.vel shows up. Sort by Write(Bytes/sec) and they will be at the top, belonging to the System process. | |
ID: 41653 | Rating: 0 | rate: / Reply Quote | |
That's interesting, doobedy! I have the same BOINC version as you have and am seeing frequent file changes in a GPU-Grid slot directory as well. In ressource monitor the data rate of the GPU-Grid process is really low, though, never exceeding 100 B/s (watched ~ 1 min). See what I said above for some hints about tracking down this particular issue, but because its showing up under System I/O I think thats the only place it will show up under Task Manager, where you can sort by total writes. System will include lots of other I/O as well, like windows logs, but it will give you an idea. If you want to do a quick sanity check, you can run something like http://www.ssdready.com/, which I used to confirm something weird was going on. Just run it, hit start, and it shows you total writes and estimated writes per day. I'd ignore the estimated SSD life, its very conservative, but it can give you a ballpark. I compared it with the S.M.A.R.T. data the drive reports, and they matched. | |
ID: 41654 | Rating: 0 | rate: / Reply Quote | |
I used SSDReady to check my write volume: about 140 GB/day. With GPU-Grid on hold this reduced to 70 GB/day. I upgraded from 7.4.42 to 7.6.6 and the daily write load with GPU-Grid (and everything else, as before) reduced to 80 GB/day. | |
ID: 41658 | Rating: 0 | rate: / Reply Quote | |
Interesting. Did you actually see checkpoints spaced out 5 minutes in the event log? Are your preferences are set 300 seconds? | |
ID: 41659 | Rating: 0 | rate: / Reply Quote | |
I setup a RAM disk, which I have mixed feelings about. If I lose WUs because of unexpected shutdowns, I'll have to give up on that. I already lost some after upgrading. It helps to have an uninterruptible power supply (UPS), especially in the spring and summer months when lightning can produce brief outages. In my area, they last for only a second or two, but I use UPS on all my machines to bridge the gap. And if the machines are not stable but crash for other reasons, I would attend to that first before putting BOINC data on a ramdisk. | |
ID: 41660 | Rating: 0 | rate: / Reply Quote | |
I'm still seeing frequent writes of those files. Just the overall transfer volume somehow went down. I haven't performt any further checks, though. | |
ID: 41661 | Rating: 0 | rate: / Reply Quote | |
I keep my BOINC Data directory on a mechanical drive, even though the OS is installed on an SSD. | |
ID: 41662 | Rating: 0 | rate: / Reply Quote | |
Whatever is left in my 33-34 GB/day is not influenced by the checkpoint time I set in BOINC. I tried 300 and 180s, each over a whole night and without any WCG CEP2 WUs. The transferred was pretty much the same. | |
ID: 41663 | Rating: 0 | rate: / Reply Quote | |
Whatever is left in my 33-34 GB/day is not influenced by the checkpoint time I set in BOINC. I tried 300 and 180s, each over a whole night and without any WCG CEP2 WUs. The transferred was pretty much the same. Funny, in the last 28 hours of uptime, my RAM disk process shows 218GB written under task manager. That's with WCG going, not an attempt to measure scientifically. I suspect GPUgrid ignores the checkpoint settings, but is behaving well on your machine since the upgrade to 7.6.6. Which is fine, as long as it's aiming for 5 minutes and not 15 seconds. I don't even know what percentage of projects use the checkpoint setting. Who would know how all of this works? Is it the Acemd executable that calls a checkpoint? Maybe it's using the checkpoint mechanism to do necessary work and it's not a bug at all. I really have no idea. Who is the developer who would know? | |
ID: 41665 | Rating: 0 | rate: / Reply Quote | |
I have a UPS, although despite Los Angeles's reputation for brown outs, we've had one power outage in my neighborhood in the past 5 years, and it was the classic Mylar balloon hitting a transformer. Even more tangentially, this particular computer was stress tested pretty hard. I don't usually tinker, and have never overclocked, but these modern chipsets make it easy to tweak useful things. I wanted my media/gaming computer in the living room to crunch 24/7, and do it while staying cool and silent. In a mini ITX case. So I undervolted the CPU core about 10%, and dropped the turbo speed down .2Ghz. It's 5% slower but 10C cooler, and dropped from the CPU power usage from 62W to 50W watts under load measured at the wall. The fans stay quiet, and combined with a Maxwell GPU, I'm only using 170W for the entire computer, and wish I had bought a GTX 980 instead of 960. But GPU crunching wasn't on my radar at the time. Edited to add: I forgot my original point, which is that computers are computers. Can't trust them too much, especially when they are used to do more than crunch. I stream Netflix and play games while BOINC remains at 100% load in the background. It's happily done that for weeks, but not all programs are good neighbors. Lock ups happen. | |
ID: 41666 | Rating: 0 | rate: / Reply Quote | |
I stream Netflix and play games while BOINC remains at 100% load in the background. It's happily done that for weeks, but not all programs are good neighbors. Lock ups happen. I do most BOINC work on dedicated machines (both Ivy Bridge and Haswell), and they are quite stable. But on my main machine (Haswell), I record and edit video, play it back over the LAN, do BOINC (both CPU and GPU) and numerous other things. I have spent two years trying to get it stable. It is one thing after another. When you add too much stuff, there is always some straw that breaks the camel's back. | |
ID: 41668 | Rating: 0 | rate: / Reply Quote | |
You can turn on the <checkpoint_debug> flag, in your cc_config.xml file or by changing "Event Log Options" on the user interface, to see the checkpoints happen, per http://boinc.berkeley.edu/wiki/Client_configuration ... and you'll see the lines in the Event Log that the original post shows. | |
ID: 41670 | Rating: 0 | rate: / Reply Quote | |
Message boards : Number crunching : GPUgrid disk writes 100GB+ a day, any way to reduce checkpoint frequency?