Optimized bandwith

Message boards : Server and website : Optimized bandwith

Author	Message
GDF Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 14 Mar 07 Posts: 1957 Credit: 629,356 RAC: 0 Level Scientific publications	Message 54125 - Posted: 27 Mar 2020 \| 22:37:54 UTC
	We have optimized the network so that bandwidth to the server should double. Hopefully this will make the download/upload better. gdf
	ID: 54125 \| Rating: 0 \| rate: / Reply Quote

Zalster Send message Joined: 26 Feb 14 Posts: 211 Credit: 4,496,324,562 RAC: 0 Level Scientific publications	Message 54127 - Posted: 28 Mar 2020 \| 0:44:18 UTC - in response to Message 54125.
	Thank you!! ____________
	ID: 54127 \| Rating: 0 \| rate: / Reply Quote

rod4x4 Send message Joined: 4 Aug 14 Posts: 266 Credit: 2,219,935,054 RAC: 0 Level Scientific publications	Message 54129 - Posted: 28 Mar 2020 \| 3:47:14 UTC - in response to Message 54125.
	We have optimized the network so that bandwidth to the server should double. Hopefully this will make the download/upload better. gdf I can confirm the site is far more responsive to browse via web. Many thanks for your efforts!
	ID: 54129 \| Rating: 0 \| rate: / Reply Quote

kain Send message Joined: 3 Sep 14 Posts: 152 Credit: 641,182,245 RAC: 0 Level Scientific publications	Message 54137 - Posted: 28 Mar 2020 \| 14:53:59 UTC - in response to Message 54129.
	We have optimized the network so that bandwidth to the server should double. Hopefully this will make the download/upload better. gdf I can confirm the site is far more responsive to browse via web. Many thanks for your efforts! Indeed, it's much faster now. Thank you!
	ID: 54137 \| Rating: 0 \| rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 400 Credit: 13,812,064,882 RAC: 30,118,441 Level Scientific publications	Message 54139 - Posted: 28 Mar 2020 \| 15:32:18 UTC
	I still woke up this morning to a long queue of GG WUs needing to move up & down. When they're moving the transfer rate seems faster. ISPs choke the upload rate to be a mere 10% of your bandwidth and DL gets 90%. Since GG won't send another WU until it completely ULs the completed WU computers sit idle waiting, i.e. pregnant pauses. What's wrong with increasing our ration to 3 WUs per GPU???
	ID: 54139 \| Rating: 0 \| rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 572 Credit: 7,513,597,024 RAC: 10,451,968 Level Scientific publications	Message 54141 - Posted: 28 Mar 2020 \| 17:55:59 UTC - in response to Message 54125. Last modified: 28 Mar 2020 \| 17:56:26 UTC
	On March 27th 2020 GDF wrote: We have optimized the network so that bandwidth to the server should double... Since March 14th, due to Coronavirus crisis, here in Spain all we citizens are required by government's order for home confinement. Gianni, Toni, and all GPUGrid's Team at backstage: Thank you very much for your continued support at current hard times!!! Hoping everybody healthy,
	ID: 54141 \| Rating: 0 \| rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 400 Credit: 13,812,064,882 RAC: 30,118,441 Level Scientific publications	Message 54590 - Posted: 6 May 2020 \| 14:07:36 UTC Last modified: 6 May 2020 \| 14:26:05 UTC
	I've been watching my BoincTasks Transfers page for the last several hours wondering when it will clear. When I got up there were almost no GG WUs actually running since the UL queue was full. I've been using these commands in my cc_config file for a year or so and at first they seemed to help but now I'm not so sure. Maybe if everyone used them or if they could be enforced by the server: <max_file_xfers>9</max_file_xfers> <max_file_xfers_per_project>3</max_file_xfers_per_project> From https://boinc.berkeley.edu/wiki/Client_configuration <max_file_xfers>N</max_file_xfers> Maximum number of simultaneous file transfers (default 8). <max_file_xfers_per_project>N</max_file_xfers_per_project> Maximum number of simultaneous file transfers per project (default 2). But it does not behave as described, maybe because things are actually happening faster in the computer than what's being displayed on the screen. But these commands lump CPU & GPU WUs together and treat DLs the same as ULs. My Charter Spectrum ISP limits UL speeds to 10% of DL speeds so UL is always the choke point. I just ran a speed test with a GG transfer backlog trying to clear: 53.3 Mbps download and 4.66 Mbps upload. It seems that it would be better if the GG server could limit the number of ULs from a given IP address. For now I'm going to switch to 64 &1 and see how that runs through the next couple of days of server backups: <max_file_xfers>64</max_file_xfers> <max_file_xfers_per_project>1</max_file_xfers_per_project> Note: I don't know the first thing about how servers work.
	ID: 54590 \| Rating: 0 \| rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1105 Credit: 7,822,620,176 RAC: 1,815,365 Level Scientific publications	Message 54591 - Posted: 6 May 2020 \| 14:42:30 UTC
	What I have noticed today, since this morning, was/is an obvious GPUGRID server outage a few times. No access to the Homepage, and stalled uploads and downloads :-(
	ID: 54591 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 547,485 Level Scientific publications	Message 54592 - Posted: 6 May 2020 \| 16:31:59 UTC
	transfers are really sluggish. website is hit or miss. something doesn't seem right ____________
	ID: 54592 \| Rating: 0 \| rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1105 Credit: 7,822,620,176 RAC: 1,815,365 Level Scientific publications	Message 54593 - Posted: 6 May 2020 \| 16:36:46 UTC - in response to Message 54592.
	something doesn't seem right I am wondering whether the GPUGRID people are aware of the problem? No comments here from their side so far :-(
	ID: 54593 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1313 Credit: 6,014,417,459 RAC: 9,835,999 Level Scientific publications	Message 54595 - Posted: 6 May 2020 \| 17:27:01 UTC
	Downloads are currently being limited by a single connection from the project to any host.
	ID: 54595 \| Rating: 0 \| rate: / Reply Quote

Toni Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level Scientific publications	Message 54596 - Posted: 6 May 2020 \| 17:37:17 UTC - in response to Message 54595.
	I see nothing obviously wrong, so I hope it's some international connectivity issue.
	ID: 54596 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 547,485 Level Scientific publications	Message 54597 - Posted: 6 May 2020 \| 17:46:29 UTC - in response to Message 54595.
	Downloads are currently being limited by a single connection from the project to any host. is there a source for this? ____________
	ID: 54597 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1313 Credit: 6,014,417,459 RAC: 9,835,999 Level Scientific publications	Message 54598 - Posted: 6 May 2020 \| 17:47:51 UTC - in response to Message 54596.
	I see nothing obviously wrong, so I hope it's some international connectivity issue. Of course as soon as I post something about it, all the stalled uploads and downloads cleared out. The only thing of consequence now is the project requested a 1 hour backoff.
	ID: 54598 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1313 Credit: 6,014,417,459 RAC: 9,835,999 Level Scientific publications	Message 54599 - Posted: 6 May 2020 \| 17:49:52 UTC - in response to Message 54597.
	Downloads are currently being limited by a single connection from the project to any host. is there a source for this? No, just what I was observing and after I posted that, the rest of the connections picked up and all the stalled tasks moved to the project on both hosts. Toni says he sees nothing wrong on his end. Thinks there might be international connection issues that we were seeing.
	ID: 54599 \| Rating: 0 \| rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 400 Credit: 13,812,064,882 RAC: 30,118,441 Level Scientific publications	Message 54600 - Posted: 6 May 2020 \| 18:31:07 UTC
	I'm amazed that we've gone this long through an international crisis with connectivity being up. Should be no surprise that elements of the net start going down. Will probably get worse before it gets better. My transfer queue cleared after 7 hours.
	ID: 54600 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1313 Credit: 6,014,417,459 RAC: 9,835,999 Level Scientific publications	Message 54601 - Posted: 6 May 2020 \| 19:24:28 UTC
	Well I'm back to stalled up/downloads again that I can't persuade to get moving. Hoping that posting about it works the magic again.
	ID: 54601 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1313 Credit: 6,014,417,459 RAC: 9,835,999 Level Scientific publications	Message 54603 - Posted: 7 May 2020 \| 1:44:21 UTC Last modified: 7 May 2020 \| 1:45:40 UTC
	The problem seems to be that my hosts never receive an ACK from the project about successful uploads. Half my pending uploads are sitting at 100% progress for the small files but never clear the list.
	ID: 54603 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1589 Credit: 6,673,794,351 RAC: 8,989,891 Level Scientific publications	Message 54605 - Posted: 7 May 2020 \| 8:30:07 UTC - in response to Message 54596.
	I see nothing obviously wrong, so I hope it's some international connectivity issue. I don't know what you're able to look at, but it's been particularly bad at certain times of day for the last 24 hours. Yesterday morning, most attempts at most connections were failing until about 10:00 UTC. Then, suddenly the floodgates opened, and I managed to get all tasks uploaded, reported, and replaced over about 20 minutes. I went out for the day, but when I returned in the evening, most machines were queuing again and were still in backlog when I went to bed. Starting this morning at about 06:05 UTC, most machines were running, but two were in local backoff. A single 'retry', and both uploaded, reported, and downloaded at full normal speed. There was a small glitch around 06:45 UTC, but the rot set in an hour ago, just after 07:00 UTC. A few tasks have crept through, but I now have 9 tasks uploading, and 3 tasks downloading. Each task requires 16 separate server connections: 6 to upload, 1 scheduler contact, and 9 downloads. Most of the delays seem to be failures to connect, so I'm not sure whether they would show up in internal monitoring - possibly only in slower turnround and reduced research throughput. With the mothballing of SETI@Home, you will have the opportunity (extra volunteers) to complete much more bioscience research. But I would urge you to, perhaps, commission a network traffic audit from a networking specialist to try to locate the cause of these problems. Otherwise, you may find that the additional volunteers float away as suddenly as they arrived. One additional problem is that every type of connection has to pass through the same bottleneck. Now to try connection number 17, to post this message. Failed - "This site can’t be reached. The connection was reset." Take 2...
	ID: 54605 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1589 Credit: 6,673,794,351 RAC: 8,989,891 Level Scientific publications	Message 54606 - Posted: 7 May 2020 \| 9:12:50 UTC
	Today's floodgates opened a little earlier. Just completed this morning's big exchange - I'm good for a few more hours.
	ID: 54606 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,201,255,749 RAC: 0 Level Scientific publications	Message 54609 - Posted: 7 May 2020 \| 12:01:56 UTC
	I had connection problems with these DC projects: GPUGrid Einstein@home folding@home However I didn't have connection problems with: TN-Grid Rosetta@home (I set 24h workunits for Rosetta, so this project may be not affected by this).
	ID: 54609 \| Rating: 0 \| rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 400 Credit: 13,812,064,882 RAC: 30,118,441 Level Scientific publications	Message 55300 - Posted: 16 Sep 2020 \| 16:30:26 UTC - in response to Message 54596.
	I see nothing obviously wrong, so I hope it's some international connectivity issue. I'm in the USA. These are European BOINC projects that have never behaved like GG: Ibercivis LHC TN-GRID Asteroids YoYo Yafu Universe QuChemPedIA I would look at your server configuration some more.
	ID: 55300 \| Rating: 0 \| rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 400 Credit: 13,812,064,882 RAC: 30,118,441 Level Scientific publications	Message 55387 - Posted: 30 Sep 2020 \| 16:25:06 UTC Last modified: 30 Sep 2020 \| 16:42:41 UTC
	If I stop babysitting, i.e. clicking Retry All on the BoincTasks Transfer tab, GPUGrid for a couple of hours this is what greets me: https://i.ibb.co/5BM0t5f/an-example-Transfers.jpg I contacted Fred at eFMer and he pointed out the <refresh> command which I tested: <config> <refresh> <uploads>60</uploads> <downloads>60</downloads> </refresh> </config> or <config> <refresh> <auto>60</auto> </refresh> </config> But sadly it only works on localhost and does not help my headless fleet. Does anyone know a way to get either BOINC (Retry pending transfers) or BoincTasks to hourly issue the "Retry All" command??? It would also help to eliminate the stifling 2 WU per GPU limitation.
	ID: 55387 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 547,485 Level Scientific publications	Message 55389 - Posted: 30 Sep 2020 \| 17:29:39 UTC - in response to Message 55387. Last modified: 30 Sep 2020 \| 17:32:35 UTC
	If I stop babysitting, i.e. clicking Retry All on the BoincTasks Transfer tab, GPUGrid for a couple of hours this is what greets me: https://i.ibb.co/5BM0t5f/an-example-Transfers.jpg I contacted Fred at eFMer and he pointed out the <refresh> command which I tested: <config> <refresh> <uploads>60</uploads> <downloads>60</downloads> </refresh> </config> or <config> <refresh> <auto>60</auto> </refresh> </config> But sadly it only works on localhost and does not help my headless fleet. Does anyone know a way to get either BOINC (Retry pending transfers) or BoincTasks to hourly issue the "Retry All" command??? It would also help to eliminate the stifling 2 WU per GPU limitation. you could write a script to issue the retry transfers. then just have it run locally on each system looping with a timed wait. your systems are hidden so i can't really be more specific since I don't know what kind of setups you have. ____________
	ID: 55389 \| Rating: 0 \| rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 400 Credit: 13,812,064,882 RAC: 30,118,441 Level Scientific publications	Message 55392 - Posted: 30 Sep 2020 \| 18:20:02 UTC - in response to Message 55389.
	your systems are hidden so i can't really be more specific since I don't know what kind of setups you have. They're naked as a Jaybird now :-)
	ID: 55392 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 547,485 Level Scientific publications	Message 55393 - Posted: 30 Sep 2020 \| 18:50:55 UTC - in response to Message 55392. Last modified: 30 Sep 2020 \| 18:57:40 UTC
	ok since you are running Linux, try this. you can use the boinccmd tool to retry transfers, you have to script it since the boinccmd tool only seems to have the ability to use the retry command on a single transfer. there is no "all" option. this script will search the stuck transfers, grab their file names, then retry them for the given project. note: if you have stuck transfers from another project you'll get an error, but you can just ignore that. create a script with the following content: if using a repository install of BOINC: #!/bin/bash for i in `boinccmd --get_file_transfers \| sed -n -e 's/^.name: //p'`;do boinccmd --file_transfer https://gpugrid.net $i retry;done name the script something like "update_transfers.sh" change permissions of the script to make it executable sudo chmod +x update_transfers.sh run it with the following command from the same directory where the script is saved: watch -n 600 ./update_transfers.sh *replace the value 600 with whatever wait (in seconds) you want. if you have a user install version of BOINC (ie, one that does not need to be "installed" and just runs from your home folder) then you need to put the script in the same directory where your boinccmd executable is, and modify the script, replacing "boinccmd" with "./boinccmd" you'll have to do this on each of your 40ish hosts. ____________
	ID: 55393 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 547,485 Level Scientific publications	Message 55394 - Posted: 30 Sep 2020 \| 19:40:12 UTC Last modified: 30 Sep 2020 \| 19:44:07 UTC
	but the real problem is the short cooldown time from the project combined with a block on communications from the same IP (it seems they have a 30 second timer on that). they need to fix that. basically if most of your computers are at the same location, and thus have the same external IP address, the GPUGRID servers will only allow communication from 1 system at a time. when you have a lot, its very likely to have multiple systems trying to communicate with the project in some way at the same time or in very close time interval. the project seems to reject these successive requests, BOINC thinks there a problem, and eventually just stops trying until you manually intervene. this is exacerbated by the short cooldown time. i think it's like 10 seconds or something ridiculously short, so the project is kind of forcing this behavior. if you only have 1 or 2 systems, especially if they are rather slow, you'll rarely or never run into this problem. they need to fix one or both of these settings. either by shortening or turning off the IP block timer that they have setup, or by changing their project cooldown to something much longer, like 10 minutes. ____________
	ID: 55394 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1313 Credit: 6,014,417,459 RAC: 9,835,999 Level Scientific publications	Message 55399 - Posted: 30 Sep 2020 \| 22:15:43 UTC
	they need to fix one or both of these settings. either by shortening or turning off the IP block timer that they have setup, or by changing their project cooldown to something much longer, like 10 minutes. +100
	ID: 55399 \| Rating: 0 \| rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 400 Credit: 13,812,064,882 RAC: 30,118,441 Level Scientific publications	Message 55402 - Posted: 1 Oct 2020 \| 1:50:57 UTC Last modified: 1 Oct 2020 \| 1:52:28 UTC
	Thanks for the script. I installed it but I also upgraded Nvidia driver to 455.23.04 and when I rebooted I lost the headless computer. Is anyone seeing a problem with 455.23.04??? Yes, all my rigs are behind the same external IP address. I don't understand what a "project cooldown time" is. I thought it was GG won't answer the phone for xx seconds but then 10 seconds is good and 10 minutes is bad and that's the opposite of what you said. Yea, today was the last day of TOU season so I can crunch 24x7 for the next 6 months. I'm watching BoincTasks All Computers Transfers and every thing is flying and working good. I expect as the first round of 2080 WUs finish they'll start entering this banishment mode. Then the 1080s will follow into banishment. You'd think these guys would want to get more work done faster instead of forcing less work to get done slower. They've implemented 3 things that reduce my throughput by about 75%: max 2 WUs per GPU, IP Blocker and short cooldown time. I'm fine with 2 WU/GPU if they'd deliver work quickly. WCG implemented their per project Device Profiles:Project Limits The following settings allow you to set the maximum number of tasks assigned to one of your devices for a project. So I set the limits to threads+2 and return all WUs within 24 hours, even the massive ARPs. I'd like to do the same for GG but they've hog-tied me. This is really frustrating since this is the only BOINC life science GPU available.
	ID: 55402 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 547,485 Level Scientific publications	Message 55403 - Posted: 1 Oct 2020 \| 2:51:46 UTC - in response to Message 55402. Last modified: 1 Oct 2020 \| 3:00:08 UTC
	I don't understand what a "project cooldown time" is. I thought it was GG won't answer the phone for xx seconds but then 10 seconds is good and 10 minutes is bad and that's the opposite of what you said. "cooldown" is an unofficial term. I don't know what it's called on the project server side maybe "requested delay"?. you can see it in your event log where it says "Project requested a delay of.." and then "Deferring communications for..." basically when you communicate with the project for a schedule request. after the request is completed, the project always tells your system to wait some amount of time before trying to communicate again. this is standard BOINC behavior and every project has a different delay pre-set on their server configuration. SETI was 303 seconds. Einstein is like 60 seconds. in the case of GPUGRID, that time is 31 seconds. which is much too short when you have many fast systems at the same IP. you dont need to be asking the project for more work every 30 seconds when it takes 20-60+ minutes to run a single WU. ____________
	ID: 55403 \| Rating: 0 \| rate: / Reply Quote

rod4x4 Send message Joined: 4 Aug 14 Posts: 266 Credit: 2,219,935,054 RAC: 0 Level Scientific publications	Message 55404 - Posted: 1 Oct 2020 \| 3:33:05 UTC - in response to Message 55403.
	I don't understand what a "project cooldown time" is. I thought it was GG won't answer the phone for xx seconds but then 10 seconds is good and 10 minutes is bad and that's the opposite of what you said. "cooldown" is an unofficial term. I don't know what it's called on the project server side maybe "requested delay"?. you can see it in your event log where it says "Project requested a delay of.." and then "Deferring communications for..." basically when you communicate with the project for a schedule request. after the request is completed, the project always tells your system to wait some amount of time before trying to communicate again. this is standard BOINC behavior and every project has a different delay pre-set on their server configuration. SETI was 303 seconds. Einstein is like 60 seconds. in the case of GPUGRID, that time is 31 seconds. which is much too short when you have many fast systems at the same IP. you dont need to be asking the project for more work every 30 seconds when it takes 20-60+ minutes to run a single WU. Not sure it is a BOINC server setting stopping the communications as it also affects the entire web site access as well. It is more likely to be at the perimeter of the network, probably part of network defence strategy against DDOS and similar style attacks. The settings may be out of Gpugrids hands and controlled by the Network Administrators at UPF.
	ID: 55404 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 547,485 Level Scientific publications	Message 55405 - Posted: 1 Oct 2020 \| 3:55:19 UTC - in response to Message 55404. Last modified: 1 Oct 2020 \| 3:57:59 UTC
	Not sure it is a BOINC server setting stopping the communications as it also affects the entire web site access as well. It is more likely to be at the perimeter of the network, probably part of network defence strategy against DDOS and similar style attacks. The settings may be out of Gpugrids hands and controlled by the Network Administrators at UPF. its a project server setting that controls the requested delay. the IP blocking timeout must be something setup on their network. its the combination of the short deferral time and the IP block timeout. either thing by itself doesn't cause a problem. i've forced my systems to go into a cooldown for 10 minutes after each schedule request (using a custom BOINC client) and that fixed all issues for my systems. the IP block timer is still in effect. they no longer get stuck in idle because the chance that it's trying to communicate at the same time as another system is drastically reduced. the way things are by default almost guarantees that if you have more than 1 fast computer, you will have issues. but the requested delay is absolutely in their control to change. I've seen many other projects adjust this value when they wanted to. there's no reason GPUGRID cant change it either since they are using the BOINC server software. ____________
	ID: 55405 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1313 Credit: 6,014,417,459 RAC: 9,835,999 Level Scientific publications	Message 55406 - Posted: 1 Oct 2020 \| 7:03:59 UTC
	Some projects are even more "busy" than GPUGrid. When I joined Universe I found it had a project delay time of 11 seconds. Totally ridiculous. No host or download server needs to be polled that often. So the first thing I did was put in a 120 second cooldown for that project for my client. As Ian states, that parameter is set in the project server software.
	ID: 55406 \| Rating: 0 \| rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 400 Credit: 13,812,064,882 RAC: 30,118,441 Level Scientific publications	Message 55419 - Posted: 4 Oct 2020 \| 10:39:07 UTC
	Not a peep from GG staff. You'd think lengthening the cooldown period would be trivial to try. This problem is just getting worse for me.
	ID: 55419 \| Rating: 0 \| rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 400 Credit: 13,812,064,882 RAC: 30,118,441 Level Scientific publications	Message 55432 - Posted: 5 Oct 2020 \| 21:50:14 UTC
	All I hear are crickets ;-(
	ID: 55432 \| Rating: 0 \| rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 400 Credit: 13,812,064,882 RAC: 30,118,441 Level Scientific publications	Message 55532 - Posted: 10 Oct 2020 \| 9:50:21 UTC
	Will GPUGrid ever outgrow the need for a babysitter???
	ID: 55532 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,201,255,749 RAC: 0 Level Scientific publications	Message 55535 - Posted: 10 Oct 2020 \| 20:58:23 UTC - in response to Message 55532. Last modified: 10 Oct 2020 \| 21:00:49 UTC
	Will GPUGrid ever outgrow the need for a babysitter??? While we're waiting (probably in vain) for that I've figured out the only way to mitigate this problem on our side: You should reduce your work cache settings on all of your hosts to roughly match the shortest workunits your host crunches. In this way the host will ask for a new task only when the server will actually send a new (spare) one, so there will be no futile requests for getting work, that results in a lower chance to get your WAN IP address "banned" for a time period, so your other hosts (behind the same WAN IP) have a bigger chance for getting work as well. As there is plenty of work available at the moment, a new wu will be sent for sure, provided your IP is not "banned". (Note that the GPUGrid server will send only two workunits per GPU for a given host.) The actual value depends on the GPU and the workunit too, as the present workunits are quite short we should set a very short queue. I've set 0.01(+0) days on my host with a 2080Ti. This made my connection to the GPUGrid servers much less "lagging". 0.01 days is the lowest you can set, this is also the 'unit' for the size of the cache. (so you can't set 0.015 days.) With lesser cards you can try higher values: days seconds h:m:s 0.01 864 14:24 0.02 1728 28:48 0.03 2592 43:12 0.04 3456 57:36 0.05 4320 1:12:00 0.06 5174 1:26:24 0.07 6048 1:40:48 0.08 6912 1:55:12 0.09 7776 2:09:36 0.10 8640 2:24:00 As my hosts (except for one) are crunching folding@home at the moment (btw my team is the 789th as I wrote this), I haven't tested it with getting work on multiple host, only by browsing the GPUGrid forums on my other PC. (But it should have an effect on getting work too.) I'm curious about that fix is working for you or not.
	ID: 55535 \| Rating: 0 \| rate: / Reply Quote

rod4x4 Send message Joined: 4 Aug 14 Posts: 266 Credit: 2,219,935,054 RAC: 0 Level Scientific publications	Message 55539 - Posted: 11 Oct 2020 \| 0:56:52 UTC - in response to Message 55535.
	Will GPUGrid ever outgrow the need for a babysitter??? While we're waiting (probably in vain) for that I've figured out the only way to mitigate this problem on our side: You should reduce your work cache settings on all of your hosts to roughly match the shortest workunits your host crunches. In this way the host will ask for a new task only when the server will actually send a new (spare) one, so there will be no futile requests for getting work, that results in a lower chance to get your WAN IP address "banned" for a time period, so your other hosts (behind the same WAN IP) have a bigger chance for getting work as well. As there is plenty of work available at the moment, a new wu will be sent for sure, provided your IP is not "banned". (Note that the GPUGrid server will send only two workunits per GPU for a given host.) The actual value depends on the GPU and the workunit too, as the present workunits are quite short we should set a very short queue. I've set 0.01(+0) days on my host with a 2080Ti. This made my connection to the GPUGrid servers much less "lagging". 0.01 days is the lowest you can set, this is also the 'unit' for the size of the cache. (so you can't set 0.015 days.) With lesser cards you can try higher values: days seconds h:m:s 0.01 864 14:24 0.02 1728 28:48 0.03 2592 43:12 0.04 3456 57:36 0.05 4320 1:12:00 0.06 5174 1:26:24 0.07 6048 1:40:48 0.08 6912 1:55:12 0.09 7776 2:09:36 0.10 8640 2:24:00 As my hosts (except for one) are crunching folding@home at the moment (btw my team is the 789th as I wrote this), I haven't tested it with getting work on multiple host, only by browsing the GPUGrid forums on my other PC. (But it should have an effect on getting work too.) I'm curious about that fix is working for you or not. This approach definitely has merit, but would rely on a large percentage of Gpugrid users applying this method for any results to be seen. Lets see how many people try this.
	ID: 55539 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,201,255,749 RAC: 0 Level Scientific publications	Message 55544 - Posted: 11 Oct 2020 \| 13:21:05 UTC - in response to Message 55539. Last modified: 11 Oct 2020 \| 13:21:40 UTC
	This approach definitely has merit, but would rely on a large percentage of Gpugrid users applying this method for any results to be seen. No. When my WAN IP gets blocked by gpugrid's DDOS prevention due to my hosts issue too many requests in rapid succession for www.gpugrid.net, it does not have any effect on any other user's WAN IP blocking.
	ID: 55544 \| Rating: 0 \| rate: / Reply Quote

rod4x4 Send message Joined: 4 Aug 14 Posts: 266 Credit: 2,219,935,054 RAC: 0 Level Scientific publications	Message 55545 - Posted: 11 Oct 2020 \| 14:53:18 UTC - in response to Message 55544. Last modified: 11 Oct 2020 \| 15:04:30 UTC
	This approach definitely has merit, but would rely on a large percentage of Gpugrid users applying this method for any results to be seen. No. When my WAN IP gets blocked by gpugrid's DDOS prevention due to my hosts issue too many requests in rapid succession for www.gpugrid.net, it does not have any effect on any other user's WAN IP blocking. So are we dealing with DDOS, contention from a saturated link, rate limiting on a under resourced link, badly configured router, QOS putting our connects at the bottom of the list or a combination of these factors? Comments by volunteers in the forum indicates DDOS and a saturated link is quite likely. The other options listed above also rate a consideration. The title of this thread also suggest the rules on the network edge equipment have been modified to change bandwidth allocation. I guess we will never really know unless it is identified by Gpugrid. We are really just hypothesizing. It passes the time....and gives us a distraction.
	ID: 55545 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 547,485 Level Scientific publications	Message 55546 - Posted: 11 Oct 2020 \| 15:26:30 UTC Last modified: 11 Oct 2020 \| 15:27:01 UTC
	if you can find a way to prevent your system from communicating with the project for longer than the default 31 second comms delay, you will solve the problem. or the project could simply increase that delay so you don't have to find some workaround yourself. increasing the delay will not change the presence of the DDOS protections, but it will prevent the users from hitting them. it's really the best solution. it just seems like the project admins either don't know where this setting is in their own server software or the suggestion is falling on deaf ears. this 100% solves the problem. ____________
	ID: 55546 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,201,255,749 RAC: 0 Level Scientific publications	Message 55547 - Posted: 11 Oct 2020 \| 17:33:06 UTC - in response to Message 55546. Last modified: 11 Oct 2020 \| 17:41:57 UTC
	increasing the [default 31 seconds communications] delay will not change the presence of the DDOS protections, but it will prevent the users from hitting them. I wonder about the ideal length of that delay. We don't know the exact rules of the DDOS protection hitting on us, which, in combination with the number of hosts the given user has behind a single WAN IP would decide the ideal delay length. This delay can't be longer than the shortest workunit on the fastest GPU, because it would make those hosts to starve. (so it won't be better than the DDOS protection making those hosts to starve.) Taking the signs of the present DDOS protection and the present short workunits in consideration, I think there is a maximum number of hosts behind the single WAN IP which can work without some of them starving. Above that number some random one of the hosts behind that WAN IP will be inevitably hit by that DDOS protection. So I think the delay should be around 600 seconds (10 minutes); but also the length of the workunits should be (at least) doubled, even quadrupled. The present workunits should be in the short queue (for lesser GPUs).
	ID: 55547 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 547,485 Level Scientific publications	Message 55551 - Posted: 12 Oct 2020 \| 3:11:26 UTC - in response to Message 55547.
	A 10min delay seems pretty ideal to me. The fastest WUs I see through my 2080ti is about 15mins. The 2080ti is about the fastest card right now (Titan RTX is barely faster, RTX 8000 is a little slower) until Ampere support is added and we can properly gauge how the 30-series cards will perform. I have no objection to the longer WUs, but they should be able to be restarted. Every time I have a power outage and my system was in the middle of some of those older PABLO WUs, I lose several hours of work since they error out immediately when they try to resume. I’m aware of the issue with restarting tasks on a different device, but all of my systems run identical GPUs within the same system. (Not even just identical type, I run the same part number SKU cards within the same system.) but it even happens on my single GPU system, so there’s really no excuse there. 10 mins is the delay I run. And it definitely solved the problem. But again the most elegant solution is to have the project make a global change server side so that the clients don’t have to individually implement a workaround. ____________
	ID: 55551 \| Rating: 0 \| rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 400 Credit: 13,812,064,882 RAC: 30,118,441 Level Scientific publications	Message 55553 - Posted: 12 Oct 2020 \| 14:53:29 UTC - in response to Message 55535.
	You should reduce your work cache settings on all of your hosts to roughly match the shortest workunits your host crunches. I've tried this before but not to the extreme of 0.01/0.01. As I recall it reduces the chances of getting big WUs such as ARP & HST from WCG. I'll give it a try on all my computers today but it'll probably take a day to see if it's working. I'm certain that 0.5/0.1 does not help GG. Note that the GPUGrid server will send only two workunits per GPU for a given host. This is part of the problem.
	ID: 55553 \| Rating: 0 \| rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 400 Credit: 13,812,064,882 RAC: 30,118,441 Level Scientific publications	Message 55554 - Posted: 12 Oct 2020 \| 15:31:43 UTC - in response to Message 55551.
	10 mins is the delay I run. And it definitely solved the problem. Does this mean you set this Preferences like this??? Store at least 0.01 days of work. Store up to an additional 0.01 days of work. {I have no idea what this line does or why it even exists.}
	ID: 55554 \| Rating: 0 \| rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 400 Credit: 13,812,064,882 RAC: 30,118,441 Level Scientific publications	Message 55555 - Posted: 12 Oct 2020 \| 16:13:55 UTC - in response to Message 55393.
	...you have to script it since the boinccmd tool only seems to have the ability to use the retry command on a single transfer. there is no "all" option. this script will search the stuck transfers, grab their file names, then retry them for the given project. note: if you have stuck transfers from another project you'll get an error, but you can just ignore that. create a script with the following content: if using a repository install of BOINC: #!/bin/bash for i in `boinccmd --get_file_transfers \| sed -n -e 's/^.name: //p'`;do boinccmd --file_transfer https://gpugrid.net $i retry;done This wouldn't work for the strangest reason, it must use the exact URL as returned by boinccmd --get_project_urls. I thought Apache rendered "www." useless a couple of decades ago. I can run it manually and it works well but I cannot get it to run on my crontab. name the script something like "update_transfers.sh" change permissions of the script to make it executable sudo chmod +x update_transfers.sh The script cannot use a dot so I changed it to an underscore, BOINC_Retry_sh. The script cannot be writable by a user other than root (aurum). So I did this: sudo chmod 700 BOINC_Retry_sh https://manpages.debian.org/stretch/cron/cron.8.en.html So my script BOINC_Retry_sh is now this: #!/bin/bash for i in $(boinccmd --get_file_transfers \| sed -n -e 's/^.name: //p');do boinccmd --host localhost --passwd mypw --file_transfer https://www.gpugrid.net $i retry; done It wouldn't work without including --host localhost --passwd mypw but that might because I didn't store the script in the right folder. run it with the following command from the same directory where the script is saved: watch -n 600 ./update_transfers.sh replace the value 600 with whatever wait (in seconds) you want. Forgot the watch command and will revisit that now.
	ID: 55555 \| Rating: 0 \| rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 400 Credit: 13,812,064,882 RAC: 30,118,441 Level Scientific publications	Message 55558 - Posted: 12 Oct 2020 \| 16:36:19 UTC Last modified: 12 Oct 2020 \| 16:40:20 UTC
	It only took an hour to remind me of the problem with using a very short work queue 0.01/0.01. I believe I saw the same thing when I previously tried 0.1/0.1 which proved unworkable. I always run Milkyway along with GPUGrid since it dries up so quickly and I abhor idle computers. Rig-02 13907 GPUGRID 12-10-2020 08:31 Not requesting tasks: don't need (CPU: ; NVIDIA GPU: job cache full) So if I have MW WUs then I cannot DL a replacement GG WU. I will not run GG exclusively just to implement a kluge. GDF needs to fix this issue from his server side.
	ID: 55558 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 547,485 Level Scientific publications	Message 55562 - Posted: 12 Oct 2020 \| 18:03:03 UTC - in response to Message 55555.
	...you have to script it since the boinccmd tool only seems to have the ability to use the retry command on a single transfer. there is no "all" option. this script will search the stuck transfers, grab their file names, then retry them for the given project. note: if you have stuck transfers from another project you'll get an error, but you can just ignore that. create a script with the following content: if using a repository install of BOINC: #!/bin/bash for i in `boinccmd --get_file_transfers \| sed -n -e 's/^.name: //p'`;do boinccmd --file_transfer https://gpugrid.net $i retry;done This wouldn't work for the strangest reason, it must use the exact URL as returned by boinccmd --get_project_urls. I thought Apache rendered "www." useless a couple of decades ago. I can run it manually and it works well but I cannot get it to run on my crontab. name the script something like "update_transfers.sh" change permissions of the script to make it executable sudo chmod +x update_transfers.sh The script cannot use a dot so I changed it to an underscore, BOINC_Retry_sh. The script cannot be writable by a user other than root (aurum). So I did this: sudo chmod 700 BOINC_Retry_sh https://manpages.debian.org/stretch/cron/cron.8.en.html So my script BOINC_Retry_sh is now this: #!/bin/bash for i in $(boinccmd --get_file_transfers \| sed -n -e 's/^.name: //p');do boinccmd --host localhost --passwd mypw --file_transfer https://www.gpugrid.net $i retry; done It wouldn't work without including --host localhost --passwd mypw but that might because I didn't store the script in the right folder. run it with the following command from the same directory where the script is saved: watch -n 600 ./update_transfers.sh replace the value 600 with whatever wait (in seconds) you want. Forgot the watch command and will revisit that now. there should be no reason you cant run a ".sh" it's just a file extension. you could name it .anything or with no extension at all as you did. it will execute the same either way, its really inconsequential. ____________
	ID: 55562 \| Rating: 0 \| rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 400 Credit: 13,812,064,882 RAC: 30,118,441 Level Scientific publications	Message 55571 - Posted: 12 Oct 2020 \| 19:23:26 UTC - in response to Message 55562.
	there should be no reason you cant run a ".sh" it's just a file extension. you could name it .anything or with no extension at all as you did. it will execute the same either way, its really inconsequential. There is if you need to run it from a crontab: As described above, the files under these directories have to be pass some sanity checks including the following: be executable, be owned by root, not be writable by group or other and, if symlinks, point to files owned by root. Additionally, the file names must conform to the filename requirements of run-parts: they must be entirely made up of letters, digits and can only contain the special signs underscores ('_') and hyphens ('-'). Any file that does not conform to these requirements will not be executed by run-parts. For example, any file containing dots will be ignored. This is done to prevent cron from running any of the files that are left by the Debian package management system when handling files in /etc/cron.d/ as configuration files (i.e. files ending in .dpkg-dist, .dpkg-orig, and .dpkg-new).
	ID: 55571 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 547,485 Level Scientific publications	Message 55572 - Posted: 12 Oct 2020 \| 19:30:00 UTC - in response to Message 55571. Last modified: 12 Oct 2020 \| 19:30:21 UTC
	well that's why my instructions were designed around running it in an open terminal ;). just open the terminal and run it there with the watch command ____________
	ID: 55572 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 547,485 Level Scientific publications	Message 55573 - Posted: 12 Oct 2020 \| 19:35:51 UTC - in response to Message 55554. Last modified: 12 Oct 2020 \| 19:36:46 UTC
	10 mins is the delay I run. And it definitely solved the problem. Does this mean you set this Preferences like this??? Store at least 0.01 days of work. Store up to an additional 0.01 days of work. {I have no idea what this line does or why it even exists.} I have a custom client that was developed by a team member, which overrides the default comms delay and forces a longer timeout to whatever you wish. this is how i KNOW that the issue is solved with a longer timeout, because i've done it (as have several other teammates). this software is locked to our team however, so even if I were to give you the BOINC client software, it wont work unless you are on our team. doesnt sound like you use anything but service installs anyway. this is a custom BOINC client that runs stand alone from wherever you have it on your system. the benefit is that you dont have to "install" anything. you just copy the folder wherever you want, and run the executable from there. the downside is that it wont auto-run when you boot the system. but when you have a stable system with failover projects, it's not too much hassle. i reboot maybe every few months due to power outages or system upgrades. ____________
	ID: 55573 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,201,255,749 RAC: 0 Level Scientific publications	Message 55579 - Posted: 12 Oct 2020 \| 22:10:26 UTC
	I have another (much more sophisticated, yet not implemented) idea: We should write a script that: 1. disables work fetch from GPUGrid 2. waits while there are two GPUGrid workunits per GPU on the host 3. enables work fetch from GPUGrid 4. waits until there are two GPUGrid workunits per GPU on the host 5. GOTO 1. #1, #3 and #5 are trivial. #2 and #4 are complex (especially to check how many usable Nvidia GPUs are present in the system), they should also include some sleep period. #4 should include the "update transfers" script.
	ID: 55579 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 547,485 Level Scientific publications	Message 55583 - Posted: 13 Oct 2020 \| 0:17:13 UTC - in response to Message 55579.
	I have another (much more sophisticated, yet not implemented) idea: We should write a script that: 1. disables work fetch from GPUGrid 2. waits while there are two GPUGrid workunits per GPU on the host 3. enables work fetch from GPUGrid 4. waits until there are two GPUGrid workunits per GPU on the host 5. GOTO 1. #1, #3 and #5 are trivial. #2 and #4 are complex (especially to check how many usable Nvidia GPUs are present in the system), they should also include some sleep period. #4 should include the "update transfers" script. Just make it wait a set amount of time, rather than wait for x number of tasks. Would be much simpler. boinccmd project update (to initiate send/receive) Wait 20-30secs (to allow proj update to complete) boinccmd set NNT wait 10 mins boinccmd allow NT And just loop that. ____________
	ID: 55583 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 547,485 Level Scientific publications	Message 55584 - Posted: 13 Oct 2020 \| 2:17:09 UTC - in response to Message 55583. Last modified: 13 Oct 2020 \| 2:22:03 UTC
	here, i wrote it. #!/bin/bash while : do ./boinccmd --project https://www.gpugrid.net update sleep 20 ./boinccmd --project https://www.gpugrid.net nomorework sleep 10m ./boinccmd --project https://www.gpugrid.net allowmorework sleep 1 done easy. put this script in whatever directory contains your boinccmd tool executable. edit it to whatever suits your needs. this is an infinite loop, best not to run this as a cronjob. just run it in a terminal and ctrl+c if you want to kill it. unsure if this will totally fix the problem though. since it will still do a schedule request to report any finished work on 31 sec cycles. setting NNT only stops asking for new work, it doesnt stop reporting of completed work, and doesnt stop schedule requests (there is no boinccmd tool to do that other than shutting off network comms to all projects, which is likely not desired). but since you wont finish WUs faster than 30s anyway maybe it works? BOINC stops trying after a while when theres nothing to do. ____________
	ID: 55584 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 547,485 Level Scientific publications	Message 55593 - Posted: 13 Oct 2020 \| 13:29:08 UTC - in response to Message 55584.
	since GPUGRID is back, i'm running my script on my computers. i've removed my custom 10 min timer via the custom boinc client. so GPUGRID is running with a default comms delay of 31 seconds. the script works as intended. it still performs reporting of completed work when they finish (if the 31 seconds has expired), but so far it doesnt seem to be causing any problems. work stays topped up on each 10 min schedule request. Aurum, give this one a shot. feel free to play around with the sleep values. try bumping it up to 15mins if you still have issues with a 10 min timer. note: I am not running the previous script at all to update_transfers. with the longer timer, the transfers seem to not be getting clogged up. but this is only 2 systems at the same IP. i'd be curious to know if it helps with your 40+ systems that you have at the same location. having the project change their settings project-side is still the best solution, so that you wont even report completed work during the 10 min deferred comms. ____________
	ID: 55593 \| Rating: 0 \| rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 400 Credit: 13,812,064,882 RAC: 30,118,441 Level Scientific publications	Message 55594 - Posted: 13 Oct 2020 \| 15:38:38 UTC
	Zoltan, I left all rigs with 0.01/0.01 overnight and awoke to the usual idle GPUs and a long list of WUs with (Project Backoff: x:x:x). Once they get tagged with Project Backoff they never seem to restart on their own. Ian, The first script works great I just can't get my crontab to invoke it periodically. I'll try your new approach today. Thanks
	ID: 55594 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 547,485 Level Scientific publications	Message 55595 - Posted: 13 Oct 2020 \| 16:08:04 UTC - in response to Message 55594.
	Zoltan, I left all rigs with 0.01/0.01 overnight and awoke to the usual idle GPUs and a long list of WUs with (Project Backoff: x:x:x). Once they get tagged with Project Backoff they never seem to restart on their own. Ian, The first script works great I just can't get my crontab to invoke it periodically. I'll try your new approach today. Thanks as you'll find out, you probably will need to remove the "./" prefix on the boinccmd lines. since my implementation is with a user install of BOINC that just runs the executable directly. ____________
	ID: 55595 \| Rating: 0 \| rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 400 Credit: 13,812,064,882 RAC: 30,118,441 Level Scientific publications	Message 55596 - Posted: 13 Oct 2020 \| 16:57:16 UTC
	It does not seem to get WUs suffering from the Project Backup syndrome to move but for WUs that finish after your script starts running it works. So I added training wheels & invoked your Retry script: #!/bin/bash while : do /usr/bin/boinccmd --host localhost --passwd pw --project https://www.gpugrid.net update /home/aurum/BOINC_Retry.sh echo "Update & Retry GPUGrid then sleep for 20 seconds" sleep 20 /usr/bin/boinccmd --host localhost --passwd pw --project https://www.gpugrid.net nomorework echo "NoMoreWork GPUGrid then sleep for 10 minutes" sleep 10m /usr/bin/boinccmd --host localhost --passwd pw --project https://www.gpugrid.net allowmorework echo "AllowMoreWork GPUGrid then sleep for 1 second" sleep 1 done It's working good so far on 2 rigs. I'll be adding it to more.
	ID: 55596 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 547,485 Level Scientific publications	Message 55597 - Posted: 13 Oct 2020 \| 17:07:00 UTC - in response to Message 55596.
	yeah, nothing in this script will retry the stuck transfers. and if you have too many stuck transfers, the schedule requests wont even get new work (you'll see in the event log that you have too many stuck uploads or whatever). clear your pending transfers, and ideally you wont need the retry transfers script anymore. but no promises. with so many systems, you might need to run both. together. just have it retry any pending transfers every few mins or something. experiment with different values and find the setup that works for you. ____________
	ID: 55597 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,201,255,749 RAC: 0 Level Scientific publications	Message 55598 - Posted: 13 Oct 2020 \| 20:53:53 UTC - in response to Message 55597.
	I think that Aurum needs the "retry transfers scripts" if all of his hosts are behind the same WAN IP. The only solution to that many hosts is to make the workunits longer, or to lighten the DDOS protection, but I think the latter is out of GPUGrid's control.
	ID: 55598 \| Rating: 0 \| rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 400 Credit: 13,812,064,882 RAC: 30,118,441 Level Scientific publications	Message 55602 - Posted: 14 Oct 2020 \| 18:34:30 UTC Last modified: 14 Oct 2020 \| 18:41:36 UTC
	It won't work without a Retry. I've seen WUs that completed after the BOINC_Nap.sh started that went into the fatal Download pending (Project backoff) mode. E.g., Rig-44 GPUGRID 3qhoA00_320_4-TONI_MDADex2sq-15-par_file 0.000 178.34 K 00:04:02 - 00:17:28 0.00 KBps Download pending (Project backoff: 01:07:59) So I inserted Retry. #!/bin/bash while : do /usr/bin/boinccmd --host localhost --passwd pw --project https://www.gpugrid.net update for i in $(boinccmd --get_file_transfers \| sed -n -e 's/^.*name: //p');do boinccmd --host localhost --passwd pw --file_transfer https://www.gpugrid.net $i retry; done echo "Update & Retry GPUGrid then sleep for 20 seconds" sleep 20 /usr/bin/boinccmd --host localhost --passwd pw --project https://www.gpugrid.net nomorework echo "NoMoreWork GPUGrid then sleep for 10 minutes" sleep 10m /usr/bin/boinccmd --host localhost --passwd pw --project https://www.gpugrid.net allowmorework echo "AllowMoreWork GPUGrid then sleep for 1 second" sleep 1 done It's a bit unnerving to look at my Project tab and see most of my GG set to "No new work." But watch for 10 minutes and they turn on and off. The downside is that if one wants to gracefully shutdown ARPs with their 2 hour checkpoints it turns them back on when they need to stay off. Suspending a WU stops new DLs.
	ID: 55602 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 547,485 Level Scientific publications	Message 55603 - Posted: 14 Oct 2020 \| 19:37:14 UTC - in response to Message 55602.
	I saw a few instances of transfers getting stuck. But they usually clear on the next attempt in a few mins. The first back off from a stuck transfer is rather short. Then they get longer and longer on each successive retry failure. Having 1 stuck task doesn’t prevent downloads of more work. But having a lot of stuck ones does. I ran my script without automatic transfer retries on my systems for over 24hrs and even though one would occasionally get stuck, it always eventually cleared itself without intervention. That was my point. Getting stuck occasionally isn’t a problem if it eventually gets uploaded, where in my case they always did. You just have to trust it a bit and not get too anxious if you see a stuck one. I can see how having 40+ systems might be a different situation though. So if you absolutely need it, then do what works for you. I don’t know what you are referring to with ARPs and 2hr checkpoints though. Care to elaborate? What needs to stay off? ____________
	ID: 55603 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,201,255,749 RAC: 0 Level Scientific publications	Message 55604 - Posted: 14 Oct 2020 \| 20:58:56 UTC - in response to Message 55603.
	ARP is one of the many projects of World Community Grid. Africa Rainfall Project
	ID: 55604 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 547,485 Level Scientific publications	Message 55605 - Posted: 14 Oct 2020 \| 22:37:12 UTC - in response to Message 55604.
	ok, that only seems to further the confusion. my script wont change anything with WCG or its projects. nor do i know what he means by suspending WUs since my script doesnt do that either, it just stops getting new work for GPUGRID. So i don't see what the issue or connection between this script and WCG/ARP or suspending WUs or whatever. ____________
	ID: 55605 \| Rating: 0 \| rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 400 Credit: 13,812,064,882 RAC: 30,118,441 Level Scientific publications	Message 55608 - Posted: 15 Oct 2020 \| 15:47:09 UTC Last modified: 15 Oct 2020 \| 15:48:16 UTC
	Africa Rainfall Project has 2 to 3 hour checkpoints. If one wants to avoid discarding all that work then an orderly shut down is required. One selects "No New Work" for all projects and waits for everything to checkpoint and then shuts down. This command reverses that: /usr/bin/boinccmd --project https://www.gpugrid.net allowmorework Switches GG to Allow New Work long enough to start additional 1-2 hour GG WUs going. Just a small occasional nuisance. I still have not heard a peep out of GDF or Toni about whether they intend to fix their server issue.
	ID: 55608 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,201,255,749 RAC: 0 Level Scientific publications	Message 55819 - Posted: 25 Nov 2020 \| 22:52:04 UTC
	Now that COVID moonshot sprint 5 is finished at folding@home, my hosts have run out of work. So I've put them back to GPUGrid, and immediately got blocked by that DDOS defense. I've set my hosts to 0.01 days work buffer, but then I've realized that they start only 2 file transfers simultaneously, while a workunit has 9 files to download, so the given host contacts the GPUGrid servers 5 times to download a task. To reduce that I've increased the number of simultaneous file transfers per project to 10 (the global number to 20) by putting <max_file_xfers>20</max_file_xfers> <max_file_xfers_per_project>10</max_file_xfers_per_project> in the <options> section of cc_config.xml file, and re-read the config files. I can see in the log that the manager starts all 9 downloads at the same time: The "starting" and the "finished" messages were mixed up before, now all 9 "starting download of ..." messages are in a block, having the same timestamp. It seems to help, at least I can access the forum.
	ID: 55819 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1589 Credit: 6,673,794,351 RAC: 8,989,891 Level Scientific publications	Message 55825 - Posted: 26 Nov 2020 \| 9:42:15 UTC - in response to Message 55819.
	There's a flaw in the logic there. If you examine BOINC's http_debug log, you can see that once the host has established a connection, it preserves it and keeps re-using it: 26/11/2020 09:29:07 \| GPUGRID \| [http] [ID#12984] Info: Connection #7366 to host www.gpugrid.net left intact 26/11/2020 09:29:08 \| GPUGRID \| Finished upload of 2jh1A01_348_1-TONI_MDADex2sj-33-50-RND7955_0_0 26/11/2020 09:29:08 \| GPUGRID \| Started upload of 2jh1A01_348_1-TONI_MDADex2sj-33-50-RND7955_0_2 26/11/2020 09:29:08 \| GPUGRID \| [http] [ID#12985] Info: Re-using existing connection! (#7366) with host www.gpugrid.net 26/11/2020 09:29:18 \| GPUGRID \| Sending scheduler request: To report completed tasks. 26/11/2020 09:29:18 \| GPUGRID \| Reporting 1 completed tasks 26/11/2020 09:29:18 \| GPUGRID \| [http] [ID#1] Info: Re-using existing connection! (#7366) with host www.gpugrid.net 26/11/2020 09:29:21 \| GPUGRID \| Started download of 2hy5B00_320_0-TONI_MDADex2sh-33-conf_file_enc 26/11/2020 09:29:22 \| GPUGRID \| [http] [ID#12990] Info: Re-using existing connection! (#7366) with host www.gpugrid.net That's a very short extract from a very long log, but connection #7366 was used for uploads, reporting, and downloads without needing to be re-established. By contrast, if your initial attempt was made at a moment when GPUGrid was unready to accept a connection from your IP address, all nine downloads will fail at the same time. This will drive BOINC into a project-wide backoff lasting several hours.
	ID: 55825 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,201,255,749 RAC: 0 Level Scientific publications	Message 55827 - Posted: 26 Nov 2020 \| 15:35:39 UTC - in response to Message 55825.
	There's a flaw in the logic there. If you examine BOINC's http_debug log, you can see that once the host has established a connection, it preserves it and keeps re-using it: ... That's a very short extract from a very long log, but connection #7366 was used for uploads, reporting, and downloads without needing to be re-established. I didn't examine the http_debug log before, so that's the reason for the flaw in my logic, however... By contrast, if your initial attempt was made at a moment when GPUGrid was unready to accept a connection from your IP address, all nine downloads will fail at the same time. This will drive BOINC into a project-wide backoff lasting several hours. I thought that raising the "file transfers per project" limit would help, because I saw the same thing happen when the "per project" limit is 2 (or 5). Some of the files are downloaded, some of the downloads get stuck. After a few unsuccessful retries, the project backoff kicks in, even when the "per project" limit is low. My point is that this unknown DDOS protection is triggered even if the BOINC manager reuses the open http connection(s). In the meantime it turned out that this method is not the adequate workaround: the uploads / downloads still get stuck at my hosts. So the only working method is the "file transfer retry" script.
	ID: 55827 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1589 Credit: 6,673,794,351 RAC: 8,989,891 Level Scientific publications	Message 55828 - Posted: 26 Nov 2020 \| 16:53:36 UTC - in response to Message 55827.
	My experience is slightly different. I have seven machines attached, made up of 2x Linux machines, with 2 GPUs each 3x Windows machines, with 2 GPUs each 2x Windows machines, with 1 GPU each Each machine may make a random attempt to download new work, but usually gets rebuffed because the machines are at the limit of 'one task and a spare' per GPU. The fun starts when a machine completes a task and starts to upload. If no other machine has phoned home in the last few minutes (define 'few'?), it connects straight away, uploads all six files, reports, and downloads a replacement - all without a delay, reusing the same connection. The sample log I posted this morning came into that category. A Linux machine may hit too soon, and be rebuffed. But it'll keep trying the same pair of uploads for a full two minutes. If they don't get through, each upload will be backed off for one or two minutes, but another two will be tried. Usually, two pairs - four minutes - will be enough to establish the connection, and the final four uploads will sail through. The first two will retry, usually get through (I'm not sure how long BOINC keeps the successful connection open), and then the report/replacement also follows immediately. Windows machines have a problem. When rebuffed, then only keep retrying the first pair for 21 or 22 seconds. The second pair, likewise, are only retried for 21/22 seconds. The third pair will always be attempted, but the whole task upload only gets 66 seconds (maximum) to complete uploading. That isn't enough - if the uploading hasn't started by then, the six consecutive failed uploads are enough to drive BOINC into a project-wide backoff of well over an hour. When writing that bit of code, David Anderson made an unconscious assumption that one task = one upload, so three consecutive failed uploads (the actual trigger) implies three failed tasks over a period of time, and hence a server experiencing problems. His safeguard is safeguarding us from a completely different problem from the one that's facing us here. That's something I tried to address in https://github.com/BOINC/boinc/issues/3778, to singularly little effect.
	ID: 55828 \| Rating: 0 \| rate: / Reply Quote

Jim1348 Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level Scientific publications	Message 55848 - Posted: 29 Nov 2020 \| 17:41:15 UTC - in response to Message 55828.
	My experience is slightly different. I have seven machines attached, made up of 2x Linux machines, with 2 GPUs each 3x Windows machines, with 2 GPUs each 2x Windows machines, with 1 GPU each Take your Windows machines to Folding. Their core_22 now has a CUDA version that works well. I will bring my Linux machines here (mostly GTX 1070's). Their control program works only with Python 2, and Ubuntu 20.04 only has Python 3, so I am being squeezed out as I upgrade. (They might fix it someday, but it is a generational thing).
	ID: 55848 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 547,485 Level Scientific publications	Message 55849 - Posted: 29 Nov 2020 \| 22:33:45 UTC - in response to Message 55848. Last modified: 29 Nov 2020 \| 22:35:07 UTC
	From what the admins have posted, GPUGRID includes the whole Python package with the application, so the environment doesn’t matter. I run all my systems on Ubuntu 20.04 and no issues. Or did you mean “they” as in FAH? ____________
	ID: 55849 \| Rating: 0 \| rate: / Reply Quote

Jim1348 Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level Scientific publications	Message 55852 - Posted: 30 Nov 2020 \| 2:35:43 UTC - in response to Message 55849.
	Or did you mean “they” as in FAH? Yes, it is the Folding control program that has the problem. I am using Ubuntu 20.04 here too.
	ID: 55852 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,201,255,749 RAC: 0 Level Scientific publications	Message 55862 - Posted: 1 Dec 2020 \| 0:41:38 UTC - in response to Message 55848. Last modified: 1 Dec 2020 \| 0:42:29 UTC
	Their [FAH] control program works only with Python 2, and Ubuntu 20.04 only has Python 3, so I am being squeezed out as I upgrade. (They might fix it someday, but it is a generational thing). If you install Ubuntu 18.04 first, then upgrade it to 20.04 it will leave Python 2 on the system, and FAH will work. If you do a clean install of Ubuntu 20.04, FAH won't work.
	ID: 55862 \| Rating: 0 \| rate: / Reply Quote

Jim1348 Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level Scientific publications	Message 55865 - Posted: 1 Dec 2020 \| 15:39:56 UTC - in response to Message 55862.
	If you install Ubuntu 18.04 first, then upgrade it to 20.04 it will leave Python 2 on the system, and FAH will work. Good thought, but whenever I do an upgrade, it never works. I always end up having to do a clean install anyway. So I will just keep some machines on Ubuntu 18.04 for the time being. By the way, I just did my usual efficiency tests on GPUGrid, and found that the GTX 1660 Ti and GTX 1650 Super are the best, a little ahead of both the GTX 1060 and GTX 1070, so those are the ones I will use here.
	ID: 55865 \| Rating: 0 \| rate: / Reply Quote

Jim1348 Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level Scientific publications	Message 55869 - Posted: 2 Dec 2020 \| 15:46:30 UTC - in response to Message 55819.
	I've set my hosts to 0.01 days work buffer, but then I've realized that they start only 2 file transfers simultaneously, while a workunit has 9 files to download, so the given host contacts the GPUGrid servers 5 times to download a task. To reduce that I've increased the number of simultaneous file transfers per project to 10 (the global number to 20) by putting <max_file_xfers>20</max_file_xfers> <max_file_xfers_per_project>10</max_file_xfers_per_project> in the <options> section of cc_config.xml file, and re-read the config files. Good idea. I routinely set that to 4, but 10 is better. It is remarkable what you have to do (in most projects) to get work.
	ID: 55869 \| Rating: 0 \| rate: / Reply Quote

Post to thread

Message boards : Server and website : Optimized bandwith

	About	Science	Volunteers	Performance	Forum	Join us	Donate