Advanced search

Message boards : Server and website : Optimized bandwith

Author Message
Profile GDF
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 14 Mar 07
Posts: 1946
Credit: 629,356
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 54125 - Posted: 27 Mar 2020 | 22:37:54 UTC

We have optimized the network so that bandwidth to the server should double. Hopefully this will make the download/upload better.

gdf

Zalster
Avatar
Send message
Joined: 26 Feb 14
Posts: 211
Credit: 4,496,324,562
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwat
Message 54127 - Posted: 28 Mar 2020 | 0:44:18 UTC - in response to Message 54125.

Thank you!!
____________

rod4x4
Send message
Joined: 4 Aug 14
Posts: 266
Credit: 2,219,935,054
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 54129 - Posted: 28 Mar 2020 | 3:47:14 UTC - in response to Message 54125.

We have optimized the network so that bandwidth to the server should double. Hopefully this will make the download/upload better.

gdf


I can confirm the site is far more responsive to browse via web.

Many thanks for your efforts!

kain
Send message
Joined: 3 Sep 14
Posts: 152
Credit: 641,182,245
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 54137 - Posted: 28 Mar 2020 | 14:53:59 UTC - in response to Message 54129.

We have optimized the network so that bandwidth to the server should double. Hopefully this will make the download/upload better.

gdf


I can confirm the site is far more responsive to browse via web.

Many thanks for your efforts!


Indeed, it's much faster now.

Thank you!

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 307
Credit: 10,279,318,231
RAC: 0
Level
Trp
Scientific publications
watwatwat
Message 54139 - Posted: 28 Mar 2020 | 15:32:18 UTC

I still woke up this morning to a long queue of GG WUs needing to move up & down. When they're moving the transfer rate seems faster.

ISPs choke the upload rate to be a mere 10% of your bandwidth and DL gets 90%. Since GG won't send another WU until it completely ULs the completed WU computers sit idle waiting, i.e. pregnant pauses.

What's wrong with increasing our ration to 3 WUs per GPU???

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 459
Credit: 2,130,649,742
RAC: 1,200,820
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 54141 - Posted: 28 Mar 2020 | 17:55:59 UTC - in response to Message 54125.
Last modified: 28 Mar 2020 | 17:56:26 UTC

On March 27th 2020 GDF wrote:

We have optimized the network so that bandwidth to the server should double...

Since March 14th, due to Coronavirus crisis, here in Spain all we citizens are required by government's order for home confinement.
Gianni, Toni, and all GPUGrid's Team at backstage: Thank you very much for your continued support at current hard times!!!

Hoping everybody healthy,

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 307
Credit: 10,279,318,231
RAC: 0
Level
Trp
Scientific publications
watwatwat
Message 54590 - Posted: 6 May 2020 | 14:07:36 UTC
Last modified: 6 May 2020 | 14:26:05 UTC

I've been watching my BoincTasks Transfers page for the last several hours wondering when it will clear. When I got up there were almost no GG WUs actually running since the UL queue was full. I've been using these commands in my cc_config file for a year or so and at first they seemed to help but now I'm not so sure. Maybe if everyone used them or if they could be enforced by the server:

<max_file_xfers>9</max_file_xfers>
<max_file_xfers_per_project>3</max_file_xfers_per_project>


From https://boinc.berkeley.edu/wiki/Client_configuration
<max_file_xfers>N</max_file_xfers>
Maximum number of simultaneous file transfers (default 8).
<max_file_xfers_per_project>N</max_file_xfers_per_project>
Maximum number of simultaneous file transfers per project (default 2).

But it does not behave as described, maybe because things are actually happening faster in the computer than what's being displayed on the screen. But these commands lump CPU & GPU WUs together and treat DLs the same as ULs.

My Charter Spectrum ISP limits UL speeds to 10% of DL speeds so UL is always the choke point. I just ran a speed test with a GG transfer backlog trying to clear:
53.3 Mbps download and 4.66 Mbps upload.

It seems that it would be better if the GG server could limit the number of ULs from a given IP address.

For now I'm going to switch to 64 &1 and see how that runs through the next couple of days of server backups:
<max_file_xfers>64</max_file_xfers>
<max_file_xfers_per_project>1</max_file_xfers_per_project>

Note: I don't know the first thing about how servers work.

Erich56
Send message
Joined: 1 Jan 15
Posts: 825
Credit: 3,463,496,727
RAC: 1,038,959
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwat
Message 54591 - Posted: 6 May 2020 | 14:42:30 UTC

What I have noticed today, since this morning, was/is an obvious GPUGRID server outage a few times.
No access to the Homepage, and stalled uploads and downloads :-(

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 544
Credit: 4,501,706,357
RAC: 4,863,736
Level
Arg
Scientific publications
wat
Message 54592 - Posted: 6 May 2020 | 16:31:59 UTC

transfers are really sluggish.
website is hit or miss.

something doesn't seem right
____________

Erich56
Send message
Joined: 1 Jan 15
Posts: 825
Credit: 3,463,496,727
RAC: 1,038,959
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwat
Message 54593 - Posted: 6 May 2020 | 16:36:46 UTC - in response to Message 54592.

something doesn't seem right

I am wondering whether the GPUGRID people are aware of the problem?
No comments here from their side so far :-(

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 813
Credit: 1,093,529,831
RAC: 2,003,471
Level
Met
Scientific publications
watwatwatwatwat
Message 54595 - Posted: 6 May 2020 | 17:27:01 UTC

Downloads are currently being limited by a single connection from the project to any host.

Toni
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 9 Dec 08
Posts: 1005
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 54596 - Posted: 6 May 2020 | 17:37:17 UTC - in response to Message 54595.

I see nothing obviously wrong, so I hope it's some international connectivity issue.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 544
Credit: 4,501,706,357
RAC: 4,863,736
Level
Arg
Scientific publications
wat
Message 54597 - Posted: 6 May 2020 | 17:46:29 UTC - in response to Message 54595.

Downloads are currently being limited by a single connection from the project to any host.


is there a source for this?
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 813
Credit: 1,093,529,831
RAC: 2,003,471
Level
Met
Scientific publications
watwatwatwatwat
Message 54598 - Posted: 6 May 2020 | 17:47:51 UTC - in response to Message 54596.

I see nothing obviously wrong, so I hope it's some international connectivity issue.

Of course as soon as I post something about it, all the stalled uploads and downloads cleared out.

The only thing of consequence now is the project requested a 1 hour backoff.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 813
Credit: 1,093,529,831
RAC: 2,003,471
Level
Met
Scientific publications
watwatwatwatwat
Message 54599 - Posted: 6 May 2020 | 17:49:52 UTC - in response to Message 54597.

Downloads are currently being limited by a single connection from the project to any host.


is there a source for this?

No, just what I was observing and after I posted that, the rest of the connections picked up and all the stalled tasks moved to the project on both hosts.

Toni says he sees nothing wrong on his end. Thinks there might be international connection issues that we were seeing.

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 307
Credit: 10,279,318,231
RAC: 0
Level
Trp
Scientific publications
watwatwat
Message 54600 - Posted: 6 May 2020 | 18:31:07 UTC

I'm amazed that we've gone this long through an international crisis with connectivity being up. Should be no surprise that elements of the net start going down. Will probably get worse before it gets better.
My transfer queue cleared after 7 hours.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 813
Credit: 1,093,529,831
RAC: 2,003,471
Level
Met
Scientific publications
watwatwatwatwat
Message 54601 - Posted: 6 May 2020 | 19:24:28 UTC

Well I'm back to stalled up/downloads again that I can't persuade to get moving.
Hoping that posting about it works the magic again.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 813
Credit: 1,093,529,831
RAC: 2,003,471
Level
Met
Scientific publications
watwatwatwatwat
Message 54603 - Posted: 7 May 2020 | 1:44:21 UTC
Last modified: 7 May 2020 | 1:45:40 UTC

The problem seems to be that my hosts never receive an ACK from the project about successful uploads.

Half my pending uploads are sitting at 100% progress for the small files but never clear the list.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1249
Credit: 3,353,161,168
RAC: 1,341,500
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 54605 - Posted: 7 May 2020 | 8:30:07 UTC - in response to Message 54596.

I see nothing obviously wrong, so I hope it's some international connectivity issue.
I don't know what you're able to look at, but it's been particularly bad at certain times of day for the last 24 hours.

Yesterday morning, most attempts at most connections were failing until about 10:00 UTC. Then, suddenly the floodgates opened, and I managed to get all tasks uploaded, reported, and replaced over about 20 minutes. I went out for the day, but when I returned in the evening, most machines were queuing again and were still in backlog when I went to bed.

Starting this morning at about 06:05 UTC, most machines were running, but two were in local backoff. A single 'retry', and both uploaded, reported, and downloaded at full normal speed.

There was a small glitch around 06:45 UTC, but the rot set in an hour ago, just after 07:00 UTC. A few tasks have crept through, but I now have 9 tasks uploading, and 3 tasks downloading. Each task requires 16 separate server connections: 6 to upload, 1 scheduler contact, and 9 downloads.

Most of the delays seem to be failures to connect, so I'm not sure whether they would show up in internal monitoring - possibly only in slower turnround and reduced research throughput.

With the mothballing of SETI@Home, you will have the opportunity (extra volunteers) to complete much more bioscience research. But I would urge you to, perhaps, commission a network traffic audit from a networking specialist to try to locate the cause of these problems. Otherwise, you may find that the additional volunteers float away as suddenly as they arrived.

One additional problem is that every type of connection has to pass through the same bottleneck. Now to try connection number 17, to post this message.

Failed - "This site can’t be reached. The connection was reset." Take 2...

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1249
Credit: 3,353,161,168
RAC: 1,341,500
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 54606 - Posted: 7 May 2020 | 9:12:50 UTC

Today's floodgates opened a little earlier. Just completed this morning's big exchange - I'm good for a few more hours.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2304
Credit: 16,123,726,240
RAC: 2,237,409
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 54609 - Posted: 7 May 2020 | 12:01:56 UTC

I had connection problems with these DC projects:
GPUGrid
Einstein@home
folding@home
However I didn't have connection problems with:
TN-Grid
Rosetta@home (I set 24h workunits for Rosetta, so this project may be not affected by this).

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 307
Credit: 10,279,318,231
RAC: 0
Level
Trp
Scientific publications
watwatwat
Message 55300 - Posted: 16 Sep 2020 | 16:30:26 UTC - in response to Message 54596.

I see nothing obviously wrong, so I hope it's some international connectivity issue.
I'm in the USA. These are European BOINC projects that have never behaved like GG:
Ibercivis
LHC
TN-GRID
Asteroids
YoYo
Yafu
Universe
QuChemPedIA
I would look at your server configuration some more.

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 307
Credit: 10,279,318,231
RAC: 0
Level
Trp
Scientific publications
watwatwat
Message 55387 - Posted: 30 Sep 2020 | 16:25:06 UTC
Last modified: 30 Sep 2020 | 16:42:41 UTC

If I stop babysitting, i.e. clicking Retry All on the BoincTasks Transfer tab, GPUGrid for a couple of hours this is what greets me:
https://i.ibb.co/5BM0t5f/an-example-Transfers.jpg
I contacted Fred at eFMer and he pointed out the <refresh> command which I tested:

<config>
<refresh>
<uploads>60</uploads>
<downloads>60</downloads>
</refresh>
</config>

or
<config>
<refresh>
<auto>60</auto>
</refresh>
</config>

But sadly it only works on localhost and does not help my headless fleet.
Does anyone know a way to get either BOINC (Retry pending transfers) or BoincTasks to hourly issue the "Retry All" command???
It would also help to eliminate the stifling 2 WU per GPU limitation.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 544
Credit: 4,501,706,357
RAC: 4,863,736
Level
Arg
Scientific publications
wat
Message 55389 - Posted: 30 Sep 2020 | 17:29:39 UTC - in response to Message 55387.
Last modified: 30 Sep 2020 | 17:32:35 UTC

If I stop babysitting, i.e. clicking Retry All on the BoincTasks Transfer tab, GPUGrid for a couple of hours this is what greets me:
https://i.ibb.co/5BM0t5f/an-example-Transfers.jpg
I contacted Fred at eFMer and he pointed out the <refresh> command which I tested:
<config>
<refresh>
<uploads>60</uploads>
<downloads>60</downloads>
</refresh>
</config>

or
<config>
<refresh>
<auto>60</auto>
</refresh>
</config>

But sadly it only works on localhost and does not help my headless fleet.
Does anyone know a way to get either BOINC (Retry pending transfers) or BoincTasks to hourly issue the "Retry All" command???
It would also help to eliminate the stifling 2 WU per GPU limitation.


you could write a script to issue the retry transfers. then just have it run locally on each system looping with a timed wait.

your systems are hidden so i can't really be more specific since I don't know what kind of setups you have.
____________

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 307
Credit: 10,279,318,231
RAC: 0
Level
Trp
Scientific publications
watwatwat
Message 55392 - Posted: 30 Sep 2020 | 18:20:02 UTC - in response to Message 55389.

your systems are hidden so i can't really be more specific since I don't know what kind of setups you have.
They're naked as a Jaybird now :-)

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 544
Credit: 4,501,706,357
RAC: 4,863,736
Level
Arg
Scientific publications
wat
Message 55393 - Posted: 30 Sep 2020 | 18:50:55 UTC - in response to Message 55392.
Last modified: 30 Sep 2020 | 18:57:40 UTC

ok since you are running Linux, try this. you can use the boinccmd tool to retry transfers,

you have to script it since the boinccmd tool only seems to have the ability to use the retry command on a single transfer. there is no "all" option.

this script will search the stuck transfers, grab their file names, then retry them for the given project. *note: if you have stuck transfers from another project you'll get an error, but you can just ignore that.

create a script with the following content:

if using a repository install of BOINC:


#!/bin/bash
for i in `boinccmd --get_file_transfers | sed -n -e 's/^.*name: //p'`;do boinccmd --file_transfer https://gpugrid.net $i retry;done


name the script something like "update_transfers.sh"
change permissions of the script to make it executable
sudo chmod +x update_transfers.sh


run it with the following command from the same directory where the script is saved:
watch -n 600 ./update_transfers.sh

*replace the value 600 with whatever wait (in seconds) you want.

if you have a user install version of BOINC (ie, one that does not need to be "installed" and just runs from your home folder) then you need to put the script in the same directory where your boinccmd executable is, and modify the script, replacing "boinccmd" with "./boinccmd"

you'll have to do this on each of your 40ish hosts.
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 544
Credit: 4,501,706,357
RAC: 4,863,736
Level
Arg
Scientific publications
wat
Message 55394 - Posted: 30 Sep 2020 | 19:40:12 UTC
Last modified: 30 Sep 2020 | 19:44:07 UTC

but the real problem is the short cooldown time from the project combined with a block on communications from the same IP (it seems they have a 30 second timer on that). they need to fix that.

basically if most of your computers are at the same location, and thus have the same external IP address, the GPUGRID servers will only allow communication from 1 system at a time. when you have a lot, its very likely to have multiple systems trying to communicate with the project in some way at the same time or in very close time interval. the project seems to reject these successive requests, BOINC thinks there a problem, and eventually just stops trying until you manually intervene.

this is exacerbated by the short cooldown time. i think it's like 10 seconds or something ridiculously short, so the project is kind of forcing this behavior. if you only have 1 or 2 systems, especially if they are rather slow, you'll rarely or never run into this problem.

they need to fix one or both of these settings. either by shortening or turning off the IP block timer that they have setup, or by changing their project cooldown to something much longer, like 10 minutes.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 813
Credit: 1,093,529,831
RAC: 2,003,471
Level
Met
Scientific publications
watwatwatwatwat
Message 55399 - Posted: 30 Sep 2020 | 22:15:43 UTC

they need to fix one or both of these settings. either by shortening or turning off the IP block timer that they have setup, or by changing their project cooldown to something much longer, like 10 minutes.

+100

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 307
Credit: 10,279,318,231
RAC: 0
Level
Trp
Scientific publications
watwatwat
Message 55402 - Posted: 1 Oct 2020 | 1:50:57 UTC
Last modified: 1 Oct 2020 | 1:52:28 UTC

Thanks for the script. I installed it but I also upgraded Nvidia driver to 455.23.04 and when I rebooted I lost the headless computer.
Is anyone seeing a problem with 455.23.04???

Yes, all my rigs are behind the same external IP address.

I don't understand what a "project cooldown time" is. I thought it was GG won't answer the phone for xx seconds but then 10 seconds is good and 10 minutes is bad and that's the opposite of what you said.

Yea, today was the last day of TOU season so I can crunch 24x7 for the next 6 months. I'm watching BoincTasks All Computers Transfers and every thing is flying and working good. I expect as the first round of 2080 WUs finish they'll start entering this banishment mode. Then the 1080s will follow into banishment.

You'd think these guys would want to get more work done faster instead of forcing less work to get done slower. They've implemented 3 things that reduce my throughput by about 75%: max 2 WUs per GPU, IP Blocker and short cooldown time.

I'm fine with 2 WU/GPU if they'd deliver work quickly. WCG implemented their per project Device Profiles:Project Limits
The following settings allow you to set the maximum number of tasks assigned to one of your devices for a project.

So I set the limits to threads+2 and return all WUs within 24 hours, even the massive ARPs. I'd like to do the same for GG but they've hog-tied me. This is really frustrating since this is the only BOINC life science GPU available.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 544
Credit: 4,501,706,357
RAC: 4,863,736
Level
Arg
Scientific publications
wat
Message 55403 - Posted: 1 Oct 2020 | 2:51:46 UTC - in response to Message 55402.
Last modified: 1 Oct 2020 | 3:00:08 UTC


I don't understand what a "project cooldown time" is. I thought it was GG won't answer the phone for xx seconds but then 10 seconds is good and 10 minutes is bad and that's the opposite of what you said.


"cooldown" is an unofficial term. I don't know what it's called on the project server side maybe "requested delay"?. you can see it in your event log where it says "Project requested a delay of.." and then "Deferring communications for..."

basically when you communicate with the project for a schedule request. after the request is completed, the project always tells your system to wait some amount of time before trying to communicate again. this is standard BOINC behavior and every project has a different delay pre-set on their server configuration. SETI was 303 seconds. Einstein is like 60 seconds.

in the case of GPUGRID, that time is 31 seconds. which is much too short when you have many fast systems at the same IP. you dont need to be asking the project for more work every 30 seconds when it takes 20-60+ minutes to run a single WU.
____________

rod4x4
Send message
Joined: 4 Aug 14
Posts: 266
Credit: 2,219,935,054
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 55404 - Posted: 1 Oct 2020 | 3:33:05 UTC - in response to Message 55403.


I don't understand what a "project cooldown time" is. I thought it was GG won't answer the phone for xx seconds but then 10 seconds is good and 10 minutes is bad and that's the opposite of what you said.


"cooldown" is an unofficial term. I don't know what it's called on the project server side maybe "requested delay"?. you can see it in your event log where it says "Project requested a delay of.." and then "Deferring communications for..."

basically when you communicate with the project for a schedule request. after the request is completed, the project always tells your system to wait some amount of time before trying to communicate again. this is standard BOINC behavior and every project has a different delay pre-set on their server configuration. SETI was 303 seconds. Einstein is like 60 seconds.

in the case of GPUGRID, that time is 31 seconds. which is much too short when you have many fast systems at the same IP. you dont need to be asking the project for more work every 30 seconds when it takes 20-60+ minutes to run a single WU.

Not sure it is a BOINC server setting stopping the communications as it also affects the entire web site access as well.
It is more likely to be at the perimeter of the network, probably part of network defence strategy against DDOS and similar style attacks.
The settings may be out of Gpugrids hands and controlled by the Network Administrators at UPF.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 544
Credit: 4,501,706,357
RAC: 4,863,736
Level
Arg
Scientific publications
wat
Message 55405 - Posted: 1 Oct 2020 | 3:55:19 UTC - in response to Message 55404.
Last modified: 1 Oct 2020 | 3:57:59 UTC


Not sure it is a BOINC server setting stopping the communications as it also affects the entire web site access as well.
It is more likely to be at the perimeter of the network, probably part of network defence strategy against DDOS and similar style attacks.
The settings may be out of Gpugrids hands and controlled by the Network Administrators at UPF.


its a project server setting that controls the requested delay. the IP blocking timeout must be something setup on their network.

its the combination of the short deferral time and the IP block timeout. either thing by itself doesn't cause a problem.

i've forced my systems to go into a cooldown for 10 minutes after each schedule request (using a custom BOINC client) and that fixed all issues for my systems. the IP block timer is still in effect. they no longer get stuck in idle because the chance that it's trying to communicate at the same time as another system is drastically reduced.

the way things are by default almost guarantees that if you have more than 1 fast computer, you will have issues.

but the requested delay is absolutely in their control to change. I've seen many other projects adjust this value when they wanted to. there's no reason GPUGRID cant change it either since they are using the BOINC server software.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 813
Credit: 1,093,529,831
RAC: 2,003,471
Level
Met
Scientific publications
watwatwatwatwat
Message 55406 - Posted: 1 Oct 2020 | 7:03:59 UTC

Some projects are even more "busy" than GPUGrid. When I joined Universe I found it had a project delay time of 11 seconds. Totally ridiculous. No host or download server needs to be polled that often.

So the first thing I did was put in a 120 second cooldown for that project for my client.

As Ian states, that parameter is set in the project server software.

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 307
Credit: 10,279,318,231
RAC: 0
Level
Trp
Scientific publications
watwatwat
Message 55419 - Posted: 4 Oct 2020 | 10:39:07 UTC

Not a peep from GG staff. You'd think lengthening the cooldown period would be trivial to try. This problem is just getting worse for me.

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 307
Credit: 10,279,318,231
RAC: 0
Level
Trp
Scientific publications
watwatwat
Message 55432 - Posted: 5 Oct 2020 | 21:50:14 UTC

All I hear are crickets ;-(

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 307
Credit: 10,279,318,231
RAC: 0
Level
Trp
Scientific publications
watwatwat
Message 55532 - Posted: 10 Oct 2020 | 9:50:21 UTC

Will GPUGrid ever outgrow the need for a babysitter???

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2304
Credit: 16,123,726,240
RAC: 2,237,409
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 55535 - Posted: 10 Oct 2020 | 20:58:23 UTC - in response to Message 55532.
Last modified: 10 Oct 2020 | 21:00:49 UTC

Will GPUGrid ever outgrow the need for a babysitter???
While we're waiting (probably in vain) for that I've figured out the only way to mitigate this problem on our side:
You should reduce your work cache settings on all of your hosts to roughly match the shortest workunits your host crunches.
In this way the host will ask for a new task only when the server will actually send a new (spare) one, so there will be no futile requests for getting work, that results in a lower chance to get your WAN IP address "banned" for a time period, so your other hosts (behind the same WAN IP) have a bigger chance for getting work as well. As there is plenty of work available at the moment, a new wu will be sent for sure, provided your IP is not "banned". (Note that the GPUGrid server will send only two workunits per GPU for a given host.)
The actual value depends on the GPU and the workunit too, as the present workunits are quite short we should set a very short queue.
I've set 0.01(+0) days on my host with a 2080Ti. This made my connection to the GPUGrid servers much less "lagging".
0.01 days is the lowest you can set, this is also the 'unit' for the size of the cache. (so you can't set 0.015 days.)
With lesser cards you can try higher values:
days seconds h:m:s 0.01 864 14:24 0.02 1728 28:48 0.03 2592 43:12 0.04 3456 57:36 0.05 4320 1:12:00 0.06 5174 1:26:24 0.07 6048 1:40:48 0.08 6912 1:55:12 0.09 7776 2:09:36 0.10 8640 2:24:00
As my hosts (except for one) are crunching folding@home at the moment (btw my team is the 789th as I wrote this), I haven't tested it with getting work on multiple host, only by browsing the GPUGrid forums on my other PC. (But it should have an effect on getting work too.)
I'm curious about that fix is working for you or not.

rod4x4
Send message
Joined: 4 Aug 14
Posts: 266
Credit: 2,219,935,054
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 55539 - Posted: 11 Oct 2020 | 0:56:52 UTC - in response to Message 55535.

Will GPUGrid ever outgrow the need for a babysitter???
While we're waiting (probably in vain) for that I've figured out the only way to mitigate this problem on our side:
You should reduce your work cache settings on all of your hosts to roughly match the shortest workunits your host crunches.
In this way the host will ask for a new task only when the server will actually send a new (spare) one, so there will be no futile requests for getting work, that results in a lower chance to get your WAN IP address "banned" for a time period, so your other hosts (behind the same WAN IP) have a bigger chance for getting work as well. As there is plenty of work available at the moment, a new wu will be sent for sure, provided your IP is not "banned". (Note that the GPUGrid server will send only two workunits per GPU for a given host.)
The actual value depends on the GPU and the workunit too, as the present workunits are quite short we should set a very short queue.
I've set 0.01(+0) days on my host with a 2080Ti. This made my connection to the GPUGrid servers much less "lagging".
0.01 days is the lowest you can set, this is also the 'unit' for the size of the cache. (so you can't set 0.015 days.)
With lesser cards you can try higher values:
days seconds h:m:s 0.01 864 14:24 0.02 1728 28:48 0.03 2592 43:12 0.04 3456 57:36 0.05 4320 1:12:00 0.06 5174 1:26:24 0.07 6048 1:40:48 0.08 6912 1:55:12 0.09 7776 2:09:36 0.10 8640 2:24:00
As my hosts (except for one) are crunching folding@home at the moment (btw my team is the 789th as I wrote this), I haven't tested it with getting work on multiple host, only by browsing the GPUGrid forums on my other PC. (But it should have an effect on getting work too.)
I'm curious about that fix is working for you or not.

This approach definitely has merit, but would rely on a large percentage of Gpugrid users applying this method for any results to be seen.
Lets see how many people try this.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2304
Credit: 16,123,726,240
RAC: 2,237,409
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 55544 - Posted: 11 Oct 2020 | 13:21:05 UTC - in response to Message 55539.
Last modified: 11 Oct 2020 | 13:21:40 UTC

This approach definitely has merit, but would rely on a large percentage of Gpugrid users applying this method for any results to be seen.
No. When my WAN IP gets blocked by gpugrid's DDOS prevention due to my hosts issue too many requests in rapid succession for www.gpugrid.net, it does not have any effect on any other user's WAN IP blocking.

rod4x4
Send message
Joined: 4 Aug 14
Posts: 266
Credit: 2,219,935,054
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 55545 - Posted: 11 Oct 2020 | 14:53:18 UTC - in response to Message 55544.
Last modified: 11 Oct 2020 | 15:04:30 UTC

This approach definitely has merit, but would rely on a large percentage of Gpugrid users applying this method for any results to be seen.
No. When my WAN IP gets blocked by gpugrid's DDOS prevention due to my hosts issue too many requests in rapid succession for www.gpugrid.net, it does not have any effect on any other user's WAN IP blocking.

So are we dealing with DDOS, contention from a saturated link, rate limiting on a under resourced link, badly configured router, QOS putting our connects at the bottom of the list or a combination of these factors?


Comments by volunteers in the forum indicates DDOS and a saturated link is quite likely. The other options listed above also rate a consideration.

The title of this thread also suggest the rules on the network edge equipment have been modified to change bandwidth allocation.

I guess we will never really know unless it is identified by Gpugrid. We are really just hypothesizing.

It passes the time....and gives us a distraction.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 544
Credit: 4,501,706,357
RAC: 4,863,736
Level
Arg
Scientific publications
wat
Message 55546 - Posted: 11 Oct 2020 | 15:26:30 UTC
Last modified: 11 Oct 2020 | 15:27:01 UTC

if you can find a way to prevent your system from communicating with the project for longer than the default 31 second comms delay, you will solve the problem.

or the project could simply increase that delay so you don't have to find some workaround yourself.

increasing the delay will not change the presence of the DDOS protections, but it will prevent the users from hitting them. it's really the best solution. it just seems like the project admins either don't know where this setting is in their own server software or the suggestion is falling on deaf ears.

this 100% solves the problem.
____________

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2304
Credit: 16,123,726,240
RAC: 2,237,409
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 55547 - Posted: 11 Oct 2020 | 17:33:06 UTC - in response to Message 55546.
Last modified: 11 Oct 2020 | 17:41:57 UTC

increasing the [default 31 seconds communications] delay will not change the presence of the DDOS protections, but it will prevent the users from hitting them.
I wonder about the ideal length of that delay. We don't know the exact rules of the DDOS protection hitting on us, which, in combination with the number of hosts the given user has behind a single WAN IP would decide the ideal delay length. This delay can't be longer than the shortest workunit on the fastest GPU, because it would make those hosts to starve. (so it won't be better than the DDOS protection making those hosts to starve.)
Taking the signs of the present DDOS protection and the present short workunits in consideration, I think there is a maximum number of hosts behind the single WAN IP which can work without some of them starving. Above that number some random one of the hosts behind that WAN IP will be inevitably hit by that DDOS protection.
So I think the delay should be around 600 seconds (10 minutes); but also the length of the workunits should be (at least) doubled, even quadrupled. The present workunits should be in the short queue (for lesser GPUs).

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 544
Credit: 4,501,706,357
RAC: 4,863,736
Level
Arg
Scientific publications
wat
Message 55551 - Posted: 12 Oct 2020 | 3:11:26 UTC - in response to Message 55547.

A 10min delay seems pretty ideal to me. The fastest WUs I see through my 2080ti is about 15mins. The 2080ti is about the fastest card right now (Titan RTX is barely faster, RTX 8000 is a little slower) until Ampere support is added and we can properly gauge how the 30-series cards will perform.

I have no objection to the longer WUs, but they should be able to be restarted. Every time I have a power outage and my system was in the middle of some of those older PABLO WUs, I lose several hours of work since they error out immediately when they try to resume.

I’m aware of the issue with restarting tasks on a different device, but all of my systems run identical GPUs within the same system. (Not even just identical type, I run the same part number SKU cards within the same system.) but it even happens on my single GPU system, so there’s really no excuse there.

10 mins is the delay I run. And it definitely solved the problem. But again the most elegant solution is to have the project make a global change server side so that the clients don’t have to individually implement a workaround.
____________

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 307
Credit: 10,279,318,231
RAC: 0
Level
Trp
Scientific publications
watwatwat
Message 55553 - Posted: 12 Oct 2020 | 14:53:29 UTC - in response to Message 55535.

You should reduce your work cache settings on all of your hosts to roughly match the shortest workunits your host crunches.
I've tried this before but not to the extreme of 0.01/0.01. As I recall it reduces the chances of getting big WUs such as ARP & HST from WCG. I'll give it a try on all my computers today but it'll probably take a day to see if it's working. I'm certain that 0.5/0.1 does not help GG.
Note that the GPUGrid server will send only two workunits per GPU for a given host.
This is part of the problem.

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 307
Credit: 10,279,318,231
RAC: 0
Level
Trp
Scientific publications
watwatwat
Message 55554 - Posted: 12 Oct 2020 | 15:31:43 UTC - in response to Message 55551.

10 mins is the delay I run. And it definitely solved the problem.

Does this mean you set this Preferences like this???
Store at least 0.01 days of work.
Store up to an additional 0.01 days of work. {I have no idea what this line does or why it even exists.}

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 307
Credit: 10,279,318,231
RAC: 0
Level
Trp
Scientific publications
watwatwat
Message 55555 - Posted: 12 Oct 2020 | 16:13:55 UTC - in response to Message 55393.

...you have to script it since the boinccmd tool only seems to have the ability to use the retry command on a single transfer. there is no "all" option.

this script will search the stuck transfers, grab their file names, then retry them for the given project. *note: if you have stuck transfers from another project you'll get an error, but you can just ignore that.

create a script with the following content:

if using a repository install of BOINC:

#!/bin/bash
for i in `boinccmd --get_file_transfers | sed -n -e 's/^.*name: //p'`;do boinccmd --file_transfer https://gpugrid.net $i retry;done

This wouldn't work for the strangest reason, it must use the exact URL as returned by boinccmd --get_project_urls. I thought Apache rendered "www." useless a couple of decades ago. I can run it manually and it works well but I cannot get it to run on my crontab.

name the script something like "update_transfers.sh"
change permissions of the script to make it executable
sudo chmod +x update_transfers.sh

The script cannot use a dot so I changed it to an underscore, BOINC_Retry_sh.
The script cannot be writable by a user other than root (aurum). So I did this:
sudo chmod 700 BOINC_Retry_sh
https://manpages.debian.org/stretch/cron/cron.8.en.html

So my script BOINC_Retry_sh is now this:
#!/bin/bash
for i in $(boinccmd --get_file_transfers | sed -n -e 's/^.*name: //p');do
boinccmd --host localhost --passwd mypw --file_transfer https://www.gpugrid.net $i retry;
done

It wouldn't work without including --host localhost --passwd mypw but that might because I didn't store the script in the right folder.

run it with the following command from the same directory where the script is saved:
watch -n 600 ./update_transfers.sh

*replace the value 600 with whatever wait (in seconds) you want.
Forgot the watch command and will revisit that now.

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 307
Credit: 10,279,318,231
RAC: 0
Level
Trp
Scientific publications
watwatwat
Message 55558 - Posted: 12 Oct 2020 | 16:36:19 UTC
Last modified: 12 Oct 2020 | 16:40:20 UTC

It only took an hour to remind me of the problem with using a very short work queue 0.01/0.01. I believe I saw the same thing when I previously tried 0.1/0.1 which proved unworkable. I always run Milkyway along with GPUGrid since it dries up so quickly and I abhor idle computers.

Rig-02 13907 GPUGRID 12-10-2020 08:31 Not requesting tasks: don't need (CPU: ; NVIDIA GPU: job cache full)

So if I have MW WUs then I cannot DL a replacement GG WU.
I will not run GG exclusively just to implement a kluge.
GDF needs to fix this issue from his server side.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 544
Credit: 4,501,706,357
RAC: 4,863,736
Level
Arg
Scientific publications
wat
Message 55562 - Posted: 12 Oct 2020 | 18:03:03 UTC - in response to Message 55555.

...you have to script it since the boinccmd tool only seems to have the ability to use the retry command on a single transfer. there is no "all" option.

this script will search the stuck transfers, grab their file names, then retry them for the given project. *note: if you have stuck transfers from another project you'll get an error, but you can just ignore that.

create a script with the following content:

if using a repository install of BOINC:

#!/bin/bash
for i in `boinccmd --get_file_transfers | sed -n -e 's/^.*name: //p'`;do boinccmd --file_transfer https://gpugrid.net $i retry;done

This wouldn't work for the strangest reason, it must use the exact URL as returned by boinccmd --get_project_urls. I thought Apache rendered "www." useless a couple of decades ago. I can run it manually and it works well but I cannot get it to run on my crontab.

name the script something like "update_transfers.sh"
change permissions of the script to make it executable
sudo chmod +x update_transfers.sh

The script cannot use a dot so I changed it to an underscore, BOINC_Retry_sh.
The script cannot be writable by a user other than root (aurum). So I did this:
sudo chmod 700 BOINC_Retry_sh
https://manpages.debian.org/stretch/cron/cron.8.en.html

So my script BOINC_Retry_sh is now this:
#!/bin/bash
for i in $(boinccmd --get_file_transfers | sed -n -e 's/^.*name: //p');do
boinccmd --host localhost --passwd mypw --file_transfer https://www.gpugrid.net $i retry;
done

It wouldn't work without including --host localhost --passwd mypw but that might because I didn't store the script in the right folder.

run it with the following command from the same directory where the script is saved:
watch -n 600 ./update_transfers.sh

*replace the value 600 with whatever wait (in seconds) you want.
Forgot the watch command and will revisit that now.


there should be no reason you cant run a ".sh" it's just a file extension. you could name it .anything or with no extension at all as you did. it will execute the same either way, its really inconsequential.
____________

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 307
Credit: 10,279,318,231
RAC: 0
Level
Trp
Scientific publications
watwatwat
Message 55571 - Posted: 12 Oct 2020 | 19:23:26 UTC - in response to Message 55562.

there should be no reason you cant run a ".sh" it's just a file extension. you could name it .anything or with no extension at all as you did. it will execute the same either way, its really inconsequential.

There is if you need to run it from a crontab:
As described above, the files under these directories have to be pass some sanity checks including the following: be executable, be owned by root, not be writable by group or other and, if symlinks, point to files owned by root. Additionally, the file names must conform to the filename requirements of run-parts: they must be entirely made up of letters, digits and can only contain the special signs underscores ('_') and hyphens ('-'). Any file that does not conform to these requirements will not be executed by run-parts. For example, any file containing dots will be ignored. This is done to prevent cron from running any of the files that are left by the Debian package management system when handling files in /etc/cron.d/ as configuration files (i.e. files ending in .dpkg-dist, .dpkg-orig, and .dpkg-new).

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 544
Credit: 4,501,706,357
RAC: 4,863,736
Level
Arg
Scientific publications
wat
Message 55572 - Posted: 12 Oct 2020 | 19:30:00 UTC - in response to Message 55571.
Last modified: 12 Oct 2020 | 19:30:21 UTC

well that's why my instructions were designed around running it in an open terminal ;). just open the terminal and run it there with the watch command
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 544
Credit: 4,501,706,357
RAC: 4,863,736
Level
Arg
Scientific publications
wat
Message 55573 - Posted: 12 Oct 2020 | 19:35:51 UTC - in response to Message 55554.
Last modified: 12 Oct 2020 | 19:36:46 UTC

10 mins is the delay I run. And it definitely solved the problem.

Does this mean you set this Preferences like this???
Store at least 0.01 days of work.
Store up to an additional 0.01 days of work. {I have no idea what this line does or why it even exists.}


I have a custom client that was developed by a team member, which overrides the default comms delay and forces a longer timeout to whatever you wish. this is how i KNOW that the issue is solved with a longer timeout, because i've done it (as have several other teammates). this software is locked to our team however, so even if I were to give you the BOINC client software, it wont work unless you are on our team.

doesnt sound like you use anything but service installs anyway. this is a custom BOINC client that runs stand alone from wherever you have it on your system. the benefit is that you dont have to "install" anything. you just copy the folder wherever you want, and run the executable from there. the downside is that it wont auto-run when you boot the system. but when you have a stable system with failover projects, it's not too much hassle. i reboot maybe every few months due to power outages or system upgrades.
____________

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2304
Credit: 16,123,726,240
RAC: 2,237,409
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 55579 - Posted: 12 Oct 2020 | 22:10:26 UTC

I have another (much more sophisticated, yet not implemented) idea:
We should write a script that:
1. disables work fetch from GPUGrid
2. waits while there are two GPUGrid workunits per GPU on the host
3. enables work fetch from GPUGrid
4. waits until there are two GPUGrid workunits per GPU on the host
5. GOTO 1.

#1, #3 and #5 are trivial.
#2 and #4 are complex (especially to check how many usable Nvidia GPUs are present in the system), they should also include some sleep period. #4 should include the "update transfers" script.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 544
Credit: 4,501,706,357
RAC: 4,863,736
Level
Arg
Scientific publications
wat
Message 55583 - Posted: 13 Oct 2020 | 0:17:13 UTC - in response to Message 55579.

I have another (much more sophisticated, yet not implemented) idea:
We should write a script that:
1. disables work fetch from GPUGrid
2. waits while there are two GPUGrid workunits per GPU on the host
3. enables work fetch from GPUGrid
4. waits until there are two GPUGrid workunits per GPU on the host
5. GOTO 1.

#1, #3 and #5 are trivial.
#2 and #4 are complex (especially to check how many usable Nvidia GPUs are present in the system), they should also include some sleep period. #4 should include the "update transfers" script.


Just make it wait a set amount of time, rather than wait for x number of tasks. Would be much simpler.

boinccmd project update (to initiate send/receive)
Wait 20-30secs (to allow proj update to complete)
boinccmd set NNT
wait 10 mins
boinccmd allow NT

And just loop that.
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 544
Credit: 4,501,706,357
RAC: 4,863,736
Level
Arg
Scientific publications
wat
Message 55584 - Posted: 13 Oct 2020 | 2:17:09 UTC - in response to Message 55583.
Last modified: 13 Oct 2020 | 2:22:03 UTC

here, i wrote it.


#!/bin/bash
while :
do
./boinccmd --project https://www.gpugrid.net update
sleep 20
./boinccmd --project https://www.gpugrid.net nomorework
sleep 10m
./boinccmd --project https://www.gpugrid.net allowmorework
sleep 1
done


easy. put this script in whatever directory contains your boinccmd tool executable. edit it to whatever suits your needs. this is an infinite loop, best not to run this as a cronjob. just run it in a terminal and ctrl+c if you want to kill it.

unsure if this will totally fix the problem though. since it will still do a schedule request to report any finished work on 31 sec cycles. setting NNT only stops asking for new work, it doesnt stop reporting of completed work, and doesnt stop schedule requests (there is no boinccmd tool to do that other than shutting off network comms to all projects, which is likely not desired). but since you wont finish WUs faster than 30s anyway maybe it works? BOINC stops trying after a while when theres nothing to do.
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 544
Credit: 4,501,706,357
RAC: 4,863,736
Level
Arg
Scientific publications
wat
Message 55593 - Posted: 13 Oct 2020 | 13:29:08 UTC - in response to Message 55584.

since GPUGRID is back, i'm running my script on my computers. i've removed my custom 10 min timer via the custom boinc client. so GPUGRID is running with a default comms delay of 31 seconds.

the script works as intended. it still performs reporting of completed work when they finish (if the 31 seconds has expired), but so far it doesnt seem to be causing any problems. work stays topped up on each 10 min schedule request.

Aurum, give this one a shot. feel free to play around with the sleep values. try bumping it up to 15mins if you still have issues with a 10 min timer.

note: I am not running the previous script at all to update_transfers. with the longer timer, the transfers seem to not be getting clogged up. but this is only 2 systems at the same IP. i'd be curious to know if it helps with your 40+ systems that you have at the same location.

having the project change their settings project-side is still the best solution, so that you wont even report completed work during the 10 min deferred comms.
____________

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 307
Credit: 10,279,318,231
RAC: 0
Level
Trp
Scientific publications
watwatwat
Message 55594 - Posted: 13 Oct 2020 | 15:38:38 UTC

Zoltan, I left all rigs with 0.01/0.01 overnight and awoke to the usual idle GPUs and a long list of WUs with (Project Backoff: x:x:x). Once they get tagged with Project Backoff they never seem to restart on their own.

Ian, The first script works great I just can't get my crontab to invoke it periodically. I'll try your new approach today.

Thanks

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 544
Credit: 4,501,706,357
RAC: 4,863,736
Level
Arg
Scientific publications
wat
Message 55595 - Posted: 13 Oct 2020 | 16:08:04 UTC - in response to Message 55594.

Zoltan, I left all rigs with 0.01/0.01 overnight and awoke to the usual idle GPUs and a long list of WUs with (Project Backoff: x:x:x). Once they get tagged with Project Backoff they never seem to restart on their own.

Ian, The first script works great I just can't get my crontab to invoke it periodically. I'll try your new approach today.

Thanks


as you'll find out, you probably will need to remove the "./" prefix on the boinccmd lines. since my implementation is with a user install of BOINC that just runs the executable directly.
____________

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 307
Credit: 10,279,318,231
RAC: 0
Level
Trp
Scientific publications
watwatwat
Message 55596 - Posted: 13 Oct 2020 | 16:57:16 UTC

It does not seem to get WUs suffering from the Project Backup syndrome to move but for WUs that finish after your script starts running it works. So I added training wheels & invoked your Retry script:

#!/bin/bash
while :
do
/usr/bin/boinccmd --host localhost --passwd pw --project https://www.gpugrid.net update
/home/aurum/BOINC_Retry.sh
echo "Update & Retry GPUGrid then sleep for 20 seconds"
sleep 20
/usr/bin/boinccmd --host localhost --passwd pw --project https://www.gpugrid.net nomorework
echo "NoMoreWork GPUGrid then sleep for 10 minutes"
sleep 10m
/usr/bin/boinccmd --host localhost --passwd pw --project https://www.gpugrid.net allowmorework
echo "AllowMoreWork GPUGrid then sleep for 1 second"
sleep 1
done

It's working good so far on 2 rigs. I'll be adding it to more.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 544
Credit: 4,501,706,357
RAC: 4,863,736
Level
Arg
Scientific publications
wat
Message 55597 - Posted: 13 Oct 2020 | 17:07:00 UTC - in response to Message 55596.

yeah, nothing in this script will retry the stuck transfers. and if you have too many stuck transfers, the schedule requests wont even get new work (you'll see in the event log that you have too many stuck uploads or whatever).

clear your pending transfers, and ideally you wont need the retry transfers script anymore. but no promises. with so many systems, you might need to run both. together. just have it retry any pending transfers every few mins or something. experiment with different values and find the setup that works for you.
____________

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2304
Credit: 16,123,726,240
RAC: 2,237,409
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 55598 - Posted: 13 Oct 2020 | 20:53:53 UTC - in response to Message 55597.

I think that Aurum needs the "retry transfers scripts" if all of his hosts are behind the same WAN IP.
The only solution to that many hosts is to make the workunits longer, or to lighten the DDOS protection, but I think the latter is out of GPUGrid's control.

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 307
Credit: 10,279,318,231
RAC: 0
Level
Trp
Scientific publications
watwatwat
Message 55602 - Posted: 14 Oct 2020 | 18:34:30 UTC
Last modified: 14 Oct 2020 | 18:41:36 UTC

It won't work without a Retry. I've seen WUs that completed after the BOINC_Nap.sh started that went into the fatal Download pending (Project backoff) mode. E.g.,
Rig-44 GPUGRID 3qhoA00_320_4-TONI_MDADex2sq-15-par_file 0.000 178.34 K 00:04:02 - 00:17:28 0.00 KBps Download pending (Project backoff: 01:07:59)
So I inserted Retry.

#!/bin/bash
while :
do
/usr/bin/boinccmd --host localhost --passwd pw --project https://www.gpugrid.net update
for i in $(boinccmd --get_file_transfers | sed -n -e 's/^.*name: //p');do
boinccmd --host localhost --passwd pw --file_transfer https://www.gpugrid.net $i retry;
done
echo "Update & Retry GPUGrid then sleep for 20 seconds"
sleep 20
/usr/bin/boinccmd --host localhost --passwd pw --project https://www.gpugrid.net nomorework
echo "NoMoreWork GPUGrid then sleep for 10 minutes"
sleep 10m
/usr/bin/boinccmd --host localhost --passwd pw --project https://www.gpugrid.net allowmorework
echo "AllowMoreWork GPUGrid then sleep for 1 second"
sleep 1
done
It's a bit unnerving to look at my Project tab and see most of my GG set to "No new work." But watch for 10 minutes and they turn on and off. The downside is that if one wants to gracefully shutdown ARPs with their 2 hour checkpoints it turns them back on when they need to stay off. Suspending a WU stops new DLs.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 544
Credit: 4,501,706,357
RAC: 4,863,736
Level
Arg
Scientific publications
wat
Message 55603 - Posted: 14 Oct 2020 | 19:37:14 UTC - in response to Message 55602.

I saw a few instances of transfers getting stuck. But they usually clear on the next attempt in a few mins. The first back off from a stuck transfer is rather short. Then they get longer and longer on each successive retry failure. Having 1 stuck task doesn’t prevent downloads of more work. But having a lot of stuck ones does. I ran my script without automatic transfer retries on my systems for over 24hrs and even though one would occasionally get stuck, it always eventually cleared itself without intervention. That was my point. Getting stuck occasionally isn’t a problem if it eventually gets uploaded, where in my case they always did. You just have to trust it a bit and not get too anxious if you see a stuck one. I can see how having 40+ systems might be a different situation though. So if you absolutely need it, then do what works for you.

I don’t know what you are referring to with ARPs and 2hr checkpoints though. Care to elaborate? What needs to stay off?


____________

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2304
Credit: 16,123,726,240
RAC: 2,237,409
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 55604 - Posted: 14 Oct 2020 | 20:58:56 UTC - in response to Message 55603.

ARP is one of the many projects of World Community Grid.
Africa Rainfall Project

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 544
Credit: 4,501,706,357
RAC: 4,863,736
Level
Arg
Scientific publications
wat
Message 55605 - Posted: 14 Oct 2020 | 22:37:12 UTC - in response to Message 55604.

ok, that only seems to further the confusion. my script wont change anything with WCG or its projects. nor do i know what he means by suspending WUs since my script doesnt do that either, it just stops getting new work for GPUGRID. So i don't see what the issue or connection between this script and WCG/ARP or suspending WUs or whatever.
____________

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 307
Credit: 10,279,318,231
RAC: 0
Level
Trp
Scientific publications
watwatwat
Message 55608 - Posted: 15 Oct 2020 | 15:47:09 UTC
Last modified: 15 Oct 2020 | 15:48:16 UTC

Africa Rainfall Project has 2 to 3 hour checkpoints. If one wants to avoid discarding all that work then an orderly shut down is required. One selects "No New Work" for all projects and waits for everything to checkpoint and then shuts down.
This command reverses that:

/usr/bin/boinccmd --project https://www.gpugrid.net allowmorework

Switches GG to Allow New Work long enough to start additional 1-2 hour GG WUs going. Just a small occasional nuisance.

I still have not heard a peep out of GDF or Toni about whether they intend to fix their server issue.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2304
Credit: 16,123,726,240
RAC: 2,237,409
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 55819 - Posted: 25 Nov 2020 | 22:52:04 UTC

Now that COVID moonshot sprint 5 is finished at folding@home, my hosts have run out of work.
So I've put them back to GPUGrid, and immediately got blocked by that DDOS defense.
I've set my hosts to 0.01 days work buffer, but then I've realized that they start only 2 file transfers simultaneously, while a workunit has 9 files to download, so the given host contacts the GPUGrid servers 5 times to download a task.
To reduce that I've increased the number of simultaneous file transfers per project to 10 (the global number to 20) by putting

<max_file_xfers>20</max_file_xfers> <max_file_xfers_per_project>10</max_file_xfers_per_project>
in the <options> section of cc_config.xml file, and re-read the config files.

I can see in the log that the manager starts all 9 downloads at the same time:
The "starting" and the "finished" messages were mixed up before, now all 9 "starting download of ..." messages are in a block, having the same timestamp.

It seems to help, at least I can access the forum.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1249
Credit: 3,353,161,168
RAC: 1,341,500
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 55825 - Posted: 26 Nov 2020 | 9:42:15 UTC - in response to Message 55819.

There's a flaw in the logic there. If you examine BOINC's http_debug log, you can see that once the host has established a connection, it preserves it and keeps re-using it:

26/11/2020 09:29:07 | GPUGRID | [http] [ID#12984] Info: Connection #7366 to host www.gpugrid.net left intact
26/11/2020 09:29:08 | GPUGRID | Finished upload of 2jh1A01_348_1-TONI_MDADex2sj-33-50-RND7955_0_0
26/11/2020 09:29:08 | GPUGRID | Started upload of 2jh1A01_348_1-TONI_MDADex2sj-33-50-RND7955_0_2
26/11/2020 09:29:08 | GPUGRID | [http] [ID#12985] Info: Re-using existing connection! (#7366) with host www.gpugrid.net
26/11/2020 09:29:18 | GPUGRID | Sending scheduler request: To report completed tasks.
26/11/2020 09:29:18 | GPUGRID | Reporting 1 completed tasks
26/11/2020 09:29:18 | GPUGRID | [http] [ID#1] Info: Re-using existing connection! (#7366) with host www.gpugrid.net
26/11/2020 09:29:21 | GPUGRID | Started download of 2hy5B00_320_0-TONI_MDADex2sh-33-conf_file_enc
26/11/2020 09:29:22 | GPUGRID | [http] [ID#12990] Info: Re-using existing connection! (#7366) with host www.gpugrid.net

That's a very short extract from a very long log, but connection #7366 was used for uploads, reporting, and downloads without needing to be re-established.

By contrast, if your initial attempt was made at a moment when GPUGrid was unready to accept a connection from your IP address, all nine downloads will fail at the same time. This will drive BOINC into a project-wide backoff lasting several hours.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2304
Credit: 16,123,726,240
RAC: 2,237,409
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 55827 - Posted: 26 Nov 2020 | 15:35:39 UTC - in response to Message 55825.

There's a flaw in the logic there. If you examine BOINC's http_debug log, you can see that once the host has established a connection, it preserves it and keeps re-using it:
...
That's a very short extract from a very long log, but connection #7366 was used for uploads, reporting, and downloads without needing to be re-established.
I didn't examine the http_debug log before, so that's the reason for the flaw in my logic, however...

By contrast, if your initial attempt was made at a moment when GPUGrid was unready to accept a connection from your IP address, all nine downloads will fail at the same time. This will drive BOINC into a project-wide backoff lasting several hours.
I thought that raising the "file transfers per project" limit would help, because I saw the same thing happen when the "per project" limit is 2 (or 5). Some of the files are downloaded, some of the downloads get stuck. After a few unsuccessful retries, the project backoff kicks in, even when the "per project" limit is low.
My point is that this unknown DDOS protection is triggered even if the BOINC manager reuses the open http connection(s).

In the meantime it turned out that this method is not the adequate workaround: the uploads / downloads still get stuck at my hosts.

So the only working method is the "file transfer retry" script.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1249
Credit: 3,353,161,168
RAC: 1,341,500
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 55828 - Posted: 26 Nov 2020 | 16:53:36 UTC - in response to Message 55827.

My experience is slightly different. I have seven machines attached, made up of

2x Linux machines, with 2 GPUs each
3x Windows machines, with 2 GPUs each
2x Windows machines, with 1 GPU each

Each machine may make a random attempt to download new work, but usually gets rebuffed because the machines are at the limit of 'one task and a spare' per GPU.

The fun starts when a machine completes a task and starts to upload. If no other machine has phoned home in the last few minutes (define 'few'?), it connects straight away, uploads all six files, reports, and downloads a replacement - all without a delay, reusing the same connection. The sample log I posted this morning came into that category.

A Linux machine may hit too soon, and be rebuffed. But it'll keep trying the same pair of uploads for a full two minutes. If they don't get through, each upload will be backed off for one or two minutes, but another two will be tried. Usually, two pairs - four minutes - will be enough to establish the connection, and the final four uploads will sail through. The first two will retry, usually get through (I'm not sure how long BOINC keeps the successful connection open), and then the report/replacement also follows immediately.

Windows machines have a problem. When rebuffed, then only keep retrying the first pair for 21 or 22 seconds. The second pair, likewise, are only retried for 21/22 seconds. The third pair will always be attempted, but the whole task upload only gets 66 seconds (maximum) to complete uploading. That isn't enough - if the uploading hasn't started by then, the six consecutive failed uploads are enough to drive BOINC into a project-wide backoff of well over an hour.

When writing that bit of code, David Anderson made an unconscious assumption that one task = one upload, so three consecutive failed uploads (the actual trigger) implies three failed tasks over a period of time, and hence a server experiencing problems. His safeguard is safeguarding us from a completely different problem from the one that's facing us here.

That's something I tried to address in https://github.com/BOINC/boinc/issues/3778, to singularly little effect.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 790
Credit: 1,561,693,721
RAC: 48,215
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 55848 - Posted: 29 Nov 2020 | 17:41:15 UTC - in response to Message 55828.

My experience is slightly different. I have seven machines attached, made up of

2x Linux machines, with 2 GPUs each
3x Windows machines, with 2 GPUs each
2x Windows machines, with 1 GPU each

Take your Windows machines to Folding. Their core_22 now has a CUDA version that works well.

I will bring my Linux machines here (mostly GTX 1070's). Their control program works only with Python 2, and Ubuntu 20.04 only has Python 3, so I am being squeezed out as I upgrade. (They might fix it someday, but it is a generational thing).

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 544
Credit: 4,501,706,357
RAC: 4,863,736
Level
Arg
Scientific publications
wat
Message 55849 - Posted: 29 Nov 2020 | 22:33:45 UTC - in response to Message 55848.
Last modified: 29 Nov 2020 | 22:35:07 UTC

From what the admins have posted, GPUGRID includes the whole Python package with the application, so the environment doesn’t matter. I run all my systems on Ubuntu 20.04 and no issues.

Or did you mean “they” as in FAH?
____________

Jim1348
Send message
Joined: 28 Jul 12
Posts: 790
Credit: 1,561,693,721
RAC: 48,215
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 55852 - Posted: 30 Nov 2020 | 2:35:43 UTC - in response to Message 55849.

Or did you mean “they” as in FAH?

Yes, it is the Folding control program that has the problem.
I am using Ubuntu 20.04 here too.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2304
Credit: 16,123,726,240
RAC: 2,237,409
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 55862 - Posted: 1 Dec 2020 | 0:41:38 UTC - in response to Message 55848.
Last modified: 1 Dec 2020 | 0:42:29 UTC

Their [FAH] control program works only with Python 2, and Ubuntu 20.04 only has Python 3, so I am being squeezed out as I upgrade. (They might fix it someday, but it is a generational thing).
If you install Ubuntu 18.04 first, then upgrade it to 20.04 it will leave Python 2 on the system, and FAH will work.
If you do a clean install of Ubuntu 20.04, FAH won't work.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 790
Credit: 1,561,693,721
RAC: 48,215
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 55865 - Posted: 1 Dec 2020 | 15:39:56 UTC - in response to Message 55862.

If you install Ubuntu 18.04 first, then upgrade it to 20.04 it will leave Python 2 on the system, and FAH will work.

Good thought, but whenever I do an upgrade, it never works. I always end up having to do a clean install anyway.
So I will just keep some machines on Ubuntu 18.04 for the time being.

By the way, I just did my usual efficiency tests on GPUGrid, and found that the GTX 1660 Ti and GTX 1650 Super are the best, a little ahead of both the GTX 1060 and GTX 1070, so those are the ones I will use here.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 790
Credit: 1,561,693,721
RAC: 48,215
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 55869 - Posted: 2 Dec 2020 | 15:46:30 UTC - in response to Message 55819.

I've set my hosts to 0.01 days work buffer, but then I've realized that they start only 2 file transfers simultaneously, while a workunit has 9 files to download, so the given host contacts the GPUGrid servers 5 times to download a task.
To reduce that I've increased the number of simultaneous file transfers per project to 10 (the global number to 20) by putting
<max_file_xfers>20</max_file_xfers> <max_file_xfers_per_project>10</max_file_xfers_per_project>
in the <options> section of cc_config.xml file, and re-read the config files.

Good idea. I routinely set that to 4, but 10 is better.
It is remarkable what you have to do (in most projects) to get work.

Post to thread

Message boards : Server and website : Optimized bandwith