Message boards : Number crunching : some hosts won't get tasks
Author | Message |
---|---|
this one is a real head-scratcher for me. ever since the new app was released, two of my hosts have not been able to receive tasks. they don't give any error, or other obvious sign that anyting is wrong, they just always get the "no tasks available" response. | |
ID: 57146 | Rating: 0 | rate:
![]() ![]() ![]() | |
We have had discussions before about that. It appears there is a limit on the number of machines it will send work to. I expect it is part of their (undisclosed) anti-ddos system, but I don't think we know much about it other than it happens. | |
ID: 57148 | Rating: 0 | rate:
![]() ![]() ![]() | |
We have had discussions before about that. It appears there is a limit on the number of machines it will send work to. I expect it is part of their (undisclosed) anti-ddos system, but I don't think we know much about it other than it happens. that situation is different. what you're referencing is a temporary communications block when too many computers are at the same physical location (sharing the same external IP address). In that case, a schedule request would fail, but occasionally get through. I've worked around this problem for a long time and nothing has changed in that regard. these systems are spread across 3 physical locations and one of the systems (the 8x 2070) is actually the only host at it's IP, it's not competing with any other system. so that's not the issue here. I have no problem making schedule requests, and it's always asking for work, but these two for some reason always get the response that no tasks are available. it seems unlikely that they would be THAT unlucky to never get a resend when 3 other systems are occasionally picking them up ____________ ![]() | |
ID: 57149 | Rating: 0 | rate:
![]() ![]() ![]() | |
that situation is different. what you're referencing is a temporary communications block when too many computers are at the same physical location (sharing the same external IP address). I am well familiar with the temporary block. There are two problems present, the second problem is longer-term. | |
ID: 57153 | Rating: 0 | rate:
![]() ![]() ![]() | |
that situation is different. what you're referencing is a temporary communications block when too many computers are at the same physical location (sharing the same external IP address). can you link to some additional information about this second case? I've never seen that discussed here. only the one I mentioned. but again, the server is responding, so it's not actually being blocked from communication, the server just always responds that there are no tasks, even when there probably is at some times. ____________ ![]() | |
ID: 57154 | Rating: 0 | rate:
![]() ![]() ![]() | |
It has been over a year ago since I last saw it mentioned. I searched own posts, but unfortunately the search function does not work correctly. | |
ID: 57156 | Rating: 0 | rate:
![]() ![]() ![]() | |
I found this interesting Message #54344 from Retvari Zoltan | |
ID: 57157 | Rating: 0 | rate:
![]() ![]() ![]() | |
Thanks for digging, but those are both describing different situations. In the case from Zoltan, the user was getting a message that tasks won’t finish in time. I am not getting any such message. Only that tasks are not available. | |
ID: 57158 | Rating: 0 | rate:
![]() ![]() ![]() | |
I'm assuming GPUGrid is your only gpu project? | |
ID: 57159 | Rating: 0 | rate:
![]() ![]() ![]() | |
GPUGRID is the only non-zero resource GPU project. | |
ID: 57160 | Rating: 0 | rate:
![]() ![]() ![]() | |
this "feels" like a similar issue being described here. maybe not exactly the same, but something similar at least. | |
ID: 57175 | Rating: 0 | rate:
![]() ![]() ![]() | |
Richard, do you remember anything about this? Yes, I remember it well. Message 150509 was one of my better bits of bug-hunting. But I also draw your attention to Message 150489: All my machines have global_prefs_override.xml files, so are functioning normally in spite of the oddities. | |
ID: 57176 | Rating: 0 | rate:
![]() ![]() ![]() | |
I have an override file as well. What’s the significance of that? It has my local settings, but that’s the significance to this issue? | |
ID: 57179 | Rating: 0 | rate:
![]() ![]() ![]() | |
I have an override file as well. What’s the significance of that? It has my local settings, but that’s the significance to this issue? An override file always takes precedence over any project preference file. It is completely local to the host. | |
ID: 57180 | Rating: 0 | rate:
![]() ![]() ![]() | |
The significance lies in the fact that Einstein has re-written large parts of their server code in Drupal, In some respects, their re-write didn't exactly correspond with the original Berkeley PHP (or whatever it was) version of the code. | |
ID: 57181 | Rating: 0 | rate:
![]() ![]() ![]() | |
I figured the exact Einstein issue was not causing any issue at GPUGRID, just that some of the aspects of that situation feel similar to whats happening now. | |
ID: 57182 | Rating: 0 | rate:
![]() ![]() ![]() | |
...the other two hosts have not received anything since July 1st. always getting the "no tasks available" message... July 1st is exactly the date when new application version ACEMD 2.12 (cuda1121) was launched. And both your problematic hosts haven't received any task of this new version. Simply coincidence? I think not. | |
ID: 57183 | Rating: 0 | rate:
![]() ![]() ![]() | |
...the other two hosts have not received anything since July 1st. always getting the "no tasks available" message... I agree with this. but so far can find no difference between the setup of the two bad hosts which would prevent it getting work. it's the same as hosts that are getting work. ____________ ![]() | |
ID: 57184 | Rating: 0 | rate:
![]() ![]() ![]() | |
I was thinking of any subtle change in requirements for task sending from server side, more than from your hosts side... | |
ID: 57185 | Rating: 0 | rate:
![]() ![]() ![]() | |
I was thinking of any subtle change in requirements for task sending from server side, more than from your hosts side... yeah but if the hosts look the same from the outside, they should meet the same requirements. I think it's something not so obvious where the server isnt telling me what the problem is. ____________ ![]() | |
ID: 57186 | Rating: 0 | rate:
![]() ![]() ![]() | |
I suggest you to force BOINC / GPUGrid to assign a new host ID for your non-working hosts. | |
ID: 57195 | Rating: 0 | rate:
![]() ![]() ![]() | |
Might be a solution. Easy enough to do and you can always merge the old hostID back into the new ID. | |
ID: 57196 | Rating: 0 | rate:
![]() ![]() ![]() | |
I suggest you to force BOINC / GPUGrid to assign a new host ID for your non-working hosts. Could be a solution. But right now since not much work is available anyway. I will wait until work is plentiful again and reassess. If I’m still not getting work when there are thousands of tasks ready to send, then I’ll do it. Really prefer not to though. ____________ ![]() | |
ID: 57197 | Rating: 0 | rate:
![]() ![]() ![]() | |
Last ACEMD3 work unit seen was 27077654 (8th July 2021). It errored out. This same error seems to happen on other's hosts too, yet one has successfully completed it: <core_client_version>7.9.3</core_client_version> <![CDATA[ <message> process exited with code 195 (0xc3, -61)</message> <stderr_txt> 10:36:10 (18462): wrapper (7.7.26016): starting 10:36:10 (18462): wrapper (7.7.26016): starting 10:36:10 (18462): wrapper: running acemd3 (--boinc input --device 0) acemd3: error while loading shared libraries: libboost_filesystem.so.1.74.0: cannot open shared object file: No such file or directory 10:36:11 (18462): acemd3 exited; CPU time 0.000578 10:36:11 (18462): app exit status: 0x7f 10:36:11 (18462): called boinc_finish(195) </stderr_txt> ]]> Perhaps some bugs waiting to be solved? | |
ID: 57198 | Rating: 0 | rate:
![]() ![]() ![]() | |
Last ACEMD3 work unit seen was 27077654 (8th July 2021). It errored out. This same error seems to happen on other's hosts too, yet one has successfully completed it: you need to install the boost 1.74 package from your distribution or from a PPA. no idea what system you have since your computers are hidden, the install process will vary from distribution to distribution. On Ubuntu there is a PPA for it. that will fix your error. ____________ ![]() | |
ID: 57199 | Rating: 0 | rate:
![]() ![]() ![]() | |
Ok, thanks for the info. My computers run mostly CentOS 6/7, but there is one Linux Mint and one Win10 also. | |
ID: 57200 | Rating: 0 | rate:
![]() ![]() ![]() | |
I think it's resolved now. | |
ID: 57267 | Rating: 0 | rate:
![]() ![]() ![]() | |
Ian, Tue 07 Sep 2021 09:03:21 BST | | CUDA: NVIDIA GPU 0: GeForce GTX 1660 SUPER (driver version 460.91, CUDA version 11.2, compute capability 7.5, 4096MB, 3974MB available, 5153 GFLOPS peak) Tue 07 Sep 2021 09:03:21 BST | | OpenCL: NVIDIA GPU 0: GeForce GTX 1660 SUPER (driver version 460.91.03, device version OpenCL 1.2 CUDA, 5942MB, 3974MB available, 5153 GFLOPS peak) - all of which seems to match your settings, but I've still never been sent a task beyond version 212. Any ideas? | |
ID: 57279 | Rating: 0 | rate:
![]() ![]() ![]() | |
i have to assume that their CUDA version is "11.2.1" the .1 denoting the Update 1 version. based on the fact that their app plan class is cuda1121. | |
ID: 57282 | Rating: 0 | rate:
![]() ![]() ![]() | |
OK, I'll see your 465 and raise you 470 (-: Wed 08 Sep 2021 12:04:41 BST | | CUDA: NVIDIA GPU 0: NVIDIA GeForce GTX 1660 Ti (driver version 470.57, CUDA version 11.4, compute capability 7.5, 4096MB, 3972MB available, 5530 GFLOPS peak) Wed 08 Sep 2021 12:04:41 BST | | OpenCL: NVIDIA GPU 0: NVIDIA GeForce GTX 1660 Ti (driver version 470.57.02, device version OpenCL 3.0 CUDA, 5942MB, 3972MB available, 5530 GFLOPS peak) It sounds plausible, coproc_info had <cudaVersion>11020</cudaVersion>: it now has <cudaVersion>11040</cudaVersion>. No tasks on the first request, but as you say, they're as rare as hen's teeth. I'll leave it trying and see what happens. | |
ID: 57283 | Rating: 0 | rate:
![]() ![]() ![]() | |
OK, so I've got a Crypic_Scout task running with v217 and cuda 1121. | |
ID: 57285 | Rating: 0 | rate:
![]() ![]() ![]() | |
Initial observations are that the Einstein tasks are running far slower than usual - implying that both sets of tasks are running on device zero, as other people have reported. | |
ID: 57286 | Rating: 0 | rate:
![]() ![]() ![]() | |
Initial observations are that the Einstein tasks are running far slower than usual - implying that both sets of tasks are running on device zero, as other people have reported. nvidia-smi command will quickly confirm this | |
ID: 57287 | Rating: 0 | rate:
![]() ![]() ![]() | |
Yup, so it has. +-----------------------------------------------------------------------------+ | NVIDIA-SMI 460.91.03 Driver Version: 460.91.03 CUDA Version: 11.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 GeForce GTX 166... Off | 00000000:01:00.0 On | N/A | | 55% 87C P2 126W / 125W | 1531MiB / 5941MiB | 100% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 GeForce GTX 166... Off | 00000000:05:00.0 Off | N/A | | 31% 37C P8 11W / 125W | 8MiB / 5944MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 1133 G /usr/lib/xorg/Xorg 89MiB | | 0 N/A N/A 49977 C bin/acemd3 302MiB | | 0 N/A N/A 50085 C ...nux-gnu__GW-opencl-nvidia 1135MiB | | 1 N/A N/A 1133 G /usr/lib/xorg/Xorg 4MiB | +-----------------------------------------------------------------------------+ acemd3 running on GPU 0 is conclusive. And so to bed. | |
ID: 57288 | Rating: 0 | rate:
![]() ![]() ![]() | |
After not crunching for several months I started back again about a month ago. It took some time due to limited work units, but I received some GPUGrid WUs starting the first week of October, but now haven't received any since October 6th. I have tried snagging one when some are showing as available and only receive a message "No tasks are available for New version of ACEMD" on BOINC Manager Event log. Any ideas what I may have changed/not set correctly? (I am receiving and crunching Einstein and Milkway WUs. GPUGrid resource share is set 15 times higher than Einstein and 50 times higher than Milkyway.) | |
ID: 57583 | Rating: 0 | rate:
![]() ![]() ![]() | |
No tasks available. Your system looks fine to me. | |
ID: 57584 | Rating: 0 | rate:
![]() ![]() ![]() | |
Message boards : Number crunching : some hosts won't get tasks