Advanced search

Message boards : Number crunching : Managing non-high-end hosts

Author Message
Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 520
Credit: 2,275,558,465
RAC: 227
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57627 - Posted: 16 Oct 2021 | 11:33:53 UTC

Managing non-high-end (slow) hosts

Current extremely long ACEMD3 Gpugrid tasks represent a serious challenge for non-high-end hosts like mine ones.
My current fastest host, based on a Turing GTX 1660 Ti GPU, takes about 1 day and 4,5 hours to process one of these tasks (around 103.000 seconds).
My slowest one, based on a Pascal GTX 1050 Ti GPU, takes about 3 days and 9 hours doing the same kind of tasks (around 291.000 seconds).

Are these kind of slower hosts worth for processing such heavy tasks?
My personal opinion: Absolutely yes, as long as they be reliable systems.
It would be of no use if one host returned an invalid/failed result after several days retaining a task.
Let's take one example for sustaining my opinion:
Task e7s106_e5s196p1f1036-ADRIA_AdB_KIXCMYB_HIP-1-2-RND0214_7 was recently processed at my mentioned slowest host.
It took 290.627 seconds for it to return a valid result. That is: 3 days, 8 hours, 43 minutes and 47 seconds...
But taking a close look to Work Unit #27082868 it was hanging of, it had previously failed on 7 other hosts.
Maximum allowed number of failed tasks for current Work Units is 7, so that the task would have not be resent to any other host if mine had also failed, then resulting in a lost work unit for its intended scientific purposes...

With time, I've had to adapt BOINC Manager settings at my hosts, trying to squeeze the maximum performance for eviting to miss deadlines.
Here are my experiences:

My Computing preferences for a 4 cores CPU host look as follows:

Where:
-1) Use at most 50% of the CPUs, for not overcommitting the CPU for feeding GPU. This leaves two CPU full cores free for attending general system requirements.
-2) Use at most 100% of the CPU time, for preventing to throttle the CPU.
-3) Never suspending, for the task to be processed with the minimum pauses possible (preferably in one go).
-4) Setting tasks buffer to a minimum, for not wasting time in waiting to the current task to finish.
-5) Switch between tasks every 9999 minutes, for giving enough time for the current task be processed in one go, and only then switching to the next.

My Network preferences look as follows:

I set a certain Download/Upload rate limitation, for not saturating network bandwidth for my other hosts.
But I try to set it to high limits, because Download + Upload times count when task deadline is close to the end, or to the credit bonus limit...

And my Disk and memory preferences look as follows:

-1) Current Gpugrid environment takes a high disk space usage. I set all available space to be usable by BOINC, except for a certain security margin for the disk not resulting saturated.
-2) Regarding memory usage, I raise the default limits to empirically tested values, to maximums for the system not becoming unresponsive to its other general tasks.

And regarding system reliability:
- I empirically test every system and each GPU for safe overclock settings, if I apply any. I do prefer a robust reliable system than a slightly faster but sporadically failing one.
- I frequently check temperatures at every hosts, and try to maintain them at reasonable low levels. Lower temperatures result in a more reliable and faster system.
- I perform preventive hardware maintenances when a temperature raise is detected at some element, or when a host starts failing tasks without a known reason.
Regarding this last matter, I share my experiences at The hardware enthusiast's corner thread.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 520
Credit: 2,275,558,465
RAC: 227
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57628 - Posted: 16 Oct 2021 | 17:50:27 UTC

When you are commanding a fleet of assorted slow GPUs (in my case currently 8), it is difficult to hold in mind how long will it take for every of them to finish their tasks.
Here is a screenshot of the spreadsheet that I use for this purpose, at the time of writing this:



For me, it is useful for being aware to request new tasks when the ones in process are next to finish.
This helps to maintain my GPUs crunching most of the time.
An editable copy of this spreadsheet can be downloaded from here.

jjch
Send message
Joined: 10 Nov 13
Posts: 91
Credit: 15,040,000,871
RAC: 1,015,809
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57631 - Posted: 17 Oct 2021 | 17:01:54 UTC

The slowest GPU's that I am using for GPUgrid are GTX 1070's. These average about 36hrs to process on Windows hosts.

They seem to work fine, just need to let them run. A stable system with a reliable connection is really the best you can do.

I have not tried anything slower like a GTX 1060 or some old Nvidia Grid K2 cards. They are only running Milkyway jobs which they do well with.

There comes a time when the technology outpaces the physical hardware we have available but I am a firm believer of using what we have as long as possible.

Remember, I'm the guy with the "Ragtag fugitive fleet" of old HP/HPE servers and workstations that I have saved from the scrap pile.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 520
Credit: 2,275,558,465
RAC: 227
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57632 - Posted: 17 Oct 2021 | 22:08:16 UTC - in response to Message 57631.

There comes a time when the technology outpaces the physical hardware we have available but I am a firm believer of using what we have as long as possible.

Yeah,
Watching an old table of my working GPUs on 2019:



I published this table at Message # 52974, in "Low power GPUs performance comparative" thread.
Since then, I've retired from production at Gpugrid all my Maxwell GPUs.
GTX 750 and GTX 750 Ti for not being able to process the current ADRIA tasks inside the 5 days deadline.
I estimate that GTX 950 could process them in about 3 days and 20 hours, but it doesn't worth it due to its low energetic efficiency.
And Pascal GT 1030, I estimate that it would take about 6 days and 10 hours...

Remember, I'm the guy with the "Ragtag fugitive fleet" of old HP/HPE servers and workstations that I have saved from the scrap pile.

Unforgettable, since today you are at the top of Gpugrid Users ranking by RAC ;-)

Finrond
Send message
Joined: 26 Jun 12
Posts: 12
Credit: 867,976,385
RAC: 0
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57648 - Posted: 25 Oct 2021 | 14:59:18 UTC

I've been running these tasks on a 1060 6gb, they take 161,000 - 163,000 seconds to complete. I will keep running tasks on this card until it can no longer meet the deadlines, I've been trying to hit the billion point milestone which would take another year or so if I could reliably get work units. Lets hope I can still run this card for that long!

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 520
Credit: 2,275,558,465
RAC: 227
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57701 - Posted: 31 Oct 2021 | 19:15:15 UTC

Some considerations about Gpugrid tasks deadlines:

The criteria for receiving credits for a valid returned task is not strictly that it fits inside the 5 days deadline... Let's explain it.

The official deadline for a Gpugrid task is currently 5 days, counting from the moment of Gpugrid server starts sending it to the receiving host.
5 days are exactly 432000 seconds.
This time includes the task download time, the resting time until the task starts being processed, the pause times (if any), the total processing time, and the upload time of the result.
When this time is past, the server will generate a copy of the same task, and resend it to a new host.

Here comes the hint:
An overdue task will not receive any credits for a returned result when it goes outside its deadline AND a valid result is received from the other host first.

Continuing the process of a task past its deadline, is in some way a bet.
Depending on the receiving host, in practice, the deadline is extended by the time that this new host takes to receive, process, and report a valid result for the task.
If the new host is the fastest one for the current set of ADRIA tasks, it will take a mere 5 hours to process... That is the minimum deadline extension that you might expect for an overdue task.
My fastest host takes 1 day and 4,5 hours with these tasks. That is the minimum extension that you could expect from my fleet.
If the newly re-sent task gets a slower-than-medium host, you could expect an extension even longer.
- If you report a valid result for your overdue task even 1 second before the new host, both tasks will receive the base (no bonus) credits amount.
- If the new host reports a valid result even 1 second before yours, it will get the credits awarded, and your task will be labeled with "Completed, too late to validate", 0 credits.

One practical example:
I decided to test if I could get an old Maxwell GTX 750 Ti GPU to fit in deadline one of the current ADRIA extremely long tasks.
I awakened my host #325908, I applied an aggressive +230 MHz overclock to the GPU, and I received task #32657449
It took 473615 seconds to process this task. Too long for fitting the deadline!
This task was hanging from WU #27084724
My host returned a valid result 45150 seconds past the original deadline, but 17295 seconds before than the host that received the re-sent task.
Both tasks received 450000 credits... This time I won my bet! 🎉

If I had aborted, extended beyond 17295 more seconds, or failed my task, the other host would have received 675000, 50% bonused credits, for returning its result inside the first 24 hours.
It is an undesirable side effect, and I apologize for this.
My example task exceeded the 5 days deadline by 45150 seconds (12 hours, 32 minutes, 30 seconds in excess)
I'll give a last try to a new task by carefully readjusting the overclock for this 46 Watts low-power-consuming GPU and its harboring host.
If It is still failing the 5 days deadline, I'll retire this GPU definitely for processing the current Gpugrid ADRIA tasks.
Conversely, if I'm successful, I'll publish the measures taken. I strongly doubt it, since I have to trim more than 13 hours in processing time 🎬⏱️

jjch
Send message
Joined: 10 Nov 13
Posts: 91
Credit: 15,040,000,871
RAC: 1,015,809
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57702 - Posted: 1 Nov 2021 | 1:13:39 UTC

Well said ServicEnginIC

At one time there was data on the Performance tab that would give users a rough idea of where the GPU's they had would fit in.

There are a great number of variables that affect GPU performance but they could at least tell if they were in the ballpark.

I would very much like to see the Performance tab reinstated so that these general comparisons were available.

While it's good for users to keep using GPU's as long as they are viable there comes a time when they should be retired or used for other projects.

I hate to see people burn time and energy with little or no result to show for it.

Maybe someone could come up with a general guide on what GPU's are useful for GPUgrid that would go back 2 or 3 generations.

If users had a breakpoint where they should seriously consider upgrading to something more recent then they will at least have something to work toward.



Erich56
Send message
Joined: 1 Jan 15
Posts: 944
Credit: 3,683,395,665
RAC: 853,259
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwat
Message 57704 - Posted: 1 Nov 2021 | 9:32:24 UTC - in response to Message 57702.


If users had a breakpoint where they should seriously consider upgrading to something more recent then they will at least have something to work toward.

well, the point here is, that in many cases it's not just a matter of removing the "old" graphic card and putting in a new one.
often enough (like with some of my rigs, too), new generation cards are not well compatible with the remaining old PC hardware.

So, in many cases, in order to be up to date GPU-wise, it would mean to buy a new PC :-(

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 520
Credit: 2,275,558,465
RAC: 227
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57705 - Posted: 1 Nov 2021 | 9:36:13 UTC - in response to Message 57702.
Last modified: 1 Nov 2021 | 9:44:26 UTC

I fully agree, jjch.

I reproduce my own words at Message #53367, in response to user Piri1974, where both slopes of your exposition were mentioned...

But anyway, I would not recommend buying the GT710 or GT730 anymore unless you need their very low consumption.

I find theese both models perfect for office computers.
Specially fanless models, that offer a silent and smooth working for office applications, joining their low power consumption.
But I agree that their performance is rather scarce to process at GPUGrid.
I've made a kind request for Performance tab to be rebuilt
At the end of this tab there was a graph named GPU performance ranking (based on long WU return time)
Currently this graph is blank.
When it worked, it showed a very useful GPUs classification according to their respective performances at processing GPUGrid tasks.
Just GT 1030 sometimes appeared at far right (less performance) in the graph, and other times appeared as a legend out of the graph.
GTX 750 Ti always appeared borderline at this graph, and GTX 750 did not.
I always considered it as a kind invitation for not to use "Out of graph" GPUs...

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 520
Credit: 2,275,558,465
RAC: 227
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57707 - Posted: 1 Nov 2021 | 12:04:56 UTC - in response to Message 57704.

So, in many cases, in order to be up to date GPU-wise, it would mean to buy a new PC :-(

Right.
And I find it particularly annoying when trying to upgrade Windows hosts...
As a hardware enthusiast, I've self-upgraded four of my Linux rigs from old socket LGA 775 Core 2 Quad processors and DDR3 RAM motherboards to new i3-i5-i7 processors and DDR4 RAM ones.
I find Linux OS to be very resilient to this kind of changes, with usually no need to care more than upgrading the hardware.
But one of them is a Linux/Windows 10 dual boot system.
While Linux assumed the changes smoothly, I had to buy a new Windows 10 License to renew the previously existing...
I related it in detail at my Message #55054, in "The hardware enthusiast's corner" thread.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 520
Credit: 2,275,558,465
RAC: 227
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57736 - Posted: 3 Nov 2021 | 20:04:42 UTC - in response to Message 57701.

Overclocking to wacky (?) limits

Conversely, if I'm successful, I'll publish the measures taken. I strongly doubt it, since I have to trim more than 13 hours in processing time 🎬⏱️

Well, we already have a verdict:

Task e13s161_e5s39p1f565-ADRIA_AdB_KIXCMYB_HIP-1-2-RND6743_1 was sent to the mentioned Linux Host #325908 on 29 Oct 2021 at 19:03:02 UTC.
This same host took 473615 seconds to process a previous task, and I did set myself the challenge of trimming more than 13 hours in processing time for fitting a new task into deadline.
For achieving this, I've had to carefully study several strategies to apply, and I'll try to share with you everything I've learnt in the way.

We're talking about a GV-N75TOC-2GI graphics card
I've found this Gigabyte 46 Watts power consuming card being very tolerant to heavy overclocking, probably due to a good design and to its extra 6-pin power connector. Other manufacturers decide to take all the power from PCIe slot for cards consuming 75 Watts or less...
This graphics card isn't its original shape. I had to refurbish its cooling system, as I related in my Message #55132 at "The hardware enthusiast's corner" thread.
It is currently installed on an ASUS P5E3 PRO motherboard, also refurbished (Message #56488).

Measures taken:

The mentioned motherboard has two PCIe x16 V2.0 slots, Slot #0 occupied by the GTX 750 Ti graphics card, and Slot #1 remaining unused.
For gaining the maximum bandwidth available for the GPU, I entered BIOS setup and disabled integrated PATA (IDE) interface, Sound, Ethernet, and RS232 ports.
Communications are managed by a WiFi NIC, installed in one PCIe x1 slot, and storage is carried out by a SATA SSD.
I also settled eXtreme Memory Profile (X.M.P) to ON, and Ai Clock Twister parameter to STRONG (highest performance available)

Overclocking options had been previously enabled at this Ubuntu Linux host by means of the following Terminal command:

sudo nvidia-xconfig --thermal-configuration-check --cool-bits=28 --enable-all-gpus

It is a persistent command, and it is enough with executing it once.

After that, entering Nvidia X Server Settings, options for adjusting fan curve and GPU and Memory frequency offsets will be available.
First of all, I'm adjusting GPU Fan setting to 80%, thus enhancing refrigeration comparing to default fan curve.
Then, I'll apply a +200 MHz offset to Memory clock, increasing from original 5400 MHz to a higher 5600 MHZ (GDDR 2800 MHz x 2).
And finally, I'm gradually increasing GPU clock until power limit for the GPU is reached while working at full load.
For determining the power limit, it is useful the following command:

sudo nvidia-smi -q -d power

For this particular graphics card, Power Limit is factory set at 46.20 W, and it coincides with the maximum allowed for this GTX 750 Ti kind of GPU.
And final clock settings look this way.
With this setup, let's look to an interesting redundancy check, by means of the following nvidia-smi GPU monitoring command:

sudo nvidia-smi dmon

As can be seen at previous link, GPU is consistently reaching a maximum frequency of 1453 MHz, and power consumptions of more than 40 Watts, frequently reaching 46 and even 47 Watts.
Temperatures are maintaining a comfortable level of 54 to 55 ºC, and GPU usage is on 100% most of the time.
That's good... as long as the processing maintains reliable... Will it?

⏳️🤔️

This new task e13s161_e5s39p1f565-ADRIA_AdB_KIXCMYB_HIP-1-2-RND6743_1 was processed by this heavily overclocked GTX 750 Ti GPU in 405229 seconds, and a valid result was returned on 03 Nov 2021 at 11:48:55 UTC
Ok, I finally was able to trim the processing time in 68386 seconds. That is: 18 hours, 59 minutes and 46 seconds less than the previous task.
It fit into deadline with an excess margin of 7 hours, 14 minutes and 7 seconds... (Transition from summer to winter time gave an extra hour this Sunday !-)

Challenge completed!

🤗️

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 520
Credit: 2,275,558,465
RAC: 227
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57886 - Posted: 27 Nov 2021 | 12:32:48 UTC
Last modified: 27 Nov 2021 | 12:37:56 UTC

Yesterday, a new batch of tasks "_ADRIA_BanditGPCR_APJ_b0-" came out.
It seems that the project has attended the request for reducing the size of tasks.
This will allow for slower GPUs to process them into the 5 days deadline.

Based on estimations from this morning (about 7:00 UTC), all my currently working GPUs will return their tasks in time.
The GTX 1660 Ti is running for getting full bonus (+50% for result returned before 24 hours)
From GTX 1650 SUPER down to GTX 1050 Ti are running for getting mid bonus (+25% for result returned before 48 hours)
Even GTX 750 Ti will get with no problem its base credit for returning result into the 5 days deadline (before 120 hours)



This might also solve the problem of sporadic oversized result files not being able to upload. Good.

Erich56
Send message
Joined: 1 Jan 15
Posts: 944
Credit: 3,683,395,665
RAC: 853,259
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwat
Message 58021 - Posted: 6 Dec 2021 | 19:27:14 UTC

could it be that a GTX1650 is not able to crunch the current series of tasks?

Today, for the first time I tried GPUGRID on my host with a GTX1650 inside, and the task failed after some 4 hours:

https://www.gpugrid.net/result.php?resultid=32721627

excerpt from the stderr:

ACEMD failed:
Error invoking kernel: CUDA_ERROR_LAUNCH_TIMEOUT (702)
19:28:44 (6344): bin/acemd3.exe exited; CPU time 13408.734375

So, as sorry as I am, I will go back to crunch other projects on this one host :-(

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2335
Credit: 16,178,080,749
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58022 - Posted: 6 Dec 2021 | 20:29:33 UTC - in response to Message 58021.

could it be that a GTX1650 is not able to crunch the current series of tasks?
I don't think so.

Today, for the first time I tried GPUGRID on my host with a GTX1650 inside, and the task failed after some 4 hours
I can suggest only the ususal:
check your GPU temperatures, lower the GPU frequency by 50MHz.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 520
Credit: 2,275,558,465
RAC: 227
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58023 - Posted: 6 Dec 2021 | 21:22:43 UTC - in response to Message 58021.

could it be that a GTX1650 is not able to crunch the current series of tasks?

I'm processing current ADRIA tasks on five GTX 1650 graphics cards of varied brands and models.
They're behaving rock stable, and getting mid bonus (valid result returned before 48 hours) when working 24/7.
Perhaps a noticeable difference with yours is that all of them are working under Linux, where some variables as antivirus and other interfering software can be discarded...
As always, Retvari Zoltan's wise advices are to be taken in mind.

Erich56
Send message
Joined: 1 Jan 15
Posts: 944
Credit: 3,683,395,665
RAC: 853,259
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwat
Message 58024 - Posted: 7 Dec 2021 | 6:43:52 UTC

well, the GPU temperature was at 60/61°C - so not too high, I would guess.
GPU runs at stock frequency, but I could try to go below as suggested by Zoltan.

However, meanwhile my suspicion is that the old processor
Intel(R) Core(TM)2 Duo CPU E7400 @ 2.80GHz
may be the culprit.

I will give it one more try; if it fails again, I guess this host is not able to successfully run GPUGRID (however, it works fine with WCG GPU tasks, and with F&H)

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 520
Credit: 2,275,558,465
RAC: 227
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58025 - Posted: 7 Dec 2021 | 8:57:01 UTC - in response to Message 58024.

However, meanwhile my suspicion is that the old processor
Intel(R) Core(TM)2 Duo CPU E7400 @ 2.80GHz
may be the culprit.

Surely you're right.
For a two-core CPU, I'd recommend to set at BOINC Manager Computing preferences "Use at most 50 % of the CPUs".
This will cause that one CPU core to remain free for feeding the GPU.

Erich56
Send message
Joined: 1 Jan 15
Posts: 944
Credit: 3,683,395,665
RAC: 853,259
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwat
Message 58026 - Posted: 7 Dec 2021 | 9:39:41 UTC - in response to Message 58025.

However, meanwhile my suspicion is that the old processor
Intel(R) Core(TM)2 Duo CPU E7400 @ 2.80GHz
may be the culprit.

Surely you're right.
For a two-core CPU, I'd recommend to set at BOINC Manager Computing preferences "Use at most 50 % of the CPUs".
This will cause that one CPU core to remain free for feeding the GPU.

the question then though is: how long would it take a task to get finished? Probably 4-5 days :-(

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 520
Credit: 2,275,558,465
RAC: 227
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58028 - Posted: 7 Dec 2021 | 10:54:14 UTC - in response to Message 58026.

Every of my five GTX 1650 graphics cards are currently taking less than 48 hours, in time for getting mid bonus (+25%).
I published a table of computing times at Message #57886

Erich56
Send message
Joined: 1 Jan 15
Posts: 944
Credit: 3,683,395,665
RAC: 853,259
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwat
Message 58029 - Posted: 7 Dec 2021 | 11:59:18 UTC - in response to Message 58028.
Last modified: 7 Dec 2021 | 11:59:45 UTC

Every of my five GTX 1650 graphics cards are currently taking less than 48 hours, in time for getting mid bonus (+25%).
I published a table of computing times at Message #57886

I just looked up your CPUs: they are generations newer than my old Intel Core 2 Duo CPU E7400 @ 2.80GHz

I am afraid that these new series of GPUGRID tasks are demanding the old CPU too much :-(

But, as said, I'll give it another try once new tasks become available.

Erich56
Send message
Joined: 1 Jan 15
Posts: 944
Credit: 3,683,395,665
RAC: 853,259
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwat
Message 58030 - Posted: 7 Dec 2021 | 14:34:26 UTC - in response to Message 58029.

But, as said, I'll give it another try once new tasks become available.

As recommended, I set at BOINC Manager Computing preferences "Use at most 50 % of the CPUs", and I lowered the GPU frequency by 50 MHz.

Now new tasks were downloaded, but they failed after less than a minute.

What I noticed is that in the stderr it says:

ACEMD failed:
Particle coordinate is nan


https://www.gpugrid.net/result.php?resultid=32722445
https://www.gpugrid.net/result.php?resultid=32722418

So the question now is:
did my changes in the settings cause the tasks to fail that quickly after start, or are they misconfigured?

BTW, at the same time another machine (with a CPU Intel Core i7-4930K and GTX980ti inside) got a new task, and this is working well.

This could indicate that the tasks are NOT misconfigured, but that rather the changes in the settings are the reason for failure. No idea.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 744
Credit: 4,943,798,494
RAC: 524,854
Level
Arg
Scientific publications
wat
Message 58031 - Posted: 7 Dec 2021 | 15:15:57 UTC - in response to Message 58030.
Last modified: 7 Dec 2021 | 15:17:54 UTC

Firstly, I don't think your issue that you experienced is related to overclock or GPU temps at all, usually if temps or OC are the culprit you'll get a particle coordinate is nan error (but the nan error can also be a bad WU and not your fault, more on that later). Your error was a CUDA timeout. likely the driver crashed and the app couldn't hook back in. I'm on the fence if your CPU is the ultimate reason for this or not. certainly it's a very old platform to be running on Windows 10, so it's possible there are some issues. If your comfortable trying Linux, particularly a lightweight version with less system overhead, you might try that to see if you have a better experience with such an old system.

Your CPU being a Core2Duo, this architecture does not have a dedicated PCIe link to the CPU. it uses the older architecture where the PCIe and memory connect to the Northbridge chipset and the chipset is what has a single 10.6GB/s link to the CPU, the memory will take most of this bandwidth unfortunately, and since GPUGRID is pretty heavy on bus use, I can see some conflicts happening on this kind of architecture. but the CPU power itself shouldnt be an issue if you're trying to run 1 GPU and no CPU work.

Secondly, with regards to the work that is flowing this morning, a lot of them are bad WUs giving the nan error, so you can't jump to conclusions that whatever you changed wasn't effectual. I have errored like 80% of the WUs received this morning and my system is rock solid stable. If you check the WU record for your tasks this morning you will see that all of your wingmen errored out too, so it's not just you.
____________

Erich56
Send message
Joined: 1 Jan 15
Posts: 944
Credit: 3,683,395,665
RAC: 853,259
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwat
Message 58032 - Posted: 7 Dec 2021 | 17:10:16 UTC - in response to Message 58031.

Ian&Steve C., thanks for the thorough explanations.

I tried it one more time, and again the task failed after some 7.000 seconds.

Excerpt from stderr:

ACEMD failed:
Error invoking kernel: CUDA_ERROR_LAUNCH_TIMEOUT (702)

the complete report can be seen: https://www.gpugrid.net/result.php?resultid=32722583

So it seems clear that the current setting (hardware, software) is not working with GPUGRID :-(
Which is too bad, because before, with a GTX750ti inside, I had successfully crunched many hundreds of GPUGRID tasks.
Maybe the new GTX1650 does not fit well into the overall setting, or the GPUGRID tasks strain the systems more than ever before.

As mentioned earlier, everything works well with the GPU tasks from WCG and with Folding&Home.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 744
Credit: 4,943,798,494
RAC: 524,854
Level
Arg
Scientific publications
wat
Message 58033 - Posted: 7 Dec 2021 | 17:31:12 UTC - in response to Message 58032.
Last modified: 7 Dec 2021 | 17:35:40 UTC

The very first thing I would recommend you try is to totally wipe out your existing nvidia drivers with DDU: https://www.guru3d.com/files-details/display-driver-uninstaller-download.html

boot into safe mode, and run DDU to do a complete removal of the driver from all areas of your system, including the registry, and make sure to select the option to prevent Windows from installing the driver automatically (or unplug the network cable so it can't). Then re-install the latest stable (WHQL) nvidia driver for your platform. this will eliminate driver corruption (common on windows) as a potential cause of your problem.

But if that still doesn't help, then I think the high PCIe use of GPUGRID ACEMD3 tasks is what's killing you, combined with the old architecture of that platform which wasnt designed to handle this kind of thing, or maybe even instability/errata with the chipset itself (they do age/degrade just like any other silicon). folding and other projects do not exhibit that behavior with the hardware and wont stress the same subsystems that GPUGRID will. maybe the 750Ti wasnt fast enough to bring this problem to light and the 1650 is causing more stress? certainly possible, but without some trial and error testing it's impossible to give a conclusive answer.

Honestly I would recommend just replacing the whole platform with something more modern. just replace the CPU/Motherboard/Memory. You can get stuff only a few years old for dirt cheap, will outperform the old setup, and will pay for itself over time being more energy efficient and lower power use. something like even first gen AMD Ryzen or like a 14nm Intel platform with 4 cores, 8+GB DDR4 memory, and a low end motherboard will be very cheap, run circles around your current platform, be more compatible with modern system and software, and use the same or less power doing it.

just my .02
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1070
Credit: 1,450,990,714
RAC: 426,047
Level
Met
Scientific publications
watwatwatwatwat
Message 58035 - Posted: 7 Dec 2021 | 22:52:25 UTC

Big slug of bad work went out with NaN errors.

tullio
Send message
Joined: 8 May 18
Posts: 190
Credit: 104,426,808
RAC: 0
Level
Cys
Scientific publications
wat
Message 58039 - Posted: 8 Dec 2021 | 8:43:25 UTC

Both my GTX 1060 on a Windows 10 host and GTX 1650 on a Windows 11 host have completed and validated their tasks.
Tullio
____________

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2335
Credit: 16,178,080,749
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58049 - Posted: 10 Dec 2021 | 17:04:39 UTC - in response to Message 58024.
Last modified: 10 Dec 2021 | 17:26:04 UTC

However, meanwhile my suspicion is that the old processor
Intel(R) Core(TM)2 Duo CPU E7400 @ 2.80GHz
may be the culprit.
I've reanimated (that was quite an adventure on its own) one ancient DQ45CB motherboard with a Core2Duo E8500 CPU in it, and I've put a GTX 1080Ti in it to test with GPUGrid, but there's no work available at the moment. You can follow the unfolding of this adventure here.
EDIT: I've managed to receive one task...
EDIT2: It failed because I've forget to install the Visual C++ runtime :(

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2335
Credit: 16,178,080,749
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58055 - Posted: 10 Dec 2021 | 22:34:22 UTC - in response to Message 58049.
Last modified: 10 Dec 2021 | 22:42:50 UTC

I was lucky again, the host received another workunit and it's running just fine for 90 minutes. (it needs another 12 hours to complete).
The Core2Duo is definitvely struggling to feed the GTX1080Ti (the GPU usage has frequent deep drops), but I don't think it will run into that "Error invoking kernel: CUDA_ERROR_LAUNCH_TIMEOUT (702)" error. We'll see. I've tried to maximize GPU usage by changing process affinities and the priority of the acemd3.exe, making not much difference.

Erich56
Send message
Joined: 1 Jan 15
Posts: 944
Credit: 3,683,395,665
RAC: 853,259
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwat
Message 58063 - Posted: 11 Dec 2021 | 7:29:44 UTC - in response to Message 58049.

However, meanwhile my suspicion is that the old processor
Intel(R) Core(TM)2 Duo CPU E7400 @ 2.80GHz
may be the culprit.
I've reanimated (that was quite an adventure on its own) one ancient DQ45CB motherboard with a Core2Duo E8500 CPU in it, and I've put a GTX 1080Ti in it to test with GPUGrid, but there's no work available at the moment. You can follow the unfolding of this adventure here.
EDIT: I've managed to receive one task...
EDIT2: It failed because I've forget to install the Visual C++ runtime :(

hm, I could try to run a GPUGRID task on my still existing box with a CPU Intel Core2Duo E8400 inside, motherboard is an Abit IP35Pro, GPU is a GTX970.
Currently, this box crunches FAH and/or WCG (GPU tasks), without problems.
However, the GTX970 gets very warm (although I dedusted it recently), so for FAH I have to underclock to about 700MHz which is far below the default clock of 1152MHz. I am afraid same would be true for GPUGRID, and a task would run, if at all, forever.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2335
Credit: 16,178,080,749
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58065 - Posted: 11 Dec 2021 | 9:48:47 UTC - in response to Message 58063.

The task is finihed successfully in 12h 35m 23s.
On a Core i3-4xxx it takes about 12h 1m 44s, so it take 34m more for the Core2Duo, it was only 4.6% slower than an i3-4xxx.
I've noticed that the present acemd3 app does not use a full CPU core (thread) on Windows while it does on Linux. There's a discrepancy between the run time and the CPU time, also the CPU usage is lower on Windows.

Erich56
Send message
Joined: 1 Jan 15
Posts: 944
Credit: 3,683,395,665
RAC: 853,259
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwat
Message 58068 - Posted: 11 Dec 2021 | 11:33:33 UTC - in response to Message 58065.
Last modified: 11 Dec 2021 | 11:34:02 UTC

I've noticed that the present acemd3 app does not use a full CPU core (thread) on Windows while it does on Linux. There's a discrepancy between the run time and the CPU time, also the CPU usage is lower on Windows.

hm, I acutally cannot confirm, see here:

e7s141_e3s56p0f226-ADRIA_BanditGPCR_APJ_b1-0-1-RND0691_0 27100764 588817 10 Dec 2021 | 12:29:46 UTC 11 Dec 2021 | 5:50:18 UTC Fertig und Bestätigt 31,250.27 31,228.75 420,000.00 New version of ACEMD v2.19 (cuda1121)

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2335
Credit: 16,178,080,749
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58070 - Posted: 11 Dec 2021 | 12:18:35 UTC - in response to Message 58068.

I've noticed that the present acemd3 app does not use a full CPU core (thread) on Windows while it does on Linux. There's a discrepancy between the run time and the CPU time, also the CPU usage is lower on Windows.

hm, I acutally cannot confirm, see here:

e7s141_e3s56p0f226-ADRIA_BanditGPCR_APJ_b1-0-1-RND0691_0 27100764 588817 10 Dec 2021 | 12:29:46 UTC 11 Dec 2021 | 5:50:18 UTC Fertig und Bestätigt 31,250.27 31,228.75 420,000.00 New version of ACEMD v2.19 (cuda1121)
The discrepancy is smaller in some cases, perhaps it depends on more factors than the OS. Newer CPUs show less discrepancy. I'll test it with my E8500. Now I'm using Windows 11 on it, but I couldn't get a new workunit yet. My next attempt will be with Linux.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 520
Credit: 2,275,558,465
RAC: 227
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58076 - Posted: 12 Dec 2021 | 13:35:40 UTC - in response to Message 58031.

Your CPU being a Core2Duo, this architecture does not have a dedicated PCIe link to the CPU. it uses the older architecture where the PCIe and memory connect to the Northbridge chipset and the chipset is what has a single 10.6GB/s link to the CPU, the memory will take most of this bandwidth unfortunately, and since GPUGRID is pretty heavy on bus use, I can see some conflicts happening on this kind of architecture. but the CPU power itself shouldnt be an issue if you're trying to run 1 GPU and no CPU work.

Inspired on your timely appointment, I've worked for experimenting the difference between newer dedicated PCIe link architecture versus the older based on an intermediate chipset.

I have still in production Two Linux hosts based on the older architecture, same Asus P5E3 PRO motherboard.
Main characteristics are: Intel X48/ICH9R chipset, DDR3 RAM, PCIe rev. 2.0, CPU socket LGA775 (the same as previously mentioned Core 2 Duo E7400 and E8500 CPUs)
- Host #482132
- Host #325908
Both hosts are also based on the same low power Intel Core 2 Quad Q9550S CPU
Host #482132 is harboring an Asus EX-GTX1050TI-4G graphics card (GTX 1050 Ti GPU)
Host #325908 mounts a Gigabyte GV-N75TOC-2GI graphics card (GTX 750 Ti GPU)
Psensor utility graphics for both hosts are the following (Gpugrid task running at each one):

Host #482132:

Host #325908:

Before going further, let's mention that Q9550S CPU TDP is 65 Watts, and that CPU series worked at fixed clock frequencies, 2,83 GHz in this case, so no power increase due to any turbo frequencies.
This allows to easy maintain full load CPU temperatures at low levels, around 40 ºC.
Also GPU TPDs are relatively low: 75 Watts for GTX 1050 Ti, and 46 Watts for GTX 750 Ti respectively.
This helps to maintain their full-load temperatures at around 50 ºC, even being overclocked as they are.

Now, for comparison, I'll take one of my newly refurbished hosts, Host #557889.
It is based on the newer architecture with dedicated PCIe link to the CPU, Gigabyte Z390UD motherboard
Main characteristics are: Intel Z390 chipset, DDR4 RAM, PCIe rev. 3.0, CPU socket LGA1151.
This host mounts a 9th generation Intel Core i5-9400F CPU.
Rated TDP for this processor is also 65 Watts at 2,90 GHz base clock frequency, but here come in play increased power consumptions due to higher turbo frequencies up to 4,10 GHz...
Two of the three mainboard available PCIe slots are occupied by GTX 1650 GPU based graphics cards.
Psensor utility graphic for this host is the following (2X Gpugrid tasks running, one at each GPU):

Host #325908:

A general temperature rising at this host can be observed, due to the mentioned extra CPU power consumption, and the higher density (Two graphics cards) in the same computer case.
And here is the conclusion we were looking for:
While older architecture used respectively 41% and 36% of PCIe 2.0 bandwidth, the newer architecture is properly feeding two GPUs with only 1% to 2% PCIe 3.0 bandwidth usage each one.
But it seems not to be an impediment for the older architecture to reliably manage the current ADRIA Gpugrid tasks...Slow but safe.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 744
Credit: 4,943,798,494
RAC: 524,854
Level
Arg
Scientific publications
wat
Message 58080 - Posted: 12 Dec 2021 | 15:08:56 UTC - in response to Message 58076.

it would be a better comparison to run the same GPU on both systems for a more apples to apples comparison.

something about your 1% PCIe use doesnt seem right. last year I had a 1650 and it used the normal PCIe bandwidth @ ~80% on a PCIe 3.0 x4 link. and about 20% on a PCIe 3.0 x16 link. and just spot checked my 2080Ti system, which showed ~20-25% use on a PCIe 3.0 x16 link, and ~40% on a PCIe 3.0 x8 link

other than PCIe generation (3.0), what are the link widths for each card? how do you have them populated? I'm assuming the two topmost slots? that should be x16 and x4 respectively.

also, what tasks were being processed when you did the test and took the screenshots? only the ACEMD3 tasks exhibit the high PCIe use. However, I've seen much lower on the Python beta tasks.

finally, keep in mind that the PCIe measurement percentage is measured GPU-side, this value comes from the nvidia driver. as a percentage of the GPU's total bandwidth. a bottleneck on the CPU side will not be reflected here. you could very well be 40% on the PCIe 2.0 GPU, but totally maxed on the CPU side as the limiting factor. This is likely the case.
____________

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 520
Credit: 2,275,558,465
RAC: 227
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58089 - Posted: 12 Dec 2021 | 21:04:40 UTC - in response to Message 58080.

something about your 1% PCIe use doesnt seem right.

Thank you for your kind comments. I enjoy reading every of them.
I'm also somehow bewildered about it.
I've noted such a PCIe bandwidth usage reduction at my newer hosts after New version of ACEMD 2.19 was launched on November 10th.
Every of my four 9th generation i3-i5-i7 hosts are experiencing the same.
At the moment of taking the presented Psensor graphic, GPU0 was executing task e7s382_e3s59p0f6-ADRIA_BanditGPCR_APJ_b1-0-1-RND7973_0, and GPU1 was executing task e7s245_e3s77p0f86-ADRIA_BanditGPCR_APJ_b1-0-1-RND1745_0
At the same moment, I took this Boinc Manager screenshot.

Additionally, at the host I'm writing this on, I've just taken this combined Boinc Manager - Psensor image.
As can be seen, at this i3-9100F CPU / GTX 1650 SUPER GPU host, the behavior is very similar.

other than PCIe generation (3.0), what are the link widths for each card? how do you have them populated? I'm assuming the two topmost slots? that should be x16 and x4 respectively

You're right.
At this particular Gigabyte Z390UD motherboard, graphics card installed at PCIe 3.0 slot 0 runs at X8 link width, while graphics card installed at PCIe slot 1 (and any eventually installed at PCIe slot 2) runs at X4.
At my most productive Host #480458, based on i7-9700F CPU / 3X GTX 1650 GPUs, all the three PCIe slots are used that way.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 744
Credit: 4,943,798,494
RAC: 524,854
Level
Arg
Scientific publications
wat
Message 58090 - Posted: 12 Dec 2021 | 22:57:15 UTC - in response to Message 58089.
Last modified: 12 Dec 2021 | 22:57:53 UTC

One thing that I just noticed. All of your hosts are running the New Feature Branch 495 drivers. These are “kinda-sorta” beta and the recommended driver branch is still the 470 branch. So I wonder if this is just a reporting issue. Does the Nvidia-Settings application report the same PCIe value as Psensor?

Can you change one of these systems back to the 470 driver and re-check?
____________

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 520
Credit: 2,275,558,465
RAC: 227
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58091 - Posted: 12 Dec 2021 | 23:30:36 UTC - in response to Message 58090.

One thing that I just noticed. All of your hosts are running the New Feature Branch 495 drivers

Good punctualization. In the interim between New version of ACEMD 2.18 and 2.19, I took the opportunity to update Nvidia drivers to version 495 at all my hosts.
But I have no time left today for more than this:


and this:


from my Host #186626
...

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 744
Credit: 4,943,798,494
RAC: 524,854
Level
Arg
Scientific publications
wat
Message 58092 - Posted: 12 Dec 2021 | 23:36:53 UTC - in response to Message 58091.

Good to know that psensor is at least still reading the correct value from the new driver.

Definitely interested to know if the reading goes back after the switch back to 470 when you have time.
____________

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 520
Credit: 2,275,558,465
RAC: 227
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58093 - Posted: 13 Dec 2021 | 6:55:14 UTC - in response to Message 58092.

Definitely interested to know if the reading goes back after the switch back to 470 when you have time.

I am too.
But I've experienced that there is a sure Gpugrid task crash when Nvidia driver update is done and the task is restarted.
It crashes with the same error that when a task is restarted in a different device on a multi-GPU host.
Luckily, my Host #186626, finished its Gpugrid task overnight, so I've reverted it to Nvidia Driver Version 470.86
Curious that PCIe load reduction is only reflected in my newer systems. As shown, PCIe usage at older ones remains about the same than usual.
Now waiting to receive some (currently scarce) new task...

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 343
Credit: 10,384,861,035
RAC: 18,101
Level
Trp
Scientific publications
watwatwat
Message 58094 - Posted: 13 Dec 2021 | 14:24:27 UTC - in response to Message 58090.

One thing that I just noticed. All of your hosts are running the New Feature Branch 495 drivers. These are “kinda-sorta” beta and the recommended driver branch is still the 470 branch. So I wonder if this is just a reporting issue. Does the Nvidia-Settings application report the same PCIe value as Psensor?

I updated my driver on a couple of computers to 495 thinking higher is better. There was something hinky about it, I believe it reported something wrong. Then I read the nvidia driver page and sure enough it's beta so I reverted to 470.86 and will stick with the repository recommended driver.

Linux Mint has another strange quirk. If you leave it to the update manager it will give you the oldest kernel, now 5.4.0-91. If you click on the 5.13 tab and then one of the kernels and install it from then on it will keep you updated to the latest kernel, now 5.13.0-22. I don't know if this is a wise thing to do or if I'm now running a beta kernel.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 744
Credit: 4,943,798,494
RAC: 524,854
Level
Arg
Scientific publications
wat
Message 58096 - Posted: 13 Dec 2021 | 15:13:23 UTC - in response to Message 58093.

Definitely interested to know if the reading goes back after the switch back to 470 when you have time.

I am too.
But I've experienced that there is a sure Gpugrid task crash when Nvidia driver update is done and the task is restarted.
It crashes with the same error that when a task is restarted in a different device on a multi-GPU host.
Luckily, my Host #186626, finished its Gpugrid task overnight, so I've reverted it to Nvidia Driver Version 470.86
Curious that PCIe load reduction is only reflected in my newer systems. As shown, PCIe usage at older ones remains about the same than usual.
Now waiting to receive some (currently scarce) new task...


I see that you picked up a task today :)
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1070
Credit: 1,450,990,714
RAC: 426,047
Level
Met
Scientific publications
watwatwatwatwat
Message 58098 - Posted: 13 Dec 2021 | 19:22:11 UTC - in response to Message 58094.

The later kernels are not necessarily considered beta. In fact the latest stable kernel is 5.15.7 as per the kernel.org site. https://www.kernel.org/

Some kernels are considered long-term or LTS and that is what distros ship that aren't rolling release distros.

Generally stated, later kernels are more secure and have more bug fixes and are faster than earlier kernels, not withstanding some speed regressions that pop up in the beta branches every once in a while.

Michael Larabel at Phoronix.com does the Linux community a great service by continually testing new kernels and does kernel bisecting regularly for the community to identify where the kernel regression is occurring and passes that info along to the developers and kernel maintainers.

We should all support his efforts by turning off adblockers on his website or contribute as a paying member.

I do.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 520
Credit: 2,275,558,465
RAC: 227
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58099 - Posted: 13 Dec 2021 | 19:38:03 UTC - in response to Message 58096.

I see that you picked up a task today :)

Yes! They come and go, and I caught one today :)
Ctrl + click over next two links, and new browser tabs will open for direct comparison between Nvidia Driver Version: 495.44 and Nvidia Driver Version: 470.86 at the same Host #186626.
You're usually very insightful in your analysis.
There is a drastic change in notified PCIe % usage between Regular Driver Branch V470.86 (31-37 %) and New Feature Driver Branch V495.44 (0-2 %) on this PCIe rev. 3.0 x16 test host.
Now, immediately, a new question arises: Is it due to some bug at measuring/notifying algorithm on this new driver branch, or is it due to a true improvement in PCIe bandwidth management?
I suppose that Nvidia driver developers usually have no time to read forums like this for clarifying... (?)

By the moment, I don't know if I'll keep in using V470 or I'll reinstall V495. It will depend on processing times be better or worse.
In a few days with steady work I'll collect enough data to evaluate and decide.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 744
Credit: 4,943,798,494
RAC: 524,854
Level
Arg
Scientific publications
wat
Message 58101 - Posted: 13 Dec 2021 | 20:06:29 UTC - in response to Message 58099.
Last modified: 13 Dec 2021 | 20:10:13 UTC

glad to see my hunch about the driver being the culprit was correct.

my second hunch is that it might not impact processing speed, and that the issue is solely with what the driver is reporting, not what's actually being used. I think your PCIe actual use with the 495 driver is normal for GPUGRID (~30%), but that the driver is incorrectly reporting it as 1%.

another test might be to put the older 470 driver on your old LGA775 system and see what the PCIe % reports then. maybe 100%? that might narrow it down to a scaling issue. or maybe the old system and older GPUs are using an older/legacy API that's fetching the correct data, but newer GPUs on a newer API are fetching the wrong data.

there's precedent for this kind of thing, where you have different API calls that fetch the same "thing" but return a different result. For example, in the stock BOINC code, everyone knows that Nvidia GPUs with >4GB memory will only report 4GB, no matter the true size, only returning the true size for values less than 4GB. this is because stock code uses a very old function that clamps the result to 32-bits (4095MB value). but a newer function exists that BOINC is not reading, that returns the correct result. both functions are trying to fetch the same thing, memory size, and in the cases of >4GB VRAM, both functions return a different result.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1070
Credit: 1,450,990,714
RAC: 426,047
Level
Met
Scientific publications
watwatwatwatwat
Message 58102 - Posted: 13 Dec 2021 | 21:54:00 UTC

The Nvidia 495.46 driver isn't considered a beta release now. Just a New Feature branch.

I would posit the main difference between the PCIe usage between the 470 and 495 branch drivers is that the 495 branch now has the GBM (Generic Buffer Manager) API built into it in support of the Wayland compositor.

Don't know exactly what the API is doing but surmise that since the name involves "buffer" that is might be decreasing the amount of traffic going over the PCIe bus.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 744
Credit: 4,943,798,494
RAC: 524,854
Level
Arg
Scientific publications
wat
Message 58103 - Posted: 13 Dec 2021 | 23:28:49 UTC - in response to Message 58102.

I dont think the PCIe use at 1% is "real". I think it's just misreporting the real value. GBM doesnt seem to have anything to do with PCIe traffic. that would be up to the app. It seems the app is coded to store a bunch of stuff in system memory rather than GPU memory and it's constantly sending data back and forth. this is up to the app, not the driver.
____________

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 520
Credit: 2,275,558,465
RAC: 227
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58199 - Posted: 23 Dec 2021 | 21:11:48 UTC

With the arrival of the last phase of Python GPU beta tasks for Linux systems, every of my hosts started to fail them.
Then I read an explanation from abouh, stating that minimum GPU dedicated RAM should be 6 GB or more.
I've temporary configured Gpugrid preferences for not to request beta apps at every of my 4 GB RAM GPUs.
I also deduced from Ian&Steve C. Message #58172 that minimun system RAM should be 16 GB.

Picking elements from several of my non-high-end hosts, I configured one of them that passed a first task, e1a18-ABOU_rnd_ppod_11-0-1-RND5755_4.
It was at my Test Host #540272
Then, I reproduced that configuration at my regular Host #569442 and a second task e1a16-ABOU_rnd_ppod_12-0-1-RND1123_2 also succeeded.
Therefore, I'm taking this configuration as of "Python GPU tasks minimum requirements" for me, with my currently available hardware.
- GPU: Asus DUAL-GTX1660TI-O6G (reworked), endowed with 6 GB dedicated RAM
- CPU: Intel CORE I3-9100F @3.60 GHZ, 4 cores / 4 threads
- System RAM: 16 GB DDR4 @2666 MHz
- Operating system: Ubuntu Linux 20.04.3 LTS
- Nvidia drivers: V495.44

I was fortunate to be present when catching the second mentioned task, and I was able to document several data since it started and got processing stability.
When this happened, my Host #569442 was previously processing a PrimeGrid GPU task.
When Python task started, an evident drop in GPU activity was observed.
The %progress of task was increasing from 0% to 10%, while free system RAM was decreasing from 44% to 1% and GPU RAM usage increased until almost reaching 100% of its 6 GB.
Then, %progress of task got fixed at 10% up to the end, after about 9 hours of true execution time.
As can be followed at this image, about 4 minutes of erratic CPU activity, followed by another 8 more minutes of intense CPU activity, gave way to an stable eye-catching periodic pattern:

This image can be better understood by reading abouh explanation at his Message #58194

During the task, the agent first interacts with the environments for a while, then uses the GPU to process the collected data and learn from it, then interacts again with the environments, and so on.

GPU keeps cycling between brief peaks of high activity (95% here) followed by valleys of low activity (11% here).
CPU, on the other hand, presents peaks of lower activity in an anti cyclic pattern: When GPU activity increases, CPU activity decreases, and vice versa. Period for this cycling was about 21 seconds at this system.
Result for this task:

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 520
Credit: 2,275,558,465
RAC: 227
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58366 - Posted: 12 Feb 2022 | 15:07:39 UTC

Lately, In Canary Islands (Spain), electricity supply costs have been getting higher and higher, month after month.
I have an ongoing "experiment":
I've swapped my currently highest performance graphics card, this reworked GTX 1660 Ti, to my daily use host, and I've switched off every of my other crunching machines.
On next month(s) I should be able to estimate the effects for this.
It should bring some savings in my electricity bill (and collaterally, in my BOINC RAC ;-) ...

Post to thread

Message boards : Number crunching : Managing non-high-end hosts

//