GERARD_MO_TRV

Message boards : Number crunching : GERARD_MO_TRV_ WUs

Author	Message
Bedrich Hajek Send message Joined: 28 Mar 09 Posts: 486 Credit: 11,380,049,620 RAC: 9,055,607 Level Scientific publications	Message 44029 - Posted: 22 Jul 2016 \| 0:25:47 UTC
	I was able to complete GERARD_MO_TRV_ WUs successfully: Name e2s38_e1s1p0f339-GERARD_MO_TRV_2-0-1-RND8501_0 Workunit 11676480 Created 21 Jul 2016 \| 10:42:20 UTC Sent 21 Jul 2016 \| 10:45:23 UTC Received 21 Jul 2016 \| 23:51:41 UTC Server state Over Outcome Success Client state Done Exit status 0 (0x0) Computer ID 263612 Report deadline 26 Jul 2016 \| 10:45:23 UTC Run time 31,564.82 CPU time 31,399.55 Validate state Valid Credit 244,050.00 Application version Long runs (8-12 hours on fastest card) v8.48 (cuda65) Stderr output <core_client_version>7.6.22</core_client_version> <![CDATA[ <stderr_txt> # GPU [GeForce GTX 980 Ti] Platform [Windows] Rev [3212] VERSION [65] # SWAN Device 0 : # Name : GeForce GTX 980 Ti # ECC : Disabled # Global mem : 4095MB # Capability : 5.2 # PCI ID : 0000:01:00.0 # Device clock : 1266MHz # Memory clock : 3505MHz # Memory width : 384bit # Driver version : r358_00 : 35906 # GPU 0 : 63C # GPU 1 : 56C # GPU 0 : 64C # GPU 1 : 57C # GPU 0 : 65C # GPU 1 : 58C # GPU 0 : 66C # GPU 1 : 59C # Time per step (avg over 10000000 steps): 3.156 ms # Approximate elapsed time for entire WU: 31561.041 s # PERFORMANCE: 54223 Natoms 3.156 ns/day 0.000 ms/step 0.000 us/step/atom 19:48:09 (3464): called boinc_finish </stderr_txt> ]]> Name e2s14_e1s32p0f291-GERARD_MO_TRV_2-0-1-RND8906_0 Workunit 11676456 Created 21 Jul 2016 \| 10:41:34 UTC Sent 21 Jul 2016 \| 10:45:23 UTC Received 22 Jul 2016 \| 0:20:30 UTC Server state Over Outcome Success Client state Done Exit status 0 (0x0) Computer ID 263612 Report deadline 26 Jul 2016 \| 10:45:23 UTC Run time 34,050.83 CPU time 33,910.92 Validate state Valid Credit 244,050.00 Application version Long runs (8-12 hours on fastest card) v8.48 (cuda65) Stderr output <core_client_version>7.6.22</core_client_version> <![CDATA[ <stderr_txt> # GPU [GeForce GTX 980 Ti] Platform [Windows] Rev [3212] VERSION [65] # SWAN Device 0 : # Name : GeForce GTX 980 Ti # ECC : Disabled # Global mem : 4095MB # Capability : 5.2 # PCI ID : 0000:01:00.0 # Device clock : 1266MHz # Memory clock : 3505MHz # Memory width : 384bit # Driver version : r358_00 : 35906 # GPU 0 : 63C # GPU 1 : 58C # BOINC suspending at user request (exit) # GPU [GeForce GTX 980 Ti] Platform [Windows] Rev [3212] VERSION [65] # SWAN Device 1 : # Name : GeForce GTX 980 Ti # ECC : Disabled # Global mem : 4095MB # Capability : 5.2 # PCI ID : 0000:02:00.0 # Device clock : 1190MHz # Memory clock : 3505MHz # Memory width : 384bit # Driver version : r358_00 : 35906 # GPU 0 : 66C # GPU 1 : 56C # GPU 1 : 57C # GPU 1 : 58C # GPU 1 : 59C # Time per step (avg over 9885000 steps): 3.407 ms # Approximate elapsed time for entire WU: 34073.917 s # PERFORMANCE: 54223 Natoms 3.407 ns/day 0.000 ms/step 0.000 us/step/atom 20:17:14 (5252): called boinc_finish </stderr_txt> ]]> But these WUs are slow. The GPU usage was 75% to 78% and GPU power usage was less than 70%, for a windows 10 machine. This compares to 80% to 89% GPU usage and 80% + power usage for the other GERARD and ADRIA units, on the same computer.
	ID: 44029 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2355 Credit: 16,375,531,916 RAC: 3,695,757 Level Scientific publications	Message 44077 - Posted: 1 Aug 2016 \| 7:09:05 UTC Last modified: 1 Aug 2016 \| 7:09:29 UTC
	One of my hosts is processing a GERARD_MO_TRV_2-0-1-RND9009_0 workunit, but its progress is very slow. It's at 42% in 7h 47m on a Windows XP x64 host with a GTX980Ti and a Core i7-4790k CPU. It's processing the GPUGrid workunit and 3 Rosetta@home workunits simultaneously the GPU usage is 41~63%, the average is 48%. There's only a little increase (~3%) in the GPU usage if I suspend the rosetta@home workunits. SWAN_SYNC is on. GPU core clock is 1391MHz. GPU FB usage is 9~15% (awg: 12%). GPU temperature is 54°C. GPU memory clock is 3500MHz.
	ID: 44077 \| Rating: 0 \| rate: / Reply Quote

Beyond Send message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level Scientific publications	Message 44078 - Posted: 1 Aug 2016 \| 19:19:45 UTC
	Join the club. They run slow, have low GPU usage (run cool) and give poor credit :-)
	ID: 44078 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2355 Credit: 16,375,531,916 RAC: 3,695,757 Level Scientific publications	Message 44079 - Posted: 1 Aug 2016 \| 21:16:19 UTC - in response to Message 44078.
	Join the club. They run slow, have low GPU usage (run cool) and give poor credit :-) It's finished in 17h 41m. The really strange is that the following GERARD_FXCXCL12RX_1153966 workunit was also very slow (with low GPU usage), while similar workunits were runnig at normal speed and GPU usage on my other hosts. I've exited the BOINC manager with stopping scientific applications (to make sure the GPUGrid app will continue without error), and then I've restarted the OS (Windows XP x64). Since then it's progressing at normal rate.
	ID: 44079 \| Rating: 0 \| rate: / Reply Quote

Beyond Send message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level Scientific publications	Message 44080 - Posted: 2 Aug 2016 \| 16:30:55 UTC - in response to Message 44079.
	The really strange is that the following GERARD_FXCXCL12RX_1153966 workunit was also very slow (with low GPU usage), while similar workunits were runnig at normal speed and GPU usage on my other hosts. I've exited the BOINC manager with stopping scientific applications (to make sure the GPUGrid app will continue without error), and then I've restarted the OS (Windows XP x64). Since then it's progressing at normal rate. Unfortunately we're getting more and more of these "TRV" WUs. I have 5 running at the moment.
	ID: 44080 \| Rating: 0 \| rate: / Reply Quote

Betting Slip Send message Joined: 5 Jan 09 Posts: 670 Credit: 2,498,095,550 RAC: 0 Level Scientific publications	Message 44081 - Posted: 2 Aug 2016 \| 17:18:11 UTC - in response to Message 44080.
	The really strange is that the following GERARD_FXCXCL12RX_1153966 workunit was also very slow (with low GPU usage), while similar workunits were runnig at normal speed and GPU usage on my other hosts. I've exited the BOINC manager with stopping scientific applications (to make sure the GPUGrid app will continue without error), and then I've restarted the OS (Windows XP x64). Since then it's progressing at normal rate. Unfortunately we're getting more and more of these "TRV" WUs. I have 5 running at the moment. Why unfortunate? We are here to crunch units and hope something useful comes out of that activity everything else is irrelevant.
	ID: 44081 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2355 Credit: 16,375,531,916 RAC: 3,695,757 Level Scientific publications	Message 44083 - Posted: 2 Aug 2016 \| 19:30:17 UTC - in response to Message 44081.
	Unfortunately we're getting more and more of these "TRV" WUs. I have 5 running at the moment. Why unfortunate? We are here to crunch units and hope something useful comes out of that activity everything else is irrelevant. I would put what you've said in another perspective: since the GPUGrid project doesn't find the cure of a nasty disease every other day, then these collateral factors can / could motivate some crunchers to participate (or leave), therefore it is unwise to take them "irrevelant". They surely come only second to the results, but if the project's take on the participant's frustration causes a negative feedback, then there will be less crunchers and as a consequence fewer results. I think that every project should avoid this, especially this one which has such a fragile GPU app. I've tested these "TRV" workunits under Windows 7, and they are experiencing a higher WDDM impact than the FXCXCLs, I guess there is more interaction between the CPU and the GPU for the TRVs. However, the 50% GPU usage surely came from some kind of error, which remained after the WU finished. I couldn't recall such behavior from the past.
	ID: 44083 \| Rating: 0 \| rate: / Reply Quote

Beyond Send message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level Scientific publications	Message 44086 - Posted: 2 Aug 2016 \| 22:58:08 UTC - in response to Message 44083.
	Unfortunately we're getting more and more of these "TRV" WUs. I have 5 running at the moment. Why unfortunate? We are here to crunch units and hope something useful comes out of that activity everything else is irrelevant. I would put what you've said in another perspective: since the GPUGrid project doesn't find the cure of a nasty disease every other day, then these collateral factors can / could motivate some crunchers to participate (or leave), therefore it is unwise to take them "irrevelant". They surely come only second to the results, but if the project's take on the participant's frustration causes a negative feedback, then there will be less crunchers and as a consequence fewer results. I think that every project should avoid this, especially this one which has such a fragile GPU app. I've tested these "TRV" workunits under Windows 7, and they are experiencing a higher WDDM impact than the FXCXCLs, I guess there is more interaction between the CPU and the GPU for the TRVs. However, the 50% GPU usage surely came from some kind of error, which remained after the WU finished. I couldn't recall such behavior from the past. My thoughts too. Always keep your customers (volunteers) as happy as possible, especially when they're paying a lot and not receiving anything in return (except for warm, fuzzy feelings).
	ID: 44086 \| Rating: 0 \| rate: / Reply Quote

Bedrich Hajek Send message Joined: 28 Mar 09 Posts: 486 Credit: 11,380,049,620 RAC: 9,055,607 Level Scientific publications	Message 44087 - Posted: 2 Aug 2016 \| 23:55:53 UTC - in response to Message 44086.
	Unfortunately we're getting more and more of these "TRV" WUs. I have 5 running at the moment. Why unfortunate? We are here to crunch units and hope something useful comes out of that activity everything else is irrelevant. I would put what you've said in another perspective: since the GPUGrid project doesn't find the cure of a nasty disease every other day, then these collateral factors can / could motivate some crunchers to participate (or leave), therefore it is unwise to take them "irrevelant". They surely come only second to the results, but if the project's take on the participant's frustration causes a negative feedback, then there will be less crunchers and as a consequence fewer results. I think that every project should avoid this, especially this one which has such a fragile GPU app. I've tested these "TRV" workunits under Windows 7, and they are experiencing a higher WDDM impact than the FXCXCLs, I guess there is more interaction between the CPU and the GPU for the TRVs. However, the 50% GPU usage surely came from some kind of error, which remained after the WU finished. I couldn't recall such behavior from the past. My thoughts too. Always keep your customers (volunteers) as happy as possible, especially when they're paying a lot and not receiving anything in return (except for warm, fuzzy feelings). I agree.
	ID: 44087 \| Rating: 0 \| rate: / Reply Quote

caffeineyellow5 Send message Joined: 30 Jul 14 Posts: 225 Credit: 2,658,976,345 RAC: 0 Level Scientific publications	Message 44096 - Posted: 6 Aug 2016 \| 10:21:41 UTC - in response to Message 44079.
	The really strange is that the following GERARD_FXCXCL12RX_1153966 workunit was also very slow (with low GPU usage), while similar workunits were runnig at normal speed and GPU usage on my other hosts. I've exited the BOINC manager with stopping scientific applications (to make sure the GPUGrid app will continue without error), and then I've restarted the OS (Windows XP x64). Since then it's progressing at normal rate. I have seen this happen on my system that has the 3 980TI Classys in it. One card will under-perform by a good percentage. Restarting the OS is the only thing that brings all 3 back to normal performance. Restarting BOINC does not. It usually follows a failed work unit, but its always the same (bottom) slot. I have even rotated the cards several times (and will continue to do so since heat removal is toughest on the middle card and I am sure causes wear and tear on it over the others.) I just keep an eye on it and reboot when needed. I haven't needed recently, but it happened quite often around the turn of the year as I remember. I am not sure what tasks we were running at the time that had errors like the MO_MORs have had recently. I have not seen this happen on any of the other machines with 2 cards in them or on the single card systems. ____________ 1 Corinthians 9:16 "For though I preach the gospel, I have nothing to glory of: for necessity is laid upon me; yea, woe is unto me, if I preach not the gospel!" Ephesians 6:18-20, please ;-) http://tbc-pa.org
	ID: 44096 \| Rating: 0 \| rate: / Reply Quote

Bedrich Hajek Send message Joined: 28 Mar 09 Posts: 486 Credit: 11,380,049,620 RAC: 9,055,607 Level Scientific publications	Message 44098 - Posted: 7 Aug 2016 \| 11:58:46 UTC
	I normally don't run 2 tasks on 1 GPU simultaneously on this project, but I decided run an experiment using these "MO" units. Here are 4 WUs which I ran using 1 CPU & .5 GPU : e18s21_e13s127p0f196-GERARD_MO_MOR_2-0-1-RND8164_0 Run time 56,508.43 seconds e18s20_e16s32p0f162-GERARD_MO_MOR_2-0-1-RND9014_0 Run time 54,610.82 seconds e18s19_e11s4p0f108-GERARD_MO_MOR_2-0-1-RND5678_0 Run time 53,411.93 seconds e18s16_e9s1p0f40-GERARD_MO_MOR_2-0-1-RND1625_0 Run time 55,218.16 seconds The average run time for those 4 WUs is 54,937.34 seconds, but since I am running 2 at a time, that actually average GPU usage is half that or 27,468.67 seconds. I think the actual GPU usage should be used as the run time number. Here are 2 WUs which I ran using 1 CPU & 1 GPU from the first post in this thread: e2s38_e1s1p0f339-GERARD_MO_TRV_2-0-1-RND8501_0 Run time 31,564.82 seconds e2s14_e1s32p0f291-GERARD_MO_TRV_2-0-1-RND8906_0 Run time 34,050.83 seconds The average run time for these 2 WUs is 32,807.83 seconds. 27,468.67/32,807.83 = 0.837259632 or more than 16% reduction in run time. The GPU usage was 91% max, and power usage was 80% on a windows 10 computer. I don't think running a slow "MO" simultaneously with normal speed long GPUGRD WU would yield the same reduction in time, but you are welcome to experiment. When I was running "MO" WU simultaneously with an Einstein Arecibo WU, GPU usage was 94% max. I did have one down clocking during the experiment, but everything else ran smoothly. I know I am comparing "MO" to "TRV", but their runtimes are about the same.
	ID: 44098 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1626 Credit: 9,362,966,723 RAC: 18,909,192 Level Scientific publications	Message 44099 - Posted: 7 Aug 2016 \| 13:15:27 UTC
	Here's an example of a different problem caused by the MO_TRV WUs - I don't think it's been discussed before. e12s9_e5s8p0f353-GERARD_MO_TRV_2-0-1-RND6047 The first user seems to be a part-time or multi-project cruncher - no quarrel with that, it's what BOINC was written for. As I write this, they have 5 completed tasks visible, with an average turnround of 3.53 days. But as we know, these MO_TRV tasks run longer than usual, without any compensating increase in the estimated runtime passed to BOINC. Unsurprisingly, BOINC failed in its duty of completing the task by deadline - overrunning by about 2.5 hours. So the server created a replacement task, and sent it to a replacement user - me, in this case. So, I was asked to run a task which validated (science complete) when I was barely 15% done with it - that's a waste of resources. As you can see, I've aborted the redundant task, and with few tasks available these days, that GPU has moved back to another project, where there is still science to be done. As an aside, if I'd allowed the wasteful task to continue, I would still have been awarded - without bonus - vastly more credit that at any other BOINC project I crunch for. So it wasn't the loss of bonus that caused me to abort the task, though I suspect it might have an influence on some people's thinking. My major concern is the failure of the admins here to use the tools available - <rtsc_fpops_est> - to help BOINC to avoid deadline misses in the first place.
	ID: 44099 \| Rating: 0 \| rate: / Reply Quote

Post to thread

Message boards : Number crunching : GERARD_MO_TRV_ WUs

	About	Science	Volunteers	Performance	Forum	Join us	Donate