Advanced search

Message boards : Server and website : Please check: windows workunits

Author Message
Toni
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 9 Dec 08
Posts: 947
Credit: 4,353,973
RAC: 48
Level
Ala
Scientific publications
watwatwatwat
Message 52644 - Posted: 18 Sep 2019 | 17:39:48 UTC
Last modified: 18 Sep 2019 | 17:40:35 UTC

We made another release of the acemd3 version, which should support CUDA 10.1 and higher (all cards with the corresponding driver, including RTX family).

Please check if the WUs named DHFR207c for Windows support stop/restart, and generally work as expected. (Under Linux and older CUDAs for Windows they seem already ok).

Thanks!

Toni
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 9 Dec 08
Posts: 947
Credit: 4,353,973
RAC: 48
Level
Ala
Scientific publications
watwatwatwat
Message 52645 - Posted: 18 Sep 2019 | 17:42:13 UTC - in response to Message 52644.
Last modified: 18 Sep 2019 | 17:42:25 UTC

BTW a known problem - suspend-restart between different cards will fail.

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 252
Credit: 9,791,563,847
RAC: 2,648,925
Level
Tyr
Scientific publications
wat
Message 52650 - Posted: 18 Sep 2019 | 18:16:12 UTC
Last modified: 18 Sep 2019 | 18:53:17 UTC

Ok so stop unneeded testing with Linux and just test with Win7.

I got one on Win7 and tried the Suspend-Resume and it failed on a 1080 Ti.
____________

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 397
Credit: 5,243,582,375
RAC: 2,477,761
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 52656 - Posted: 18 Sep 2019 | 23:02:35 UTC

I got one of the new test units cuda101. It took almost a minute after the unit started for the GPU to start crunching. GPU usage was approximately 55% ( lower than before), but power usage is between 70% to 80%, according to Afterburner. It ran fine. I suspended it and resumed after about 30 seconds, and it crashed within a few seconds after that.

It was running on a windows 7 computer with a rtx 2080 ti card.

See link:


http://www.gpugrid.net/result.php?resultid=21391121


I ran one successfully, which I did not suspend and resume.


http://www.gpugrid.net/result.php?resultid=21391156


BTW, I also received cuda 100 unit, is this new unit or is it a left over old unit from before? Which has higher GPU usage 65% and power usage 85%

http://www.gpugrid.net/result.php?resultid=21391213





Toni
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 9 Dec 08
Posts: 947
Credit: 4,353,973
RAC: 48
Level
Ala
Scientific publications
watwatwatwat
Message 52659 - Posted: 19 Sep 2019 | 9:46:22 UTC - in response to Message 52656.

Cuda 100 are leftovers. They are actually mislabeled 10.1.

More precisely, i'd like to investigate if after suspension the following processes are still in the task manager:

- wrapper*.exe
- acemd3

Thanks!

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 397
Credit: 5,243,582,375
RAC: 2,477,761
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 52660 - Posted: 19 Sep 2019 | 10:42:23 UTC - in response to Message 52659.

Cuda 100 are leftovers. They are actually mislabeled 10.1.

More precisely, i'd like to investigate if after suspension the following processes are still in the task manager:

- wrapper*.exe
- acemd3

Thanks!



Both processes are gone from the task manager.




Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 397
Credit: 5,243,582,375
RAC: 2,477,761
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 52711 - Posted: 24 Sep 2019 | 2:00:05 UTC - in response to Message 52660.
Last modified: 24 Sep 2019 | 2:00:51 UTC

Cuda 100 are leftovers. They are actually mislabeled 10.1.

More precisely, i'd like to investigate if after suspension the following processes are still in the task manager:

- wrapper*.exe
- acemd3

Thanks!



Both processes are gone from the task manager.






It is still happening. The unit starts running. It runs well, and then I suspend it. Both processes listed above disappear from the task manager, I then resume the task, both processes reappear briefly, then disappear again. The unit crashes again.


http://www.gpugrid.net/result.php?resultid=21405696

rod4x4
Send message
Joined: 4 Aug 14
Posts: 164
Credit: 1,866,147,954
RAC: 1,224,645
Level
His
Scientific publications
watwatwatwatwatwatwat
Message 52722 - Posted: 25 Sep 2019 | 23:30:23 UTC - in response to Message 52659.
Last modified: 25 Sep 2019 | 23:31:30 UTC

Received first v2.07 (cuda101) work unit - a36-TONI_TESTDHFR207c-10-30-RND9893_0 on Win10 GTX 1060 host.

Could not test suspend / resume as it was received / processed overnight.

One comment is the runtime is shorter than v2.06 (cuda 100) test work unit.

v2.07 cuda 101 runtime - 2897 seconds
http://www.gpugrid.net/result.php?resultid=21409526

v2.06 cuda 100 runtime - 3974 seconds
http://www.gpugrid.net/result.php?resultid=21404652

Assuming Work units are comparable, it is a 27% improvement in processing speed.

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 397
Credit: 5,243,582,375
RAC: 2,477,761
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 52724 - Posted: 26 Sep 2019 | 1:31:48 UTC

I managed to get 4 of these units on my windows 10 computer, which has 2 videos card (a gtx 980ti and a rtx 2080ti). I decided run the unit simultaneously on both cards. The 3 units that ran on the 980ti card all crashed, within about a half hour (no suspend and resume error).

http://www.gpugrid.net/result.php?resultid=21410390

http://www.gpugrid.net/result.php?resultid=21410391

http://www.gpugrid.net/result.php?resultid=21410403


The unit that ran on 2080ti, finished successfully, while this was going on.

http://www.gpugrid.net/result.php?resultid=21410392

This also caused afterburner to become non responsive.


I was able to run simultaneously, in the past week or so, a long unit on the 980ti and a new version unit on the 2080ti, successfully, or single new version unit on either card, while run either Einstein or Milkyway unit on the other card, again successfully.


BTW, 2080ti is more than twice as fast as the 980ti, on this computer.




Billy Ewell 1931
Send message
Joined: 22 Oct 10
Posts: 28
Credit: 330,865,974
RAC: 357
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 52725 - Posted: 26 Sep 2019 | 3:22:11 UTC

New version of ACEMD v2.07 (cuda101)
New version of ACEMD v2.07 (cuda101)

Both tasks were downloaded while RTX 2018 was otherwise occupied. When started both tasks errored out in sequence after each was paused and then resumed.
Machine: I7 windows 10 RTX2080

Obviously very disappointing.

Billy Ewell 1931
Send message
Joined: 22 Oct 10
Posts: 28
Credit: 330,865,974
RAC: 357
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 52726 - Posted: 26 Sep 2019 | 3:22:11 UTC

New version of ACEMD v2.07 (cuda101)
New version of ACEMD v2.07 (cuda101)

Both tasks were downloaded while RTX 2018 was otherwise occupied. When started both tasks errored out in sequence after each was paused and then resumed.
Machine: I7 windows 10 RTX2080

Obviously very disappointing.

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 397
Credit: 5,243,582,375
RAC: 2,477,761
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 52727 - Posted: 26 Sep 2019 | 10:24:29 UTC - in response to Message 52724.

I managed to get 4 of these units on my windows 10 computer, which has 2 videos card (a gtx 980ti and a rtx 2080ti). I decided run the unit simultaneously on both cards. The 3 units that ran on the 980ti card all crashed, within about a half hour (no suspend and resume error).

http://www.gpugrid.net/result.php?resultid=21410390

http://www.gpugrid.net/result.php?resultid=21410391

http://www.gpugrid.net/result.php?resultid=21410403


The unit that ran on 2080ti, finished successfully, while this was going on.

http://www.gpugrid.net/result.php?resultid=21410392

This also caused afterburner to become non responsive.


I was able to run simultaneously, in the past week or so, a long unit on the 980ti and a new version unit on the 2080ti, successfully, or single new version unit on either card, while run either Einstein or Milkyway unit on the other card, again successfully.


BTW, 2080ti is more than twice as fast as the 980ti, on this computer.







I received 2 more of these morning. They ran on the 980ti card. Both crashed without doing the suspend and resume. This is a new observation, previously I was able to finish them, when I was running Einstein units on the other card.

http://www.gpugrid.net/result.php?resultid=21411671

http://www.gpugrid.net/result.php?resultid=21411672


The long units are running well on this card, with only one exception recently, which was caused by abrupt computer showdown.



Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 397
Credit: 5,243,582,375
RAC: 2,477,761
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 52729 - Posted: 27 Sep 2019 | 0:28:34 UTC - in response to Message 52727.

I managed to get 4 of these units on my windows 10 computer, which has 2 videos card (a gtx 980ti and a rtx 2080ti). I decided run the unit simultaneously on both cards. The 3 units that ran on the 980ti card all crashed, within about a half hour (no suspend and resume error).

http://www.gpugrid.net/result.php?resultid=21410390

http://www.gpugrid.net/result.php?resultid=21410391

http://www.gpugrid.net/result.php?resultid=21410403


The unit that ran on 2080ti, finished successfully, while this was going on.

http://www.gpugrid.net/result.php?resultid=21410392

This also caused afterburner to become non responsive.


I was able to run simultaneously, in the past week or so, a long unit on the 980ti and a new version unit on the 2080ti, successfully, or single new version unit on either card, while run either Einstein or Milkyway unit on the other card, again successfully.


BTW, 2080ti is more than twice as fast as the 980ti, on this computer.







I received 2 more of these morning. They ran on the 980ti card. Both crashed without doing the suspend and resume. This is a new observation, previously I was able to finish them, when I was running Einstein units on the other card.

http://www.gpugrid.net/result.php?resultid=21411671

http://www.gpugrid.net/result.php?resultid=21411672


The long units are running well on this card, with only one exception recently, which was caused by abrupt computer showdown.






I had one of these units today finish successfully on the 980ti card (no suspend/resume was done on this unit):


http://www.gpugrid.net/result.php?resultid=21412313


It took more double the time to complete than the same unit running on the 2080ti card.

Another interesting observation is the new ACEMD version seems to be more CPU dependent. A unit running on 2080ti with a Intel(R) Core(TM) i7-5820K CPU will finish in about a forth less time than a unit running on a 2080ti with a AuthenticAMD AMD Phenom(tm) II X6 1090T.




Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 397
Credit: 5,243,582,375
RAC: 2,477,761
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 52749 - Posted: 30 Sep 2019 | 0:08:57 UTC

I got 2 more of these units today. I decided to run them both simultaneously in one card. (1 CPU w/ .5 GPU). They ran slowly together at a rate of about 16% per hour each, versus about 54% per hour running at 1 CPU w/ 1 GPU.

After running them for a few minutes, I decided to suspend one of them. Before the suspension, I had 2 wrapper tasks and 2 acemd3 tasks running in the task manager, after suspend 1 unit, the task manager shows 1 wrapper and 1 acemd3 running. After the resuming the unit, 2 acemd3 tasks and only 1 wrapper were running, then the unit crashed. Looks like the problem maybe with the wrapper.

See links:

http://www.gpugrid.net/result.php?resultid=21420201

http://www.gpugrid.net/result.php?resultid=21420184



rod4x4
Send message
Joined: 4 Aug 14
Posts: 164
Credit: 1,866,147,954
RAC: 1,224,645
Level
His
Scientific publications
watwatwatwatwatwatwat
Message 52752 - Posted: 1 Oct 2019 | 9:55:16 UTC
Last modified: 1 Oct 2019 | 9:56:13 UTC

Received e23s10_e19s1p1f205-ADRIA_FOLDUBQ_BANDIT_crystal_ss_contacts_50_ubiquitin_4-1-2-RND4012_0 TEST v2.06 (Cuda100) work unit on Win10 Host with GTX1060.

Let it run for 6 hours 16 minutes (50 minutes run time left)
Suspended for 2 minutes.
Failed on restarted.

Wrapper and ACEMD3 tasks disappeared in Task Manager on suspend.
These tasks briefly reappeared in Task manager before the Work unit failed.

Link to Work Unit here:
http://gpugrid.net/result.php?resultid=21422582

The observations for all users testing Suspend/Resume on these TEST work units seem to be consistent with the above pattern.
Are there any other symptoms you would like us to monitor when testing?

Toni
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 9 Dec 08
Posts: 947
Credit: 4,353,973
RAC: 48
Level
Ala
Scientific publications
watwatwatwat
Message 52754 - Posted: 1 Oct 2019 | 10:28:35 UTC - in response to Message 52752.
Last modified: 1 Oct 2019 | 10:30:26 UTC

That's sufficient, thanks. We are investigating. Sorry for the failed wus.

Looks like Windows apps fail on restart :(

The restart function itself (=process expected to disappear) seems correct.

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 397
Credit: 5,243,582,375
RAC: 2,477,761
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 52755 - Posted: 1 Oct 2019 | 10:51:56 UTC - in response to Message 52754.

That's sufficient, thanks. We are investigating. Sorry for the failed wus.

Looks like Windows apps fail on restart :(

The restart function itself (=process expected to disappear) seems correct.



Is the restart function the same as the initial start function (which doesn't crash)? Have the saved work files from before the suspension been corrupted or not interacting properly with the other files?



Toni
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 9 Dec 08
Posts: 947
Credit: 4,353,973
RAC: 48
Level
Ala
Scientific publications
watwatwatwat
Message 52756 - Posted: 1 Oct 2019 | 12:45:18 UTC - in response to Message 52755.

Restarts from a checkpoint file which is written periodically. There is a bug, possibly not in our code.

Post to thread

Message boards : Server and website : Please check: windows workunits