Advanced search

Message boards : Graphics cards (GPUs) : hERG: information and issues

Author Message
Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 13850 - Posted: 9 Dec 2009 | 12:01:16 UTC
Last modified: 9 Dec 2009 | 13:27:53 UTC

Dear crunchers,

I'm starting this topic to collect information and feedback on the HERG workunits, all in a single place. The idea (under test) is to provide a quick-to-find reference for both those of you curious about the purpose of the WU they are crunching, and a place to report issues.

This post, and the one below, may be updated from time to time.


Scientific rationale.

First of all, some background information on the experiment: we are doing various studies on the so-called "hERG channel". You can find a (longish) description on Wikipedia's hERG page.
This complex of four proteins (tetramer) is found in many of the body cells, and most notably the heart tissue, where it plays a very important role: it conducts charged particles (potassium ions), which flow through it cyclically, ultimately governing the heart beat.

The molecule is of especial interest because interferences with its functioning, e.g. unintentional side effects of drugs, and congenital mutations, cause potentially fatal alterations in the cardiac rhythm, including the long QT syndrome.

The curious ones may find an image of the tetramer on our Flickr photostream.

Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 13851 - Posted: 9 Dec 2009 | 12:02:40 UTC
Last modified: 9 Dec 2009 | 18:18:42 UTC

Crunching issues.

The TONI_HERG workunits use the same parameters as many others. As far as we know, they have the same failure rate as other workunits, but I am trying to get some sounder statistics. If you see more HERG failures, it could be that there are many of those WU out right now.


[This post reserved for future updates]

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 13869 - Posted: 10 Dec 2009 | 18:49:47 UTC - in response to Message 13851.

http://www.gpugrid.net/forum_thread.php?id=1506

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 13870 - Posted: 10 Dec 2009 | 20:11:52 UTC - in response to Message 13851.

Crunching issues.

The TONI_HERG workunits use the same parameters as many others. As far as we know, they have the same failure rate as other workunits, but I am trying to get some sounder statistics. If you see more HERG failures, it could be that there are many of those WU out right now.

The TONI_HERG run fine on GTX 260 and above. On my 4 G92 based cards they almost always fail, so I now abort them on those cards when they arrive. Other WUs are much much better, most types never fail on any of the cards.

Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 13875 - Posted: 11 Dec 2009 | 11:07:57 UTC - in response to Message 13870.
Last modified: 11 Dec 2009 | 11:16:34 UTC

So, from what I understand, these WUs sometimes fail on older cards? I'm trying to collect statistics on non-overclocked cards.

From what I see in SKGiven's task list for host 51279, he had at least three TONI_HERG successfully completed, as well 1572466, 1606985 and 1558388. BTW, isn't the card overclocked at 1.85 GHz?

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1451
Credit: 3,575,929,351
RAC: 310,964
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 13876 - Posted: 11 Dec 2009 | 11:48:07 UTC - in response to Message 13875.

So, from what I understand, these WUs sometimes fail on older cards? I'm trying to collect statistics on non-overclocked cards.

I would put it more strongly than that - they have a high probability of failing, even if some succeed. And by 'age' of the card, you mean the technology generation they incorporate.

I have three 9800GT series cards, all purchased in January this year. The straight 9800GTs are not overclocked, the 9800GTX+ runs on factory overclock settings. I haven't noteiced any significant difference in failure rate between the cards: so I don't think the problem is related to (moderate) overclocking.

Also, I've been running the same drivers (190.38, 32-bit WinXP) since July: the increased error rate has become apparent much more recently than that - late October, IIRC. So I'm not inclined to blame it on drivers, either.

No, it seems to be related to specific model types. TONI_HERG is a fairly recent addition to the list of problematic models - searching the message boards suggests that my report on 24 November was the first sighting. Previously, we had been commenting on IBUCH_TRYP and OTTO_HERG in thread 1468

canardo
Send message
Joined: 11 Feb 09
Posts: 4
Credit: 8,675,472
RAC: 0
Level
Ser
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 13881 - Posted: 11 Dec 2009 | 17:36:57 UTC - in response to Message 13875.

Hello,
Just have a look here comp id: 26091
worked fine untill i upgraded to BOINC 6.10.18
allthough it might be coincidence with HERG units coming in
SETI & Einstein have no problems though
Ciao,
Jaak

____________

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 13906 - Posted: 13 Dec 2009 | 12:41:58 UTC - in response to Message 13875.
Last modified: 13 Dec 2009 | 13:33:06 UTC

So, from what I understand, these WUs sometimes fail on older cards? I'm trying to collect statistics on non-overclocked cards.

From what I see in SKGiven's task list for host 51279, he had at least three TONI_HERG successfully completed, as well 1572466, 1606985 and 1558388. BTW, isn't the card overclocked at 1.85 GHz?


Yes, 3 tasks did complete on the GTS 250, but there were too many failures.
The clock settings are in fact Factory settings, but yes they are higher than other cards, but it is fairly new and the core sits at 66 degrees (5 fans on case, + GPU, CPU and PSU fans) and UPS! The GTS 250 success rates are much higher for other tasks.

On the other hand my 8800GTS 512MB G92, could not complete any TONI_HERG tasks. As there were so many being sent I was down to an almost zero return for that card on the project. That card was also not able to handle other recent tasks too well. I guess it is down to the G92 cores limitations.

My GTS250 spec:
Palit card. 65nm, G92 rev A2. Bios 62.92.7D.00.10
11.9562, CUDA 3 (better than 2.3)!
GPU @745, Memory @1000MHz, Shaders @1848MHz
754M Transistors.
GPUGrid temp=66 Degrees C
For Ref. Einstein temp=48 Degrees C (but that barely uses the GPU)!

System: Q9400CPU @3.46GHz crunching other Boinc tasks (24/7, no outages as on UPS) and Win7 Pro 64bit. 4GB RAM plenty HDD space.

I will allow it to try another Herg task. Report back tomorrow, hopefully!

The GTX260 is still working well for all tasks, but that uses a GT200 A2.

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 13925 - Posted: 14 Dec 2009 | 18:53:27 UTC - in response to Message 13875.
Last modified: 14 Dec 2009 | 18:53:55 UTC

So, from what I understand, these WUs sometimes fail on older cards? I'm trying to collect statistics on non-overclocked cards.

As Richard stats, "high probability of failing" is a better description. They will occasionally complete but usually fail. On the GTX 260 and above they run fine. BTW, they often fail on the new GTS 240 and GT 240 cards too even with their 1.2 compute capability:

http://www.gpugrid.net/result.php?resultid=1592578
http://www.gpugrid.net/result.php?resultid=1590198
http://www.gpugrid.net/result.php?resultid=1610106

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 13927 - Posted: 14 Dec 2009 | 19:16:26 UTC - in response to Message 13925.

My GTS250 managed to complete one! http://www.gpugrid.net/result.php?resultid=1625604

The success percentage of these HERG tasks for anything less than a GTX260 seems to be poor, with the older cards being less reliable.

Just because an NVidia card is new does not mean there is any new technology inside!

Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 13947 - Posted: 15 Dec 2009 | 14:17:10 UTC - in response to Message 13927.

We are keeping eyes on the failure rate wrt card types (in absence of overclock). As said, the matter is puzzling because there should be no major difference with other WU types. For now, I reduced the number of HERG WUs out, and possibly I'll reduce their length a bit in order to increase the chances of correct termination.

Almost all of the failures seem to be related to the infamous CUDA FFT bug, on which we have little to no control (i.e., errors in "pme" or "fft" kernels).

Definitely, thanks for bearing with us.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1451
Credit: 3,575,929,351
RAC: 310,964
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 13951 - Posted: 15 Dec 2009 | 17:19:35 UTC - in response to Message 13947.

Almost all of the failures seem to be related to the infamous CUDA FFT bug, on which we have little to no control (i.e., errors in "pme" or "fft" kernels).

Could you give us a little bit more detail about this bug, as this is the first time I've heard about it? It may only be "infamous" in developer circles.

I'm aware of an infamous bug in the BOINC CUDA application which NVidia developed for SETI@home, but that just causes certain tasks ('VLAR') to run extremely slowly, and inhibits screen re-drawing while they're running. Apart from that, SETI is an extremely heavy user of FFTs at a wide range of problem sizes, and benefits enormously from the additional capabilities of cufft v2.3: I've not come across a single SETI task which has failed because of a CUDA FFT bug.

Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 13953 - Posted: 15 Dec 2009 | 17:55:52 UTC - in response to Message 13951.
Last modified: 15 Dec 2009 | 18:02:19 UTC

It's a long standing issue that hits older cards especially hard. Please see here or here. For what concerns FFT being ok with SETI, in fact there are many types of FFT, and it's not surprising that the bug only manifests for some of them.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1451
Credit: 3,575,929,351
RAC: 310,964
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 13954 - Posted: 15 Dec 2009 | 18:49:36 UTC - in response to Message 13953.

It's a long standing issue that hits older cards especially hard. Please see here or here. For what concerns FFT being ok with SETI, in fact there are many types of FFT, and it's not surprising that the bug only manifests for some of them.

I had hoped that you would direct me to a relevant discussion here. The only thing of relevance in those threads seems to be message 12734:

We have contacted AGAIN Nvidia yesterday.

gdf

That was almost three months ago, and is the very last post in the thread. Did he ever get a reply?

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 13955 - Posted: 15 Dec 2009 | 20:08:41 UTC - in response to Message 13954.

Perhaps the FFT bug is being compounded by a mixture of G92/65nm cores and old firmware?

Reducing the work length would help, as the tasks that failed on my systems seemed to do so randomly, in terms of time. If they fail after 10sec its not really a problem that effects turnover, but after 6h is not good.

Ultimately if you could match cards to work units it would resolve this issue. It might even be better than card pairing, though both could be done.

No hERG tasks for G92 cards would soon sort a lot of problems out.

Tom Philippart
Send message
Joined: 12 Feb 09
Posts: 57
Credit: 23,376,686
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 13957 - Posted: 15 Dec 2009 | 20:50:07 UTC

great to see this thread!! thanks a lot!

Profile GDF
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 14 Mar 07
Posts: 1947
Credit: 629,356
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 13982 - Posted: 18 Dec 2009 | 9:49:06 UTC - in response to Message 13957.
Last modified: 18 Dec 2009 | 9:49:35 UTC

I can just repeat what I have already said somewhere in the forum.
We have furnished a reproducer of the bug to Nvidia. We contacted them back several times. They say that there they are looking at it. Another time, they said that technical stuff is trying to find the problem and the are discussions on what to do. But then nothing. This is common with Nvidia, we have sent several bug reproducers but they only fixed once another other bug with their FFT which we have sent. In my experience, they use bug reports to fix bugs on new chips not older ones. It also makes some sense given the rate at which new GPUs are produced. So we have stopped reporting bugs for older cards.


GDF

MarkJ
Volunteer moderator
Volunteer tester
Send message
Joined: 24 Dec 08
Posts: 738
Credit: 200,909,904
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 14077 - Posted: 29 Dec 2009 | 23:57:17 UTC

Had two TONI_HERG's fail. They were run on a GTX295 (single PCB variety, so the newer model).

WU 1
WU 2

Both say "Cuda error: Kernel [pme_fill_charges_overflow] failed in file 'fillcharges.cu' in line 97 : unknown error".

I know there isn't much you can do if nvidia don't want to fix their software.
____________
BOINC blog

Profile robertmiles
Send message
Joined: 16 Apr 09
Posts: 502
Credit: 590,520,933
RAC: 39,302
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 14159 - Posted: 8 Jan 2010 | 15:59:04 UTC - in response to Message 13982.

I can just repeat what I have already said somewhere in the forum.
We have furnished a reproducer of the bug to Nvidia. We contacted them back several times. They say that there they are looking at it. Another time, they said that technical stuff is trying to find the problem and the are discussions on what to do. But then nothing. This is common with Nvidia, we have sent several bug reproducers but they only fixed once another other bug with their FFT which we have sent. In my experience, they use bug reports to fix bugs on new chips not older ones. It also makes some sense given the rate at which new GPUs are produced. So we have stopped reporting bugs for older cards.


GDF


I've downloaded the Nvidia SDKs for the older CUDA versions. Are you interested in sending me the source code for the current Windows application and letting me check if whatever method you use to compile it also works with the older SDKs? Or would you prefer to download those SDKs yourself? I'd expect either method to produce versions with better support for some of the older Nvidia boards, IF they don't need major source code modifications to work at all.

I intended to start learning enough CUDA that I could start helping a few BOINC projects start a GPU version, but so far it looks like I won't be ready to actually start modifying the code very soon.

Another idea: Ask the BOINC developers to add more code for reporting the GPU chip type, in order to get more information about which of the older Nvidia boards are still usable.

Profile robertmiles
Send message
Joined: 16 Apr 09
Posts: 502
Credit: 590,520,933
RAC: 39,302
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 14160 - Posted: 8 Jan 2010 | 17:46:38 UTC - in response to Message 13850.

First of all, some background information on the experiment: we are doing various studies on the so-called "hERG channel". You can find a (longish) description on Wikipedia's hERG page.
This complex of four proteins (tetramer) is found in many of the body cells, and most notably the heart tissue, where it plays a very important role: it conducts charged particles (potassium ions), which flow through it cyclically, ultimately governing the heart beat.


Since that means your software is now ready to handle a tetramer, here's some information on a trimer you're likely to be interested in as well:

A trimer of the gp120 protein that the HIV-1 virus uses to enter human cells. If your software can handle docking of assorted compounds the that trimer and choose those that dock to the trimer without too much being wasted also docking to the single units of the gp120 protein elsewhere on the virus coat, you're likely to get the groups interested in HIV/AIDS research very interested in using your software.

At this moment, I'm having trouble getting the links from one of my other computers to this one, but will post several related links if they look useful for you.

Atre you interested in getting enough grants that you will have to hire yet another researcher or two to handle them all?


Jari Pyyluoma
Send message
Joined: 2 Aug 08
Posts: 12
Credit: 1,165,835,704
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 14163 - Posted: 8 Jan 2010 | 20:23:49 UTC - in response to Message 14159.

Well if the problem with the Toni work units can't be solved a work around better be made. In other projects the participants can chose what work they want to do. Let people chose the Toni if they want to, for instance if knowing that their hardware does not fail them. It is reasonable to give a slightly higher credit on problematic work. I you want to experiment with new work units, make it a voluntary choice.

Right now I do not trust this project, so I supervise the downloads to remove any Tonis that might show up. This also means that my GPUs mostly run another project, something that I am unhappy with.

Do not repeat the mistakes of other projects who have lost most of their donors. Do not put out work units that are of no use. Do not take chances with our time and our money.

I hope you make smart decisions so I can trust you again. Because I like this project.

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 14189 - Posted: 13 Jan 2010 | 0:19:44 UTC - in response to Message 14163.

Jari Pyyluoma,
I broadly agree with your concerns; I too am annoyed by wasting some of my efforts:

Although my GTS250 has recently faired a bit better (RAC 8255), largely due to the changes made by the Techs, my 8800GTS 512MB is all but lost – RAC is about 900!
I know it is old kit, but there are lots of people that have old kit.

However, on a positive note, my recent reviewing has shown that some of the new cards, although not magnificent, can still contribute substantially.
The GT 240 in particular is a worthy card. Very reliable.
So I would suggest to anyone that wants to continue participating, Sell your old cards and buy a new one.
A GT 240's can be purchased from between £60 an £80. The running costs are about one third of top end Compute Capable 1.1 cards, so over 6months crunching you will save:

If you run a 9800 GTX for 6 months the running cost = 180W * £1.20 per Year Watt * ½ year = £108
Sell your card for £25 minimum. Buy a new GT 240 for £65 and you spend a total of £40
Run a GT 240 for 6 months = 60W * £1.20 per Year Watt / 2 = £36
Total cost of buying a new card and running it = £36 + £40 = £76
So, over 6months crunching time you would save £108 - £76 = £32

Over a year of crunching that works out at a saving of £216-36-36-40 = £104
Oh, plus you get a better card!

From a network managers point of view, the fact that you would break even within 4months is Very attractive!

Under full load a GT 240 will use about 50 or 60 Watts and give you about 6900points per day. About 125 Points per Watt day.
Under full load a GTS 250 will use about 184Watts and give you about 8250points per day (perhaps partially due to failures). About 44 Points per Watt day!
Even my GTX 260 sp216 (55nm version) only gets 14000points per day, and eats up about the same Watts as a GTS 250. About 76 Points per Watt day.
Given that Three GT 240 cards would use less electric than One GTS 250 and do more than twice the work, these cards are very efficient!

In terms of Points per Watt, the GT 240 IS BY FAR the most efficient card available to GPUGrid supporters!

It will also do TEN times the work of an overclocked i7-920 running at 3.8GHz and using over 300W. It is a NO BRAINER!

Ref:
http://benchmarkreviews.com/index.php?option=com_content&task=view&id=423&Itemid=72&limit=1&limitstart=11
http://www.guru3d.com/article/msi-geforce-gt-240-review-test/5
My Stats!

Jari Pyyluoma
Send message
Joined: 2 Aug 08
Posts: 12
Credit: 1,165,835,704
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 14198 - Posted: 15 Jan 2010 | 5:30:11 UTC - in response to Message 14189.

Thanx, that sounds like great news. I just happened to get some nvidia cards from a friend. Yes, I also have the feeling that I should pass them on, and now you have proven it with numbers.

Well, I had a bunch of ps3. I never quite understood the reasoning behind stopping that part of the project. Seems that people running the project can change their minds from one day to another. So, buying a card just for this project is out of the question. Ati cards are better for the other projects.

The project has been pushing those flaky Toni work units on me, and I have been aborting them. Seems that it always has been someone with a 295 that finishes them. I wish this project had the back bone that folding@home has, they test their work before putting it in production, and they react to feedback, and most of all - they still support the ps3. I guess the people running this project feel let down by their university and can't find it in themselves to create great work. I have a very hard time trying to understand what the problem with funding is, with a top notch project like this. Maybe the university is to small and insignificant to be able to make its name known, compared with a university like Stanford.

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 14201 - Posted: 15 Jan 2010 | 14:17:45 UTC - in response to Message 14198.
Last modified: 15 Jan 2010 | 14:34:13 UTC

Sony apparently stopped allowing the use of Linux on the PS3, required to run GPUGrid. Lots of G92 cards are struggling due to a number of things outside the control of the project, including the reliance on NVidia for code. If there is a bug in their code and the project team are not allowed to correct it, there is nothing they can do. If the project team was larger I am sure they would be better equipped to make more changes, perhaps even write tasks for the older cards, but as is things appear to be tight.
This is a good project and needs support. To me it makes sense to sell on old parts and replace them with new parts that are better and actually work well.

Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 14235 - Posted: 18 Jan 2010 | 13:31:20 UTC - in response to Message 14201.
Last modified: 18 Jan 2010 | 13:32:42 UTC

I agree that manually aborting WUs should not be necessary. In any case, BOINC does not currently foresee a mechanism for letting people chose WUs. As already said, some classes of WUs have higher probability of triggering bugs in some cards, but to the best of our knowledge this is not as simple as fixing a bug in our code.

We are working on the problem, of course; in the meantime, I've suspended the generation of the last HERG workunits (most of them were stopped before Christmas), even though this is not really a "solution": any bugs it triggers will not go away.

Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 14924 - Posted: 2 Feb 2010 | 15:01:16 UTC - in response to Message 14235.

I'm sending a few more workunits of the HERG type. In the meanwhile, if you want to see images of what you are crunching, have a look at the flickr page!

Profile robertmiles
Send message
Joined: 16 Apr 09
Posts: 502
Credit: 590,520,933
RAC: 39,302
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 14928 - Posted: 2 Feb 2010 | 16:43:07 UTC - in response to Message 14235.

I agree that manually aborting WUs should not be necessary. In any case, BOINC does not currently foresee a mechanism for letting people chose WUs. As already said, some classes of WUs have higher probability of triggering bugs in some cards, but to the best of our knowledge this is not as simple as fixing a bug in our code.


World Computing Grid allows participants to choose workunits by making different types of workunits different subprojects and allowing the participants to choose which subprojects to run. Any particular reason why you can't do the same, even if it requires providing the same application program under more than one name?

Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 14930 - Posted: 2 Feb 2010 | 16:57:21 UTC - in response to Message 14928.

Well, bugs may come and go with cards, drivers, and versions of the application. We try to make all WUs run equally well, rather than fork (and maintain) separate queues.

Profile [AF>Libristes>Jip] Elgran...
Avatar
Send message
Joined: 16 Jul 08
Posts: 45
Credit: 78,618,001
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 14951 - Posted: 3 Feb 2010 | 12:06:46 UTC
Last modified: 3 Feb 2010 | 12:07:19 UTC

I also have another compute error with this type of workunit.
My workunit
My computer

Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 14968 - Posted: 3 Feb 2010 | 18:50:48 UTC - in response to Message 14951.

As soon as the new application will be available, I'll migrate the workunits to it, crossing fingers it will improve the situation.

Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 14969 - Posted: 3 Feb 2010 | 18:53:14 UTC - in response to Message 14968.
Last modified: 5 Feb 2010 | 12:03:22 UTC

Feel free to see the new molecular images on flickr...

Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 14981 - Posted: 4 Feb 2010 | 11:11:31 UTC - in response to Message 14969.
Last modified: 4 Feb 2010 | 11:26:36 UTC

The new HERGqext are out (note the middle "q"). I'm trying a variation of the FFT parameters, using a slightly longer computation than necessary, to see if they run more stably on more cards. Thanks for your support and patience...

Snow Crash
Send message
Joined: 4 Apr 09
Posts: 450
Credit: 539,316,349
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 15000 - Posted: 5 Feb 2010 | 1:26:43 UTC - in response to Message 14981.

I have had two crash and no sucess on a stable card.
____________
Thanks - Steve

Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 15009 - Posted: 5 Feb 2010 | 10:22:00 UTC - in response to Message 15000.
Last modified: 5 Feb 2010 | 10:29:01 UTC

Which of the two hosts?

BTW, for those of you crunching beta, the L*_TONI_TEST WUs are the same as the HERG and HERGext ones.

Profile [AF>Libristes>Jip] Elgran...
Avatar
Send message
Joined: 16 Jul 08
Posts: 45
Credit: 78,618,001
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 15017 - Posted: 5 Feb 2010 | 11:32:09 UTC

Another compute error with a GTX295 on this computer .

Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 15018 - Posted: 5 Feb 2010 | 12:00:40 UTC - in response to Message 15017.
Last modified: 5 Feb 2010 | 12:02:01 UTC

Does not seem to be HERG specific, you also had an error on task 1816245 of another batch.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1451
Credit: 3,575,929,351
RAC: 310,964
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 15052 - Posted: 7 Feb 2010 | 9:49:01 UTC - in response to Message 13876.

.. No, it seems to be related to specific model types. TONI_HERG is a fairly recent addition to the list of problematic models - searching the message boards suggests that my report on 24 November was the first sighting. Previously, we had been commenting on IBUCH_TRYP and OTTO_HERG in thread 1468

Pleased to report that one of my 9800GT cards has successfully completed a TONI_HERG qext.

Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 15055 - Posted: 7 Feb 2010 | 11:28:11 UTC - in response to Message 15052.
Last modified: 7 Feb 2010 | 11:28:59 UTC

Good!

Siegfried Niklas
Avatar
Send message
Joined: 23 Feb 09
Posts: 39
Credit: 144,654,294
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 15057 - Posted: 7 Feb 2010 | 13:14:22 UTC

D1s30c47-TONI_HERGqext-2-60-RND0387_0

http://www.gpugrid.net/workunit.php?wuid=1148761


<core_client_version>6.10.17</core_client_version>
<![CDATA[
<message>
WU download error: couldn't get input files:
<file_xfer_error>
<file_name>D1s30c47-TONI_HERGqext-2-conf_file_enc</file_name>
<error_code>-119</error_code>
<error_message>MD5 check failed</error_message>
</file_xfer_error>

</message>
]]>


Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 15058 - Posted: 7 Feb 2010 | 15:42:50 UTC - in response to Message 15057.
Last modified: 7 Feb 2010 | 17:09:24 UTC

Thanks, well spotted. I tried to replace a parameter on the fly. Please post if it happens again.

Siegfried Niklas
Avatar
Send message
Joined: 23 Feb 09
Posts: 39
Credit: 144,654,294
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 15060 - Posted: 7 Feb 2010 | 17:43:34 UTC - in response to Message 14981.

The new HERGqext are out (note the middle "q"). I'm trying a variation of the FFT parameters, using a slightly longer computation than necessary, to see if they run more stably on more cards. Thanks for your support and patience...



I notice a computation time of 11h to 14,5h on high overclocked GTX295(700MHz)/GTX265(750MHz) for the HERGqext.

Time per step: 62.932 ms

Example

The TONI_HERGext running only ~6,5h

Time per step: 37.026 ms

Example

"slightly? :-) longer computation"

Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 15061 - Posted: 7 Feb 2010 | 19:11:32 UTC - in response to Message 15060.
Last modified: 7 Feb 2010 | 19:26:25 UTC

I also noticed the increase, and that was higher than expected. This is what I was trying to fix...

The new ones should be back to the norm.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1451
Credit: 3,575,929,351
RAC: 310,964
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 15735 - Posted: 13 Mar 2010 | 18:10:42 UTC

I've had a run of three successive failures from the current batch of TONI_HERG with ACEMD v6.03, Windows XP32:

a43-TONI_HERG77a-1-100-RND4354_0
a317-TONI_HERG79a-0-100-RND8649_1
a268-TONI_HERG79a-1-100-RND6278_1

Three deifferent machines, three CUDA cards - two 9800GT at stock, one 9800GTX+ factory overclocked.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1451
Credit: 3,575,929,351
RAC: 310,964
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 15812 - Posted: 18 Mar 2010 | 12:43:10 UTC

Does anyone have any idea on these?

Since reporting these errors, all three cards have worked full time on GPUGrid (another refugee from SETI!), around 30 tasks completed, and with 100% success rate - including a couple of the long-running TONI_GA.

But I've continued to abort TONI_HERG on sight (apologies once again to the researchers on that project) until the situation is clearer.

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 15814 - Posted: 18 Mar 2010 | 13:13:11 UTC - in response to Message 15812.

I think the only way round this sort of problem is for the server to identify cards abilities to complete the various types of work unit and allocate accordingly. If there is more than say a 25% chance of failure then dont allocate the task, unless there are no other tasks.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1451
Credit: 3,575,929,351
RAC: 310,964
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 15823 - Posted: 19 Mar 2010 | 8:23:45 UTC

Another slipped in while I was asleep:

a8-TONI_HERG77a-9-100-RND1351_1

Profile X-Files 27
Avatar
Send message
Joined: 11 Oct 08
Posts: 95
Credit: 68,023,693
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 15966 - Posted: 24 Mar 2010 | 21:37:34 UTC

here's a bad WU:
http://www.gpugrid.net/workunit.php?wuid=1282907
____________

Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 15969 - Posted: 25 Mar 2010 | 11:29:44 UTC - in response to Message 15966.

This bad one seems to have been created by some file transfer error. It should fail immediately.

mwgiii
Send message
Joined: 22 Jan 09
Posts: 8
Credit: 958,583,971
RAC: 0
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 16173 - Posted: 5 Apr 2010 | 14:36:26 UTC
Last modified: 5 Apr 2010 | 14:38:51 UTC

I have now had 3 in a row fail.

1st: http://www.gpugrid.net/result.php?resultid=2093496
2nd: http://www.gpugrid.net/result.php?resultid=2096024
3dr: http://www.gpugrid.net/result.php?resultid=2103499

I rebooted after the 1st fail. The 2nd failed after 523 seconds and the 3rd after 9.1 seconds. The failures are also putting random sparkles on my screen.

Looking back at my history, I also had one fail on April 1st: http://www.gpugrid.net/result.php?resultid=2082136

All 4 have the same error message:
MDIO ERROR: cannot open file "restart.coor"
SWAN : FATAL : Failure executing kernel sync [M_shake_position_kernel_step_1] [999]
Assertion failed: 0, file ../swan/swanlib_nv.cpp, line 203


Intel Q9450 quad with Windows Vista Premium x64.
Nvidia 9800 GTX+ with driver 197.13.
Boinc 6.10.18
____________

Profile JStateson
Avatar
Send message
Joined: 31 Oct 08
Posts: 183
Credit: 3,327,276,529
RAC: 13
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 16195 - Posted: 7 Apr 2010 | 15:55:24 UTC

I have one similar error after crunching 13 hours.

MDIO ERROR: cannot open file "restart.coor"
SWAN : FATAL : Failure executing kernel sync [frc_sum_kernel] [999]
Assertion failed: 0, file ../swan/swanlib_nv.cpp, line 203


I do not know which gpu it failed on, either the GTS250 with 1mb of memory or the 9800gtx+ with .5mb memory.

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 16197 - Posted: 7 Apr 2010 | 17:04:58 UTC - in response to Message 16195.

A GTS250 is very similar (almost identical) to a 9800 GTX+
So it is probably not that important, unless you are getting lots of failures. The half a 1GB vs 500MB does not make any difference here.

ftpd
Send message
Joined: 6 Jun 08
Posts: 152
Credit: 328,250,382
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 16220 - Posted: 9 Apr 2010 | 6:57:56 UTC

After 14 hours 25 minutes crashed. GTS 250 - driver 197.13 - windows xp.
Task 2120270.
____________
Ton (ftpd) Netherlands

Siegfried Niklas
Avatar
Send message
Joined: 23 Feb 09
Posts: 39
Credit: 144,654,294
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 16313 - Posted: 15 Apr 2010 | 18:12:32 UTC
Last modified: 15 Apr 2010 | 18:33:47 UTC

During the past weeks I had some hERG-WUs on my four 9800GT (Vista64) that stopped
with a "acemd... error bubble".
About 4 weeks ago I tried not to click "OK" but restarting the PC (with open "error bubble")- After the restart the WU has been restarted at the checkpoint and finished valid!

I verified this behavior with 5 further WUs. Every (valid) result shows similar "stderr out"

.....................................................................
# There is 1 device supporting CUDA
# Device 0: "GeForce 9800 GT"
# Clock rate: 1.52 GHz
# Total amount of global memory: 519634944 bytes
# Number of multiprocessors: 14
# Number of cores: 112
MDIO ERROR: cannot open file "restart.coor"
SWAN : FATAL : Failure executing kernel sync [frc_sum_kernel] [999]
Assertion failed: 0, file ../swan/swanlib_nv.cpp, line 203

This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.
# There is 1 device supporting CUDA
# Device 0: "GeForce 9800 GT"
# Clock rate: 1.52 GHz
# Total amount of global memory: 519634944 bytes
# Number of multiprocessors: 14
# Number of cores: 112
# Time per step: 69.189 ms
# Approximate elapsed time for entire WU: 43242.851 s
called boinc_finish

Validate state Valid
..........................................................................

Last example: http://www.gpugrid.net/result.php?resultid=2158139

Post to thread

Message boards : Graphics cards (GPUs) : hERG: information and issues

//