Advanced search

Message boards : Graphics cards (GPUs) : Repeatable failure on Windows XP Home with multiple users logged in

Author Message
jrobbio
Send message
Joined: 13 Mar 09
Posts: 59
Credit: 324,366
RAC: 0
Level

Scientific publications
watwatwatwat
Message 7827 - Posted: 25 Mar 2009 | 8:58:08 UTC
Last modified: 25 Mar 2009 | 8:58:25 UTC

I decided to turn on all the debugging features in Boinc in case I could trap something useful since I can repeatedly cause a WU to fail by logging into a second account on XP Home, whilst the other is suspended.

When I installed Boinc I had the following configurations:


  • Boinc installed as service
  • All users able to control Boinc
  • Boinc data files installed on E drive
  • I have CPDN Beta running


Here is the latest failure


<core_client_version>6.6.15</core_client_version>
<![CDATA[
<message>
Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
# Using CUDA device 0
# Device 0: "GeForce GTS 250"
# Clock rate: 601714 kilohertz
# Total amount of global memory: 1073414144 bytes
# Number of multiprocessors: 16
# Number of cores: 128
MDIO ERROR: cannot open file "restart.coor"
# Using CUDA device 0
# Device 0: "GeForce GTS 250"
# Clock rate: 1890000 kilohertz
# Total amount of global memory: 1073414144 bytes
# Number of multiprocessors: 16
# Number of cores: 128
Cuda error: Kernel [frc_sum_nb_forces] failed in file 'force.cu' in line 244 : unknown error.

</stderr_txt>
]]>


These are the messages that came from stdoutgui.txt when I changed profile:

[03/25/09 07:58:02] TRACE [4028]: init_poll(): retrying connect: -1

[03/25/09 07:58:03] TRACE [4028]: init_poll(): sock = 476

[03/25/09 07:58:03] TRACE [4028]: init_poll(): sock = 476

[03/25/09 07:58:03] TRACE [4028]: init_poll(): sock = 476

[03/25/09 07:58:03] TRACE [4028]: init_07:59:51: Error: can't open file 'E:\Program Files\BOINC\\RebootPending.txt' (error 2: the system cannot find the file specified.)
08:03:27: Error: can't open file 'E:\Program Files\BOINC\\RebootPending.txt' (error 2: the system cannot find the file specified.)


These are the messages from stoutdae.txt when it failed about 10 minutes later.


25-Mar-2009 08:10:53 [GPUGRID] [app_msg_send] sent heartbeat to up388195-pYEpYV3_xyUS30000-1-10-ignasi_1

25-Mar-2009 08:10:53 [GPUGRID] [app_msg_receive] got msg from slot 4: <current_cpu_time>1.392656e+002</current_cpu_time>
<checkpoint_cpu_time>1.390000e+002</checkpoint_cpu_time>
<fraction_done>9.632000e-002</fraction_done>
<fpops_cumulative>2.491288e+015</fpops_cumulative>
<intops_cumulative>2.491288e+015</intops_cumulative>
25-Mar-2009 08:10:54 [---] [guirpc_debug] GUI RPC Command = '<boinc_gui_rpc_request>
<get_cc_status/>
</boinc_gui_rpc_request>
'
25-Mar-2009 08:10:54 [---] [guirpc_debug] GUI RPC reply: '<boinc_gui_rpc_reply>
<cc_status>
<network_stat'
25-Mar-2009 08:10:54 [---] [guirpc_debug] GUI RPC Command = '<boinc_gui_rpc_request>
<get_messages>
<seqno>9249</seqno>
</get_messages>
</boinc_gui_rpc_request>
'
25-Mar-2009 08:10:54 [---] [guirpc_debug] GUI RPC reply: '<boinc_gui_rpc_reply>
<msgs>
<msg>
<project></pro'
25-Mar-2009 08:10:54 [---] [guirpc_debug] GUI RPC Command = '<boinc_gui_rpc_request>
<get_cc_status/>
</boinc_gui_rpc_request>
'
25-Mar-2009 08:10:54 [---] [guirpc_debug] GUI RPC reply: '<boinc_gui_rpc_reply>
<cc_status>
<network_stat'
25-Mar-2009 08:10:54 [---] [guirpc_debug] GUI RPC Command = '<boinc_gui_rpc_request>
<get_messages>
<seqno>9253</seqno>
</get_messages>
</boinc_gui_rpc_request>
'
25-Mar-2009 08:10:54 [---] [guirpc_debug] GUI RPC reply: '<boinc_gui_rpc_reply>
<msgs>
<msg>
<project></pro'
25-Mar-2009 08:10:54 [---] [guirpc_debug] GUI RPC Command = '<boinc_gui_rpc_request>
<get_results/>
</boinc_gui_rpc_request>
'
25-Mar-2009 08:10:54 [---] [guirpc_debug] GUI RPC reply: '<boinc_gui_rpc_reply>
<results>
<result>
<name'
25-Mar-2009 08:10:54 [GPUGRID] [app_msg_receive] got msg from slot 4: <current_cpu_time>1.392969e+002</current_cpu_time>
<checkpoint_cpu_time>1.390000e+002</checkpoint_cpu_time>
<fraction_done>9.632000e-002</fraction_done>
<fpops_cumulative>2.491288e+015</fpops_cumulative>
<intops_cumulative>2.491288e+015</intops_cumulative>
25-Mar-2009 08:10:54 [---] [cpu_sched_debug] Request CPU reschedule: application exited
25-Mar-2009 08:10:54 [---] [state_debug] set dirty: ACTIVE_TASK_SET::poll
25-Mar-2009 08:10:54 [GPUGRID] Computation for task up388195-pYEpYV3_xyUS30000-1-10-ignasi_1 finished
25-Mar-2009 08:10:54 [---] [state_debug] set dirty: handle_finished_apps
25-Mar-2009 08:10:54 [---] [cpu_sched_debug] Request CPU reschedule: handle_finished_apps
25-Mar-2009 08:10:54 [---] [cpu_sched_debug] schedule_cpus(): start
25-Mar-2009 08:10:54 [---] [rr_sim] rr_sim start: work_buf_total 181440.00
25-Mar-2009 08:10:54 [---] [cpu_sched_debug] Request enforce CPU schedule: schedule_cpus
25-Mar-2009 08:10:54 [---] [state_debug] set dirty: schedule_cpus
25-Mar-2009 08:10:54 [GPUGRID] [debt] CPU ineligible; debt 0.00
25-Mar-2009 08:10:54 [---] [debt] CUDA: no eligible projects
25-Mar-2009 08:10:54 [---] [status_debug] CLIENT_STATE::write_state_file(): Writing state file
25-Mar-2009 08:10:54 [---] [status_debug] CLIENT_STATE::write_state_file(): Done writing state file
25-Mar-2009 08:10:56 [---] [guirpc_debug] GUI RPC Command = '<boinc_gui_rpc_request>
<get_cc_status/>
</boinc_gui_rpc_request>
'
25-Mar-2009 08:10:56 [---] [guirpc_debug] GUI RPC reply: '<boinc_gui_rpc_reply>
<cc_status>
<network_stat'
25-Mar-2009 08:10:56 [---] [guirpc_debug] GUI RPC Command = '<boinc_gui_rpc_request>
<get_messages>
<seqno>9318</seqno>
</get_messages>
</boinc_gui_rpc_request>
'
25-Mar-2009 08:10:56 [---] [guirpc_debug] GUI RPC reply: '<boinc_gui_rpc_reply>
<msgs>
<msg>
<project></pro'
25-Mar-2009 08:10:56 [---] [guirpc_debug] GUI RPC Command = '<boinc_gui_rpc_request>
<get_cc_status/>
</boinc_gui_rpc_request>
'
25-Mar-2009 08:10:56 [---] [guirpc_debug] GUI RPC reply: '<boinc_gui_rpc_reply>
<cc_status>
<network_stat'
25-Mar-2009 08:10:56 [---] [guirpc_debug] GUI RPC Command = '<boinc_gui_rpc_request>
<get_messages>
<seqno>9322</seqno>
</get_messages>
</boinc_gui_rpc_request>
'
25-Mar-2009 08:10:56 [---] [guirpc_debug] GUI RPC reply: '<boinc_gui_rpc_reply>
<msgs>
<msg>
<project></pro'
25-Mar-2009 08:10:56 [---] [guirpc_debug] GUI RPC Command = '<boinc_gui_rpc_request>
<get_results/>
</boinc_gui_rpc_request>
'
25-Mar-2009 08:10:56 [---] [guirpc_debug] GUI RPC reply: '<boinc_gui_rpc_reply>
<results>
<result>
<name'
25-Mar-2009 08:10:56 [GPUGRID] Started upload of up388195-pYEpYV3_xyUS30000-1-10-ignasi_1_0
25-Mar-2009 08:10:56 [GPUGRID] [file_xfer_debug] URL: http://www.ps3grid.net/PS3GRID_cgi/file_upload_handler
25-Mar-2009 08:10:56 [GPUGRID] Started upload of up388195-pYEpYV3_xyUS30000-1-10-ignasi_1_1
25-Mar-2009 08:10:56 [GPUGRID] [file_xfer_debug] URL: http://www.ps3grid.net
25-Mar-2009 08:11:00 [GPUGRID] [file_xfer_debug] FILE_XFER_SET::poll(): http op done; retval 0
25-Mar-2009 08:11:00 [GPUGRID] [file_xfer_debug] parsing upload response: <data_server_reply>
<status>0</status>
</data_server_reply>
25-Mar-2009 08:11:00 [GPUGRID] [file_xfer_debug] parsing status: 0
25-Mar-2009 08:11:00 [GPUGRID] [file_xfer_debug] file transfer status 0
25-Mar-2009 08:11:00 [GPUGRID] Finished upload of up388195-pYEpYV3_xyUS30000-1-10-ignasi_1_0
25-Mar-2009 08:11:00 [GPUGRID] [file_xfer_debug] Throughput 27567 bytes/sec
25-Mar-2009 08:11:00 [GPUGRID] Started upload of up388195-pYEpYV3_xyUS30000-1-10-ignasi_1_2
25-Mar-2009 08:11:00 [GPUGRID] [file_xfer_debug] URL: http://www.ps3grid.net/PS3GRID_cgi/file_upload_handler
25-Mar-2009 08:11:00 [---] [state_debug] set dirty: pers_file_xfer_set poll
25-Mar-2009 08:11:00 [---] [status_debug] CLIENT_STATE::write_state_file(): Writing state file
25-Mar-2009 08:11:01 [---] [status_debug] CLIENT_STATE::write_state_file(): Done writing state file


Any ideas?

Rob

jrobbio
Send message
Joined: 13 Mar 09
Posts: 59
Credit: 324,366
RAC: 0
Level

Scientific publications
watwatwatwat
Message 7889 - Posted: 27 Mar 2009 | 0:13:45 UTC - in response to Message 7827.

I tried to capture some more errors from the new statefile debug out in 6.6.18


26-Mar-2009 23:59:21 [GPUGRID] [app_msg_send] sent heartbeat to up524947-pYIpYV3_xyUS1480000-4-10-ignasi_0
26-Mar-2009 23:59:21 [GPUGRID] [app_msg_receive] got msg from slot 4: <current_cpu_time>5.587501e+001</current_cpu_time>
<checkpoint_cpu_time>4.692188e+001</checkpoint_cpu_time>
<fraction_done>3.968000e-002</fraction_done>
<fpops_cumulative>2.557516e+015</fpops_cumulative>
<intops_cumulative>2.557516e+015</intops_cumulative>
26-Mar-2009 23:59:22 [GPUGRID] [app_msg_send] failed to send heartbeat to up524947-pYIpYV3_xyUS1480000-4-10-ignasi_0
26-Mar-2009 23:59:23 [GPUGRID] [app_msg_send] failed to send heartbeat to up524947-pYIpYV3_xyUS1480000-4-10-ignasi_0
26-Mar-2009 23:59:24 [GPUGRID] [app_msg_send] failed to send heartbeat to up524947-pYIpYV3_xyUS1480000-4-10-ignasi_0
26-Mar-2009 23:59:25 [GPUGRID] [app_msg_send] failed to send heartbeat to up524947-pYIpYV3_xyUS1480000-4-10-ignasi_0
26-Mar-2009 23:59:26 [GPUGRID] [app_msg_send] failed to send heartbeat to up524947-pYIpYV3_xyUS1480000-4-10-ignasi_0
26-Mar-2009 23:59:27 [GPUGRID] [task_debug] Process for up524947-pYIpYV3_xyUS1480000-4-10-ignasi_0 exited
26-Mar-2009 23:59:27 [GPUGRID] [task_debug] task_state=EXITED for up524947-pYIpYV3_xyUS1480000-4-10-ignasi_0 from handle_exited_app
26-Mar-2009 23:59:27 [GPUGRID] [task_debug] result state=COMPUTE_ERROR for up524947-pYIpYV3_xyUS1480000-4-10-ignasi_0 from CS::report_result_error
26-Mar-2009 23:59:27 [GPUGRID] [task_debug] Process for up524947-pYIpYV3_xyUS1480000-4-10-ignasi_0 exited
26-Mar-2009 23:59:27 [GPUGRID] [task_debug] exit code 1 (0x1): Incorrect function. (0x1)
26-Mar-2009 23:59:28 [---] [statefile_debug] set dirty: ACTIVE_TASK_SET::poll
26-Mar-2009 23:59:28 [GPUGRID] Computation for task up524947-pYIpYV3_xyUS1480000-4-10-ignasi_0 finished
26-Mar-2009 23:59:28 [GPUGRID] [task_debug] result state=COMPUTE_ERROR for up524947-pYIpYV3_xyUS1480000-4-10-ignasi_0 from CS::app_finished
26-Mar-2009 23:59:28 [---] [statefile_debug] set dirty: handle_finished_apps
26-Mar-2009 23:59:28 [---] [rr_sim] rr_sim start: work_buf_total 181440.00
26-Mar-2009 23:59:28 [---] [statefile_debug] Writing state file
26-Mar-2009 23:59:28 [---] [statefile_debug] Done writing state file
26-Mar-2009 23:59:28 [---] [guirpc_debug] GUI RPC Command = '<boinc_gui_rpc_request>
<get_cc_status/>
</boinc_gui_rpc_request>
'


Rob

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 7900 - Posted: 27 Mar 2009 | 9:16:43 UTC

Thanks for your effort, but I'm not sure what anyone would do with this. I thought the reason for this was somewhat similar to the remote desktop problem: upon a connect or user switch windows takes the "graphical session" away from any programs still running under the old user and thus BOINC can't find the GPU any more.

MrS
____________
Scanning for our furry friends since Jan 2002

Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 7909 - Posted: 27 Mar 2009 | 15:26:17 UTC

The no heartbeat message can come from a number of places and if memory serves there is a longstanding bug associated with it. I ran into it recently because of an issue with my Internet connection. It went down in some strange way and THAT meant that I got communication hangs and eventually, since the heartbeat messages are sent by IP caused work on several systems for multiple projects to die ...

I don't know if that is happening here or not... or if there is something else going on ...

But, the two things that worry at me are the installing as service, and the installing with the files on drive E:

They are PROBABLY not the causes, but, I have to wonder if C: is so full that BOINC has to be on E:, might the system be resource starved? That resource might be the page file and that could cause delays and that could cause the loss of heartbeats ...

jrobbio
Send message
Joined: 13 Mar 09
Posts: 59
Credit: 324,366
RAC: 0
Level

Scientific publications
watwatwatwat
Message 7918 - Posted: 27 Mar 2009 | 22:41:08 UTC - in response to Message 7909.


But, the two things that worry at me are the installing as service


I tried installing Boinc as a service, because the original problem started as when Boinc was installed as just an application. I thought it might be a permissions problem


, and the installing with the files on drive E:

They are PROBABLY not the causes, but, I have to wonder if C: is so full that BOINC has to be on E:, might the system be resource starved? That resource might be the page file and that could cause delays and that could cause the loss of heartbeats ...


I installed Boinc on E: purely for enhanced read/write times rather than being on the OS/Swap file drive. My swap file is set to 4096mb and there is plenty of space on C.


I thought the reason for this was somewhat similar to the remote desktop problem: upon a connect or user switch windows takes the "graphical session" away from any programs still running under the old user and thus BOINC can't find the GPU any more.


Sounds feasible to me and could possibly explain why only the GPU tasks fail and not CPDN tasks.

As I initially said, the problem never happens when logged in as a single user, only when multiple users are logged in simultaneously. I cannot prevent this machine being shared unfortunately.

Thanks for your suggestions.

Rob

jrobbio
Send message
Joined: 13 Mar 09
Posts: 59
Credit: 324,366
RAC: 0
Level

Scientific publications
watwatwatwat
Message 8053 - Posted: 2 Apr 2009 | 6:15:05 UTC - in response to Message 7918.

Should we (I) be reporting this directly to Boinc instead of posting pleas on this board?

Rob

Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 8070 - Posted: 2 Apr 2009 | 17:14:56 UTC - in response to Message 8053.

Should we (I) be reporting this directly to Boinc instead of posting pleas on this board?

If you want answers this is the place to start. Once you have a clear idea of what the problem is, and a possible cure, you should by all means post on the mailing lists. Either Dev or Alpha as appropriate.

Post to thread

Message boards : Graphics cards (GPUs) : Repeatable failure on Windows XP Home with multiple users logged in

//