CC: Taylor Webb <taylorwebb85@yahoo.com>, Dr Warnick <warnick@ee.byu.edu>,
   "Dr. Jeffs" <bjeffs@ee.byu.edu>, Vikas Asthana <vikasfeb8@gmail.com>,
   Mike Elmer <mike_elmer@yahoo.com>, Dave Carter <theeyrehead@hotmail.com>,
   German Cortes <gc76@cornell.edu>, R Ganesan <ganesh@naic.edu>
Message-ID: <F55B3D71-B477-475B-B18F-01724E678C58@byu.edu>
From: Brian Jeffs <bjeffs@BYU.EDU>
To: Phil Perillat <phil@naic.edu>
In-Reply-To: <4C24748A.7040000@naic.edu>
Content-Type: text/plain; charset="US-ASCII"; format=flowed; delsp=yes
Content-Transfer-Encoding: 7bit
MIME-Version: 1.0 (Apple Message framework v936)
Subject: Re: Finishing the single pole ovservations
Date: Fri, 25 Jun 2010 11:39:40 -0600
References: <990554.30651.qm@web54306.mail.re2.yahoo.com> <8D3C36FA-105D-40EF-9E57-256BCE78585C@byu.edu> <4C24748A.7040000@naic.edu>
Status:   

Phil and all,

This is a new type of problem we have not seen before.  Some thoughts  
and observations:

1.  The fact that the problem occurs even with the external firewire  
RAID array drive indicates this is not a problem with the onboard RAID  
array drives themselves, or with the drive controller since the  
firewire drive does not use the internal hard drive controller.  We  
have had no problems in the past with direct streaming of data to the  
external drivel.

2.  Are we sure this has not been happening all along in our super- 
fine scans?  Could you look at the earlier file sizes and see if you  
see any drop outs (short files)?  The fact that Karl was able to  
process all of the data argues against this as a possibility, but I  
think he did mention once that he was only using a central set of  
blocks, e.g. a one second window of correlations.  Perhaps the shorter  
files were long enough to give him that much data and he did not  
notice the problem??  If this is the case then the new data is still  
usable.

3.  The "available samples per channel" display shows how much of the  
in-PC-memory FIFO buffer is being used.  This is not the on-card  
buffer.   We have seen this buffer overflow like this before under the  
following conditions:
a) the sample clock is too high so within a few minutes or seconds the  
buffer overflows because the write-to-hard drive process cannot keep  
up (We know our clock rate of 1.25 Msamp/s is sustainable for 20  
channels);
b) the hard drives are getting near full or are highly fragmented so  
disk response time bogs down (we have eliminated this possibility);
c)  the phantom windows process.  For years we have been unable to  
identify what Windows is doing when every time at about 26 minutes of  
continuous sampling the buffer overflows due to slow PC response to I/ 
O demands.  At higher sample rates this occurs sooner, but it is  
always stable, keeping buffer loading small until the buffer balloons  
and things crash suddenly.  There could be some new unknown windows  
activity going on that hits us randomly and causes the buffer to  
overflow.  These would not show up on a task manager check prior to,  
or during a successful run because the processes are likely started by  
Windows just prior to the buffer crash, and you would have a hard time  
catching them.

4.  Internet traffic?  This is wild speculation, but if there is some  
network traffic that the PC has to handle to see if the packets are  
intended for it, then this could be a cycle stealing process.  The  
network card I/O is interrupt driven, and without a DNS server on this  
network, and a different configuration for this local network than we  
are generally used to, perhaps network activity, even that not  
intended as communications to the acquisition PC can demand cycles  
that could put acquisition behind.


Possible solutions (in my preferred order of priority):

1.  Re-image the PC.  We sent a GHOST image copy, so I think we should  
immediately install that to restore the OS to the state it was in when  
we left Provo.  I think this is a priority must-do before attempting  
to take any more data on the LabView system.  Make sure to first save  
out to external hard drives all of the current VI codes and any other  
configuration files that may have been changed or updated since the  
image was take here.

2.  Use a different remote desktop computer?  This is the first time  
Vikas's has been used, and that is when the trouble seemed to start.   
Is his network connection to the acquisition machine set up the same  
way as Karl's was?  I was not involved, so I do not know.  Bogging  
down with remote desktop data transfers could have an effect on  
performance.  We had no problem though before when both Karl's and  
Mike's laptops were simultaneously running remote desktops.

3.  Swap the sever PCs.  The one in the receiver room (TAU) is almost  
identical and is intended as a backup.  If what we are seeing is a new  
hardware problem, this should solve it.

4.  Move now instead of Saturday to using the Adlink 40 channel data  
acquisition system.

I will be at my work number the rest of the work day, and at home most  
of the evening.  I will also have my cell phone on.

Good luck!

Brian

work:  801-422-3062
home: 801-225-5727
cell:  801-471-5801


Brian D. Jeffs
Professor
Department of Electrical and Computer Engineering
459 CB
Brigham Young University
Provo, UT 84602
USA

email: bjeffs@ee.byu.edu
phone: (801) 422-3062
FAX: (801) 422-0201


On Jun 25, 2010, at 3:19 AM, Phil Perillat wrote:

> Everyone,
>    Here is an update on the 24jun10 evening, 25jun10 morning  
> observation:
>
> - We had trouble with short files being recorded on the byu system
> throughout the evening and the morning.
> - No data: super fine grid, w51 map, grids tracking source in za  was
> completed successfully.
> - The super fine A2 grid was taken, but it had 40 short files out of
> the  963 4 second files.
> - The failures could be intermittent rather than continuous. We ran  
> for
> more than 50 4 sec integration on A2
>   super fine grid without  a short file, then it would start again.
>
> What we tried (without success)
>
> - stopping starting matlab.. problems continued.
> - rebooting the computer upstairs..
> - reformatting the F drive
>   we did this once after the computer reboot, once later in the
> evening. In both cases the problem persisted.
> - we switched and ran on the external drive for data recording. We  
> still
> got short files.
>
> Looking at the labview display we could see /the available sampler per
> chanel entry  /get large when there was
> a short file. This value would normally oscillate between 200 to 500.
> When it jumped to 2 to 3 million we would also see
> a short file.
>
> As far as a i can tell, the vi is using a timer to decide when a 4
> second observation has completed. This assumes a
> constant i/o rate. If the i/o rate slows down on the input or output
> then you'd get a short file.
> Since the input buffers filled up, it points to a problem with the  
> output.
>
> We monitored the cpu activity and didn't see anything active. No  
> obvious
> extra tasks were seen.
> If the cpu was slowing down, then the input and output could both  
> slow down.
> I'm not sure whether the large available samples per channel entry is
> out on the a/d cards or already in the
> matlab buffers that have read the data.
>
>
> phil
> ps. a log of what we tried to observe is  at;
> http://www.naic.edu/~phil/hardware/byuPhasedAr/byuPhasedAr.html#logfiles
> then go down to 24jun10.
>
>