pdev data taking measurements

LAST MOD:17may18


230329: pset error  timeouts caused by disc i/o slowdown.
180517: disc i/o fails with 40mbyte/sec voltage sampling.
23feb09:scp'ing data while running at 65Mbytes/second.
02dec08: pdevs1 disk doesn't keep up.
monitoring cpu/disc statistics with iostat
pdev raid disc i/o rates.

180517: disc i/o fails with 40MB/sec voltage sampling.

    s3279 was using lbw and the mock spectrometer to take some rfi data. The mock setup was:

    During the run, the mock program aborted with FPGA1 PI fifo error 0x0470 or FPGA1: DMA buffer overflow. The buffer pool monitor data was examined to see what was going on.

The Buffer pool:

    The mock data taking uses 50 20Mbytes shared memory  buffers for i/o buffering to disc (1GByte total). The process is:
    The bpMon program continually monitors the status of the shared memory freelist queue as well as the bpInp,bpOut queues. It will

The s3279 disc writing errors:

    When voltage sampling is used, the  i/o programs only use the 1st 1Mbyte of each 20Mbyte buffer for recording data. So instead of having 20Mb yte*50bufs = 1Gbyte  of buffer memory, there is only 50*1= 50Mbytes of buffer spaces available. This was initially chosen so there wasn't too long a delay between buffer outputs when slow voltage data sampling was used.
   
    s3279 was taking data  at 40 million bytes /sec (around 38.15  2^20 bytes/sec). 50 1Mbytes byte buffers gave a time slop of only 1.25 seconds.

The plots show the number of free buffers available during the run (.ps)   (.pdf):

SUMMARY:

processing: x101/180517/bpmonchk.pro









23feb09: scp data while taking data

    The scp program was run from galfas1 to copy a 2gb file  from pdevs1 to galfas1 while pdevs1 was taking data in galfacts mode (65 Mb/sec). scp was run at the default nice level. The transfer completed without any aborts from data taking.The buffer memory went from 52 down to 41 free buffers (20Mbytes each) but this continued even after the scp finished. Top was run and the cpu utilization was recorded during and after scp was run:
-----------------------------------------------------------------------------------------------------
While scp pdevs1 to galfas1 was running with data taking:

top - 12:05:33 up 13 days, 17:29,  4 users,  load average: 1.78, 0.88, 1.12
Tasks: 169 total,   3 running, 165 sleeping,   0 stopped,   1 zombie
Cpu(s): 16.1%us, 16.3%sy,  0.0%ni, 43.8%id,  6.1%wa,  1.3%hi, 16.3%si,  0.0%st
Mem:   4033376k total,  3831064k used,   202312k free,     1116k buffers
Swap: 10233384k total,      196k used, 10233188k free,  3507700k cached

  PID USER      PR  NI  VIRT  RES  SHR  S %CPU %MEM    TIME+  COMMAND
20726 phil      15    0 73888 1696    972  R   66    0.0    0:16.92 sshd
20264 root      16   0 1073m 995m 957m S   32  25.3   1:56.83 bpOut
20721 root      15   0 1031m 469m 469m S   20  11.9   0:07.33 psrv
20260 root      18   0 1041m 957m 957m S   20  24.3   1:08.09 bpInp
20720 root      15   0 1031m 732m 732m S   20  18.6   0:07.41 psrv
20727 phil      15    0 33180 1352   1024 R   10     0.0   0:02.43 scp
  
----------------------------------------------------------------------------------------------------
scp not running, data taking still running

top - 12:08:07 up 13 days, 17:31,  4 users,  load average: 2.77, 1.84, 1.46
Tasks: 166 total,   2 running, 163 sleeping,   0 stopped,   1 zombie
Cpu(s):  7.7%us, 10.4%sy,  0.0%ni, 71.6%id,  1.6%wa,  0.5%hi,  8.2%si,  0.0%st
Mem:   4033376k total,  3994320k used,    39056k free,      496k buffers
Swap: 10233384k total,      196k used, 10233188k free,  3665700k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
20264 root      16   0 1073m 995m 957m S   40 25.3   2:53.05 bpOut
20260 root      18   0 1041m 957m 957m S   17 24.3   1:37.26 bpInp
20721 root      15   0 1031m 863m 863m R   13 21.9   0:30.77 psrv
20720 root      15   0 1031m 938m 938m S   11 23.8   0:31.54 psrv

sshd is using 66 % of 1 cpu probably to do the encryption..

processing: x101/090223/test.doc


02dec08: pdevs1 disk doesn't keep up. (top)

       Galfacts was running on 01dec08-02dec08 with 65mb/sec per box. At 00:27:10 an abort message was written to the message.log. The circumstances were:
The table below shows when the last few disc files were generated and their last modification dates. The open times come from the message log. The message is written after the file has been successfully opened and the fits header info has been written. Column 2 is pdevs1, column 3 is pdevs2. For each pdev there are two files generated: b0s0g0, b0s1g0 for the two 170 MHz Bands.

Times when fits files were opened.
fileNumber
pdevs1 open   (secMidite)
pdevs2 open (secMidnite)
open Dif
p1-p2
lastMod
p1_p2
l
lastMod
p1-p2
00286
00:23:36 (1416)
00:23:36
00:23:36 (1416)
00:23:36
0
0


00287
00:24:41 (1481)
00:24:42
00:24:41 (1481)
00:24:41
0
1
00:25:47_00:25:46
00:25:46_00:25:47
1
-1
00288
00:25:47 (1547) s0
00:25:46 (1546) s1
00:25:46 (1546)
00:25:47 (1547)
0
0
00:27:04_00:26:52
00:26:58_00:26:52
12
6
00289
00:27:08 (1628) s0
00:27:04 (1624) s1
00:26:52 (1612)
00:26:52 (1612)
16
12


abortMsg
00:27:10 (1630)
00:27:10 (1630)
00:27:10 (1630)
00:27:10




    The scramnet pnt, agc info is logged to disc on pdevs1 once a second. three time are recorded  each second:
The plot shows the scramnet times close to the pnetctl abort (.ps) (.pdf):
Looking at the raid disc controller info. pdevs1 disk 5 had 43 timeouts. None of the other discs on pdevs1 raid or any of the other pdevsN's raids  had any disc timeouts.

summary


Monitoring pdev server cpu & disc with iostat.  (top)

    The iostat program is run on  the pdev servers  for about 12 hours each evening. Galfacts (65Mb/sec) and a2133 (AUDS) run during this period. We would like to see what the cpu and disk are doing during data taking (especially during galfacts aborts).
    The iostat program is set to average over 10 seconds intervals and then record the info to disc. This info is then plotted. It contains:


Daily monitoring of iostat on pdevsN servers
Date (start date)
Notes
29nov11 (.ps) (.pdf))
programs run:
  • (14.3-16.5):p2627 voltage sampling 40mb/sec
    • io rate 40 mb/sec
    • No bitflipping of spectra so cpu not busy.
    • 15.2  burst of 100mb/sec.
      • They stopped to switch feeds. when they start up it gets behind for awhile..
  • (16.5:19.25): p2030 pulsar search
  • 19.5:-01.25  a2130 galfacts
    • The red read rates is probably the p2030 compression running after the p2030 run.
19nov09 (.ps) (.pdf) ok.alfalfa[-6:0],galfacts[0:6]
14dec08 (.ps) (.pdf) ok
13dec08 (.ps) (.pdf) no galfacts this night
09dec08 (.ps) (.pdf) ok
07dec08 (.ps) (.pdf) ok
06dec08 (.ps) (.pdf) 06dec08 and before only monitored pdevs1
05dec08 (.ps) (.pdf) disc 5 raid array  pdevs1 changed before run
azimuth power supply fault stopped data taking. There were no i/o aborts.
04dec08 (.ps) (.pdf)
03dec08 (.ps) (.pdf) galfacts aborted around 1:29:21 am.
2nd page is a blowup near the failure. Note the large iowait.

processing: svn/pdev/pdev/monitor/runiostat to collect the data. x101/pdev/iostat/pltday.pro to make plot.

Pdev raid disc i/o rates: (top)

    The program discwr2 (/share/megs/phil/svn/pdev/pdev/scripts/discwr2 ) tests the raid disc writing speed on each pdevsN server.
The raid measurements can be found in the discwr2 output file.

    The program discwr2tmp tests the single disc i/o rates (no raid speedup). The single disc is normally used for galfacts compressed output.
The single disc measurements can be found in discwr2tmp output file.

History:


<- page up

home_~phil