Generating the list of files

jan23

How the lists were generated

the lists of files

Intro.

    Data directories at AO were searched to generate a list of files that belong to various datasets  (projects) taken at AO.
This page describes how the lists were generated.

    Arun has setup  a master directory (/share/projdir/) that has entries for most projects taken at AO. This includes files online and often links to files that are no longer present online. This was the starting point for the searches.

ABBREVIATIONS:


--> ToDo : 230130 ..




 Steps to make the file lists:

    The scripts and idl routines are located in /share/megs/phil/x101/shutdown/archive/projscan/. The general outline is:


step
cmd
inp
out
notes
lastrun
create ./proj/xxxxx ascii files
  • 1st creates a list of all files(dir)  in /share/projdir
    • output to ->lists/projdir.list
  • run find -L xxxx on each entry of projdir.list
    • ./proj/xxxxx has ascii list of files in xxxx (and it's subdir)
  • --> The routine excludes:
    • any /share/projdir/xxxx starting with
      • r (radar data)
      • d (discs.. BKfiles)
      • l  (linkxxx BKfiles)
      • m (md5 BKfiles)
      • F (FILECONTENT... BKfiles)
mkprojlist.sc

proj/xxx
each xxx has files in proj xxx
this includes files online and any links that point to non-online files. It also includes some BKfiles. It excludes radar projects
230130
create ./projfixed/xxxxx ascii files
  • Exclude some xxxx that are duplicate links
    • XALFA,XLAFA,x101cor,x101mock,x101data,x101wapp
  • in xxxx exclude filenames that contain:
    • aaa/.log (log files)
    • FILELIST
    • .del (usually bookkeeping with .deleted)
    • .ls  (a bookkeeping listing file)
    • link,cron
    • exclude any filenames that are directories
  • Exclude files that end it:
    • ~.
  • For t1193 proj
    • exclude t3323_20190703a_usrp_test/  (dup link)
  • for x101 proj
    • exclude x101/mock/pdevs (dup links)
  • for t2875 exclude dirs:
    • t2875/aeron5.aolcj04/, t2875/aeron5.aolck4 (dups)
  • for s3043 exclude dir:
    • s3043/pdevs1-201912 (dup)
  • exclude all files that are broken links (not on disc)
fixprojlist.pro
./proj/*
./projfixed
list/fixprojslit.log
list/proj.allfixed
fixproj.sav
  • makes a copy of ./proj in ./projfixed with files that meet the criteriea. 
    • There will still be some unwanted files in this set (probably links that point to the same file)
  • summary of file counts per project in list/fixprojlist.log
  • List of all files in lists/proj.allfixed
  • fixproj.sav has:
    • fixI[nproj], lskipAr[10]
    • fixI  struct array holding :
    •   PROJ            STRING    ''
         FILESIN         LONG                 0
         FILESOUT        LONG                 0
         SKIPCNT         LONG      Array[10]
    • Skipcnt counts how many of different file types were skipped. see lskipAr for labels.

230130
run  getduppaths.pro



check that there aren't any duplicate directories (may  still be duplicate files in directories... probably no longer needed

try to count the number of various types of files
splittypes.pro
proj.allfixed
lists/proj.type





The file lists


Lists with most file included (except radar)

count
links
notes
daterun
list of files(dir) in  /share/projdir/
795
lists/projdir.list
one file holding all entries in /share/projdir/
Nothing has been excluded.
230130
Number entries in ./proj/ after
excluding radar and s few BKfiles
591
   ./proj
  • Each entry in ./proj/xxx  is an ascii file holding all the files below /share/projdir/xxx/
  • radar projs and some bookkeeping files have been removed
  • Clicking on an entry xxx will list all the files in this path.
    • This can still include links not on disc
  • There are a few duplicate entries (2 directories pointing at same dataset)
  • The decrease of 795 to 591 is mainly links  in /share/projdir that are no longer on disc or have 0 entries or are radar projects.

number of files below /proj/projdir 1466983 lists/proj.all (88MB) one file holding all the filenames below /share/projdir/ (excludes radar and a BKfiles)
it's a big file...





Lists after excluding duplicated, links on  disc, etc

count
links
notes
daterun
number of entries in ./projfixed/
465


./projfixed

  • Each entry inj ./projfixed/xxx is an ascii file holding all the files below /share/projdir/xxx that meet the inclusion criteria.
  • Differences with ./proj:
  • exclude some duplicate dir links
  • exclude some BKfiles.
  • exclude directories that don't have any online files.
  • also exclude some "non-data" files.. although maybe these should have been kept.
  • The decrease of entries 591->464 is mainly dir with no online files.

summary of projfixed/xxx

lists/fixprojlist.log
  • An ascii listing with summary info (file counts) for the various kinds of files found in proj xxx
  • the cols show how many of  each kind of file was excluded


# files in lists/proj.allfixed 854959
lists/proj.allfixed (51GB)
  • one file holding all entries in ./projfixed/ .
  • similar to lists/proj.all with lots of files excluded.



Files counts by file type

finp="./lists/projall.fixed1"
nlines=readasciifile(finp,fnmAr)
type

filecnt
cor
16488
wapp
38674
mock (35MBytes)
598838
ri
363
atm (9.3 MBytes)
130816
tar files
5
idl sav files
32074
postscript
4541
gif
4535
uncategorized
10263


processing: x101/closeout/archive/Readme   for routines used.