The Big Data Program: Arecibo Observatory Data Archive

#AOScienceNow: Big Data


#AOScienceNow

Through the Big Data Program at the Arecibo Observatory (AO), we are developing the Arecibo Archives Data Catalog to facilitate the access to AO's projects, observations, datasets, and attributes. Approximately half of the AO database is currently available in the catalog: https://www.naic.edu/datacatalog/

The purpose of the Data Catalog is to provide a user-friendly portal where users can browse, query, and explore the projects observed at Arecibo for more than 55 years. This catalog consolidates multiple data sources that have been built throughout AO's operation. The main component of the Data Catalog is the Projects Catalog, which provides all of the technical information about a proposal or project. This is essentially what the scientists would submit as a proposal to receive Arecibo observing time. The Data Catalog is complemented by the Observations Log, a Files Catalog and an Attributes Catalog. The Observations Log provides a detailed log recorded by the observing scientists for each project. The Files and Attributes catalogs contain all of the raw data files that were captured in the observations as well as key metadata of those files.

To build this catalog, the Big Data team worked to first identify and catalog all of the projects that have been done at Arecibo. This was no easy task since the data was stored in many formats throughout the years. For each format, the team created scripts that scraped or extracted all technical information from the documents and saved them into a database. This first step is the foundation of the Data Catalog.

In a similar way, the team extracted and compiled the Observations Log using log information that existed in different locations. Most of the observations were already saved in a database, making it easier to integrate into the catalog. The Files Catalog is being built as the datasets are copied to the Texas Advanced Computing Center. Once a dataset is copied, the team catalogs it and creates a record for it within the Catalog Database, keeping record of the file location, corresponding project, and size. Finally, the Attributes Catalog is being actively populated by extracting headers, metadata, and attributes from the raw files. This is being done using scripts that navigate through the server's paths and extracts the attributes from each file. This is catalogued and saved into a database that keeps record of all scientific attributes including related file name and project.

This catalog's importance is incalculable. It is the steppingstone to make Arecibo's Datasets accessible to the community and curious minds. The Data Catalog project is a computing strategy that will make the necessary data and resources widely available to the scientific community, continuing the Arecibo Observatory’s legacy of enabling groundbreaking new results about our atmosphere, our Solar System, and our universe.




Article written by Eng. Julio Alvarado Negrón

Arecibo Media Contact
Ricardo Correa
Universidad Ana G. Méndez (UAGM)
787-878-2612 ext. 615
rcorrea@naic.edu

Big Data Manager
Eng. Julio Alvarado
University of Central Florida / Arecibo Observatory
julio.alvarado@ucf.edu

Keywords: observatory, arecibo, data, big, data, catalog, texas, TACC, advanced, computing