Written by Jorge Salazar for the Texas Advanced Computing Center
Millions of people have seen footage of the famed Arecibo radio telescope’s collapse in December 2020. What they would not have seen from those videos was Arecibo’s data center, located outside the danger zone. It stores the ‘golden copy’ of the telescope’s data — the original tapes, hard drives, and disk drives of sky scans since the 1960s.
Now, a new partnership will make sure that about three petabytes, or 3,000 terabytes, of telescope data is securely backed up off-site and made accessible to astronomers around the world, who will be able to use it to continue Arecibo Observatory’s legacy of discovery and innovation.
Within weeks of Arecibo’s collapse, the Texas Advanced Computing Center (TACC) entered into an agreement with the University of Central Florida (UCF), the Engagement and Performance Operations Center (EPOC), the Arecibo Observatory, the Cyberinfrastructure Center of Excellence Pilot (CICoE Pilot), and Globus at the University of Chicago. Together, they’re moving the Arecibo radio telescope data to TACC’s Ranch, a long-term data mass storage system. Plans include expanding access to over 50 years of astronomy data from the Arecibo Observatory, which up until 2016 had been the world’s largest radio telescope.
“I’m thrilled that UT Austin will become the home of the data repository for one of the most important telescopes of the past half-century,” said Dan Jaffe, Interim Executive Vice President and Provost of The University of Texas at Austin.
“As a young radio astronomer, I saw Arecibo as an amazing symbol of the commitment of our country to the science I loved,” Jaffe said. “Arecibo made important contributions across many fields — studies of planets, setting the scale for the expansion of the universe, understanding the clouds from which stars form, to name a few. Preserving these data and making them available for further study will allow Arecibo’s legacy to have an ongoing impact on my field.”
“Arecibo data has led to hundreds of discoveries over the last 50 years,” said Francisco Cordova, Director of the Arecibo Observatory. “Preserving it, and most importantly, making it available to researchers and students worldwide will undoubtedly help continue the legacy of the facility for decades to come. With advanced machine learning and artificial intelligence tools available now, and in the future, the data provides opportunity for even more discoveries and understanding of recently discovered physical phenomena.”
Since 2018, UCF has led the consortium that manages the Arecibo Observatory, which is owned and funded by the National Science Foundation (NSF). EPOC, a collaboration between Indiana University and the Energy Sciences Network (ESnet) funded by the U.S. Department of Energy’s Office of Science (SC) and managed by Lawrence Berkeley National Laboratory, had itself partnered with UCF in profiling their scientific data movement activities a year prior to the collapse.
“NSF is committed to supporting Arecibo Observatory as a vital scientific, educational, and cultural center, and part of that will be making sure that the vast amounts of data collected by the telescope continue to drive discovery,” says NSF Program Officer Alison B. Peck. “We’re gratified to see that this partnership will not only safely store copies of Arecibo Observatory’s data but also provide enhanced levels of access for current and future generations of astronomers.”
The data storage is part of the ongoing efforts at Arecibo Observatory to clean up debris from the 305-meter telescope’s 900-ton instrument platform and reopen remaining infrastructure. NSF is supporting a June 2021 workshop that will focus on actionable ways to support Arecibo Observatory’s future and create opportunities for scientific, educational, and cultural activities.
Sense of Urgency
“The collapse of the Arecibo Observatory platform certainly raised a sense of urgency within our team,” said Julio Alvarado, Big Data Program Manager at Arecibo. The Big Data team was already working on a strategic plan for their Data Management and Cloud programs. Those plans had to be prioritized and executed with unprecedented urgency and importance. The legacy of the observatory relied on the data stored for the over 1,700 projects dating back to the 1960’s.
Alvarado’s team reached out to UCF’s Office of Research for help, which connected Arecibo to two NSF-funded cyberinfrastructure projects, EPOC led by Principal Investigators Jennifer Schopf and Dave Jent from Indiana University, and Jason Zurawski from ESnet; and the Cyberinfrastructure Center of Excellence Pilot (CICoE Pilot) led by Ewa Deelman of the University of Southern California.
“We got involved when the University of Central Florida noted they were having challenges in trying to identify a new data storage location off of the island, and were struggling with the demands of efficiently moving that data,” said Jason Zurawski, Science Engagement Engineer of ESnet and Co-PI of the EPOC project.
“Migrating the entire Arecibo data set, well over a petabyte in size, would take many months or even years if done inefficiently, but could take only weeks with proper hardware, software, and configurations,” said Hans Addleman, the Principal Network Systems Engineer for EPOC. The EPOC team provided the infrastructure skills and resources that helped Arecibo design their data transfer framework using the latest research tools and expertise. The CICoE Pilot team is helping Arecibo evaluate their data storage solutions and design their future data management and stewardship experience in order to make Arecibo’s data easily accessible to the scientific community.
“Arecibo is an amazing project that has enabled astronomers, planetary scientists, and atmospheric scientists to collect and analyze extremely valuable scientific data over many decades,” said Ewa Deelman, Research Director at the USC Information Sciences Institute, and PI of the CI CoE Pilot project.
“The CI CoE Pilot project is very excited to be working with Arecibo, EPOC, TACC, and Globus members in this community effort, making sure the precious data is preserved and made easily findable, accessible, interoperable, and reusable (FAIR). Recently, we have also reached out to members of the International Virtual Observatory Alliance (IVOA), and in particular Bruce Berriman (Caltech/IPAC-NExScI, Vice-Chair of the IVOA Executive Committee) to explore Arecibo’s data role in the international community. The collaboration formed around and with Arecibo shows how NSF-funded projects can come together, amplify each other’s efforts and have an impact on the international scientific community,” Deelman added.
CI CoE Pilot contributes expertise in a number of areas spanning the Arecibo data lifecycle, including data archiving (Angela Murillo, Indiana University), identity management (Josh Drake, IU), semantic technologies (Chuck Vardeman, University of Notre Dame), visualization (Valerio Pascucci and Steve Petruzza, University of Utah), and workflow management (Mats Rynge, and Karan Vahi, USC). The CI CoE Pilot effort is coordinated by Wendy Whitcup (USC).
As a result of Arecibo’s limited Internet connectivity, the University of Puerto Rico and Engine-4, a non-profit coworking space and laboratory, are contributing to the data transfer process by allowing Arecibo to share their Internet infrastructure. Further, the irreplaceable nature of the data required a solution that would guarantee data integrity while maximizing transfer speed. This motivated the use of Globus, a platform for research data management developed and operated by the University of Chicago.
The data transfer process started mid-January 2021. Arecibo’s data landscape consists of three main sources: data in hard drives; data in tape library; and data offsite. The archive holds over one petabyte of data in hard drives and over two petabytes of data in tapes. This data includes information from thousands of observing sessions, equivalent to watching 120 years of HD video.
Currently, data is being transferred from Arecibo hard drives to TACC’s Ranch system, recently upgraded to expand its storage capabilities to an exabyte, or 1,000 petabytes. Ranch upgrades combine a DDN SFA14K DCR block storage system with a Quantum Scalar i6000 tape library.
Over 52,000 users archive their data from all facets of science, from the subatomic to the cosmic. Ranch is an allocated resource of the Extreme Science and Engineering Discovery Environment (XSEDE) funded by the National Science Foundation (NSF).
“Further phases will copy the Arecibo tape library to hard drives and then to TACC, and a later phase will copy data from offsite locations to TACC,” Alvarado said.
To preserve and guarantee continuity to the scientific community, Arecibo’s data is being copied to storage devices, which are then delivered to the University of Puerto Rico at Mayaguez and to the Engine-4 facilities for upload. This ensures that the research community continues to access and execute research with the existing data. This data migration is executed in coordination with Arecibo’s IT department, led by Arun Venkataraman.
Given time constraints and limitations in the networking infrastructure connecting the observatory, speed, security, and reliability were key to effectively moving the data.
The Globus service addressed these needs, while also providing a means to monitor the transfers and automatically recover from any transient errors. This was necessary to minimize the chance of losing or corrupting the valuable data collected by the telescope in its 50+ years of service.
The Globus service enabled the UCF and ESNet teams to securely and reliably move 12 terabytes of data per day. “Seeing the impact that our services can have on preserving the legacy of a storied observatory such as Arecibo is truly gratifying”, said Rachana Ananthakrishnan, Globus executive director at the University of Chicago.
Arecibo’s Data Legacy
The data were collected from Arecibo’s 1,000 foot (305 meter) fixed spherical radio/radar telescope. Its frequency capabilities range from 50 megahertz to 11 gigahertz. Transmitters include an S-band (2,380 megahertz) radar system for planetary studies; a 430 megahertz radar system for atmospheric science studies; and a heating facility for ionospheric research.
Past achievements made with Arecibo include the discovery of the first ever binary pulsar, a find that tested Einstein’s General Theory of Relativity and earned its discoverers a Nobel Prize in 1993; the first radar maps of the Venusian surface and polar ice on Mercury; and the first planet found outside our solar system.
“The data is priceless,” Alvarado emphasized. Arecibo’s data includes a variety of astronomical, atmospheric, and planetary observations dating to the 1960s that can’t be duplicated.
“While some of the data led to major discoveries over the years, there are reams of data that have yet to be analyzed and could very likely yield more discoveries. Arecibo’s plan is to work with TACC to provide researchers access to the data and the tools necessary to easily retrieve data to continue the science mission at Arecibo,” he said.
The Arecibo IT and Big Data teams are in charge of the data during the migration phases of the project, which doesn’t allow public access. As the migration and data management efforts progresses, the data will be made available to the research community.
Arecibo, TACC, EPOC, CICoE Pilot, and Globus will continue to work on building tools, processes, and framework to support the continuous access and analysis of the data to the research community. The data will be stored at TACC temporarily, supporting Arecibo’s goal of providing open access to the data. Arecibo will continue to work with the groups on the design and development of a permanent storage solution.