Active Data

Active Data

Scientific applications often involve computation intensive workflows and may generate large amount of derived data. In this paper we consider a life cycle, which starts when the data is first generated, and tracks its progress through replication, distribution, deletion and possible re-computation. We describe the design and implementation of an infrastructure, called Active Data, which combines existing Grid middleware to support the scientific data lifecycle in a platform-neutral environment

The Grid provides an infrastructure that enables sharing of heterogeneous resources. Over the last few years, the development of Grid has attracted a great deal of interest because of its ability to bring together distributed computation and storage resources. This facilities the composition of large-scale resource-intensive scientific applications that require much more processing and data storage than that provided by a standalone system.

Many of the current and proposed scientific experiments generates, and requires access to, massive amount of data distributed in multiple geographic locations. For example, the LHC experiments are expected to produce terabytes to petabytes of high-energy physics data for further processing and analysis by thousands of researchers worldwide. Managing such amount of distributed data is a challenging and tedious task and requires users to spend much of the time on data management rather than concentrate on scientific experiments and investigations.

The Active Data Grid architecture attempts to provide high-performance access to distributed data in an integrated environment. However, most of the Data Grid systems require application developers to adopt different data access interfaces and protocols. This shortcoming is significant particularly in large collaboration where multiple Data Grid systems are in place. Active Grid deals with this by providing a more flexible data access layer above the specific replica middleware.

A Grid Data Life Cycle - Features

  • Computation generates the original copy of data
  • This data copy of data is replicated to other sites
    -Better performance and scalability
  • Data replicas may be removed

-Limited storage space, not needed anymore, hardware failures, etc.
-Difficult to manage many files

  • Derived data may be stored as computation procedures
    -Virtual Data Grid (e.g. Chimera)
  • Re-create deleted data dynamically
  • Use buffering for seamless re-creation?

Key Features of Active Data

  • Data deletion
    -Allow data to be deleted when procedures required to re-compute the data exist
    -Special remove command uses logical file names for multiple Data Grid systems

  • Dynamic data regeneration
    -Automatically rerun the recorded computation (e.g. arbitrary program or a Kepler workflow)

  • Logical file name spacing
    -Uses a unique file identifier that may be mapped to any file stored in any supported storage systems