BIOLOGY

Marine biological data can represent for example species presence and absence, abundance, or density.

Data Collection

Marine biology encompasses an enormous variety of microorganisms, plant life, crustaceans, fish species and marine mammals, from the smallest algae to the largest creature on Earth today: The blue whale. Life in the ocean is closely linked to the physics and chemistry of seawater and the seabed, but the most important (and the most visual) characteristic of marine biology is the dependence on seasons. Species migration, feeding, and reproductive cycles are frequently governed by the seasonal changes in ocean temperature and the availability of sunlight to initiate the algal blooms that are so important for the higher trophic levels. In the ocean sciences, biological monitoring encompasses various specializations and concentrations, including professionals from fisheries, marine microbiology, botany, and invertebrate specialties such as coral reef ecology, shellfish, seagrass, and rocky shores. In the benthic realm, communities can be found in various bottom types, from the deep sea to littoral fringes, with differences in the characteristics of the bottom, from sandy and coarse sediment to rocky hard bottoms, resulting in an uneven distribution of benthic organisms. Additionally, benthic communities originating from the same bottom type can differ due to a combination of physical and chemical variables that define niches or spatial preferences.

Biological observations of the benthos can take many forms, including direct visual observations in shallower waters, video or photographic observations from platforms such as remotely-operated vehicles (ROVs), autonomous underwater vehicles (AUVs), unoccupied aerial vehicles (UAVs), or dropped cameras, and direct capture through methods such as grab sampling, ROV-based sampling, or fishing. The need to compare observations of one location over time or from different locations at any time necessitates the implementation of specific protocols that produce inter-comparable data. The type, frequency, and other specific approaches to benthic biological sampling depend on the monitoring goal or research question, and this helps focus on the types of organisms and the spatial and temporal resolutions of the data that need to be collected. Over 1700 methods and observing programs related to "benthic biological sampling" may be found in the open-access digital repository of ocean best practices documentation (http://www.oceanbestpractices.org). Benthic biological data collection often encompasses information about the presence or absence of various organisms, ranging from infauna to epibenthic life, including various forms of algae, reef-forming organisms, and organisms that bury partly or entirely or spend time on the bottom. This information is typically complemented by data on environmental variables that impact the occurrence of these organisms. It is generally recognized that the distribution of organisms within benthic communities is uneven or patchy. Several factors, such as variations in the environment, changes in sediment distribution, bottom currents, and larval settlement, among other factors, influence this patchiness. In order to address local, regional, and global biodiversity patterns, changes, and other biological or biophysical issues, it is essential to evaluate biotic and abiotic data together.

Data Processing

Marine biological data span a vast array of different data types, from numerical simulations of algal blooms to individual visual observations of invasive species in a limited region. It is therefore difficult to summarize a method of data processing that accounts for all data types. A good starting point for processing approaches is to search the Ocean Best Practices repository (http://www.oceanbestpractices.org). In the data life cycle, following data acquisition, the initial stage of data processing involves what some call “data wrangling”, which aims to clean and curate raw data to organize it into a structured and usable format. Data wrangling encompasses various methods to achieve this objective, each comprising several general steps. Furthermore, the naming and organization of files and folders constitute another critical stage in the data processing pipeline, particularly for long-term projects involving frequent data collection.

Processing data for a practical interpretation is challenging. Before commencing data analysis, it is imperative to consider all these numerous elements to ensure the integrity and usability of the data. Gathering data from various sources, regardless of their level of structure, and subsequently preparing it for visualization, modeling, or permanent storage is imperative for the success of projects. Performing biological observations involves not only georeferenced data (occurrences) but also collecting various types of data that can be used to estimate key ecological parameters, such as density, abundance, and measures of diversity (e.g., Alpha and Beta diversity). However, these observations often require some processing to ensure the derived metrics are biologically meaningful and comparable across different locations or habitats. For instance, this involves grouping observations with consistent taxonomic resolution, standardizing measurement units, and ensuring temporal and spatial resolution compatibility. Doing so can avoid biases introduced by varying methodologies or taxonomic ambiguities. In the U.S., individual biological observations are commonly combined with other observations of benthic environments and classified as belonging to specific benthic habitats defined by classification schemes such as the Coastal and Marine Ecological Classification Standard (CMECS). CMECS aims to ensure consistent descriptions of ecological features, and its four components are: Water column (ecological features in the water column, including physical parameters such as temperature, salinity, and currents), Geoform (details on the coast and seafloor characteristics), Substrate (characteristics on materials on the seabed, of both geological, biological, and anthropogenic origin), and Biotic (biological features in the water column and on the seafloor). CMECS is built hierarchically, meaning that measurements are organized from top level domains to increasingly different sub-categories. Internationally, marine ecosystems are more commonly classified using the International Union for Conservation of Nature Global Ecosystem Typology.

Data Management

A good data management plan is critical from the very first initiation steps of a project. This encompasses a series of steps, beginning with the strategic planning of data collection, processing, preservation, and sharing. Several global benthic assessments, including Natural Geography in Shore Areas (NaGISA), SeagrassNet, and the Atlantic and Gulf Rapid Reef Assessment (AGRRA), have embraced data management and sharing standards and formats. Leading organizations in ocean research advocate for adherence to data management principles. For instance, the International Oceanographic Data and Information Exchange (IODE) is dedicated to advancing marine research, exploitation, and development by facilitating the exchange of oceanographic data and information among member states. Central to its mission is encouraging researchers to formulate comprehensive data management plans for projects involving marine data. Furthermore, the IODE underscores the importance of archiving data generated by research projects within its National Oceanographic Data Centers network.

In the data life cycle, data acquisition entails both discovering existing data and generating new data. Metadata plays a critical role in this process by making data discoverable, usable, and comprehensible. Well-crafted metadata allow users and information resources to grasp the content they are accessing or reviewing, understand its potential relevance to their objectives, gauge its value, and acknowledge its limitations. The balancing act of simplifying observation data while accepting reasonable data loss and avoiding verbose complexity was a key concept for the reviewed biological metadata standards below. To minimize the complexity of a project’s defined methods/results and the possible uncertainty about descriptions within a project, specific language (controlled vocabulary) for metadata components was set within biological data standards. These specific language elements are implemented in the sections/sub-items within a metadata document as well as recommended domain declaration values. By having a shared vocabulary among projects, there is a greater opportunity for merging similar datasets by limiting the unknown or inconsistent description of a project’s scope and findings. The Federal Geographic Data Committee (FGDC) Biological Data Profile is often the default standard chosen for geographic metadata related to biological observations. The primary objectives for the standard are to set a common terminology and define minimal metadata documentation requirements for biological projects with either geographic or non-geographic components. The metadata are distributed as a singular extensible markup language (XML) file format. Another relevant standard for biological data is the Ecological Metadata Language (EML), a method for formalizing and standardizing the set of concepts that are essential for describing ecological data. The EML has data producers follow a comprehensive metadata form that allows for extended and better-organized descriptions of sampling procedures and internal project communications. The metadata are stored in a single XML document.

Ultimately, the fitness of the biological observational dataset is conditional upon its ability to be shared effectively using suitable methods. Despite the inherent complications of uniformly standardizing data between biological datasets, several global databases have established well-structured data networks to share present and historical biological observations of the oceans. These databases share and distribute biological data through various methods, including relaying point observations, suitability based on combined sources (e.g., environmental conditions, peer-reviewed publications, historical accounts), project footprints, and metadata accounts. The distribution and storage methods for these database services are made possible through specific metadata scopes and the formatting of observational data. However, in dictating observational data formats and strict metadata scopes, some project components (e.g., field notes, complex geographic references, observation data subsets) are dropped in a shared data service due to data reporting restrictions. This loss of details on a dataset ultimately limits its practicality to be used outside the original project’s scope.

The Darwin Core was created to further expand the use of existing nomenclature practices by providing a more parsimonious metadata framework for cataloging biological data. Recognizing that apart from the fixed taxonomical terms associated with biological observations, there were many commonly used interchangeable descriptors used to summarize a project’s monitoring efforts that may not necessarily share the same meaning between projects or among data repositories. The Darwin Core is set to address term inconsistency by providing specific non-interchangeable definitions across metadata elements and allowing for better data integration across projects. Developed in the late 1990s, the Darwin Core started as a set of standard terms with rigidly defined semantics. These terms included taxonomical nomenclature and spatial and procedural context terms, including locality, sampling methods, and temporal descriptors. The referenced Darwin Core terms (or schema domains) are available for project review using the Darwin Quick Reference Guide. The Darwin Core is intended to be adaptable to accommodate new terms and extensions but ultimately seeks to promote flexibility and reuse of terms through data-sharing platforms and void verbose complexity. The Darwin Core is distributed through simple text or XML documents, typically in an archive format. These documents are encoded through a variety of encoding schemes, such as comma-separated values (CSV; tabular), JavaScript Object Notation (JSON), or Resource Description Formation (RDF). To distribute the different data components of a project (i.e., the observations, methods, and project description), Darwin Core partitions the project data layers into separate metadata documents or extensions. These extensions are files that follow the Darwin Core approved structure but may require additional fields that expand a project’s documentation beyond the scope of the simple data Darwin Core. The data within the documents are linked through relational database construct using shared objectIDs (e.g., eventID, observationID, project ID). By separating metadata components into different organizational units through metadata extensions, the Darwin Core allows data management at different project phases. A common complication observed in many biological monitoring datasets is that projects may not be term-limited or are a planned continuing monitoring effort. In this case, project managers submit metadata or observational findings on a set schedule that is often dictated by convenience or capabilities based on available funding/resources rather than the needs of an outside investigator. Acknowledging the need to document global biodiversity better, the Darwin Core standard was developed to provide a more streamlined approach to sharing project biological observational data through community coordination. At its heart, the Darwin Core standard is not a complete metadata format like FGDC or International Organization for Standardization (ISO) formats but a model allowing for sharing and integrating similar datasets among different databases using a shared descriptive scheme. Many platforms integrate Darwin Core while using a base project metadata identifier, such as EML, within a Darwin Core archive (package). Today, the Darwin Core is a heavily utilized data-sharing format accounting for many occurrence entries.

BIOLOGY

Standards & Protocols

Software for Data Processing

Frameworks for Data Access