HowTo – Managing site and analytical data

Priority: medium
Updating: practice being updated as we begin to interpret data

This covers data about sampling locations, samples, and analytical results. Mapping data have their own HowTo: GIS Mapping. Together the tabular and map data constitute the project’s substantive database. There are certain administrative data managed separately.

Change log:

When	Who	Comment
2021 03 05	Sp17	First partial version launched. This is based on undocumented practice during prior DEC projects.
2021 09 15	Sp17	Continuing fleshing out. This may not be part of the initial QAPP reviewable content.
2021 10 25	Sp17	Incorporates anonymous site identifiers and excludes confidential data including that which can imply identities or locations. The tabular database does not exist yet.
2023 02 19 started	Sp17	[beginning major edit, not complete] Tabular database is mature prior to receiving the first pesticide analytical results. All recruited sites, all sampling points at such sites, all samples to date, and all field and lab measurements for all samples to date are incorporated into the database. The original objectives remain. The update here reflects cumulative adjustments to design and implementation through today. Confidentiality aspects are redesigned.
2023 06 05	Sp17	More updates prior to DEC review.
2023 06 21	Sp17	Converted to Markdown, minor updates. This remains an evolving area as we interpret the 2022 results.
2023 07 15	Sp17	The database needs to include chemical properties such as soil half lives and KOC/KFOC values. We are now using this for choosing analytes to focus on in lab. Began a table for this. We also may have a use for Statewide pesticide use to go along with our zip code level data.

Cross references:

1. Design and operating objectives

Maintain adequate compatibility with EQUIS requirements.
Provide for periodic summary reporting of progress in site acquisition, sample acquisition, and analytical results.
Provide a cleaned repository for all non-geographic data acquired or created by the project.
Provide for reporting detailed analytical results to landowners and lake volunteers.
Minimize the chance of DEC, the public, or data thieves from learning the identities or locations of cooperators.

2. Software infrastructure

We were originally (back to around 2000) maintaining data in spreadsheets per analytical batch, both at the DEC lab and the Cornell lab. Cedric Mason established an overlay on this, also in Excel, to consolidate data from the earliest analytical results through several years to export for EQUIS. Going forward it was clear that we needed to be recording much more metadata about sampling locations, samples, and analytical operations to support data submission into EQUIS. This really was not well supportable in Excel. Thus we began an EQUIS-dominated overlay on prior practice that began to use a tabular database. The incremental analytical results spreadsheets, one or more per batch of samples analyzed, will continue to exist, but they will need to be imported into a cumulative, EQUIS-oriented format.

The earlier format, evolved through a few rounds, proved itself for retrieval in 2019 when we were requested to extract all results for a subset of chemicals for use in NYS pollinator impact assessment. This was done after the data had been moved from MS Access (where other data outgrew the ability of Access to query it) into SQLite managed using DB Browser.

Note that SQLite is a single user database system with limited protection against concurrent access. The .db file is in Box thus there is some potential for concurrent use. Read-only concurrent is fine, concurrent update is a problem. Updating is currently only by SP17.

We also began to use Access then SQLite for cumulative PSUR pesticide use and sales data. That is a separate database from the one containing sites, sampling points, samples, and analytical results.

3. Data flow

The overall 2021-2025 project data flow is:

graph TD;
  subgraph well sampling;
    A{Cornell<br>deionized<br>water}-->B[collect field blank];
    B((collect well<br>field blank))-->C[field log];
    D{well}-->E((measure well<br>in situ));
    D{well}-->F((collect well<br>sample));
    E-->C;
    F-->C;
    end;

  subgraph lake sampling;
    O{Cornell<br>deionized<br>water}-->P((collect lake<br> field blank));
    P((collect well<br>field blank))-->Q[lake<br>sample<br>datasheet];
    R{lake}-->S((measure lake<br>in situ));
    R{lake}-->T((collect lake<br>sample));
    S-->Q;
    T-->Q;
    end;

  subgraph data core;
    C-->Cback((scanned<br>backup));
    N-->Nback((scanned<br>backup));
    Q-->Qback((scanned<br>backup));
    C-->G[(operating<br>database)];
    Q-->G;
    G-->H[\report to DEC\];
    G-->I[\reports to volunteers &<br>well owners\];
    G-->J[\submit data to EQUIS\];
    K{{other people's data}}-->H[\report to DEC\];
    end;

    subgraph Cornell lab;
      L{sample}-->M((Cornell<br>analyzes<br>samples));
      M-->N[lab log];
      N-->G;
      L-->X((Cornell<br>makes<br>aliquots))-->U{aliquot<br>from<br>Cornell};
    end;

    subgraph Cornell then DEC lab;
      G-->W[ship manifests<br>and labels<br>for aliquots];
      U-->V((DEC<br>analyzes\nsamples));
      W-->V;
      V-->G;
    end;

This HowTo is about the “operating database” part of the project’s data core. There are additional textual data in the field and lab logs, covered by other HowTos.

See also EQUIS guidance: https://www.dec.ny.gov/chemical/62440.html

4. Data structure

This is mostly dictated by EQUIS. It reflects a customary analytical lab records hierarchy and is somewhat oriented toward hazardous waste site monitoring.

Data are in a hierarchical structure of tables in which one parent record has zero or more child records. The ultimate root of the tree hierarchy is the monitored (or hazardous remediation) “facility”. Our submission into EQUIS is for a single facility and none of the tables include any facility identity.

The primary data entities of the operating database for this project are implicitly the project (entire database covers one project), and explicitly the site, sampling location, sample, and analytical result. A different project will have its own database file. For example, all DEC project data prior to the current project are in one database file that will be kept separate. Our glyphosate project data through 2021 are in a third database file set similarly designed.

We prefer USGS terminology except when EQUIS terminology is essential.

4.1 Project sites, EQUIS “Facilities” and EQUIS “Subfacilities”

Our Sites are groups of wells on nearby property or groups of sampling locations in lakes. Each Site will have a unique anonymous identifier (such as Golf-7) that will maintain the anonymous privacy of the owner, and will reflect whether the Site is a lake, a long term well, or a group of wells related to a categorical pesticide user. (Note that some “wells” are actually other ways of accessing groundwater to sample, such as water table ponds.) There will also be a confidential site name for use between Cornell and the landowner, typically the owner’s favored name.

EQUIS Facilities and Subfacilities reflect terminology from EQUIS’ hazardous waste site cleanup heritage and do not have much meaning for pesticide reconnaissance in groundwater and lakes. In other EQUIS content than ours, they serve a grouping purpose for data retrieval, much as we would use our Sites for retrieval. Our Site is semantically similar to a Facility, in that it is geographically defined. We have no need to subdivide our Sites since there will be few monitoring Locations per Site.

In earlier EQUIS submissions we used “upstate New York” as the single Facility when submitting data into EQUIS, effectively treating New York State as having three Facilities: upstate, Long Island, and New York City. Our Subfacilities were counties. Neither type of grouping seemed meaningful for future DEC users of the earlier pesticide monitoring data.

There will have to be some kind of geocoding of Locations for EQUIS purposes. With household wells we anonymized past sampling locations using zip code centroids. There are also major watershed, county, and DEC region codes in EQUIS which may have to be left null or filled in with placeholders to avoid disclosing identities indirectly for some of the categories. Perhaps a new kind of Location could be defined within EQUIS, ecological region, and the centroid of the region would be the geo-tag.

A Site has a project category such as Golf Course, Lake, Greenhouse, or Long Term. This may not have a meaning in EQUIS semantics. Site identifiers are assigned based on the calendar order when the site began to be tracked as a serious candidate. Their sorting order within a category is not otherwise significant.

ToDo in2023: Negotiate with DEC EQUIS managers a nomenclature for unique project site identifiers. This also relates to Locations (next section).

ToDo in 2023: Negotiate with DEC EQUIS managers a mapping from our Site identifiers to Facilities and Subfacilities, so that records exported to EQUIS can populate the EQUIS fields.

ToDo in 2023: Agree with DEC EQUIS managers on an anonymity-preserving way to geocode data, such as ecological regions characterized with latitude/longitude centroids.

4.2 EQUIS and Project “Locations”

EQUIS and Project Locations have identical semantics: the map view (2-dimensional) position of a sampling point. We call them sampling points since Location is too generic.

These are X/Y (longitude/latitude) positions in the landscape or waterscape, where water samples are drawn from. We map these locations in confidential GIS based on the latitude and longitude of the sampling point derived from a GIS map (with help from a digital orthophoto) or GPS reading. (Note that these are never rigorously surveyed.) If there are multiple sampling points vertically at the same XY location, such as from a cluster of wells screened at different depths, that is still one Location, by previous agreement with DEC EQUIS staff. (The depth of the sample below a vertical datum is an attribute of the Sample, not the Location.)

Both censored and exact XY positions are maintained for any location that cannot to be disclosed to DEC or the public. In the past the censored location has been the centroid of a zip code polygon for the mailing address of the owner of the location. This was adequate protection of the privacy of the landowner. This must be blurred farther than the zip code level for most of this project’s disclosures to DEC and public because there is also a category involved. For example, disclosing the zip code of a golf course will disclose the exact site because there are not many zip codes containing more than one golf course. For another category there are only a handful of candidate Sites in entire upstate thus they may not even be able to include an ecological region centroid. To cover cases like that, we might need an Upstate NY centroid to geo-tag data about a few entities.

Long term Sites have different anonymity and privacy considerations. We should be able to provide their counties (which means DEC regions tags would be acceptable), and will consider providing zip code centroids.

(Lakes are not confidential in earlier joint work of the current project. We should be able to carry over EQUIS coding decisions from earlier lake sampling.)

The source used to estimate the XY coordinates, including centroids, is part of submitted EQUIS metadata.

As a special case, the Cornell lab is a Location when there is an artificial sample created for quality assurance purposes. This would generally be limited to “lab blanks”, though it would also apply if Cornell fortified any field samples. (The sample being fortified retains its location identity and the fortified version has a link to the unfortified version.) Field blank artificial samples belong to the site where they are created. The DEC Air Lab can also be an analogous location.

We should begin to submit field blanks related to categorical site sampling in 2023. Blanks are not relevant to long term sites since these all use the owner’s pump and piping. We collected field blanks in earlier lake sampling.

4.3 EQUIS and Project “Samples”

EQUIS and the Project Samples have identical semantics. A Sample is a volume of water or solids removed from the environment to represent conditions at a sampling point at a specific time. A single Sample may be divided across multiple containers for protection against loss in handling or just to have more volume; the multiple containers are not tracked individually in the database or by EQuIS. (A Sample may have replicates which are tracked in the database via a suffix on the sample identifier.) In prior SWL practice we used single capital letter suffices for multiple containers and numeric or text suffices for tracked replicates.

The unique identifier of a sample consists of its Site ID, its sampling point identifier within the site, date and time of collection, and a replicate number or name. In the project database these are five separate columns, in EQUIS these attributes are merged into one long string which can be generated algorithmically from the project’s sample identifiers.

In EQUIS a replicate sample is linked to the original by listing the EQUIS identifier of the original sample in the record for the replicate.

Composite samples that draw from multiple locations or blend across time are NOT provided for in EQUIS or the project. Any such samples are unique samples with their own locations or timestamps, and a comment indicating which locations or times were drawn from to create the composite.

There is provision in the project’s samples table to indicate that a sample and any related analytical data should be excluded from the EQUIS export.

4.4 Well Characteristics

In past when we were not installing our own monitor wells we did not know much about the wells, beyond what an owner could tell us which was usually only a depth and the type of subsurface material being tapped. When we install our own wells or use newer existing wells (after 2000), we may have access to well stratigraphic logs. Thus we should be recording information about some of our wells in the database including the stratigraphy. Note that we are unable to log driven or augered wells, the only kinds we have installed as of 2023 07 15. We have a general idea of the order of strata encountered by the auger if the layers are greatly contrasting, in the form of what the auger brings up, but we do not know the depth of different materials.

EQUIS terms this data group Subsurface investigation; the tables are Lithology and Well. Lithology has one row per stratum per well, and Well has one per well. We may also use a WellConstruction table.

4.5 Time Variable Field Conditions

We will also be recording well water levels, pH, temperature, and specific conductance at sampling times, which is a new kind of accompanying data for samples. EQUIS considers these part of a “field activities” group rather than “Chemistry” group. There are tabular formats specified for WaterLevel and FieldResults. The EQUIS mandate dictates some of the structure of our own time series data so that a scripted export to EQUIS importing spreadsheets is possible, as well as what metadata to record and export.

4.6 Analyses (Chemistry)

EQUIS Chemistry results and our Analyses have identical meaning. The EQUIS chemistry table is created as a subset of our “Analyses” table. Each row of the table is one analytical result from a specified lab (or as measured in field) using a specified method for a specified parameter. There is provision for nonquantified results, such as “non detects”. There is limited provision to indicate data quality. EQUIS requires extensive metadata such as the date when the test was done and who did it.

Replication is permitted, i.e. a sample analyzed at different labs by the same method. As with sample replication, analyses have a replicate sequence number. When a lab splits a sample to do several parallel analyses on each part, that is considered a replicate sample rather than replicate analyses of the same sample. It is even possible for replicate analyses to be done on each replicate of a sample.

When there are replicated analyses in the data with the same method, or different methods and the same underlying parameter, there is no provision in EQUIS to indicate that one result is preferred over another. There is also no mechanism to indicate preference for one method over another for the same parameter in the same sample, unless one method is tagged as screening quality and the other normal. We have a “useit” attribute to demote lesser data below best data when there are replicates done. This would exclude the lesser data from an EQUIS submission or in reporting.

The analytical result record indicates which lab or field person did the analysis. Labs have registered names in EQUIS which may not be the same as a lab’s customary name – the name in EQUIS for Cornell’s Soil and Water Lab is “Cornell”, presumably because we were the first Cornell University entity who submitted data into EQUIS. Thus we keep both the EQUIS and project lab names in the data record for the result.

Each analysis record contains triple redundancy about what analyte was measured - a parameter name as used in the project, an EQUIS official name, and a Chemical Abstracts Service (CAS) multipart numeric substance identifier. As with lab names, the EQUIS official analyte name seems to be the one registered first into the EQUIS system.

There is also redundancy about the analytical method. We record our customary method name of the DEC Air lab, or Cornell in the field or lab, in our recognizable form, and also which official EQUIS method most closely matches ours. In a few cases we register additional EQUIS methods when our method is not a good match for what we do.

EQUIS requires submitting the date an analysis was performed, thus we keep track of this in the table. EQUIS does not explicitly track how a sample was stored between colletion and analysis. We generally freeze all samples shortly after they are collected. (see the Sample Storage HowTo.) Cornell SWL’s and the DEC Air lab’s testing indicate that most parameters are stable for years while frozen. The only exception has been sulfate at Cornell which reduces in concentration during storage, and is of minimal interest to the project.

The method encodings for EQUIS include any preprocessing after preservation, for example sample preconcentration or cleanup such as by solid phase extraction. Cornell will submit any changes in DEC Air lab sample processing as new EQUIS methods unless they match closely a method already in EQUIS.

Experimental: When there are field and lab blanks, these can be associated with analytical results of given samples. Our field blanks will be trip blanks, thus the result is associated with all samples of a given trip.

4.7 Referential integrity and valid value checking in EQUIS

The hierarchical data structure means that data must be entered in a particular order. This also helps to identify entry errors.

All analyses must have a sample record. Thus before adding an analytical result the sample record must be entered first.

All samples must have a location record. Thus before adding a sample record there must be a location record.

All replicate samples must have a parent sample from which they were derived. Thus the parent samples must be entered before the replicate sample.

There are many referential integrity checks in the EQUIS checker. It checks encodings for “valid” (registered) values, such an analyte names and laboratory names. The EQUIS valid values checker run against a trial export in and Excel spreadsheet file helps to correct data in the project database, i.e. the database is corrected and our SQL export script is rerun to create a fresh Excel trial export, then the fresh trial export is run through the EQUIS checker. Only the last version of the export that passes EQUIS checking is retained.

5. QC samples and analyses

There are a few kinds of QC checks, at the DEC lab, in the field, and. These create analytical data which are associated with another analytical result or with a sample or with a batch of samples, to aid in interpreting one or more analytical results on natural samples.

Field blanks - by Cornell (earlier by lake volunteers)
Lab blanks - by Cornell and DEC Air Lab
Recovery tests - by DEC Air Lab
“Sample” replicates - by DEC Air Lab (not a second field sample)
Analytical replicates - by DEC Air Lab
Lab spikes - these are unique samples with certain tags to indicate that they are not normal field samples - not envisioned for use in project
Split (multiple labs same method) sample results - this is simple in EQUIS which includes which lab did a test with every result datum - not envisioned for use in project on pesticide analyses

We are not doing replicate sampling in the field for this project. Any “sample” replication is done by multiple aliquots from the same field container set.

During this project we did one split lab cations test across two Cornell-based labs who both use similar ICP apparatus. We also sometimes repeat an analysis at the same lab by making a second aliquot from the same sample on a later date.

[todo: describe how we store and encode QC results for exporting to EQUIS; we have submitted replicate and field/lab blank data into EQUIS as conventional analytical results, not tagged as QC to EQUIS]

6. Lookup tables

The laboratories and EQUIS often use different names for analytes, thus there is a need for a lookup table to translate the lab’s name for the analyte into EQUIS name and EQUIS’s CAS code.

County names are used in the database, which need to be translated into a FIPS code for EQUIS.

EQUIS requires that locations be encoded in a particular way different from what a project may wish. In both cases the identifiers must be unique. A lookup table can be used to translate from location or sample attributes into an EQUIS-compatible identifier.

In general EQUIS uses a single column as a row key in its tables, concatenating several attributes of the location or sample into a single string value. Our database uses multi-column keys and employs SQL string expressions with the aid of suffices to compose EQUIS keys.

7. Exporting data

EQUIS - bulk, periodic uploads of data not previously submitted to EQUIS
Reports to DEC Bureau - progress reports on samples and analyses, status reports on samples in process
Reports to lake volunteers - all samples in one or more batches they collected; no location censoring
Reports to site owners - one or more samples from their wells; no location censoring, data only go to the land owner
Container labels to send to lake volunteers (initially manually created)
Container labels for a planned sampling trip (initially manually created)

8. Importing data: recurring

Kind of data	Source	Destination
After a Cornell sampling trip: field results and new sample list	Field log	Samples, Analyses
After a lake sample batch is received: field results and new sample list	Printed field log sheets	Samples, Analyses
DEC lab analytical results for submitted aliquots	DEC lab spreadsheet	Analyses, QC samples?
Cornell lab analytical results	Cornell lab spreadsheet	Analyses

9. History of problems and resolutions.

Earliest practice in county work did not include any formal information system, essentially spreadsheets with separate files or sheets for different types of data. Analytical results were in a format provided by DEC’s lab, analytical results from Cornell ELISA and IC work were in custom formats matched to the methods, and the other formats were invented by project staff and students.

When EQUIS became mandated, starting with conversion and supplementing of earlier data, several issues arose that influenced the following generation of routine data storage:

EQUIS encodings of things like analyte names and sampling site identifiers were required, replacing earlier practice. For example, the DEC pesticide lab used particular analyte names for active ingredients, the active ingredients have US EPA chemical numbers and official US EPA names which are used by Bureau of Pesticides and manufacturers. There are CAS registry numbers, which can be used to look up many common names of the compound. There are EQUIS analyte names established by the first person to enter an analytical result for the compound as a sample analyte. The DEC lab uses a particular name for their analytical result. That is five identifiers for the same chemical compound. Using EQUIS forces carrying around the EQUIS name and the CAS number alongside the DEC lab’s name.
Confidentiality and privacy of most locations had to become explicit, balancing the need for data reuse with honoring privacy agreements. We agreed to report confidential locations using the zip code centroid of the owner mailing address.
Data quality was heterogeneous, between the authoritative DEC lab and the inexperienced to semiprofessional work at Cornell. EQUIS allows effectively two grades of data quality rating: final and screening. We decided to encode all Cornell-produced data as “screening” and DEC lab data as fully reusable.
Quality control information such as matrix spike recovery results was kept within the DEC lab and not included in Cornell’s project reports. Some samples and analytes were tested in duplicate at the lab, along with spike recovery results. Over time the supplemental QC numbers began to be transmitted to Cornell from the lab but they were not reported into EQUIS or included in the project reporting. Later EQUIS submissions began to include the duplicate test results.
Much metadata required by EQUIS was absent, such as the date an analytical test was done in the lab. That was reconstructable for most Cornell lab work but such metadata had not been transmitted by DEC’s lab to Cornell until the second generation of EQUIS submissions. For older data, we used the date results were transmitted to Cornell as the analytical date, which is weeks to months conservatively late in the sense of elapsed time between when the sample was collected and when it was analyzed. EQUIS does not provide a means to encode the storage conditions for the sample.
The data had to be expressed in a conceptual hierarchy based on hazardous waste site monitoring, which is much different from reconnaissance monitoring. The hierarchy is facility -> subfacility -> sampling location -> date/time sampled -> analytical result. We somewhat arbitrarily chose to have all of upstate NY as the “upstate pesticides facility” and counties as the “pesticides subfacilities”, thus essentially discarding the top level of grouping and having the second be an environmentally meaningless civil division. When we later covered lakes and a Karst area, which each spanned multiple counties, using the county as the subfacility had the additional problem of splitting up the projects; the Lakes project has data about four lakes happening to be in four different counties and there is no indication that the lake samples were in one group nor that the samples were from lakes, as opposed to stormwater or rivers. They are coded as from generic “surface water”.
EQUIS requirements are a moving target, fortunately not changing much earlier meaning. Thus, when making an excerpt of our data for EQUIS, snapshots of both our data and the current EQUIS requirements were needed, to enable rerunning the data subsetting and recoding later. Typically this was done by making a snapshot of the database and the SQL queries that export a re-encoded version of the data in EQUIS format.
While some of the EQUIS content can be synthesized algorithmically from local data, in general there has to be duplicate metadata in EQUIS syntax and semantics. For example, a site might have been named CyC-13 originally but EQUIS would need to refer to it using a county FIPS code, a code indicating surface or ground water, and a sequence number for that county

With the completion of conversion of earlier data into EQUIS compatible format, we began to anticipate the need to record all EQUIS-mandated metadata in EQUIS-compatible formal metadata.