HowTo – Submitting data to DEC EQuIS [80%]

Priority: low
Updating: rare

Change log for EQUIS submission HOWTO:

Date Who Comment
2021 03 05 Sp17 First version from memory, skeleton of how we interact.
2021 07 30 Sp17 Beginning to fill in an evolved version of this for the new project.
2021 10 01 Sp17 Reformat for export to DOCX.
2023 06 14 Sp17 Adapt to Markdown.
2023 06 21 Sp17 Added a timing section at end.

Reference: https://www.dec.ny.gov/chemical/62440.html

1. Objectives

  • Get data and metadata accepted into EQUIS.

2. Quality assurance considerations

EQUIS is designed to facilitate reuse of data by NYS DEC staff across the agency. There are three facets of data submission into EQUIS that govern how we submit data:

  1. The locations of many project samples will be censored so that the identity of the landowners is not disclosed and cannot easily be guessed.
  2. The relatively low accuracy and precision requirements of the project limit reuse potential of its data to similar screening types of activity. The metadata accompanying measurements must include codings to reflect the nature of the data. EQUIS through 2019 did not have fine-grained data quality metadata encodings thus we may need to code everything as “screening” quality.
  3. EQUIS data structure is designed around hazardous waste cleanup thus it terminology, semantics, and encodings are not always a good match for the project. The database at Cornell must be based on the union of this project and EQUIS requirements, thus there will be redundancy in data maintenance.

3. Data model

EQUIS uses a hierarchical model of data representation that has been adapted to earlier Cornell database work supporting projects with the Bureau of Pest Management. The current project does not use counties for grouping samples, thus a somewhat different encoding could be worthwhile.

The primary EQUIS hierarchy (as we used it earlier) is:

  • Facility

    • Subfacility(ies)

      • (sampling) Location(s)

        • Sample(s)

          • Chemistry (analytical results)
Data type Dimensionality Apparent EQUIS usage Past Cornell adaptation Proposed present Cornell adaptation
Facility One per managed cleanup site Cluster data by hazwaste cleanup site Entirety of Upstate is one “facility”. Long Island is another “facility”. Implicitly New York City is a third facility for pesticide water monitoring purposes. (same as past Cornell)
Sub-facility One or more per managed cleanup site Subvidide a hazwaste facility into parts. Cluster locations within a cleanup site. County Category of site. Eight categories of categorical sites, a ninth for lakes, a tenth for “long term”. The subfacilities thus permit zooming in on sites of one category, on lakes together, and on the ensemble of “long term” sites.

Note that we cannot use counties because some of the categorical sites are unique within their county, thus listing the county would indirectly disclose the identity of the confidential cooperator.
Location One or more per cleanup site Precise 3-D (with depth) or 2-D location of recurring sampling. Sequence number within sampling location type (surface water, groundwater, others) within county. (Within-site sampling positions.) Sequence number within the subfacility.
Sample One or more per location Represent date, time, and depth below surface at an XY location (if vertical proifiling). Date and timestamp, depth suffix Date and timestamp. (Sampling depths are fixed per location.)
Analytical result One or more per sample Represent one chemical or physical measurement on a sample. (no adaptation needed) (no adaptation needed)

4. Preparatory Steps

  1. Periodically refresh the specifications required for EQUIS submissions [need download page URL]. These change with time. Data already within EQUIS do not have to be resubmitted to match later specifications. The project’s tabular results database is designed around cumulative EQUIS submission requirements as they existed in 2019.
  2. Submit new analytes, analytical methods, labs, and other items for assignment of new EQUIS codes, when there is no good match among the EQUIS “valid values” tables. It appears that the first person submitting a new code request determines the official EQUIS name of the real entity (such as analyte) forever after.
  3. Assign facility, subfacility, and location codes. Generally the facility and subfacility codes require DEC approval after negotiations among Cornell, the Bureau of Pest Management, and DEC EQUIS staff. Location codes are derived from subfacility codes and do not require approval. Our prior “facility” code uses one value for all of upstate NY beyind New York City. Subfacilities were counties. Locations were water medium types (surface water, ground water for example) and sequence numbers within types. The facility is not encoded within the data: an entire batch is for one facility.
    • For this project it may be advantageous to use the subfacility to encode the category of pesticide user hosting the wells, instead of the FIPS county number which would make the owner easier to deduce. Then the sequence numbering would be within the category.
  4. Agree between Cornell and Bureau of Pest Management on location censoring strategy so that an anonymous land owner’s identity is not disclosed or easily discoverable.
    1. Lakes can be fully disclosed and would encode latitude and longitude for each sampling location.
    2. Categorical sites on public lands, for example golf courses in State Parks, can have locations of wells fully disclosed.
    3. “Long term” wells can be located to the mailing zip code centroid.
    4. Categorical sites not on public lands can be located in “upstate” without a geographic coding. Even disclosing the DEC region may be too suggestive for rare categories.

5. Per data batch steps

  1. Decide which samples and which analytical results for those samples should be included in the batch. Tag new sampling locations (ones not previously submitted), samples, and analytical results from those samples for inclusion. (Each record in each table has provision to record which records should be included or excluded from EQUIS exporting.)
  2. Code censoring in SQL to prevent direct and indirect disclosure of identities of landowners. Generally the category of site should be included, such as “vineyard”, and the censoring should prevent guessing the identity of the site based or rarity of the category upstate. For example, there may be only two sod farms upstate on vulnerable aquifers, thus if either became cooperators their location would have to be coded as “upstate NY” rather than disclosing the zip code, county, or DEC region.
  3. Execute SQL code to create preliminary tables equivalent to EQUIS export format.
  4. Convert the preliminary tables to EQUIS export XLSX format. (Typically a bulk copy and paste operation.)
  5. Validate the export using the EQUIS checker. Correct problems and repeat steps 2-5 until the entire export passes the validator. Note that problems with data are corrected in the project database, not the EQUIS export. We want the final EQUIS export to require only the project database and the SQL code, with no manual editing of the export.
  6. Submit the results of #5 to DEC’s EQUIS submission email address.
  7. Revise and resubmit as requested.
  8. Record in the project database that the sample batch has been accepted.

Note that Cornell is not usually informed about acceptance, only contacted if there is a problem. Thus silence is assumed to be acceptance.

6. Checking data that have been accepted into EQUIS

Cornell does not have access to EQUIS thus must rely on a DEC staff member having EQUIS access to make a download from EQUIS. This does not necessarily need to be done for every batch.

7. Correcting previously submitted data

This is done by overlaying previous records that were accepted for submission with new information. Note: There is not a documented provision to delete records in this fashion, thus to delete there either must be a special manual request made, or the overlayed replacement record must be encoded in a way that indicates that there are no valid data in the record.

8. Timing

This work is far from urgent, and can be timed to be in a season when field work is at a minimum. This is typically the snowy months from December through March when travel conditions prevent collecting samples, and inhibit recruiting.

It is tempting to put it off for more than one year. There are disadvantages to delaying it:

  • There may be changes in EQUIS data specifications, particularly new mandates, that require going back to data not recorded in the project database or encoded differently. Getting it in earlier reduces the changes that can occur.
  • Registering new codes takes calendar time and may push closure of a submission into a busier season of year. Allow months of lead time.
  • Getting into a routine of entering the EQUIS encodings in parallel with the project encodings will save from having to change thousands of records later. It is best to do them on an ongoing basis as data batches are merged into the database, including keeping up with EQUIS encoding changes.
  • Installing and running the proofreading software infrequently leads to longer startup times to learn its licensing, installation, and use.
  • Going around and around with DEC EQUIS staff about a submission may turn up surprise problems. Starting the refresh of EQUIS codes and export SQL coding before the beginning of the winter slack season my be helpful if the interchange takes a long time. Getting new code requests to appear, including in the proofreading software, has taken the longest thus should start first.