HowTo – Transcribing DEC Air Lab data to database [60%]

Priority: medium
Updating: each new spreadsheet from DEC has had different format.

This HowTo covers the path from receiving a DEC Air lab analytical report in spreadsheet or PDF form to having the data merged with earlier DEC data and Cornell data in a form that allows cumulative summaries and EQUIS submission.

Change log for EQUIS submission HOWTO:

Date Who Comments
2021 03 05 Sp17 First version from memory, skeleton of how we interact.
2021 10 01 Sp17 Added change log. Reformat for export to DOCX.
2023 06 15 Sp17 We have been through one transcription for the 2022 pesticide results, without issue. Converted to Markdown format.
2023 06 29 Sp17 Minor editing about the computer aspects of the conversion process. Potential for scripting including conversion of anions and cations lab results.
2023 06 30 Sp17 Edited to reflect an improved way of converting the DEC format into our tabular database format.

Related howtos:

1. Objectives

  • Verify correctness of DEC lab submission.
  • Merge incremental data from one DEC submission into cumulative data.
  • Transform the DEC lab result data structure into an EQUIS compatible data structure.
  • Fill in EQUIS metadata columns in the tabular database for method, analyte, CAS number, analytical lab identity.

2. Overall Process

Prerequisites:

  • The DEC results in their original form should be proofread manually. Any obvious problems, such as duplicate analyte names in column headings, should be reconciled with DEC lab staff.
  • For referential integrity in the project database, there should already be sample and location records in the operating database related to each sample covered by the DEC report. Ideally the location and sample records are created long before the analytical data return from lab, such as when a location is first identified or a sample is brought back to the Cornell lab.
  • EQUIS requires dates of analysis in our cumulative database. If these have not been provided in the report from DEC, poll the lab staff for them. These do not need to be precise. In some years we have used the date we received the lab report as a “conservatively late” lab date estimate for EQUIS purposes.

The DEC reports through 2022 were organized as one row per sample and one column per analyte. Sometimes there is more than one table, breaking out analytes into groups. Replicates were embedded in the tables, as additional sample rows for all analytes or selected analytes. (When fewer than all analytes are covered there is a “not tested” result cell entry for the ones not tested, such as a single minus sign. An empty cell can also be used.) Qualitative results such as “not detected” or “not tested” are in the same cell as numerical results, thus mixing numbers with text. Units are implicit for the entire table, almost always µg/L. A detection limit is in a footnote or a header row. Sample identifiers are what Cornell sent plus a laboratory (sub-)sample ID (accession) number, which EQUIS makes provision for, thus we transcribe it. There is a considerable difference between this data structure and the format used in our project tabular database which is derived from EQUIS.

We begin transforming the DEC provided format in a copy of the DEC file (if spreadsheet), or we make our own spreadsheet version by copying from PDF and pasting into Excel.

We transform the DEC format within our the spreadsheet file in several steps:

  1. Insert additional columns near the left to match our tabular database structure. Most important are having sample identifiers (site ID, sampling location ID, sample date, sample time). The DEC lab often does this from our sample manifest file. Our simple sequential number is in the DEC file.
  2. Make a row above the DEC results table and fill in our database analyte names above the DEC names; these are usually identical to DEC ones. There can be no subscripts or footnotes, just plain text.
  3. Beneath the DEC results block, use formulae to create N copies of the block of sample identifiers. The first copied block will be for the first analyte, second block for the second analyte, etc.
    • Create one block copy at a time by creating one copying cell referring to the leftmost data cell of the first row of the DEC block, and copy that over to the second column after the its first row, and adjusting Then have a column in the block that copies the analyte name Include an empty column in the destination area for the analyte name.

matching our Analyses table keys and replicate numbers for samples and analyses. This is still in the format of one row per sample and one column per analyte.

Then we edit to fill in additional columns of the analyses table. The analyte names from DEC need to be filled in.

This process could be scripted, but with typically one batch of results per year and the lab’s format changing each time, the labor required for scripting might exceed the labor required for annual manual imports.

The transcription process is tedious and lengthy which may increase the transcription error rate. Proofreading the project database contents against the DEC original is essential. One thing that could be scripted is to create a SQL query with an output similar to the DEC format, and run that against the transcribed data. That would make the final proofreading faster and more reliable. The project database format is consistent over time thus only the SQL code for output format may need to be changed between DEC batches.

3. Sample Identification

The full sample IDs consist of location codes, dates, times, and sample replicate numbers. (We are unlikely to have samples from multiple depths at identical locations thus the single depth is simply a descriptor of the sample.) These IDs are in three columns of the destination database, probably one column in the DEC report. Thus the contents of the DEC report will need to be split.

The replicate numbers in DEC-tested samples are introduced by DEC when they do replication. The samples may be tested in a different order than in the manifest which was based on the original samples. Thus for proper linkage to sample records, the samples table will need to have records added, one per each of the DEC replicates.

The location code of a lab blank, i.e. Cornell deionized water, is Cornell, not any field site. This can be a companion to a field blank. The field blank has a location code of where it was taken in the field.

4. Analyte names

Our project uses the DEC Air lab’s analyte names which are often different from official EQUIS names. (EQUIS analyte names typically are established by the first approved EQUIS submitter of an analytical result for that compound.) In earlier DEC Air lab data imports we kept an EQUIS analyte name in every analysis table row, which is unnecessarily denormalized.

Later: We could eliminate the denormalized condition and reduce data entry labor by just include the DEC Air lab analyte name in the Analyses table and creating a lookup table that matches DEC Air names to EQUIS names. The matching table would also include CAS registry numbers.

5. Results

The DEC results combine detected and non-detected result and a qualifier into the same column, with non-detects signified by the ‘<’ then the detection limit. A detected value is a bare number, and a non-detection would typically be coded as “<0.025”. The project’s tabular database “analyses” table is close to EQUIS data structure, which uses a null value for not-detected, puts a qualifier string into a second column, and a detection limit into a third column. EQUIS only requires a detection limit for results with a qualifier, and insists on a null result value for a non-null qualifier. (It is OK in EQUIS to cite a detection limit when there is a null qualifier and a numerical result.)

The tabular database matches EQUIS by including a units name with every analytical result. We adopt EQUIS encodings of unit names. They do not allow Greek characters, thus we use “ug/l” for micrograms per liter.

All DEC Air lab results to date have been in ug/L, thus the same units apply to the whole batch.

6. Additional Quality Control Test results

These are not in the main table of the DEC report. They include results of matrix spike recovery tests. We assume that the lab adjusts for this in their final reported sample results.

No replicate results were provided in the 2022 analytical results. Earlier year results did include some which we encoded as replicates in our Analyses table. The replicate result has identification in EQUIS terms that links back to an original (“parent”) sample so the two results can be retrieved together and distinguished from one another.

7. More EQUIS metadata

EQUIS requires providing the date that an analysis was performed. We know that for our own analyses; in the field that is the date of sampling, and we record the date of analysis in our lab logbook and the any results spreadsheet. We must get analysis dates from the DEC lab. Typically the analytical work is split across groups of parameters, with each group performed on a different day.

EQUIS requies citing the identity of the analytical lab, per result, using their registered names. All SWL work uses a previously registered name of “Cornell”. The DEC Air lab does not have a registered name, so we use the generic DEC. We record our preferred lab name in another column. We also record who did the analysis, with DEC Air under a single name and individuals of SWL itemized.

There is provision for a lab sample ID. We use the DEC Air lab’s internal number which is in a column of the transmitted DEC spreadsheet and may be copied and pasted as one data column.

8. Manual transformation, earlier years and 2022

We are using DB Browser for SQLite3 as the destination database interface and the data come in via Microsoft Excel. Copying and pasting from Excel into DB Browser works very well, as long as the rows and columns have the same meanings in the same orders.

graph TD;
A[Edit DEC table in Excel<br>into same rows/columns<br>as DB Browser]-->C[Copy sample identities<br>site ID, point ID, date, time<br>from Excel]

B[Make empty rows in<br>DB Browser to receive<br>next analyte]-->D

C-->D[Paste into DB Browser]
A-->E[Copy one results<br>column from Excel]-->D[paste results into DB Browser]
G[replicate analyte name,<br>units, lab name, analyst<br>into every row of DB browser]-->D

D-->F[Move non-detected results<br>into detection limit column,<br>get rid of '<' signs, fill in 'nd'<br>for qualifier in these cells]

Manual transcription has worked over the years. Suppose there are 28 analytes of 50 samples, i.e. 50 rows and 28 columns in the DEC spreadsheet, not counting headings at top or sample ID information at left:

  1. Make 28x50=1400 empty rows in DB Browser Analyses table. (There must be a better way to do this than clicking an icon once per row.)
  2. Make one copy of the 50 multi-column Excel sample identities, and paste it repeatedly (28 times) into consecutive empty rows of DB Browser.
  3. As above, copy the DEC sample accession numbers for the 50 samples from Excel into the column for Lab_ID in dB browser. This is one copy pasted 28 times.
  4. Copy the first Excel column of results into the first 50 fresh rows result column. Copy second column of Excel results into next 50 rows, etc.
  5. Enter one cell of the DEC analyte name into the first cell of DBBrowser parameter column. The copy this and paste it into the 49 cells below. Repeat this enter, copy, paste block for each analyte, working down the cells of the parameter column and across the rows of the source spreadsheet to get the parameter names.
  6. Copy the 1400 cells in the results column of DB Browser into the narby (to right) detection limit cells. In the results column, empty all cells having nondetect encodings. (Best to paste in a ‘null’ entry.)
  7. In the detection limit column, empty all cells containing bare numbers. Delete the ‘<’ symbols from the cells containing non-detect sympbolism; can edit one such cell and paste it on top of the remaining cells for this parameter that had same format; detection limits vary by parameter but usually not per sample.
  8. In the qualifier column, for each row whose result is now blank, enter ‘nd’. These are also the rows where there is a detection limit value left after step 6.
  9. Take a long break to walk around, this is mind numbing.
  10. Fill in ‘DEC’ as the lab for the new 1400 rows (one entry, one copy, paste into next 1399 cells below ..). Do the equivalent for the ‘ug/l’ units. Do the equivalent for the “original_file” column, taking the DEC’s spreadsheet file name as the string to enter.
  11. Fill in EQUIS analyte names and EQUIS methods into batches of 50 consecutive rows. (Enter one, copy it, paste into next 49 cells. DB Browser restricts this replication pasting to one cell thus the copying has to be done separately for name and method.)
  12. Fill in analysis dates similarly to 11. In most cases, all samples would be analyzed for a given parameter on the same date. These dates have to be obtained from DEC, they are usually not in the spreadsheet.
  13. Proofread. (!!!) It is useful to write SQL to draw from the newly entered DB Browser records and reconstitute the format of the Excel table, so the SQL output can easily be proofread against the DEC spreadsheet cell by cell. Most of the cells have nondetect results thus the detect vs nondetect entries can be proofread by pattern (perhaps two SQL queries, one for the detects only with blanks for nondetects, and the other for the nondetects only with blanks for detects), and only the detection limits and detect values need to be proofread by digit. Perhaps a second person should do this.

9. Scripted conversion

In earlier projects with around 40 samples and 50 analytes = 2000 analytical results, this would take a few hours for one DEC file; fortunately only once per year. The 2022 DEC lab data vintage contained 28 analytes and 107 samples, 3000 values, still tractable manually. The 2023 vintage will cover more samples and more analytes as all categorical samples yield 2 samples per sampling point and long term sites emerge, with 50-90 analytes and 150-175 samples, thus 7,500-16,000 values to transcribe.

At this scale, it will prove useful to write a conversion script that does all transformations in the list above taking the DEC Excel spreadsheet (perhaps a CSV export) as input. There is no provision to do this in DB Browser which is a very basic data manager interface, but this could be coded in Python, to export a CSV. Then open the CSV in excel and mass copy and paste into DB Browser.

Another approach is to make an extra tab in the Excel file and structure that using formulae to build the destination row and column format, then copy and paste (paste values is desired and the default) into DB Browser.

These are the most important data of the project, very costly to produce, and easy to misplace a number, thus they deserve

9. Error detection and correction

During earlier transcriptions, we discovered mis-entered analyte names but were able to deduce the correct ones, confirming with the lab. Make a note in the lab log when such a correction is made.

The lab fills in Cornell sample identities to the results rows, based on our manifest in Excel spreadsheet. We also provide sample sequence numbers. The Cornell spreadsheet inhaler should check the sample identites in the DEC spreadsheet before trusting them. In some cases the identities are corrected after the manifest and samples were sent to DEC; for example this is to keep a sampling point ID identical across consecutive samples from there. We want the corrected IDs in the tabular database.