[2015-05-20] Themodynamic Data Legacy Rescue Project Update uri icon

DCO ID 11121/8220-8367-7863-2704-CC

Date Submitted

  • 2015-05-20

Update Text

  • 1. Rescued Datasets:
    A number of rescued datasets have been registered on the DCO data portal. They are accessible through the DCO dataset browser under the EPC community: http://deepcarbon.net/dco_datasets

    2. Data Extraction Workflow:
    A data extraction workflow was established based on current facilities available, and was used to rescue datasets.

    (1) Archive the pdf documents in a web server.
    (2) Log in the virtual machine remotely - with user name ‘dco_user’. Password can be required via Patrick westp@rpi.edu.
    (3) Download the pdf documents in from the web server to the virtual machine.
    (4) Launch the PDF2XL tool on the virtual machine - Read this document to learn features of PDF2XL
    (5) Load a PDF document into PDF2XL. Draw a bounding box for each table in a page and the tool will recognize the structure of the table (i.e. columns and rows) , the record in each cell and the column headers. The result is shown in a spreadsheet on the same user interface.
    (6) Check the result quality by moving highlight in the cells of the spreadsheet. Comment: It is a pity that we cannot edit in the spreadsheet directly.
    (7) Output the result into an Excel table and then edit the errors recognized in Step 2.
    (8) Locate the PDF record --- in this case study all PDFs are journal papers --- on the publisher's website, and download the citation information and the DOI. Paste the citation information and DOI into the Excel table generated in Step 3.
    (9) Register metadata of the original paper and the rescued dataset on a data portal, such as the DCO data portal.

    3. Further Work:
    Dr. Mark Ghiorso suggested a workflow and esp. the metadata that a geoscientist would like to capture.
    (1) select a reference
    (2) locate that reference as an electronic document in the university library
    (3) download the reference
    (4) metadata: identify and record the chemical composition of the material being studied in the paper; this could be a mineral name found in the title of the paper or in the text, or it could be a chemical composition recorded in a table; identify the source of this material: is it natural or synthetic?
    (5) metadata: identify the structure of the material. Is this a solid, liquid or gas. If the material is a solid, what is the crystal system or space group that characterizes the structure.
    (6) metadata: identify how the experiments were done? Are these experiments (1) high-temperature drop calorimetry, (2) low-temperature adiabatic calorimetry, or (3) differential scanning calorimetry, (4) something else was done
    (7) metadata: identify where the experiments were done: name of lab, where is the lab located?
    (8) metadata: identify the experimental device that was used and who manufactured the device
    (9) metadata: identify the standards that were used, if any, to calibrate the measurement device
    (10) metadata: identify the capsule material if any that contained the experimental material; was the capsule sealed or open?
    (11) metadata: identify the temperature-time history of the experiment, if that information is reported
    (12) metadata: identify the measurement units for temperature, and heat content or heat capacity.
    (13) metadata: identify the mass of material used in the experiment
    (13) metadata: identify whether the experiment was performed in air or in contact with some other gas, and so, what gas
    (14) metadata: identify how the experimental results are presented: tabular form or as graphs
    (15) data: tabulate experimental results for each data type (as in #6) as a function of measured temperature; include the precision of each measurement, if reported; it is possible that the precision could be stated in general terms somewhere in the text and not in the data tables.
    (16) store the original PDF document in the data repository for future reference and clarification of metadata

    For more automated document processing approaches such as NLP. Prof. Gihorso also offered a few suggestions on topics that a geoscientist are interested:
    (1) How many papers report measurements using the technique of low-temperature adiabatic calorimetry (AC)? How many high-temperature drop calorimetry DC)? How many differential scanning calorimetry (DSC)?
    (2) What substances have been investigated over what temperature ranges? are there patterns to this distribution?
    (3) For each substance, are there measurements made using AC and DC/DSC? These measurements could be reported in separate papers
    (4) What laboratories perform the majority of experiments; make a list of labs in descending order
    (5) When were the experiments done (i.e. date of publication)? Code these by type of experiment (AC, DC, DSC).
    (6) Where were the experiments done (i.e. location of lab)? Code these by type of experiment (AC, DC, DSC).
    (7) What substances investigated have experimental results that report heat content or heat capacity anomalies associated with: (a) structural phase transitions, (2) magnetic phase transitions, (3) electronic phase transitions, (4) cation-ordering effects in minerals, or (5) change in phase state (i.e. solid to liquid, liquid to gas)?
    (7) Are there substances that have been investigated by more than one laboratory?
    (8) As a function of time (i.e. date of publication), what units are used to report results of measurements?