Introduction

  • PDS4 labels are written in XML, and enforced by so-called dictionaries. Each dictionary consists of a schema (.xsd file) and Schematron (.sch) file.
  • Dictionaries are built using a PDS-provided tool called LDDTool, for which some knowledge of XPath and XQuery is useful.
  • In fact, XQuery is generally a very useful thing to know when working with PDS4 labels, since it can be used to very easy extract information programmatically.
  • This page aims to capture some examples, hints and tips on how to use XPath and XQuery, both standalone and in LDDTool.


The following label can be downloaded to follow along the examples below: srn_raw_sc_mipa_20201016t2323_20201016t2345.xml

Xpath

xpath is a query language that lets you select elements in an XML document - in PDS4-speak this can be classes or attributes. It specifies the path much as a typical filepath, with forward slashes ('/') separating parent and child elements. The following example shows the start of a very minimal PDS4 product label:

<Product_Observational xmlns="http://pds.nasa.gov/pds4/pds/v1"
                       xsi:schemaLocation="http://pds.nasa.gov/pds4/pds/v1 https://pds.nasa.gov/pds4/pds/v1/PDS4_PDS_1F00.xsd">

    <Identification_Area>
        <logical_identifier>urn:esa:psa:bc_mpo_serena:data_raw:srn_raw_sc_mipa_20201016t2323_20201016t2345</logical_identifier>
        <version_id>0.9</version_id>
        <title>BepiColombo MPO SERENA-MIPA Mercury Orbit Raw Science Data Product for period 2020-10-16 23:23 to 2020-10-16 23:45</title>
        <information_model_version>1.15.0.0</information_model_version>
        <product_class>Product_Observational</product_class>

...

So if we wanted to write the path to the version_id attribute, we would simply write:

/Product_Observational/Identification_Area/version_id

and executing this in an editor, application or supporting code, you should be returned the value" 0.9".

There are many ways to evaluate Xpath, but here we will demo xidel:

$ xidel -se "/Product_Observational/Identification_Area/version_id" srn_raw_sc_mipa_20201016t2323_20201016t2345.xml
0.9

Sometimes there may be multiple occurences of an element, for example in the Modification_History there can be multipl Modification_Detail entries. So the following query will return all the version_id entries in the history:

$ xidel -se "/Product_Observational/Identification_Area/Modification_History/Modification_Detail/version_id" srn_raw_sc_mipa_20201016t2323_20201016t2345.xml
0.9
0.6
0.5
0.4
0.3
0.2
0.1

You can also address the nth entry in a list, so if we want to know the version_id of the second Modification_Detail entry:

$ xidel -se "/Product_Observational/Identification_Area/Modification_History/Modification_Detail[2]/version_id" srn_raw_sc_mipa_20201016t2323_20201016t2345.xml
0.6

Sometimes you may want to find an element which appears in different parts of the tree, and here you can use the "//" operator and specify just part of the path, e.g.

$ xidel -se "//version_id" srn_raw_sc_mipa_20201016t2323_20201016t2345.xml
0.9
0.9
0.6
0.5
0.4
0.3
0.2
0.1

Here the first value is repeated since we are reading version_id from the Identification_Area directly as well as from the Modification_History.

Selecting an element based on its children is also common. We showed above that you could pick a Modification_Detail by number, but what if you need to select by a child element? e.g. to find the entry where the modification history is 0.5, and return the modification date... This example does just that:

$ xidel -se "/Product_Observational/Identification_Area/Modification_History/Modification_Detail[version_id=0.5]/modification_date" srn_raw_sc_mipa_20201016t2323_20201016t2345.xml 
2019-05-13

Xpath also includes various functions which can be used. For example, to count the number of entries in the Modification_History:

$ xidel -se "count(/Product_Observational/Identification_Area/Modification_History/Modification_Detail)" srn_raw_sc_mipa_20201016t2323_20201016t2345.xml
7

You can also write queries that evaluate to a boolean (true or false), for example if you want to know if the version_id is below 1.0:

$ xidel -se "/Product_Observational/Identification_Area/version_id < 1.0" srn_raw_sc_mipa_20201016t2323_20201016t2345.xml
true

$ xidel -se "/Product_Observational/Identification_Area/version_id > 1.0" srn_raw_sc_mipa_20201016t2323_20201016t2345.xml
false

Boolean operators like "not" can also be used, and string functions like "contains" and "starts-with". Let's apply this to the logical_identifier:

$ xidel -se "contains(/Product_Observational/Identification_Area/logical_identifier, 'data_raw')" srn_raw_sc_mipa_20201016t2323_20201016t2345.xml
true
$ xidel -se "contains(/Product_Observational/Identification_Area/logical_identifier, 'data_rawww')" srn_raw_sc_mipa_20201016t2323_20201016t2345.xml
false
$ xidel -se "not(contains(/Product_Observational/Identification_Area/logical_identifier, 'data_rawww'))" srn_raw_sc_mipa_20201016t2323_20201016t2345.xml
true

A very common function when working with PDS4 LIDs is to split (tokenize) the LID using a colon as the separator:

$ xidel -se "tokenize(/Product_Observational/Identification_Area/logical_identifier, ':')" srn_raw_sc_mipa_20201016t2323_20201016t2345.xml
urn
esa
psa
bc_mpo_serena
data_raw
srn_raw_sc_mipa_20201016t2323_20201016t2345

So that you could check if this product belongs to the bc_mpo_serena bundle:

$ xidel -se "tokenize(/Product_Observational/Identification_Area/logical_identifier, ':')[4]='bc_mpo_serena'" srn_raw_sc_mipa_20201016t2323_20201016t2345.xml
true

Schematron and LDDTool

Schematron is an XML validation language which makes heavy use of xpath. It is used (along with XML schema) to constrain PDS4 labels and ensure that they are valid, according to a given stack of dictionaries referenced at the top of each label. It works by making assertions about the value or content of an element. If this assertion fails, a message is displayed to the user.

Some useful resources include:

In general, for PDS4 Schematron files (.sch) are not written directly, but are generated as the output of the LDDTool, which accepts as input an XML input file defining the dictionary attribute, classes and rules. In this case xpath is used in two contexts:

  1. to define the context
  2. to define the validation rule.

A simple example of defining a Schematron rule via an LDDTool ingest file is:

<DD_Rule>
    <local_identifier>modification_history</local_identifier>
    <rule_context>//pds:Identification_Area[not(pds:product_class='Product_SPICE_Kernel')]</rule_context>
    <DD_Rule_Statement>
        <rule_type>Assert</rule_type>
        <rule_test>pds:Modification_History</rule_test>
        <rule_message>Products MUST contain Modification_History</rule_message>
    </DD_Rule_Statement>
</DD_Rule>

This statement is translated into Schematron as:

  <sch:pattern>
    <sch:rule context="//pds:Identification_Area[not(pds:product_class='Product_SPICE_Kernel')]">
      <sch:assert test="pds:Modification_History">
        <title>modification_history/Rule</title>
        Products MUST contain Modification_History</sch:assert>
    </sch:rule>
  </sch:pattern>
  <sch:pattern>

The context determines where this rule should apply. In this case it applies to everything with an Identification_Area, except Product_SPICE_Kernel. This context is implied when writing the assertion - unless you specify otherwise, it assumes that the test applies to elements at the same level. So here the rule checks for the presence of Modification_History inside the Identification_Area of products, except SPICE kernels. 

Both the context and the rule can be considerably more complex. For example, here is a rule that checks that the label filename (minus extension) and the product ID (the last part of the LID) are the same - a PSA rule:

    <DD_Rule>    
        <local_identifier>logical_identifier_filename</local_identifier>
        <rule_context>pds:Product_Observational/pds:Identification_Area</rule_context>
        <rule_assign>name="file-name" value="replace(tokenize(document-uri(/), '/')[last()],'\.[^.]+$','')"</rule_assign>
        <rule_assign>name="lid" value="tokenize(pds:logical_identifier, ':')[last()]"</rule_assign>
        
        <DD_Rule_Statement>
            <rule_type>Assert</rule_type>
            <rule_test>
                lower-case($file-name) = $lid
            </rule_test>
            <rule_message>
                Label filename (<sch:value-of select="$file-name"/>) and last component of the LID (product ID) (<sch:value-of select="$lid"/>) must match for observational products
            </rule_message>
            <rule_description>Filename (minus extension) and the product ID (last part of the LID) must match for observational products (case can be different)</rule_description>
        </DD_Rule_Statement>
    </DD_Rule>

Several new concepts are introduced here:

  • rule_assign corresponds to the "let" statement in Schematron and allows the result of an Xpath query to be assigned to a variable
    • this makes the rule earier, but more importantly allows the value to be used in messages telling the use what went wrong
  • the string function "lower-case" is used since LIDs have to be lower case, but filenames do not
  • embedding <sch:value-of select="$file-name"/> in the rule_message allows the value of the variable to be displayed in the validation output


The output Schematron for the above is:

  <sch:pattern>
    <sch:rule context="pds:Product_Observational/pds:Identification_Area">
      <sch:let name="file-name" value="replace(tokenize(document-uri(/), '/')[last()],'\.[^.]+$','')"/>
      <sch:let name="lid" value="tokenize(pds:logical_identifier, ':')[last()]"/>
      <sch:assert test="lower-case($file-name) = $lid">
        <title>logical_identifier_filename/Rule</title>
        Label filename (<sch:value-of select="$file-name"/>) and last component of the LID (product ID) (<sch:value-of select="$lid"/>) must match for observational products
      </sch:assert>
    </sch:rule>
  </sch:pattern>



  • No labels