OCC Environmental Data Commons Logo

Signpost ID Service

What is Signpost and how do I use it?.

Signpost ID Service

The Signpost digital ID system balances the needs of both data archiving for persistent storage, that is, assigning identifiers to unique pieces of information, and also active computation, that is, for finding locations of data on a living system where data may be physically moved or updated. We have constructed a simple implementation of this design in a two-layer identification scheme service with a REST-like API interface.

The top layer utilizes user-defined identifiers. These are flexible, may be of any format, including Archival Resource Keys (ARKs) and Digital Object Identifiers (DOIs), and provide a layer of human readability. The user-defined identifiers map to hashes of the identified data objects. This additionally allows for mutability by assigning different hashes and allows for reproducibility considerations.

The bottom layer utilizes hash-based identifiers. These are inflexible and identify data objects as unambiguously as possible. Hash-based identifiers guarantee immutability of identified data, allow for identification of duplicated data via hash collisions, and allow for verification upon retrieval. These map to known locations of the identified data.

By utilizing the Signpost digital identifier service, we can relocate data files from our data commons to another commons and no researcher needs to change their code.

Working with the ID Service

After a user has received their Digital IDs, they may want to confirm integrity and download the data from their query. Below are some example python functions from the Nexrad Jupyter example that review a txt file from a signpost query, check file integrity, check for a preferred repo, then download the data to the defined in the download_from_arks function.

   import requests
   import urllib
   import json
   import hashlib
   import os

   #get data from txt file generated from search service
   with open('testarks.txt', 'r') as f:
       file_lines = f.readlines()
             for line in file_lines:
           print line.strip() 
             id_service_arks = [line.strip().decode('utf-8-sig') for line in file_lines]


         # hash provided in signpost should match locally calculated hash
   def confirm_hash(hash_algo, file_,actual_hash):
       with open(file_) as f:
                 computed_hash = hash_algo(f.read()).hexdigest()
 
       if computed_hash == actual_hash:
           return True
       else:
                 return False
    
         def download_from_arks(id_service_arks, intended_dir, hash_confirmation = True, pref_repo='https://griffin-objstore.opensciencedatacloud.org/'):
       hash_algo_dict = {'md5':hashlib.md5, 'sha1':hashlib.sha1, 'sha256':hashlib.sha256}

       for ark_id in id_service_arks:
           signpost_url = 'https://signpost.opensciencedatacloud.org/alias/' + ark_id
     resp = requests.get(signpost_url)         
        
       # make JSON response into dictionary
             signpost_dict = resp.json() #json.load(resp)
             # get repository URLs
             repo_urls = signpost_dict['urls']
             for url in repo_urls:
           print 'url = ', url
     # if preferred repo exists, will opt for that URL
     if pref_repo in url:
                     break
                 # otherwise, will use last url provided

       # if you're not in Jupyter, you can use the requests library to create the files
             r = requests.get(url)
             # need file path for hash validation
             file_name = url.split('/')[-1]
             file_path = os.path.join(intended_dir, file_name)
             f = open(file_path, 'wb')
             f.write(r.content)
             f.close()
        
             # otherwise, we can run this bash command from Jupyter!
             #!sudo wget -P $intended_dir $url
       
             if hash_confirmation:
                 # get dict of hash type: hash
     hashes = signpost_dict['hashes']
     # iterate though list of (hash type, hash) tuples
     for hash_tup in hashes.items():
                     # get proper hash algorithm function
                     hash_algo = hash_algo_dict[hash_tup[0]]
         # fail if not the downloaded file has diff. hash
                     assert confirm_hash(hash_algo, file_path, hash_tup[1]), '%s hash calculated does not match hash in metadata'

         #to download, run function, make sure and have dir 'mayfly_data' created
   download_from_arks(id_service_arks, 'mayfly_data')

ARK Key Service

The OSDC Public Data Commons features a key service utilizing ARK codes as permanent identifiers to each dataset. More information can be found here: https://www.opensciencedatacloud.org/keyservice/