API Tutorial

AMRDC Data Repository API Introduction & Tutorial

In this tutorial, we will learn how to access and aggregate AMRDC data via the public API (application programming interface). The features exposed by the API will allow us to quickly search metadata, aggregate results, and even access the datafiles, all from within the programming language of your choice. Along the way, we will learn a bit about the architecture of APIs, HTTP requests, and wrangling JSON and other data streams in Python.

Web API fundamentals & JSON data structure

On the most basic level, a web API listens to a given URL or /endpoint/ and retrieves data. That data is packaged in a human- and machine-readable format. In the case of the AMRDC Data Repository API, the data exchange format used is JSON (JavaScript Object Notation), which follows the basic structure: {'key': 'value'}.

Two things to note: JSON values can be strings, integers, Booleans, arrays (i.e. lists); JSON objects can be nested, i.e. the value for a JSON key can itself be a JSON object with its own keys and values.

{
    "greeting": "Hello penguins!",
    "year": 2023,
    "happy_feet": true,
    "species": ["Emperor", "Gentoo", "Magellanic"],
    "pals": {"Polar Bear": false, "Fur Seal": true}
}

To illustrate this data exchange, the following link is an API endpoint that returns the ID of every dataset in the repository, limited to 10 datasets:

https://amrdcdata.ssec.wisc.edu/api/action/package_list?limit=10

Let's break the URL down:

https://amrdcdata.ssec.wisc.edu <-- main domain
/api/action                     <-- API path
/package_list                   <-- API endpoint
?limit=10                       <-- Specifies a parameter, 'limit', with the value 10

If we open the link in a web browser, we should get a response that looks something like this:

{
  "help": "https://amrdcdata.ssec.wisc.edu/api/3/action/help_show?name=package_list",

  "success": true,

  "result": [
    "2-minute-palmos-automatic-weather-station-meteorological-measurements-palmer-station-2001-2017",
    "accurate-measurement-of-solar-and-infrared-radiation-fluxes-antarctic-plateau-at-concordia-station",
    "ago-4-automatic-weather-station-2012-quality-controlled-observational-data",
    "ago-4-automatic-weather-station-2012-reader-format-three-hour-observational-data",
    "ago-4-automatic-weather-station-2012-unmodified-ten-minute-observational-data",
    "ago-4-automatic-weather-station-2013-quality-controlled-observational-data",
    "ago-4-automatic-weather-station-2013-reader-format-three-hour-observational-data",
    "ago-4-automatic-weather-station-2013-unmodified-ten-minute-observational-data",
    "ago-4-automatic-weather-station-2014-quality-controlled-observational-data",
    "ago-4-automatic-weather-station-2014-reader-format-three-hour-observational-data"
  ]
}

This JSON response includes three keys: 'help', 'success', and 'result'.

The value for 'help' is a string representing an API endpoint which returns documentation on the 'package_list' function.
The value for 'success' will be 'true' or 'false', depending on the success of the API endpoint method.
The value for 'result' is the return value of the called API endpoint method (assuming 'success' is equal to 'true'). In this case, it is a list of strings representing the IDs of the first 10 datasets in the repository in alphabetical order.

As we will see, each API response will follow this structure, albeit with different value types depending on the method.

Searching the repository

We can search the repository using the 'package_search' endpoint:

https://amrdcdata.ssec.wisc.edu/api/action/package_search?q=Byrd&rows=1000

https://amrdcdata.ssec.wisc.edu <-- main domain
/api/action                     <-- API path
/package_search                 <-- API endpoint
?q=Byrd                         <-- Specifies a parameter, 'q' (i.e. 'query'), with the value 'Byrd'
&rows=1000                      <-- Specifies a second parameter, 'rows' (i.e. # of results to return), with the value 1000

Once again, the response will be a JSON object with "help", "success", and "result" keys. The value for "result" is a nested object with a few additional values:

{
  "help": "https://amrdcdata.ssec.wisc.edu/api/3/action/help_show?name=package_search",
  "success": true,
  "result": {
    "count": 98,
    "sort": "score desc, metadata_modified desc",
    "facets": {},
    "results": [
      // A list of datasets....
    ]
  }
}

The 'package_search' endpoint includes a "result" key containing values for the number of datasets returned by the query ("count"), the sorting method used ("sort"), the facets used to filter the results ("facets"; more on this shortly!), and a "results" key with a list of the datasets returned by the query.

When we examine the list of "results", we will notice that each dataset is itself represented by a JSON object. For example:

{
  "license_title": "Creative Commons Attribution",
  "maintainer": "Antarctic Meteorological Research and Data Center",
  "maintainer_email": "amrdc@amrdc.ssec.wisc.edu",
  "metadata_created": "2022-06-08T17:15:55.678368",
  "metadata_modified": "2023-03-14T22:18:43.090902",
  "state": "active",
  "version": "1.0",
  "type": "dataset",
  "resources": [
      // Long list of resources here....
  ],
  "num_resources": 11,
  "tags": [
      // Long list of metadata tags here...
  ],
  "name": "sabrina-automatic-weather-station-2009-unmodified-ten-minute-observational-data",
  "title": "Sabrina Automatic Weather Station, 2009 unmodified ten-minute observational data.",
  "revision_id": "c3149988-0bfb-4f21-b12f-bfeab32c7813"
}

This is the dataset's /metadata/. Every thing we need to identify, access, and cite this dataset is contained in this object.

Now that we understand the basic functionality of the web API and its data structures, we will learn how access datasets programmatically using a scripting language.

Accessing metadata records, saving data + metadata

This tutorial assumes you have Python 3.xx installed, along with either the pip tool or an active conda environment.

Users can apply these examples to their favorite programming language using any utility that can generate and receive an HTTP request. In fact, we will even learn how to execute API requests in your web browser and get a formatted JSON file as a response.

Alternately, you can execute the code directly in this notebook by opening it in Google Colab or downloading the .ipynb file and opening it in your favorite interactive notebook environment.

For a more technical treatment of this topic, refer to the Official CKAN API Guide.

Accessing the repository using the Python `requests` library

We'll start by importing the requests library, which we can use to make simple HTTP GET and POST requests. For convenience, we will also be using the urllib.parse library, which reformats strings for use in web URLs.

import requests
import urllib.parse

## Request a list of datasets from the repository
## requests.get() returns a `Response` object, which we map to a dictionary using .json()
response = requests.get('https://amrdcdata.ssec.wisc.edu/api/action/package_list').json()

## We then access the results using the 'result' key.
amrdc_datasets = response['result']

## Let's see how many datasets are in the repo.
len(amrdc_datasets)

Now let's use this list to look for Byrd AWS datasets.

byrd_aws_datasets = [dataset for dataset in amrdc_datasets if "byrd" in dataset]
len(byrd_aws_datasets)

If we want only Byrd AWS datasets from 1999....

byrd_aws_1999_datasets = [dataset for dataset in amrdc_datasets if "byrd" in dataset and "1999" in dataset]
byrd_aws_1999_datasets = [dataset for dataset in amrdc_datasets if all(word in dataset for word in ["byrd", "unmodified", "1999"])]
byrd_aws_1999_datasets

You can use these results as static URLs by appending them to https://amrdcdata.ssec.wisc.edu/dataset/. Visit this link in your browser to access the data.

This pattern will work fine for most use cases, but working with URL strings is often inconvenient. Luckily, there is a Python library that acts as a wrapper for these methods.

Accessing the repository using the Python `ckanapi` library

The AMRDC Data Repository is built on the open-source CKAN data management system. The ckanapi Python library is a simple, powerful Python wrapper for programmatically communicating with the CKAN API.

Install it with pip install ckanapi or conda install -c conda-forge ckanapi.

pip install ckanapi
import ckanapi

Let's import the library and execute another search for the Byrd 1999 data.

from ckanapi import RemoteCKAN
amrdc_repository = RemoteCKAN('https://amrdcdata.ssec.wisc.edu/')

byrd_aws_1999_datasets2 = amrdc_repository.action.package_search(q="Byrd 1999")['results']

The library returns the response as a list of dictionaries, this time containing all of the metadata. Let's make sure we got the same response as last time.

[dataset['title'] for dataset in byrd_aws_1999_datasets2]