How to use cf-pandas#

The main use of cf-pandas currently is for selecting columns of a DataFrame that represent axes or coordinates of the dataset and for selecting a variable from a pandas DataFrame using the accessor and a custom vocabulary that searches column names for a match to the regular expressions, as well as some other capabilities that have been ported over from cf-xarray. There are several class and utilities that support this functionality that are used internally but are also helpful for other packages.

import cf_pandas as cfp
import pandas as pd

Get some data#

# Some data
url = "https://files.stage.platforms.axds.co/axiom/netcdf_harvest/basis/2013/BE2013_/data.csv.gz"
df = pd.read_csv(url)
df
time longitude latitude z profile temperature pressure salinity chlorophyll_a conductivity distance segment
0 2013-08-07T22:26:00 -168.01784 65.409500 0.0 0 10.3291 0.0 30.7286 NaN NaN 0.00 0
1 2013-08-07T22:26:00 -168.01784 65.409500 66.0 0 NaN NaN NaN NaN NaN 0.00 0
2 2013-08-07T22:26:00 -168.01784 65.409500 65.0 0 NaN NaN NaN NaN NaN 0.00 0
3 2013-08-07T22:26:00 -168.01784 65.409500 64.0 0 NaN NaN NaN NaN NaN 0.00 0
4 2013-08-07T22:26:00 -168.01784 65.409500 63.0 0 NaN NaN NaN NaN NaN 0.00 0
... ... ... ... ... ... ... ... ... ... ... ... ...
12735 2013-09-24T22:59:00 -168.01384 60.516167 25.0 139 NaN NaN NaN NaN NaN 15575752.91 0
12736 2013-09-24T22:59:00 -168.01384 60.516167 24.0 139 NaN NaN NaN NaN NaN 15575752.91 0
12737 2013-09-24T22:59:00 -168.01384 60.516167 23.0 139 NaN NaN NaN NaN NaN 15575752.91 0
12738 2013-09-24T22:59:00 -168.01384 60.516167 32.0 139 NaN NaN NaN NaN NaN 15575752.91 0
12739 2013-09-24T22:59:00 -168.01384 60.516167 90.0 139 NaN NaN NaN NaN NaN 15575752.91 0

12740 rows × 12 columns

Basic accessor usage#

The terminology all comes from cf-xarray which deals with multi-dimensional data and has more layers of standardized attributes. This package ports over useful functionality, retaining some of the complexity of terminology and syntax from cf-xarray which doesn’t always apply. The perspective is to be able to think about and use DataFrames of data in a similar manner to Datasets of data/model output.

When you use the cf-pandas accessor it will first validate that columns representing time, latitude, and longitude are present and identifiable (by validating the object).

Using an approach copied directly from cf-xarray, cf-pandas contains a mapping of names from the CF conventions that define the axes (“T”, “Z”, “Y”, “X”) and coordinates (“time”, “vertical”, “latitude”, “longitude”). These are built in and used to identify columns containing axes and coordinates using name matching (column names are split by white space for the comparison).

Check axes and coordinates mappings of the dataset:

df.cf.axes, df.cf.coordinates
({'Z': ['z'], 'T': ['time']},
 {'longitude': ['longitude'], 'latitude': ['latitude'], 'time': ['time']})

Check all available keys:

df.cf.keys()
{'T', 'Z', 'latitude', 'longitude', 'time'}

Is a certain key in the DataFrame?

"T" in df.cf, "X" in df.cf
(True, False)

What CF standard names can be identified with strict matching in the column names? Column names will be split by white space for this comparison.

df.cf.standard_names
{'latitude': ['latitude'], 'longitude': ['longitude'], 'time': ['time']}

Select variable#

Selecting a variable typically requires knowing the name of the column representing the variable. What is demonstrated here is an approach to selecting a column name containing the variable using regular expression matching. In this case, the user defines the regular expression matching that will be used to identify matches to a variable. There are helper functions for this process available in cf-pandas; see the Reg, Vocab, and widget classes and below for more information.

Create custom vocabulary#

More information about custom vocabularies and using the Vocab class here: https://cf-pandas.readthedocs.io/en/latest/demo_vocab.html

You can make regular expressions for your vocabulary by hand or use the Reg class in cf-pandas to do so.

# initialize class
vocab = cfp.Vocab()

# define a regular expression to represent your variable
reg = cfp.Reg(include="salinity", exclude="soil", exclude_end="_qc")

# Make an entry to add to your vocabulary
vocab.make_entry("salt", reg.pattern(), attr="standard_name")

# Add another entry to vocab
vocab.make_entry("temp", "temp")

vocab
{'salt': {'standard_name': '(?i)^(?!.*(soil))(?!.*(_qc)$)(?=.*salinity)'}, 'temp': {'standard_name': 'temp'}}

Access variable#

Refer to the column of data you want by the nickname described in your custom vocabulary.

You can do this with a context manager, especially if you are using more than one vocabulary:

with cfp.set_options(custom_criteria=vocab.vocab):
    print(df.cf["salt"])
0        30.7286
1            NaN
2            NaN
3            NaN
4            NaN
          ...   
12735        NaN
12736        NaN
12737        NaN
12738        NaN
12739        NaN
Name: salinity, Length: 12740, dtype: float64

Or you can set one for use generally in this kernel:

cfp.set_options(custom_criteria=vocab.vocab)
df.cf["salt"]
0        30.7286
1            NaN
2            NaN
3            NaN
4            NaN
          ...   
12735        NaN
12736        NaN
12737        NaN
12738        NaN
12739        NaN
Name: salinity, Length: 12740, dtype: float64

Display mapping of all variables in the dataset that can be identified using the custom criteria/vocab we defined above:

df.cf.custom_keys
{'salt': ['salinity'], 'temp': ['temperature']}

Other utilities#

Access all CF Standard Names#

sn = cfp.standard_names()
sn[:5]
['acoustic_signal_roundtrip_travel_time_in_sea_water',
 'aerodynamic_particle_diameter',
 'aerodynamic_resistance',
 'age_of_sea_ice',
 'age_of_stratospheric_air']

Use vocabulary to match any list#

This is the logic under the hood of the cf-pandas accessor that selects what column matches a variable nickname according to the custom vocabulary. This comes from cf-xarray almost exactly. It is available as a separate function because it is useful to use in other scenarios too. Here we filter the standard names just found by our custom vocabulary from above.

cfp.match_criteria_key(sn, "salt", vocab.vocab)
['sea_water_practical_salinity_at_sea_floor',
 'tendency_of_sea_water_salinity',
 'sea_water_absolute_salinity',
 'tendency_of_sea_water_salinity_expressed_as_salt_content',
 'change_over_time_in_sea_water_preformed_salinity',
 'tendency_of_sea_water_salinity_due_to_vertical_mixing',
 'tendency_of_sea_water_salinity_due_to_sea_ice_thermodynamics',
 'sea_water_salinity',
 'tendency_of_sea_water_salinity_expressed_as_salt_content_due_to_parameterized_submesoscale_eddy_advection',
 'square_of_sea_surface_salinity',
 'sea_water_cox_salinity',
 'integral_wrt_depth_of_product_of_salinity_and_sea_water_density',
 'sea_water_practical_salinity',
 'tendency_of_sea_water_salinity_expressed_as_salt_content_due_to_parameterized_eddy_advection',
 'tendency_of_sea_water_salinity_due_to_horizontal_mixing',
 'tendency_of_sea_water_salinity_expressed_as_salt_content_due_to_parameterized_mesoscale_eddy_advection',
 'integral_wrt_depth_of_sea_water_practical_salinity',
 'tendency_of_sea_water_salinity_expressed_as_salt_content_due_to_parameterized_mesoscale_eddy_diffusion',
 'sea_surface_salinity',
 'change_over_time_in_sea_water_absolute_salinity',
 'tendency_of_sea_water_salinity_due_to_parameterized_eddy_advection',
 'ratio_of_sea_water_practical_salinity_anomaly_to_relaxation_timescale',
 'tendency_of_sea_water_salinity_expressed_as_salt_content_due_to_parameterized_dianeutral_mixing',
 'product_of_eastward_sea_water_velocity_and_salinity',
 'product_of_northward_sea_water_velocity_and_salinity',
 'tendency_of_sea_water_salinity_expressed_as_salt_content_due_to_residual_mean_advection',
 'sea_water_salinity_at_sea_floor',
 'tendency_of_sea_water_salinity_due_to_advection',
 'sea_water_reference_salinity',
 'change_over_time_in_sea_water_practical_salinity',
 'sea_water_knudsen_salinity',
 'sea_water_preformed_salinity',
 'change_over_time_in_sea_water_salinity',
 'sea_ice_salinity']