How to use `cf-pandas`#

The main use of cf-pandas currently is for selecting columns of a DataFrame that represent axes or coordinates of the dataset and for selecting a variable from a pandas DataFrame using the accessor and a custom vocabulary that searches column names for a match to the regular expressions, as well as some other capabilities that have been ported over from cf-xarray. There are several class and utilities that support this functionality that are used internally but are also helpful for other packages.

import cf_pandas as cfp
import pandas as pd

Get some data#

# Some data
url = "https://files.stage.platforms.axds.co/axiom/netcdf_harvest/basis/2013/BE2013_/data.csv.gz"
df = pd.read_csv(url)
df

	time	longitude	latitude	z	profile	temperature	pressure	salinity	chlorophyll_a	conductivity	distance	segment
0	2013-08-07T22:26:00	-168.01784	65.409500	0.0	0	10.3291	0.0	30.7286	NaN	NaN	0.00	0
1	2013-08-07T22:26:00	-168.01784	65.409500	66.0	0	NaN	NaN	NaN	NaN	NaN	0.00	0
2	2013-08-07T22:26:00	-168.01784	65.409500	65.0	0	NaN	NaN	NaN	NaN	NaN	0.00	0
3	2013-08-07T22:26:00	-168.01784	65.409500	64.0	0	NaN	NaN	NaN	NaN	NaN	0.00	0
4	2013-08-07T22:26:00	-168.01784	65.409500	63.0	0	NaN	NaN	NaN	NaN	NaN	0.00	0
...	...	...	...	...	...	...	...	...	...	...	...	...
12735	2013-09-24T22:59:00	-168.01384	60.516167	25.0	139	NaN	NaN	NaN	NaN	NaN	15575752.91	0
12736	2013-09-24T22:59:00	-168.01384	60.516167	24.0	139	NaN	NaN	NaN	NaN	NaN	15575752.91	0
12737	2013-09-24T22:59:00	-168.01384	60.516167	23.0	139	NaN	NaN	NaN	NaN	NaN	15575752.91	0
12738	2013-09-24T22:59:00	-168.01384	60.516167	32.0	139	NaN	NaN	NaN	NaN	NaN	15575752.91	0
12739	2013-09-24T22:59:00	-168.01384	60.516167	90.0	139	NaN	NaN	NaN	NaN	NaN	15575752.91	0

12740 rows × 12 columns

Basic accessor usage#

The terminology all comes from cf-xarray which deals with multi-dimensional data and has more layers of standardized attributes. This package ports over useful functionality, retaining some of the complexity of terminology and syntax from cf-xarray which doesn’t always apply. The perspective is to be able to think about and use DataFrames of data in a similar manner to Datasets of data/model output.

When you use the cf-pandas accessor it will first validate that columns representing time, latitude, and longitude are present and identifiable (by validating the object).

Using an approach copied directly from cf-xarray, cf-pandas contains a mapping of names from the CF conventions that define the axes (“T”, “Z”, “Y”, “X”) and coordinates (“time”, “vertical”, “latitude”, “longitude”). These are built in and used to identify columns containing axes and coordinates using name matching (column names are split by white space for the comparison).

Check axes and coordinates mappings of the dataset:

df.cf.axes, df.cf.coordinates

({'Z': ['z'], 'T': ['time']},
 {'longitude': ['longitude'], 'latitude': ['latitude'], 'time': ['time']})

Check all available keys:

df.cf.keys()

{'T', 'Z', 'latitude', 'longitude', 'time'}

Is a certain key in the DataFrame?

"T" in df.cf, "X" in df.cf

(True, False)

What CF standard names can be identified with strict matching in the column names? Column names will be split by white space for this comparison.

df.cf.standard_names

{'latitude': ['latitude'], 'longitude': ['longitude'], 'time': ['time']}

Select variable#

Selecting a variable typically requires knowing the name of the column representing the variable. What is demonstrated here is an approach to selecting a column name containing the variable using regular expression matching. In this case, the user defines the regular expression matching that will be used to identify matches to a variable. There are helper functions for this process available in cf-pandas; see the Reg, Vocab, and widget classes and below for more information.

Create custom vocabulary#

More information about custom vocabularies and using the Vocab class here: https://cf-pandas.readthedocs.io/en/latest/demo_vocab.html

You can make regular expressions for your vocabulary by hand or use the Reg class in cf-pandas to do so.

# initialize class
vocab = cfp.Vocab()

# define a regular expression to represent your variable
reg = cfp.Reg(include="salinity", exclude="soil", exclude_end="_qc")

# Make an entry to add to your vocabulary
vocab.make_entry("salt", reg.pattern(), attr="standard_name")

# Add another entry to vocab
vocab.make_entry("temp", "temp")

vocab

{'salt': {'standard_name': '(?i)^(?!.*(soil))(?!.*(_qc)$)(?=.*salinity)'}, 'temp': {'standard_name': 'temp'}}

Access variable#

Refer to the column of data you want by the nickname described in your custom vocabulary.

You can do this with a context manager, especially if you are using more than one vocabulary:

with cfp.set_options(custom_criteria=vocab.vocab):
    print(df.cf["salt"])

      30.7286
          NaN
          NaN
          NaN
          NaN
          ...   
      NaN
      NaN
      NaN
      NaN
      NaN
Name: salinity, Length: 12740, dtype: float64

Or you can set one for use generally in this kernel:

cfp.set_options(custom_criteria=vocab.vocab)
df.cf["salt"]

      30.7286
          NaN
          NaN
          NaN
          NaN
          ...   
      NaN
      NaN
      NaN
      NaN
      NaN
Name: salinity, Length: 12740, dtype: float64

Display mapping of all variables in the dataset that can be identified using the custom criteria/vocab we defined above:

df.cf.custom_keys

{'salt': ['salinity'], 'temp': ['temperature']}

Other utilities#

Access all CF Standard Names#

sn = cfp.standard_names()
sn[:5]

['acoustic_signal_roundtrip_travel_time_in_sea_water',
 'aerodynamic_particle_diameter',
 'aerodynamic_resistance',
 'age_of_sea_ice',
 'age_of_stratospheric_air']

Use vocabulary to match any list#

This is the logic under the hood of the cf-pandas accessor that selects what column matches a variable nickname according to the custom vocabulary. This comes from cf-xarray almost exactly. It is available as a separate function because it is useful to use in other scenarios too. Here we filter the standard names just found by our custom vocabulary from above.

cfp.match_criteria_key(sn, "salt", vocab.vocab)

['sea_water_practical_salinity_at_sea_floor',
 'tendency_of_sea_water_salinity',
 'sea_water_absolute_salinity',
 'tendency_of_sea_water_salinity_expressed_as_salt_content',
 'change_over_time_in_sea_water_preformed_salinity',
 'tendency_of_sea_water_salinity_due_to_vertical_mixing',
 'tendency_of_sea_water_salinity_due_to_sea_ice_thermodynamics',
 'sea_water_salinity',
 'tendency_of_sea_water_salinity_expressed_as_salt_content_due_to_parameterized_submesoscale_eddy_advection',
 'square_of_sea_surface_salinity',
 'sea_water_cox_salinity',
 'integral_wrt_depth_of_product_of_salinity_and_sea_water_density',
 'sea_water_practical_salinity',
 'tendency_of_sea_water_salinity_expressed_as_salt_content_due_to_parameterized_eddy_advection',
 'tendency_of_sea_water_salinity_due_to_horizontal_mixing',
 'tendency_of_sea_water_salinity_expressed_as_salt_content_due_to_parameterized_mesoscale_eddy_advection',
 'integral_wrt_depth_of_sea_water_practical_salinity',
 'tendency_of_sea_water_salinity_expressed_as_salt_content_due_to_parameterized_mesoscale_eddy_diffusion',
 'sea_surface_salinity',
 'change_over_time_in_sea_water_absolute_salinity',
 'tendency_of_sea_water_salinity_due_to_parameterized_eddy_advection',
 'ratio_of_sea_water_practical_salinity_anomaly_to_relaxation_timescale',
 'tendency_of_sea_water_salinity_expressed_as_salt_content_due_to_parameterized_dianeutral_mixing',
 'product_of_eastward_sea_water_velocity_and_salinity',
 'product_of_northward_sea_water_velocity_and_salinity',
 'tendency_of_sea_water_salinity_expressed_as_salt_content_due_to_residual_mean_advection',
 'sea_water_salinity_at_sea_floor',
 'tendency_of_sea_water_salinity_due_to_advection',
 'sea_water_reference_salinity',
 'change_over_time_in_sea_water_practical_salinity',
 'sea_water_knudsen_salinity',
 'sea_water_preformed_salinity',
 'change_over_time_in_sea_water_salinity',
 'sea_ice_salinity']

How to use cf-pandas#