Working with DataArrays
The central data structure used in hyperseti is the DataArray. It is returned when loading data (fil or h5), and internally. For example:
from hyperseti.data_array import from_h5
darr = from_h5('voyager_data.h5')
If you inspect this in jupyter you will see:
DataArray is similar to xarray, in that it labels a numpy-like array with dimensions, scales and attribute metadata. The issue with xarray is that for very large arrays, the coords that describe the axis are also very large (see pydata:discussion#5166). For example:
import xarray as xr
data = np.random.rand(1000000, 3)
frequency = np.linspace(1, 2, 1000000)
locs = ['a', 'b', 'c']
xdata = xr.DataArray(data, coords=[frequency, locs], dims=["frequency", "space"])
Here, we needed to generate a very large frequency array – this is slow to create, and uses a lot of memory. For high-time or high-frequency resolution radio data, this is problematic.
Introducing DimensionScale
Hyperseti’s solution is a class DimensionScale, which attaches to the DataArray to describe each axis.
A DimensionScale pretends it is a numpy array but is actually just composed of three values: start, stop, and step, that is:
dim_scale_value = start_value + step_size * i
It also has units (e.g. GHz) and a name (e.g. ‘frequency’):
d = np.arange(2**20)
ds = DimensionScale('frequency', 1.1, 1.9, len(d), 'GHz')
>> <DimensionScale 'frequency': start 1.1 GHz step 1.9 GHz nstep 1048576 >
Dimension scales can be indexed, and a new dimension scale will be generated:
ds[1024:1032:2]
>> <DimensionScale 'frequency': start 1946.6999999999998 GHz step 3.8 GHz nstep 4 >
And they can be converted into numpy arrays, or into astropy.Quantity datasets:
# generate numpy array
ds_array = np.asarray(ds)
# generate astropy.Quantity array
ds_astropy = ds.generate()
Parts of the DataArray
To construct a DataArray, you need to supply data, dims, scales, and attrs. Here’s how to initialize a new array:
darr = DataArray(data, dims, scales, attrs, slice_info=None, parent_shape=None)
These correspond to
data- A numpy-like dataset. This can be a numpy.ndarray, a cupy.ndarray, a h5py.Dataset, or anything else that is numpy-like.dims- The names of each axis of thedataarray, e.g. (frequency, time, polarization).scales- A set ofDimensionScales, one for each dimension indims.attrs- A dictionary of any other metadata you’d like to attach.
The slice_info and parent_shape are to do with if you have selected a subsection of data from a larger array, so you can keep track.
These will populate if you call the sel() method:
darr = from_h5('voyager_data.h5')
dsel = darr.sel({'frequency': slice(0, 4096, 2), 'time': slice(1, 7)})
Which returns a new DataArray: