## Working with DataArrays The central data structure used in hyperseti is the `DataArray`. It is returned when loading data (fil or h5), and internally. For example: ```python from hyperseti.data_array import from_h5 darr = from_h5('voyager_data.h5') ``` If you inspect this in jupyter you will see: image DataArray is similar to [xarray](http://xarray.pydata.org/en/stable/), in that it labels a numpy-like array with dimensions, scales and attribute metadata. The issue with xarray is that for very large arrays, the coords that describe the axis are also very large (see [pydata:discussion#5166](https://github.com/pydata/xarray/discussions/5156)). For example: ```python import xarray as xr data = np.random.rand(1000000, 3) frequency = np.linspace(1, 2, 1000000) locs = ['a', 'b', 'c'] xdata = xr.DataArray(data, coords=[frequency, locs], dims=["frequency", "space"]) ``` Here, we needed to generate a very large frequency array -- this is slow to create, and uses a lot of memory. For high-time or high-frequency resolution radio data, this is problematic. ### Introducing `DimensionScale` Hyperseti's solution is a class `DimensionScale`, which attaches to the `DataArray` to describe each axis. A `DimensionScale` pretends it is a numpy array but is actually just composed of three values: start, stop, and step, that is: ```python dim_scale_value = start_value + step_size * i ``` It also has units (e.g. GHz) and a name (e.g. 'frequency'): ```python d = np.arange(2**20) ds = DimensionScale('frequency', 1.1, 1.9, len(d), 'GHz') >> ``` Dimension scales can be indexed, and a new dimension scale will be generated: ``` ds[1024:1032:2] >> ``` And they can be converted into numpy arrays, or into astropy.Quantity datasets: ```python # generate numpy array ds_array = np.asarray(ds) # generate astropy.Quantity array ds_astropy = ds.generate() ``` ### Parts of the `DataArray` To construct a DataArray, you need to supply data, dims, scales, and attrs. Here's how to initialize a new array: ```python darr = DataArray(data, dims, scales, attrs, slice_info=None, parent_shape=None) ``` These correspond to * `data` - A numpy-like dataset. This can be a numpy.ndarray, a cupy.ndarray, a h5py.Dataset, or anything else that is numpy-like. * `dims` - The names of each axis of the `data` array, e.g. (frequency, time, polarization). * `scales` - A set of `DimensionScales`, one for each dimension in `dims`. * `attrs` - A dictionary of any other metadata you'd like to attach. The `slice_info` and `parent_shape` are to do with if you have selected a subsection of data from a larger array, so you can keep track. These will populate if you call the `sel()` method: ```python darr = from_h5('voyager_data.h5') dsel = darr.sel({'frequency': slice(0, 4096, 2), 'time': slice(1, 7)}) ``` Which returns a new DataArray: image