Data array of xarray data structure

Posted by greekuser on Fri, 07 Feb 2020 19:38:40 +0100

Data array of xarray data structure

xarray.DataArray is a multidimensional array that uses labels. It mainly has the following key properties:

  • Values: a numpy.ndarray to hold array values
  • dims: Dimension name of each axis (for example, ('x', 'y', 'z'))-
  • coords: a dictionary like container containing array coordinates to mark each point (for example, a one-dimensional array of numbers, date time objects, or strings)
  • attrs: dictionary for storing arbitrary metadata (attributes)

Xarray uses dims and coords to implement its core metadata aware operations. Dimensions provide the name used by xarray instead of the axis parameter in many numpy functions. Coordinates is based on the index function on the DataFrame or Series of pandas, which can realize quick index and alignment based on labels.

DataArray objects can also have a name and can be made to hold arbitrary metadata in the form of attrs properties. Names and properties are for users and user written code only: xarray does not attempt to interpret them, and uses them only explicitly (see FAQ, What is your approach to metadata?)

Create a DataArray

To construct a DataArray function:

  • data: a multidimensional array containing values (such as a numpy ndarray, Series, DataFrame or panels. Panel)
  • coords: a list or dictionary containing coordinates. If it is a list, it should be a tuple list, in which the first element is the dimension name and the second element is the corresponding object whose coordinates are similar to array.
  • dims: list containing the name of the dimension. If omitted, and coords is a list containing tuples, the dimension name is taken from coords.
  • attrs: attribute dictionary added to the instance
  • name: string of named instance
In [1]: data = np.random.rand(4, 3)

In [2]: locs = ['IA', 'IL', 'IN']

In [3]: times = pd.date_range('2000-01-01', periods=4)

In [4]: foo = xr.DataArray(data, coords=[times, locs], dims=['time', 'space'])

In [5]: foo
Out[5]: 
<xarray.DataArray (time: 4, space: 3)>
array([[0.127, 0.967, 0.26 ],
       [0.897, 0.377, 0.336],
       [0.451, 0.84 , 0.123],
       [0.543, 0.373, 0.448]])
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
  * space    (space) <U2 'IA' 'IL' 'IN'

Only data is required; all other parameters will be populated with default values:

In [6]: xr.DataArray(data)
Out[6]: 
<xarray.DataArray (dim_0: 4, dim_1: 3)>
array([[0.127, 0.967, 0.26 ],
       [0.897, 0.377, 0.336],
       [0.451, 0.84 , 0.123],
       [0.543, 0.373, 0.448]])
Dimensions without coordinates: dim_0, dim_1

As you can see, the dimension name always exists in the xarray data model: if not provided, it will be created in the default dim? N format. However, coordinates are always optional, and dimensions do not have automatic coordinate labels.

Be careful:
This is different from pandas. In pandas, there are often scale labels. The default value is integer [0,..., n-1].
Prior to xarray v0.9, xarray applied this behavior: if no coordinates were explicitly provided, default coordinates were created for each dimension. This is not the case at present.

Coordinates can be specified in the following ways:

  • A list with a length equal to the dimension, providing a coordinate label for each dimension. Each corresponding value must take one of the following forms:
    • A DataArray or Variable
    • Tuples in the format of (dims, data[, attrs]) will be converted to parameters of variables
    • A pandas object or scalar value will be converted to DataArray
    • A one-dimensional array or list that will be interpreted as the value of a one-dimensional coordinate variable and the corresponding dimension name.
  • A dictionary in the form {coord_name: coord} where the value is in the same form as the list. Coordinates are provided in the form of a dictionary, allowing coordinates other than those of the corresponding dimension (more on that later). If coords is provided as a dictionary, dims must be provided explicitly.

Provided as a list of tuples:

In [7]: xr.DataArray(data, coords=[('time', times), ('space', locs)])
Out[7]: 
<xarray.DataArray (time: 4, space: 3)>
array([[0.127, 0.967, 0.26 ],
       [0.897, 0.377, 0.336],
       [0.451, 0.84 , 0.123],
       [0.543, 0.373, 0.448]])
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
  * space    (space) <U2 'IA' 'IL' 'IN'

Dictionary:

In [8]: xr.DataArray(data, coords={'time': times, 'space': locs, 'const': 42,
   ...:                            'ranking': ('space', [1, 2, 3])},
   ...:              dims=['time', 'space'])
   ...: 
Out[8]: 
<xarray.DataArray (time: 4, space: 3)>
array([[0.127, 0.967, 0.26 ],
       [0.897, 0.377, 0.336],
       [0.451, 0.84 , 0.123],
       [0.543, 0.373, 0.448]])
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
  * space    (space) <U2 'IA' 'IL' 'IN'
    const    int64 42
    ranking  (space) int64 1 2 3

Provided as a dictionary with multiple dimension coordinates:

In [9]: xr.DataArray(data, coords={'time': times, 'space': locs, 'const': 42,
   ...:                            'ranking': (('time', 'space'), np.arange(12).reshape(4,3))},
   ...:              dims=['time', 'space'])
   ...: 
Out[9]: 
<xarray.DataArray (time: 4, space: 3)>
array([[0.127, 0.967, 0.26 ],
       [0.897, 0.377, 0.336],
       [0.451, 0.84 , 0.123],
       [0.543, 0.373, 0.448]])
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
  * space    (space) <U2 'IA' 'IL' 'IN'
    const    int64 42
    ranking  (time, space) int64 0 1 2 3 4 5 6 7 8 9 10 11

If you create a DataArray by providing a Series, DataFrame, or pandas.Panel of a pandas, all unspecified parameters in the DataArray constructor are populated from the pandas object:

In [10]: df = pd.DataFrame({'x': [0, 1], 'y': [2, 3]}, index=['a', 'b'])

In [11]: df.index.name = 'abc'

In [12]: df.columns.name = 'xyz'

In [13]: df
Out[13]: 
xyz  x  y
abc      
a    0  2
b    1  3

In [14]: xr.DataArray(df)
Out[14]: 
<xarray.DataArray (abc: 2, xyz: 2)>
array([[0, 2],
       [1, 3]])
Coordinates:
  * abc      (abc) object 'a' 'b'
  * xyz      (xyz) object 'x' 'y'

DataArray property

Let's look at the important attributes on array:

In [15]: foo.values
Out[15]: 
array([[0.127, 0.967, 0.26 ],
       [0.897, 0.377, 0.336],
       [0.451, 0.84 , 0.123],
       [0.543, 0.373, 0.448]])

In [16]: foo.dims
Out[16]: ('time', 'space')

In [17]: foo.coords
Out[17]: 
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
  * space    (space) <U2 'IA' 'IL' 'IN'

In [18]: foo.attrs
Out[18]: {}

In [19]: print(foo.name)
None

value can be modified in place:

In [20]: foo.values = 1.0 * foo.values

Be careful:
Array values in DataArray have a single (uniform) data type. To use heterogeneous or structured data types in xarray, use coordinates, or place individual DataArray objects in a single Dataset (see following).

Now, fill in some missing metadata:

In [21]: foo.name = 'foo'

In [22]: foo.attrs['units'] = 'meters'

In [23]: foo
Out[23]: 
<xarray.DataArray 'foo' (time: 4, space: 3)>
array([[0.127, 0.967, 0.26 ],
       [0.897, 0.377, 0.336],
       [0.451, 0.84 , 0.123],
       [0.543, 0.373, 0.448]])
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
  * space    (space) <U2 'IA' 'IL' 'IN'
Attributes:
    units:    meters

The rename() method is another option that returns a new array of data:

In [24]: foo.rename('bar')
Out[24]: 
<xarray.DataArray 'bar' (time: 4, space: 3)>
array([[0.127, 0.967, 0.26 ],
       [0.897, 0.377, 0.336],
       [0.451, 0.84 , 0.123],
       [0.543, 0.373, 0.448]])
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
  * space    (space) <U2 'IA' 'IL' 'IN'
Attributes:
    units:    meters

DataArray coordinates

The coords attribute is similar to a dictionary. Individual coordinates can be accessed by name from coordinates, or even by index data array itself:

In [25]: foo.coords['time']
Out[25]: 
<xarray.DataArray 'time' (time: 4)>
array(['2000-01-01T00:00:00.000000000', '2000-01-02T00:00:00.000000000',
       '2000-01-03T00:00:00.000000000', '2000-01-04T00:00:00.000000000'],
      dtype='datetime64[ns]')
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04

In [26]: foo['time']
Out[26]: 
<xarray.DataArray 'time' (time: 4)>
array(['2000-01-01T00:00:00.000000000', '2000-01-02T00:00:00.000000000',
       '2000-01-03T00:00:00.000000000', '2000-01-04T00:00:00.000000000'],
      dtype='datetime64[ns]')
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04

These are also DataArray objects that contain scale labels for each dimension.

You can also use a dictionary to set or delete coordinates, such as syntax:

In [27]: foo['ranking'] = ('space', [1, 2, 3])

In [28]: foo.coords
Out[28]: 
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
  * space    (space) <U2 'IA' 'IL' 'IN'
    ranking  (space) int64 1 2 3

In [29]: del foo['ranking']

In [30]: foo.coords
Out[30]: 
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
  * space    (space) <U2 'IA' 'IL' 'IN'

For more details, see Coordinates

Published 7 original articles, won praise 15, visited 5158
Private letter follow

Topics: Attribute