brief introduction
This article will explain the basic data types Series and DataFrame in Pandas, and explain in detail the creation, indexing and other basic behaviors of these two types.
To use Pandas, you need to reference the following lib:
In [1]: import numpy as np In [2]: import pandas as pd
Series
Series is a one-dimensional array with label and index. We use the following method to create a series:
>>> s = pd.Series(data, index=index)
The data here can be a Python dictionary, np's ndarray, or a scalar.
index is a list of horizontal label s. Next, let's take a look at how to create Series.
Create from ndarray
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e']) s Out[67]: a -1.300797 b -2.044172 c -1.170739 d -0.445290 e 1.208784 dtype: float64
Use index to get index:
s.index Out[68]: Index(['a', 'b', 'c', 'd', 'e'], dtype='object')
Create from dict
d = {'b': 1, 'a': 0, 'c': 2} pd.Series(d) Out[70]: a 0 b 1 c 2 dtype: int64
Create from scalar
pd.Series(5., index=['a', 'b', 'c', 'd', 'e']) Out[71]: a 5.0 b 5.0 c 5.0 d 5.0 e 5.0 dtype: float64
Series and ndarray
Series and ndarray are very similar. Using the index value in series is like ndarray:
s[0] Out[72]: -1.3007972194268396 s[:3] Out[73]: a -1.300797 b -2.044172 c -1.170739 dtype: float64 s[s > s.median()] Out[74]: d -0.445290 e 1.208784 dtype: float64 s[[4, 3, 1]] Out[75]: e 1.208784 d -0.445290 b -2.044172 dtype: float64
Series and dict
If you use label to access Series, it behaves much like dict:
s['a'] Out[80]: -1.3007972194268396 s['e'] = 12. s Out[82]: a -1.300797 b -2.044172 c -1.170739 d -0.445290 e 12.000000 dtype: float64
Vectorization and label alignment
Series can use simpler vectorization operations:
s + s Out[83]: a -2.601594 b -4.088344 c -2.341477 d -0.890581 e 24.000000 dtype: float64 s * 2 Out[84]: a -2.601594 b -4.088344 c -2.341477 d -0.890581 e 24.000000 dtype: float64 np.exp(s) Out[85]: a 0.272315 b 0.129487 c 0.310138 d 0.640638 e 162754.791419 dtype: float64
Name property
Series also has a name attribute, which we can set when creating:
s = pd.Series(np.random.randn(5), name='something') s Out[88]: 0 0.192272 1 0.110410 2 1.442358 3 -0.375792 4 1.228111 Name: something, dtype: float64
S also has a rename method, which can rename s:
s2 = s.rename("different")
DataFrame
DataFrame is a two-dimensional data structure with label. It is composed of Series. You can regard DataFrame as an excel table. DataFrame can be created from the following data:
- One dimensional ndarrays, lists, dicts, or Series
- Structured array creation
- 2-dimensional numpy ndarray
- Other dataframes
Create from Series
You can create a DataFrame from a dictionary composed of Series:
d = {'one': pd.Series([1., 2., 3.], index=['a', 'b', 'c']),'two': pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])} df = pd.DataFrame(d) df Out[92]: one two a 1.0 1.0 b 2.0 2.0 c 3.0 3.0 d NaN 4.0
Perform index rearrangement:
pd.DataFrame(d, index=['d', 'b', 'a']) Out[93]: one two d NaN 4.0 b 2.0 2.0 a 1.0 1.0
Perform column rearrangement:
pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three']) Out[94]: two three d 4.0 NaN b 2.0 NaN a 1.0 NaN
Create from ndarrays and lists
d = {'one': [1., 2., 3., 4.],'two': [4., 3., 2., 1.]} pd.DataFrame(d) Out[96]: one two 0 1.0 4.0 1 2.0 3.0 2 3.0 2.0 3 4.0 1.0 pd.DataFrame(d, index=['a', 'b', 'c', 'd']) Out[97]: one two a 1.0 4.0 b 2.0 3.0 c 3.0 2.0 d 4.0 1.0
Create from structured array
You can create a DF from a structured array:
In [47]: data = np.zeros((2, ), dtype=[('A', 'i4'), ('B', 'f4'), ('C', 'a10')]) In [48]: data[:] = [(1, 2., 'Hello'), (2, 3., "World")] In [49]: pd.DataFrame(data) Out[49]: A B C 0 1 2.0 b'Hello' 1 2 3.0 b'World' In [50]: pd.DataFrame(data, index=['first', 'second']) Out[50]: A B C first 1 2.0 b'Hello' second 2 3.0 b'World' In [51]: pd.DataFrame(data, columns=['C', 'A', 'B']) Out[51]: C A B 0 b'Hello' 1 2.0 1 b'World' 2 3.0
Create from dictionary list
In [52]: data2 = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}] In [53]: pd.DataFrame(data2) Out[53]: a b c 0 1 2 NaN 1 5 10 20.0 In [54]: pd.DataFrame(data2, index=['first', 'second']) Out[54]: a b c first 1 2 NaN second 5 10 20.0 In [55]: pd.DataFrame(data2, columns=['a', 'b']) Out[55]: a b 0 1 2 1 5 10
Create from tuples
You can create more complex DF S from tuples:
In [56]: pd.DataFrame({('a', 'b'): {('A', 'B'): 1, ('A', 'C'): 2}, ....: ('a', 'a'): {('A', 'C'): 3, ('A', 'B'): 4}, ....: ('a', 'c'): {('A', 'B'): 5, ('A', 'C'): 6}, ....: ('b', 'a'): {('A', 'C'): 7, ('A', 'B'): 8}, ....: ('b', 'b'): {('A', 'D'): 9, ('A', 'B'): 10}}) ....: Out[56]: a b b a c a b A B 1.0 4.0 5.0 8.0 10.0 C 2.0 3.0 6.0 7.0 NaN D NaN NaN NaN NaN 9.0
Column selection, addition and deletion
DF can be operated like Series:
In [64]: df['one'] Out[64]: a 1.0 b 2.0 c 3.0 d NaN Name: one, dtype: float64 In [65]: df['three'] = df['one'] * df['two'] In [66]: df['flag'] = df['one'] > 2 In [67]: df Out[67]: one two three flag a 1.0 1.0 1.0 False b 2.0 2.0 4.0 False c 3.0 3.0 9.0 True d NaN 4.0 NaN False
You can delete specific columns or pop:
In [68]: del df['two'] In [69]: three = df.pop('three') In [70]: df Out[70]: one flag a 1.0 False b 2.0 False c 3.0 True d NaN False
If you insert a constant, the entire column will be filled:
In [71]: df['foo'] = 'bar' In [72]: df Out[72]: one flag foo a 1.0 False bar b 2.0 False bar c 3.0 True bar d NaN False bar
By default, it will be inserted into the last column in DF. You can use insert to specify the insertion into a specific column:
In [75]: df.insert(1, 'bar', df['one']) In [76]: df Out[76]: one bar flag foo one_trunc a 1.0 1.0 False bar 1.0 b 2.0 2.0 False bar 2.0 c 3.0 3.0 True bar NaN d NaN NaN False bar NaN
Use assign to derive new columns from existing columns:
In [77]: iris = pd.read_csv('data/iris.data') In [78]: iris.head() Out[78]: SepalLength SepalWidth PetalLength PetalWidth Name 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa In [79]: (iris.assign(sepal_ratio=iris['SepalWidth'] / iris['SepalLength']) ....: .head()) ....: Out[79]: SepalLength SepalWidth PetalLength PetalWidth Name sepal_ratio 0 5.1 3.5 1.4 0.2 Iris-setosa 0.686275 1 4.9 3.0 1.4 0.2 Iris-setosa 0.612245 2 4.7 3.2 1.3 0.2 Iris-setosa 0.680851 3 4.6 3.1 1.5 0.2 Iris-setosa 0.673913 4 5.0 3.6 1.4 0.2 Iris-setosa 0.720000
Note that assign will create a new DF, and the original DF will remain unchanged.
The index and selection in DF are represented by a table below:
operation | grammar | Return results |
---|---|---|
Select column | df[col] | Series |
Select a row through label | df.loc[label] | Series |
Select rows by array | df.iloc[loc] | Series |
Slice of row | df[5:10] | DataFrame |
Select rows using boolean vectors | df[bool_vec] | DataFrame |
This article has been included in http://www.flydean.com/03-python-pandas-data-structures/
The most popular interpretation, the most profound dry goods, the most concise tutorial, and many tips you don't know are waiting for you to find!
Welcome to my official account: "those things in procedure", understand technology, know you better!