This blog is a reading note of "data analysis by Python". Please do not reprint it for other business purposes.
1,Series
Series is a one-dimensional array object, which contains a sequence of values and data labels, called index es. The simplest sequence can be just an array:
import pandas as pd obj = pd.Series([4, 7, -5, 3]) print(obj) # 0 4 1 7 2 -5 3 3 dtype: int64
The string representation of Series in the interactive environment, with index on the left and value on the right. Since no index is specified for the data, the default generated index is from 0 to N-1 (N is the length of the data). We can obtain the values and indexes of the Series objects through the values and index attributes respectively:
print(obj.values) print(obj.index) # [ 4 7 -5 3] RangeIndex(start=0, stop=4, step=1)
It is usually necessary to create an index sequence, with labels identifying each data point:
obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c']) print(obj2) print(obj2.index) # d 4 b 7 a -5 c 3 dtype: int64 Index(['d', 'b', 'a', 'c'], dtype='object')
Compared with Numpy, we can use labels to index when selecting data from data:
obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c']) print(obj2['a']) obj2['d'] = 6 print(obj2[['c', 'a', 'd']]) # -5 c 3 a -5 d 6 dtype: int64
It should be noted that we use two brackets when outputting the values of obj 2 with indexes of C, a and D. ['c ',' a ',' d '] contains not a number but a string as an index list.
Using Numpy functions or Numpy style operations, such as filtering with Boolean arrays, multiplying with scalars, or applying mathematical functions, will save index value links:
obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c']) print(obj2[obj2 > 0]) print(obj2 * 2) print(np.exp(obj2)) # d 4 b 7 c 3 dtype: int64 d 8 b 14 a -10 c 6 dtype: int64 d 54.598150 b 1096.633158 a 0.006738 c 20.085537 dtype: float64
Considering Series from another perspective, you can think of it as a fixed and ordered dictionary because it pairs the data values of index values by location.
obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c']) print('b' in obj2) print('e' in obj2) # True False
If you already have data included in the Python dictionary, you can use the dictionary to generate a Series:
sdata = {'Ohio':35000, 'Texas':71000, 'Oregon':16000, 'Utah':5000} obj3 = pd.Series(sdata) print(obj3) # Ohio 35000 Texas 71000 Oregon 16000 Utah 5000 dtype: int64
When the dictionary is passed to the Series constructor, the resulting Series index will be the sorted dictionary key. You can pass dictionary keys to the constructor in the order you want, so that the index order of the resulting Series is as expected:
sdata = {'Ohio':35000, 'Texas':71000, 'Oregon':16000, 'Utah':5000} states = ['California', 'Ohio', 'Oregon', 'Texas'] obj4 = pd.Series(sdata, index=states) print(obj4) # California NaN Ohio 35000.0 Oregon 16000.0 Texas 71000.0 dtype: float64
"NA" indicates missing data. isnull and notnull are used in pandas to check for missing data:
print(pd.isnull(obj4)) print(pd.notnull(obj4)) # California True Ohio False Oregon False Texas False dtype: bool California False Ohio True Oregon True Texas True dtype: bool
isnull and notnull are also instance methods of Series:
print(obj4.isnull()) # California True Ohio False Oregon False Texas False dtype: bool
The Series object itself and its index have the name attribute, which is integrated with other important functions of pandas:
obj4.name = 'population' obj4.index.name = 'states' print(obj4) # states California NaN Ohio 35000.0 Oregon 16000.0 Texas 71000.0 Name: population, dtype: float64
The index of the Series can be changed by assigning values by location:
obj = pd.Series([4, 7, -5, 3]) obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan'] print(obj) # Bob 4 Steve 7 Jeff -5 Ryan 3 dtype: int64
2,DataFrame
DataFrame represents the data table of the matrix, which contains the sorted column collection. Each column can be of different value types (value, string, Boolean value, etc.). DataFrame has both row and column indexes. It can be regarded as a dictionary of Series sharing the same index. In a DataFrame, data is stored as more than one two-dimensional block, rather than a collection of lists, dictionaries, or other one-dimensional arrays.
Although DataFrame is two-dimensional, we can use hierarchical index to present data with higher dimensions in DataFrame. Hierarchical index is a more advanced data processing feature in pandas.
There are many ways to build a DataFrame. The most common way is to use a dictionary that contains an equal length list or a Numpy array to form a DataFrame:
data = {'states': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'], 'year': [2000, 2001, 2002, 2001, 2002, 2003], 'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]} frame = pd.DataFrame(data) print(frame) # states year pop 0 Ohio 2000 1.5 1 Ohio 2001 1.7 2 Ohio 2002 3.6 3 Nevada 2001 2.4 4 Nevada 2002 2.9 5 Nevada 2003 3.2
The resulting DataFrame automatically assigns indexes to the Series, and the columns are sorted. For a large DataFrame, the head method will select only five rows of the header:
print(frame.head()) # states year pop 0 Ohio 2000 1.5 1 Ohio 2001 1.7 2 Ohio 2002 3.6 3 Nevada 2001 2.4 4 Nevada 2002 2.9
If the order of columns is specified, the columns of DataFrame will be arranged in the specified order:
frame1 = pd.DataFrame(data, columns=['year', 'states', 'pop']) print(frame1) # year states pop 0 2000 Ohio 1.5 1 2001 Ohio 1.7 2 2002 Ohio 3.6 3 2001 Nevada 2.4 4 2002 Nevada 2.9 5 2003 Nevada 3.2
If the passed column is not included in the dictionary, the missing value will appear in the result:
frame2 = pd.DataFrame(data, columns=['year', 'states', 'pop', 'debt'], index=['one', 'two', 'three', 'four', 'five', 'six']) print(frame2) # year states pop debt one 2000 Ohio 1.5 NaN two 2001 Ohio 1.7 NaN three 2002 Ohio 3.6 NaN four 2001 Nevada 2.4 NaN five 2002 Nevada 2.9 NaN six 2003 Nevada 3.2 NaN
A column in a DataFrame that can be retrieved as a Series as a dictionary type tag or attribute:
print(frame2['states']) print(frame2.year) # one Ohio two Ohio three Ohio four Nevada five Nevada six Nevada Name: states, dtype: object one 2000 two 2001 three 2002 four 2001 five 2002 six 2003 Name: year, dtype: int64
Note that the returned Series has the same index as the original DataFrame, and the name property of the Series will also be set properly. Rows can also be selected by the stop or special attribute loc:
print(frame2.loc['three']) # year 2002 states Ohio pop 3.6 debt NaN Name: three, dtype: object
The reference of the column can be modified. For example, an empty "debt" column can be assigned as a scalar value or an array value:
frame2['debt'] = 16.5 print(frame2) # year states pop debt one 2000 Ohio 1.5 16.5 two 2001 Ohio 1.7 16.5 three 2002 Ohio 3.6 16.5 four 2001 Nevada 2.4 16.5 five 2002 Nevada 2.9 16.5 six 2003 Nevada 3.2 16.5
frame2['debt'] = np.arange(6) print(frame2) # year states pop debt one 2000 Ohio 1.5 0 two 2001 Ohio 1.7 1 three 2002 Ohio 3.6 2 four 2001 Nevada 2.4 3 five 2002 Nevada 2.9 4 six 2003 Nevada 3.2 5
When we assign a list or array to a column, the length of the value must match the length of the DataFrame. If we assign a Series to a column, the indexes of the Series will be rearranged according to the indexes of DataFrame, and the missing values will be filled in the empty places:
val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five']) frame2['debt'] = val print(frame2) # year states pop debt one 2000 Ohio 1.5 NaN two 2001 Ohio 1.7 -1.2 three 2002 Ohio 3.6 NaN four 2001 Nevada 2.4 -1.5 five 2002 Nevada 2.9 -1.7 six 2003 Nevada 3.2 NaN
If the assigned column does not exist, a new column is generated. The del keyword can remove columns from a DataFrame as it does in a dictionary.
In the del example, we first add a column, which is a Boolean value. The condition is whether the states column is' Ohio ':
frame2['eastern'] = frame2.states == 'Ohio' print(frame2) # year states pop debt eastern one 2000 Ohio 1.5 NaN True two 2001 Ohio 1.7 NaN True three 2002 Ohio 3.6 NaN True four 2001 Nevada 2.4 NaN False five 2002 Nevada 2.9 NaN False six 2003 Nevada 3.2 NaN False
Note: the syntax for frame.eastern cannot create a new column.
The del method can be used to remove previously created columns:
del frame2['eastern'] print(frame2.columns) # Index(['year', 'states', 'pop', 'debt'], dtype='object')
Note: the columns selected from the DataFrame are views of the data, not copies. As a result, changes to the Series are mapped to the DataFrame. If you need replication, you should explicitly use the copy method of Series.
Another common form of data is nested dictionaries that contain dictionaries. If the nested dictionary is assigned to the DataFrame, pandas will explicitly use the key of the dictionary as a column and the key of the internal dictionary as a row:
pop = {'Nevada': {2001: 2.4, 2002: 2.9}, 'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}} frame3 = pd.DataFrame(pop) print(frame3) # Nevada Ohio 2001 2.4 1.7 2002 2.9 3.6 2000 NaN 1.5
You can use Numpy like syntax to transpose the DataFrame (swap rows and columns):
print(frame3.T) # 2001 2002 2000 Nevada 2.4 2.9 NaN Ohio 1.7 3.6 1.5
The key of the internal dictionary is combined and sorted to form the index of the result. If the index is explicitly indicated, the keys of the internal dictionary will not be sorted:
print(pd.DataFrame(pop, index=[2001, 2002, 2003])) # Nevada Ohio 2001 2.4 1.7 2002 2.9 3.6 2003 NaN NaN
Dictionaries containing Series can also be used to construct dataframes:
pdata = {'Ohio': frame3['Ohio'][:-1], 'Nevada': frame3['Nevada'][:2]} print(pd.DataFrame(pdata)) # Ohio Nevada 2001 1.7 2.4 2002 3.6 2.9
Similar to Series, the values property of DataFrame returns the data contained in DataFrame in the form of 2D ndarray:
print(frame3.values) # [[2.4 1.7] [2.9 3.6] [nan 1.5]]
If the columns of DataFrame are different dtypes, the dtype of values will automatically select the type suitable for all columns:
print(frame2.values) # [[2000 'Ohio' 1.5 nan] [2001 'Ohio' 1.7 nan] [2002 'Ohio' 3.6 nan] [2001 'Nevada' 2.4 nan] [2002 'Nevada' 2.9 nan] [2003 'Nevada' 3.2 nan]]
Valid input for DataFrame constructor
type
Notes
2D ndarray
Data matrix, row and column labels are optional parameters
A dictionary of arrays, lists, and tuples
Each sequence is called a column of DataFrame, and all sequences must be the same length
NumPy structured / recorded array
Consistent with the dictionary of array composition
Dictionary of Series
Each value is called a column, and the indexes of each Series are combined to form the row index of the result, which can also be passed explicitly
A dictionary made up of dictionaries
Each internal dictionary is called a column, and the key is combined to form the row index of the result
List of dictionaries or Series
An element in the list forms a row of DataFrame, and the dictionary key or Series index are combined to form the column label of DataFrame
A list of lists or tuples
Consistent with 2D ndarray
Other dataframes If the pass through index is not displayed, the index of the original DataFrame is used NumPy MaskedArray Similar to 2D ndarray, but hidden values are called NA / missing values in the resulting DataFrame3. Index object
Index objects in pandas are used to store axis labels and other metadata. When constructing a Series or DataFrame, any array or label we use can be internally converted to an index object:
obj = pd.Series(range(3), index=['a', 'b', 'c']) index = obj.index print(obj) print(index) print(index[1:]) # a 0 b 1 c 2 dtype: int64 Index(['a', 'b', 'c'], dtype='object') Index(['b', 'c'], dtype='object')
The index object is immutable, so users cannot modify it. Invariance is that it is safer to share index objects in multiple data structures:
labels = pd.Index(np.arange(3)) print(labels) obj2 = pd.Series([1.5, -2.5, 0], index=labels) print(obj2) print(obj2.index is labels) # Int64Index([0, 1, 2], dtype='int64') 0 1.5 1 -2.5 2 0.0 dtype: float64 True
Some users do not often use the functions provided by index objects, but because some operations will produce results containing indexed data, it is important to understand how indexes work.
Method | describe |
append | After pasting additional index objects into the original index, a new index is generated |
difference | Calculating the difference set of two indexes |
intersection | Calculate the intersection of two indexes |
union | Calculating the union of two indexes |
isin | Computes a Boolean array that indicates whether each value is in the pass through container |
delete | Delete the element at location i and generate a new index |
drop | Delete the specified index value according to the parameter and generate a new index |
insert | Insert the element at position i and generate a new index |
is_monotonic | Returns True if index sequence is incremented |
is_unique | Returns True if index sequence is unique |
unique | Calculate unique sequence value of index |