Learning summary: Click here.
String function
Pandas provides a set of string functions to operate on string data conveniently. Most importantly, these functions ignore (or exclude) missing / NaN values. Almost all of these methods use Python string functions (see: Here ) Therefore, convert the Series object to a String object, and then do the operation.
number | function | describe |
---|---|---|
1 | lower() | Converts strings in Series/Index to lowercase. |
2 | upper() | Converts strings in Series/Index to uppercase. |
3 | len() | Calculates the length of the string. |
4 | strip() | Help remove spaces (including line breaks) from each string in the series / index on both sides. |
5 | split(' ') | Splits each string with the given pattern. |
6 | cat(sep=' ') | Connect series / index elements with the given separator. |
7 | get_dummies() | Returns a data frame with a single heat code value. |
8 | contains(pattern) | Returns the Boolean value True for each element if the element contains substrings, otherwise False. |
9 | replace(a,b) | Replace the value a with the value b. |
10 | repeat(value) | Repeats each element a specified number of times. |
11 | count(pattern) | Returns the total number of occurrences of each element in the pattern. |
12 | startswith(pattern) | Returns true if the element in the series / index starts with a pattern. |
13 | endswith(pattern) | Returns true if the element in the series / index ends in a pattern. |
14 | find(pattern) | Returns the location of the first occurrence of the pattern. |
15 | findall(pattern) | Returns a list of all occurrences of the pattern. |
16 | swapcase() | Change the case of letters. |
17 | islower() | Checks if all characters in each string in the series / index are lowercase, returns a Boolean value |
18 | isupper() | Checks if all characters in each string in the series / index are uppercase and returns a Boolean value |
19 | isnumeric() | Checks whether all characters in each string in the series / index are numbers and returns a Boolean value. |
Rebuild index
Reindex changes the row and column labels of the DataFrame. Reindexing means matching data to match a given set of labels on a particular axis. Multiple operations can be implemented through index:
- Reorder existing data to match a new set of labels.
- Insert a missing value (NA) tag in a label location that does not have label data.
>>>import pandas as pd >>>import numpy as np >>>iN=20 >>>idf = pd.DataFrame({ 'A': pd.date_range(start='2016-01-01',periods=N,freq='D'), 'x': np.linspace(0,stop=N-1,num=N), 'y': np.random.rand(N), 'C': np.random.choice(['Low','Medium','High'],N).tolist(), 'D': np.random.normal(100, 10, size=(N)).tolist() }) #reindex the DataFrame >>>idf_reindexed = df.reindex(index=[0,2,5], columns=['A', 'C', 'B']) >>>idf_reindexed A C B 0 2016-01-01 Low NaN 2 2016-01-03 High NaN 5 2016-01-06 Low NaN
1. Re align index with other objects
Sometimes you may want to take an object and re index it, with its axis marked as the same as another object.
>>>import pandas as pd >>>import numpy as np >>>df1 = pd.DataFrame(np.random.randn(10,3),columns=['col1','col2','col3']) >>>df2 = pd.DataFrame(np.random.randn(7,3),columns=['col1','col2','col3']) >>>df1 = df1.reindex_like(df2) >>>df1 col1 col2 col3 0 -2.467652 -1.211687 -0.391761 1 -0.287396 0.522350 0.562512 2 -0.255409 -0.483250 1.866258 3 -1.150467 -0.646493 -0.222462 4 0.152768 -2.056643 1.877233 5 -1.155997 1.528719 -1.343719 6 -1.015606 -1.245936 -0.295275
Note: here, the df1 data frame is changed and renumbered, such as df2. Column names should match or NAN will be added to the entire column label.
2. Refill when filling
reindex() adopts the optional parameter method, which is a filling method with the following values:
- Pad / fill - fill value forward
- bfill/backfill - fill backward
- nearest - populates from the most recent index value
>>>import pandas as pd >>>import numpy as np >>>df1 = pd.DataFrame(np.random.randn(6,3),columns=['col1','col2','col3']) >>>df2 = pd.DataFrame(np.random.randn(2,3),columns=['col1','col2','col3']) >>>df2.reindex_like(df1) col1 col2 col3 0 1.311620 -0.707176 0.599863 1 -0.423455 -0.700265 1.133371 2 NaN NaN NaN 3 NaN NaN NaN 4 NaN NaN NaN 5 NaN NaN NaN >>>df2.reindex_like(df1,method='ffill') col1 col2 col3 0 1.311620 -0.707176 0.599863 1 -0.423455 -0.700265 1.133371 2 -0.423455 -0.700265 1.133371 3 -0.423455 -0.700265 1.133371 4 -0.423455 -0.700265 1.133371 5 -0.423455 -0.700265 1.133371
3. Fill limit when rebuilding index
Limit parameters provide additional control over population when rebuilding indexes. Limits the maximum count of specified consecutive matches.
>>>import pandas as pd >>>import numpy as np >>>df1 = pd.DataFrame(np.random.randn(6,3),columns=['col1','col2','col3']) >>>df2 = pd.DataFrame(np.random.randn(2,3),columns=['col1','col2','col3']) >>>df2.reindex_like(df1) col1 col2 col3 0 0.247784 2.128727 0.702576 1 -0.055713 -0.021732 -0.174577 2 NaN NaN NaN 3 NaN NaN NaN 4 NaN NaN NaN 5 NaN NaN NaN >>>df2.reindex_like(df1,method='ffill',limit=1) col1 col2 col3 0 0.247784 2.128727 0.702576 1 -0.055713 -0.021732 -0.174577 2 -0.055713 -0.021732 -0.174577 3 NaN NaN NaN 4 NaN NaN NaN 5 NaN NaN NaN
4. renaming
The rename() method allows you to relabel an axis based on some mapping (Dictionary or series) or any function.
>>>import pandas as pd >>>import numpy as np >>>df1 = pd.DataFrame(np.random.randn(6,3),columns=['col1','col2','col3']) >>>df1 col1 col2 col3 0 0.486791 0.105759 1.540122 1 -0.990237 1.007885 -0.217896 2 -0.483855 -1.645027 -1.194113 3 -0.122316 0.566277 -0.366028 4 -0.231524 -0.721172 -0.112007 5 0.438810 0.000225 0.435479 >>>df1.rename(columns={'col1' : 'c1', 'col2' : 'c2'},index = {0 : 'apple', 1 : 'banana', 2 : 'durian'}) c1 c2 col3 apple 0.486791 0.105759 1.540122 banana -0.990237 1.007885 -0.217896 durian -0.483855 -1.645027 -1.194113 3 -0.122316 0.566277 -0.366028 4 -0.231524 -0.721172 -0.112007 5 0.438810 0.000225 0.435479