Pandas basic introduction pandas string and index reconstruction

Posted by English Fire on Fri, 17 Jan 2020 13:47:50 +0100

Learning summary: Click here.

String function

Pandas provides a set of string functions to operate on string data conveniently. Most importantly, these functions ignore (or exclude) missing / NaN values. Almost all of these methods use Python string functions (see: Here ) Therefore, convert the Series object to a String object, and then do the operation.

number function describe
1 lower() Converts strings in Series/Index to lowercase.
2 upper() Converts strings in Series/Index to uppercase.
3 len() Calculates the length of the string.
4 strip() Help remove spaces (including line breaks) from each string in the series / index on both sides.
5 split(' ') Splits each string with the given pattern.
6 cat(sep=' ') Connect series / index elements with the given separator.
7 get_dummies() Returns a data frame with a single heat code value.
8 contains(pattern) Returns the Boolean value True for each element if the element contains substrings, otherwise False.
9 replace(a,b) Replace the value a with the value b.
10 repeat(value) Repeats each element a specified number of times.
11 count(pattern) Returns the total number of occurrences of each element in the pattern.
12 startswith(pattern) Returns true if the element in the series / index starts with a pattern.
13 endswith(pattern) Returns true if the element in the series / index ends in a pattern.
14 find(pattern) Returns the location of the first occurrence of the pattern.
15 findall(pattern) Returns a list of all occurrences of the pattern.
16 swapcase() Change the case of letters.
17 islower() Checks if all characters in each string in the series / index are lowercase, returns a Boolean value
18 isupper() Checks if all characters in each string in the series / index are uppercase and returns a Boolean value
19 isnumeric() Checks whether all characters in each string in the series / index are numbers and returns a Boolean value.

Rebuild index

Reindex changes the row and column labels of the DataFrame. Reindexing means matching data to match a given set of labels on a particular axis. Multiple operations can be implemented through index:

  1. Reorder existing data to match a new set of labels.
  2. Insert a missing value (NA) tag in a label location that does not have label data.
>>>import pandas as pd
>>>import numpy as np
>>>iN=20
>>>idf = pd.DataFrame({
   'A': pd.date_range(start='2016-01-01',periods=N,freq='D'),
   'x': np.linspace(0,stop=N-1,num=N),
   'y': np.random.rand(N),
   'C': np.random.choice(['Low','Medium','High'],N).tolist(),
   'D': np.random.normal(100, 10, size=(N)).tolist()
})

#reindex the DataFrame
>>>idf_reindexed = df.reindex(index=[0,2,5], columns=['A', 'C', 'B'])
>>>idf_reindexed
            A    C     B
0  2016-01-01  Low   NaN
2  2016-01-03  High  NaN
5  2016-01-06  Low   NaN

1. Re align index with other objects

Sometimes you may want to take an object and re index it, with its axis marked as the same as another object.

>>>import pandas as pd
>>>import numpy as np
>>>df1 = pd.DataFrame(np.random.randn(10,3),columns=['col1','col2','col3'])
>>>df2 = pd.DataFrame(np.random.randn(7,3),columns=['col1','col2','col3'])
>>>df1 = df1.reindex_like(df2)
>>>df1
          col1         col2         col3
0    -2.467652    -1.211687    -0.391761
1    -0.287396     0.522350     0.562512
2    -0.255409    -0.483250     1.866258
3    -1.150467    -0.646493    -0.222462
4     0.152768    -2.056643     1.877233
5    -1.155997     1.528719    -1.343719
6    -1.015606    -1.245936    -0.295275

Note: here, the df1 data frame is changed and renumbered, such as df2. Column names should match or NAN will be added to the entire column label.

2. Refill when filling

reindex() adopts the optional parameter method, which is a filling method with the following values:

  1. Pad / fill - fill value forward
  2. bfill/backfill - fill backward
  3. nearest - populates from the most recent index value
>>>import pandas as pd
>>>import numpy as np
>>>df1 = pd.DataFrame(np.random.randn(6,3),columns=['col1','col2','col3'])
>>>df2 = pd.DataFrame(np.random.randn(2,3),columns=['col1','col2','col3'])
>>>df2.reindex_like(df1)
         col1        col2       col3
0    1.311620   -0.707176   0.599863
1   -0.423455   -0.700265   1.133371
2         NaN         NaN        NaN
3         NaN         NaN        NaN
4         NaN         NaN        NaN
5         NaN         NaN        NaN

>>>df2.reindex_like(df1,method='ffill')
         col1        col2        col3
0    1.311620   -0.707176    0.599863
1   -0.423455   -0.700265    1.133371
2   -0.423455   -0.700265    1.133371
3   -0.423455   -0.700265    1.133371
4   -0.423455   -0.700265    1.133371
5   -0.423455   -0.700265    1.133371

3. Fill limit when rebuilding index

Limit parameters provide additional control over population when rebuilding indexes. Limits the maximum count of specified consecutive matches.

>>>import pandas as pd
>>>import numpy as np
>>>df1 = pd.DataFrame(np.random.randn(6,3),columns=['col1','col2','col3'])
>>>df2 = pd.DataFrame(np.random.randn(2,3),columns=['col1','col2','col3'])
>>>df2.reindex_like(df1)
         col1        col2        col3
0    0.247784    2.128727    0.702576
1   -0.055713   -0.021732   -0.174577
2         NaN         NaN         NaN
3         NaN         NaN         NaN
4         NaN         NaN         NaN
5         NaN         NaN         NaN


>>>df2.reindex_like(df1,method='ffill',limit=1)
         col1        col2        col3
0    0.247784    2.128727    0.702576
1   -0.055713   -0.021732   -0.174577
2   -0.055713   -0.021732   -0.174577
3         NaN         NaN         NaN
4         NaN         NaN         NaN
5         NaN         NaN         NaN

4. renaming

The rename() method allows you to relabel an axis based on some mapping (Dictionary or series) or any function.

>>>import pandas as pd
>>>import numpy as np
>>>df1 = pd.DataFrame(np.random.randn(6,3),columns=['col1','col2','col3'])
>>>df1
         col1        col2        col3
0    0.486791    0.105759    1.540122
1   -0.990237    1.007885   -0.217896
2   -0.483855   -1.645027   -1.194113
3   -0.122316    0.566277   -0.366028
4   -0.231524   -0.721172   -0.112007
5    0.438810    0.000225    0.435479

>>>df1.rename(columns={'col1' : 'c1', 'col2' : 'c2'},index = {0 : 'apple', 1 : 'banana', 2 : 'durian'})
                c1          c2        col3
apple     0.486791    0.105759    1.540122
banana   -0.990237    1.007885   -0.217896
durian   -0.483855   -1.645027   -1.194113
3        -0.122316    0.566277   -0.366028
4        -0.231524   -0.721172   -0.112007
5         0.438810    0.000225    0.435479
Published 18 original articles, won praise 7, visited 1372
Private letter follow

Topics: Python