Four skills of row column conversion in pandas

Posted by gszauer on Wed, 15 Dec 2021 15:24:45 +0100

Official account: Special House
Author: Peter
Editor: Peter

Hello, I'm Peter~

This article introduces the four row column conversion methods in Pandas, including:

  • melt
  • transpose T or transfer
  • wide_to_long
  • explode (explosion function)

Finally, answer a data processing question asked by a reader friend.

Pandas row column conversion

There are many methods in pandas to realize row column conversion:

Import library

import pandas as pd
import numpy as np

Function melt

Main parameters of melt:

pandas.melt(frame, 
            id_vars=None, 
            value_vars=None, 
            var_name=None, 
            value_name='value',
            ignore_index=True,  
            col_level=None)

The meaning of parameters is explained below:

  • Frame: the data frame to process.

  • id_vars: indicates column names that do not need to be converted

  • value_vars: indicates the column name to be converted. If all the remaining columns need to be converted, it is unnecessary to write

  • var_name and value_name: the column name corresponding to the custom setting, which is equivalent to taking a new column name

  • igonore_index: whether to ignore the original column name. The default is True, which means that the original index name is ignored and the natural index of 0,1,2,3,4... Is regenerated

  • col_level: this parameter is used if the column is a multi-level index column MultiIndex; This parameter is rarely used

Analog data

# Data to be converted: frame
df = pd.DataFrame({"col1":[1,1,1,1,1],
                   "col2":[3,3,3,3,3],
                   "col3":["a","a","a","b","b"]
                  })
df

id_vars

value_vars

The above two parameters are used simultaneously:

Convert multiple column attributes at the same time:

var_name and value_name

pd.melt(
    df,
    id_vars=["col1"],  # unchanged
    value_vars=["col3"],  # change
    var_name="col4",  # New column name
    value_name="col5" # The new column name of the corresponding value
)

ignore_index

Natural indexes are generated by default:

You can change it to False and use the original index:

Transpose function

The T attribute or transfer function in pandas implements the function of row to column conversion, which is exactly transpose

Simple transpose

Simulate a piece of data and view the transposed results:

Transpose using the transpose function:

There is another method: first transpose the value values, and then exchange the index and column names:

Finally, let's look at a simple case:

wide_to_long function

Literally: convert a dataset from a wide format to a long format

wide_to_long(
    df,
    stubnames,
    i,
    j,
    sep: str = "",
    suffix: str = "\\d+"

Specific explanation of parameters:

  • df: data frame to be converted
  • Stub names: the part of a wide table with the same column name
  • i: Column to use as id variable
  • j: Set columns for long format suffix columns
  • sep: sets the separator to delete. For example, if columns is A-2020, specify sep = '-' to delete the separator. The default value is empty.
  • Suffix: get "suffix" by setting regular expression. Default '\ D +' means to obtain a numeric suffix. "Suffix" without numbers can be obtained by '\ D +'

Analog data

Conversion process

Use functions to implement transformations:

Set multi-level index

First simulate a data:

If you are not used to multi-layer indexes, you can convert them to the following format:

sep and suffix

df5 = pd.DataFrame({
    'a': [1, 1, 2, 2, 3, 3, 3],
    'b': [1, 2, 2, 3, 1, 2, 3],
    'stu_one': [2.8, 2.9, 1.8, 1.9, 2.2, 2.3, 2.1],
    'stu_two': [3.4, 3.8, 2.8, 2.4, 3.3, 3.4, 2.9]
})
df5

pd.wide_to_long(
    df5, 
    stubnames='stu', 
    i=['a', 'b'], 
    j='number',
    sep='_', # Used when there is a connector in the column name; The default is empty
    suffix=r'\w+')  # Suffix based on regular expression; The default is number \ d +; It's changed here to \ w +, which means letters

Explosion function

explode(column, ignore_index=False)

This function has only two parameters:

  • column: element to explode
  • ignore_index: whether to ignore the index; The default is False and the original index is maintained

Analog data

Single field explosion

Perform an explosion process on a single field and transfer the wide table to the long table:

Parameter ignore_index

Multiple field explosion

The process of continuously exploding multiple fields:

Readers' doubts

Here to answer a reader's question, the data is in the form of simulation. The following data is required:

Proportion of each fruit in each shop

fruit = pd.DataFrame({
    "shop":["shop1","shop3","shop2","shop3",
            "shop2","shop1","shop3","shop2",
            "shop3","shop2","shop3","shop2","shop1"],
    "fruit":["Apple","Grape","Banana","Apple",
             "Grape","a mandarin orange","Pear","Hami melon",
             "Grape","Banana","Apple","Grape","a mandarin orange"],
    "number":[100,200,340,150,
              200,300,90,80,340,
              150,200,300,90]})
fruit

First, we need to count the sales of each shop and each fruit

Method 1: multi step

Method 1 uses a multi-step solution:

1. Total sales per shop

2. Add total shop_sum column

3. Generation proportion

Method 2: use the transform function

Topics: Python Machine Learning Data Analysis Data Mining pandas