Introduction and practice of Python data analysis - Notes

Posted by sam_h on Wed, 16 Feb 2022 06:12:24 +0100

Chapter 1 construction of experimental environment

This chapter will mainly introduce anaconda and Jupyter Notebook. Including how to install Anaconda on windows, Mac, linux and other platforms, as well as the basic startup and use methods of Jupyter Notebook.

1-1 guidance video

Mathematical science and machine learning

Mathematical science workflow

Specific course arrangement:

  • Chapter 1: Construction of experimental environment
  • Chapter 2: introduction to Numpy
  • Chapter 3: introduction to Pandas
  • Chapter 4: Pandas play data
  • Chapter 5: drawing and visualization - Matplotlib
  • Chapter 6: drawing and visualization Seaborn
  • Chapter 7: actual combat of data analysis project
  • Chapter 8: summary

Suitable for people:

  • Have certain self-study and practical ability
  • Have the most basic Python Foundation
  • I want to work in data analysis and machine learning related fields in the future


1-2 introduction to anaconda and Jupyter notebook

Anaconda/Jupyter notebook: open Data Science Platform

What is Anaconda?

  • The most famous Python data science platform
  • 750 + popular Python & R packages
  • Cross platform: Windows, Mac, Linux
  • conda: Extensible package management tool
  • Free distribution
  • A very active community

Installation of Anaconda

Download address

  • Now? https://www.anaconda.com/products/individual
  • Before: https://www.anaconda.com/download/

Check for correct installation:

cd ~/anaconda
bin/conda --version
conda 4.3.21

Conda: Package and Environment management

  • Install Packages
  • Update Packages
  • Create sandbox: Conda environment

Conda's Environment management
Create a new environment

conda create --name python34 python3.4

Activate an environment

activate python34 # for Windows
source activate python34 # for Linux & Mac

Exit an environment

deactivate python34 # for Windows
source deactivate python34 # for Linux & Mac

Delete an environment

conda remove --name python34 --all

package management of Conda
Conda's package management is a bit like pip
Install a Python package

conda install numpy

View installed Python packages

conda list
conda list -n python34 #View Python packages installed in the specified environment

Delete a Python package

conda remove --name python34 numpy

Data Science IDE vs Developer IDE

Data Science IDEs in Anaconda

From IPython to Jupyter

What is Ipython?

  • A powerful interactive shell
  • It's Jupyter's kernel
  • Support interactive data analysis and visualization

Ipython Kernel

  • Mainly responsible for running user code
  • Interact with the python shell through stdin/stdout
  • Interact with ZeroMQ and notebook with json message

What is Jupyter Notebook?

  • Formerly known as Ipython notebook
  • An open source Web application
  • You can create and share documents that contain code, views, and comments
  • It can be used in data statistics, analysis, modeling, machine learning and other fields

Interaction between Notebook and kernel

  • The core is Notebook server
  • Server save and notebook load

File format (. ipynb) of Notebook

  • A format defined by IPython Notebook (json)
  • You can read online data, CSV/XLS files
  • It can be converted to other formats (py,html,pdf,md, etc.)

NBViewer

  • An online ipynb format notebook presentation tool
  • Can be shared via url
  • Github integrates NBViewer
  • Easily integrate into blogs, emails, Wikis and Books through converter

Laboratory environment of this course

  • Installing Anaconda on Windows/Mac/Linux
  • Use Python 3 6 as the basic environment
  • Using Jupiter notebook as the programming IDE

1-3 installation demonstration of Anaconda on Mac

Download the macOS version installation package, python 3 6 + 64 bit version (as of February 15, 2022, python 3.9)
Anaconda3-2021.11-MacOSX-x86_64.pkg
Select Install for me only for other basic default options
It is not recommended to change the installation directory (1.44gb is required for installation)

~] ls
~] pwd
~] cd anaconda/
anaconda] ls
anaconda] cd bin
bin] ./conda --version
conda 4.3.21
bin] ./conda list
bin] ./jupyter notebook # Open browser

1-4 Anaconda installation demonstration on windows

Download the Windows version installation package, python 3 6 + 64 bit version (as of February 15, 2022, python 3.9)
Anaconda3-2021.11-Windows-x86_64.exe
Select Just Me(recommended) and other basic default options
The installed Anaconda3 can be seen in the [Start Menu]
Open Jupiter notebook

1-5 installation demonstration of Anaconda on Linux

Download the Linux version installation package, python 3 6 + 64 bit version (as of February 15, 2022, python 3.9)
Copy installation package link

~] wget https://repo.anaconda.com/archive/Anaconda3-2021.11-Linux-x86_64.sh
~] ls
Anaconda3-2021.11-Linux-x86_64.sh
~] ls -lh
~] sh Anaconda3-2021.11-Linux-x86_64.sh # Select Default Options
~] pwd
/home/centos
~] cd anaconda3
anaconda3] ls
anaconda3] cd bin
anaconda3] ./conda --version
conda 4.3.21
anaconda3] ./jupyter notebook --no-browser # Copy link

Local terminal

~ ssh -N -f -L localhost:888:localhost:8888 gitlab-demo-ci
~ ssh -N -f -L localhost:888:localhost:8888 root@gitlab-demo-ci

The browser opens and the link is copied in

!ifconfig  # Corresponding to ifconfig in linux system

1-6 demonstration of using Jupiter notebook

cd anaconda3
cd jupyter-notebook/python-data-science
python-data-science git:(master) ls
README.md    demo.ipynb
python-data-science git:(master) xx/bin/jupyter notebook # Openable

Chapter 2 Introduction to Numpy

This chapter will introduce Numpy, the most basic library in the field of Python data science, review the basis of matrix operation, introduce the most important data structure Array and how to perform Array and matrix operation through Numpy.

2-1 five commonly used Python libraries in the field of data science

  • Numpy
  • Scipy
  • Pandas
  • Matplotlib
  • Scikit-learn

Numpy

  • N-dimensional array (matrix), fast and efficient, vector attribute operation
  • Efficient Index without loop
  • Open source, free cross platform, and its operation efficiency is comparable to that of C/Matlab

Scipy

  • Rely on Numpy
  • Designed for science and Engineering
  • A variety of commonly used scientific calculations are realized: linear algebra, Fourier transform, signal and image processing

Pandas

  • Structured data analysis tool (relying on Numpy)
  • It provides a variety of advanced data structures: time series, DataFrame and Panel
  • Strong data indexing and processing capabilities

Matplotlib

  • Python is the most widely used suite in 2D drawing
  • It can basically replace the drawing function of Matlab (scatter, curve, column, etc.)
  • You can draw exquisite 3D pictures through mplot3d

Scikit-learn

  • Python module of machine learning
  • Based on Scipy, it provides common machine learning algorithms: clustering and regression
  • Easy to learn API interface

2-2 matrix operation of basic review of Mathematics

Basic concepts

  • Matrix: rectangular data, that is, a two-dimensional array. Vector and scalar are special cases of matrix
  • Vector: refers to the matrix of 1xn or nx1
  • Scalar: 1x1 matrix
  • When the array of N dimensions is extended

Special matrix

  • All 0 and all 1 matrices

  • Identity matrix

Matrix addition and subtraction

  • The two matrices of addition and subtraction must have the same row and column
  • Addition and subtraction of corresponding elements of row and column

Array multiplication (dot multiplication)

  • Array multiplication (dot multiplication) is the multiplication between corresponding elements

Matrix multiplication

Let A be the matrix of mxp, B be the matrix of pxn, and matrix C of mxn be the product of A and B, denoted as C=AB, where the elements in row i and column j of matrix C can be expressed as:

Other knowledge of linear algebra

  • Linear algebra published by Tsinghua University
  • http://bs.szu.edu.cn/sljr/Up/day_110824/201108240409437708.pdf

2-3 creation and access of array

Jupyter notebook creates a new file array ipynb

# Creation and access of arrays
import numpy as np
# create from python list
list_1 = [1, 2, 3, 4]
list_1        #  [1, 2, 3, 4]
array_1 = np.array(list_1)
array_1        # array([1, 2, 3, 4])
list_2 = [5, 6, 7, 8]
array_2 = np.array([list_1,list_2])
array_2
# array([[1, 2, 3, 4],
       [5, 6, 7, 8]])
array_2.shape    # (2, 4)
array_2.size    # 8
array_2.dtype    # dtype('int32 ') look at the computer, or dtype('int64')
array_3 = np.array([[1.0,2,3],[4.0,5,6]])
array_3.dtype        # dtype('float64')
array_4 = np.arange(1,10)
array_4        # array([1, 2, 3, 4, 5, 6, 7, 8, 9])
array_4 = np.arange(1, 10, 2)
array_4        # array([1, 3, 5, 7, 9])
np.zeros(5)        # array([0., 0., 0., 0., 0.])    # Zero matrix
np.zeros([2,3])        # Two dimensional zero matrix with two rows and three columns
# array([[0., 0., 0.],
       [0., 0., 0.]])
np.eye(5)    # Identity matrix with n=5
# array([[1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 0., 1., 0., 0.],
       [0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1.]])
np.eye(5).dtype        # dtype('float64')
a = np.arange(1,10)
a        # array([1, 2, 3, 4, 5, 6, 7, 8, 9])
a[1]    # 2 (take the second element of the array)
a[1:5]    # array([2, 3, 4, 5]) takes the 2nd-5th element of the array
b = np.array([[1,2,3],[4,5,6]])
b
# array([[1, 2, 3],
       [4, 5, 6]])
b[1][0]        # 4
b[1,0]        # 4
c = np.array([[1,2,3],[4,5,6],[7,8,9]])
c
# array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])
c[:2,1:]
# array([[2, 3],
       [5, 6]])


2-4 array and matrix operation

Jupyter notebook creates a new file array and matrix operation ipynb

# Quickly create an array
import numpy as np
np.random.randn(10)        # Returns a one-dimensional array of 10 decimal elements
# array([ 0.26674666, -0.91111093,  0.30684449, -0.80206634, -0.89176532,
        0.7950014 , -0.53259808, -0.09981816,  1.2960139 , -0.9668373 ])
np.random.randint(10)    # 0
np.random.randint(10,size=(2,3))    # Generate a 2x3 two-dimensional array with array elements [0,9]
# array([[7, 5, 8],
       [1, 5, 8]])
np.random.randint(10,size=20)        # Generate a one-dimensional array of 20 elements, array elements [0,9]
# array([5, 6, 4, 8, 0, 9, 6, 2, 2, 9, 2, 1, 4, 6, 1, 5, 8, 2, 3, 4])
np.random.randint(10,size=20).reshape(4,5)    # Reshape the one-dimensional array generating 20 elements into a 4x5 two-dimensional array, with array elements [0,9]
# array([[7, 1, 0, 5, 7],
       [8, 0, 3, 7, 9],
       [9, 0, 7, 3, 2],
       [9, 1, 5, 8, 7]])

# Array operation
a = np.random.randint(10,size=20).reshape(4,5)
b = np.random.randint(10,size=20).reshape(4,5)
a
# array([[2, 3, 8, 4, 8],
       [0, 7, 9, 9, 9],
       [1, 8, 1, 8, 6],
       [3, 4, 7, 5, 1]])
b
# array([[8, 4, 3, 1, 6],
       [4, 4, 6, 2, 9],
       [9, 4, 8, 5, 8],
       [6, 2, 5, 5, 8]])
a + b
# array([[10,  7, 11,  5, 14],
       [ 4, 11, 15, 11, 18],
       [10, 12,  9, 13, 14],
       [ 9,  6, 12, 10,  9]])
a - b
# array([[-6, -1,  5,  3,  2],
       [-4,  3,  3,  7,  0],
       [-8,  4, -7,  3, -2],
       [-3,  2,  2,  0, -7]])
a * b
# array([[16, 12, 24,  4, 48],
       [ 0, 28, 54, 18, 81],
       [ 9, 32,  8, 40, 48],
       [18,  8, 35, 25,  8]])
a / b
# An error may be reported to see if there is an element 0 in b
array([[0.25      , 0.75      , 2.66666667, 4.        , 1.33333333],
       [0.        , 1.75      , 1.5       , 4.5       , 1.        ],
       [0.11111111, 2.        , 0.125     , 1.6       , 0.75      ],
       [0.5       , 2.        , 1.4       , 1.        , 0.125     ]])
np.mat([[1,2,3],[4,5,6]])
# matrix([[1, 2, 3],
        [4, 5, 6]])
a
# array([[2, 3, 8, 4, 8],
       [0, 7, 9, 9, 9],
       [1, 8, 1, 8, 6],
       [3, 4, 7, 5, 1]])
np.mat(a)
# 
matrix([[2, 3, 8, 4, 8],
        [0, 7, 9, 9, 9],
        [1, 8, 1, 8, 6],
        [3, 4, 7, 5, 1]])

# Matrix operation
A = np.mat(a)
B = np.mat(b)
A
# matrix([[2, 3, 8, 4, 8],
        [0, 7, 9, 9, 9],
        [1, 8, 1, 8, 6],
        [3, 4, 7, 5, 1]])
B
# matrix([[8, 4, 3, 1, 6],
        [4, 4, 6, 2, 9],
        [9, 4, 8, 5, 8],
        [6, 2, 5, 5, 8]])
A + B
# matrix([[10,  7, 11,  5, 14],
        [ 4, 11, 15, 11, 18],
        [10, 12,  9, 13, 14],
        [ 9,  6, 12, 10,  9]])
A - B
# matrix([[-6, -1,  5,  3,  2],
        [-4,  3,  3,  7,  0],
        [-8,  4, -7,  3, -2],
        [-3,  2,  2,  0, -7]])
A * B    # An error is reported. The number of columns of A is inconsistent with the number of rows of B

a = np.mat(np.random.randint(10,size=20).reshape(4,5))
b = np.mat(np.random.randint(10,size=20).reshape(5,4))
a
# matrix([[9, 9, 3, 0, 5],
        [9, 4, 6, 4, 5],
        [9, 0, 7, 0, 9],
        [7, 2, 6, 0, 6]])
b
# matrix([[2, 2, 6, 4],
        [8, 9, 8, 0],
        [2, 1, 3, 9],
        [3, 1, 0, 2],
        [9, 3, 1, 4]])    
a * b
# matrix([[141, 117, 140,  83],
        [119,  79, 109, 118],
        [113,  52,  84, 135],
        [ 96,  56,  82, 106]])

# Array common functions
a = np.random.randint(10,size=20).reshape(4,5)
np.unique(a)    # De duplication of all elements in a
# array([0, 1, 2, 3, 4, 5, 6, 8, 9])
a
# array([[4, 2, 8, 4, 2],
       [6, 9, 6, 4, 0],
       [9, 2, 6, 9, 0],
       [1, 3, 8, 5, 9]])
sum(a)        # Sum all rows and columns in a
# array([20, 16, 28, 22, 11])
sum(a[0])    # Sum the first line in a
# 20
sum(a[:,0])    # Sum the first column in a
# 20
a.max()        # Maximum value in a
# 9
max(a[0])    # Maximum value of the first row in a
# 8
max(a[:,0])    # Maximum value of the first column in a
# 9

2-5 input and output of array

Jupyter notebook creates a new file, input and output of Array ipynb

# Serialize numpy array using pickle
import pickle
import numpy as np
x = np.arange(10)
x
# array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
f = open('x.pk1','wb')
pickle.dump(x, f)
!ls        # windows system is available! dir
# Array.ipynb# array input and output ipynb
  x.pk1                    Array and matrix operation.ipynb
f = open('x.pk1','rb')
pickle.load(f)
# array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
np.save('one_array', x)
!ls
# Array.ipynb# array input and output ipynb
  x.pk1                    one_array.npy
  Array and matrix operation.ipynb
np.load('one_array.npy')
# array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
y = np.arange(20)
y
# array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19])
np.savez('two_array.npz', a=x, b=y)
!ls
# Array.ipynb                        two_array.npz
  Array of input and output.ipynb        x.pk1
  one_array.npy                        Array and matrix operation.ipynb
np.load('two_array.npz')
# <numpy.lib.npyio.NpzFile at 0x17033c77df0>
c = np.load('two_array.npz')
c['a']
# array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
c['b']
# array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19])

scipy document

  • Now? https://docs.scipy.org/doc/scipy/getting_started.html
  • Before: https://docs.scipy.org/doc/numpy-dev/user/quickstart.html

Chapter 3 Introduction to Pandas

This chapter will introduce pandas, the most important library for data analysis in the field of Python data science. We will start with the two most important data structures Series and DataFrame in pandas, introduce their creation and basic operation, and understand the relationship between Series and DataFrame through practical operation.

3-1 Pandas Series

Jupyter notebook creates a new file series ipynb

import numpy as np
import pandas as pd
s1 = pd.Series([1,2,3,4])
s1
# 0    1
  1    2
  2    3
  3    4
  dtype: int64
s1.values
# array([1, 2, 3, 4], dtype=int64)
s1.index
# RangeIndex(start=0, stop=4, step=1)
s2 = pd.Series(np.arange(10))
s2            # Some computers dtype: int64
# 0    0
  1    1
  2    2
  3    3
  4    4
  5    5
  6    6
  7    7
  8    8
  9    9
  dtype: int32
s3 = pd.Series({'1':1, '2':2, '3':3})
s3
# 1    1
  2    2
  3    3
  dtype: int64
s3.values
# array([1, 2, 3], dtype=int64)
s3.index
# Index(['1', '2', '3'], dtype='object')
s4 = pd.Series([1,2,3,4],index=['A','B','C','D'])
s4
# A    1
  B    2
  C    3
  D    4
  dtype: int64
s4.values
# array([1, 2, 3, 4], dtype=int64)
s4.index
# Index(['A', 'B', 'C', 'D'], dtype='object')
s4['A']
# 1
s4[s4>2]
# C    3
  D    4
  dtype: int64
s4
# A    1
  B    2
  C    3
  D    4
  dtype: int64
s4.to_dict()
# {'A': 1, 'B': 2, 'C': 3, 'D': 4} 
s5 = pd.Series(s4.to_dict())
s5
# A    1
  B    2
  C    3
  D    4
  dtype: int64
index_1 = ['A', 'B', 'C', 'D','E']
s6 = pd.Series(s5,index=index_1)
s6
# A    1.0
  B    2.0
  C    3.0
  D    4.0
  E    NaN
  dtype: float64
pd.isnull(s6)
# A    False
  B    False
  C    False
  D    False
  E     True
dtype: bool
pd.notnull(s6)
# A     True
  B     True
  C     True
  D     True
  E    False
  dtype: bool
s6
# A    1.0
  B    2.0
  C    3.0
  D    4.0
  E    NaN
  dtype: float64
s6.name = 'demo'
s6
# A    1.0
  B    2.0
  C    3.0
  D    4.0
  E    NaN
  Name: demo, dtype: float64
s6.index.name = 'demo index'
s6
# demo index
  A    1.0
  B    2.0
  C    3.0
  D    4.0
  E    NaN
  Name: demo, dtype: float64
s6.index
# Index(['A', 'B', 'C', 'D', 'E'], dtype='object', name='demo index')
s6.values
# array([ 1.,  2.,  3.,  4., nan])

3-2 Pandas DataFrame

Jupyter notebook creates a new file dataframe ipynb

import numpy as np
import pandas as pd
from pandas import Series, DataFrame

import webbrowser
link = 'https://www.tiobe.com/tiobe-index/'
webbrowser.open(link)        # Open link in browser
True
df = pd.read_clipboard()    # Copy the first 10 data in the page table, including the header
df
# output
Position    Programming    Language    Ratings
0    21    SAS    0.66%    None
1    22    Scratch    0.64%    None
2    23    Fortran    0.58%    None
3    24    Rust    0.54%    None
4    25    (Visual)    FoxPro    0.52%
5    26    COBOL    0.42%    None
6    27    Dart    0.42%    None
7    28    Kotlin    0.41%    None
8    29    Lua    0.40%    None
9    30    Julia    0.40%    None

type(df)
# pandas.core.frame.DataFrame
df.columns
# Index(['Position', 'Programming', 'Language', 'Ratings'], dtype='object')
df.Ratings
#
0     None
1     None
2     None
3     None
4    0.52%
5     None
6     None
7     None
8     None
9     None
Name: Ratings, dtype: object

df_new = DataFrame(df,columns=['Programming','Language'])
df_new
# output
Programming    Language
0    SAS    0.66%
1    Scratch    0.64%
2    Fortran    0.58%
3    Rust    0.54%
4    (Visual)    FoxPro
5    COBOL    0.42%
6    Dart    0.42%
7    Kotlin    0.41%
8    Lua    0.40%
9    Julia    0.40%

df['Position']
#
0    21
1    22
2    23
3    24
4    25
5    26
6    27
7    28
8    29
9    30
Name: Position, dtype: int64

type(df['Position'])
pandas.core.series.Series
df_new = DataFrame(df,columns=['Programming','Language','Language1'])
df_new
# output
Programming    Language    Language1
0    SAS    0.66%    NaN
1    Scratch    0.64%    NaN
2    Fortran    0.58%    NaN
3    Rust    0.54%    NaN
4    (Visual)    FoxPro    NaN
5    COBOL    0.42%    NaN
6    Dart    0.42%    NaN
7    Kotlin    0.41%    NaN
8    Lua    0.40%    NaN
9    Julia    0.40%    NaN

# Three ways of filling
df_new['Language1'] = range(0,10)
# df_new['Language1'] = np.arange(0,10)
# df_new['Language1'] = pd.Series(np.arange(0,10))
df_new
# output
Programming    Language    Language1
0    SAS    0.66%    0
1    Scratch    0.64%    1
2    Fortran    0.58%    2
3    Rust    0.54%    3
4    (Visual)    FoxPro    4
5    COBOL    0.42%    5
6    Dart    0.42%    6
7    Kotlin    0.41%    7
8    Lua    0.40%    8
9    Julia    0.40%    9

df_new['Language1'] = pd.Series([100,200], index=[1,2])
df_new
# output
Programming    Language    Language1
0    SAS    0.66%    NaN
1    Scratch    0.64%    100.0
2    Fortran    0.58%    200.0
3    Rust    0.54%    NaN
4    (Visual)    FoxPro    NaN
5    COBOL    0.42%    NaN
6    Dart    0.42%    NaN
7    Kotlin    0.41%    NaN
8    Lua    0.40%    NaN
9    Julia    0.40%    NaN


3-3 in depth understanding of Series and Dataframe

Jupyter notebook creates a new file and deeply understands Series and dataframe ipynb

import numpy as np
import pandas as pd
from pandas import Series, DataFrame

data = {'Country':['Belgium', 'India', 'Brazil'],
       'Capital':['Brussels','New Delhi', 'Brasilia'],
       'Population':[11190846, 1303171035, 207847528]}

#Series
s1 = pd.Series(data['Country'])
s1
# output
0    Belgium
1      India
2     Brazil
dtype: object

s1.values
# array(['Belgium', 'India', 'Brazil'], dtype=object)
s1.index
# RangeIndex(start=0, stop=3, step=1)
s1 = pd.Series(data['Country'],index=['A','B','C'])
# output
A    Belgium
B      India
C     Brazil
dtype: object

s1.values
# array(['Belgium', 'India', 'Brazil'], dtype=object)
s1.index
# Index(['A', 'B', 'C'], dtype='object')

#Dataframe
df1 = pd.DataFrame(data)
df1
# output
    Country    Capital    Population
0    Belgium    Brussels    11190846
1    India    New Delhi    1303171035
2    Brazil    Brasilia    207847528

df1['Country']
# output
0    Belgium
1      India
2     Brazil
Name: Country, dtype: object

cou = df1['Country']
type(cou)
# pandas.core.series.Series
df1.iterrows()
# <generator object DataFrame.iterrows at 0x0000018DD44C59E0>

for row in df1.iterrows():
    print(row),print(type(row)),print(len(row))
# output
(0, Country        Belgium
Capital       Brussels
Population    11190846
Name: 0, dtype: object)
<class 'tuple'>
2
(1, Country            India
Capital        New Delhi
Population    1303171035
Name: 1, dtype: object)
<class 'tuple'>
2
(2, Country          Brazil
Capital        Brasilia
Population    207847528
Name: 2, dtype: object)
<class 'tuple'>
2

for row in df1.iterrows():
    print(type(row[0]),row[0],row[1])
    break
# output
<class 'int'> 0 Country        Belgium
Capital       Brussels
Population    11190846
Name: 0, dtype: object

# <class 'int'>  ??
<class 'numpy.int64'> 


for row in df1.iterrows():
    print(type(row[0]),type(row[1]))
    break
# output
<class 'int'> <class 'pandas.core.series.Series'>

# <class 'int'>  ??
<class 'numpy.int64'> 


df1
# output
Country    Capital    Population
0    Belgium    Brussels    11190846
1    India    New Delhi    1303171035
2    Brazil    Brasilia    207847528

data
# output
{'Country': ['Belgium', 'India', 'Brazil'],
 'Capital': ['Brussels', 'New Delhi', 'Brasilia'],
 'Population': [11190846, 1303171035, 207847528]}


s1 = pd.Series(data['Country'])
s2 = pd.Series(data['Capital'])
s3 = pd.Series(data['Population'])
df_new = pd.DataFrame([s1,s2,s3])
df_new
# output
    0    1    2
0    Belgium    India    Brazil
1    Brussels    New Delhi    Brasilia
2    11190846    1303171035    207847528

df1
# output
Country    Capital    Population
0    Belgium    Brussels    11190846
1    India    New Delhi    1303171035
2    Brazil    Brasilia    207847528

df_new = df_new.T
df_new
# output
    0    1    2
0    Belgium    Brussels    11190846
1    India    New Delhi    1303171035
2    Brazil    Brasilia    207847528

df_new = pd.DataFrame([s1,s2,s3], index=['Country','Capital','Population'])
df_new
# output
        0    1    2
Country    Belgium    India    Brazil
Capital    Brussels    New Delhi    Brasilia
Population    11190846    1303171035    207847528

df_new = df_new.T
df_new
# output
    Country    Capital    Population
0    Belgium    Brussels    11190846
1    India    New Delhi    1303171035
2    Brazil    Brasilia    207847528

 


Datapandas-3 operation

Jupiter notebook creates a new file dataframe io ipynb

import numpy as np
import pandas as pd
from pandas import Series,DataFrame

import webbrowser

link = 'http://pandas.pydata.org/pandas-docs/version/0.20/io.html'
webbrowser.open(link)    # Open the browser and return True; Copy page table content
# True

df1 = pd.read_clipboard()
df1
# output
    Format Type    Data Description    Reader    Writer
0    text    CSV    read_csv    to_csv
1    text    JSON    read_json    to_json
2    text    HTML    read_html    to_html
3    text    Local clipboard    read_clipboard    to_clipboard
4    binary    MS Excel    read_excel    to_excel
5    binary    HDF5 Format    read_hdf    to_hdf
6    binary    Feather Format    read_feather    to_feather
7    binary    Msgpack    read_msgpack    to_msgpack
8    binary    Stata    read_stata    to_stata
9    binary    SAS    read_sas    
10    binary    Python Pickle Format    read_pickle    to_pickle
11    SQL    SQL    read_sql    to_sql
12    SQL    Google Big Query    read_gbq    to_gbq

df1.to_clipboard()
df1
# output
    Format Type    Data Description    Reader    Writer
0    text    CSV    read_csv    to_csv
1    text    JSON    read_json    to_json
2    text    HTML    read_html    to_html
3    text    Local clipboard    read_clipboard    to_clipboard
4    binary    MS Excel    read_excel    to_excel
5    binary    HDF5 Format    read_hdf    to_hdf
6    binary    Feather Format    read_feather    to_feather
7    binary    Msgpack    read_msgpack    to_msgpack
8    binary    Stata    read_stata    to_stata
9    binary    SAS    read_sas    
10    binary    Python Pickle Format    read_pickle    to_pickle
11    SQL    SQL    read_sql    to_sql
12    SQL    Google Big Query    read_gbq    to_gbq

df1.to_csv('df1.csv')
!ls   # windows system is available! dir
# DataFrame IO.ipynb    df1.csv

!more df1.csv
# output
,Format Type,Data Description,Reader,Writer
0,text,CSV,read_csv,to_csv
1,text,JSON,read_json,to_json
2,text,HTML,read_html,to_html
3,text,Local clipboard,read_clipboard,to_clipboard
4,binary,MS Excel,read_excel,to_excel
5,binary,HDF5 Format,read_hdf,to_hdf
6,binary,Feather Format,read_feather,to_feather
7,binary,Msgpack,read_msgpack,to_msgpack
8,binary,Stata,read_stata,to_stata
9,binary,SAS,read_sas, 
10,binary,Python Pickle Format,read_pickle,to_pickle
11,SQL,SQL,read_sql,to_sql
12,SQL,Google Big Query,read_gbq,to_gbq

df1.to_csv('df1.csv',index=False)    # Remove index
!ls
# DataFrame IO.ipynb    df1.csv

!more df1.csv
# output
Format Type,Data Description,Reader,Writer
text,CSV,read_csv,to_csv
text,JSON,read_json,to_json
text,HTML,read_html,to_html
text,Local clipboard,read_clipboard,to_clipboard
binary,MS Excel,read_excel,to_excel
binary,HDF5 Format,read_hdf,to_hdf
binary,Feather Format,read_feather,to_feather
binary,Msgpack,read_msgpack,to_msgpack
binary,Stata,read_stata,to_stata
binary,SAS,read_sas, 
binary,Python Pickle Format,read_pickle,to_pickle
SQL,SQL,read_sql,to_sql
SQL,Google Big Query,read_gbq,to_gbq

df2 = pd.read_csv('df1.csv')
df2
# output
    Format Type    Data Description    Reader    Writer
0    text    CSV    read_csv    to_csv
1    text    JSON    read_json    to_json
2    text    HTML    read_html    to_html
3    text    Local clipboard    read_clipboard    to_clipboard
4    binary    MS Excel    read_excel    to_excel
5    binary    HDF5 Format    read_hdf    to_hdf
6    binary    Feather Format    read_feather    to_feather
7    binary    Msgpack    read_msgpack    to_msgpack
8    binary    Stata    read_stata    to_stata
9    binary    SAS    read_sas    
10    binary    Python Pickle Format    read_pickle    to_pickle
11    SQL    SQL    read_sql    to_sql
12    SQL    Google Big Query    read_gbq    to_gbq

df1.to_json()
# output
'{"Format Type":{"0":"text","1":"text","2":"text","3":"text","4":"binary","5":"binary","6":"binary","7":"binary","8":"binary","9":"binary","10":"binary","11":"SQL","12":"SQL"},"Data Description":{"0":"CSV","1":"JSON","2":"HTML","3":"Local clipboard","4":"MS Excel","5":"HDF5 Format","6":"Feather Format","7":"Msgpack","8":"Stata","9":"SAS","10":"Python Pickle Format","11":"SQL","12":"Google Big Query"},"Reader":{"0":"read_csv","1":"read_json","2":"read_html","3":"read_clipboard","4":"read_excel","5":"read_hdf","6":"read_feather","7":"read_msgpack","8":"read_stata","9":"read_sas","10":"read_pickle","11":"read_sql","12":"read_gbq"},"Writer":{"0":"to_csv","1":"to_json","2":"to_html","3":"to_clipboard","4":"to_excel","5":"to_hdf","6":"to_feather","7":"to_msgpack","8":"to_stata","9":" ","10":"to_pickle","11":"to_sql","12":"to_gbq"}}'

pd.read_json(df1.to_json())
# output
    Format Type    Data Description    Reader    Writer
0    text    CSV    read_csv    to_csv
1    text    JSON    read_json    to_json
2    text    HTML    read_html    to_html
3    text    Local clipboard    read_clipboard    to_clipboard
4    binary    MS Excel    read_excel    to_excel
5    binary    HDF5 Format    read_hdf    to_hdf
6    binary    Feather Format    read_feather    to_feather
7    binary    Msgpack    read_msgpack    to_msgpack
8    binary    Stata    read_stata    to_stata
9    binary    SAS    read_sas    
10    binary    Python Pickle Format    read_pickle    to_pickle
11    SQL    SQL    read_sql    to_sql
12    SQL    Google Big Query    read_gbq    to_gbq


df1.to_html()
# output
'<table border="1" class="dataframe">\n  <thead>\n    <tr style="text-align: right;">\n      <th></th>\n      <th>Format Type</th>\n      <th>Data Description</th>\n      <th>Reader</th>\n      <th>Writer</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>text</td>\n      <td>CSV</td>\n      <td>read_csv</td>\n      <td>to_csv</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>text</td>\n      <td>JSON</td>\n      <td>read_json</td>\n      <td>to_json</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>text</td>\n      <td>HTML</td>\n      <td>read_html</td>\n      <td>to_html</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>text</td>\n      <td>Local clipboard</td>\n      <td>read_clipboard</td>\n      <td>to_clipboard</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>binary</td>\n      <td>MS Excel</td>\n      <td>read_excel</td>\n      <td>to_excel</td>\n    </tr>\n    <tr>\n      <th>5</th>\n      <td>binary</td>\n      <td>HDF5 Format</td>\n      <td>read_hdf</td>\n      <td>to_hdf</td>\n    </tr>\n    <tr>\n      <th>6</th>\n      <td>binary</td>\n      <td>Feather Format</td>\n      <td>read_feather</td>\n      <td>to_feather</td>\n    </tr>\n    <tr>\n      <th>7</th>\n      <td>binary</td>\n      <td>Msgpack</td>\n      <td>read_msgpack</td>\n      <td>to_msgpack</td>\n    </tr>\n    <tr>\n      <th>8</th>\n      <td>binary</td>\n      <td>Stata</td>\n      <td>read_stata</td>\n      <td>to_stata</td>\n    </tr>\n    <tr>\n      <th>9</th>\n      <td>binary</td>\n      <td>SAS</td>\n      <td>read_sas</td>\n      <td></td>\n    </tr>\n    <tr>\n      <th>10</th>\n      <td>binary</td>\n      <td>Python Pickle Format</td>\n      <td>read_pickle</td>\n      <td>to_pickle</td>\n    </tr>\n    <tr>\n      <th>11</th>\n      <td>SQL</td>\n      <td>SQL</td>\n      <td>read_sql</td>\n      <td>to_sql</td>\n    </tr>\n    <tr>\n      <th>12</th>\n      <td>SQL</td>\n      <td>Google Big Query</td>\n      <td>read_gbq</td>\n      <td>to_gbq</td>\n    </tr>\n  </tbody>\n</table>'

df1.to_html('df1.html')
!ls
# DataFrame IO.ipynb    df1.csv        df1.html
df1.to_excel('df1.xlsx')


3-5 Selecting and indexing of dataframe

Jupiter notebook creating a new file selecting and indexing ipynb

import numpy as np
import pandas as pd
from pandas import Series, DataFrame

!pwd    # pwd corresponds to windows system chdir
# /Users/xxx/xx

!ls /Users/xxx/xx/homework    # ls corresponds to windows system dir pwd
# movie_metadata.csv

imdb = pd.read_csv('/Users/xxx/xx/homework/movie_metadata.csv')
imdb
# output
color    director_name    num_critic_for_reviews    duration    director_facebook_likes    actor_3_facebook_likes    actor_2_name    actor_1_facebook_likes    gross    genres    ...    num_user_for_reviews    language    country    content_rating    budget    title_year    actor_2_facebook_likes    imdb_score    aspect_ratio    movie_facebook_likes
0    Color    James Cameron    723.0    178.0    0.0    855.0    Joel David Moore    1000.0    760505847.0    Action|Adventure|Fantasy|Sci-Fi    ...    3054.0    English    USA    PG-13    237000000.0    2009.0    936.0    7.9    1.78    33000
1    Color    Gore Verbinski    302.0    169.0    563.0    1000.0    Orlando Bloom    40000.0    309404152.0    Action|Adventure|Fantasy    ...    1238.0    English    USA    PG-13    300000000.0    2007.0    5000.0    7.1    2.35    0
2    Color    Sam Mendes    602.0    148.0    0.0    161.0    Rory Kinnear    11000.0    200074175.0    Action|Adventure|Thriller    ...    994.0    English    UK    PG-13    245000000.0    2015.0    393.0    6.8    2.35    85000
3    Color    Christopher Nolan    813.0    164.0    22000.0    23000.0    Christian Bale    27000.0    448130642.0    Action|Thriller    ...    2701.0    English    USA    PG-13    250000000.0    2012.0    23000.0    8.5    2.35    164000
4    NaN    Doug Walker    NaN    NaN    131.0    NaN    Rob Walker    131.0    NaN    Documentary    ...    NaN    NaN    NaN    NaN    NaN    NaN    12.0    7.1    NaN    0
...    ...    ...    ...    ...    ...    ...    ...    ...    ...    ...    ...    ...    ...    ...    ...    ...    ...    ...    ...    ...    ...
5038    Color    Scott Smith    1.0    87.0    2.0    318.0    Daphne Zuniga    637.0    NaN    Comedy|Drama    ...    6.0    English    Canada    NaN    NaN    2013.0    470.0    7.7    NaN    84
5039    Color    NaN    43.0    43.0    NaN    319.0    Valorie Curry    841.0    NaN    Crime|Drama|Mystery|Thriller    ...    359.0    English    USA    TV-14    NaN    NaN    593.0    7.5    16.00    32000
5040    Color    Benjamin Roberds    13.0    76.0    0.0    0.0    Maxwell Moody    0.0    NaN    Drama|Horror|Thriller    ...    3.0    English    USA    NaN    1400.0    2013.0    0.0    6.3    NaN    16
5041    Color    Daniel Hsia    14.0    100.0    0.0    489.0    Daniel Henney    946.0    10443.0    Comedy|Drama|Romance    ...    9.0    English    USA    PG-13    NaN    2012.0    719.0    6.3    2.35    660
5042    Color    Jon Gunn    43.0    90.0    16.0    16.0    Brian Herzlinger    86.0    85222.0    Documentary    ...    84.0    English    USA    PG    1100.0    2004.0    23.0    6.6    1.85    456
5043 rows × 28 columns

imdb.shape
# (5043, 28)

imdb.head()
# output
    color    director_name    num_critic_for_reviews    duration    director_facebook_likes    actor_3_facebook_likes    actor_2_name    actor_1_facebook_likes    gross    genres    ...    num_user_for_reviews    language    country    content_rating    budget    title_year    actor_2_facebook_likes    imdb_score    aspect_ratio    movie_facebook_likes
0    Color    James Cameron    723.0    178.0    0.0    855.0    Joel David Moore    1000.0    760505847.0    Action|Adventure|Fantasy|Sci-Fi    ...    3054.0    English    USA    PG-13    237000000.0    2009.0    936.0    7.9    1.78    33000
1    Color    Gore Verbinski    302.0    169.0    563.0    1000.0    Orlando Bloom    40000.0    309404152.0    Action|Adventure|Fantasy    ...    1238.0    English    USA    PG-13    300000000.0    2007.0    5000.0    7.1    2.35    0
2    Color    Sam Mendes    602.0    148.0    0.0    161.0    Rory Kinnear    11000.0    200074175.0    Action|Adventure|Thriller    ...    994.0    English    UK    PG-13    245000000.0    2015.0    393.0    6.8    2.35    85000
3    Color    Christopher Nolan    813.0    164.0    22000.0    23000.0    Christian Bale    27000.0    448130642.0    Action|Thriller    ...    2701.0    English    USA    PG-13    250000000.0    2012.0    23000.0    8.5    2.35    164000
4    NaN    Doug Walker    NaN    NaN    131.0    NaN    Rob Walker    131.0    NaN    Documentary    ...    NaN    NaN    NaN    NaN    NaN    NaN    12.0    7.1    NaN    0
5 rows × 28 columns

imdb.tail(10)
# output
color    director_name    num_critic_for_reviews    duration    director_facebook_likes    actor_3_facebook_likes    actor_2_name    actor_1_facebook_likes    gross    genres    ...    num_user_for_reviews    language    country    content_rating    budget    title_year    actor_2_facebook_likes    imdb_score    aspect_ratio    movie_facebook_likes
5033    Color    Shane Carruth    143.0    77.0    291.0    8.0    David Sullivan    291.0    424760.0    Drama|Sci-Fi|Thriller    ...    371.0    English    USA    PG-13    7000.0    2004.0    45.0    7.0    1.85    19000
5034    Color    Neill Dela Llana    35.0    80.0    0.0    0.0    Edgar Tancangco    0.0    70071.0    Thriller    ...    35.0    English    Philippines    Not Rated    7000.0    2005.0    0.0    6.3    NaN    74
5035    Color    Robert Rodriguez    56.0    81.0    0.0    6.0    Peter Marquardt    121.0    2040920.0    Action|Crime|Drama|Romance|Thriller    ...    130.0    Spanish    USA    R    7000.0    1992.0    20.0    6.9    1.37    0
5036    Color    Anthony Vallone    NaN    84.0    2.0    2.0    John Considine    45.0    NaN    Crime|Drama    ...    1.0    English    USA    PG-13    3250.0    2005.0    44.0    7.8    NaN    4
5037    Color    Edward Burns    14.0    95.0    0.0    133.0    Caitlin FitzGerald    296.0    4584.0    Comedy|Drama    ...    14.0    English    USA    Not Rated    9000.0    2011.0    205.0    6.4    NaN    413
5038    Color    Scott Smith    1.0    87.0    2.0    318.0    Daphne Zuniga    637.0    NaN    Comedy|Drama    ...    6.0    English    Canada    NaN    NaN    2013.0    470.0    7.7    NaN    84
5039    Color    NaN    43.0    43.0    NaN    319.0    Valorie Curry    841.0    NaN    Crime|Drama|Mystery|Thriller    ...    359.0    English    USA    TV-14    NaN    NaN    593.0    7.5    16.00    32000
5040    Color    Benjamin Roberds    13.0    76.0    0.0    0.0    Maxwell Moody    0.0    NaN    Drama|Horror|Thriller    ...    3.0    English    USA    NaN    1400.0    2013.0    0.0    6.3    NaN    16
5041    Color    Daniel Hsia    14.0    100.0    0.0    489.0    Daniel Henney    946.0    10443.0    Comedy|Drama|Romance    ...    9.0    English    USA    PG-13    NaN    2012.0    719.0    6.3    2.35    660
5042    Color    Jon Gunn    43.0    90.0    16.0    16.0    Brian Herzlinger    86.0    85222.0    Documentary    ...    84.0    English    USA    PG    1100.0    2004.0    23.0    6.6    1.85    456
10 rows × 28 columns

imdb['color']
# output
0       Color
1       Color
2       Color
3       Color
4         NaN
        ...  
5038    Color
5039    Color
5040    Color
5041    Color
5042    Color
Name: color, Length: 5043, dtype: object

imdb['color'][0]
# 'Color'
imdb['color'][1]
# 'Color'

imdb[['color','director_name']]
# output
    color    director_name
0    Color    James Cameron
1    Color    Gore Verbinski
2    Color    Sam Mendes
3    Color    Christopher Nolan
4    NaN    Doug Walker
...    ...    ...
5038    Color    Scott Smith
5039    Color    NaN
5040    Color    Benjamin Roberds
5041    Color    Daniel Hsia
5042    Color    Jon Gunn
5043 rows × 2 columns

sub_df = imdb[['director_name','movie_title','imdb_score']]
sub_df
# output
director_name    movie_title    imdb_score
0    James Cameron    Avatar    7.9
1    Gore Verbinski    Pirates of the Caribbean: At World's End    7.1
2    Sam Mendes    Spectre    6.8
3    Christopher Nolan    The Dark Knight Rises    8.5
4    Doug Walker    Star Wars: Episode VII - The Force Awakens  ...    7.1
...    ...    ...    ...
5038    Scott Smith    Signed Sealed Delivered    7.7
5039    NaN    The Following    7.5
5040    Benjamin Roberds    A Plague So Pleasant    6.3
5041    Daniel Hsia    Shanghai Calling    6.3
5042    Jon Gunn    My Date with Drew    6.6
5043 rows × 3 columns

sub_df.head()
# output
    director_name    movie_title    imdb_score
0    James Cameron    Avatar    7.9
1    Gore Verbinski    Pirates of the Caribbean: At World's End    7.1
2    Sam Mendes    Spectre    6.8
3    Christopher Nolan    The Dark Knight Rises    8.5
4    Doug Walker    Star Wars: Episode VII - The Force Awakens  ...    7.1

sub_df.head(5)
# output
    director_name    movie_title    imdb_score
0    James Cameron    Avatar    7.9
1    Gore Verbinski    Pirates of the Caribbean: At World's End    7.1
2    Sam Mendes    Spectre    6.8
3    Christopher Nolan    The Dark Knight Rises    8.5
4    Doug Walker    Star Wars: Episode VII - The Force Awakens  ...    7.1

sub_df.iloc[10:20,:]
# output
    director_name    movie_title    imdb_score
10    Zack Snyder    Batman v Superman: Dawn of Justice    6.9
11    Bryan Singer    Superman Returns    6.1
12    Marc Forster    Quantum of Solace    6.7
13    Gore Verbinski    Pirates of the Caribbean: Dead Man's Chest    7.3
14    Gore Verbinski    The Lone Ranger    6.5
15    Zack Snyder    Man of Steel    7.2
16    Andrew Adamson    The Chronicles of Narnia: Prince Caspian    6.6
17    Joss Whedon    The Avengers    8.1
18    Rob Marshall    Pirates of the Caribbean: On Stranger Tides    6.7
19    Barry Sonnenfeld    Men in Black 3    6.8

sub_df.iloc[10:20,0:2]
# output
director_name    movie_title
10    Zack Snyder    Batman v Superman: Dawn of Justice
11    Bryan Singer    Superman Returns
12    Marc Forster    Quantum of Solace
13    Gore Verbinski    Pirates of the Caribbean: Dead Man's Chest
14    Gore Verbinski    The Lone Ranger
15    Zack Snyder    Man of Steel
16    Andrew Adamson    The Chronicles of Narnia: Prince Caspian
17    Joss Whedon    The Avengers
18    Rob Marshall    Pirates of the Caribbean: On Stranger Tides
19    Barry Sonnenfeld    Men in Black 3

tmp_df = sub_df.iloc[10:20,0:2]
tmp_df
# output
director_name    movie_title
10    Zack Snyder    Batman v Superman: Dawn of Justice
11    Bryan Singer    Superman Returns
12    Marc Forster    Quantum of Solace
13    Gore Verbinski    Pirates of the Caribbean: Dead Man's Chest
14    Gore Verbinski    The Lone Ranger
15    Zack Snyder    Man of Steel
16    Andrew Adamson    The Chronicles of Narnia: Prince Caspian
17    Joss Whedon    The Avengers
18    Rob Marshall    Pirates of the Caribbean: On Stranger Tides
19    Barry Sonnenfeld    Men in Black 3

tmp_df.iloc[2:4,:]
# output
    director_name    movie_title
12    Marc Forster    Quantum of Solace
13    Gore Verbinski    Pirates of the Caribbean: Dead Man's Chest

tmp_df.loc[15:17,:]
# output
    director_name    movie_title
15    Zack Snyder    Man of Steel
16    Andrew Adamson    The Chronicles of Narnia: Prince Caspian
17    Joss Whedon    The Avengers

tmp_df.loc[15:17,:'movie_title']
# output
    director_name    movie_title
15    Zack Snyder    Man of Steel
16    Andrew Adamson    The Chronicles of Narnia: Prince Caspian
17    Joss Whedon    The Avengers

tmp_df.loc[15:17,:'director_name']
# output
    director_name
15    Zack Snyder
16    Andrew Adamson
17    Joss Whedon

3-6 Reindexing of series and Dataframe

Jupiter notebook creates a new file reindexing series and dataframe ipynb

import numpy as np
import pandas as pd
from pandas import Series, DataFrame

# series reindex
s1 = Series([1,2,3,4], index=['A','B','C','D'])
s1
# output
A    1
B    2
C    3
D    4
dtype: int64

# s1.reindex()    # Move the cursor over the method, press shift+tab to pop up the document, and press continuously to select the document detail level
s1.reindex(index=['A','B','C','D','E'])
# output
A    1.0
B    2.0
C    3.0
D    4.0
E    NaN
dtype: float64

s1.reindex(index=['A','B','C','D','E'],fill_value=0)
# output
A    1
B    2
C    3
D    4
E    0
dtype: int64

s1.reindex(index=['A','B','C','D','E'],fill_value=10)
# output
A     1
B     2
C     3
D     4
E    10
dtype: int64

s2 = Series(['A','B','C'], index=[1,5,10])
s2
# output
1     A
5     B
10    C
dtype: object

s2.reindex(index=range(15))
# output
0     NaN
1       A
2     NaN
3     NaN
4     NaN
5       B
6     NaN
7     NaN
8     NaN
9     NaN
10      C
11    NaN
12    NaN
13    NaN
14    NaN
dtype: object

s2.reindex(index=range(15),method='ffill')
# output
0     NaN
1       A
2       A
3       A
4       A
5       B
6       B
7       B
8       B
9       B
10      C
11      C
12      C
13      C
14      C
dtype: object

# reindex dataframe
df1 = DataFrame(np.random.rand(25).reshape([5,5]))
df1
# output
    0    1    2    3    4
0    0.255424    0.315708    0.951327    0.423676    0.975377
1    0.087594    0.192460    0.502268    0.534926    0.423024
2    0.817002    0.113410    0.468270    0.410297    0.278942
3    0.315239    0.018933    0.133764    0.240001    0.910754
4    0.267342    0.451077    0.282865    0.170235    0.898429


df1 = DataFrame(np.random.rand(25).reshape([5,5]), index=['A','B','D','E','F'], columns=['c1','c2','c3','c4','c5'])
df1
# output
    c1    c2    c3    c4    c5
A    0.278063    0.894546    0.932129    0.178442    0.303684
B    0.186239    0.260677    0.708358    0.275914    0.369878
D    0.786987    0.125907    0.191987    0.338194    0.009877
E    0.192269    0.909661    0.227301    0.343989    0.610203
F    0.503267    0.306472    0.197467    0.063800    0.813786

df1.reindex(index=['A','B','C','D','E','F'])
# output
    c1    c2    c3    c4    c5
A    0.278063    0.894546    0.932129    0.178442    0.303684
B    0.186239    0.260677    0.708358    0.275914    0.369878
C    NaN    NaN    NaN    NaN    NaN
D    0.786987    0.125907    0.191987    0.338194    0.009877
E    0.192269    0.909661    0.227301    0.343989    0.610203
F    0.503267    0.306472    0.197467    0.063800    0.813786

df1.reindex(columns=['c1','c2','c3','c4','c5','c6'])
# output
    c1    c2    c3    c4    c5    c6
A    0.278063    0.894546    0.932129    0.178442    0.303684    NaN
B    0.186239    0.260677    0.708358    0.275914    0.369878    NaN
D    0.786987    0.125907    0.191987    0.338194    0.009877    NaN
E    0.192269    0.909661    0.227301    0.343989    0.610203    NaN
F    0.503267    0.306472    0.197467    0.063800    0.813786    NaN

df1.reindex(index=['A','B','C','D','E','F'],columns=['c1','c2','c3','c4','c5','c6'])
# output
    c1    c2    c3    c4    c5    c6
A    0.278063    0.894546    0.932129    0.178442    0.303684    NaN
B    0.186239    0.260677    0.708358    0.275914    0.369878    NaN
C    NaN    NaN    NaN    NaN    NaN    NaN
D    0.786987    0.125907    0.191987    0.338194    0.009877    NaN
E    0.192269    0.909661    0.227301    0.343989    0.610203    NaN
F    0.503267    0.306472    0.197467    0.063800    0.813786    NaN


s1
# output
A    1
B    2
C    3
D    4
dtype: int64

s1.reindex(index=['A','B'])
# output
A    1
B    2
dtype: int64


df1
# output
    c1    c2    c3    c4    c5
A    0.278063    0.894546    0.932129    0.178442    0.303684
B    0.186239    0.260677    0.708358    0.275914    0.369878
D    0.786987    0.125907    0.191987    0.338194    0.009877
E    0.192269    0.909661    0.227301    0.343989    0.610203
F    0.503267    0.306472    0.197467    0.063800    0.813786

df1.reindex(index=['A','B'])
# output
    c1    c2    c3    c4    c5
A    0.278063    0.894546    0.932129    0.178442    0.303684
B    0.186239    0.260677    0.708358    0.275914    0.369878

s1
# output
A    1
B    2
C    3
D    4
dtype: int64

s1.drop('A')
# output
B    2
C    3
D    4
dtype: int64

df1
# output
    c1    c2    c3    c4    c5
A    0.278063    0.894546    0.932129    0.178442    0.303684
B    0.186239    0.260677    0.708358    0.275914    0.369878
D    0.786987    0.125907    0.191987    0.338194    0.009877
E    0.192269    0.909661    0.227301    0.343989    0.610203
F    0.503267    0.306472    0.197467    0.063800    0.813786

df1.drop('A',axis=0)
# output
    c1    c2    c3    c4    c5
B    0.186239    0.260677    0.708358    0.275914    0.369878
D    0.786987    0.125907    0.191987    0.338194    0.009877
E    0.192269    0.909661    0.227301    0.343989    0.610203
F    0.503267    0.306472    0.197467    0.063800    0.813786

df1.drop('c1',axis=0)
# An error is reported. There is no such field in the line

df1.drop('c1',axis=1)
# output
    c2    c3    c4    c5
A    0.894546    0.932129    0.178442    0.303684
B    0.260677    0.708358    0.275914    0.369878
D    0.125907    0.191987    0.338194    0.009877
E    0.909661    0.227301    0.343989    0.610203
F    0.306472    0.197467    0.063800    0.813786

3-7 talk about NaN

Jupyter notebook create a new file talk about Nan ipynb

# NaN - means Not a Number
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

n = np.nan
type(n)
# float

m = 1
m + n
# nan


# Nan in Series
s1 = Series([1, 2, np.nan, 3, 4], index=['A','B','C','D','E'])
s1
# output
A    1.0
B    2.0
C    NaN
D    3.0
E    4.0
dtype: float64

s1.isnull()
# output
A    False
B    False
C     True
D    False
E    False
dtype: bool

s1.notnull()
# output
A     True
B     True
C    False
D     True
E     True
dtype: bool

s1
# output
A    1.0
B    2.0
C    NaN
D    3.0
E    4.0
dtype: float64

s1.dropna()
# output
A    1.0
B    2.0
D    3.0
E    4.0
dtype: float64

# Nan in DataFrame
dframe = DataFrame([[1,2,3],[np.nan,5,6],[7,np.nan,9],[np.nan,np.nan,np.nan]])
dframe
# output
    0    1    2
0    1.0    2.0    3.0
1    NaN    5.0    6.0
2    7.0    NaN    9.0
3    NaN    NaN    NaN

dframe.isnull()
# output
    0    1    2
0    False    False    False
1    True    False    False
2    False    True    False
3    True    True    True

dframe.notnull()
# output
    0    1    2
0    True    True    True
1    False    True    True
2    True    False    True
3    False    False    False

df1 = dframe.dropna(axis=0)
df1
# output
    0    1    2
0    1.0    2.0    3.0


df1 = dframe.dropna(axis=1)
df1
# output
0
1
2
3

df1 = dframe.dropna(axis=1,how='any')
df1
# output
0
1
2
3

# output
df1 = dframe.dropna(axis=0,how='any')
df1
# output
    0    1    2
0    1.0    2.0    3.0

df1 = dframe.dropna(axis=0,how='all')
df1
# output
    0    1    2
0    1.0    2.0    3.0
1    NaN    5.0    6.0
2    7.0    NaN    9.0

dframe2 = DataFrame([[1,2,3,np.nan],[2,np.nan,5,6],[np.nan,7,np.nan,9],[1,np.nan,np.nan,np.nan]])
dframe2
# output
    0    1    2    3
0    1.0    2.0    3.0    NaN
1    2.0    NaN    5.0    6.0
2    NaN    7.0    NaN    9.0
3    1.0    NaN    NaN    NaN

df2 = dframe2.dropna(thresh=None)
df2
# output
0    1    2    3

df2 = dframe2.dropna(thresh=2)
df2
# output
    0    1    2    3
0    1.0    2.0    3.0    NaN
1    2.0    NaN    5.0    6.0
2    NaN    7.0    NaN    9.0

dframe2
# output
    0    1    2    3
0    1.0    2.0    3.0    NaN
1    2.0    NaN    5.0    6.0
2    NaN    7.0    NaN    9.0
3    1.0    NaN    NaN    NaN

dframe2.fillna(value=1)
# output
    0    1    2    3
0    1.0    2.0    3.0    1.0
1    2.0    1.0    5.0    6.0
2    1.0    7.0    1.0    9.0
3    1.0    1.0    1.0    1.0

dframe2.fillna(value={0:0,1:1,2:2,3:3})    # Column fill
# output
    0    1    2    3
0    1.0    2.0    3.0    3.0
1    2.0    1.0    5.0    6.0
2    0.0    7.0    2.0    9.0
3    1.0    1.0    2.0    3.0

df1
# output
    0    1    2
0    1.0    2.0    3.0
1    NaN    5.0    6.0
2    7.0    NaN    9.0

df2
# output
    0    1    2    3
0    1.0    2.0    3.0    NaN
1    2.0    NaN    5.0    6.0
2    NaN    7.0    NaN    9.0

df1.dropna()
# output
    0    1    2
0    1.0    2.0    3.0

df1.fillna(1)
# output
    0    1    2
0    1.0    2.0    3.0
1    1.0    5.0    6.0
2    7.0    1.0    9.0

Topics: Python Data Analysis