Python data analysis and mining learning notes (1) basic usage of numpy and pandas modules

Posted by dustbuster on Sun, 15 Dec 2019 17:54:32 +0100

I. basic usage of numpy module:

numpy module can process data efficiently and provide array support. Many modules rely on it, such as panda, scipy, matplotlib, etc., so this module is the foundation.

(1) import:

import numpy

(2) create one dimension and two dimension arrays:

#Create a one-dimensional array
x=numpy.array(["1","3","r","u","a"])
#Create a 2D array
y=numpy.array([[1,2],[22,2],[11,8]])

Results:

>>> x
array(['1', '3', 'a', 'r', 'u'], dtype='<U1')
>>> y
array([[ 1,  2],
       [ 2, 22],
       [ 8, 11]])

(3) extract array specific values:

#Create a one-dimensional array
x=numpy.array(["1","3","r","u","a"])
#Create a 2D array
y=numpy.array([[1,2],[22,2],[11,8]])

#Output the first element of a one-dimensional array
print(x[0])
#Output the first element of the second index of the 2D array
print(y[2][0])

Results:

1
11

(4) maximum and minimum value of array:

#Take maximum and minimum
y1=y.max()#Maximum of all elements
y2=y.min()#Minimum of all elements

Results:

>>> y1
22
>>> y2
1

(5) array element sorting:

#Create a one-dimensional array
x=numpy.array(["1","3","r","u","a"])
#Create a 2D array
y=numpy.array([[1,2],[22,2],[11,8]])
#sort
x.sort()
y.sort()#Sort each one in two dimensions

Results:

>>> x
array(['1', '3', 'a', 'r', 'u'], dtype='<U1')
>>> y
array([[ 1,  2],
       [ 2, 22],
       [ 8, 11]])

(6) slice: take a fragment element by subscript

#Create a one-dimensional array
x=numpy.array(["1","3","r","u","a"])

#Slice: take a fragment element by subscript
#Format: array [start subscript: final subscript + 1]
x[1:3]#"3","r"
x[:3]#"1","3","r"
x[1:]#"3","r","u","a"

II. Basic usage of pandas module:

The pandas module is mainly used for data exploration and data analysis.

(1) import

import pandas as pda
#After that, pda can be used to replace pandas in the code, which is convenient

(2) create data:

Series: represents a string of numbers, row by column, and its index.
DataFrame: a data frame, similar to a table, which represents the data of row and column integration. columns are used to represent its header.

1) create as an array:

#Create data by array
a=pda.Series([8,9,2,1])
b=pda.Series([8,9,2,1],index=["a","b","c","d"])

c=pda.DataFrame([[5,8,9,6],[3,5,7,9],[33,54,58,10],[2,12,55,78]])
d=pda.DataFrame([[5,8,9,6],[3,5,7,9],[33,54,58,10],[2,12,55,78]],columns=["one","two","three","four"])

Results:

>>> a
0    8
1    9
2    2
3    1
dtype: int64
>>> b
a    8
b    9
c    2
d    1
dtype: int64
>>> c
    0   1   2   3
0   5   8   9   6
1   3   5   7   9
2  33  54  58  10
3   2  12  55  78
>>> d
   one  two  three  four
0    5    8      9     6
1    3    5      7     9
2   33   54     58    10
3    2   12     55    78

2) create as an array:

#Create data box by dictionary
e=pda.DataFrame({
"one":4,
"two":[3,2,1],
"three":list(str(982)),
    })

If the data is uneven, it will be filled automatically, and the result is as follows:

>>> e
   one three  two
0    4     9    3
1    4     8    2
2    4     2    1

(3) data acquisition:

f=d.head()#Header data, default top five lines
g=d.head(1)#Output specific lines from header

h=d.tail()#Tail data, the last five lines by default
i=d.tail(2)#Output specific lines from tail

Results:

>>> f
   one  two  three  four
0    5    8      9     6
1    3    5      7     9
2   33   54     58    10
3    2   12     55    78
>>> g
   one  two  three  four
0    5    8      9     6
>>> h
   one  two  three  four
0    5    8      9     6
1    3    5      7     9
2   33   54     58    10
3    2   12     55    78
>>> i
   one  two  three  four
2   33   54     58    10
3    2   12     55    78

(4) data statistics:

d.describe()

Results:

>>> d.describe()
             one        two     three       four
count   4.000000   4.000000   4.00000   4.000000
mean   10.750000  19.750000  32.25000  25.750000
std    14.885675  23.012678  28.04015  34.874776
min     2.000000   5.000000   7.00000   6.000000
25%     2.750000   7.250000   8.50000   8.250000
50%     4.000000  10.000000  32.00000   9.500000
75%    12.000000  22.500000  55.75000  27.000000
max    33.000000  54.000000  58.00000  78.000000

From top to bottom, they represent: element number, average value, standard deviation, minimum value, 25% quantile, 50% quantile, 75% quantile, maximum value.

(5) transpose (row and column position reversed)

d=pda.DataFrame([[5,8,9,6],[3,5,7,9],[33,54,58,10],[2,12,55,78]],columns=["one","two","three","four"])

j=d.T

Results:

>>> d.T
       0  1   2   3
one    5  3  33   2
two    8  5  54  12
three  9  7  58  55
four   6  9  10  78

Topics: Fragment

Programmer Think

Python data analysis and mining learning notes (1) basic usage of numpy and pandas modules

Hot Topics