Statistics in Pandas and Numpy

Posted by ceruleansin on Fri, 18 Feb 2022 12:02:57 +0100

Numerical descriptive statistics

Arithmetic mean

S = [s_1, s_2, ..., s_n]

Each value in the sample is the sum of true value and error.

mean = \frac{(s_1 + s_2 + ... + s_n) }{n}

The arithmetic mean represents an unbiased estimate of the true value.

m = np.mean(array)
m = array.mean()
m = df.mean(axis=0)

Case: mean analysis of film scoring data:

mean = ratings['John Carson'].mean()
mean = np.mean(ratings['John Carson'])
means = ratings.mean(axis=1)

weighted average

When calculating the average value, considering the importance of different samples, different weights can be given to different samples.

Sample: S = [s_1, s_2, s_3... S_n]

Weight: w = [w_1, w_2, w_3... W_n]

weighted average:

a = \frac{s_1w_1 + s_2w_2 + ... + s_nw_n}{w_1+w_2+...+w_n}

Code implementation:

a = np.average(array, weights=volumes)

Case: customize the weight and calculate the weighted average.

# Weighted mean
w = np.array([3,1,1,1,1,1,1])
np.average(ratings.loc['Inception'], weights=w)

mask = ~pd.isna(ratings.loc['Inception']) 
np.average(ratings.loc['Inception'][mask], weights=w[mask])

Maximum value

np.max() / np.min() / np.ptp(): returns the maximum / minimum / range (maximum minus minimum) in an array

import numpy as np
# Generate 9 random numbers between [10, 100)
a = np.random.randint(10, 100, 9)
print(np.max(a), np.min(a), np.ptp(a))

np.argmax() np.argmin() and PD idxmax() pd. Idxmin(): returns the subscript of the largest / smallest element in an array

# In np, use argmax to get the subscript of the maximum value
print(np.argmax(a), np.argmin(a))

# In pandas, use idxmax to get the subscript of the maximum value
print(series.idxmax(), series.idxmin())
print(dataframe.idxmax(), dataframe.idxmin())


Sort multiple samples according to size and take the element in the middle.

If the number of samples is odd, the median is the middle most element

[1, 2000, 3000, 4000, 10000000]

If the number of samples is even, the median is the average of the two elements in the middle


Case: analyze the median algorithm, test numpy and provide the median API

np.median()  median

import numpy as np
closing_prices = np.loadtxt('../../data/aapl.csv', 
	delimiter=',', usecols=(6), unpack=True)
size = closing_prices.size
sorted_prices = np.msort(closing_prices)
median = (sorted_prices[int((size - 1) / 2)] + 
          sorted_prices[int(size / 2)]) / 2
median = np.median(closing_prices)

standard deviation

We can evaluate the amplitude of the shock of a set of data and whether it is stable or not

sample: S = [s_1, s_2, s_3, ..., s_n]

Average: m = \frac{s_1 + s_2 + s_3 + ... + s_n}{n}

Deviation: indicates the deviation degree of a group of data from a central point D = [d_1, d_2, d_3, ..., d_n] d_i = S_i-m Subtract each difference from the mean If the absolute value of the deviation is relatively large and far from the center point, the oscillation amplitude is large There are positive and negative deviations. Make the deviation square and turn them all into positive numbers

Deviation square: Q = [q_1, q_2, q_3, ..., q_n] q_i=d_i^2

Population variance: v = \frac{(q_1+q_2+q_3 + ... + q_n)}{n} The deviation square of the whole group, / n obtains the mean value of the deviation, which is the variance The greater the variance, the stronger the shock The smaller the variance, the smoother the oscillation

standard deviation: s = \sqrt{v}

Sample variance: v' = \frac{(q_1+q_2+q_3 + ... + q_n)}{n-1} , v = \frac{(q_1+q_2+q_3 + ... + q_n)}{n} Among them, n-1 is called "Bessel correction", because when sampling, the collected samples mainly fall near the central value, so the variance calculated by these samples will be less than or equal to the unbiased estimation of the variance of the overall data set. In order to make up for this defect, we change the formula n to n-1 to improve the value of variance. It is called Bessel correction coefficient.

Sample standard deviation: s' = \sqrt{v'}

Case: according to the standard deviation theory, the analysis of variance is carried out for the scoring data: