Numerical descriptive statistics
Arithmetic mean
S = [s_1, s_2, ..., s_n]
Each value in the sample is the sum of true value and error.
mean = \frac{(s_1 + s_2 + ... + s_n) }{n}
The arithmetic mean represents an unbiased estimate of the true value.
m = np.mean(array) m = array.mean() m = df.mean(axis=0)
Case: mean analysis of film scoring data:
mean = ratings['John Carson'].mean() mean = np.mean(ratings['John Carson']) means = ratings.mean(axis=1)
weighted average
When calculating the average value, considering the importance of different samples, different weights can be given to different samples.
Sample: S = [s_1, s_2, s_3... S_n]
Weight: w = [w_1, w_2, w_3... W_n]
weighted average:
a = \frac{s_1w_1 + s_2w_2 + ... + s_nw_n}{w_1+w_2+...+w_n}
Code implementation:
a = np.average(array, weights=volumes)
Case: customize the weight and calculate the weighted average.
# Weighted mean w = np.array([3,1,1,1,1,1,1]) np.average(ratings.loc['Inception'], weights=w) mask = ~pd.isna(ratings.loc['Inception']) np.average(ratings.loc['Inception'][mask], weights=w[mask])
Maximum value
np.max() / np.min() / np.ptp(): returns the maximum / minimum / range (maximum minus minimum) in an array
import numpy as np # Generate 9 random numbers between [10, 100) a = np.random.randint(10, 100, 9) print(a) print(np.max(a), np.min(a), np.ptp(a))
np.argmax() np.argmin() and PD idxmax() pd. Idxmin(): returns the subscript of the largest / smallest element in an array
# In np, use argmax to get the subscript of the maximum value print(np.argmax(a), np.argmin(a)) # In pandas, use idxmax to get the subscript of the maximum value print(series.idxmax(), series.idxmin()) print(dataframe.idxmax(), dataframe.idxmin())
median
Sort multiple samples according to size and take the element in the middle.
If the number of samples is odd, the median is the middle most element
[1, 2000, 3000, 4000, 10000000]
If the number of samples is even, the median is the average of the two elements in the middle
[1,2000,3000,4000,5000,10000000]
Case: analyze the median algorithm, test numpy and provide the median API
np.median() median import numpy as np closing_prices = np.loadtxt('../../data/aapl.csv', delimiter=',', usecols=(6), unpack=True) size = closing_prices.size sorted_prices = np.msort(closing_prices) median = (sorted_prices[int((size - 1) / 2)] + sorted_prices[int(size / 2)]) / 2 print(median) median = np.median(closing_prices) print(median)
standard deviation
We can evaluate the amplitude of the shock of a set of data and whether it is stable or not
sample: S = [s_1, s_2, s_3, ..., s_n]
Average: m = \frac{s_1 + s_2 + s_3 + ... + s_n}{n}
Deviation: indicates the deviation degree of a group of data from a central point D = [d_1, d_2, d_3, ..., d_n] d_i = S_i-m Subtract each difference from the mean If the absolute value of the deviation is relatively large and far from the center point, the oscillation amplitude is large There are positive and negative deviations. Make the deviation square and turn them all into positive numbers
Deviation square: Q = [q_1, q_2, q_3, ..., q_n] q_i=d_i^2
Population variance: v = \frac{(q_1+q_2+q_3 + ... + q_n)}{n} The deviation square of the whole group, / n obtains the mean value of the deviation, which is the variance The greater the variance, the stronger the shock The smaller the variance, the smoother the oscillation
standard deviation: s = \sqrt{v}
Sample variance: v' = \frac{(q_1+q_2+q_3 + ... + q_n)}{n-1} , v = \frac{(q_1+q_2+q_3 + ... + q_n)}{n} Among them, n-1 is called "Bessel correction", because when sampling, the collected samples mainly fall near the central value, so the variance calculated by these samples will be less than or equal to the unbiased estimation of the variance of the overall data set. In order to make up for this defect, we change the formula n to n-1 to improve the value of variance. It is called Bessel correction coefficient.
Sample standard deviation: s' = \sqrt{v'}
Case: according to the standard deviation theory, the analysis of variance is carried out for the scoring data:
ratings.std(axis=0)