Combing standardized and normalized concepts (with code)

Posted by altis88 on Fri, 28 Jan 2022 21:57:49 +0100

  • Standardization and normalization are data preprocessing methods. Because translation problems are often confused in China, here is a special clarification.

1.Four Feature Scaling methods

  • Reference resources Wikipedia In fact, standardization and normalization belong to four Feature scaling methods

1.1 Rescaling (min-max normalization)

  • Rescaling (min-max normalization) is sometimes referred to as normalization, which is often referred to as normalization.
    x ′ = x − min ⁡ ( x ) max ⁡ ( x ) − min ⁡ ( x ) x' = \frac{x-\min(x)}{\max(x)-\min(x)} x′=max(x)−min(x)x−min(x)​
  • After processing in this way, all data values can be compressed in [ 0 , 1 ] [0,1] Eliminate dimension effects while maintaining relative distances between samples within [0,1]
  • Here is an extension of compressing data to a specified [ a , b ] [a,b] Within [a,b]
    x ′ = a + ( b − a ) x − min ⁡ ( x ) max ⁡ ( x ) − min ⁡ ( x ) x' = a+(b-a)\frac{x-\min(x)}{\max(x)-\min(x)} x′=a+(b−a)max(x)−min(x)x−min(x)​

1.2 Mean normalization

  • Mean normalization, often translated as mean normalization
    x ′ = x − mean ( x ) max ⁡ ( x ) − min ⁡ ( x ) x' = \frac{x-\text{mean} (x)}{\max(x)-\min(x)} x′=max(x)−min(x)x−mean(x)​
  • This moves all samples near zero, eliminates the dimension effect, and maintains the relative distance between samples.

1.3 Standardization (Z-score normalization)

  • Standardization (Z-score normalization), commonly referred to as standardization refers to this
    x ′ = x − mean ( x ) σ ( x ) x' = \frac{x-\text{mean}(x)}{\sigma(x)} x′=σ(x)x−mean(x)​
  • This operation adjusts all samples to mean 0, variance 1, and normalized normal distribution to obtain standard normal distribution, but this does not mean that only normal distribution can be standardized, nor does it mean that all normalized distributions are standard normal distribution. In fact, any distribution can be standardized. The normalized distribution has changed, but the distribution type has not changed. Just pan and zoom

    Note: For multidimensional normal distribution, only when the features are independent (isotropic), will the normalized distribution present a positive circle/sphere, otherwise the normalized distribution is not a positive circle/sphere.

1.4 Scaling to unit length

  • Scaling to unit length, often translated as unitary
    x ′ = x ∣ ∣ x ∣ ∣ x' = \frac{x}{||x||} x′=∣∣x∣∣x​
  • This operation changes all samples to the unit hypersphere around zero

2. Examples

  • Generate a two-dimensional normal distribution with expectations set to μ = [ − 1 2 ] \pmb{\mu} = \begin{bmatrix}-1\\2 \end{bmatrix} μ ( μ * μ= [12], the covariance matrix is B = [ 0.6 0.2 0.2 0.1 ] \pmb{B} = \begin{bmatrix}0.6 &0.2\\0.2 &0.1 \end{bmatrix} BBB=[0.60.2​0.20.1​]
  • Draw this distribution as follows, noting that the two dimensions are not independent
    %matplotlib notebook
    import numpy as np
    import scipy.stats as st
    import matplotlib.pylab as plt
    from matplotlib.ticker import MultipleLocator, FormatStrFormatter
    
    fig = plt.figure(figsize = (5,5))
    mu = np.array([-1, 2])
    sigma = np.array([[0.6,0.2],[0.2,0.1]])
    points = np.random.multivariate_normal(mu,sigma,10000)
    
    a0 = fig.add_subplot(1,1,1,label='a0')
    a0.grid(which='minor',alpha=0.5) 
    a0.scatter(points[:,0], points[:,1],s=1,alpha=0.5,cmap="rainbow")
    a0.grid(which='major',alpha=0.5)  
    

  • Processed and visualized using four Feature scaling methods, with the following results
    %matplotlib notebook
    import numpy as np
    import scipy.stats as st
    import matplotlib.pylab as plt
    from matplotlib.ticker import MultipleLocator, FormatStrFormatter
    
    def MinMaxNormalization(px):
        for i in range(px.shape[1]):   
            t = px[:,i]
            tmin,tmax = np.min(t),np.max(t)
            t[:] = (t-tmin)/(tmax-tmin)  
    
    def MeanNormalization(px):
        for i in range(px.shape[1]):   
            t = px[:,i]
            tmin,tmax,tmean = np.min(t),np.max(t),np.mean(t)
            t[:] = (t-tmean)/(tmax-tmin)  
    
    def Standardization(px):
        for i in range(px.shape[1]):   
            t = px[:,i]
            tmean,tstd = np.mean(t),np.std(t.copy())
            t[:] = (t-tmean)/tstd
    
    def Scaling2Unit(px):
        norm = np.linalg.norm(px,axis=1)
        for i in range(px.shape[1]):   
            t = px[:,i]
            t[:] = t/norm    
            
    majorLocator = MultipleLocator(2) # Major scale label set to multiple of 1
    p1 = points.copy()
    p2 = points.copy()
    p3 = points.copy()
    p4 = points.copy()
    
    MinMaxNormalization(p1)
    MeanNormalization(p2)
    Standardization(p3)
    Scaling2Unit(p4)
    
    fig = plt.figure(figsize = (12,3))
    a1 = fig.add_subplot(1,4,1,label='a1')
    a2 = fig.add_subplot(1,4,2,label='a2')
    a3 = fig.add_subplot(1,4,3,label='a3')
    a4 = fig.add_subplot(1,4,4,label='a4')
    
    subplot = {a1:(p1,'min-max normalization'),
               a2:(p2,'mean normalization'),
               a3:(p3,'standardization'),
               a4:(p4,'scaling to unit')}
    
    for ax in subplot:
        px,title = subplot[ax]
        ax.scatter(px[:,0], px[:,1],s=1,alpha=0.5,cmap="rainbow")
        ax.axis([-4,4,-4,4]) 
        ax.xaxis.set_major_locator(majorLocator)
        ax.yaxis.set_major_locator(majorLocator)
        ax.grid(which='major',alpha=0.5)                    
        ax.set_title(title)