Borderline smote algorithm introduction and Python implementation [source code attached]

Posted by alcapone on Tue, 04 Jan 2022 04:19:56 +0100

💖 About the author: Hello, I'm brother cheshen, cheshen at No. 18 Fuxue road 🥇
⚡ About - > Che Shen: the fastest time from the bedroom to the laboratory is 3 minutes, and the slowest time is 3 minutes and a half (that half minute is actually waiting for the traffic light)
📝 Personal homepage: A blog that should have no place to live_ Cheshen, 18 Fuxue Road_ CSDN blog
🥇 Official certification: New Star creator in the field of artificial intelligence
🎉 give the thumbs-up ➕ comment ➕ Collection = = form a habit (one button three times) 😋

⚡ I hope you can give me more support 🤗~ Let's come on together 😁

Recently, I wrote my graduation thesis and used the borderline SMOTE algorithm for fault diagnosis. In fact, the actual working conditions contain a lot of data, and the monitoring cycle is extremely uneven. Some detection time is sampled by month, some by year, and daily, real-time, etc. There are unbalanced data in many places, so we need to produce more similar data. There are many general virtual sample generation technologies: Monte Carlo method, overall trend diffusion technology, SMOTE, DNN, Bootstrap and so on. Due to the recent use of borderline SMOTE, let's introduce it~

Python source code at the end of the article is taken by yourself!!!

🎉 Introduction to borderline smote algorithm

Borderline SMOTE is an improved oversampling algorithm based on SMOTE. The algorithm only uses a few class samples on the boundary to synthesize new samples, so as to improve the class distribution of samples.

Smote algorithm is still based on the assumption that the samples between small class samples are close, and the distribution characteristics of adjacent samples are not fully considered, which will lead to the possibility of repeatability between classes. The borderline somte algorithm for identifying small class seed samples can avoid this repeatability. The principle of composite samples based on samples on the boundary is shown in the figure below.

The Borderline SMOTE sampling process divides a small number of samples into three categories: Safe, Danger and Noise. Finally, only a few class samples whose table is Danger are oversampled.

Borderline SMOTE can also be divided into Borderline-SMOTE1 and Borderline-SMOTE2. When generating new samples for Danger points, Borderline-SMOTE1 randomly selects a few samples in k-nearest neighbors (the same as SMOTE), and Borderline-SMOTE2 is any sample in k-nearest neighbors (regardless of sample category)


hypothesis S S S is the sample set, S m i n S_{min} Smin ^ is a small sample set, S m a x j S_{maxj} Most sample sets of Smaxj, m is the number of adjacent samples, x i x_i xi. Properties, x i j x_{ij} xij = all attributes of adjacent samples, x n x_n xn is the nearest neighbor sample, R i j R_{ij} Rij , takes 0.5 or 1, and the synthesis algorithm steps are as follows.

  • Step 1: suppose each x i ∈ S m i n x_i \in S_{min} xi ∈ Smin, determine the nearest sample set, and its data set is S N N S_{NN} SNN, and S N N ∈ S S_{NN} \in S SNN​∈S.
  • Step: for each sample x i x_i xi, judge the number of nearest neighbors belonging to most sample sets, i.e ∣ S N N ∩ S m a x j ∣ < m | S_{NN} \cap S_{maxj}| < m ∣SNN​∩Smaxj​∣<m; Synthesize samples of a few classes. Namely x i x_i xi. Relationship with nearest neighbors x n x_n xn # corresponding attribute j j The difference in j is recorded as d i j = x i − x i j d_{ij}=x_{i}-x_{ij} dij​=xi​−xij​. A new minority sample is obtained h i j = x i + d i j × r a n d ( 0 , R i j ) h_{ij}=x_i+d_{ij} \times rand(0, R_{ij}) hij​=xi​+dij​×rand(0,Rij​).

Compared with SMOTE method, borderline SMOTE method only performs nearest neighbor linear interpolation for boundary samples, which makes the distribution of minority samples more reasonable

Wow, I haven't hit the formula at Markdown for a long time. It's a little hand-made, ha ha.

🤗 source code

Python code:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Time    : 2021/12/28 10:49
# @Author: cheshen, 18 Fuxue Road
# @Email   : yurz_control@163.com
# @File    : Borderline-SMOTE_imblearn.py

from collections import Counter

import numpy as np
import pandas as pd
from icecream import ic
from sklearn.datasets import make_classification
from imblearn.over_sampling import BorderlineSMOTE
import random
from sklearn.neighbors import NearestNeighbors

# Read data
def loaddata(filename):

    df = pd.read_excel(io=filename)
    # ic(df)

    return np.array(df)

# Uniform 100 data
def gene_data(Lb, Ub):

    # Generate LB
    # Cycle values to obtain the values of Lb and Ub
    m = Lb.shape[0]
    gene_box = []
    for i in range(m):

        gene_sample = np.linspace(Lb[i], Ub[i], 100)
        # ic(gene_sample)
        gene_box.append(gene_sample)

    return gene_box

class Smote:

    def __init__(self,samples,N=10,k=5):
        self.n_samples,self.n_attrs=samples.shape
        self.N=N
        self.k=k
        self.samples=samples
        self.newindex=0
       # self.synthetic=np.zeros((self.n_samples*N,self.n_attrs))

    def over_sampling(self):
        N=int(self.N/100)
        self.synthetic = np.zeros((self.n_samples * N, self.n_attrs))
        neighbors=NearestNeighbors(n_neighbors=self.k).fit(self.samples)
        print('neighbors',neighbors)
        for i in range(len(self.samples)):
            nnarray=neighbors.kneighbors(self.samples[i].reshape(1,-1),return_distance=False)[0]
            #print nnarray
            self._populate(N,i,nnarray)
        return self.synthetic


    # for each minority class samples,choose N of the k nearest neighbors and generate N synthetic samples.
    def _populate(self,N,i,nnarray):
        for j in range(N):
            nn=random.randint(0,self.k-1)
            dif=self.samples[nnarray[nn]]-self.samples[i]
            gap=random.random()
            self.synthetic[self.newindex]=self.samples[i]+gap*dif
            self.newindex+=1



if __name__ == '__main__':

    # Read data
    datafile = "LbUbCL boundary.xlsx"
    df = loaddata(datafile)
    df_LB = df[:, 2]
    df_UB = df[:, 3]
    # ic(df_LB, df_UB)
    # ic(df_LB.shape, df_UB.shape)

    # Generate 100 uniform data in LB and UB intervals, and then use borderline smote to generate virtual samples
    Initial_dt = np.array(gene_data(df_LB, df_UB))
    X = Initial_dt.T
    ic((Initial_dt.T).shape)     # success
    # Corresponding label y
    y = np.array([1]*100)

    # y = np.ones((16, 100))
    # ic(y)

    # print('Original dataset shape %s' % Counter(y))
    # Use borderline smote to generate virtual samples for the generated 100 samples
    # sm = BorderlineSMOTE(random_state=42, kind="borderline-1")
    # X_res, y_res = sm.fit_resample(X, y)
    # print('Resampled dataset shape %s' % Counter(y_res))
    # ic(X_res, y_res)

    ss = Smote(X, N=100)

    res = ss.over_sampling()

    pd.DataFrame(res).to_excel("Borderline-SMOTE_result.xlsx")

    print(res)

    """-----------------------------------------------------------"""

    # print('Original dataset shape %s' % Counter(y))

    # sm = BorderlineSMOTE(random_state=42, kind="borderline-1")
    # X_res, y_res = sm.fit_resample(X, y)

    # print('Resampled dataset shape %s' % Counter(y_res))

    # icecream.ic(X_res, y_res.shape)
    # X1, y1 = make_classification(n_classes=2, class_sep=2,
    #                                                        weights=[0.1, 0.9], n_informative=2, n_redundant=0, flip_y=0,
    #                                                        n_features=2, n_clusters_per_class=1, n_samples=100, random_state=9)

    # ic(X1, y1)

Note that our data set is my paper data set, so we can't analyze it for you~

There is also a drawing code attached below for reference only:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Time    : 2021/12/28 21:35
# @Author: cheshen, 18 Fuxue Road
# @Email   : yurz_control@163.com
# @File    : plot.py

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from pylab import mpl, text
from matplotlib.font_manager import FontProperties

roman = FontProperties(fname=r'C:\Windows\Fonts\Times New Roman.ttf', size=10) # Times new roman
mpl.rcParams['font.sans-serif'] = ['SimSun']
fontcn = {'family': 'SimSun','size': 10} # 1pt = 4/3px
fonten = {'family':'Times New Roman','size': 10}


df = pd.read_excel("SMAPE_final.xlsx", sheet_name="Sheet2")
df2 = pd.read_excel("SMAPE_final.xlsx", sheet_name="Sheet4")
print(df)

t1 = list(df2.iloc[0, 1:])
t2 = list(df2.iloc[1, 1:])
t3 = list(df2.iloc[2, 1:])
# X2 = ["MTD(%)", "MT-MTD(%)", "MD-MTDB(%)"]
X2 = ['10', '15', '20', '25', '&']

plt.plot(X2, t1, linestyle="-.", marker="o", linewidth=2, label="MTD", markersize='8')
plt.plot(X2, t2, "-D", linewidth=2, label="MT-MTD", markersize='8')
plt.plot(X2, t3, "--v",linewidth=2, label="MD-MTDB", markersize='8')
plt.xlabel("Size of Sample", fontsize=15)
plt.ylabel("AveSMAPE(%)", fontsize=15)
plt.rcParams.update({'font.size':14})
plt.legend()
plt.show()


y1 = list(df.iloc[0, 1:])
y2 = list(df.iloc[1, 1:])
y3 = list(df.iloc[2, 1:])
X = ['10', '15', '20', '25', '30']

plt.plot(X, y1, '--o', linewidth=2, label='SMAPE no vitual samples', markersize='8')
plt.plot(X, y2, '-^', linewidth=2, label='SMAPE include vitual samples', markersize='8')
plt.xlabel("Size of Sample", fontsize=15)
plt.ylabel("AveSMAPE(%)", fontsize=15)
plt.rcParams.update({'font.size':14})
plt.legend()
plt.show()


# plt.subplot(211)

plt.bar(X, y3, width=0.8)
plt.xlabel("Size of Sample",  fontsize=15)
plt.ylabel("AvePCR(%)", fontsize=15)
plt.show()


Come on~ Prepare the third chapter of the graduation project, o(╥﹏╥) o


❤ Keep reading Paper, keep taking notes, keep learning, and keep brushing LeetCode ❤!!!
Insist on brushing questions!!! Hit the ladder!!!
⚡To Be No.1

⚡⚡ Ha ha ha ha

⚡ Creation is not easy ⚡, Crossing energy ❤ Follow, collect and like ❤ Three is the best

ღ( ´・ᴗ・` )


Everything is difficult at the beginning, then in the middle, and finally at the end.

Topics: Python Algorithm