Naive Bayes

Posted by paddycallaghan on Mon, 13 Dec 2021 01:19:25 +0100

Creative background

Ben Caiji recently wanted to learn machine learning, but he learned naive Bayes.
If you think my article is well written, can you give me some praise and comment on it. It's not impossible to focus on it 🤗

Algorithm classification

Generally speaking, classification algorithms can be divided into generative algorithms and discriminant algorithms.

Generation class algorithm

Generally speaking, this kind of algorithm is to calculate the probability of a certain value belonging to the label when the given data takes a certain value for each feature. The typical example of this kind of algorithm is naive Bayes.

Discriminant algorithm

Generally speaking, this kind of algorithm is to give some eigenvalues and judge which kind of data it belongs to according to the influence of each feature on the tag value. The typical type of this kind of algorithm is the decision tree. We'll talk about it in the next article.

difference

The generation algorithm pays attention to the influence of eigenvalue combination on label value; The discriminant algorithm focuses on the influence of a single feature on the tag value.

Knowledge supplement

Suppose there are two events A and B, the probability of occurrence is P(A) and P(B), and the probability of simultaneous occurrence of two events is P ( A , B ) P(A, B) P(A,B). Then, under the condition that event a occurs, the probability of event B can be expressed as P(B|A), and the calculation formula is
P ( B ∣ A ) = P ( A , B ) P ( A ) = P ( A ∣ B ) ⋅ P ( B ) P ( A ) P(B|A) = \frac{P(A, B)}{P(A)}=\frac{P(A|B) \cdot P(B)}{P(A)} P(B∣A)=P(A)P(A,B)=P(A)P(A∣B)⋅P(B)
generally speaking

P(B|A) ≠ P(A), that is, A and B are not independent of each other. The occurrence of event B has an impact on the occurrence of event A.
- If P (b|A) > P (A), it indicates that the occurrence of event B promotes the occurrence of event A.
- On the contrary, it has inhibitory effect.
If the two are equal, the occurrence of event B has no impact on the occurrence of event A.

Naive Bayes

In naive Bayes, we believe that features are independent of each other. The purpose is to predict the probability of each value of the label under the condition of the combination of each eigenvalue.

Hypothetical data set T = { ( x 1 , y 1 ) , ( x 2 , y 2 ) ... ( x n , y n ) } T=\{(x_1, y_1), (x_2, y_2) \dots (x_n, y_n)\} T={(x1, y1), (x2, y2)... (xn, yn)} is generated by P(X, Y) independent and identically distributed, where x and y are two features, then

A priori probability: the probability of various values of a feature, eg: P ( Y = c k ) , k = 1 , 2 , ... , K P(Y=c_k), \: k=1,2,\dots,K P(Y=ck),k=1,2,...,K .
Conditional probability: the value of characteristic Y is c k c_k Under the condition of ck | the probability that the characteristic x is taken as X is expressed as P(X=x|Y=c_k). The calculation formula is as follows:

P ( X = x ∣ Y = c k ) = P ( X 1 = x 1 , X 2 = x 2 , ... , X n = x n ∣ Y = c k ) , k = 1 , 2 , ... , K P(X=x|Y=c_k) = P(X_1=x_1, X_2=x_2, \dots , X_n=x_n|Y=c_k), \: k=1,2,\dots,K P(X=x∣Y=ck)=P(X1=x1,X2=x2,...,Xn=xn∣Y=ck),k=1,2,...,K

Under the assumption of conditional independence, it can be combined into
P ( X = x ∣ Y = c k ) = ∏ j = 1 n P ( X j = x j ∣ Y = c k ) P(X=x|Y=c_k) = \prod \limits_{j=1}^{n}{P(X_j=x_j|Y=c_k)} P(X=x∣Y=ck)=j=1∏nP(Xj=xj∣Y=ck) .

A posteriori probability: Contrary to a priori probability, it means that when the value of feature x is x, the value of feature Y is c k c_k The probability of ck | is expressed as P(Y=c_k|X=x), and the calculation formula is as follows:

P ( Y = c k ∣ X = x ) = P ( X = x , Y = c k ) P ( X = x ) = P ( Y = c k ) ⋅ P ( X = x ∣ Y = c k ) P ( X = x ) = P ( Y = c k ) ⋅ ∏ j = 1 n P ( X j = x j ∣ Y = c k ) ∑ k P ( Y = c k ) ⋅ ∏ j = 1 n P ( X j = x j ∣ Y = c k ) P(Y=c_k|X=x) = \frac{P(X=x, Y=c_k)}{P(X=x)}=\frac{P(Y=c_k) \cdot P(X=x|Y=c_k)}{P(X=x)} \\ =\frac{P(Y=c_k) \cdot \prod \limits_{j=1}^{n}{P(X_j=x_j|Y=c_k)}}{\displaystyle \sum _{k}{P(Y=c_k) \cdot \prod \limits_{j=1}^{n}{P(X_j=x_j|Y=c_k)}}} P(Y=ck∣X=x)=P(X=x)P(X=x,Y=ck)=P(X=x)P(Y=ck)⋅P(X=x∣Y=ck)=k∑P(Y=ck)⋅j=1∏nP(Xj=xj∣Y=ck)P(Y=ck)⋅j=1∏nP(Xj=xj∣Y=ck)

Naive Bayes finally obtains the posterior probability of each tag value, and predicts the result as the tag value with the largest posterior probability.

Take a chestnut

If we have such a data set (made by ourselves), the following table:

Serial number	x1	x2	y
1	Small	less	0
2	Small	in	0
3	Small	in	1
4	Small	less	1
5	Small	less	0
6	in	less	0
7	in	in	0
8	in	in	1
9	in	many	1
10	in	many	1
11	large	many	1
12	large	in	1
13	large	in	1
14	large	many	1
15	large	many	0

Among them,

Characteristic x1 represents age, and there are three values, namely { Small , in , large } \{small, medium, large \} {small, medium, large}.
Characteristic x2 represents income. There are three values, namely { less , in , many } \{less, medium, more \} {less, medium, more}.
The tag y indicates that you have been cheated. There are two values: 1 { 0 , 1 } \{0, 1\} {0,1}.

If a person is of middle age and low income, has he been cheated?

This requires Bayesian judgment.

Solution idea

Calculate the probability of the number of people who have been cheated and those who have not been cheated, i.e P ( Y = 1 ) P(Y=1) P(Y=1) and P ( Y = 0 ) P(Y=0) P(Y=0) .
Calculate the probability that the data of middle age and low income account for the data that have been cheated and have not been cheated respectively, i.e P ( x 1 = in ∣ Y = 0 ) , P ( x 1 = in ∣ Y = 1 ) P(x_1 = medium | Y=0), P(x_1 = medium | Y=1) P(x1 = medium ∣ Y=0),P(x1 = medium ∣ Y=1) and P ( x 2 = less ∣ Y = 0 ) , P ( x 2 = less ∣ Y = 1 ) P(x_2 = less | Y=0), P(x_2 = less | Y=1) P(x2 = less ∣ Y=0),P(x2 = less ∣ Y=1).
Find out the probability that this person has been cheated and has not been cheated under the premise of middle age and low income, and take the one with high probability as the final classification result.

Solving process (mathematical calculation)

A priori probability
P ( Y = 1 ) = 9 15 , P ( Y = 0 ) = 6 15 P ( x 1 = in ) = 5 15 , P ( x 2 = less ) = 4 15 P ( x 1 = in , x 2 = less ) = P ( x 1 = in ) × P ( x 2 = less ) = 4 45 P(Y=1)=\frac{9}{15},\: P(Y=0)=\frac{6}{15} \ P(x_1 = medium) = \ frac{5}{15},\: P(x_2 = less) = \ frac{4}{15} \ P(x_1 = medium, x_2 = less) = P(x_1 = medium) \ times P(x_2 = less) = \ frac{4}{45} P(Y=1)=159, P(Y=0)=156, P(x1 = medium) = 155, P(x2 = less) = 154, P(x1 = medium, X2 = less) = P(x1 = medium) × P(x2 = less) = 454
Conditional probability (assuming independence between features)
P ( x 1 = in ∣ Y = 0 ) = 2 6 , P ( x 1 = in ∣ Y = 1 ) = 3 9 P ( x 2 = less ∣ Y = 0 ) = 3 6 , P ( x 2 = less ∣ Y = 1 ) = 1 9 P ( x 1 = in , x 2 = less ∣ Y = 0 ) = P ( x 1 = in ∣ Y = 0 ) × P ( x 2 = less ∣ Y = 0 ) = 1 6 P ( x 1 = in , x 2 = less ∣ Y = 1 ) = P ( x 1 = in ∣ Y = 1 ) × P ( x 2 = less ∣ Y = 1 ) = 1 27 At the same time, P(x_1 = the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the 6} \ \ P(x_1 = medium, x_2 = less, Y=1)=P(x_1 = medium, Y=1) \times P(x_2 = less, Y=1)=\frac{1}{27} P(x1 = medium ∣ Y=0)=62, P(x1 = medium ∣ Y=1)=93, P(x2 = less ∣ Y=0)=63, P(x2 = less ∣ Y=1)=91, P(x1 = medium, X2 = less ∣ Y=0)=P(x1 = medium ∣ Y=0) × P(x2 = less ∣ Y=0)=61 P(x1 = medium, X2 = less ∣ Y=1)=P(x1 = medium ∣ Y=1) × P(x2 = less ∣ Y=1)=271
A posteriori probability
P ( Y = 0 ∣ x 1 = in , x 2 = less ) = P ( x 1 = in , x 2 = less ∣ Y = 0 ) × P ( Y = 0 ) P ( x 1 = in , x 2 = less ) = 1 6 × 6 15 4 45 = 3 4 P ( Y = 1 ∣ x 1 = in , x 2 = less ) = P ( x 1 = in , x 2 = less ∣ Y = 1 ) × P ( Y = 1 ) P ( x 1 = in , x 2 = less ) = 1 27 × 9 15 4 45 = 1 4 The P (y = 0, and the x_1 = middle, where the x_1 = middle, where the x_1 = middle, where the x_2 = less, and y = 0). Times P (y = 0) \ times P (y = 0)}{(P (y = 0).). P (y = 0). Times P (y = 0). P (y = 0)} {(y = 0). P (y = U 1 = middle, where x_2 = less))} = = = \ frac{\ P (x_1 = middle, where the x_2 = less))}} {{{{{{P (y = 0 = 0 = 0 = 0, where the x_2 = 0 = less))} = (y = 0 = 0 0)}}}}}}} {{} {{} {{{{{0 (y = 0 = 0 = 0 = Times P (y = 1)} {P (x_1 = medium, x_2 = less)} = \ frac {\ frac {1} {27} \ times \ frac {9} {15} {\ frac {4} {45} = \ frac {1} {4} P (y = 0 ∣ X1 = medium, x2 = less) = P(x1 = medium, x2 = less) P(x1 = medium, x2 = less ∣ y = 0) × P(Y=0)=45461 × 156 = 43 ∣ P (y = 1 ∣ X1 = medium, x2 = less) = P(x1 = medium, x2 = less) P(x1 = medium, x2 = less ∣ y = 1) × P(Y=1)=454271 × 159=41

∵ P ( Y = 0 ∣ x 1 = in , x 2 = less ) > P ( Y = 1 ∣ x 1 = in , x 2 = less ) ∴ this individual people very large General rate no have cover fool too . \Because P (y = 0 | x 1 = medium, x 2 = less) > P (y = 1 | x 1 = medium, x 2 = less) \ \ \ therefore this person has a high probability of not being cheated. ∵ P(Y=0 ∣ x1 = medium, x2 = less) > P (y = 1 ∣ x1 = medium, x2 = less) ∣ this person has a high probability of not being cheated.

code implementation

Self realization

from collections import Iterable

class NaiveBayes:
    
    def __init__(self):
        # A priori probability
        self.pri = pd.DataFrame()
        # conditional probability
        self.cond = pd.DataFrame()
        # Label column, the default is the last column
        self.y_col = ''
        # Feature column name list
        self.featrues = []
    
    def get_frequency(self, data, name=None, key=None):
        # Acquisition frequency
        freq = data.value_counts(normalize=True)
        # Set the column name and index of the returned dataframe
        name = freq.name if name is None else name
        key = freq.name if key is None else key
        
        freq = freq.to_frame(name=name)
        
        freq.index = pd.MultiIndex.from_product([[key], freq.index.tolist()])
        
        return freq
    
    def get_priori(self, data):
        # Initialization result
        result = pd.DataFrame()
        # Ergodic characteristic column
        for column in data.columns:
            # Acquisition frequency
            p = self.get_frequency(data[column], 'value')
            # Save results
            result = result.append(p)
        # Return a priori probability
        return result
    
    def get_conditional(self, data):
        # Initialization result
        result = None
        # Sort by label column
        for cate, cate_index in data.groupby(self.y_col).groups.items():
            # Data corresponding to each tag value
            x = data.loc[cate_index, self.featrues]
            # The initialization tag value corresponds to the frequency of each feature value, i.e. P(X=x|Y=y)
            cate_df = None
            # Ergodic characteristic column
            for column in self.featrues:
                # Obtain corresponding frequency
                freq = self.get_frequency(x[column], name=cate, key=column)
                # If None, initialize
                if cate_df is None:
                    cate_df = freq
                # Otherwise, the splicing result will be ignored
                else:
                    cate_df = pd.concat([cate_df, freq])
                    
            # The logic is the same as above
            if result is None:
                result = cate_df
            else:
                result = pd.concat([result, cate_df], axis=1)
        # Return conditional probability
        return result
    
    def train(self, data, y_col='y'):
        
        if y_col not in data.columns:
            print(f'Column of y named {y_col} not in data. It will go by column {data.columns[-1]}')
            y_col = data.columns[-1]
        # Set label column name
        self.y_col = y_col
        # Get a list of feature column names
        self.featrues = data.columns.tolist()
        self.featrues.remove(y_col)
        # A priori probability
        self.pri = self.get_priori(data)
        # conditional probability
        self.cond = self.get_conditional(data)
        
        return 
    
    def forward(self, pred_dict):
        # For each column of the prediction result, the column name is the value of each tag, and the value is the probability of each tag value under the specified conditions
        result = None
        # Traverse the eigenvalue dictionary to be predicted to obtain the probability of each tag value Π P(X=x)
        for key, value in pred_dict.items():
            # If the characteristic and value do not exist, an error is reported
            if (key, value) not in self.cond.index:
                raise ValueError(f'The key or value of input data not exists.')
            # Logic is similar to self get_ conditional
            if result is None:
                result = self.cond.loc[(key, value)].to_frame().T
            else:
                result *= self.cond.loc[(key, value)]
        # Multiply each column by P(Y=y)
        result = result.apply(lambda x: x * self.pri.loc[(self.y_col, x.name)].values[0])
        
        result /= result.sum(axis=1).values[0]
        
        return result.iloc[0]
    
    def predict(self, item):
        # Judge whether the input data for prediction can be iterated
        # If yes, predict, if not, report an error
        if isinstance(item, Iterable):
            # The number of input eigenvalues is incorrect, and an error is reported
            if len(item) != len(self.featrues):
                raise ValueError(f'The number of input value is [{len(item)}], '
                                 f'which is not equal to the number of featrue [{len(self.featrues)}].')
            # Initialize eigenvalue dictionary
            pred_dict = {}
            # If the input data is a dictionary, overwrite it directly
            if isinstance(item, dict):
                
                pred_dict = item
            # Otherwise, the default value is passed in the order of column names
            else:
                
                for i, v in enumerate(item):
                    
                    if isinstance(i, dict):
                        pred_dict.update(v)
                    else:
                        pred_dict[self.featrues[i]] = v
            # Return forecast results 
            return self.forward(pred_dict).idxmax()
        
        else:
            raise TypeError(f'The type of input data [{type(item)}] is not iterable.')

Test (e.g submit a memorial to the emperor):

Input[]:	model = NaiveBayes()
    		model.train(data)
            x_pred = {'x1': 'in', 'x2': 'less'}
            prediction = model.predict(x_pred)
            print(f'Value: {", ".join([f"{key}: {value}" for key, value in x_pred.items()])}. Prediction: {prediction}')

-------------------------------------------------------------------

Output[]:	Value: x1: in, x2: less. Prediction: 0

The prediction result is correct!!!

Using sklearn

Input[]:	import numpy as np
            from sklearn import preprocessing
            from sklearn.naive_bayes import GaussianNB

            encoders = {}
            for column in train_data:
                if train_data[column].dtype == object:
                    encoders[column] = preprocessing.LabelEncoder()
                    train_data[column] = encoders[column].fit_transform(train_data[column])

            model = GaussianNB()
            model.fit(train_data[train_data.columns[:-1]], train_data['y'])
            model.predict(np.array([encoders[key].transform([value])[0] for key, value in x_pred.items()]).reshape(1, -1))
            
-------------------------------------------------------------------

Output[]:	array([0], dtype=int64)

Where, train_data, such as submit a memorial to the emperor.

ending

The above example is to use naive Bayesian algorithm for classification.

If you want to learn python together, you can confide me into the group.

The above is what I want to share. Because my knowledge is still shallow, there will be deficiencies. Please correct me.
If you have any questions, you can also leave a message in the comment area.

Topics: Python Algorithm Machine Learning

Programmer Think

Naive Bayes

Creative background

Algorithm classification

Generation class algorithm

Discriminant algorithm

difference

Knowledge supplement

Naive Bayes

Take a chestnut

Solution idea

Solving process (mathematical calculation)

code implementation

Self realization

Using sklearn

ending

Hot Topics