Creative background
Ben Caiji recently wanted to learn machine learning, but he learned naive Bayes.
If you think my article is well written, can you give me some praise and comment on it. It's not impossible to focus on it 🤗
Algorithm classification
Generally speaking, classification algorithms can be divided into generative algorithms and discriminant algorithms.
Generation class algorithm
Generally speaking, this kind of algorithm is to calculate the probability of a certain value belonging to the label when the given data takes a certain value for each feature. The typical example of this kind of algorithm is naive Bayes.
Discriminant algorithm
Generally speaking, this kind of algorithm is to give some eigenvalues and judge which kind of data it belongs to according to the influence of each feature on the tag value. The typical type of this kind of algorithm is the decision tree. We'll talk about it in the next article.
difference
The generation algorithm pays attention to the influence of eigenvalue combination on label value; The discriminant algorithm focuses on the influence of a single feature on the tag value.
Knowledge supplement
Suppose there are two events A and B, the probability of occurrence is P(A) and P(B), and the probability of simultaneous occurrence of two events is
P
(
A
,
B
)
P(A, B)
P(A,B). Then, under the condition that event a occurs, the probability of event B can be expressed as P(B|A), and the calculation formula is
P
(
B
∣
A
)
=
P
(
A
,
B
)
P
(
A
)
=
P
(
A
∣
B
)
⋅
P
(
B
)
P
(
A
)
P(B|A) = \frac{P(A, B)}{P(A)}=\frac{P(A|B) \cdot P(B)}{P(A)}
P(B∣A)=P(A)P(A,B)=P(A)P(A∣B)⋅P(B)
generally speaking
- P(B|A) ≠ P(A), that is, A and B are not independent of each other. The occurrence of event B has an impact on the occurrence of event A.
- If P (b|A) > P (A), it indicates that the occurrence of event B promotes the occurrence of event A.
- On the contrary, it has inhibitory effect.
- If the two are equal, the occurrence of event B has no impact on the occurrence of event A.
Naive Bayes
In naive Bayes, we believe that features are independent of each other. The purpose is to predict the probability of each value of the label under the condition of the combination of each eigenvalue.
Hypothetical data set T = { ( x 1 , y 1 ) , ( x 2 , y 2 ) ... ( x n , y n ) } T=\{(x_1, y_1), (x_2, y_2) \dots (x_n, y_n)\} T={(x1, y1), (x2, y2)... (xn, yn)} is generated by P(X, Y) independent and identically distributed, where x and y are two features, then
- A priori probability: the probability of various values of a feature, eg: P ( Y = c k ) , k = 1 , 2 , ... , K P(Y=c_k), \: k=1,2,\dots,K P(Y=ck),k=1,2,...,K .
- Conditional probability: the value of characteristic Y is c k c_k Under the condition of ck | the probability that the characteristic x is taken as X is expressed as P(X=x|Y=c_k). The calculation formula is as follows:
P ( X = x ∣ Y = c k ) = P ( X 1 = x 1 , X 2 = x 2 , ... , X n = x n ∣ Y = c k ) , k = 1 , 2 , ... , K P(X=x|Y=c_k) = P(X_1=x_1, X_2=x_2, \dots , X_n=x_n|Y=c_k), \: k=1,2,\dots,K P(X=x∣Y=ck)=P(X1=x1,X2=x2,...,Xn=xn∣Y=ck),k=1,2,...,K
Under the assumption of conditional independence, it can be combined into
P
(
X
=
x
∣
Y
=
c
k
)
=
∏
j
=
1
n
P
(
X
j
=
x
j
∣
Y
=
c
k
)
P(X=x|Y=c_k) = \prod \limits_{j=1}^{n}{P(X_j=x_j|Y=c_k)}
P(X=x∣Y=ck)=j=1∏nP(Xj=xj∣Y=ck) .
- A posteriori probability: Contrary to a priori probability, it means that when the value of feature x is x, the value of feature Y is c k c_k The probability of ck | is expressed as P(Y=c_k|X=x), and the calculation formula is as follows:
P ( Y = c k ∣ X = x ) = P ( X = x , Y = c k ) P ( X = x ) = P ( Y = c k ) ⋅ P ( X = x ∣ Y = c k ) P ( X = x ) = P ( Y = c k ) ⋅ ∏ j = 1 n P ( X j = x j ∣ Y = c k ) ∑ k P ( Y = c k ) ⋅ ∏ j = 1 n P ( X j = x j ∣ Y = c k ) P(Y=c_k|X=x) = \frac{P(X=x, Y=c_k)}{P(X=x)}=\frac{P(Y=c_k) \cdot P(X=x|Y=c_k)}{P(X=x)} \\ =\frac{P(Y=c_k) \cdot \prod \limits_{j=1}^{n}{P(X_j=x_j|Y=c_k)}}{\displaystyle \sum _{k}{P(Y=c_k) \cdot \prod \limits_{j=1}^{n}{P(X_j=x_j|Y=c_k)}}} P(Y=ck∣X=x)=P(X=x)P(X=x,Y=ck)=P(X=x)P(Y=ck)⋅P(X=x∣Y=ck)=k∑P(Y=ck)⋅j=1∏nP(Xj=xj∣Y=ck)P(Y=ck)⋅j=1∏nP(Xj=xj∣Y=ck)
Naive Bayes finally obtains the posterior probability of each tag value, and predicts the result as the tag value with the largest posterior probability.
Take a chestnut
If we have such a data set (made by ourselves), the following table:
Serial number | x1 | x2 | y |
---|---|---|---|
1 | Small | less | 0 |
2 | Small | in | 0 |
3 | Small | in | 1 |
4 | Small | less | 1 |
5 | Small | less | 0 |
6 | in | less | 0 |
7 | in | in | 0 |
8 | in | in | 1 |
9 | in | many | 1 |
10 | in | many | 1 |
11 | large | many | 1 |
12 | large | in | 1 |
13 | large | in | 1 |
14 | large | many | 1 |
15 | large | many | 0 |
Among them,
- Characteristic x1 represents age, and there are three values, namely { Small , in , large } \{small, medium, large \} {small, medium, large}.
- Characteristic x2 represents income. There are three values, namely { less , in , many } \{less, medium, more \} {less, medium, more}.
- The tag y indicates that you have been cheated. There are two values: 1 { 0 , 1 } \{0, 1\} {0,1}.
If a person is of middle age and low income, has he been cheated?
This requires Bayesian judgment.
Solution idea
- Calculate the probability of the number of people who have been cheated and those who have not been cheated, i.e P ( Y = 1 ) P(Y=1) P(Y=1) and P ( Y = 0 ) P(Y=0) P(Y=0) .
- Calculate the probability that the data of middle age and low income account for the data that have been cheated and have not been cheated respectively, i.e P ( x 1 = in ∣ Y = 0 ) , P ( x 1 = in ∣ Y = 1 ) P(x_1 = medium | Y=0), P(x_1 = medium | Y=1) P(x1 = medium ∣ Y=0),P(x1 = medium ∣ Y=1) and P ( x 2 = less ∣ Y = 0 ) , P ( x 2 = less ∣ Y = 1 ) P(x_2 = less | Y=0), P(x_2 = less | Y=1) P(x2 = less ∣ Y=0),P(x2 = less ∣ Y=1).
- Find out the probability that this person has been cheated and has not been cheated under the premise of middle age and low income, and take the one with high probability as the final classification result.
Solving process (mathematical calculation)
A priori probability
P
(
Y
=
1
)
=
9
15
,
P
(
Y
=
0
)
=
6
15
P
(
x
1
=
in
)
=
5
15
,
P
(
x
2
=
less
)
=
4
15
P
(
x
1
=
in
,
x
2
=
less
)
=
P
(
x
1
=
in
)
×
P
(
x
2
=
less
)
=
4
45
P(Y=1)=\frac{9}{15},\: P(Y=0)=\frac{6}{15} \ P(x_1 = medium) = \ frac{5}{15},\: P(x_2 = less) = \ frac{4}{15} \ P(x_1 = medium, x_2 = less) = P(x_1 = medium) \ times P(x_2 = less) = \ frac{4}{45}
P(Y=1)=159, P(Y=0)=156, P(x1 = medium) = 155, P(x2 = less) = 154, P(x1 = medium, X2 = less) = P(x1 = medium) × P(x2 = less) = 454
Conditional probability (assuming independence between features)
P
(
x
1
=
in
∣
Y
=
0
)
=
2
6
,
P
(
x
1
=
in
∣
Y
=
1
)
=
3
9
P
(
x
2
=
less
∣
Y
=
0
)
=
3
6
,
P
(
x
2
=
less
∣
Y
=
1
)
=
1
9
P
(
x
1
=
in
,
x
2
=
less
∣
Y
=
0
)
=
P
(
x
1
=
in
∣
Y
=
0
)
×
P
(
x
2
=
less
∣
Y
=
0
)
=
1
6
P
(
x
1
=
in
,
x
2
=
less
∣
Y
=
1
)
=
P
(
x
1
=
in
∣
Y
=
1
)
×
P
(
x
2
=
less
∣
Y
=
1
)
=
1
27
At the same time, P(x_1 = the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the middle of the 6} \ \ P(x_1 = medium, x_2 = less, Y=1)=P(x_1 = medium, Y=1) \times P(x_2 = less, Y=1)=\frac{1}{27}
P(x1 = medium ∣ Y=0)=62, P(x1 = medium ∣ Y=1)=93, P(x2 = less ∣ Y=0)=63, P(x2 = less ∣ Y=1)=91, P(x1 = medium, X2 = less ∣ Y=0)=P(x1 = medium ∣ Y=0) × P(x2 = less ∣ Y=0)=61 P(x1 = medium, X2 = less ∣ Y=1)=P(x1 = medium ∣ Y=1) × P(x2 = less ∣ Y=1)=271
A posteriori probability
P
(
Y
=
0
∣
x
1
=
in
,
x
2
=
less
)
=
P
(
x
1
=
in
,
x
2
=
less
∣
Y
=
0
)
×
P
(
Y
=
0
)
P
(
x
1
=
in
,
x
2
=
less
)
=
1
6
×
6
15
4
45
=
3
4
P
(
Y
=
1
∣
x
1
=
in
,
x
2
=
less
)
=
P
(
x
1
=
in
,
x
2
=
less
∣
Y
=
1
)
×
P
(
Y
=
1
)
P
(
x
1
=
in
,
x
2
=
less
)
=
1
27
×
9
15
4
45
=
1
4
The P (y = 0, and the x_1 = middle, where the x_1 = middle, where the x_1 = middle, where the x_2 = less, and y = 0). Times P (y = 0) \ times P (y = 0)}{(P (y = 0).). P (y = 0). Times P (y = 0). P (y = 0)} {(y = 0). P (y = U 1 = middle, where x_2 = less))} = = = \ frac{\ P (x_1 = middle, where the x_2 = less))}} {{{{{{P (y = 0 = 0 = 0 = 0, where the x_2 = 0 = less))} = (y = 0 = 0 0)}}}}}}} {{} {{} {{{{{0 (y = 0 = 0 = 0 = Times P (y = 1)} {P (x_1 = medium, x_2 = less)} = \ frac {\ frac {1} {27} \ times \ frac {9} {15} {\ frac {4} {45} = \ frac {1} {4}
P (y = 0 ∣ X1 = medium, x2 = less) = P(x1 = medium, x2 = less) P(x1 = medium, x2 = less ∣ y = 0) × P(Y=0)=45461 × 156 = 43 ∣ P (y = 1 ∣ X1 = medium, x2 = less) = P(x1 = medium, x2 = less) P(x1 = medium, x2 = less ∣ y = 1) × P(Y=1)=454271 × 159=41
∵ P ( Y = 0 ∣ x 1 = in , x 2 = less ) > P ( Y = 1 ∣ x 1 = in , x 2 = less ) ∴ this individual people very large General rate no have cover fool too . \Because P (y = 0 | x 1 = medium, x 2 = less) > P (y = 1 | x 1 = medium, x 2 = less) \ \ \ therefore this person has a high probability of not being cheated. ∵ P(Y=0 ∣ x1 = medium, x2 = less) > P (y = 1 ∣ x1 = medium, x2 = less) ∣ this person has a high probability of not being cheated.
code implementation
Self realization
from collections import Iterable class NaiveBayes: def __init__(self): # A priori probability self.pri = pd.DataFrame() # conditional probability self.cond = pd.DataFrame() # Label column, the default is the last column self.y_col = '' # Feature column name list self.featrues = [] def get_frequency(self, data, name=None, key=None): # Acquisition frequency freq = data.value_counts(normalize=True) # Set the column name and index of the returned dataframe name = freq.name if name is None else name key = freq.name if key is None else key freq = freq.to_frame(name=name) freq.index = pd.MultiIndex.from_product([[key], freq.index.tolist()]) return freq def get_priori(self, data): # Initialization result result = pd.DataFrame() # Ergodic characteristic column for column in data.columns: # Acquisition frequency p = self.get_frequency(data[column], 'value') # Save results result = result.append(p) # Return a priori probability return result def get_conditional(self, data): # Initialization result result = None # Sort by label column for cate, cate_index in data.groupby(self.y_col).groups.items(): # Data corresponding to each tag value x = data.loc[cate_index, self.featrues] # The initialization tag value corresponds to the frequency of each feature value, i.e. P(X=x|Y=y) cate_df = None # Ergodic characteristic column for column in self.featrues: # Obtain corresponding frequency freq = self.get_frequency(x[column], name=cate, key=column) # If None, initialize if cate_df is None: cate_df = freq # Otherwise, the splicing result will be ignored else: cate_df = pd.concat([cate_df, freq]) # The logic is the same as above if result is None: result = cate_df else: result = pd.concat([result, cate_df], axis=1) # Return conditional probability return result def train(self, data, y_col='y'): if y_col not in data.columns: print(f'Column of y named {y_col} not in data. It will go by column {data.columns[-1]}') y_col = data.columns[-1] # Set label column name self.y_col = y_col # Get a list of feature column names self.featrues = data.columns.tolist() self.featrues.remove(y_col) # A priori probability self.pri = self.get_priori(data) # conditional probability self.cond = self.get_conditional(data) return def forward(self, pred_dict): # For each column of the prediction result, the column name is the value of each tag, and the value is the probability of each tag value under the specified conditions result = None # Traverse the eigenvalue dictionary to be predicted to obtain the probability of each tag value Π P(X=x) for key, value in pred_dict.items(): # If the characteristic and value do not exist, an error is reported if (key, value) not in self.cond.index: raise ValueError(f'The key or value of input data not exists.') # Logic is similar to self get_ conditional if result is None: result = self.cond.loc[(key, value)].to_frame().T else: result *= self.cond.loc[(key, value)] # Multiply each column by P(Y=y) result = result.apply(lambda x: x * self.pri.loc[(self.y_col, x.name)].values[0]) result /= result.sum(axis=1).values[0] return result.iloc[0] def predict(self, item): # Judge whether the input data for prediction can be iterated # If yes, predict, if not, report an error if isinstance(item, Iterable): # The number of input eigenvalues is incorrect, and an error is reported if len(item) != len(self.featrues): raise ValueError(f'The number of input value is [{len(item)}], ' f'which is not equal to the number of featrue [{len(self.featrues)}].') # Initialize eigenvalue dictionary pred_dict = {} # If the input data is a dictionary, overwrite it directly if isinstance(item, dict): pred_dict = item # Otherwise, the default value is passed in the order of column names else: for i, v in enumerate(item): if isinstance(i, dict): pred_dict.update(v) else: pred_dict[self.featrues[i]] = v # Return forecast results return self.forward(pred_dict).idxmax() else: raise TypeError(f'The type of input data [{type(item)}] is not iterable.')
Test (e.g submit a memorial to the emperor):
Input[]: model = NaiveBayes() model.train(data) x_pred = {'x1': 'in', 'x2': 'less'} prediction = model.predict(x_pred) print(f'Value: {", ".join([f"{key}: {value}" for key, value in x_pred.items()])}. Prediction: {prediction}') ------------------------------------------------------------------- Output[]: Value: x1: in, x2: less. Prediction: 0
The prediction result is correct!!!
Using sklearn
Input[]: import numpy as np from sklearn import preprocessing from sklearn.naive_bayes import GaussianNB encoders = {} for column in train_data: if train_data[column].dtype == object: encoders[column] = preprocessing.LabelEncoder() train_data[column] = encoders[column].fit_transform(train_data[column]) model = GaussianNB() model.fit(train_data[train_data.columns[:-1]], train_data['y']) model.predict(np.array([encoders[key].transform([value])[0] for key, value in x_pred.items()]).reshape(1, -1)) ------------------------------------------------------------------- Output[]: array([0], dtype=int64)
Where, train_data, such as submit a memorial to the emperor.
ending
The above example is to use naive Bayesian algorithm for classification.
If you want to learn python together, you can confide me into the group.
The above is what I want to share. Because my knowledge is still shallow, there will be deficiencies. Please correct me.
If you have any questions, you can also leave a message in the comment area.