2. Alipy tool class
this section further alipy tool classes, including data operation, query strategy, Oracle, index management, performance measurement, etc.
reference link: http://parnec.nuaa.edu.cn/huangsj/alipy/
2.1 data operation
for our scientific researchers, what are the first steps of the experiment? Data preprocessing. alipy. data_ That's what manipulate does.
however, data collection and digitization need to be done by yourself.
the functions of this module are as follows:
1) data division: obtain training set index, test set index, labeled set index and unlabeled set index;
2) feature scaling.
2.1.1 data division
There are many ways to divide data sets in Alipy, such as by instance and multi label data sets. Here is only the simplest way:
import alipy as ap import numpy as np from sklearn.datasets import load_iris np.random.seed(1) # I promise you my results will always be def test(): """ By instance """ data = load_iris() X, Y = data["data"], data["target"] tool = ap.ToolBox(X=X, y=Y) idx_tr, idx_te, idx_lab, idx_unlab = tool.split_AL(test_ratio=0.1, initial_label_rate=0.1, split_count=1) print("Training set index:", idx_tr) print("Test set index:", idx_te) print("Tagged set index:", idx_lab) print("No label set index:", idx_unlab) if __name__ == '__main__': test()
the output is as follows:
Training set index: [array([ 14, 98, 75, 16, 131, 56, 141, 44, 29, 120, 94, 5, 102, 51, 78, 42, 92, 66, 31, 35, 90, 84, 77, 40, 125, 99, 33, 19, 73, 146, 91, 135, 69, 128, 114, 48, 53, 28, 54, 108, 112, 17, 119, 103, 58, 118, 18, 4, 45, 59, 39, 36, 117, 139, 107, 132, 126, 85, 122, 95, 11, 113, 123, 12, 2, 104, 6, 127, 110, 65, 55, 144, 138, 46, 62, 74, 116, 93, 100, 89, 10, 34, 32, 124, 38, 83, 111, 149, 27, 23, 67, 9, 130, 97, 105, 145, 87, 148, 109, 64, 15, 82, 41, 80, 52, 26, 76, 43, 24, 136, 121, 143, 49, 21, 70, 3, 142, 30, 147, 106, 47, 115, 13, 88, 8, 81, 60, 0, 1, 57, 22, 61, 63, 7, 86])] Test set index: [array([ 96, 68, 50, 101, 20, 25, 134, 71, 129, 79, 133, 137, 72, 140, 37])] Tagged set index: [array([ 14, 98, 75, 16, 131, 56, 141, 44, 29, 120, 94, 5, 102, 51])] No label set index: [array([ 78, 42, 92, 66, 31, 35, 90, 84, 77, 40, 125, 99, 33, 19, 73, 146, 91, 135, 69, 128, 114, 48, 53, 28, 54, 108, 112, 17, 119, 103, 58, 118, 18, 4, 45, 59, 39, 36, 117, 139, 107, 132, 126, 85, 122, 95, 11, 113, 123, 12, 2, 104, 6, 127, 110, 65, 55, 144, 138, 46, 62, 74, 116, 93, 100, 89, 10, 34, 32, 124, 38, 83, 111, 149, 27, 23, 67, 9, 130, 97, 105, 145, 87, 148, 109, 64, 15, 82, 41, 80, 52, 26, 76, 43, 24, 136, 121, 143, 49, 21, 70, 3, 142, 30, 147, 106, 47, 115, 13, 88, 8, 81, 60, 0, 1, 57, 22, 61, 63, 7, 86])]
through the results, it can be found that the union of labeled sets and unlabeled sets is exactly the training set index, which is consistent with active learning.
in particular, split here_ Count is used to control the number of partitions. When it is greater than 1, it is equivalent to the specified number
k
k
In kCV
k
k
k.
2.1.2 feature scaling
includes maximum, minimum and standard scaling:
1) maximum and minimum scaling unifies the attribute values of the dataset into [lower, upper]:
import numpy as np from sklearn.datasets import load_iris from alipy.data_manipulate import minmax_scale np.random.seed(1) # I promise you my results will always be def test(): """ By instance """ X, _ = load_iris(return_X_y=True) # The range is usually limited to [0, 1] X = minmax_scale(X=X, feature_range=(0, 1)) print(X[:5]) if __name__ == '__main__': test()
the output is as follows:
[[0.22222222 0.625 0.06779661 0.04166667] [0.16666667 0.41666667 0.06779661 0.04166667] [0.11111111 0.5 0.05084746 0.04166667] [0.08333333 0.45833333 0.08474576 0.04166667] [0.19444444 0.66666667 0.06779661 0.04166667]]
the reason for this operation is actually well understood. For example, the sample attributes in your hand are height (cm), age (year) and code size (lines). Let's not choose these attributes for the moment.
when you need to calculate the distance between such two samples, there may be tens of thousands of codes in the current sample, but it is difficult for the height to exceed 200. Is the calculated distance credible?
2) standard scaling subtracts the average value of the attribute column from the specified value, and then divides the result by the corresponding standard deviation:
import numpy as np from sklearn.datasets import load_iris from alipy.data_manipulate import StandardScale np.random.seed(1) # I promise you my results will always be def test(): """ By instance """ X, _ = load_iris(return_X_y=True) X = StandardScale(X=X) print(X[:5]) if __name__ == '__main__': test()
the output is as follows:
[[-1.09133619 2.34571507 -0.76175027 -1.73154806] [-1.38496925 -0.30381249 -0.76175027 -1.73154806] [-1.67860231 0.75599853 -0.7940552 -1.73154806] [-1.82541883 0.22609302 -0.72944534 -1.73154806] [-1.23815272 2.87562058 -0.76175027 -1.73154806]]
2.2 query strategy
in short, query strategy is the strategy of how to query instances from unlabeled sets? In Alipy, seven different query strategies have been implemented. For details, see: https://blog.csdn.net/weixin_44575152/article/details/100783835
2.2.1 selection using built-in logistic regression model
import alipy as ap import numpy as np from sklearn.datasets import load_iris from alipy.data_manipulate import StandardScale from alipy.query_strategy import QueryInstanceQBC np.random.seed(1) # I promise you my results will always be def test(): """ By instance """ X, Y = load_iris(return_X_y=True) X = StandardScale(X=X) # Initialize toolbox tool = ap.ToolBox(X=X, y=Y) idx_tr, idx_te, idx_lab, idx_unlab = tool.split_AL(test_ratio=0.1, initial_label_rate=0.1, split_count=1) # Specify the query strategy. Of course, different query strategies have different super parameters qbc = QueryInstanceQBC(X, Y) # Use default model model = tool.get_default_model() model.fit(X[idx_lab[0]], Y[idx_lab[0]]) """Here's the point --> Get query results""" idx_query = qbc.select(idx_lab[0], idx_unlab[0], batch_size=10, model=model) print(idx_lab[0]) print(idx_query) if __name__ == '__main__': test()
the output is as follows:
[ 14 98 75 16 131 56 141 44 29 120 94 5 102 51] [123 128 126 121 138 149 114 85 111 103]
the query will be conducted in the unlabeled set, and the selected samples can be used for subsequent model training.
2.2.2 selection of prediction probability
import alipy as ap import numpy as np from sklearn.datasets import load_iris from alipy.data_manipulate import StandardScale from alipy.query_strategy import QueryInstanceQBC np.random.seed(1) # I promise you my results will always be def test(): """ By instance """ X, Y = load_iris(return_X_y=True) X = StandardScale(X=X) # Initialize toolbox tool = ap.ToolBox(X=X, y=Y) idx_tr, idx_te, idx_lab, idx_unlab = tool.split_AL(test_ratio=0.1, initial_label_rate=0.1, split_count=1) # Specify the query strategy. Of course, different query strategies have different super parameters qbc = QueryInstanceQBC(X, Y) # Use default model model = tool.get_default_model() model.fit(X[idx_lab[0]], Y[idx_lab[0]]) """=============================================""" # It will show different places in this way in the future # The model predicts the category probability for an unlabeled set. It is a matrix predict_lab = model.predict_proba(X[idx_unlab[0]]) idx_query = qbc.select_by_prediction_mat(idx_unlab[0], predict_lab, batch_size=10) print(idx_query) """=============================================""" if __name__ == '__main__': test()
the output is as follows:
[78 42 92]
2.3 Oracle