Alipy package for active learning: detailed explanation of tools

Posted by jeankaleb on Mon, 17 Jan 2022 01:36:21 +0100

2. Alipy tool class

  this section further alipy tool classes, including data operation, query strategy, Oracle, index management, performance measurement, etc.
  reference link: http://parnec.nuaa.edu.cn/huangsj/alipy/

2.1 data operation

  for our scientific researchers, what are the first steps of the experiment? Data preprocessing. alipy. data_ That's what manipulate does.
  however, data collection and digitization need to be done by yourself.
  the functions of this module are as follows:
   1) data division: obtain training set index, test set index, labeled set index and unlabeled set index;
   2) feature scaling.

2.1.1 data division

There are many ways to divide data sets in Alipy, such as by instance and multi label data sets. Here is only the simplest way:

import alipy as ap
import numpy as np
from sklearn.datasets import load_iris
np.random.seed(1)  # I promise you my results will always be


def test():
    """
    By instance
    """
    data = load_iris()
    X, Y = data["data"], data["target"]
    tool = ap.ToolBox(X=X, y=Y)
    idx_tr, idx_te, idx_lab, idx_unlab = tool.split_AL(test_ratio=0.1, initial_label_rate=0.1, split_count=1)
    print("Training set index:", idx_tr)
    print("Test set index:", idx_te)
    print("Tagged set index:", idx_lab)
    print("No label set index:", idx_unlab)


if __name__ == '__main__':
    test()

  the output is as follows:

Training set index: [array([ 14,  98,  75,  16, 131,  56, 141,  44,  29, 120,  94,   5, 102,
        51,  78,  42,  92,  66,  31,  35,  90,  84,  77,  40, 125,  99,
        33,  19,  73, 146,  91, 135,  69, 128, 114,  48,  53,  28,  54,
       108, 112,  17, 119, 103,  58, 118,  18,   4,  45,  59,  39,  36,
       117, 139, 107, 132, 126,  85, 122,  95,  11, 113, 123,  12,   2,
       104,   6, 127, 110,  65,  55, 144, 138,  46,  62,  74, 116,  93,
       100,  89,  10,  34,  32, 124,  38,  83, 111, 149,  27,  23,  67,
         9, 130,  97, 105, 145,  87, 148, 109,  64,  15,  82,  41,  80,
        52,  26,  76,  43,  24, 136, 121, 143,  49,  21,  70,   3, 142,
        30, 147, 106,  47, 115,  13,  88,   8,  81,  60,   0,   1,  57,
        22,  61,  63,   7,  86])]
Test set index: [array([ 96,  68,  50, 101,  20,  25, 134,  71, 129,  79, 133, 137,  72,
       140,  37])]
Tagged set index: [array([ 14,  98,  75,  16, 131,  56, 141,  44,  29, 120,  94,   5, 102,
        51])]
No label set index: [array([ 78,  42,  92,  66,  31,  35,  90,  84,  77,  40, 125,  99,  33,
        19,  73, 146,  91, 135,  69, 128, 114,  48,  53,  28,  54, 108,
       112,  17, 119, 103,  58, 118,  18,   4,  45,  59,  39,  36, 117,
       139, 107, 132, 126,  85, 122,  95,  11, 113, 123,  12,   2, 104,
         6, 127, 110,  65,  55, 144, 138,  46,  62,  74, 116,  93, 100,
        89,  10,  34,  32, 124,  38,  83, 111, 149,  27,  23,  67,   9,
       130,  97, 105, 145,  87, 148, 109,  64,  15,  82,  41,  80,  52,
        26,  76,  43,  24, 136, 121, 143,  49,  21,  70,   3, 142,  30,
       147, 106,  47, 115,  13,  88,   8,  81,  60,   0,   1,  57,  22,
        61,  63,   7,  86])]

  through the results, it can be found that the union of labeled sets and unlabeled sets is exactly the training set index, which is consistent with active learning.
  in particular, split here_ Count is used to control the number of partitions. When it is greater than 1, it is equivalent to the specified number k k In kCV k k k.

2.1.2 feature scaling

  includes maximum, minimum and standard scaling:
   1) maximum and minimum scaling unifies the attribute values of the dataset into [lower, upper]:

import numpy as np
from sklearn.datasets import load_iris
from alipy.data_manipulate import minmax_scale
np.random.seed(1)  # I promise you my results will always be


def test():
    """
    By instance
    """
    X, _ = load_iris(return_X_y=True)
    # The range is usually limited to [0, 1]
    X = minmax_scale(X=X, feature_range=(0, 1))
    print(X[:5])


if __name__ == '__main__':
    test()

  the output is as follows:

[[0.22222222 0.625      0.06779661 0.04166667]
 [0.16666667 0.41666667 0.06779661 0.04166667]
 [0.11111111 0.5        0.05084746 0.04166667]
 [0.08333333 0.45833333 0.08474576 0.04166667]
 [0.19444444 0.66666667 0.06779661 0.04166667]]

  the reason for this operation is actually well understood. For example, the sample attributes in your hand are height (cm), age (year) and code size (lines). Let's not choose these attributes for the moment.
  when you need to calculate the distance between such two samples, there may be tens of thousands of codes in the current sample, but it is difficult for the height to exceed 200. Is the calculated distance credible?
   2) standard scaling subtracts the average value of the attribute column from the specified value, and then divides the result by the corresponding standard deviation:

import numpy as np
from sklearn.datasets import load_iris
from alipy.data_manipulate import StandardScale
np.random.seed(1)  # I promise you my results will always be


def test():
    """
    By instance
    """
    X, _ = load_iris(return_X_y=True)
    X = StandardScale(X=X)
    print(X[:5])


if __name__ == '__main__':
    test()

  the output is as follows:

[[-1.09133619  2.34571507 -0.76175027 -1.73154806]
 [-1.38496925 -0.30381249 -0.76175027 -1.73154806]
 [-1.67860231  0.75599853 -0.7940552  -1.73154806]
 [-1.82541883  0.22609302 -0.72944534 -1.73154806]
 [-1.23815272  2.87562058 -0.76175027 -1.73154806]]

2.2 query strategy

   in short, query strategy is the strategy of how to query instances from unlabeled sets? In Alipy, seven different query strategies have been implemented. For details, see: https://blog.csdn.net/weixin_44575152/article/details/100783835

2.2.1 selection using built-in logistic regression model

import alipy as ap
import numpy as np
from sklearn.datasets import load_iris
from alipy.data_manipulate import StandardScale
from alipy.query_strategy import QueryInstanceQBC
np.random.seed(1)  # I promise you my results will always be


def test():
    """
    By instance
    """
    X, Y = load_iris(return_X_y=True)
    X = StandardScale(X=X)
    # Initialize toolbox
    tool = ap.ToolBox(X=X, y=Y)
    idx_tr, idx_te, idx_lab, idx_unlab = tool.split_AL(test_ratio=0.1, initial_label_rate=0.1, split_count=1)
    # Specify the query strategy. Of course, different query strategies have different super parameters
    qbc = QueryInstanceQBC(X, Y)
    # Use default model
    model = tool.get_default_model()
    model.fit(X[idx_lab[0]], Y[idx_lab[0]])
    """Here's the point --> Get query results"""
    idx_query = qbc.select(idx_lab[0], idx_unlab[0], batch_size=10, model=model)
    print(idx_lab[0])
    print(idx_query)


if __name__ == '__main__':
    test()

  the output is as follows:

[ 14  98  75  16 131  56 141  44  29 120  94   5 102  51]
[123 128 126 121 138 149 114  85 111 103]

   the query will be conducted in the unlabeled set, and the selected samples can be used for subsequent model training.

2.2.2 selection of prediction probability

import alipy as ap
import numpy as np
from sklearn.datasets import load_iris
from alipy.data_manipulate import StandardScale
from alipy.query_strategy import QueryInstanceQBC
np.random.seed(1)  # I promise you my results will always be


def test():
    """
    By instance
    """
    X, Y = load_iris(return_X_y=True)
    X = StandardScale(X=X)
    # Initialize toolbox
    tool = ap.ToolBox(X=X, y=Y)
    idx_tr, idx_te, idx_lab, idx_unlab = tool.split_AL(test_ratio=0.1, initial_label_rate=0.1, split_count=1)
    # Specify the query strategy. Of course, different query strategies have different super parameters
    qbc = QueryInstanceQBC(X, Y)
    # Use default model
    model = tool.get_default_model()
    model.fit(X[idx_lab[0]], Y[idx_lab[0]])
    """============================================="""
    # It will show different places in this way in the future
    # The model predicts the category probability for an unlabeled set. It is a matrix
    predict_lab = model.predict_proba(X[idx_unlab[0]])
    idx_query = qbc.select_by_prediction_mat(idx_unlab[0], predict_lab, batch_size=10)
    print(idx_query)
    """============================================="""


if __name__ == '__main__':
    test()

  the output is as follows:

[78 42 92]

2.3 Oracle

  
  
  
  
  
  
  
  
  
  
  
  
  
  

Topics: Python IDE