Feature selection algorithm in Spark ML

Posted by beselabios on Mon, 07 Mar 2022 09:40:15 +0100

Feature Selection refers to the process of selecting those "excellent" features from the feature vector to form a new and more "streamlined" feature vector. It is very commonly used in high-dimensional data analysis. It can eliminate the characteristics of "redundancy" and "irrelevant" and improve the performance of the learner.

1, VectorSlicer

VectorSlicer is a converter that accepts a feature vector and outputs a new feature vector with the original feature subarray. It is useful for extracting features from vector columns.

VectorSlicer accepts the vector column with the specified index, and then outputs a new vector column whose value is selected by these indexes. There are two types of indexes. Integer indexes represent indexes in vectors, and setindexes().

The string index representing the feature name in the vector, setNames(). This requires the vector column to have AttributeGroup because the implementation matches the name field of the Attribute.

Both integer and string specifications are acceptable. In addition, you can use both integer indexes and string names. At least one function must be selected. Duplicate features are not allowed, so there can be no overlap between the selected index and name. Note that if a feature name is selected, an exception is thrown if an empty input attribute is encountered.

// Feature selection———— VectorSlicer
// This class accepts an eigenvector and outputs a new eigenvector with the original feature subarray.
// You can use indexes (setIndices()) Or name (setNames()) Specifies a subset of features. At least one function must be selected. Duplicate features are not allowed, so there can be no overlap between the selected index and name.
// The output vector will first sort the features using the selected index (in the given order), followed by the selected name (in the given order).
import java.util.Arrays

import org.apache.spark.ml.attribute.{Attribute, AttributeGroup, NumericAttribute}
import org.apache.spark.ml.feature.VectorSlicer
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types.StructType

val data = Arrays.asList(
  Row(Vectors.sparse(3, Seq((0, -2.0), (1, 2.3)))),
  Row(Vectors.dense(-2.0, 2.3, 0.0))

// Numeric attribute with optional summary statistics. 
val defaultAttr = NumericAttribute.defaultAttr
// withName Method: copy with a new name. 
val attrs = Array("f1", "f2", "f3").map(defaultAttr.withName)
val attrGroup = new AttributeGroup("userFeatures", attrs.asInstanceOf[Array[Attribute]])

val dataset = spark.createDataFrame(data, StructType(Array(attrGroup.toStructField())))

val slicer = new VectorSlicer()
    // Select the index array of features from the vector column. Names cannot overlap. Default: Empty Array 
    // A set of feature names used to select features from vector columns. These names must be ML org.apache.spark.ml.attribute.Attributes appoint. Cannot overlap with index. Default: Empty Array 
// Equivalent to slicer.setIndices(Array(1, 2)),
// Equivalent to slicer.setNames(Array("f2", "f3"))

val output = slicer.transform(dataset)


|userFeatures        |features     |
|[-2.0,2.3,0.0]      |[2.3,0.0]    |


2, RFormula

RFormula selects the column specified by the R model formula. At present, we support a limited subset of R operators, including "~", "" ":", "+" and "-". The basic operators are:

  • ~Separate objectives and terms
  • +concat term, "+ 0" means to delete intercept
  • -Delete a word, "- 1" means delete intercept
  • : interaction (numerical multiplication, or binary classification value)
  • . all columns except target
  • *Factor crossover, including terms and interactions between them
  • ^Factor crossing to specified degree

Assuming that a and b are double columns, we use the following simple example to illustrate the effect of RFormula:

  • y ~ a + b represents the model y ~ w0 + w1 * a + w2 * b, where w0 is the intercept and W1 and W2 are the coefficients.
  • y ~ a + b + a:b - 1 represents the model y ~ w1 * a + w2 * b + w3 * a * b, where W1, W2 and W3 are coefficients.
  • y ~ a * b represents the model y ~ w0 + w1 * a + w2 * b + w3 * a * b, where w0 is the intercept and w1, w2 and w3 are the coefficients
  • y ~ (a + b)^2 represents the model y ~ w0 + w1 * a + w2 * b + w3 * a * b, where w0 is the intercept and w1, w2 and w3 are the coefficients

RFormula generates eigenvector columns and double precision columns or character string columns of labels. Just like using a formula for linear regression in R, the string input column will be one hot encoded and the number column will be converted to double precision values.

If the label column is of string type, it will first be converted to double with StringIndexer. If the label column does not exist in the DataFrame, the output label column is created from the response variable specified in the formula.


// Feature selection———— RFormula
// RFormula Selected by R The column specified by the model formula.
import org.apache.spark.ml.feature.RFormula

val dataset = spark.createDataFrame(Seq(
  (7, "US", 18, 1.0),
  (8, "CA", 12, 0.0),
  (9, "NZ", 15, 0.0)
)).toDF("id", "country", "hour", "clicked")

val formula = new RFormula()
// R Formula parameter, which is provided in the form of string. 
  .setFormula("clicked ~ country + hour")
//   Forces whether the index label is numeric or string type. Usually we index tags only when they are string type. If the formula is used by the classification algorithm, we can set this parameter to true To force the index label, even if it is of numeric type. Default: false
//   For how to StringIndexer String used FEATURE A parameter to sort by the category of the column. When encoding a string, the last category after sorting is deleted. Supported options:“ frequencyDesc","frequencyAsc","alphabetDesc","alphabetAsc".  The default value is“ frequencyDesc".  When ordering Set to 'alphabetDesc' When, RFormula And are discarded when encoding strings R Same category. 

val output = formula.fit(dataset).transform(dataset)
output.select("features", "label").show()

|      features|label|
|[0.0,0.0,18.0]|  1.0|
|[1.0,0.0,12.0]|  0.0|
|[0.0,1.0,15.0]|  0.0|


3, ChiSqSelector

Chi square feature selection, which selects classification features for predicting classification labels. Selectors support different selection methods: numTopFeatures, percentile, fpr, fdr, fwe.

  • numTopFeatures: select a fixed number of top-level features according to Chi square test.
  • percentile: percentiles are similar, but select a small part of all features instead of a fixed number.
  • fpr: select all features whose p value is lower than the threshold, so as to control the selected false positive rate.
  • fdr: select all features whose error detection rate is lower than the threshold.
  • fwe: select all features whose p value is lower than the threshold. The threshold is scaled by 1/numFeatures to control the family wide selection error rate. By default, the selection method is numTopFeatures, and the default number of top features is set to 50.
// Feature selection———— ChiSqSelector
// ChiSqSelector Represents chi square feature selection. It operates on labeled data with classification features. 
// ChiSqSelector The independence check card determines which columns to use
import org.apache.spark.ml.feature.ChiSqSelector
import org.apache.spark.ml.linalg.Vectors

val data = Seq(
  (7, Vectors.dense(0.0, 0.0, 18.0, 1.0), 1.0),
  (8, Vectors.dense(0.0, 1.0, 12.0, 0.0), 0.0),
  (9, Vectors.dense(1.0, 0.0, 15.0, 0.1), 0.0)

val df = spark.createDataset(data).toDF("id", "features", "clicked")

val selector = new ChiSqSelector()
// ChisqSelector Type of selector. Supported options:“ numTopFeatures"(Default)“ percentile","fpr","fdr","fwe".  
// The selector will select the number of features and press p Values are arranged in ascending order. If the number of features is less than numTopFeatures,Then this selects all features. Only in selectorType = "numTopFeatures" When applicable. numTopFeatures The default value for is 50. 
//   The upper limit of the expected error detection rate. Only in selectorType = "fdr" When applicable. The default value is 0.05.  
//   .setFdr(0.05)

// The selector arranges the percentage of selected features in descending order of statistical values. Only in selectorType = "percentile" When applicable. The default value is 0.1.  
//   .setPercentile(0.1)

// The highest number of features to retain p Value. Only in selectorType = "fpr" When applicable. The default value is 0.05.  
//   .setFpr(0.05)

// The upper limit of the expected family error rate. Only in selectorType = "fwe" When applicable. The default value is 0.05.  
//   .setFwe(0.05)

val result = selector.fit(df).transform(df)

println(s"ChiSqSelector output with top ${selector.getNumTopFeatures} features selected")

ChiSqSelector output with top 1 features selected
| id|          features|clicked|selectedFeatures|
|  7|[0.0,0.0,18.0,1.0]|    1.0|          [18.0]|
|  8|[0.0,1.0,12.0,0.0]|    0.0|          [12.0]|
|  9|[1.0,0.0,15.0,0.1]|    0.0|          [15.0]|