Graphic big data | Spark machine learning - workflow and Feature Engineering

Posted by hofmann777 on Wed, 09 Mar 2022 01:42:20 +0100

Author: Han Xinzi@ShowMeAI

Tutorial address:

Article address:

Notice: All Rights Reserved. Please contact the platform and the author for reprint and indicate the source

1.Spark machine learning workflow

1) Spark mllib and ml

Spark also has MLlib/ML for big data machine learning, which can support the modeling and application of massive data.

2) Machine learning workflow (Pipeline)

A typical machine learning process starts from data collection and needs to go through multiple steps to get the required output. It is a multi-step pipelined work:

  • Source data ETL (extract, transform, load)
  • Data preprocessing
  • Index extraction
  • Model training and cross validation
  • New data forecast

MLlib is simple enough to use, but in some cases, using MLlib will make the program structure complex and difficult to understand and implement.

  • The structure of the target data set is complex and needs to be processed many times.
  • When predicting new data, it is necessary to combine multiple trained single models for comprehensive prediction. ML Pipeline introduced after spark version 1.2 can be used to build complex machine learning workflow applications.

The following is an explanation of several important concepts:


The DataFrame in Spark SQL is used as the data set, which can accommodate various data types. Compared with RDD, DataFrame contains schema information, which is more similar to two-dimensional tables in traditional databases.

It is used by ML Pipeline to store source data. For example, the columns in DataFrame can be stored text, feature vector, real label and predicted label.

(2) Transformer

Is an algorithm that can convert one DataFrame into another. For example, a model is a Transformer, which can label a test data set DataFrame without prediction label and convert it into another DataFrame with prediction label.

Technically, Transformer implements a method transform(), which converts one DataFrame into another by attaching one or more columns.

(3) Estimator (estimator / evaluator)

It is the conceptual abstraction of learning algorithm or training method on training data. Pipeline is usually used to manipulate DataFrame data and produce a Transformer. Technically, Estimator implements a method fit(), which accepts a DataFrame and generates a Transformer converter.


Parameter is used to set the parameters of Transformer or Estimator. All transformers and estimators can now share a common API for specifying parameters. ParamMap is a set of (parameter, value) pairs.

(5) Pipeline (workflow / pipeline)

Workflow connects multiple workflow stages (Transformer converter and Estimator) to form a machine learning workflow and obtain the result output.

3) Build a Pipeline workflow

val pipeline = new Pipeline().setStages(Array(stage1,stage2,stage3,...))

① First, you need to define each pipelinestage (workflow stage) in the Pipeline.

  • Including Transformer converter and Estimator evaluator.
  • Such as index extraction and transformation model training.
  • With these Transformer converters and Estimator evaluators that deal with specific problems, you can orderly organize PipelineStages and create a Pipeline according to specific processing logic.

② Then, you can take the training data set as an input parameter and call the fit method of the Pipelin instance to start processing the source training data in a stream manner.

  • This call returns an instance of the PipelineModel class, which is then used to predict the label of the test data

③ Each stage of the workflow runs in sequence, and the input DataFrame is transformed as it passes through each stage.

  • For the Transformer converter phase, call the transform() method on the DataFrame.
  • For the Estimator estimator phase, the fit() method is called to generate a converter (which becomes part of the PipelineModel or a fitted Pipeline), and the transform() method of the converter is called in the DataFrame.

4) Build Pipeline example

from import Pipeline
from import LogisticRegression
from import HashingTF, Tokenizer

# Prepare training documents from a list of (id, text, label) tuples.
training = spark.createDataFrame([
    (0, "a b c d e spark", 1.0),
    (1, "b d", 0.0),
    (2, "spark f g h", 1.0),
    (3, "hadoop mapreduce", 0.0)
], ["id", "text", "label"])

# Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=10, regParam=0.001)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])

# Fit the pipeline to training documents.
model =

# Prepare test documents, which are unlabeled (id, text) tuples.
test = spark.createDataFrame([
    (4, "spark i j k"),
    (5, "l m n"),
    (6, "spark hadoop spark"),
    (7, "apache hadoop")
], ["id", "text"])

# Make predictions on test documents and print columns of interest.
prediction = model.transform(test)
selected ="id", "text", "probability", "prediction")
for row in selected.collect():
    rid, text, prob, prediction = row  # type: ignore
        "(%d, %s) --> prob=%s, prediction=%f" % (
            rid, text, str(prob), prediction   # type: ignore

2. Spark ML Feature Engineering Based on DataFrame

1) Characteristic Engineering

2) Binarization

continuousDataFrame = spark.createDataFrame([(0, 1.1),(1, 8.5),(2, 5.2)], ["id", "feature"])
binarizer = Binarizer(threshold=5.1, inputCol="feature", outputCol="binarized_feature")
binarizedDataFrame = binarizer.transform(continuousDataFrame)

3) Definite boundary discretization

splits = [-float("inf"), -0.5, 0.0, 0.5, float("inf")] 
data = [(-999.9,),(-0.5,),(-0.3,),(0.0,),(0.2,),(999.9,)] 
dataFrame = spark.createDataFrame(data, ["features"]) 
bucketizer = Bucketizer(splits=splits, inputCol="features", outputCol="bucketedFeatures") 

# Divide barrels according to the given boundary 
bucketedData = bucketizer.transform(dataFrame)

4) Discretization according to quantile

data = [(0, 18.0), (1, 19.0), (2, 8.0), (3, 5.0), (4, 2.2), (5, 9.2), (6, 14.4)]
df = spark.createDataFrame(data, ["id", "hour"])
df = df.repartition(1)

# Divided into 3 buckets for discretization
discretizer = QuantileDiscretizer(numBuckets=3, inputCol="hour", outputCol="result")
result =

5) Continuous value amplitude scaling

dataFrame = spark.createDataFrame([ 
(0, Vectors.dense([1.0, 0.1, -8.0]),), 
(1, Vectors.dense([2.0, 1.0, -4.0]),), 
(2, Vectors.dense([4.0, 10.0, 8.0]),) 
], ["id", "features"]) 
scaler = MaxAbsScaler(inputCol="features", outputCol="scaledFeatures") 

# Calculate the maximum absolute value for scaling 
scalerModel = 

# Zoom to [- 1, 1] 
scaledData = scalerModel.transform(dataFrame)

6) Standardization

dataFrame ="libsvm").load("data/mllib/sample_libsvm_data.txt") 
scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures", withStd=True, withMean=False) 

# Calculate the mean variance and other parameters 
scalerModel = 

# Standardization 
scaledData = scalerModel.transform(dataFrame)

7) Add polynomial feature

df = spark.createDataFrame([(Vectors.dense([2.0, 1.0]),), (Vectors.dense([0.0, 0.0]),), (Vectors.dense([3.0, -1.0]),)], ["features"]) 
polyExpansion = PolynomialExpansion(degree=3, inputCol="features", outputCol="polyFeatures") 
polyDF = polyExpansion.transform(df)

8) Category independent heat vector coding

df = spark.createDataFrame([ (0,"a"), (1,"b"), (2,"c"), (3,"a"), (4,"a"), (5,"c")], ["id","category"]) 
stringIndexer = StringIndexer(inputCol="category", outputCol="categoryIndex") 
model = 
indexed = model.transform(df) 

encoder = OneHotEncoder(inputCol="categoryIndex", outputCol="categoryVec") 
encoded = encoder.transform(indexed)

9) Text feature extraction

df = spark.createDataFrame([(0, "a b c".split(" ")), (1, "a b b c a".split(" "))], ["id", "words"]) 
cv = CountVectorizer(inputCol="words", outputCol="features", vocabSize=3, minDF=2.0) 
model = 
result = model.transform(df)

10) Text feature extraction

sentenceData = spark.createDataFrame([
(0.0, "Hi I heard about Spark"), 
(0.0, "I wish Java could use case classes"), 
(1.0, "Logistic regression models are neat") 
], ["label", "sentence"]) 

tokenizer = Tokenizer(inputCol="sentence", outputCol="words") 
wordsData = tokenizer.transform(sentenceData) 
hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=20) 
featurizedData = hashingTF.transform(wordsData) 
idf = IDF(inputCol="rawFeatures", outputCol="features") 
idfModel = 
rescaledData = idfModel.transform(featurizedData)