Spark: JupyterNotebook integrates PySpark development environment

Posted by cl77 on Fri, 11 Feb 2022 15:37:58 +0100

Record

Basic environment

JDK8
Python3.7

Setting up Spark environment in Window

First install JDK8 and python 3, which will not be repeated here

Install Hadoop 2 seven

Download address: http://archive.apache.org/dist/hadoop/core/hadoop-2.7.7/hadoop-2.7.7.tar.gz
decompression
Download winutils of hadoop: https://github.com/steveloughran/winutils
Unzip the downloaded winutils to the bin directory of the hadoop directory
Setting up Java for hadoop_ HOME

Modify etc / hadoop / hadoop env CMD file: set the actual java installation directory
```
set JAVA_HOME=%JAVA_HOME%
```
Change to
```
set JAVA_HOME=E:\study\jdk1.8.0_144
```
Set HADOOP environment variable

The method is the same as configuring JDK environment variables:

New HADOOP_HOME variable, whose value is the extracted hadoop root directory;

Put% Hadoop_ Add home% \ bin to Path

cmd test whether hadoop is installed

cmd -- "run hadoop, hadoop version"

C:\Users\Minke>hadoop version                                                           Hadoop 2.7.7                                                                           Subversion Unknown -r c1aad84bd27cd79c3d1a7dd58202a8c3ee1ed3ac                         Compiled by stevel on 2018-07-18T22:47Z                                               Compiled with protoc 2.5.0                                                           From source with checksum 792e15d20b12c74bd6f19a1fb886490                             This command was run using /F:/ITInstall/hadoop-2.7.7/share/hadoop/common/hadoop-common-2.7.7.jar

If error: Java appears_ Home is incorrect set. Generally, if your jdk is installed on Disk C, move it to another disk

Install spark 2 4.x

Here I install version 2.4.8

Download, download address: https://archive.apache.org/dist/spark/spark-2.4.8/spark-2.4.8-bin-hadoop2.7.tgz
Unzip to your directory
Set SPARK_HOME environment variable and put% spark_ Add home% \ bin to Path

cmd test

cmd – > run pyspark

C:\Users\Minke>pyspark
Python 3.7.0 (v3.7.0:1bf9cc5093, Jun 27 2018, 04:59:51) [MSC v.1914 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
22/02/11 17:21:57 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.4.8
      /_/

Using Python version 3.7.0 (v3.7.0:1bf9cc5093, Jun 27 2018 04:59:51)
SparkSession available as 'spark'.
>>>

quit() to exit
Test spark task

cmd --> spark-submit %SPARK_HOME%/examples/src/main/python/pi.py

The calculation results can be seen from the log:

Pi is roughly 3.142780

Build Spark environment with Linux

JDK and python 3 need to be installed in advance seven

Only single node and multi node are demonstrated here. For the installation of Hadoop cluster, please refer to:

Download. The download path is the same as above

You can download wget or window and transfer it to linux through the transmission tool

decompression

tar -zvxf spark-2.4.8-bin-hadoop2.7.tgz -C /opt/module
mv spark-2.4.8-bin-hadoop2.7 spark-2.4.8

test

cd /opt/module/spark-2.4.8
bin/spark-submit examples/src/main/python/pi.py

The results can be seen in the print log
Pi is roughly 3.137780

Setting environment variables

vi /etc/profile
 add to
#==================spark====================
export SPARK_HOME=/opt/module/spark-2.4.8
export PATH=$PATH:$SPARK_HOME/bin

wq!After saving
source /etc/profile

Modify log level

Modify the conf directory and copy log4j properties. The template is log4j properties

Modify log4j Rootcategory = info, just console

Jupiter notebook installation

JupyterNotebook integration pyspark in Linux Environment

Install jupyterNotebook
```
pip3 install jupyter
```
Install findpark

The package findpark is required for jupyter to access spark
```
pip3 install findspark
```

Start jupyter

If you don't know where the jupyter command is installed, you can find it first

find / -name /jupyter

perhaps

cd /usr/local/python3/bin
 There are in this directory jupyter Command. If this directory is not in the environment variable, it needs to be started in this way
./jupyter notebook --allow-root

Open the web page of Jupiter notebook and test it

Create a new file

import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").appName("wordCount").getOrCreate()

sc = spark.sparkContext
rdd = sc.parallelize(["hello world", "hello spark"])
rdd2 = rdd.flatMap(lambda line:line.split(" "))
rdd3 = rdd2.map(lambda word:(word, 1))
rdd5 = rdd3.reduceByKey(lambda a, b : a + b)
print(rdd5.collect())

sc.stop()

Output results

[('hello', 2), ('spark', 1), ('world', 1)]

Window environment JupyterNotebook integration pyspark

Install Anaconda first, download the installation package from Baidu, and then install it. There is no special step. Anaconda will install JupyterNotebook. Anaconda is an integrated environment, which is also convenient for the installation of other tools and python packages. It is recommended to install it

Install Anaconda
Enter Anaconda directory
Enter the Scripts directory
Open the command line cmd in the Scripts directory. Be sure to use this directory. Otherwise, the installed toolkit, Jupiter notebook, cannot be found. This is a flaw in the windows environment
Install findpark
```
pip3 install findspark
```
If the process is long, you can consider changing the image to Ali, which will be faster
Test: start the jupyter notebook and open the web page in your browser

Create a new Python 3 file

import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").appName("wordCount").getOrCreate()

sc = spark.sparkContext
rdd = sc.parallelize(["hello world", "hello spark"])
rdd2 = rdd.flatMap(lambda line:line.split(" "))
rdd3 = rdd2.map(lambda word:(word, 1))
rdd5 = rdd3.reduceByKey(lambda a, b : a + b)
print(rdd5.collect())

sc.stop()

results of enforcement

[('hello', 2), ('spark', 1), ('world', 1)]

Topics: Big Data Spark Pyspark

Programmer Think