Running Multiple workers of Apache Spark on Windows

 Below steps have been tried on 2 different Windows 10 laptops, with two different Spark versions (2.x.x) and with Spark 3.1.2.

Image source: https://upload.wikimedia.org/wikipedia/commons/thumb/f/f3/Apache_Spark_logo.svg/1280px-Apache_Spark_logo.svg.png

Installing Prerequisites

PySpark requires Java version 7 or later and Python version 2.6 or later.

  1. Java

To check if Java is already available and find it’s version, open a Command Prompt and type the following command.

java -version

If the above command gives an output like below, then you already have Java and hence can skip the below steps.

java version "1.8.0_271"
Java(TM) SE Runtime Environment (build 1.8.0_271-b09)
Java HotSpot(TM) 64-Bit Server VM (build 25.271-b09, mixed mode)

Java comes in two packages: JRE and JDK. Which one to use between the two depends on whether you want to just use Java or you want to develop applications for Java.

You can download either of them from the Oracle website.

Do you want to run Java™ programs, or do you want to develop Java programs? If you want to run Java programs, but not develop them, download the Java Runtime Environment, or JRE™

If you want to develop applications for Java, download the Java Development Kit, or JDK™. The JDK includes the JRE, so you do not have to download both separately.

For our case, we just want to use Java and hence we will be downloading the JRE file.
1. Download Windows x86 (e.g. jre-8u271-windows-i586.exe) or Windows x64 (jre-8u271-windows-x64.exe) version depending on whether your Windows is 32-bit or 64-bit
2. The website may ask for registration in which case you can register using your email id
3. Run the installer post download.

Note: The above two .exe files require admin rights for installation.

In case you do not have admin access to your machine, download the .tar.gz version (e.g. jre-8u271-windows-x64.tar.gz). Then, un-gzip and un-tar the downloaded file and you have a Java JRE or JDK installation.
You can use 7zip to extract the files. Extracting the .tar.gz file will give a .tar file- extract this one more time using 7zip.
Or you can run the below command in cmd on the downloaded file to extract it:

tar -xvzf jre-8u271-windows-x64.tar.gz

Make a note of where Java is getting installed as we will need the path later.

2. Python

Use Anaconda to install- https://www.anaconda.com/products/individual

Use below command to check the version of Python.

python --version

Run the above command in Anaconda Prompt in case you have used Anaconda to install it. It should give an output like below.

Python 3.7.9

Note: Spark 2.x.x don’t support Python 3.8. Please install python 3.7.x. For more information, refer to this stackoverflow question. Spark 3.x.x support Python 3.8.

Scripted setup

Following steps can be scripted as a batch file and run in one go. Script has been provided after the below walkthrough.

Getting the Spark files

Download the required spark version file from the Apache Spark Downloads website. Get the ‘spark-x.x.x-bin-hadoop2.7.tgz’ file, e.g. spark-2.4.3-bin-hadoop2.7.tgz.

Spark 3.x.x also come with Hadoop 3.2 but this Hadoop version causes errors when writing Parquet files so it is recommended to use Hadoop 2.7.

Make corresponding changes to remaining steps for the chosen spark version.

You can extract the files using 7zip. Extracting the .tgz file will give a .tar file- extract this one more time using 7zip. Or you can run the below command in cmd on the downloaded file to extract it:

tar -xvzf spark-2.4.3-bin-hadoop2.7.tgz

Putting everything together

Setup folder

Create a folder for spark installation at the location of your choice. e.g. C:\spark_setup.

Extract the spark file and paste the folder into chosen folder: C:\spark_setup\spark-2.4.3-bin-hadoop2.7

Adding winutils.exe

From this GitHub repository, download the winutils.exe file corresponding to the Spark and Hadoop version.

We are using Hadoop 2.7, hence download winutils.exe from hadoop-2.7.1/bin/.

Copy and replace this file in following paths (create \hadoop\bin directories)

  • C:\spark_setup\spark-2.4.3-bin-hadoop2.7\bin
  • C:\spark_setup\spark-2.4.3-bin-hadoop2.7\hadoop\bin

Setting environment variables

We have to setup below environment variables to let spark know where the required files are.

In Start Menu type ‘Environment Variables’ and select ‘Edit the system environment variables

Click on ‘Environment Variables…

Add ‘New…’ variables

  1. Variable name: SPARK_HOME
    Variable value: C:\spark_setup\spark-2.4.3-bin-hadoop2.7 (path to setup folder)
  2. Variable name: HADOOP_HOME
    Variable value: C:\spark_setup\spark-2.4.3-bin-hadoop2.7\hadoop
    OR
    Variable value: %SPARK_HOME%\hadoop
  3. Variable name: JAVA_HOME
    Variable value: Set it to the Java installation folder, e.g. C:\Program Files\Java\jre1.8.0_271
    Find it in ‘Program Files’ or ‘Program Files (x86)’ based on which version was installed above. In case you used the .tar.gz version, set the path to the location where you extracted it.
  4. Variable name: PYSPARK_PYTHON
    Variable value: python
    This environment variable is required to ensure tasks that involve python workers, such as UDFs, work properly. Refer to this StackOverflow post.
  5. Select ‘Path’ variable and click on ‘Edit…

Click on ‘New and add the spark bin path, e.g. C:\spark_setup\spark-2.4.3-bin-hadoop2.7\bin OR %SPARK_HOME%\bin

All required Environment variables have been set.

Optional variables: Set below variables if you want to use PySpark with Jupyter notebook. If this is not set, PySpark session will start on the console.

  1. Variable name: PYSPARK_DRIVER_PYTHON
    Variable value: jupyter
  2. Variable name: PYSPARK_DRIVER_PYTHON_OPTS
    Variable value: notebook

Scripted setup

Edit and use below script to (almost) automate PySpark setup process.

Steps that aren’t automated: Java & Python installation, and Updating the ‘Path’ variable.

Install Java & Python before, and edit the ‘Path’ variable after running the script, as mentioned in the walkthrough above.

Make changes to the script for the required spark version and the installation paths. Save it as a .bat file and double-click to run.

Using PySpark in standalone mode on Windows

You might have to restart your machine post above steps in case below commands don’t work.

Commands

Each command to be run in a separate Anaconda Prompt

  1. Deploying Master
    spark-class.cmd org.apache.spark.deploy.master.Master -h 127.0.0.1
    Open your browser and navigate to: http://localhost:8080/. This is the SparkUI.
  2. Deploying Worker
    spark-class.cmd org.apache.spark.deploy.worker.Worker spark://127.0.0.1:7077
    SparkUI will show the worker status.

3. PySpark shell
pyspark --master spark://127.0.0.1:7077 --num-executors 1 --executor-cores 1 --executor-memory 4g --driver-memory 2g --conf spark.dynamicAllocation.enabled=false

Adjust num-executorsexecutor-coresexecutor-memory and driver-memory as per machine config. SparkUI will show the list of PySparkShell sessions.

Activate the required python environment in the third anaconda prompt before running the above pyspark command.

The above command will open Jupyter Notebook instead of pyspark shell if you have set the PYSPARK_DRIVER_PYTHON and PYSPARK_DRIVER_PYTHON_OPTS Environment variables as well.

Alternative

Run below command to start pyspark (shell or jupyter) session using all resources available on your machine. Activate the required python environment before running the pyspark command.

pyspark --master local[*]

Please let me know in comments if any of the steps give errors or you face any kind of issues.

Linux Administration Commands

As a system administrator, you may want to know who is on the system at any give point in time. You may also want to know what they are doing. In this article let us review 4 different methods to identify who is on your Linux system.

1. Get the running processes of logged-in user using w

w command is used to show logged-in user names and what they are doing. The information will be read from /var/run/utmp file. The output of the w command contains the following columns:
  • Name of the user
  • User’s machine number or tty number
  • Remote machine address
  • User’s Login time
  • Idle time (not usable time)
  • Time used by all processes attached to the tty (JCPU time)
  • Time used by the current process (PCPU time)
  • Command currently getting executed by the users
 
Following options can be used for the w command:
  • -h Ignore the header information
  • -u Display the load average (uptime output)
  • -s Remove the JCPU, PCPU, and login time.

$ w
 23:04:27 up 29 days,  7:51,  3 users,  load average: 0.04, 0.06, 0.02
USER     TTY      FROM              LOGIN@   IDLE   JCPU   PCPU WHAT
ramesh   pts/0    dev-db-server        22:57    8.00s  0.05s  0.01s sshd: ramesh [priv]
jason    pts/1    dev-db-server        23:01    2:53   0.01s  0.01s -bash
john     pts/2    dev-db-server        23:04    0.00s  0.00s  0.00s w

$ w -h
ramesh   pts/0    dev-db-server        22:57   17:43   2.52s  0.01s sshd: ramesh [priv]
jason    pts/1    dev-db-server        23:01   20:28   0.01s  0.01s -bash
john     pts/2    dev-db-server        23:04    0.00s  0.03s  0.00s w -h

$ w -u
 23:22:06 up 29 days,  8:08,  3 users,  load average: 0.00, 0.00, 0.00
USER     TTY      FROM              LOGIN@   IDLE   JCPU   PCPU WHAT
ramesh   pts/0    dev-db-server        22:57   17:47   2.52s  2.49s top
jason    pts/1    dev-db-server        23:01   20:32   0.01s  0.01s -bash
john     pts/2    dev-db-server        23:04    0.00s  0.03s  0.00s w -u

$ w -s
 23:22:10 up 29 days,  8:08,  3 users,  load average: 0.00, 0.00, 0.00
USER     TTY      FROM               IDLE WHAT
ramesh   pts/0    dev-db-server        17:51  sshd: ramesh [priv]
jason    pts/1    dev-db-server        20:36  -bash
john     pts/2    dev-db-server         1.00s w -s

2. Get the user name and process of logged in user using who and users command

who command is used to get the list of the usernames who are currently logged in. Output of the who command contains the following columns: user name, tty number, date and time, machine address.
$ who
ramesh pts/0        2009-03-28 22:57 (dev-db-server)
jason  pts/1        2009-03-28 23:01 (dev-db-server)
john   pts/2        2009-03-28 23:04 (dev-db-server)
 
To get a list of all usernames that are currently logged in, use the following:
$ who | cut -d' ' -f1 | sort | uniq
john
jason
ramesh

Users Command

users command is used to print the user name who are all currently logged in the current host. It is one of the command don’t have any option other than help and version. If the user using, ‘n’ number of terminals, the user name will shown in ‘n’ number of time in the output.
$ users
john jason ramesh

3. Get the username you are currently logged in using whoami

whoami command is used to print the loggedin user name.
$ whoami
john
 
whoami command gives the same output as id -un as shown below:
$ id -un
john
 
who am i command will display the logged-in user name and current tty details. The output of this command contains the following columns: logged-in user name, tty name, current time with date and ip-address from where this users initiated the connection.
$ who am i
john     pts/2        2009-03-28 23:04 (dev-db-server)

$ who mom likes
john     pts/2        2009-03-28 23:04 (dev-db-server)

Warning: Don't try "who mom hates" command.
Also, if you do su to some other user, this command will give the information about the logged in user name details.

4. Get the user login history at any time

last command will give login history for a specific username. If we don’t give any argument for this command, it will list login history for all users. By default this information will read from /var/log/wtmp file. The output of this command contains the following columns:
  • User name
  • Tty device number
  • Login date and time
  • Logout time
  • Total working time
$ last jason
jason   pts/0        dev-db-server   Fri Mar 27 22:57   still logged in
jason   pts/0        dev-db-server   Fri Mar 27 22:09 - 22:54  (00:45)
jason   pts/0        dev-db-server   Wed Mar 25 19:58 - 22:26  (02:28)
jason   pts/1        dev-db-server   Mon Mar 16 20:10 - 21:44  (01:33)
jason   pts/0        192.168.201.11  Fri Mar 13 08:35 - 16:46  (08:11)
jason   pts/1        192.168.201.12  Thu Mar 12 09:03 - 09:19  (00:15)
jason   pts/0        dev-db-server   Wed Mar 11 20:11 - 20:50  (00:39