“Hello World” of PySpark for Python & Pandas User [Pandas Vs PySpark]

Beginners guide and introduction to PySpark

ABHISHEK GUPTA
4 min readApr 16, 2022
Picture Credit

Audience of this article:

This article is aimed mainly for people who are new to PySpark but has already familiarity with Pandas. It is assumed that you have intermediary level understanding of Python. However, if you’re complete new to Python and learning PySpark, still you can find below mentioned cases useful for your learning.

You can consider this article as your “Hello World” of PySpark.

Basic Understanding:

We will look at different examples of how Pandas is different from PySpark or in another words, we will explore how to perform some of the basic Pandas tasks in PySpark. But first, let's try to develop a basic understanding between Pandas and PySpark.

While, pandas is a data analysis and manipulation tool built on top of Python, whereas, PySpark is a Python API for Spark. Apache Spark is written in Scala programming language and PySpark has been released to support collaboration of Apache Spark and Python.

Important Concepts:

Before we jump into Pandas vs PySpark, two important concepts to keep in mind while using PySpark

PySpark is Immutable :

  • When you make changes, it creates new object references
  • Old versions are unchanged

PySpark is Lazy:

  • Compute does not happen until you request for an output.

Architecture:

A very generalised architecture of how PySpark works would look like this:

User → Driver → Executer (Master) → Data

The user interact with the driver, which then in turn operates with Executers to run on your Master that operates on your data.

Photo by Max Duzij on Unsplash

Let’s Go: Pandas vs PySpark

Here are some of the most common and general tasks that we perform in Pandas and exampls of how we can perform same tasks in PySpark.

Declaration: there are multiple methods to do some of the things that are mentioned below but in order to make things organised and focused, one to two methods have been shown.

1. Load CSV:

Default representation of a PySpark dataframe is just the schema, not the data.

Pandas:

df = read_csv(“file_name.csv”)

PySpark:

df = spark.read.options(header=True,inferSchema=True) \  
.csv(“practice.csv")

2. View Dataframe

Pandas:

> df> df.head(5)

PySpark:

> df.show()> df.show(5)

3. Column names and data types:

Pandas:

> df.columns> df.dtypes

PySpark:

> df.columns> df.dtypes

4. Rename Columns:

Pandas:

> df.columns = [‘a’, ‘b’, ‘c’, ‘d']> df.rename(columns = {‘old_name’: ’new_name’})

PySpark:

> df.toDF(‘a’, ‘b’, ‘c’, ‘d)> df.withColumnRenamed(‘old_name’, ’new_name’)

5. Drop Columns:

Pandas:

> df.drop(‘col_name’, axis = 1)

PySpark:

df.drop(‘col_name’)

6. Filtering:

Pandas:

> df[df.col_name1 < 40]> df[(df.col_name1 < 40) & (df.col_name2 == 20)]

PySpark:

> df[df.col_name < 40]> df[(df.col_name1 < 40) & (df.col_name2 == 20)]

7. Add Column:

Pandas:

df[‘col_name_new’] = 5 * df.df_col_name1

PySpark:

df.withColumn(‘col_name_new’, 5 * df.df_col_name1)

8. Fill Nulls:

Pandas:

df.fillna(0)

PySpark:

df.fillna(0)

9. Aggregation:

Pandas:

df.groupby([‘col_1’, ‘col_2’]) 
.agg({‘col_3’: ‘mean’, ‘col_4’: ‘min’})

PySpark:

df.groupby([‘col_1’, ‘col_2’]) 
.agg({‘col_3’: ‘mean’, ‘col_4’: ‘min’})

10. Standard Transformation:

Pandas:

import numpy as np 
df[‘col_name_new’] = np.log(df.col_name1)

PySpark:

import pyspark.sql.functions as F
df.withColumn(‘col_name_new’, F.log(df.col_name))

Keep it in the JVM. It keeps the compute that happening on your data in the JVM. That means you’re actually not running any Python at all on your executers.

11. Row Conditional Statement:

Pandas:

df[‘col_name_new'] = df.apply(lambda r: 1 if r.col_1 > 20 else 2 if r.col_2 
== 6 else 3, axis = 1)

PySpark:

df.withColumn(‘col_name_new’, \ 
F.when(df.col_1 > 20, 1) \
F.when(df.col_2 == 6, 2) \
.otherwise(3)

12. Python When Required

Pandas:

df[‘col_name’] = df.disp.apply(lambda x: x+1)

PySpark:

import pyspark.sql.functions as F
import pyspark.sql.types import DoubleType
fn = F.udf(lambda x: x+1, DoubleType())
df.withColumn(‘col_name_new’, fn(df.col_name))

13. Merge/Join Dataframes

Pandas:

left_df.merger(right_df, on=‘key_col’)
left_df.merge(right_df, left_on=‘col_a', right_on=‘col_b’)

PySpark:

left_df.merger(right_df, on=‘key_col’)
left_df.merge(right_df, left_df.col_a == right_df.col_b)

14. Pivot Table

Pandas:

pd.pivot_table(df, values= ‘col_d’, \
index=[‘col_a’, ‘col_b’], columns=[‘col_c’], \
aggfunc=np.sum)

PySpark:

df.groupBy(‘col_a’, ‘col_b’).pivot(‘col_c’).sum(‘col_d’)

15. Summary Statistics

Pandas:

df.describe()

PySpark:

> df.describe().show()> df.summary().show()

16. Histogram

Pandas:

df.hist()

PySpark:

df.sample(False, 0.1).toPandas().hist()

17. SQL Support

Pandas:

N/A

PySpark:

df.createOrReplaceTempView(‘func’)
df2 = spark.sql(’select * from fund')

There are many other things that you can learn after taking a look at it and I am sure you will feel much confident about using PySpark after this. Now, for the next level, always refer to sparkbyexamples tutorials.

Recently, for a new project, I had to learn how to use PySpark. I must admit, I learned it hard way, doing mistakes, learned one thing at a time as I was progressing specially in the EDA process. Therefore, I decided to bring all the knowledge and information in one place in order to easily adapt with PySpark but understanding the similarities and the differences with Pandas. I hope you find this useful in your learning.

If you have any feedback or suggestions, please write down below.

Keep Exploring, Keep Learing. Thank you!

--

--