Flattening Nested JSON in PySpark: A Practical Walkthrough

Working with nested JSON data in PySpark can be challenging — especially when you need to transform complex hierarchical structures into flat, analysis-ready tables.
In this walkthrough, we’ll use PySpark to read, inspect, and flatten nested restaurant JSON data. All examples come directly from a working PySpark notebook.
📂 Sample Input File — restaurant_json_data.json
Here’s a snippet from the JSON file used in this tutorial.
It contains multiple restaurant entries, each with deeply nested attributes such as location, user_rating, and establishment_types.
[
{
"results_found": 6835,
"results_shown": 20,
"results_start": "1",
"restaurants": [
{
"restaurant": {
"R": { "res_id": 17066603 },
"name": "The Coop",
"cuisines": "Southern, Cajun, Soul Food",
"average_cost_for_two": 25,
"user_rating": {
"aggregate_rating": "3.6",
"rating_text": "Good",
"votes": "432"
},
"location": {
"address": "610 W Morse Boulevard, Winter Park, FL 32789",
"city": "Orlando",
"latitude": "28.5973660000",
"longitude": "-81.3572190000"
},
"currency": "$",
"establishment_types": []
}
},
{
"restaurant": {
"R": { "res_id": 17059541 },
"name": "Maggiano's Little Italy",
"cuisines": "Italian",
"average_cost_for_two": 50,
"user_rating": {
"aggregate_rating": "4.4",
"rating_text": "Very Good",
"votes": "886"
},
"location": {
"address": "9101 International Drive, Orlando, FL 32819",
"city": "Orlando",
"latitude": "28.4332350000",
"longitude": "-81.4714470000"
},
"currency": "$",
"establishment_types": []
}
}
]
}
]
This nested JSON structure mimics real-world APIs (such as Zomato or Yelp), where restaurant data includes complex objects and arrays.
🚀 Step 1: Initialize SparkSession
We begin by creating a Spark session.
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
spark = SparkSession \
.builder \
.appName("Create DF on fly") \
.master("local[*]") \
.getOrCreate()
This sets up a local Spark environment with all cores available.
📥 Step 2: Load the JSON Data
Next, we load the nested JSON file into a DataFrame.
json_df = spark.read\
.format("json")\
.option("multiline","true")\
.load("data/input/restaurant_json_data.json")
👀 Step 3: View the Data
To understand what’s inside, let’s display the first few rows.
json_df.show()
🧱 Step 4: Examine the Schema
To flatten a JSON, you must first understand its structure.
Let’s print the schema.
json_df.printSchema()
This reveals the hierarchy with nested struct and array fields.
⚙️ Step 5: Install Dependencies
To convert small samples to pandas DataFrames later, we’ll install pandas.
%pip install pandas
🧩 Step 6: Explode the Restaurants Array
The restaurants field is an array. We’ll use explode() and explode_outer() to flatten it and extract nested values.
df_flat = json_df.select("*",explode("restaurants").alias("new_restaraunts"))\
.drop("restaurants")\
.select("*",explode_outer("new_restaraunts.restaurant.establishment_types")
.alias("establishment_type_new"),
"new_restaraunts.restaurant.R.res_id")
Then preview the first few rows:
df_flat.limit(5).toPandas()
🧾 Step 7: Review the Flattened Schema
df_flat.printSchema()
This shows that each restaurant record is now accessible as a top-level row.
🧮 Step 8: Extract Selected Fields
To simplify the result further, we can select only the required columns.
df_flat2 = json_df.select("*",explode("restaurants").alias("new_restaraunts"))\
.drop("restaurants")\
.select(
explode_outer("new_restaraunts.restaurant.establishment_types")
.alias("establishment_type_new"),
"new_restaraunts.restaurant.R.res_id",
"new_restaraunts.restaurant.name"
)
Preview:
df_flat2.limit(5).toPandas()
| establishment_type_new | res_id | name |
|---|---|---|
| None | 17066603 | The Coop |
| None | 17059541 | Maggiano’s Little Italy |
| None | 17064405 | Tako Cheena by Pom Pom |
| None | 17057797 | Bosphorous Turkish Cuisine |
| None | 17057591 | Bahama Breeze Island Grille |
✅ Summary
This notebook demonstrated how to:
Load nested JSON data using PySpark.
Inspect and understand its schema.
Use
explode()andexplode_outer()to flatten nested arrays.Extract specific nested fields into a flat table for analysis.
By following these steps, you can turn deeply nested JSONs into simple, structured DataFrames that are much easier to query, analyze, or export to CSV/Parquet.
🎓 References & Credits
This tutorial and notebook was inspired by the excellent YouTube video on PySpark by Manish Kumar, below:
🔗 Flatten Nested JSON in PySpark (YouTube)




