Skip to main content

Command Palette

Search for a command to run...

Flattening Nested JSON in PySpark: A Practical Walkthrough

Updated
4 min read
Flattening Nested JSON in PySpark: A Practical Walkthrough

Working with nested JSON data in PySpark can be challenging — especially when you need to transform complex hierarchical structures into flat, analysis-ready tables.

In this walkthrough, we’ll use PySpark to read, inspect, and flatten nested restaurant JSON data. All examples come directly from a working PySpark notebook.


📂 Sample Input File — restaurant_json_data.json

Here’s a snippet from the JSON file used in this tutorial.

It contains multiple restaurant entries, each with deeply nested attributes such as location, user_rating, and establishment_types.

[
  {
    "results_found": 6835,
    "results_shown": 20,
    "results_start": "1",
    "restaurants": [
      {
        "restaurant": {
          "R": { "res_id": 17066603 },
          "name": "The Coop",
          "cuisines": "Southern, Cajun, Soul Food",
          "average_cost_for_two": 25,
          "user_rating": {
            "aggregate_rating": "3.6",
            "rating_text": "Good",
            "votes": "432"
          },
          "location": {
            "address": "610 W Morse Boulevard, Winter Park, FL 32789",
            "city": "Orlando",
            "latitude": "28.5973660000",
            "longitude": "-81.3572190000"
          },
          "currency": "$",
          "establishment_types": []
        }
      },
      {
        "restaurant": {
          "R": { "res_id": 17059541 },
          "name": "Maggiano's Little Italy",
          "cuisines": "Italian",
          "average_cost_for_two": 50,
          "user_rating": {
            "aggregate_rating": "4.4",
            "rating_text": "Very Good",
            "votes": "886"
          },
          "location": {
            "address": "9101 International Drive, Orlando, FL 32819",
            "city": "Orlando",
            "latitude": "28.4332350000",
            "longitude": "-81.4714470000"
          },
          "currency": "$",
          "establishment_types": []
        }
      }
    ]
  }
]

This nested JSON structure mimics real-world APIs (such as Zomato or Yelp), where restaurant data includes complex objects and arrays.


🚀 Step 1: Initialize SparkSession

We begin by creating a Spark session.

from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *

spark = SparkSession \
    .builder \
    .appName("Create DF on fly") \
    .master("local[*]") \
    .getOrCreate()

This sets up a local Spark environment with all cores available.


📥 Step 2: Load the JSON Data

Next, we load the nested JSON file into a DataFrame.

json_df = spark.read\
                .format("json")\
                .option("multiline","true")\
                .load("data/input/restaurant_json_data.json")

👀 Step 3: View the Data

To understand what’s inside, let’s display the first few rows.

json_df.show()

🧱 Step 4: Examine the Schema

To flatten a JSON, you must first understand its structure.
Let’s print the schema.

json_df.printSchema()

This reveals the hierarchy with nested struct and array fields.


⚙️ Step 5: Install Dependencies

To convert small samples to pandas DataFrames later, we’ll install pandas.

%pip install pandas

🧩 Step 6: Explode the Restaurants Array

The restaurants field is an array. We’ll use explode() and explode_outer() to flatten it and extract nested values.

df_flat = json_df.select("*",explode("restaurants").alias("new_restaraunts"))\
        .drop("restaurants")\
        .select("*",explode_outer("new_restaraunts.restaurant.establishment_types")
                .alias("establishment_type_new"),
                "new_restaraunts.restaurant.R.res_id")

Then preview the first few rows:

df_flat.limit(5).toPandas()

🧾 Step 7: Review the Flattened Schema

df_flat.printSchema()

This shows that each restaurant record is now accessible as a top-level row.


🧮 Step 8: Extract Selected Fields

To simplify the result further, we can select only the required columns.

df_flat2 = json_df.select("*",explode("restaurants").alias("new_restaraunts"))\
        .drop("restaurants")\
        .select(
            explode_outer("new_restaraunts.restaurant.establishment_types")
                .alias("establishment_type_new"),
            "new_restaraunts.restaurant.R.res_id",
            "new_restaraunts.restaurant.name"
        )

Preview:

df_flat2.limit(5).toPandas()
establishment_type_new res_id name
None 17066603 The Coop
None 17059541 Maggiano’s Little Italy
None 17064405 Tako Cheena by Pom Pom
None 17057797 Bosphorous Turkish Cuisine
None 17057591 Bahama Breeze Island Grille

✅ Summary

This notebook demonstrated how to:

  1. Load nested JSON data using PySpark.

  2. Inspect and understand its schema.

  3. Use explode() and explode_outer() to flatten nested arrays.

  4. Extract specific nested fields into a flat table for analysis.

By following these steps, you can turn deeply nested JSONs into simple, structured DataFrames that are much easier to query, analyze, or export to CSV/Parquet.


🎓 References & Credits

This tutorial and notebook was inspired by the excellent YouTube video on PySpark by Manish Kumar, below:
🔗 Flatten Nested JSON in PySpark (YouTube)