Parent-Child Design Pattern in Azure Data Factory

Parent-Child Design Pattern in Azure Data Factory

The Parent-Child design pattern in Azure Data Factory (ADF) is a powerful approach to dynamically orchestrating data pipelines. It simplifies handling scenarios where multiple tasks need to process data dynamically based on input. This blog post outlines how to design a Parent pipeline that orchestrates a Child pipeline to create files dynamically in an Azure Data Lake Storage Gen2 container. We’ll use an example involving schema and table names dynamically fetched from a database.


Use Case

We want to generate files in Azure Data Lake Gen2 with filenames that are dynamically created using a combination of the database schema and table names. This design leverages the Parent pipeline for control and the Child pipeline for execution.


Parent Pipeline

The Parent pipeline handles:

  1. Fetching Table Metadata:

    • A Lookup activity runs a query to fetch table schema and table names dynamically from the source database. Here's the query used:

        SELECT table_schema, table_name
        FROM information_schema.tables
        WHERE table_name LIKE 'Customer%'
      
    • This query returns a list of tables whose names start with "Customer" from the database.

  2. Iterating Over the Results:

    • A ForEach activity iterates through the array returned by the Lookup activity. Each iteration corresponds to a row in the table list (one table schema and table name pair).
  3. Calling the Child Pipeline:

    • Inside the ForEach activity, an Execute Pipeline activity is used to invoke the Child pipeline. The schema and table name from the current iteration are passed as parameters to the Child pipeline.

Child Pipeline

The Child pipeline is responsible for:

  1. Copying Data to ADLS Gen2:

    • A Copy Data activity is used to transfer data from the source database to Azure Data Lake Storage Gen2.

    • The destination filename is dynamically defined based on the parameters passed from the Parent pipeline (schema and table_name). For instance, if the table is in the sales schema and is named CustomerInfo, the output file would be named sales.CustomerInfo.

    • This is achieved using the Sink Dataset in the Copy Data activity. The filename is configured using the following expression:

        @concat(pipeline().parameters.schema, '.', pipeline().parameters.table_name)
      

      This expression concatenates the schema and table name with a dot separator.


Steps to Create the Pipelines

1. Create the Parent Pipeline

  • Add a Lookup activity:

    • Configure it with the query to fetch schema and table_name from the source database.
  • Add a ForEach activity:

    • Set the Items property to the output of the Lookup activity.

    • Inside the ForEach activity, add an Execute Pipeline activity and pass the schema and table_name dynamically.

2. Create the Child Pipeline

  • Add parameters to the pipeline:

    • schema and table_name.
  • Add a Copy Data activity:

    • Configure the Source dataset with the dynamic table (schema.table_name) if needed.

    • Configure the Sink dataset:

      • Set the filename property to use the expression @concat(pipeline().parameters.schema, '.', pipeline().parameters.table_name).

Benefits of this Design

  1. Reusability: The Child pipeline can be reused for different tasks since it is driven by parameters.

  2. Scalability: The Parent pipeline handles multiple table schemas and table names dynamically without manual configuration.

  3. Flexibility: You can extend this design for additional transformations or use cases


Conclusion

The Parent-Child design pattern in Azure Data Factory is ideal for dynamic and scalable data orchestration. In this example, you’ve learned how to:

  1. Fetch dynamic metadata from a database.

  2. Use ForEach to iterate through the metadata.

  3. Invoke a parameterized Child pipeline to handle dynamic data movement and transformations.

This design can be a robust foundation for automating data workflows in Azure.