skip to Main Content

Azure Data Factory (ADF) – Building Pipelines

Time to read: 5 minutes By Arun Sirpal (Microsoft MVP), Editorial Contributor

Contents:

Introduction

Creating Azure Data Factory

Creating a Pipeline

Executing and Monitoring

 

Introduction

With many businesses moving towards Microsoft Azure you may be wondering what the impact is on data integration techniques? How would you build solutions required for ingestion and orchestration of data? What tools would you use? This is where Azure Data Factory (ADF) forms the most important piece of your architecture. With this tool, you have access to a fully managed serverless cloud data integration tool that scales on-demand.

This is done by building pipelines – these data-driven workflows usually perform the following steps shown below.

Building a Pipeline

From a modern data warehousing architecture point of view sometimes it is good to see the bigger picture. If you read the following link https://azure.microsoft.com/en-gb/solutions/architecture/ you will see different solution architectures where Azure Data Factory is the core tool needed for ingesting and moving data that can be structured or unstructured. It is important to state that Azure Data Factory does not actually store any data, its key purpose is to be the tool that will allow you to build data-driven workflows to orchestrate the movement of the data and to even allow for certain transformations, something very similar to concepts of on-premises SSIS (SQL Server Integration Services). It does hold credentials that are needed to authenticate to different Azure data sources, but these are encrypted.

Data movement in Azure Data Factory has been certified for several compliances, including HIPPA, HITECH and ISO 27001/27018.

 

Creating Azure Data Factory

Creating an Azure Data Factory to build your pipelines is relatively straightforward and all is needed is an active Azure subscription. If you want to be more granular with permissions, then the user signing into the subscription must be a member of the contributor or owner role assuming that he/she is not the administrator.

So, from the main page of the Azure portal select Create a resource on the left menu, select Analytics, and then select Data Factory.

 

Azure Portal

This will then take you to the main creation wizard. Here you will need to complete all the common details about resource group, location and whether you want GIT integration.

 

Creation Wizard

Once you are happy with the settings click create where then this will take you to the main Azure Data Factory dashboard.

 

Azure Data Factory

Creating a Pipeline

To showcase the capability of implementing pipelines I will create a basic pipeline that connects to Azure Data Lake Gen2 to extract a CSV file about movie data. From here I will apply two transformations, a filter on a column to get comedy movies where the year of production is greater than 1999. Then I move the results to a table in Azure SQL Database for reporting. The high-level design looks like the below.

Microsoft Azure

Below shows what the pipeline design looks like in Azure Data Factory.

 

Azure Data Factory

Even though this is not a full guide on how and why certain settings should be configured I will mention one important element when building the pipeline, previewing data. This is important and to do this you will need to enable Data Flow debug (red box below), this under the covers creates a cluster for the data preview to work.

Data Flow Debug

Data Preview

Once happy with the pipeline you should then validate and publish as shown via the blue boxes in the previous image.

Executing and Monitoring

Assuming a successful validation and publish the next phase is to execute the pipeline. To do this navigate to the pipeline and click add trigger, here you will have the option to “trigger now”.

 

Trigger now

 

Once triggered and successfully executed you will then need to click on the monitor icon in the left-hand Azure Data Factory UI panel (shown via the red box below).

 

Azure Data Factory UI panel

 

Here you will find the pipeline activity that you can drill down to get some great execution statistics, you will want to look out for the binocular symbols shown below.

 

Binocular symbol

This section holds all the details about the pipeline, more specifically the row movement, the number of partitions utilised and processing times.

Row Movement

 

Just to confirm the same row count seen above does exist within the Azure SQL Database, you can see that it is correct.

SELECT @@VERSION AS [Version]

SELECT COUNT(*) AS [RowCount] FROM [dbo].[movies]

Row count

Hopefully, after reading this blog post, you can see how simple it can be to build data pipelines to ingest and transform data in Azure.

By Arun Sirpal (Microsoft MVP), Editorial Contributor

Post Terms: Azure Data Factory | Cloud | Integration | System integration

About the Author

Arun Sirpal, writing here as a freelance blogger, is a four-time former Data Platform MVP, specialising in Microsoft Azure and Database technology. A frequent writer, his articles have been published on SQL Server Central and Microsoft TechNet alongside his own personal website. During 2017/2018 he worked with the Microsoft SQL Server product team on testing the vNext adaptive query processing feature and other Azure product groups. Arun is a member of Microsoft’s Azure Advisors and SQL Advisors groups and frequently talks about Azure SQL Database.

Education, Membership & Awards

Arun graduated from Aston University in 2007 with a BSc (Hon) in Computer Science and Business. He went on to work as an SQL Analyst, SQL DBA and later as a Senior SQL DBA, DBA Team Lead and now Cloud Solution Architect. Alongside his professional progress, Arun became a member of the Professional Association for SQL Server. He became a Microsoft Most Valued Professional (MVP) in November 2017 and has since won it for the fourth time.

You can find Arun online at:

Back To Top
Contact us for a chat