Time to read: 5 minutes By Arun Sirpal (Microsoft MVP), Editorial Contributor
Contents:
Introduction
Creating Azure Data Factory
Creating a Pipeline
Executing and Monitoring
Introduction
With many businesses moving towards Microsoft Azure you may be wondering what the impact is on data integration techniques? How would you build solutions required for ingestion and orchestration of data? What tools would you use? This is where Azure Data Factory (ADF) forms the most important piece of your architecture. With this tool, you have access to a fully managed serverless cloud data integration tool that scales on-demand.
This is done by building pipelines – these data-driven workflows usually perform the following steps shown below.
From a modern data warehousing architecture point of view sometimes it is good to see the bigger picture. If you read the following link https://azure.microsoft.com/en-gb/solutions/architecture/ you will see different solution architectures where Azure Data Factory is the core tool needed for ingesting and moving data that can be structured or unstructured. It is important to state that Azure Data Factory does not actually store any data, its key purpose is to be the tool that will allow you to build data-driven workflows to orchestrate the movement of the data and to even allow for certain transformations, something very similar to concepts of on-premises SSIS (SQL Server Integration Services). It does hold credentials that are needed to authenticate to different Azure data sources, but these are encrypted.
Data movement in Azure Data Factory has been certified for several compliances, including HIPPA, HITECH and ISO 27001/27018.
Creating Azure Data Factory
Creating an Azure Data Factory to build your pipelines is relatively straightforward and all is needed is an active Azure subscription. If you want to be more granular with permissions, then the user signing into the subscription must be a member of the contributor or owner role assuming that he/she is not the administrator.
So, from the main page of the Azure portal select Create a resource on the left menu, select Analytics, and then select Data Factory.
This will then take you to the main creation wizard. Here you will need to complete all the common details about resource group, location and whether you want GIT integration.
Once you are happy with the settings click create where then this will take you to the main Azure Data Factory dashboard.
Creating a Pipeline
To showcase the capability of implementing pipelines I will create a basic pipeline that connects to Azure Data Lake Gen2 to extract a CSV file about movie data. From here I will apply two transformations, a filter on a column to get comedy movies where the year of production is greater than 1999. Then I move the results to a table in Azure SQL Database for reporting. The high-level design looks like the below.
Below shows what the pipeline design looks like in Azure Data Factory.
Even though this is not a full guide on how and why certain settings should be configured I will mention one important element when building the pipeline, previewing data. This is important and to do this you will need to enable Data Flow debug (red box below), this under the covers creates a cluster for the data preview to work.
Once happy with the pipeline you should then validate and publish as shown via the blue boxes in the previous image.
Executing and Monitoring
Assuming a successful validation and publish the next phase is to execute the pipeline. To do this navigate to the pipeline and click add trigger, here you will have the option to “trigger now”.
Once triggered and successfully executed you will then need to click on the monitor icon in the left-hand Azure Data Factory UI panel (shown via the red box below).
Here you will find the pipeline activity that you can drill down to get some great execution statistics, you will want to look out for the binocular symbols shown below.
This section holds all the details about the pipeline, more specifically the row movement, the number of partitions utilised and processing times.
Just to confirm the same row count seen above does exist within the Azure SQL Database, you can see that it is correct.
SELECT @@VERSION AS [Version]
SELECT COUNT(*) AS [RowCount] FROM [dbo].[movies]
Hopefully, after reading this blog post, you can see how simple it can be to build data pipelines to ingest and transform data in Azure.
By Arun Sirpal (Microsoft MVP), Editorial Contributor