In this blog series, we will examine common scenarios in Business Intelligence using Finance Data. The series will explore common use cases which often occur in business intelligence, such as pivoting data, changing the format, and working with dates. The blog series will use Power BI and Azure Synapse Analytics to showcase the data quickly and scale effortlessly to meet the business need.
The series is accessible to business users who may not be comfortable in coding, but who are familiar with Excel since all coding is supplied. Power BI is a good step for Excel users, and this blog series will cover some of the most common scenarios that an Excel user might come across during analytical work. The blog series does cover SQL and all code is provided.
There are a number of steps throughout the series, presented here at a high-level.
In the first post, we will cover introductory Data Engineering with Power Query. Power Query is present in Excel and Power BI, and it is incredibly useful for many projects. It is estimated that data scientists spend 80% of their time fixing data, so this post forms an introduction to some simple ways that we can get Excel data into Azure Synapse Analytics and doing some of the simple data shaping with SQL and Power Query.
In the second post, we cover Azure Synapse Analytics in more detail. We will explore how we can shape the data in Azure Synapse Analytics, as well as explore some of the Studio interface.
In the third and final post, we will focus on Power BI. We will import the data into Power BI and create data visualisations using the data that we shaped in Azure Synapse Analytics. Here are the high-level steps:
Blog 3: Visualising Azure Synapse Analytics Data in Power BI. (Coming soon)
Data Engineering with Power Query
In this post, we will cover data engineering with Power Query, and we will import the data into Azure Synapse Analytics. We will use some Excel data files, and then transform and shape the Excel data to remove duplicates. As a final step, we will import the Excel files into Azure Synapse Analytics.
Data Sources in Excel
The blog series will use two data files. One is an example file which we have downloaded from the World Data Bank website. The other data file contains data from the World Economic Outlook Database, which is sourced from the IMF Fund. It has been amended in advance for the purposes of helping the reader through the tutorial.
Using Power Query to transform and shape duplicate data
Open the CLASS.xlsx file from the World Data Bank website.
You can see the Excel file contains several columns, such as the Group name, Country Code, and Country Name.
However, there is a problem with this data. When we look through the data in this example, we can see that the country names are repeated. Duplicate data will cause confusion, and it is a very common occurrence in real-life examples. When Power BI users analyse the data, they may not see correct results since duplicate data will also impact the relationships between the tables. If there are multiple copies of each country and the dimension, then Power BI will not know which country is the right one to pick if there are duplicates, and this could end up producing errors.
In this section, Power Query is used to repair the duplicate data. Power Query is accessed from the Data tab. We can select all the data by clicking on a column heading and using CTRL + A to select the data. The data is turned into an Excel table which forms a data source for the query.
To open Power Query, navigate to the Data tab and select the From Table/Range option in the Get & Transform Data tab.
Figure 1 Accessing the Power Query Editor in Excel
In this example, Power Query will be used to filter the data that is contained in the GroupName that is specified by the World group name. Other groups, such as ‘Arab World’ or ‘Europe’, will be removed.
To do that, we can select the filter on the column, and we can look for the World option in the drop-down list and select only the World option. Here is an example:
Figure 2 Filtering data in Power Query
The data for only the World group will appear in the table, which displays only one row for each country. The filter function removes the duplicate country names so there are no duplicates country names.
It is also possible to create an index column in Power Query. To do that, we go to the Add Column tab at the top. Then, select the Index Column option and select the option to start the index at the number 1 using the From 1 option.
Figure 3 Adding an Index Column to the data
The next step is to commit the change. To do this, choose the Close and Load option from Close and Load button at the top left-hand side of Power Query. This action will return the data to the Excel workbook and the Power Query Editor will close.
Figure 4 Committing Power Query changes with the Close and Load functionality
The data will appear in a new worksheet in the Excel file. For the purposes of this article, copy all of this data and place it into a new Excel file which will be imported into the Azure Synapse Analytics storage account. The data is stored in the storage account and it can be imported into the Azure Synapse Analytics sql pool database.
To summarise, the original data set has been cleaned so that each country only appears once and the data set.
Importing the Excel Files into the Azure Synapse Analytics storage account
To import the data into Azure Synapse Analytics, it is easiest to use the Azure Portal. In the portal, look for the Storage Account which underpins Azure Synapse Analytics. In this example, the storage account is called dlsballardchalmers. In the panel on the left-hand side, find the Data Lake Storage panel on the left-hand side, and select for the Containers option to look for the file store that was configured during the Azure Synapse Analytics setup. In this article, the file storage is called fsballardchalmers. Here is an example:
Figure 5 Navigating through the Azure Synapse Analytics storage workspace
Clicking on the file storage will open the functionality to upload the files that were created previously. Selecting the Upload button will open the Upload blob option on the right-hand side. The Folder icon opens an Explorer facility, which allows the user to navigate to find the files for upload, select the files, and upload them to the storage account. An example is provided below.
Figure 6 Upload files to the Storage Account
The data is now in the Azure Synapse Analytics storage account, and it can be imported for use in the Azure Synapse Analytics SQL pool.
Importing data from the Storage Account to the Azure Synapse Analytics
In this section, the focus is on importing the data from the Excel and CSV files data and adding to the Azure Synapse Analytics SQL pool. There are two files to import: the IMFData File and the Country file.
Importing the IMF Excel File into Azure Synapse Analytics
The import is completed using the Azure Synapse Analytics Studio, which is accessible from the Dedicated SQL Pool in Azure Synapse Analytics. In this example, the Ingest option offers the functionality to import the data. The ingest option appears in the left-hand side of the middle of the Azure Synapse Analytics Studio. Here is an example:
Figure 7 Ingest option in Azure Synapse Analytics
Selecting the Ingest button opens the Copy Data Tool which imports the data from the source and the destination. Here is an example of the Copy Data Tool.
Figure 8 The Copy Data Tool in Azure Synapse Analytics
The Copy Data Tool needs a source and a destination. In this example, the IMF Data Excel file will form the source, and the Azure Synapse Analytics will form the destination.
As a first step, select the Next button to select the workspace default file storage. In this example, the file storage is called fsballardchalmers. Select the Next button to navigate to the Excel IMFData.xlsx file and select the Choose button to pick the file. When the file is selected, the screen details will appear as follows and select Next to continue:
Figure 9 Selecting the Source file in the Copy Data Tool
The next step is to retrieve the metadata for the source file. The Copy Data Tool can identify that the source is an Excel file. The tool will identify the worksheets in the Excel worksheet. In this example, the worksheet name is Sheet1 and it can be selected from the drop-down list or typed into the field. For the purposes of the tutorial, the column headers are stored in the first row so the First row as header option should be selected. Here is an illustration:
Figure 10 File Settings in the Copy Data Tool
Once the file format settings had been configured, select the Next button to configure the Destination data store in the Copy Data Tool. The next step is to select the Azure Synapse Analytics destination store, which is available from the Azure option or the All option.
Figure 11 Selecting the Azure Synapse Analytics destination
To proceed, the next step is to select the Azure Synapse Analytics destination, and then select the Next button.
In the next step, there is the option to map the source data to a table. The source data is the Excel file that is stored in Azure Data Lake storage, and this data will be imported and transformed to the destination table. On the left-hand side of the destination, we can set the schema name. In the example below, it is set to as IMFData. On the right-hand side of the destination, we can set the table name which is set to AzureData. In this example, the schema is called dim, which is a shortened version of the world Dimension, which is a commonly used term in Business Intelligence to denote a way of describing data.
When we have specified the schema and table name, we can select the Next button. Here is an example:
Figure 12 Table Mapping in Azure Synapse Analytics
Next, we can select the column mapping so that the source data columns are mapped to the target data columns. You can see the source data is mapped to a column and the destination and the format of each column is the string format. Value appears as a string rather than an integer. To resolve this issue, the data will be transformed to appropriate destination data types in a later step.
Figure 13 Column mapping in the Copy Data Tool
The final step is to configure the remainder of the settings. In this example, we will choose bulk insert as a way of loading the data. We can also enable a staging area as well, by clicking on the Enable stage button and choosing the storage information again. Here we select the workspace to select where the destination should be stored.
In this example, we are going to use the default workspace such as staging area. Now, we see that there is a summary and to commence the data import. We can select the button marks next.
This will deploy our pipeline for copying the data from the Excel file into Azure synapse analytics.
The data will be stored in Azure Synapse Analytics SQL Pool. Once the pipeline is created, we can select the Finish button, and the data will be imported.
Importing the Country Data Excel file
The methodology for importing the Country Data file is the same, using the Copy Data Tool.
The Country mapping source column and destination columns are shown here as an illustration.
Figure 14 Country mapping columns between source and destination
For the import of the Country data, the Settings page will appear as follows once the Copy Data Tool has been configured.
Figure 15 Settings to import Country Data into Azure Synapse Analytics
As in the case where the IMF data was imported, once the pipeline is created, we can select the Finish button and the data is imported.
Now the data has been landed in Azure Synapse Analytics, but it needs to be further shaped to correct underlying data quality issues. In the next steps, the data quality issues will be examined and resolved using Azure Synapse Analytics.
Cleaning up Azure Synapse Analytics Resources when not in use
Note that it is very important to pause the SQL pool operation when it is not being used. It is a billable operation, so the organisation is being charged for data warehouse units and data stored the SQL pool. The instructions to pause or delete the resources are found next.
In the Azure Portal, click on your SQL pool. To pause compute, click the Pause button. When the SQL pool is paused, the Start button replaces the Resume button. To resume compute, click Start.
In this post, we covered some simple data engineering with Power Query, and we imported the data into Azure Synapse Analytics storage. Using Excel data files, we transformed and shaped the Excel data to remove duplicates. As a final step, we imported the Excel files into Azure Synapse Analytics storage, ready for our next steps in the second blog post. We look forward to seeing you there!