Working on Data Lakes with Azure
Cover Photo by Pietro De Grandi on Unsplash
Data storage methods and techniques have grown a lot ever since data science became well-known and coveted. There are many methods of storage and many providers who assist developers with their storage needs.
In this article, we will look at some of the basic operations and functionalities offered by Azure Data Lake Storage. We will practically demonstrate a few steps and understand how to implement the following methodologies from scratch. These steps include creating an Azure Databricks Service, followed by a Spark Cluster in the Azure Databricks Service, and, finally, a file system in the Azure Data Lake Storage. We will also look at some simple sample code in Scala for a demonstration of the numerous tasks that can be performed with this system.
Introduction
Data lakes are collections of large datasets in their raw format. These data lakes consist of large amounts of both structured and unstructured data elements. These systems or repositories of data can be stored in any state to be utilized for numerous purposes. One of the best ways to handle data lakes is with the help of Microsoft Azure.
Microsoft Azure grants you access to the data lake storage. It provides users with the option to store data and perform numerous big data analytics operations. A data lake can contain multiple containers. Each of these containers can contain multiple files and folders for various purposes. The Azure Blob File System is encrypted over a Secure Sockets Layer (SSL) connection, and it is compatible with a majority of the existing solutions in the market, making it easy to connect to.
Source: Wikimedia Commons
To follow this article, you will need at least a free trial account or subscription to Azure services. Even without a paid account, you can use the fourteen-day free trial to explore the workspace and contents of this article. Some basic knowledge of SQL and programming will suffice for the rest of the tutorial. So, without further ado, let us get started.
Create Your Azure Databricks Service:
To create an Azure Databricks service, the first step is to enter the Azure portal and create a resource from the Azure portal menu. Once you click on the 'create a resource' icon, you will be directed to the next window. You can either scroll down and click the storage account icon or follow these steps: Click on the Analytics icon and choose the Azure Databricks option. The below diagram is a representation of the next couple of steps that need to be performed.
Once you have completed the above-mentioned steps, you will be directed to the Azure Databricks service. Here, you can view a set of properties and descriptions. Select the options that are most suitable and appropriate for your particular task. The account creation might take a few minutes to complete. The status of the operation can be monitored with the progress bar at the top.
Create a Spark Cluster in Azure Databricks:
Once you have successfully completed the previously mentioned steps, we can now proceed to create a Spark Cluster in Azure Databricks. Spark is a general-purpose distributed data processing engine that's used in multiple applications. In addition to Spark's core data processing engine, there are also libraries for machine learning, stream processing, SQL, and graph computation which can be used together in an application.
To create the Spark Cluster in Azure Databricks, revisit the Databricks service in the Azure portal and launch the workspace. Create the new cluster with the option available in the provided window. The above image is a clear representation of the steps to follow. Upon clicking the 'new cluster' option, you will be redirected to a new cluster page. You can set your requirements here and proceed to create the new cluster.
Create a file system in the Azure Data Lake Storage:
Once you enter the workspace, click on the 'create a new notebook' option and select the 'Scala' option. We will be experimenting with our code blocks using the Scala library, as a majority of the convenient code lines for Azure are written in Scala. You can add some additional storage files or data lakes at your convenience. In this article, I will be using a GitHub reference with the movies.csv file. Let us look at some sample code blocks to better understand Azure Databricks.
Code Sample
In this section, we will discuss a few simple Scala code blocks for a better understanding of how to operate with Azure Databricks. I recommend checking out a GitHub link to gain a better understanding of the data elements and workflow.
In the following code block, we are attaching the storage elements to our script. Using this method, you can attach your data lake to the system. We will be using a slightly more secure and reliable way of attaching the data lake to our system (the data, in this instance, can be the movies.csv provided in the GitHub link).
val appID = ""
val password = ""
val tenantID = ""
val fileSystemName = "";
var storageAccountName = "";
val configs = Map("fs.azure.account.auth.type" -> "OAuth",
"fs.azure.account.oauth.provider.type" -> "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id" -> appID,
"fs.azure.account.oauth2.client.secret" -> password,
"fs.azure.account.oauth2.client.endpoint" -> ("https://login.microsoftonline.com/" + tenantID + "/oauth2/token"),
"fs.azure.createRemoteFileSystemDuringInitialization"-> "true")\
We will utilize the next code block to attach the data lake to the cluster and ensure that it is mounted as a local drive.
dbutils.fs.mount(source = "abfss://" + fileSystemName + "@" +
storageAccountName + ".dfs.core.windows.net/",
mountPoint = "/mnt/data",
extraConfigs = configs)
Finally, we can perform numerous operations on the Data Lakes and the stored contents with the help of some Scala code. The following code block is a simple representation of the various functionalities that you can perform and implement. Visit the GitHub resource provided previously for further information on the movies dataset and the code.
val df = spark.read.csv("/mnt/data/demo/movies.csv")
display(df)
val df = spark.read.option("header","true").csv("/mnt/data/demo/movies.csv")
val selected = df.select("movieId", "title")
selected.write.csv("/mnt/data/demo/movies-2.csv")\
Conclusion
In this article, we discussed a few of the basic operations and functionalities of Azure Data Lake Storage along with a practical demonstration. Azure Databricks is an important service that solves problems related to big data analytics. Apart from setting up a basic structure of the Azure Databricks, we also briefly looked at the Scala code sample on a test dataset.
I highly recommend checking out more references to gain a more intuitive understanding of this topic. You can access video guides that cover most of the essential topics required to get started with big data analytics in Azure, and tutorials that cover all the significant steps for manipulating the storage of elements in Azure and the numerous steps that can be performed in it. Please check them out to learn both practically and theoretically using Azure Data Lake Storage.