Microsoft Azure
Overview
Microsoft Azure is a cloud computing service for building, testing, deploying, and managing applications and services through a global network of Microsoft-managed data centers. Data Management supports three flavors of Azure data storage: ADLS (Azure Data Lake Store) Gen1 and Gen2, and Azure Blob Storage. All support Azure Active Directory (AD) for authentication. In order for Data Management to authenticate with an Azure storage account, you must provide Azure AD authentication credentials. Check with your system administrator to determine which authentication method is used in your organization and obtain the appropriate credentials.
Microsoft ADLS
There are two versions of ADLS: Gen1 and Gen2. Both use Azure Active Directory (AD) for authentication. In order for Data Management to authenticate with Azure Data Lake storage, you must configure service-to-service authentication by creating an Azure Active Directory application, using the AD application to generate authentication credentials, and providing those credentials to Data Management. Consult your system administrator for details.
Authentication credentials are version-specific, and are not interchangeable between Gen1 and Gen2. If you have both Gen1 and Gen2 Data Lake Stores, you must configure each with its own credentials. While a misconfigured connection may successfully authenticate, functionality will be limited and unexpected behavior may occur.
Once you have configured access to an ADLS account, any Data Management browse dialog that has access to the account will display a new ADLS file system icon at the root level of the browse window. Expanding this item displays a list of all the authenticated ADLS accounts.
DFS path prefix adls:///
in the above image.
If you configured access to an ADLS Gen2 account, any browse dialog that that has access to the account will display the DFS path prefix adl2:///
and the new file system labeled ADLS Gen2 Accounts. Expanding this item displays a list of all the authenticated ADLS Gen 2 accounts.
Configure access to ADLS Gen1
To configure access to ADLS Gen1:
In the repository, open Settings>Cloud folder.
Select the Azure icon.
Go to the Properties pane.
On the ADLS Settings tab, enter the authentication credentials generated by the Azure AD application:
Tenant ID: the Directory ID associated with the AD application.
Application ID: the Application ID associated with the AD application.
Client secret: the Authentication ID associated with the AD application. This may be a Password or Key Vault reference.
Select in the ADLS accounts grid and enter the names of ADLS accounts to access using these authentication credentials.
Optionally, configure Tuning settings:
Read ahead queue depth sets the queue depth to be used for parallelized read-aheads of files. Accept the default value of 15 to maximize read performance, or set a new value from 0 (no read-aheads) to 20.
Buffer size (MB) sets the size of the tool's internal read buffer. Accept the default value of 4 MB to maximize read performance, or set a new value from 1 to 4.
Configure access to ADLS Gen2
To configure access to ADLS Gen2:
In the repository, open Settings>Cloud folder.
Select the Azure icon.
Go to the Properties pane.
On the ADLS Gen2 Settings tab, select in the ADLS accounts grid and enter the names of ADLS accounts.
For each account, select the Authentication type and enter the authentication credentials generated by the Azure AD application for that account.
Authentication credentials are version-specific, and are not interchangeable between Gen1 and Gen2.
Authentication type | Credential |
Access key | May be a Password or Key Vault reference. |
OAuth2 Service Principal |
|
Azure Blob Storage
Azure Blob Storage (ABS) is Microsoft's object storage solution for the cloud. Blob storage is optimized for storing large volumes of unstructured data, such as text or binary data.
To access ABS from Data Management, you need to know the Account name and authentication credentials for an Azure account that has been configured with a Storage Account and Azure Blob Storage. Consult your system administrator for details.
Once you have configured access to this account, any Data Management browse dialog that has access to the account will display a new file system icon labeled Azure Blob Storage at the root level of the browse window. Expanding this item displays a list of the containers in the account.
URLs and paths: ABS versus WASB
The URL shown in the above screenshot begins with abs:///
. The same URL with the abs: scheme specifier will work everywhere for native access to the same blobs stored in ABS.
Directories and files in ABS
Technically, Azure Blob Storage does not have directories. ABS stores blobs in a “flat” namespace. In ABS blob names, the slash character /
is actually just another character. However, tools that interface with Azure Blob Storage, including the Azure portal web site, and the Azure Storage Explorer, recognize the slash character as an indicator of a "virtual directory" and interpret the view accordingly. Data Management follows this lead, and attempts to interpret the /
character as a directory indicator. Note, however, that this can lead to conflicting interpretations, because, for example, all of these blob names can co-exist in ABS: foo
, foo/
, foo/bar
, and foo//
.
Data Management follows these rules:
Any blob ending with
/
character, such asfoo/bar/
, is interpreted as a directory without the trailing slash, e.g.,foo/bar
.The existence of any blobs with an interior
/
character imputes the existence of intermediate folder(s). For example, a blob namedfoo/bar/glarp
imputes a directory namedfoo
and a directory namedfoo/bar
.Any indication of a directory supersedes a conflicting indication of a regular file. For example, if there is a blob named
foo/bar
and also a blob namedfoo/bar/glarp
, the namefoo/bar
will be considered a directory, and you will not be able to treatfoo/bar
as a regular file. If you listfoo
you will see a single directory namedfoo/bar
. If you attempt to read or write a file namedfoo/bar
you will get an error.If you use Data Management to create a directory named
foo
, this will result in an empty blob being created that is namedfoo/
.Data Management attempts to reject all inconsistent use that would create conflicting interpretations of a path as both directory and file. For example, if you have a “file” named
foo/bar
and you attempt to create a “file” namedfoo/bar/glarp
that will be rejected, becausefoo/bar
is not a directory.If you delete a “directory” using Data Management, all children will be deleted. For example, if you have blobs named
foo/bar
andfoo/glarp
, and you delete the imputed “directory”foo
, both blobs will be deleted.
It is important to realize that this interpretation of the flat namespace in ABS as a directory structure is purely convention. Nothing requires or enforces any tools to share this interpretation. In particular, we’ve noticed that Azure Storage Explorer will recognize a blob named foo/
as a directory, but then claims that there is a “file” named foo/
under the directory named foo
. Azure portal will claim that the directory foo
contains a “file” named “empty file”.
Append operations are not supported
Data Management supports ABS block blobs. You can modify an existing block blob by inserting, replacing, or deleting blocks. However, append operations are not supported. Enabling the Append to existing file option in a Flat File Output tool or a CSV Output tool will result in an error.
Configure access to Azure Blob Storage
To configure access to Azure Blob Storage:
In the repository, open Settings>Cloud folder.
Select the Azure icon.
Go to the Properties pane.
Select the Blob Storage Settings tab.
Enter the Account name.
Select the Authentication type and enter the authentication credentials generated by the Azure AD application for that account:
Authentication type | Credential |
Connection string | May be a Password or Key Vault reference. |
OAuth2 Service Principal |
|