Deleting Blobs with Azure Automation Runbook

May 6, 2024 in Azure, Azure Automation

0 Comments

There are several ways of versioning data in an Azure Blob Storage. A simple solution is, to save different versions of data files (e.g. csv or parquet) and differenciate between each version by the name of the file. This though can quickly lead to a huge amount of data due to a huge number of different files. To make sure, that you only keep a certain number of files your file and automatically delete the rest. This post shows, how you can create an Azure Automation runbook, that can then be triggered from Azure Data Factory via a Webhook call and will only keep the user specified number of newest files in an Azure Storage Account.

Description of the process

In this usecase, an Azure Data Factory pipeline is pulling a new version of a dataset from an external data source (in this case from SAP Datasphere). After completion, a webhook activity is triggering an Azure Automation notebook with several paramters such as storageAccountName, keyVaultName or folderName. The runbook then scans the specified Storage folder, sorts the files per date and deletes all files except the newest ones (depending on the filesToKeep parameter). On successful completion, ADF receives a callback.

Prerequisites

  • Azure Automation Account
  • Azure KeyVault with AzureStorage Key stored as secret
  • Azure Data Factory

Step 1: Store Storage Account Access Key as Secret in Azure KeyVault

Copy Storage Account Access Key

If you don’t want to hardcode your storage credentials (which you never should do) you can store those in you KeyVault. To do so, copy the access key from your storage account.

Copy the storage account access key

Store Access Key in Azure KeyVault

Afterwards, create a secret for you Access key. Make sure, to give a name that fits your naming conventions. Be sure, that it is enabled.

Step 2: Creating the Runbook

Create Azure Automation Account

If not yet available, create an Azure automation account.

  1. Log in to Azure Portal: First, access your Azure portal.
  2. Navigate to Automation Accounts: Search for ‘Automation Accounts’ in the portal’s search bar.
  3. Create a New Automation Account: Click on ‘Add’ and fill in the necessary details like name, subscription, resource group, etc.
  4. Complete the Setup: Follow the on-screen instructions to create the account.

Add appropriate IAM rights to the Automation Account

For handling KeyVault information you need to give your Automation account access to the storage account secret from before. For a maximum of security, it makes sense to only give those permissions to the specific secret and to keep those permissions as light as possible. In this example: The automation account has the “Key Vault Secrets User” role, which let’s the automation read the secrets.

The Runbook Powershell script

Create a new runbook and write the code. The following example is a working code snippet, in which no hardcoded variables are needed. Therefore, the runbook can be used for several usecases, in which a certain number of files have to be deleted. I will afterwards go over the code step by step and explain the different functions.

As this is a runbook, which is using Webhook data, it can unfortunately only be tested from within Azure Data Factory. The testing environment will not work.

This code snipped was last tested on 2024-04-15, so there might be newer methods or other best practices.

##################################################
# Receive Parameters
##################################################
Param
(
    # Object is required
    [Parameter(Mandatory=$True,Position=1)]
    [object] $WebhookData
)
 
# Initialize callbackUri variable
$callbackUri = $null

# Get all parameters from body (passed from Data Factory Web Activity)
if ($null -ne $WebhookData -and $null -ne $WebhookData.RequestBody) {
    $Parameters = (ConvertFrom-Json -InputObject $WebhookData.RequestBody)

    Write-Output "Successfully retreived parameters: $Parameters"

    # Get folder_path parameter from set of parameters
    $folderPath = $Parameters.folderPath
    $containerName = $Parameters.containerName
    $storageAccountName = $Parameters.storageAccountName
    $keyVaultName = $Parameters.keyVaultName
    $secretName = $Parameters.secretName
    $filesToKeep = $Parameters.filesToKeep
    $callbackUri = $Parameters.callBackUri

    Write-Output "folderPath: $folderPath"
    Write-Output "containerName: $containerName"
    Write-Output "storageAccountName: $storageAccountName"
    Write-Output "keyVaultName: $keyVaultName"
    Write-Output "secretName: $secretName"
    Write-Output "filesToKeep: $filesToKeep"
    Write-Output "callbackUri: $callbackUri"

}

else {
    Write-Error "Error receiving webhook data."
}

#Error handling for empty callback URI
if ([string]::IsNullOrWhiteSpace($callbackUri)) {
    Write-Error "Callback URL is null or empty. Unable to send callback."
}


##################################################
# Set Azure Context
##################################################
# Use Managed Identity to authenticate with Azure Services
Connect-AzAccount -Identity

# context information
$contextInfo = Get-AzContext
Write-Output "Current Azure context: $contextInfo"

##################################################
# Key Vault Credentials
##################################################
# Trying to receive secret from Key Vault
$storageAccountKey = Get-AzKeyVaultSecret -VaultName $keyVaultName -Name $secretName -AsPlainText
if ($null -eq $storageAccountKey) {
    Write-Error "Error retreiving secret"
    exit
}
Write-Output "Successfully retreived secret. "

##################################################
# Access Blobs in Storage account
##################################################
# Context fo accessing Storage Account
$context = New-AzStorageContext -StorageAccountName $storageAccountName -StorageAccountKey $storageAccountKey

$blobs = Get-AzStorageBlob -Container $containerName -Context $context -Prefix $folderPath
Write-Output "Processing folder: $folderPath"

##################################################
# Keep only certain amount of blobs, delete rest
##################################################

# Sort blobs by LastModified property in descending order
$sortedBlobs = $blobs | Sort-Object -Property LastModified -Descending

# Count the blobs
$blobCount = $sortedBlobs.Count
Write-Output "Number of files in `$folderPath`: $blobCount"


# Keep only the newest $filesToKeep blobs, delete the rest
if ($blobCount -gt $filesToKeep) {
    $blobsToDelete = $sortedBlobs | Select-Object -Skip $filesToKeep
    foreach ($blobToDelete in $blobsToDelete) {
        # Remove the blob from storage
        Remove-AzStorageBlob -Blob $blobToDelete.Name -Container $containerName -Context $context
        Write-Output "Deleted old blob: $($blobToDelete.Name)"
    }
}
# Output names of the remaining up to $filesToKeep blobs
$sortedBlobs | Select-Object -First $filesToKeep | ForEach-Object {
    Write-Output "Keeping blob: $($_.Name)"
}

##################################################
# Sending Callback to ADF
##################################################
# Define the body of the callback. Adjust this according to what ADF expects.
$body = @{
    status = "Completed"
} | ConvertTo-Json

# Send the callback
Invoke-RestMethod -Uri $callbackUri -Method Post -Body $body -ContentType "application/json"

Parameters

Param
(
    # Object is required
    [Parameter(Mandatory=$True,Position=1)]
    [object] $WebhookData
)

This part defines the parameter, which is received by Azure DataFactory later. The webhookdata is a json formatted objec, which contains all the necessary information for further processing. The position argument references the whole object. This part has te be at the beginning of your script.

if ($null -ne $WebhookData -and $null -ne $WebhookData.RequestBody) {
    $Parameters = (ConvertFrom-Json -InputObject $WebhookData.RequestBody)

    Write-Output "Successfully retreived parameters: $Parameters"

    # Get folder_path parameter from set of parameters
    $folderPath = $Parameters.folderPath
    $containerName = $Parameters.containerName
    $storageAccountName = $Parameters.storageAccountName
    $keyVaultName = $Parameters.keyVaultName
    $secretName = $Parameters.secretName
    $filesToKeep = $Parameters.filesToKeep
    $callbackUri = $Parameters.callBackUri

    Write-Output "folderPath: $folderPath"
    Write-Output "containerName: $containerName"
    Write-Output "storageAccountName: $storageAccountName"
    Write-Output "keyVaultName: $keyVaultName"
    Write-Output "secretName: $secretName"
    Write-Output "filesToKeep: $filesToKeep"
    Write-Output "callbackUri: $callbackUri"

}

else {
    Write-Error "Error receiving webhook data."
}

We then receive the different variables from within the WebhookData variable. They are converted from Json and then read one by one. Please note, that this code doesn’t handle wrong input format or missing values. You have to make sure, that the Webhook body is correctly formated in ADF and delivering all the necessary information.

Set Azure context

##################################################
# Set Azure Context
##################################################
# Use Managed Identity to authenticate with Azure Services
Connect-AzAccount -Identity

# context information
$contextInfo = Get-AzContext
Write-Output "Current Azure context: $contextInfo"

This contextInfo variable ensures, that the runbook authenticates via the Azure Managed Identity of the Automation Account (that’s why we gave it the necessary rights in the Key Vault).

Retreive Storage Account Access Key from Key Vault

##################################################
# Key Vault Credentials
##################################################
# Trying to receive secret from Key Vault
$storageAccountKey = Get-AzKeyVaultSecret -VaultName $keyVaultName -Name $secretName -AsPlainText
if ($null -eq $storageAccountKey) {
    Write-Error "Error retreiving secret"
    exit
}
Write-Output "Successfully retreived secret. "

This section of the PowerShell script is designed to interact with Azure Key Vault to retrieve the access key for a storage account. It attempts to fetch the secret using the Get-AzKeyVaultSecret cmdlet, specifying the vault and secret names. If the secret cannot be retrieved (i.e., if it is null), the script logs an error message and terminates; otherwise, it logs a success message.

Access Blobs in Storage account

##################################################
# Access Blobs in Storage account
##################################################
# Context fo accessing Storage Account
$context = New-AzStorageContext -StorageAccountName $storageAccountName -StorageAccountKey $storageAccountKey

$blobs = Get-AzStorageBlob -Container $containerName -Context $context -Prefix $folderPath
Write-Output "Processing folder: $folderPath"

This portion of the PowerShell script is responsible for accessing Azure Storage blobs. It first establishes a connection context to the storage account using the New-AzStorageContext cmdlet with the account name and key provided. The script then retrieves all blobs within a specified container and path prefix using the Get-AzStorageBlob cmdlet. Finally, it logs a message indicating that it is processing the specified folder path.

Keep only certain amount of blobs, delete rest

##################################################
# Keep only certain amount of blobs, delete rest
##################################################

# Sort blobs by LastModified property in descending order
$sortedBlobs = $blobs | Sort-Object -Property LastModified -Descending

# Count the blobs
$blobCount = $sortedBlobs.Count
Write-Output "Number of files in `$folderPath`: $blobCount"


# Keep only the newest $filesToKeep blobs, delete the rest
if ($blobCount -gt $filesToKeep) {
    $blobsToDelete = $sortedBlobs | Select-Object -Skip $filesToKeep
    foreach ($blobToDelete in $blobsToDelete) {
        # Remove the blob from storage
        Remove-AzStorageBlob -Blob $blobToDelete.Name -Container $containerName -Context $context
        Write-Output "Deleted old blob: $($blobToDelete.Name)"
    }
}
# Output names of the remaining up to $filesToKeep blobs
$sortedBlobs | Select-Object -First $filesToKeep | ForEach-Object {
    Write-Output "Keeping blob: $($_.Name)"
}

This script section is enforcing a retention policy based on modification dates. Initially, the script sorts all retrieved blobs in descending order based on the LastModified property, ensuring that the most recently updated files are processed first. It then counts the total number of blobs, outputting this number along with the folder path being processed to provide a clear log of the current state.

The script uses the given parameter $filesToKeep as a treshold, which determines how many of the most recent blobs should be retained. If the number of existing blobs exceeds this threshold, it identifies which blobs to delete by skipping the top $filesToKeep blobs and selecting the rest. Each of these excess blobs is then systematically deleted from the storage container, with each deletion logged to provide a record of which files were removed.

For clarity and record-keeping, the script also outputs the names of the blobs that continue to be stored, up to the $filesToKeep limit. This helps in monitoring and verifying that the correct files are retained and that older, less necessary files are pruned appropriately from the storage environment. This methodical approach ensures that storage usage is optimized while keeping the most relevant data readily available.

Sending Callback to ADF

##################################################
# Sending Callback to ADF
##################################################
# Define the body of the callback. Adjust this according to what ADF expects.
$body = @{
    status = "Completed"
} | ConvertTo-Json

# Send the callback
Invoke-RestMethod -Uri $callbackUri -Method Post -Body $body -ContentType "application/json"

The process concludes by sending a callback to Azure Data Factory to indicate the completion of the task. A JSON payload is created with a status of “Completed,” which is then converted into a JSON string format. This JSON object is sent as the body of an HTTP POST request to the parameter callbackUri using the Invoke-RestMethod cmdlet, with the content type set to “application/json” to ensure proper handling by the receiving server.

Step 3: Call the runbook from within an ADF pipeline

Create Webhook for the runbook

To be able to call the runbook from within ADF, you need to create a webhook. When doing so, be sure to copy the URL, as it will not be viewable again after the creation process is finished. You will need this URL in your pipeline.

Create Webhook activity in ADF

For calling the runbook from within ADF, use the WebHook activity. Copy the WebHook-URL from the former step. Use the POST Method. The body has to be formatted in the following format:

@json(
    '{ 
        "folderPath": "your_folder_path/" , 
        "containerName": "your_container_name",
        "storageAccountName": "your_storage_account_name",
        "keyVaultName": "your_keyVault_name",
        "secretName": "your_secret_name",
        "filesToKeep": 15 # this is the number of blobs that aren't deleted via the script
    }'
)

Test or publish

As mentioned, for testing the script, you have to use ADF, as the Webhook data are necessary for the script to work. If you have tested already, don’t forget to publish.

Conlusion

With this runbook, you have a very reusable way of keeping your folders small and preventing them from becoming confusing and too expensive.

I’m really interested in your opinions on this way of dealing with larger folders. Have you used similar approaches? Do you think, there are better ways? Please leave a comment with your ideas.

External Links

Get-AzKeyVaultSecret Documentation

This might also interest you:

Your contact

How may we help you?

Bastian Knaus

Co-Founder & COO

bastian.knaus@enari.com

+49 176 666 78 033

Name *
Email *
Phone
Message *
0 of 350

Don't want to wait? Book your own timeslot right now!