DVC Bugfix: securely pull data from azure blob backend

Bug Fix: enable DVC to pull data from Azure blob in a more secure way

DVC is a "Data Version Control" system. It keeps your actual data files tucked away on suitable media, such as cloud-based blob storage, while keeping the code naturally in sync with the data, using git. This way you get to eat the cake (sync large, binary, data files with your code) while having it too (keeping the repository small and fast-performing).

Since the code and data are related, automatic CI pipelines can rely on such data and even test and validate it. However, for security reasons, we typically want to run the CI pipelines with reduced credentials, rather than give them full control of the cloud account that was used as storage backend. It turned out that for the case of the Azure blob backend, such reduced-credentials accounts had caused DVC to choke, as described in this issue, effectively preventing the usage of that security measure.

Fortunately, some digging revealed the root cause for the issue, and the workaround I suggested in this PR was quickly reviewed and merged.

  • What was the issue?

In Azure, the common way to authenticate automated tasks to the blob storage is by using SAS tokens instead of the full credentials. However, when trying to use such a token for a dvc pull command in the CI pipeline resulted in an error message, which said "This request is not authorized to perform this operation." (despite the fact that the SAS token did have read permission).

  • The root cause:

It turned out that during the initialization of the blob_service object, the existing implementation had always attempted to create the container on the blob. This is useful, e.g. for the first usage of dvc push command, when the container did not exist before, and it does nothing if the container already exists. However, if you are only trying to pull, and you have only read permissions, it fails.

  • What was the solution?

Instead of always trying to create, we should first try to check if the folder exists. If it does, no reason for the pull command to fail. If it does not, it will fail with a more comprehensible error message, and if you are indeed trying to push, you actually need write permissions, so the error would be expected.

H2
H3
H4
3 columns
2 columns
1 column
Join the conversation now
Logo
Center