Hi,
I'm integrating GCS in our backup flow and the goal is to have a multipart upload being performed by multiple workers.
How it works
We have a process which tars and compresses the backup data. The tarball is written in small chunks into a temp directory, so our goal is to start a multipart upload as soon as the tar process starts and upload each chunk as a part of the multipart upload. Once the tar process is finished, we complete the multipart upload.
This approach means we never need the full data locally before starting the upload, and it also allows distributing the upload across multiple workers under our control.
The problem
The upload_chunks_concurrently method exposed by the Python SDK requires a filename and, internally, divides the file into parts and uploads them in parallel. This means I would need to have the full file locally before starting the multipart upload, which completely breaks our workflow. Also, because the multipart upload is handled by the library internally, it does not allow us to distribute the work across our own workers.
In short, we need full control over all stages of the multipart upload i.e. initiate + upload parts + complete, similar to what boto3 exposes. This seems like a limitation of the Python SDK specifically, as the Java SDK, for example, seems to give you full control of the complete flow.
We already have this flow working on AWS S3 using the boto3 library and would like to achieve the same with GCS.
Workaround
Upon inspecting the upload_chunks_concurrently implementation, we were able to come up with a solution that uses the internal classes XMLMPUContainer and XMLMPUPart (the code below is just to illustrate something similar to our usage):
from google.cloud import storage
from google.cloud.storage.transfer_manager import XMLMPUContainer, XMLMPUPart
client = storage.Client()
upload_url = f"https://storage.googleapis.com/{bucket_name}/{object_key}"
# === PARENT PROCESS: Initiate upload ===
container = XMLMPUContainer(upload_url, filename=None)
container.initiate(transport=client._http, content_type="application/octet-stream")
upload_id = container.upload_id
# Spawn workers, passing them the upload_url and upload_id...
# === WORKER PROCESS: Upload a single part ===
def upload_part(upload_url, upload_id, part_filename, part_number):
part = XMLMPUPart(
upload_url=upload_url,
upload_id=upload_id,
filename=part_filename,
start=0,
end=os.path.getsize(part_filename),
part_number=part_number,
)
part.upload(transport=client._http)
return {"PartNumber": part_number, "ETag": part.etag}
# === PARENT PROCESS: Complete upload after all workers finish ===
container = XMLMPUContainer(upload_url, filename=None, upload_id=upload_id)
for part in parts_metadata:
container.register_part(part["PartNumber"], part["ETag"])
container.finalize(transport=client._http)
Using these classes give us total control of the multipart upload process, allowing a seamless integration within our workflow.
Questions
However these classes originate from google.cloud.storage._media (a private module). Are they considered stable? Are we discouraged to use these classes directly?
In case their use is discouraged, is there any alternative for streaming multipart uploads which achieves the same result?
Thanks for any help!
Hi,
I'm integrating GCS in our backup flow and the goal is to have a multipart upload being performed by multiple workers.
How it works
We have a process which tars and compresses the backup data. The tarball is written in small chunks into a temp directory, so our goal is to start a multipart upload as soon as the tar process starts and upload each chunk as a part of the multipart upload. Once the tar process is finished, we complete the multipart upload.
This approach means we never need the full data locally before starting the upload, and it also allows distributing the upload across multiple workers under our control.
The problem
The
upload_chunks_concurrentlymethod exposed by the Python SDK requires a filename and, internally, divides the file into parts and uploads them in parallel. This means I would need to have the full file locally before starting the multipart upload, which completely breaks our workflow. Also, because the multipart upload is handled by the library internally, it does not allow us to distribute the work across our own workers.In short, we need full control over all stages of the multipart upload i.e. initiate + upload parts + complete, similar to what boto3 exposes. This seems like a limitation of the Python SDK specifically, as the Java SDK, for example, seems to give you full control of the complete flow.
We already have this flow working on AWS S3 using the boto3 library and would like to achieve the same with GCS.
Workaround
Upon inspecting the
upload_chunks_concurrentlyimplementation, we were able to come up with a solution that uses the internal classesXMLMPUContainerandXMLMPUPart(the code below is just to illustrate something similar to our usage):Using these classes give us total control of the multipart upload process, allowing a seamless integration within our workflow.
Questions
However these classes originate from
google.cloud.storage._media(a private module). Are they considered stable? Are we discouraged to use these classes directly?In case their use is discouraged, is there any alternative for streaming multipart uploads which achieves the same result?
Thanks for any help!