Skip to content

Public interface for manual multipart upload control #17494

@gustabowill

Description

@gustabowill

Hi,

I'm integrating GCS in our backup flow and the goal is to have a multipart upload being performed by multiple workers.

How it works

We have a process which tars and compresses the backup data. The tarball is written in small chunks into a temp directory, so our goal is to start a multipart upload as soon as the tar process starts and upload each chunk as a part of the multipart upload. Once the tar process is finished, we complete the multipart upload.

This approach means we never need the full data locally before starting the upload, and it also allows distributing the upload across multiple workers under our control.

The problem

The upload_chunks_concurrently method exposed by the Python SDK requires a filename and, internally, divides the file into parts and uploads them in parallel. This means I would need to have the full file locally before starting the multipart upload, which completely breaks our workflow. Also, because the multipart upload is handled by the library internally, it does not allow us to distribute the work across our own workers.

In short, we need full control over all stages of the multipart upload i.e. initiate + upload parts + complete, similar to what boto3 exposes. This seems like a limitation of the Python SDK specifically, as the Java SDK, for example, seems to give you full control of the complete flow.

We already have this flow working on AWS S3 using the boto3 library and would like to achieve the same with GCS.

Workaround

Upon inspecting the upload_chunks_concurrently implementation, we were able to come up with a solution that uses the internal classes XMLMPUContainer and XMLMPUPart (the code below is just to illustrate something similar to our usage):

from google.cloud import storage
from google.cloud.storage.transfer_manager import XMLMPUContainer, XMLMPUPart

client = storage.Client()
upload_url = f"https://storage.googleapis.com/{bucket_name}/{object_key}"

# === PARENT PROCESS: Initiate upload ===
container = XMLMPUContainer(upload_url, filename=None)
container.initiate(transport=client._http, content_type="application/octet-stream")
upload_id = container.upload_id

# Spawn workers, passing them the upload_url and upload_id...

# === WORKER PROCESS: Upload a single part ===
def upload_part(upload_url, upload_id, part_filename, part_number):
    part = XMLMPUPart(
        upload_url=upload_url,
        upload_id=upload_id,
        filename=part_filename,
        start=0,
        end=os.path.getsize(part_filename),
        part_number=part_number,
    )
    part.upload(transport=client._http)
    return {"PartNumber": part_number, "ETag": part.etag}

# === PARENT PROCESS: Complete upload after all workers finish ===
container = XMLMPUContainer(upload_url, filename=None, upload_id=upload_id)
for part in parts_metadata:
    container.register_part(part["PartNumber"], part["ETag"])
container.finalize(transport=client._http)

Using these classes give us total control of the multipart upload process, allowing a seamless integration within our workflow.

Questions

However these classes originate from google.cloud.storage._media (a private module). Are they considered stable? Are we discouraged to use these classes directly?

In case their use is discouraged, is there any alternative for streaming multipart uploads which achieves the same result?

Thanks for any help!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions