Batch upsert documents#1539
Conversation
|
I'm not sure why a user would use that, and not just: from pgml import Collection, Batch
collection = Collection("my_collection")
# batch = Batch(collection, 25, {"merge": True})
batch = []
# await batch.upsert_documents([{"id": 1}]) # Doesn't upsert yet
batch.append({"id": 1})
for i in range(23):
# await batch.upsert_documents([{"id": i}]) # Doesn't upsert yet
batch.append({"id": i})
# Upserts whatever is in the current batch
# and appends the document to the next batch
# await batch.upsert_documents([{"id": 1}])
await collection.upsert_documents(batch, {"merge": True})
# Upserts the final batch
# await batch.finish() |
|
Oh I see the automatic handling of upserting after they hit the threshold is nice, but it is a bit confusing. I think most people in the Python world are used to using batching systems already built into the dataset they are operating on. For example: https://huggingface.co/docs/datasets/en/process#batch-processing Not saying we shouldn't add it, but maybe we should clarify the name to like |
|
Why not use the |
Datasets are one of very many sources for data. For example, my use case that triggered the desire for this feature was streaming WET files from a
|
|
Right, having to call flush/finish is only an issue because of this new API you’re introducing. The example Silas gave doesn’t have the issue. |
Features
Bugs