Release Note
This release contains 2 breaking changes, 3 new features, 11 bug fixes, and 2 documentation improvements.
💣 Breaking Changes
Terminate Python 3.7 support
⚠️ ⚠️ DocArray will now require Python 3.8. We can no longer assure compatibility with Python 3.7.
We decided to drop it for two reasons:
- Several dependencies of DocArray require Python 3.8.
Python long-term support for 3.7 is ending this week. This means there will no longer
be security updates for Python 3.7, making this a good time for us to change our requirements.
Changes to DocVec Protobuf definition (#1639)
In order to fix a bug in the DocVec protobuf serialization described in #1561,
we have changed the DocVec .proto definition.
This means that DocVec objects serialized with DocArray v0.33.0 or earlier cannot be deserialized with DocArray
v.0.34.0 or later, and vice versa.
⚠️ ⚠️ We strongly recommend that everyone using Protobuf with DocVec upgrade to DocArray v0.34.0 or
later.
🆕 Features
Allow users to check if a Document is already indexed in a DocIndex (#1633)
You can now check if a Document has already been indexed by using the in keyword:
from docarray.index import InMemoryExactNNIndex
from docarray import BaseDoc, DocList
from docarray.typing import NdArray
import numpy as np
class MyDoc(BaseDoc):
text: str
embedding: NdArray[128]
docs = DocList[MyDoc](
[MyDoc(text="Example text", embedding=np.random.rand(128))
for _ in range(2000)])
index = InMemoryExactNNIndex[MyDoc](docs)
assert docs[0] in index
assert MyDoc(text='New text', embedding=np.random.rand(128)) not in index
Support subindexes in InMemoryExactNNIndex (#1617)
You can now use the find_subindex
method with the ExactNNSearch DocIndex.
from docarray.index import InMemoryExactNNIndex
from docarray import BaseDoc, DocList
from docarray.typing import NdArray
import numpy as np
class MyDoc(BaseDoc):
text: str
embedding: NdArray[128]
docs = DocList[MyDoc](
[MyDoc(text="Example text", embedding=np.random.rand(128))
for _ in range(2000)])
index = InMemoryExactNNIndex[MyDoc](docs)
assert docs[0] in index
assert MyDoc(text='New text', embedding=np.random.rand(128)) not in index
Flexible tensor types for protobuf deserialization (#1645)
You can deserialize any DocVec protobuf message to any tensor type,
by passing the tensor_type parameter to from_protobuf.
This means that you can choose at deserialization time if you are working with numpy, PyTorch, or TensorFlow tensors.
class MyDoc(BaseDoc):
tensor: TensorFlowTensor
da = DocVec[MyDoc](...) # doesn't matter what tensor_type is here
proto = da.to_protobuf()
da_after = DocVec[MyDoc].from_protobuf(proto, tensor_type=TensorFlowTensor)
assert isinstance(da_after.tensor, TensorFlowTensor)
⚙ Refactoring
Add DBConfig to InMemoryExactNNSearch
InMemoryExactNNsearch used to get a single parameter index_file_path as a constructor parameter, unlike the rest of
the Indexers who accepted their own DBConfig. Now index_file_path is part of the DBConfig which allows to
initialize from it.
This will allow us to extend this config if more parameters are needed.
The parameters of DBConfig can be passed at construction time as **kwargs making this change compatible with old
usage.
These two initializations are equivalent.
from docarray.index import InMemoryExactNNIndex
db_config = InMemoryExactNNIndex.DBConfig(index_file_path='index.bin')
index = InMemoryExactNNIndex[MyDoc](db_config=db_config)
index = InMemoryExactNNIndex[MyDoc](index_file_path='index.bin')
🐞 Bug Fixes
Allow protobuf deserialization of BaseDoc with Union type (#1655)
Serialization of BaseDoc types who have Union types parameter of Python native types is supported.
from docarray import BaseDoc
from typing import Union
class MyDoc(BaseDoc):
union_field: Union[int, str]
docs1 = DocList[MyDoc]([MyDoc(union_field="hello")])
docs2 = DocList[BasisUnion].from_dataframe(docs_basic.to_dataframe())
assert docs1 == docs2
When these Union types involve other BaseDoc types, an exception is thrown.
class CustomDoc(BaseDoc):
ud: Union[TextDoc, ImageDoc] = TextDoc(text='union type')
docs = DocList[CustomDoc]([CustomDoc(ud=TextDoc(text='union type'))])
# raises an Exception
DocList[CustomDoc].from_dataframe(docs.to_dataframe())
Cast limit to integer when passed to HNSWDocumentIndex (#1657, #1656)
If you call find or find_batched on an HNSWDocumentIndex, the limit parameter will automatically be cast to
integer.
Moved default_column_config from RuntimeConfig to DBconfig (#1648)
default_column_config contains specific configuration information about the columns and tables inside the backend's
database. This was previously put inside RuntimeConfig which caused an error because this information is required at
initialization time. This information has been moved inside DBConfig so you can edit it there.
from docarray.index import HNSWDocumentIndex
import numpy as np
db_config = HNSWDocumentIndex.DBConfig()
db_conf.default_column_config.get(np.ndarray).update({'ef': 2500})
index = HNSWDocumentIndex[MyDoc](db_config=db_config)
Fix issue with Protobuf (de)serialization for DocVec (#1639)
This bug caused raw Protobuf objects to be stored as DocVec columns after they were deserialized from Protobuf, making the
data essentially inaccessible. This has now been fixed, and DocVec objects are identical before and after (de)serialization.
Fix order of returned matches when find and filter combination used in InMemoryExactNNIndex (#1642)
Hybrid search (find+filter) for InMemoryExactNNIndex was prioritizing low similarities (lower scores) for returned
matches. Fixed by adding an option to sort matches in a reverse order based on their scores.
# prepare a query
q_doc = MyDoc(embedding=np.random.rand(128), text='query')
query = (
db.build_query()
.find(query=q_doc, search_field='embedding')
.filter(filter_query={'text': {'$exists': True}})
.build()
)
results = db.execute_query(query)
# Before: results was sorted from worst to best matches
# Now: It's sorted in the correct order, showing better matches first
Working with external Qdrant collections (#1632)
When using QdrandDocumentIndex to connect to a Qdrant DB initialized outside of docarray raised a KeyError.
This has been fixed, and now you can use QdrantDocumentIndex to connect to externally initialized collections.
Other bug fixes
📗 Documentation Improvements
🤟 Contributors
We would like to thank all contributors to this release:
Release Note
This release contains 2 breaking changes, 3 new features, 11 bug fixes, and 2 documentation improvements.
💣 Breaking Changes
Terminate Python 3.7 support
We decided to drop it for two reasons:
Python long-term support for 3.7 is ending this week. This means there will no longer
be security updates for Python 3.7, making this a good time for us to change our requirements.
Changes to
DocVecProtobuf definition (#1639)In order to fix a bug in the
DocVecprotobuf serialization described in #1561,we have changed the
DocVec.proto definition.This means that
DocVecobjects serialized with DocArray v0.33.0 or earlier cannot be deserialized with DocArrayv.0.34.0 or later, and vice versa.
DocVecupgrade to DocArray v0.34.0 orlater.
🆕 Features
Allow users to check if a Document is already indexed in a DocIndex (#1633)
You can now check if a Document has already been indexed by using the
inkeyword:Support subindexes in
InMemoryExactNNIndex(#1617)You can now use the find_subindex
method with the ExactNNSearch DocIndex.
Flexible tensor types for protobuf deserialization (#1645)
You can deserialize any
DocVecprotobuf message to any tensor type,by passing the
tensor_typeparameter tofrom_protobuf.This means that you can choose at deserialization time if you are working with numpy, PyTorch, or TensorFlow tensors.
⚙ Refactoring
Add
DBConfigtoInMemoryExactNNSearchInMemoryExactNNsearchused to get a single parameterindex_file_pathas a constructor parameter, unlike the rest ofthe Indexers who accepted their own
DBConfig. Nowindex_file_pathis part of theDBConfigwhich allows toinitialize from it.
This will allow us to extend this config if more parameters are needed.
The parameters of
DBConfigcan be passed at construction time as**kwargsmaking this change compatible with oldusage.
These two initializations are equivalent.
🐞 Bug Fixes
Allow protobuf deserialization of
BaseDocwithUniontype (#1655)Serialization of
BaseDoctypes who haveUniontypes parameter of Python native types is supported.When these
Uniontypes involve otherBaseDoctypes, an exception is thrown.Cast limit to integer when passed to
HNSWDocumentIndex(#1657, #1656)If you call
findorfind_batchedon anHNSWDocumentIndex, thelimitparameter will automatically be cast tointeger.Moved
default_column_configfromRuntimeConfigtoDBconfig(#1648)default_column_configcontains specific configuration information about the columns and tables inside the backend'sdatabase. This was previously put inside
RuntimeConfigwhich caused an error because this information is required atinitialization time. This information has been moved inside
DBConfigso you can edit it there.Fix issue with Protobuf (de)serialization for DocVec (#1639)
This bug caused raw Protobuf objects to be stored as DocVec columns after they were deserialized from Protobuf, making the
data essentially inaccessible. This has now been fixed, and
DocVecobjects are identical before and after (de)serialization.Fix order of returned matches when
findandfiltercombination used inInMemoryExactNNIndex(#1642)Hybrid search (find+filter) for
InMemoryExactNNIndexwas prioritizing low similarities (lower scores) for returnedmatches. Fixed by adding an option to sort matches in a reverse order based on their scores.
Working with external Qdrant collections (#1632)
When using
QdrandDocumentIndexto connect to a Qdrant DB initialized outside ofdocarrayraised aKeyError.This has been fixed, and now you can use
QdrantDocumentIndexto connect to externally initialized collections.Other bug fixes
DocVecequality (fix: doc vec equality #1641, fix: docvec equality if tensors are involved #1663)summary()called forLegacyDocument. (fix: summary of legacy document #1637)DocListandDocVeccoersion. (Validation bug Fix: DocList and DocVec are not coerced to each other #1568)update()onBaseDocwith tensors fields (fix: fix update with tensors #1628)📗 Documentation Improvements
🤟 Contributors
We would like to thank all contributors to this release: