feat: schema for text index and search#267
Conversation
Codecov Report
@@ Coverage Diff @@
## main #267 +/- ##
==========================================
+ Coverage 86.05% 86.10% +0.04%
==========================================
Files 134 134
Lines 6253 6282 +29
==========================================
+ Hits 5381 5409 +28
- Misses 872 873 +1
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
…na-ai/docarray into feat-elastic-search-with-text
| def find( | ||
| self: 'T', | ||
| query: Union['DocumentArray', 'Document', 'ArrayType', Dict], | ||
| metric: Union[ | ||
| str, Callable[['ArrayType', 'ArrayType'], 'np.ndarray'] | ||
| ] = 'cosine', | ||
| limit: Optional[Union[int, float]] = 20, | ||
| metric_name: Optional[str] = None, | ||
| exclude_self: bool = False, | ||
| only_id: bool = False, | ||
| index: str = 'text', | ||
| **kwargs, | ||
| ) -> Union['DocumentArray', List['DocumentArray']]: | ||
| """Returns approximate nearest neighbors given an input query. | ||
|
|
||
| :param query: the input query to search by | ||
| :param limit: the maximum number of matches, when not given defaults to 20. | ||
| :param metric_name: if provided, then match result will be marked with this string. | ||
| :param metric: the distance metric. | ||
| :param exclude_self: if set, Documents in results with same ``id`` as the query values will not be | ||
| considered as matches. This is only applied when the input query is Document or DocumentArray. | ||
| :param only_id: if set, then returning matches will only contain ``id`` |
There was a problem hiding this comment.
At this point, we chose to copy-paste the parent find method code and add more logic to it in order to support finx_by_text.
However, this is not sustainable and a lot of logic (regarding the metrics, matches, limit etc) does not concern the find by text logic. Furthermore, the extended interface (index parameter and the new supported query type) does not appear for the Parent DocumentArray usage in auto-completion.
Therefore, I suggest the following:
- applying the extended interface (index parameter and the new supported query type) to the parent DocumentArray class and documenting that these types/params only apply for storage backends that support find by text (might be available in the future for weaviate and qdrant as well)
- (option 1) adding an empty handle method to allow extending the behavior and extend it in DocumentArrayElastic
- (option 2) adding a method find_by_text to the parent FindMixin which concerns DocumentArray in general which raises a NotImplementedError or a NotSupportedError and add it to the find method in the parent FindMixin. Implement the method for DocumentArrayElastic. That way, only elastic storage backend (and in the future, backends that implement the method) will allow accepting text as input.
- Properly documenting that query type text and index parameter only concern storage backends that support find by text
- No more re-implementing this find method
- the private _find method only handles the find by vector case. We can possible rename it to
_find_by_vector
For step 2, I prefer option 2. @JoanFM what is your opinion here ?
There was a problem hiding this comment.
makes sense, I've updated find in the parent class to handle strand List[str] and removed it from elastic.find
| :param exclude_self: if set, Documents in results with same ``id`` as the query values will not be | ||
| considered as matches. This is only applied when the input query is Document or DocumentArray. | ||
| :param only_id: if set, then returning matches will only contain ``id`` | ||
| :param index: if the query is a string, text search will be performed on `index`, otherwise, ignored. By |
There was a problem hiding this comment.
again this change should be applied to the parent find mixin, we no more want to do this whole change here
| :param index: if the query is a string, text search will be performed on `index`, otherwise, ignored. By | |
| :param index: if the query is a string or list of string, text search will be performed on `index`, otherwise, ignored. By |
I think this behavior is not intended. I think we could assume that the |
| def find( | ||
| self: 'T', | ||
| query: Union['DocumentArray', 'Document', 'ArrayType', Dict], | ||
| query: Union['DocumentArray', 'Document', 'ArrayType', Dict, str, List[str]], |
There was a problem hiding this comment.
I do not think this is desired. I think Document and DocumentArray are complete enough abstractions to handle what we want.
There was a problem hiding this comment.
how would you select how search has to be done if you do da.find(da2) ?
I see two options
- add a new field
da.find(da2, by='text')orda.find(da2, by='embedding') - have different methods
da.find_by_vector(da2),da.find_by_text(da2)
| :param exclude_self: if set, Documents in results with same ``id`` as the query values will not be | ||
| considered as matches. This is only applied when the input query is Document or DocumentArray. | ||
| :param only_id: if set, then returning matches will only contain ``id`` | ||
| :param index: if the query is a string, text search will be performed on `index`, otherwise, ignored. By |
There was a problem hiding this comment.
This docstring does not help me understand what this parameter is about or does
The API does not behave in that way currently, but we will handle this in a separate PR. |
Co-authored-by: Han Xiao <han.xiao@jina.ai>
|
📝 Docs are deployed on https://ft-feat-elastic-search-with-text--jina-docs.netlify.app 🎉 |
Expected API examples
The behaviour we want to follow would be analogous to