Skip to content

feat: schema for text index and search#267

Merged
alaeddine-13 merged 33 commits into
mainfrom
feat-elastic-search-with-text
Apr 14, 2022
Merged

feat: schema for text index and search#267
alaeddine-13 merged 33 commits into
mainfrom
feat-elastic-search-with-text

Conversation

@davidbp

@davidbp davidbp commented Apr 7, 2022

Copy link
Copy Markdown
Contributor

Expected API examples

q = 'hello' - >  return DocumentArray
q =[ 'hello'] - >  return [DocumentArray] # this can't work unless .find is modfied
q =[ 'hello','hi'] - >  return [DocumentArray, DocumentArray]

The behaviour we want to follow would be analogous to

q = np.array([2,3,4]) - >  return DocumentArray
q = np.array([[2,3,4]]) - >  return [DocumentArray] # this can't work unless .find is modfied
q = np.array([[2,3,4],[2,3,4]]) - >  return [DocumentArray,DocumentArray]

@codecov

codecov Bot commented Apr 7, 2022

Copy link
Copy Markdown

Codecov Report

Merging #267 (787421b) into main (5069dc9) will increase coverage by 0.04%.
The diff coverage is 95.23%.

@@            Coverage Diff             @@
##             main     #267      +/-   ##
==========================================
+ Coverage   86.05%   86.10%   +0.04%     
==========================================
  Files         134      134              
  Lines        6253     6282      +29     
==========================================
+ Hits         5381     5409      +28     
- Misses        872      873       +1     
Flag Coverage Δ
docarray 86.10% <95.23%> (+0.04%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
docarray/array/storage/elastic/seqlike.py 91.66% <ø> (-0.44%) ⬇️
docarray/array/mixins/find.py 88.60% <87.50%> (-0.29%) ⬇️
docarray/array/storage/elastic/find.py 89.13% <95.00%> (+0.55%) ⬆️
docarray/array/storage/elastic/backend.py 94.33% <100.00%> (+0.28%) ⬆️
docarray/array/storage/elastic/getsetdel.py 100.00% <100.00%> (ø)
docarray/helper.py 82.40% <0.00%> (+0.46%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5069dc9...787421b. Read the comment docs.

@github-actions github-actions Bot added size/m and removed size/s labels Apr 8, 2022
Comment thread docarray/array/storage/elastic/find.py Outdated
Comment on lines +135 to +156
def find(
self: 'T',
query: Union['DocumentArray', 'Document', 'ArrayType', Dict],
metric: Union[
str, Callable[['ArrayType', 'ArrayType'], 'np.ndarray']
] = 'cosine',
limit: Optional[Union[int, float]] = 20,
metric_name: Optional[str] = None,
exclude_self: bool = False,
only_id: bool = False,
index: str = 'text',
**kwargs,
) -> Union['DocumentArray', List['DocumentArray']]:
"""Returns approximate nearest neighbors given an input query.

:param query: the input query to search by
:param limit: the maximum number of matches, when not given defaults to 20.
:param metric_name: if provided, then match result will be marked with this string.
:param metric: the distance metric.
:param exclude_self: if set, Documents in results with same ``id`` as the query values will not be
considered as matches. This is only applied when the input query is Document or DocumentArray.
:param only_id: if set, then returning matches will only contain ``id``

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point, we chose to copy-paste the parent find method code and add more logic to it in order to support finx_by_text.
However, this is not sustainable and a lot of logic (regarding the metrics, matches, limit etc) does not concern the find by text logic. Furthermore, the extended interface (index parameter and the new supported query type) does not appear for the Parent DocumentArray usage in auto-completion.
Therefore, I suggest the following:

  1. applying the extended interface (index parameter and the new supported query type) to the parent DocumentArray class and documenting that these types/params only apply for storage backends that support find by text (might be available in the future for weaviate and qdrant as well)
  2. (option 1) adding an empty handle method to allow extending the behavior and extend it in DocumentArrayElastic
  3. (option 2) adding a method find_by_text to the parent FindMixin which concerns DocumentArray in general which raises a NotImplementedError or a NotSupportedError and add it to the find method in the parent FindMixin. Implement the method for DocumentArrayElastic. That way, only elastic storage backend (and in the future, backends that implement the method) will allow accepting text as input.
  4. Properly documenting that query type text and index parameter only concern storage backends that support find by text
  5. No more re-implementing this find method
  6. the private _find method only handles the find by vector case. We can possible rename it to _find_by_vector

For step 2, I prefer option 2. @JoanFM what is your opinion here ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, I've updated find in the parent class to handle strand List[str] and removed it from elastic.find

Comment thread docarray/array/storage/elastic/find.py Outdated
Comment thread docarray/array/storage/elastic/find.py Outdated
:param exclude_self: if set, Documents in results with same ``id`` as the query values will not be
considered as matches. This is only applied when the input query is Document or DocumentArray.
:param only_id: if set, then returning matches will only contain ``id``
:param index: if the query is a string, text search will be performed on `index`, otherwise, ignored. By

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again this change should be applied to the parent find mixin, we no more want to do this whole change here

Suggested change
:param index: if the query is a string, text search will be performed on `index`, otherwise, ignored. By
:param index: if the query is a string or list of string, text search will be performed on `index`, otherwise, ignored. By

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, updated

Comment thread docarray/array/mixins/find.py Outdated
Comment thread docarray/array/mixins/find.py Outdated
Comment thread tests/unit/array/mixins/test_find.py Outdated
@JoanFM

JoanFM commented Apr 13, 2022

Copy link
Copy Markdown
Member

Expected API examples

q = 'hello' - >  return DocumentArray
q =[ 'hello'] - >  return [DocumentArray] # this can't work unless .find is modfied
q =[ 'hello','hi'] - >  return [DocumentArray, DocumentArray]

The behaviour we want to follow would be analogous to

q = np.array([2,3,4]) - >  return DocumentArray
q = np.array([[2,3,4]]) - >  return [DocumentArray] # this can't work unless .find is modfied
q = np.array([[2,3,4],[2,3,4]]) - >  return [DocumentArray,DocumentArray]

I think this behavior is not intended. I think we could assume that the query is a DocumentArray in the rest of the interface. I would not vote for mixing interfaces @alaeddine-13

JoanFM
JoanFM previously requested changes Apr 13, 2022
def find(
self: 'T',
query: Union['DocumentArray', 'Document', 'ArrayType', Dict],
query: Union['DocumentArray', 'Document', 'ArrayType', Dict, str, List[str]],

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think this is desired. I think Document and DocumentArray are complete enough abstractions to handle what we want.

@davidbp davidbp Apr 14, 2022

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how would you select how search has to be done if you do da.find(da2) ?

I see two options

  • add a new field da.find(da2, by='text') or da.find(da2, by='embedding')
  • have different methods da.find_by_vector(da2) , da.find_by_text(da2)

Comment thread docarray/array/mixins/find.py Outdated
:param exclude_self: if set, Documents in results with same ``id`` as the query values will not be
considered as matches. This is only applied when the input query is Document or DocumentArray.
:param only_id: if set, then returning matches will only contain ``id``
:param index: if the query is a string, text search will be performed on `index`, otherwise, ignored. By

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This docstring does not help me understand what this parameter is about or does

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@alaeddine-13 alaeddine-13 marked this pull request as ready for review April 13, 2022 14:34
@alaeddine-13

Copy link
Copy Markdown
Member

Expected API examples

q = 'hello' - >  return DocumentArray
q =[ 'hello'] - >  return [DocumentArray] # this can't work unless .find is modfied
q =[ 'hello','hi'] - >  return [DocumentArray, DocumentArray]

The behaviour we want to follow would be analogous to

q = np.array([2,3,4]) - >  return DocumentArray
q = np.array([[2,3,4]]) - >  return [DocumentArray] # this can't work unless .find is modfied
q = np.array([[2,3,4],[2,3,4]]) - >  return [DocumentArray,DocumentArray]

I think this behavior is not intended. I think we could assume that the query is a DocumentArray in the rest of the interface. I would not vote for mixing interfaces @alaeddine-13

The API does not behave in that way currently, but we will handle this in a separate PR.
The find method supports various query types here we are just extending the interface.
But in general I think we should have 1 method per input type it's better

Comment thread docs/advanced/document-store/elasticsearch.md Outdated
Comment thread docs/advanced/document-store/elasticsearch.md Outdated
Comment thread docs/advanced/document-store/elasticsearch.md Outdated
Comment thread docs/advanced/document-store/elasticsearch.md Outdated
alaeddine-13 and others added 2 commits April 14, 2022 09:32
Co-authored-by: Han Xiao <han.xiao@jina.ai>
Comment thread docarray/array/mixins/find.py Outdated
Comment thread docarray/array/storage/elastic/find.py Outdated
Comment thread docarray/array/storage/elastic/find.py Outdated
Comment thread docarray/array/storage/elastic/find.py Outdated
Comment thread docarray/array/storage/elastic/find.py Outdated
Comment thread docarray/array/storage/elastic/getsetdel.py Outdated
@github-actions

Copy link
Copy Markdown

📝 Docs are deployed on https://ft-feat-elastic-search-with-text--jina-docs.netlify.app 🎉

@alaeddine-13 alaeddine-13 merged commit d12710c into main Apr 14, 2022
@alaeddine-13 alaeddine-13 deleted the feat-elastic-search-with-text branch April 14, 2022 09:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants