Skip to content

feat: subindex for document index#1428

Merged
AnneYang720 merged 75 commits into
mainfrom
feat-subindex
May 11, 2023
Merged

feat: subindex for document index#1428
AnneYang720 merged 75 commits into
mainfrom
feat-subindex

Conversation

@AnneYang720

@AnneYang720 AnneYang720 commented Apr 20, 2023

Copy link
Copy Markdown
Contributor

This PR is related to issue #1235

ToDo:
Subindex

  • init
    • abstract
    • change schema (add parent_id dynamically)
    • hnswlib
    • elastic
  • index
    • abtract
    • hnswlib
    • elastic
  • get
  • del
  • find
  • find (return both root and subindex results)
  • filter
  • tests
    • abtract common methods
    • hnswlib
    • elastic (v7&v8)
  • weaviate
  • qdrant
  • Documentation

Example

class SimpleDoc(BaseDoc):
    simple_tens: NdArray[10]
    simple_text: str

class MyDoc(BaseDoc):
    docs: DocList[SimpleDoc]

my_docs = [MyDoc(id='0', docs=DocList[SimpleDoc]([SimpleDoc(simple_tens=np.ones(10) * (j + 1), simple_text=f'hello {j}',)for _ in range(10)])]
index = ElasticDocIndex[MyDoc](index_name='idx')
index.index(my_docs)# index with name 'idx' and 'idx__docs' will be generated

doc = index['0] #this doc should be complete
print(doc) 
print(doc.docs[0])

# you can find on subindex level, and return results will be subindex docs
docs, scores = index.find(query, search_field='docs__simple_tens', limit=1) 
# or you can use find_subindex, which can return both root and sub results
root, sub, stores = index.find_subindex(query, search_field='docs__simple_tens', limit=5)

# filter_subindex is for filter on subindex level, return results are subindex docs
query = {'match': {'simple_text': 'hello 0'}}
docs = index.filter_subindex(query, subindex='docs', limit=5)

del index['0'] # delete all subindex docs as well

Signed-off-by: AnneY <evangeline-lun@foxmail.com>
@AnneYang720 AnneYang720 marked this pull request as draft April 20, 2023 07:48
Signed-off-by: AnneY <evangeline-lun@foxmail.com>
Signed-off-by: AnneY <evangeline-lun@foxmail.com>
Signed-off-by: AnneY <evangeline-lun@foxmail.com>
Signed-off-by: AnneY <evangeline-lun@foxmail.com>
@github-actions github-actions Bot added size/m and removed size/s labels Apr 21, 2023
Signed-off-by: AnneY <evangeline-lun@foxmail.com>
Signed-off-by: AnneY <evangeline-lun@foxmail.com>
Signed-off-by: AnneY <evangeline-lun@foxmail.com>
Signed-off-by: AnneY <evangeline-lun@foxmail.com>
Signed-off-by: AnneY <evangeline-lun@foxmail.com>
Signed-off-by: AnneY <evangeline-lun@foxmail.com>
Signed-off-by: AnneY <evangeline-lun@foxmail.com>
Signed-off-by: AnneY <evangeline-lun@foxmail.com>
Signed-off-by: AnneY <evangeline-lun@foxmail.com>
Signed-off-by: Anne Yang <evangeline-lun@foxmail.com>
Signed-off-by: AnneY <evangeline-lun@foxmail.com>
Signed-off-by: AnneY <evangeline-lun@foxmail.com>
Signed-off-by: AnneY <evangeline-lun@foxmail.com>
Signed-off-by: AnneY <evangeline-lun@foxmail.com>
Signed-off-by: AnneY <evangeline-lun@foxmail.com>
Signed-off-by: AnneY <evangeline-lun@foxmail.com>
Signed-off-by: AnneY <evangeline-lun@foxmail.com>
@github-actions github-actions Bot removed the size/m label Apr 26, 2023
Signed-off-by: AnneY <evangeline-lun@foxmail.com>
@github-actions

Copy link
Copy Markdown

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

Co-authored-by: Johannes Messner <44071807+JohannesMessner@users.noreply.github.com>
Signed-off-by: Anne Yang <evangeline-lun@foxmail.com>
@github-actions

Copy link
Copy Markdown

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

Signed-off-by: AnneY <evangeline-lun@foxmail.com>
@github-actions

Copy link
Copy Markdown

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

@AnneYang720

Copy link
Copy Markdown
Contributor Author

you can find on subindex level, and return results will be subindex docs

Can we change it such that this one return on the root level, not the subindex level? Or is that difficult to do?
The reason I am asking is because normal nested search (not subindex) returns on the root level, so this should do the same, otherwise it will be confusing imo

@AnneYang720 what is the status on this?

This one is hard to change for two reasons:

  1. The find() is recursively called. For example search_field='list_docs__docs__simple_tens', when find is called on subindex docs level, we already lose the root level info.
  2. find_subindex() use find() to get the subindex results as well.

But we can let find() return on the root level and implement another new function like find_subindex_only() as a substitue.

@JohannesMessner

Copy link
Copy Markdown
Member

you can find on subindex level, and return results will be subindex docs

Can we change it such that this one return on the root level, not the subindex level? Or is that difficult to do?
The reason I am asking is because normal nested search (not subindex) returns on the root level, so this should do the same, otherwise it will be confusing imo

@AnneYang720 what is the status on this?

This one is hard to change for two reasons:

1. The `find()` is recursively called. For example `search_field='list_docs__docs__simple_tens'`, when `find` is called on subindex `docs` level, we already lose the root level info.

2. `find_subindex()` use `find()` to get the subindex results as well.

But we can let find() return on the root level and implement another new function like find_subindex_only() as a substitue.

I see. Maybe we should drop the subindex functionality in .find() altogether? I would rather it not being there than it being confusing. We could detect if a user passes a field that belongs to a subindex and then raise an error clearly stating that they should use find_subindex instead.
@AnneYang720 @samsja what do you think?

@samsja

samsja commented May 10, 2023

Copy link
Copy Markdown
Member

agree on failing on the find method rather than being confusing. We can always make it compatible with find later. (But not the otherway around if we commit to it now we have to always support it)

But this need to be documented when to pick find when to pick find_subindex

Signed-off-by: AnneY <evangeline-lun@foxmail.com>
@github-actions

Copy link
Copy Markdown

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

Signed-off-by: AnneY <evangeline-lun@foxmail.com>
@github-actions

Copy link
Copy Markdown

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

@github-actions

Copy link
Copy Markdown

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

1 similar comment
@github-actions

Copy link
Copy Markdown

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

Signed-off-by: AnneY <evangeline-lun@foxmail.com>
@github-actions

Copy link
Copy Markdown

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

@github-actions

Copy link
Copy Markdown

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

Signed-off-by: AnneY <evangeline-lun@foxmail.com>
@github-actions

Copy link
Copy Markdown

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

@github-actions

Copy link
Copy Markdown

📝 Docs are deployed on https://ft-feat-subindex--jina-docs.netlify.app 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Subindex search + navigate and traverse nested structures with id, parent_id and root_id

4 participants