feat: avoid stack embedding for every search by maxwelljin · Pull Request #1586 · docarray/docarray

maxwelljin · 2023-05-30T09:41:55Z

In the current design of the ExactNNSearchIndexer instance, each time a search is performed, all the embeddings within the index are re-stacked during every 'find' operation.

To optimize this, we've introduced an 'embedding_map'. The initialization of this map is done in a lazy manner - meaning that embeddings are added to the map only when they are explicitly called for index. Embeddings are added to this map only when required, improving batch vector search speed.

Rebuilding of embeddings is only triggered upon user insertion or deletion.

Signed-off-by: Ge Jin <gejin@berkeley.edu>

maxwelljin · 2023-05-31T07:47:05Z

I have also created a test to verify whether the _embedding_map correctly updates when the user inserts or deletes a document from the list.

JoanFM

Make sure to have extensive testing, and to measure performance impact of the PR

maxwelljin · 2023-05-31T10:04:54Z

Make sure to have extensive testing, and to measure performance impact of the PR

I have performed extensive testing on the query function as requested. Here are the results:

Data: A list of documents (using DocList) with 3 embedding tensors, each of length 128. The length of the list is denoted as 'm'.
Query: A list of documents. The length of the list is denoted as 'n'.

The implementation 'find_batched' function can be divided into three parts:

Stacking the embeddings or retrieving them from the cache (embedding_map).
Using the comp_backend to perform matrix multiplication to calculate similarity and obtain the top 'k' results (only the indices).
For each query (since it's a batch query), adding the relevant documents to the result array. (the code of part 3 starts at this line)
for _, (indices_per_query, scores_per_query) in enumerate( zip(top_indices, top_scores) ):

When m = 10,000 and n = 5,000, the timings for the first batched find (all times are in seconds) are as follows:
Time for the 1st part: 0.04855059299734421
Time for the 2nd part: 0.239690676004102
Time for the 3rd part: 2.9729396229959093

In the second batched find (using the same query, thus utilizing the cache version), the timings are as follows:
Time for the 1st part: 0.011098969000158831
Time for the 2nd part: 0.08534913699986646
Time for the 3rd part: 3.03791297099815

When m = 20,000 and n = 10,000, the timings for the first batched find are:
Time for the 1st part: 0.15310250900074607
Time for the 2nd part: 0.6652847430013935
Time for the 3rd part: 5.121941960998811

In the second batched find:
Time for the 1st part: 0.01820297400263371
Time for the 2nd part: 0.3696962500034715
Time for the 3rd part: 4.840899665003235

On average, the optimized part (part one) shows a 5-10 times improvement in speed. However, since the most time-consuming part is part three, the overall optimization is not as significant. It may be beneficial to focus on optimizing the third part to achieve better performance.

Signed-off-by: Ge Jin <gejin@berkeley.edu>

JoanFM · 2023-05-31T11:30:46Z

Make sure to have extensive testing, and to measure performance impact of the PR

I have performed extensive testing on the query function as requested. Here are the results:

Data: A list of documents (using DocList) with 3 embedding tensors, each of length 128. The length of the list is denoted as 'm'. Query: A list of documents. The length of the list is denoted as 'n'.

The implementation 'find_batched' function can be divided into three parts:

Stacking the embeddings or retrieving them from the cache (embedding_map).

Using the comp_backend to perform matrix multiplication to calculate similarity and obtain the top 'k' results (only the indices).

For each query (since it's a batch query), adding the relevant documents to the result array. (the code of part 3 starts at this line)
for _, (indices_per_query, scores_per_query) in enumerate( zip(top_indices, top_scores) ):

When m = 10,000 and n = 5,000, the timings for the first batched find (all times are in seconds) are as follows: Time for the 1st part: 0.04855059299734421 Time for the 2nd part: 0.239690676004102 Time for the 3rd part: 2.9729396229959093

In the second batched find (using the same query, thus utilizing the cache version), the timings are as follows: Time for the 1st part: 0.011098969000158831 Time for the 2nd part: 0.08534913699986646 Time for the 3rd part: 3.03791297099815

When m = 20,000 and n = 10,000, the timings for the first batched find are: Time for the 1st part: 0.15310250900074607 Time for the 2nd part: 0.6652847430013935 Time for the 3rd part: 5.121941960998811

In the second batched find: Time for the 1st part: 0.01820297400263371 Time for the 2nd part: 0.3696962500034715 Time for the 3rd part: 4.840899665003235

On average, the optimized part (part one) shows a 5-10 times improvement in speed. However, since the most time-consuming part is part three, the overall optimization is not as significant. It may be beneficial to focus on optimizing the third part to achieve better performance.

Can you add a script to reproduce the results? And also compare with the potential gain in #1598

JoanFM · 2023-05-31T11:39:11Z

Can this one help @maxwelljin ? #1598

Signed-off-by: Ge Jin <gejin@berkeley.edu>

maxwelljin · 2023-05-31T12:54:15Z

Can this one help @maxwelljin ? #1598

Yeah, it's very helpful to include this optimization. I've included my test code in this pull request. To reproduce the results, we need to add a timer inside the find_batch function in in_memory.py. Specifically, I use the perf_counter function from the time module. I have created four breakpoints. The first one is before the line if cache is not None and search_field in cache:. The second one is before the line metric_fn = getattr(comp_backend.Metrics, metric). The third one is at the line batched_docs: List[DocList] = []. The fourth one is at the end of the function. By measuring the time difference between these breakpoints, we can evaluate the performance improvements.

The potential gain from #1598 is as follows:

Before the simplification:

The part 1 takes 0.14537029600614915 seconds
The part 2 takes 0.49421013099345146 seconds
The part 3 takes 4.325345135002863 seconds

The part 1 takes 0.018737553000391927 seconds
The part 2 takes 0.341395959003421 seconds
The part 3 takes 4.513382460994762 seconds

After the simplification:

The part 1 takes 0.1592650849997881 seconds
The part 2 takes 0.6122807069987175 seconds
The part 3 takes 2.7912922600007732 seconds

The part 1 takes 0.016888011996343266 seconds
The part 2 takes 0.3044082790001994 seconds
The part 3 takes 2.9079534870033967 seconds

I guess the reason for this is that the first two parts of the code make use of built-in optimizations from the numpy/torch module, which can significantly improve performance.

Signed-off-by: Ge Jin <gejin@berkeley.edu>

JoanFM · 2023-06-01T07:22:09Z

        search_field='tensor',
        limit=7,
        metric=metric,
+        cache={},


can we remove this cache, it should not be considered when using find.

JoanFM · 2023-06-01T07:22:14Z

        search_field='tensor',
        limit=7,
        metric=metric,
+        cache={},


maxwelljin · 2023-06-01T07:24:09Z

I have finished a more thorough testing of our optimization on my local machine, and I would like to share the results:

Number of queries: 10000
Number of indexed documents: 10000
Before optimization: 4.8117079977993855 seconds
After optimization: 3.1941708801983624 seconds
Speedup: 1.50x

Number of queries: 10000
Number of indexed documents: 20000
Before optimization: 4.9425261351978405 seconds
After optimization: 3.0140384888014524 seconds
Speedup: 1.64x

Number of queries: 10000
Number of indexed documents: 30000
Before optimization: 6.004549367600703 seconds
After optimization: 3.56447235160158 seconds
Speedup: 1.68x

Number of queries: 10000
Number of indexed documents: 50000
Before optimization: 9.126258195599075 seconds
After optimization: 3.4907073292008137 seconds
Speedup: 2.61x

Signed-off-by: Ge Jin <gejin@berkeley.edu>

maxwelljin and others added 12 commits May 24, 2023 18:25

Fix: Add a Custom issubclass Function to Handle Non-Class Inputs

25d6992

Merge branch 'main' into fix-docarray-list

e7fb89f

fix: support non-class data type for document class

326365f

fix: support non-class data type in document class

a2a6d62

fix: support non-class data type for the document class

bd94e30

fix: added unit tests to issues of non-class input

1e98c66

fix: support non-class data input for document class

1828b91

fix: support non-class data input for document class

d7364e4

fix: support non-class data input for document class

5cd75d2

Signed-off-by: Ge Jin <gejin@berkeley.edu>

fix: support non-class data input for document class

8a13e18

Signed-off-by: Ge Jin <gejin@berkeley.edu>

feat: avoid stack embedding each time

7efd5af

Signed-off-by: Ge Jin <gejin@berkeley.edu>

feat: avoid stack embedding each time

717d5a1

Signed-off-by: Ge Jin <gejin@berkeley.edu>

maxwelljin linked an issue May 30, 2023 that may be closed by this pull request

ExactNNSearchIndexer can be optimized to avoid stacking embeddings every time #1574

Closed

feat: avoid stack embedding each time

1724c4c

Signed-off-by: Ge Jin <gejin@berkeley.edu>

maxwelljin marked this pull request as ready for review May 31, 2023 07:47

JoanFM requested changes May 31, 2023

View reviewed changes

Comment thread docarray/index/backends/in_memory.py

Comment thread docarray/utils/find.py Outdated

feat: avoid stack embedding each time

bb706dd

Signed-off-by: Ge Jin <gejin@berkeley.edu>

feat: avoid stack embedding each time

671a8ae

Signed-off-by: Ge Jin <gejin@berkeley.edu>

JoanFM reviewed May 31, 2023

View reviewed changes

Comment thread tests/index/in_memory/test_in_memory.py Outdated

feat: avoid stack embedding each time

3132916

Signed-off-by: Ge Jin <gejin@berkeley.edu>

JoanFM approved these changes Jun 1, 2023

View reviewed changes

JoanFM requested changes Jun 1, 2023

View reviewed changes

feat: avoid stack embedding each time

ca85ab3

Signed-off-by: Ge Jin <gejin@berkeley.edu>

maxwelljin force-pushed the feat-avoid-stack-embedding branch from caabceb to ca85ab3 Compare June 1, 2023 07:40

JoanFM approved these changes Jun 1, 2023

View reviewed changes

JoanFM merged commit 110f714 into docarray:main Jun 1, 2023

samsja mentioned this pull request Jun 6, 2023

release 0.33.0 #1619

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: avoid stack embedding for every search#1586

feat: avoid stack embedding for every search#1586
JoanFM merged 17 commits into
docarray:mainfrom
maxwelljin:feat-avoid-stack-embedding

maxwelljin commented May 30, 2023

Uh oh!

maxwelljin commented May 31, 2023

Uh oh!

JoanFM left a comment

Uh oh!

Uh oh!

Uh oh!

maxwelljin commented May 31, 2023

Uh oh!

JoanFM commented May 31, 2023 •

edited

Loading

Uh oh!

JoanFM commented May 31, 2023

Uh oh!

maxwelljin commented May 31, 2023

Uh oh!

Uh oh!

JoanFM Jun 1, 2023

Uh oh!

JoanFM Jun 1, 2023

Uh oh!

maxwelljin commented Jun 1, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

maxwelljin commented May 30, 2023

Uh oh!

maxwelljin commented May 31, 2023

Uh oh!

JoanFM left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

maxwelljin commented May 31, 2023

Uh oh!

JoanFM commented May 31, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JoanFM commented May 31, 2023

Uh oh!

maxwelljin commented May 31, 2023

Uh oh!

Uh oh!

JoanFM Jun 1, 2023

Choose a reason for hiding this comment

Uh oh!

JoanFM Jun 1, 2023

Choose a reason for hiding this comment

Uh oh!

maxwelljin commented Jun 1, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JoanFM commented May 31, 2023 •

edited

Loading