Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
caa00fd
chore: first pr
jupyterjazz Jun 28, 2023
b45e3a6
docs: modify hnsw
jupyterjazz Jul 6, 2023
cad4e60
Merge branch 'main' into docs-self-contained-indices
jupyterjazz Jul 6, 2023
11bda62
docs: rough versions of inmemory and hnsw
jupyterjazz Jul 6, 2023
96319ca
chore: update branch
jupyterjazz Jul 6, 2023
f5825f8
docs: weaviate v1
jupyterjazz Jul 6, 2023
8aaedbe
docs: elastic v1
jupyterjazz Jul 17, 2023
4a3e25c
docs: introduction page
jupyterjazz Jul 17, 2023
db77beb
docs: redis v1
jupyterjazz Jul 17, 2023
82afb99
docs: qdrant v1
jupyterjazz Jul 17, 2023
befc786
docs: validate intro inmemory and hnsw examples
jupyterjazz Jul 17, 2023
9bdb0dc
docs: validate elastic and qdrant examples
jupyterjazz Jul 17, 2023
64f83bf
docs: validate code examples for redis and weaviate
jupyterjazz Jul 18, 2023
759900c
Merge branch 'main' into docs-self-contained-indices
jupyterjazz Jul 19, 2023
60cd4d4
chore: merge recent updates
jupyterjazz Jul 19, 2023
ca25feb
docs: milvus v1
jupyterjazz Jul 19, 2023
7fef5d8
Merge branch 'main' into docs-self-contained-indices
jupyterjazz Jul 24, 2023
fe572da
docs: validate milvus code
jupyterjazz Jul 24, 2023
10bc14b
docs: make redis and milvus visible
jupyterjazz Jul 24, 2023
6199a2a
docs: refine vol1
jupyterjazz Jul 26, 2023
fa8f919
Merge branch 'main' into docs-self-contained-indices
jupyterjazz Jul 26, 2023
c257a4e
docs: refine vol2
jupyterjazz Jul 26, 2023
ccf17e1
chore: pull recent updates
jupyterjazz Jul 26, 2023
f3ca77c
docs: update api reference
jupyterjazz Jul 27, 2023
21e3ad2
Merge branch 'main' into docs-self-contained-indices
jupyterjazz Jul 27, 2023
e6ef9c4
docs: apply suggestions
jupyterjazz Jul 31, 2023
19045ec
docs: separate nested data section
jupyterjazz Jul 31, 2023
5736334
Merge branch 'main' into docs-self-contained-indices
jupyterjazz Jul 31, 2023
41c7307
docs: apply suggestions vol2
jupyterjazz Jul 31, 2023
a32a1e5
fix: nested data imports
jupyterjazz Jul 31, 2023
8a8aa33
Merge branch 'main' into docs-self-contained-indices
jupyterjazz Aug 1, 2023
ef0b7ef
docs: apply johannes suggestions
jupyterjazz Aug 1, 2023
6818688
chore: merge conflicts
jupyterjazz Aug 1, 2023
9268161
docs: apply suggestions
jupyterjazz Aug 1, 2023
b402802
docs: app sgg
jupyterjazz Aug 1, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
docs: refine vol2
Signed-off-by: jupyterjazz <saba.sturua@jina.ai>
  • Loading branch information
jupyterjazz committed Jul 26, 2023
commit c257a4e672e4d1a086cad8a59de383a4d567bed6
14 changes: 7 additions & 7 deletions docs/user_guide/storing/docindex.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,17 +45,17 @@ Currently, DocArray supports the following vector databases:

## Basic Usage
Comment thread
jupyterjazz marked this conversation as resolved.
Outdated

For this user guide you will use the [InMemoryExactNNIndex][docarray.index.backends.in_memory.InMemoryExactNNIndex]
because it doesn't require you to launch a database server. Instead, it will store your data locally.
Let's learn basic capabilities of Document Index with [InMemoryExactNNIndex][docarray.index.backends.in_memory.InMemoryExactNNIndex].
Comment thread
jupyterjazz marked this conversation as resolved.
Outdated
It's easy because you don't need a database server, instead it saves your data locally.
Comment thread
jupyterjazz marked this conversation as resolved.
Outdated


!!! note "Using a different vector database"
You can easily use Weaviate, Qdrant, or Elasticsearch instead -- they share the same API!
You can easily use Weaviate, Qdrant, Redis, Milvus or Elasticsearch instead -- they share the same API!
Comment thread
jupyterjazz marked this conversation as resolved.
Outdated
To do so, check their respective documentation sections.

!!! note "InMemory-specific settings"
The following sections explain the general concept of Document Index by using
`InMemoryExactNNIndex` as an example.
For InMemory-specific settings, check out the `InMemoryExactNNIndex` documentation
!!! note "InMemoryExactNNIndex in more detail"
The following section only covers the basics of InMemoryExactNNIndex.
For a deeper understanding, please look into its documentation
[here](index_in_memory.md).

### Define Document Schema and Create Data
Comment thread
jupyterjazz marked this conversation as resolved.
Outdated
Expand Down
30 changes: 14 additions & 16 deletions docs/user_guide/storing/index_hnswlib.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ class MyDoc(BaseDoc):
docs = DocList[MyDoc](MyDoc(title=f'title #{i}', embedding=np.random.rand(128)) for i in range(10))

# Initialize a new HnswDocumentIndex instance and add the documents to the index.
doc_index = HnswDocumentIndex[MyDoc](work_dir='./my_index')
doc_index = HnswDocumentIndex[MyDoc](work_dir='./tmp_0')
doc_index.index(docs)

# Perform a vector search.
Expand All @@ -63,7 +63,7 @@ class MyDoc(BaseDoc):
text: str


db = HnswDocumentIndex[MyDoc](work_dir='./my_test_db')
db = HnswDocumentIndex[MyDoc](work_dir='./tmp_1')
```

### Schema definition
Expand Down Expand Up @@ -105,7 +105,7 @@ You can work around this problem by subclassing the predefined Document and addi
embedding: NdArray[128]


db = HnswDocumentIndex[MyDoc](work_dir='test_db')
db = HnswDocumentIndex[MyDoc](work_dir='./tmp_2')
```

=== "Using Field()"
Expand All @@ -120,7 +120,7 @@ You can work around this problem by subclassing the predefined Document and addi
embedding: AnyTensor = Field(dim=128)


db = HnswDocumentIndex[MyDoc](work_dir='test_db3')
db = HnswDocumentIndex[MyDoc](work_dir='./tmp_3')
```

Once the schema of your Document Index is defined in this way, the data that you are indexing can be either of the
Expand Down Expand Up @@ -187,8 +187,8 @@ need to have compatible schemas.

Now that you have indexed your data, you can perform vector similarity search using the [find()][docarray.index.abstract.BaseDocIndex.find] method.
Comment thread
jupyterjazz marked this conversation as resolved.
Outdated

By using a document of type `MyDoc`, [find()][docarray.index.abstract.BaseDocIndex.find], you can find
similar Documents in the Document Index:
You can use the [find()][docarray.index.abstract.BaseDocIndex.find] function with a document of the type `MyDoc`
to find similar documents within the Document Index:

=== "Search by Document"

Expand Down Expand Up @@ -272,8 +272,6 @@ a list of `DocList`s, one for each query, containing the closest matching docume

## Filter

To filter Documents, the `InMemoryExactNNIndex` uses DocArray's [`filter_docs()`][docarray.utils.filter.filter_docs] function.

You can filter your documents by using the `filter()` or `filter_batched()` method with a corresponding filter query.
The query should follow the query language of the DocArray's [`filter_docs()`][docarray.utils.filter.filter_docs] function.

Expand All @@ -289,10 +287,10 @@ class Book(BaseDoc):


books = DocList[Book]([Book(title=f'title {i}', price=i * 10) for i in range(10)])
book_index = HnswDocumentIndex[Book](work_dir='./tmp_0')
book_index = HnswDocumentIndex[Book](work_dir='./tmp_4')

# filter for books that are cheaper than 29 dollars
query = {'price': {'$lte': 29}}
query = {'price': {'$lt': 29}}
cheap_books = book_index.filter(query)

assert len(cheap_books) == 3
Expand Down Expand Up @@ -331,7 +329,7 @@ class SimpleSchema(BaseDoc):
# Create dummy documents.
docs = DocList[SimpleSchema](SimpleSchema(year=2000-i, price=i, embedding=np.random.rand(128)) for i in range(10))

doc_index = HnswDocumentIndex[SimpleSchema](work_dir='./tmp_9')
doc_index = HnswDocumentIndex[SimpleSchema](work_dir='./tmp_5')
doc_index.index(docs)

query = (
Expand Down Expand Up @@ -467,7 +465,7 @@ class MyDoc(BaseDoc):
text: str


db = HnswDocumentIndex[MyDoc](work_dir='./path/to/db')
db = HnswDocumentIndex[MyDoc](work_dir='./tmp_6')
```

To load existing data, you can specify a directory that stores data from a previous session.
Expand All @@ -488,7 +486,7 @@ import numpy as np


db = HnswDocumentIndex[MyDoc](
work_dir='/tmp/my_db',
work_dir='./tmp_7',
default_column_config={
np.ndarray: {
'dim': -1,
Expand Down Expand Up @@ -537,7 +535,7 @@ class Schema(BaseDoc):
tens_two: NdArray[10] = Field(M=4, space='ip')


db = HnswDocumentIndex[Schema](work_dir='/tmp/my_db')
db = HnswDocumentIndex[Schema](work_dir='./tmp_8')
```

In the example above you can see how to configure two different vector fields, with two different sets of settings.
Expand Down Expand Up @@ -611,7 +609,7 @@ class YouTubeVideoDoc(BaseDoc):


# create a Document Index
doc_index = HnswDocumentIndex[YouTubeVideoDoc](work_dir='./tmp2')
doc_index = HnswDocumentIndex[YouTubeVideoDoc](work_dir='./tmp_9')

# create some data
index_docs = [
Expand Down Expand Up @@ -688,7 +686,7 @@ class MyDoc(BaseDoc):


# create a Document Index
doc_index = HnswDocumentIndex[MyDoc](work_dir='./tmp3')
doc_index = HnswDocumentIndex[MyDoc](work_dir='./tmp_10')

# create some data
index_docs = [
Expand Down
11 changes: 6 additions & 5 deletions docs/user_guide/storing/index_in_memory.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
# In-Memory Document Index


[InMemoryExactNNIndex][docarray.index.backends.in_memory.InMemoryExactNNIndex] stores all Documents in DocLists in memory.
[InMemoryExactNNIndex][docarray.index.backends.in_memory.InMemoryExactNNIndex] stores all Documents in memory using DocLists.
It is a great starting point for small datasets, where you may not want to launch a database server.

For vector search and filtering the InMemoryExactNNIndex utilizes DocArray's [`find()`][docarray.utils.find.find] and
[`filter_docs()`][docarray.utils.filter.filter_docs] functions.
For vector search and filtering the [InMemoryExactNNIndex][docarray.index.backends.in_memory.InMemoryExactNNIndex]
Comment thread
jupyterjazz marked this conversation as resolved.
Outdated
utilizes DocArray's [`find()`][docarray.utils.find.find] and [`filter_docs()`][docarray.utils.filter.filter_docs] functions.

!!! note "Production readiness"
[InMemoryExactNNIndex][docarray.index.backends.in_memory.InMemoryExactNNIndex] is a great starting point
Expand Down Expand Up @@ -183,8 +183,9 @@ need to have compatible schemas.

Now that you have indexed your data, you can perform vector similarity search using the [find()][docarray.index.abstract.BaseDocIndex.find] method.

By using a document of type `MyDoc`, [find()][docarray.index.abstract.BaseDocIndex.find], you can find
similar Documents in the Document Index:
You can use the [find()][docarray.index.abstract.BaseDocIndex.find] function with a document of the type `MyDoc`
to find similar documents within the Document Index:


=== "Search by Document"

Expand Down
11 changes: 4 additions & 7 deletions docs/user_guide/storing/index_milvus.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,10 @@ focusing on special features and configurations of Milvus.


## Basic Usage
!!! note "Single Search Field Requirement"
In order to utilize vector search, it's necessary to define 'is_embedding' for one field only.
This is due to Milvus' configuration, which permits a single vector for each data object.

```python
from docarray import BaseDoc, DocList
from docarray.index import MilvusDocumentIndex
Expand Down Expand Up @@ -205,13 +209,6 @@ similar Documents in the Document Index:
print(f'{scores=}')
```

To succesfully peform a vector search, you need to specify a `search_field`. This is the field that serves as the
basis of comparison between your query and the documents in the Document Index.

In this particular example you only have one field (`embedding`) that is a vector, so you can trivially choose that one.
In general, you could have multiple fields of type `NdArray` or `TorchTensor` or `TensorFlowTensor`, and you can choose
which one to use for the search.

The [find()][docarray.index.abstract.BaseDocIndex.find] method returns a named tuple containing the closest
matching documents and their associated similarity scores.

Expand Down
2 changes: 1 addition & 1 deletion docs/user_guide/storing/index_qdrant.md
Original file line number Diff line number Diff line change
Expand Up @@ -185,7 +185,7 @@ docs = DocList[MyDoc](
doc_index.index(docs)
```

That call to [index()][docarray.index.backends.qdrant.QdrantDocumentIndex.index] stores all Documents in `docs` into the Document Index,
That call to `index()` stores all Documents in `docs` into the Document Index,
ready to be retrieved in the next step.

As you can see, `DocList[MyDoc]` and `QdrantDocumentIndex[MyDoc]` are both parameterized with `MyDoc`.
Expand Down
4 changes: 4 additions & 0 deletions docs/user_guide/storing/index_weaviate.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,10 @@ focusing on special features and configurations of Weaviate.


## Basic Usage
!!! note "Single Search Field Requirement"
In order to utilize vector search, it's necessary to define 'is_embedding' for one field only.
This is due to Weaviate's configuration, which permits a single vector for each data object.

```python
from docarray import BaseDoc, DocList
from docarray.index import WeaviateDocumentIndex
Expand Down