Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 1 addition & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -108,8 +108,6 @@ class MyDocument(BaseDoc):

So not only can you define the types of your data, you can even **specify the shape of your tensors!**

Once you have your model in the form of a document, you can work with it!

```python
# Create a document
doc = MyDocument(
Expand All @@ -120,6 +118,7 @@ doc = MyDocument(
# Load image tensor from URL
doc.image_tensor = doc.image_url.load()


# Compute embedding with any model of your choice
def clip_image_encoder(image_tensor: TorchTensor) -> TorchTensor: # dummy function
return torch.rand(512)
Expand Down
20 changes: 7 additions & 13 deletions docarray/typing/tensor/tensor.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,9 +42,11 @@
class AnyTensor(AbstractTensor, Generic[ShapeT]):
"""
Represents a tensor object that can be used with TensorFlow, PyTorch, and NumPy type.
!!! note:
when doing type checking (mypy or pycharm type checker), this class will actually be replace by a Union of the three
tensor types. You can reason about this class as if it was a Union.

---
'''python
```python
from docarray import BaseDoc
from docarray.typing import AnyTensor

Expand All @@ -54,9 +56,9 @@ class MyTensorDoc(BaseDoc):


# Example usage with TensorFlow:
import tensorflow as tf
# import tensorflow as tf

doc = MyTensorDoc(tensor=tf.zeros(1000, 2))
# doc = MyTensorDoc(tensor=tf.zeros(1000, 2))

# Example usage with PyTorch:
import torch
Expand All @@ -67,15 +69,7 @@ class MyTensorDoc(BaseDoc):
import numpy as np

doc = MyTensorDoc(tensor=np.zeros((1000, 2)))
'''
---

Returns:
Union[TorchTensor, TensorFlowTensor, NdArray]: The validated and converted tensor.

Raises:
TypeError: If the input value is not a compatible type (torch.Tensor, tensorflow.Tensor, numpy.ndarray).

```
"""

def __getitem__(self: T, item):
Expand Down
2 changes: 2 additions & 0 deletions docs/API_reference/typing/tensor/tensor.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,5 @@
::: docarray.typing.tensor.ndarray
::: docarray.typing.tensor.tensorflow_tensor
::: docarray.typing.tensor.torch_tensor
::: docarray.typing.tensor.AnyTensor

1 change: 1 addition & 0 deletions docs/data_types/first_steps.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,3 +12,4 @@ This section covers the following sections:
- [3D Mesh](3d_mesh/3d_mesh.md)
- [Table](table/table.md)
- [Multimodal data](multimodal/multimodal.md)
- [Tensor](tensor/tensor.md)
222 changes: 222 additions & 0 deletions docs/data_types/tensor/tensor.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,222 @@
# 🔢 Tensor

DocArray supports several tensor types that can you can use inside `BaseDoc`.

The main ones are:

- [`NdArray`][docarray.typing.tensor.NdArray] for NumPy tensors
- [`TorchTensor`][docarray.typing.tensor.TorchTensor] for PyTorch tensors
- [`TensorFlowTensor`][docarray.typing.tensor.TensorFlowTensor] for TensorFlow tensors

The three of them wrap their respective framework's tensor type.

!!! note
[`NdArray`][docarray.typing.tensor.NdArray] and [`TorchTensor`][docarray.typing.tensor.TorchTensor] are a subclass of their native tensor type. This means that they can be used natively in their framework.

!!! warning
[`TensorFlowTensor`][docarray.typing.tensor.TensorFlowTensor] stores the pure `tf.Tensor` object inside the `tensor` attribute. This is due to a limitation of the TensorFlow framework that prevents you from subclassing the `tf.Tensor` object.

DocArray also supports [`AnyTensor`][docarray.typing.tensor.AnyTensor], which is the Union of the three previous tensor types.
This is a generic placeholder to specify that it can work with any tensor type (NumPy, PyTorch, TensorFlow).

Comment thread
samsja marked this conversation as resolved.
## Tensor Shape validation

All three tensor types support shape validation. This means that you can specify the shape of the tensor using type hint syntax: `NdArray[100, 100]`, `TorchTensor[100, 100]`, `TensorFlowTensor[100, 100]`.

Let's take an example:

```python
from docarray import BaseDoc
from docarray.typing import NdArray


class MyDoc(BaseDoc):
tensor: NdArray[100, 100]
```

If you try to pass a tensor with a different shape, an error will be raised:

```python
import numpy as np

try:
doc = MyDoc(tensor=np.zeros((100, 200)))
except ValueError as e:
print(e)
```

```bash
1 validation error for MyDoc
tensor
cannot reshape array of size 20000 into shape (100,100) (type=value_error)
```


Whereas if you just pass a tensor with the correct shape, no error will be raised:

```python
doc = MyDoc(tensor=np.zeros((100, 100)))
```

### Axes validation

You can check that the number of axes is correct by specifying `NdArray['x','y']`, `TorchTensor['x','y']`, `TensorFlowTensor['x','y']`.

```python
from docarray import BaseDoc
from docarray.typing import NdArray


class MyDoc(BaseDoc):
tensor: NdArray['x', 'y']
```

Here you can only pass a tensor with two axes. `np.zeros(10, 12)` will work, but `np.zeros(10, 12, 3)` will raise an error.

### Axis names

You can specify that two axes should have the same dimensions with the syntax `NdArray['x', 'x']`, `TorchTensor['x', 'x']`, `TensorFlowTensor['x', 'x']`.

```python
from docarray import BaseDoc
from docarray.typing import NdArray


class MyDoc(BaseDoc):
tensor: NdArray['x', 'x']
```

Here you can only pass a tensor with two axes that have the same dimensions. `np.zeros(10, 10)` will work but `np.zeros(10, 12)` will raise an error.

### Arbitrary number of axis

To specify that your shape can have an arbitrary number of axes, use the syntax `NdArray['x', ...]`, or `NdArray[..., 'x']`.

```python
from docarray import BaseDoc
from docarray.typing import NdArray


class MyDoc(BaseDoc):
tensor: NdArray[100, ...]
```

Here you can only pass a tensor with at least one axis with dimension 100. `np.zeros(100, 10)` will work but `np.zeros(10, 12)` will raise an error.

## Tensor type validation

You don't need to directly instantiate the [`NdArray`][docarray.typing.tensor.NdArray] , [`TorchTensor`][docarray.typing.tensor.TorchTensor], or [`TensorFlowTensor`][docarray.typing.tensor.TensorFlowTensor] by yourself.

Instead, you should use them as type hints on [`BaseDoc`][docarray.base_doc.doc.BaseDoc] fields, where they perform data validation.
During this process, [`BaseDoc`][docarray.base_doc.doc.BaseDoc] will cast the native tensor type into the respective DocArray tensor type.

Let's look at an example:

```python
from docarray import BaseDoc
from docarray.typing import NdArray

import numpy as np


class MyDoc(BaseDoc):
tensor: NdArray


doc = MyDoc(tensor=np.zeros(100))

assert isinstance(doc.tensor, NdArray) # True
```
Here you see that the `doc.tensor` is an `NdArray`:

```python
assert isinstance(doc.tensor, np.ndarray) # True as well
```

But since it inherits from `np.ndarray`, you can also use it as a normal NumPy array. The same holds for PyTorch and `TorchTensor`.

## Type coercion with different tensor types

Comment thread
samsja marked this conversation as resolved.
DocArray also supports type coercion between different tensor types. This mean that if you pass a different tensor type to a tensor field, it will be converted to the correct tensor type.

For instance, if you define a field of type [`TorchTensor`][docarray.typing.tensor.TorchTensor] and you pass a NumPy array to it, it will be converted to a [`TorchTensor`][docarray.typing.tensor.TorchTensor].

```python
from docarray import BaseDoc
from docarray.typing import TorchTensor
import numpy as np


class MyTensorsDoc(BaseDoc):
tensor: TorchTensor


doc = MyTensorsDoc(tensor=np.zeros(512))
doc.summary()
```

```bash
📄 MyTensorsDoc : 0a10f88 ...
╭─────────────────────┬────────────────────────────────────────────────────────╮
│ Attribute │ Value │
├─────────────────────┼────────────────────────────────────────────────────────┤
│ tensor: TorchTensor │ TorchTensor of shape (512,), dtype: torch.float64 │
╰─────────────────────┴────────────────────────────────────────────────────────╯
```

It also works in the other direction:

```python
from docarray import BaseDoc
from docarray.typing import NdArray
import torch


class MyTensorsDoc(BaseDoc):
tensor: NdArray


doc = MyTensorsDoc(tensor=torch.zeros(512))
doc.summary()
```

```bash
📄 MyTensorsDoc : 157e6f5 ...
╭─────────────────┬────────────────────────────────────────────────────────────╮
│ Attribute │ Value │
├─────────────────┼────────────────────────────────────────────────────────────┤
│ tensor: NdArray │ NdArray of shape (512,), dtype: float32 │
╰─────────────────┴────────────────────────────────────────────────────────────╯
```

## `DocVec` with `AnyTensor`

[`DocVec`][docarray.array.doc_vec.doc_vec.DocVec] can be used with a `BaseDoc` which has a field of [`AnyTensor`][docarray.typing.tensor.AnyTensor] or any other Union of tensor types.

However, the `DocVec` needs to know the tensor type of the tensor field beforehand to create the correct column.

You can specify these parameters with the `tensor_type` parameter of the [`DocVec`][docarray.vectorizer.doc_vec.DocVec] constructor:

```python
from docarray import BaseDoc, DocVec
from docarray.typing import AnyTensor, NdArray

import numpy as np


class MyDoc(BaseDoc):
tensor: AnyTensor


docs = DocVec[MyDoc](
[MyDoc(tensor=np.zeros(100)) for _ in range(10)], tensor_type=NdArray
)

assert isinstance(docs.tensor, NdArray)
```

!!! note
`NdArray` will be used by default if:

- you don't specify the `tensor_type` parameter
- your tensor field is a Union of tensor or [`AnyTensor`][docarray.typing.tensor.AnyTensor]
2 changes: 2 additions & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -121,6 +121,8 @@ nav:
- data_types/3d_mesh/3d_mesh.md
- data_types/table/table.md
- data_types/multimodal/multimodal.md
- data_types/tensor/tensor.md

- Migration guide: migration_guide.md
- ...
- Glossary: glossary.md
Expand Down
3 changes: 3 additions & 0 deletions simple-dl.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
id,text
2719443f09b267eb0ac9b4fa997d2031,doc 0
4e96c6bd6096549aacd19eecc208a6a3,doc 1
1 change: 1 addition & 0 deletions simple-dl.json
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
[{"id":"71f4114d146013ec8a35449949801701","text":"doc 0"},{"id":"e5ec7932de6aa0d8e51b3ff6ce81a9cc","text":"doc 1"}]