-
Notifications
You must be signed in to change notification settings - Fork 244
docs: add a tensor section to docs #1576
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
22 commits
Select commit
Hold shift + click to select a range
c169645
docs: add mention and example for type coercion (#1572)
Tanguyabel cd33236
docs: add documentation on tensor
samsja 8e6c6d0
docs: add tensor type
samsja d3fc08b
fix: apply grammarly
samsja 6be17c5
Merge branch 'main' into docs-tensor-type
samsja 2b0a702
fix: fix some stuff
samsja 9d97468
feat: apply johannes suggestion
samsja 8a61145
fix: fix tf not inherit
samsja bade6dd
fix: remove note
samsja b8be899
fix: add only union for DocVec
samsja 688e4af
docs: rewrite type coercion
samsja 709af9c
docs: rewrite type coercion
samsja 926c3c4
docs: add tensor shape part
samsja fb4ac75
docs: fix link
samsja 8779385
feat: apply alex suggestion
samsja edae14f
docs: fix docstrng
samsja 005be19
docs: fix docstrng
samsja 3a555e0
docs: add icon
samsja bdda23f
fix: capitalize numpy
samsja ba275b1
Merge branch 'main' into docs-tensor-type
samsja 4e3ef00
fix: remove warning
samsja 7437b38
feat: apply alex suggestion
samsja File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,222 @@ | ||
| # 🔢 Tensor | ||
|
|
||
| DocArray supports several tensor types that can you can use inside `BaseDoc`. | ||
|
|
||
| The main ones are: | ||
|
|
||
| - [`NdArray`][docarray.typing.tensor.NdArray] for NumPy tensors | ||
| - [`TorchTensor`][docarray.typing.tensor.TorchTensor] for PyTorch tensors | ||
| - [`TensorFlowTensor`][docarray.typing.tensor.TensorFlowTensor] for TensorFlow tensors | ||
|
|
||
| The three of them wrap their respective framework's tensor type. | ||
|
|
||
| !!! note | ||
| [`NdArray`][docarray.typing.tensor.NdArray] and [`TorchTensor`][docarray.typing.tensor.TorchTensor] are a subclass of their native tensor type. This means that they can be used natively in their framework. | ||
|
|
||
| !!! warning | ||
| [`TensorFlowTensor`][docarray.typing.tensor.TensorFlowTensor] stores the pure `tf.Tensor` object inside the `tensor` attribute. This is due to a limitation of the TensorFlow framework that prevents you from subclassing the `tf.Tensor` object. | ||
|
|
||
| DocArray also supports [`AnyTensor`][docarray.typing.tensor.AnyTensor], which is the Union of the three previous tensor types. | ||
| This is a generic placeholder to specify that it can work with any tensor type (NumPy, PyTorch, TensorFlow). | ||
|
|
||
| ## Tensor Shape validation | ||
|
|
||
| All three tensor types support shape validation. This means that you can specify the shape of the tensor using type hint syntax: `NdArray[100, 100]`, `TorchTensor[100, 100]`, `TensorFlowTensor[100, 100]`. | ||
|
|
||
| Let's take an example: | ||
|
|
||
| ```python | ||
| from docarray import BaseDoc | ||
| from docarray.typing import NdArray | ||
|
|
||
|
|
||
| class MyDoc(BaseDoc): | ||
| tensor: NdArray[100, 100] | ||
| ``` | ||
|
|
||
| If you try to pass a tensor with a different shape, an error will be raised: | ||
|
|
||
| ```python | ||
| import numpy as np | ||
|
|
||
| try: | ||
| doc = MyDoc(tensor=np.zeros((100, 200))) | ||
| except ValueError as e: | ||
| print(e) | ||
| ``` | ||
|
|
||
| ```bash | ||
| 1 validation error for MyDoc | ||
| tensor | ||
| cannot reshape array of size 20000 into shape (100,100) (type=value_error) | ||
| ``` | ||
|
|
||
|
|
||
| Whereas if you just pass a tensor with the correct shape, no error will be raised: | ||
|
|
||
| ```python | ||
| doc = MyDoc(tensor=np.zeros((100, 100))) | ||
| ``` | ||
|
|
||
| ### Axes validation | ||
|
|
||
| You can check that the number of axes is correct by specifying `NdArray['x','y']`, `TorchTensor['x','y']`, `TensorFlowTensor['x','y']`. | ||
|
|
||
| ```python | ||
| from docarray import BaseDoc | ||
| from docarray.typing import NdArray | ||
|
|
||
|
|
||
| class MyDoc(BaseDoc): | ||
| tensor: NdArray['x', 'y'] | ||
| ``` | ||
|
|
||
| Here you can only pass a tensor with two axes. `np.zeros(10, 12)` will work, but `np.zeros(10, 12, 3)` will raise an error. | ||
|
|
||
| ### Axis names | ||
|
|
||
| You can specify that two axes should have the same dimensions with the syntax `NdArray['x', 'x']`, `TorchTensor['x', 'x']`, `TensorFlowTensor['x', 'x']`. | ||
|
|
||
| ```python | ||
| from docarray import BaseDoc | ||
| from docarray.typing import NdArray | ||
|
|
||
|
|
||
| class MyDoc(BaseDoc): | ||
| tensor: NdArray['x', 'x'] | ||
| ``` | ||
|
|
||
| Here you can only pass a tensor with two axes that have the same dimensions. `np.zeros(10, 10)` will work but `np.zeros(10, 12)` will raise an error. | ||
|
|
||
| ### Arbitrary number of axis | ||
|
|
||
| To specify that your shape can have an arbitrary number of axes, use the syntax `NdArray['x', ...]`, or `NdArray[..., 'x']`. | ||
|
|
||
| ```python | ||
| from docarray import BaseDoc | ||
| from docarray.typing import NdArray | ||
|
|
||
|
|
||
| class MyDoc(BaseDoc): | ||
| tensor: NdArray[100, ...] | ||
| ``` | ||
|
|
||
| Here you can only pass a tensor with at least one axis with dimension 100. `np.zeros(100, 10)` will work but `np.zeros(10, 12)` will raise an error. | ||
|
|
||
| ## Tensor type validation | ||
|
|
||
| You don't need to directly instantiate the [`NdArray`][docarray.typing.tensor.NdArray] , [`TorchTensor`][docarray.typing.tensor.TorchTensor], or [`TensorFlowTensor`][docarray.typing.tensor.TensorFlowTensor] by yourself. | ||
|
|
||
| Instead, you should use them as type hints on [`BaseDoc`][docarray.base_doc.doc.BaseDoc] fields, where they perform data validation. | ||
| During this process, [`BaseDoc`][docarray.base_doc.doc.BaseDoc] will cast the native tensor type into the respective DocArray tensor type. | ||
|
|
||
| Let's look at an example: | ||
|
|
||
| ```python | ||
| from docarray import BaseDoc | ||
| from docarray.typing import NdArray | ||
|
|
||
| import numpy as np | ||
|
|
||
|
|
||
| class MyDoc(BaseDoc): | ||
| tensor: NdArray | ||
|
|
||
|
|
||
| doc = MyDoc(tensor=np.zeros(100)) | ||
|
|
||
| assert isinstance(doc.tensor, NdArray) # True | ||
| ``` | ||
| Here you see that the `doc.tensor` is an `NdArray`: | ||
|
|
||
| ```python | ||
| assert isinstance(doc.tensor, np.ndarray) # True as well | ||
| ``` | ||
|
|
||
| But since it inherits from `np.ndarray`, you can also use it as a normal NumPy array. The same holds for PyTorch and `TorchTensor`. | ||
|
|
||
| ## Type coercion with different tensor types | ||
|
|
||
|
samsja marked this conversation as resolved.
|
||
| DocArray also supports type coercion between different tensor types. This mean that if you pass a different tensor type to a tensor field, it will be converted to the correct tensor type. | ||
|
|
||
| For instance, if you define a field of type [`TorchTensor`][docarray.typing.tensor.TorchTensor] and you pass a NumPy array to it, it will be converted to a [`TorchTensor`][docarray.typing.tensor.TorchTensor]. | ||
|
|
||
| ```python | ||
| from docarray import BaseDoc | ||
| from docarray.typing import TorchTensor | ||
| import numpy as np | ||
|
|
||
|
|
||
| class MyTensorsDoc(BaseDoc): | ||
| tensor: TorchTensor | ||
|
|
||
|
|
||
| doc = MyTensorsDoc(tensor=np.zeros(512)) | ||
| doc.summary() | ||
| ``` | ||
|
|
||
| ```bash | ||
| 📄 MyTensorsDoc : 0a10f88 ... | ||
| ╭─────────────────────┬────────────────────────────────────────────────────────╮ | ||
| │ Attribute │ Value │ | ||
| ├─────────────────────┼────────────────────────────────────────────────────────┤ | ||
| │ tensor: TorchTensor │ TorchTensor of shape (512,), dtype: torch.float64 │ | ||
| ╰─────────────────────┴────────────────────────────────────────────────────────╯ | ||
| ``` | ||
|
|
||
| It also works in the other direction: | ||
|
|
||
| ```python | ||
| from docarray import BaseDoc | ||
| from docarray.typing import NdArray | ||
| import torch | ||
|
|
||
|
|
||
| class MyTensorsDoc(BaseDoc): | ||
| tensor: NdArray | ||
|
|
||
|
|
||
| doc = MyTensorsDoc(tensor=torch.zeros(512)) | ||
| doc.summary() | ||
| ``` | ||
|
|
||
| ```bash | ||
| 📄 MyTensorsDoc : 157e6f5 ... | ||
| ╭─────────────────┬────────────────────────────────────────────────────────────╮ | ||
| │ Attribute │ Value │ | ||
| ├─────────────────┼────────────────────────────────────────────────────────────┤ | ||
| │ tensor: NdArray │ NdArray of shape (512,), dtype: float32 │ | ||
| ╰─────────────────┴────────────────────────────────────────────────────────────╯ | ||
| ``` | ||
|
|
||
| ## `DocVec` with `AnyTensor` | ||
|
|
||
| [`DocVec`][docarray.array.doc_vec.doc_vec.DocVec] can be used with a `BaseDoc` which has a field of [`AnyTensor`][docarray.typing.tensor.AnyTensor] or any other Union of tensor types. | ||
|
|
||
| However, the `DocVec` needs to know the tensor type of the tensor field beforehand to create the correct column. | ||
|
|
||
| You can specify these parameters with the `tensor_type` parameter of the [`DocVec`][docarray.vectorizer.doc_vec.DocVec] constructor: | ||
|
|
||
| ```python | ||
| from docarray import BaseDoc, DocVec | ||
| from docarray.typing import AnyTensor, NdArray | ||
|
|
||
| import numpy as np | ||
|
|
||
|
|
||
| class MyDoc(BaseDoc): | ||
| tensor: AnyTensor | ||
|
|
||
|
|
||
| docs = DocVec[MyDoc]( | ||
| [MyDoc(tensor=np.zeros(100)) for _ in range(10)], tensor_type=NdArray | ||
| ) | ||
|
|
||
| assert isinstance(docs.tensor, NdArray) | ||
| ``` | ||
|
|
||
| !!! note | ||
| `NdArray` will be used by default if: | ||
|
|
||
| - you don't specify the `tensor_type` parameter | ||
| - your tensor field is a Union of tensor or [`AnyTensor`][docarray.typing.tensor.AnyTensor] | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,3 @@ | ||
| id,text | ||
| 2719443f09b267eb0ac9b4fa997d2031,doc 0 | ||
| 4e96c6bd6096549aacd19eecc208a6a3,doc 1 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1 @@ | ||
| [{"id":"71f4114d146013ec8a35449949801701","text":"doc 0"},{"id":"e5ec7932de6aa0d8e51b3ff6ce81a9cc","text":"doc 1"}] |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.