Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
66 commits
Select commit Hold shift + click to select a range
89fb6a4
feat: add multi modal document
alaeddine-13 Mar 9, 2022
f16b19d
test: add test simple
winstonww Mar 10, 2022
f084c0d
test: add multi modal nested test
alaeddine-13 Mar 10, 2022
74f8fe1
fix: fix iterable multi modal annotation
alaeddine-13 Mar 10, 2022
cec8fe9
test: cover iterable annotation
alaeddine-13 Mar 10, 2022
48cd192
feat: add meta_tags attribute
alaeddine-13 Mar 10, 2022
c5002b4
feat: store multi modal schema in meta_tags
alaeddine-13 Mar 10, 2022
d051f6c
test: assert correct schema
alaeddine-13 Mar 10, 2022
13c4690
fix: fix storing schema
alaeddine-13 Mar 10, 2022
56f7415
test: assert correct schema
alaeddine-13 Mar 10, 2022
18376d3
fix: lint
alaeddine-13 Mar 10, 2022
c714764
feat: add multi modal get attribute method
alaeddine-13 Mar 11, 2022
0899401
test: cover iterable of nested
alaeddine-13 Mar 11, 2022
411fe77
chore: rename meta_tags to metadata
alaeddine-13 Mar 14, 2022
c0421dd
test: assert nested
alaeddine-13 Mar 14, 2022
1f0069f
test: assert nested
alaeddine-13 Mar 14, 2022
3458c5b
fix: fix get_multi_modal_attribute
alaeddine-13 Mar 14, 2022
f0a4735
test: test get_multi_modal_attribute
alaeddine-13 Mar 14, 2022
2a654b9
fix: use ForwardRef to define types
alaeddine-13 Mar 14, 2022
c5211fe
feat: serialize and deserialize metadata attribute
alaeddine-13 Mar 14, 2022
06a1e98
fix: rebuild proto
alaeddine-13 Mar 14, 2022
ae18a9b
feat: extend traversal path syntax
alaeddine-13 Mar 14, 2022
254f9a9
test: cover simple traverse
alaeddine-13 Mar 14, 2022
6756a03
test: cover traverse with many attributes selector
alaeddine-13 Mar 14, 2022
38ae6ac
fix: fix traverse syntax
alaeddine-13 Mar 15, 2022
0f591b8
test: cover traverse
alaeddine-13 Mar 15, 2022
95a77cb
Merge branch 'main' into feat-multi-modal
alaeddine-13 Mar 15, 2022
a23d858
chore: separate by comma
alaeddine-13 Mar 16, 2022
a4c9103
chore: rename tp _metadata
alaeddine-13 Mar 16, 2022
e6f2be6
test: cover traverse chunks attribute
alaeddine-13 Mar 16, 2022
188548f
feat: change traversal paths syntax
alaeddine-13 Mar 17, 2022
6ece923
test: adapt tests to syntax change
alaeddine-13 Mar 17, 2022
dce5114
refactor: refactor traversal path grammar definition
alaeddine-13 Mar 17, 2022
0a2d996
test: test separator
alaeddine-13 Mar 17, 2022
d4c950e
fix: make sure syntax accepts whitespace everywhere
alaeddine-13 Mar 17, 2022
5127695
test: cover whitespace and equivalent traversal paths
alaeddine-13 Mar 17, 2022
2e659a4
feat: add optional brackets around slice
alaeddine-13 Mar 17, 2022
e88c968
test: add optional square brackets over slice
alaeddine-13 Mar 17, 2022
755d9e4
feat: allow selecting single attribute without square brackets
alaeddine-13 Mar 17, 2022
6793fb3
test: test optional square brackets with single attributes
alaeddine-13 Mar 17, 2022
3af7214
chore: add warning comment
alaeddine-13 Mar 17, 2022
c546ad9
feat: allow single offset access in traversal paths
alaeddine-13 Mar 18, 2022
aca7d07
feat: add from_document method
alaeddine-13 Mar 18, 2022
5534b8c
test: test translation from docs to dataclasses
alaeddine-13 Mar 18, 2022
762c3f2
feat: support bool as primitive type
alaeddine-13 Mar 18, 2022
de142ff
fix: fix proto serialization
alaeddine-13 Mar 21, 2022
8de0a3d
test: cover proto serialization
alaeddine-13 Mar 21, 2022
bd13124
chore: apply suggestion
alaeddine-13 Mar 21, 2022
32743df
docs: update docstring
alaeddine-13 Mar 21, 2022
fddb749
feat: support modality types
alaeddine-13 Mar 21, 2022
bf8064f
test: adapt tests
alaeddine-13 Mar 21, 2022
b697baf
feat: improve types
alaeddine-13 Mar 22, 2022
9ec34f6
test: adapt tests
alaeddine-13 Mar 22, 2022
d8a351e
chore: add librosa to test requirements
alaeddine-13 Mar 22, 2022
389e6ac
fix: support providing custom Field
alaeddine-13 Mar 22, 2022
b998cc0
test: cover custom field type
alaeddine-13 Mar 22, 2022
c060008
chore: apply suggestion
alaeddine-13 Mar 23, 2022
c2b6800
feat: support bytes
alaeddine-13 Mar 23, 2022
acf88c2
chore: strong check with custom is_dataclass
alaeddine-13 Mar 23, 2022
807367d
test: cover default values passed to dataclass schema
alaeddine-13 Mar 23, 2022
3cfcab3
feat: support JSON type
alaeddine-13 Mar 23, 2022
8da0779
test: add JSON type test
alaeddine-13 Mar 23, 2022
0c3267f
test: install soundfile lib
alaeddine-13 Mar 23, 2022
903ed42
test: fix serializer
alaeddine-13 Mar 23, 2022
447ce51
ci: install soundfile lib
alaeddine-13 Mar 23, 2022
70e565d
Merge branch 'main' into feat-multi-modal
alaeddine-13 Mar 23, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -112,6 +112,7 @@ jobs:
python -m pip install --upgrade pip
python -m pip install wheel
pip install --no-cache-dir ".[full,test]"
sudo apt-get install libsndfile1
- name: Test
id: test
run: |
Expand Down
110 changes: 97 additions & 13 deletions docarray/array/mixins/traverse.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,13 +6,73 @@
Optional,
Callable,
Tuple,
Dict,
List,
)

if TYPE_CHECKING:
from ... import DocumentArray, Document
from ...types import T


ATTRIBUTES_SEPARATOR = ','
PATHS_SEPARATOR = ','

SLICE_BASE = r'[-\d:]+'
WRAPPED_SLICE_BASE = r'\[[-\d:]+\]'

SLICE = rf'({SLICE_BASE}|{WRAPPED_SLICE_BASE})?'
SLICE_TAGGED = rf'(?P<slice>{SLICE})'

ATTRIBUTE_NAME = r'[a-zA-Z][a-zA-Z0-9]*'

# accepts both syntaxes: '.[att]' or '.att'
# However, this makes the grammar ambiguous. E.g:
# 'r.attr' should it be parsed into tokens 'r', '.', 'attr' or 'r', '.', 'att', 'r' ?
ATTRIBUTE = rf'\.(\[({ATTRIBUTE_NAME}({ATTRIBUTES_SEPARATOR}{ATTRIBUTE_NAME})*)\]|{ATTRIBUTE_NAME})'
ATTRIBUTE_TAGGED = rf'\.(\[(?P<attributes>{ATTRIBUTE_NAME}({ATTRIBUTES_SEPARATOR}{ATTRIBUTE_NAME})*)\]|(?P<attribute>{ATTRIBUTE_NAME}))'

SELECTOR = rf'(r|c|m|{ATTRIBUTE})'
SELECTOR_TAGGED = rf'(?P<selector>r|c|m|{ATTRIBUTE_TAGGED})'

REMAINDER = rf'({SELECTOR}{SLICE})*'
REMAINDER_TAGGED = rf'(?P<remainder>({SELECTOR}{SLICE})*)'

TRAVERSAL_PATH = rf'{SELECTOR}{SLICE}{REMAINDER}'
TRAVERSAL_PATH_TAGGED = rf'(?P<path>{SELECTOR_TAGGED}{SLICE_TAGGED}){REMAINDER_TAGGED}'


PATHS_REMAINDER_TAGGED = rf'(?P<paths_remainder>({PATHS_SEPARATOR}{TRAVERSAL_PATH})*)'

TRAVERSAL_PATH_LIST_TAGGED = (
rf'^(?P<traversal_path>{TRAVERSAL_PATH}){PATHS_REMAINDER_TAGGED}$'
)

ATTRIBUTE_REGEX = re.compile(rf'^{ATTRIBUTE}$')
TRAVERSAL_PATH_REGEX = re.compile(rf'^{TRAVERSAL_PATH_TAGGED}$')
TRAVERSAL_PATH_LIST_REGEX = re.compile(TRAVERSAL_PATH_LIST_TAGGED)


def _re_traversal_path_split(path: str) -> List[str]:
res = []
remainder = path
while True:
m = TRAVERSAL_PATH_LIST_REGEX.match(remainder)
if not m:
raise ValueError(
f'`path`:{path} is invalid, please refer to https://docarray.jina.ai/fundamentals/documentarray/access-elements/#index-by-nested-structure'
)
group_dict = m.groupdict()
current, remainder = group_dict['traversal_path'], group_dict['paths_remainder']
res.append(current)
if not remainder:
break
else:
remainder = remainder[1:]

return res


class TraverseMixin:
"""
A mixin used for traversing :class:`DocumentArray`.
Expand All @@ -36,13 +96,16 @@ def traverse(
- `r`: docs in this TraversableSequence
- `m`: all match-documents at adjacency 1
- `c`: all child-documents at granularity 1
- `r.[attribute]`: access attribute of a multi modal document
- `cc`: all child-documents at granularity 2
- `mm`: all match-documents at adjacency 2
- `cm`: all match-document at adjacency 1 and granularity 1
- `r,c`: docs in this TraversableSequence and all child-documents at granularity 1
- `r[start:end]`: access sub document array using slice

"""
for p in traversal_paths.split(','):

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess there must be some update of the docstrings

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

traversal_paths = re.sub(r'\s+', '', traversal_paths)
for p in _re_traversal_path_split(traversal_paths):
yield from self._traverse(self, p, filter_fn=filter_fn)

@staticmethod
Expand All @@ -53,21 +116,33 @@ def _traverse(
):
path = re.sub(r'\s+', '', path)
if path:
cur_loc, cur_slice, _left = _parse_path_string(path)
group_dict = _parse_path_string(path)
cur_loc = group_dict['selector']
cur_slice = group_dict['slice']
remainder = group_dict['remainder']

if cur_loc == 'r':
yield from TraverseMixin._traverse(
docs[cur_slice], _left, filter_fn=filter_fn
docs[cur_slice], remainder, filter_fn=filter_fn
)
elif cur_loc == 'm':
for d in docs:
yield from TraverseMixin._traverse(
d.matches[cur_slice], _left, filter_fn=filter_fn
d.matches[cur_slice], remainder, filter_fn=filter_fn
)
elif cur_loc == 'c':
for d in docs:
yield from TraverseMixin._traverse(
d.chunks[cur_slice], _left, filter_fn=filter_fn
d.chunks[cur_slice], remainder, filter_fn=filter_fn
)
elif ATTRIBUTE_REGEX.match(cur_loc):
for d in docs:
for attribute in group_dict['attributes']:
yield from TraverseMixin._traverse(
d.get_multi_modal_attribute(attribute)[cur_slice],
remainder,
filter_fn=filter_fn,
)
else:
raise ValueError(
f'`path`:{path} is invalid, please refer to https://docarray.jina.ai/fundamentals/documentarray/access-elements/#index-by-nested-structure'
Expand All @@ -92,7 +167,8 @@ def traverse_flat_per_path(
:param filter_fn: function to filter docs during traversal
:yield: :class:``TraversableSequence`` containing the document of all leaves per path.
"""
for p in traversal_paths.split(','):
traversal_paths = re.sub(r'\s+', '', traversal_paths)
for p in _re_traversal_path_split(traversal_paths):
yield self._flatten(self._traverse(self, p, filter_fn=filter_fn))

def traverse_flat(
Expand Down Expand Up @@ -159,23 +235,31 @@ def _flatten(sequence) -> 'DocumentArray':
return DocumentArray(list(itertools.chain.from_iterable(sequence)))


def _parse_path_string(p: str) -> Tuple[str, slice, str]:
g = re.match(r'^([rcm])([-\d:]+)?([rcm].*)?$', p)
_this = g.group(1)
slice_str = g.group(2)
_next = g.group(3)
return _this, _parse_slice(slice_str or ':'), _next or ''
def _parse_path_string(p: str) -> Dict[str, str]:
g = TRAVERSAL_PATH_REGEX.match(p)
group_dict = g.groupdict()
group_dict['remainder'] = group_dict.get('remainder') or ''
group_dict['slice'] = _parse_slice(group_dict.get('slice') or ':')
if group_dict.get('attributes'):
group_dict['attributes'] = group_dict['attributes'].split(ATTRIBUTES_SEPARATOR)
elif group_dict.get('attribute'):
group_dict['attributes'] = [group_dict.get('attribute')]

return group_dict


def _parse_slice(value):
"""
Parses a `slice()` from string, like `start:stop:step`.
"""
if re.match(WRAPPED_SLICE_BASE, value):
value = value[1:-1]

if value:
parts = value.split(':')
if len(parts) == 1:
# slice(stop)
parts = [None, parts[0]]
parts = [parts[0], str(int(parts[0]) + 1)]
# else: slice(start, stop[, step])
else:
# slice()
Expand Down
4 changes: 3 additions & 1 deletion docarray/document/data.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@
uri='',
mime_type='',
tags=dict,
_metadata=dict,
offset=0.0,
location=list,
modality='',
Expand Down Expand Up @@ -51,6 +52,7 @@ class DocumentData:
weight: Optional[float] = None
uri: Optional[str] = None
tags: Optional[Dict[str, 'StructValueType']] = None
_metadata: Optional[Dict[str, 'StructValueType']] = None
offset: Optional[float] = None
location: Optional[List[float]] = None
embedding: Optional['ArrayType'] = field(default=None, hash=False, compare=False)
Expand All @@ -65,7 +67,7 @@ def _non_empty_fields(self) -> Tuple[str]:
r = []
for f in fields(self):
f_name = f.name
if not f_name.startswith('_'):
if not f_name.startswith('_') or f_name == '_metadata':
v = getattr(self, f_name)
if v is not None:
if f_name not in default_values:
Expand Down
2 changes: 2 additions & 0 deletions docarray/document/mixins/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
from .featurehash import FeatureHashMixin
from .image import ImageDataMixin
from .mesh import MeshDataMixin
from .multimodal import MultiModalMixin
from .plot import PlotMixin
from .porting import PortingMixin
from .property import PropertyMixin
Expand Down Expand Up @@ -37,6 +38,7 @@ class AllMixins(
PortingMixin,
FeatureHashMixin,
GetAttributesMixin,
MultiModalMixin,
):
"""All plugins that can be used in :class:`Document`. """

Expand Down
9 changes: 9 additions & 0 deletions docarray/document/mixins/_property.py
Original file line number Diff line number Diff line change
Expand Up @@ -118,6 +118,15 @@ def tags(self) -> Optional[Dict[str, 'StructValueType']]:
def tags(self, value: Dict[str, 'StructValueType']):
self._data.tags = value

@property
def _metadata(self) -> Optional[Dict[str, 'StructValueType']]:
self._data._set_default_value_if_none('_metadata')
return self._data._metadata

@_metadata.setter
def _metadata(self, value: Dict[str, 'StructValueType']):
self._data._metadata = value

@property
def offset(self) -> Optional[float]:
self._data._set_default_value_if_none('offset')
Expand Down
140 changes: 140 additions & 0 deletions docarray/document/mixins/multimodal.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
import base64

import typing
from enum import Enum

from docarray.types.multimodal import Image, Text, Field, is_dataclass
from docarray.types.multimodal import TYPES_REGISTRY

if typing.TYPE_CHECKING:
from docarray import Document, DocumentArray


class AttributeType(str, Enum):
DOCUMENT = 'document'
PRIMITIVE = 'primitive'
ITERABLE_PRIMITIVE = 'iterable_primitive'
ITERABLE_DOCUMENT = 'iterable_document'
NESTED = 'nested'
ITERABLE_NESTED = 'iterable_nested'


class MultiModalMixin:
@classmethod
def from_dataclass(cls, obj):
if not is_dataclass(obj):
Comment thread
alaeddine-13 marked this conversation as resolved.
raise ValueError(f'Object {obj.__name__} is not a dataclass instance')

from docarray import Document

root = Document()
tags = {}
multi_modal_schema = {}
for key, field in obj.__dataclass_fields__.items():
attribute = getattr(obj, key)
if field.type in [str, int, float, bool] and not isinstance(field, Field):
tags[key] = attribute
multi_modal_schema[key] = {
'attribute_type': AttributeType.PRIMITIVE,
'type': field.type.__name__,
}

elif field.type == bytes and not isinstance(field, Field):
tags[key] = base64.b64encode(attribute).decode()
multi_modal_schema[key] = {
'attribute_type': AttributeType.PRIMITIVE,
'type': field.type.__name__,
}
elif isinstance(field.type, typing._GenericAlias):

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this mean?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it handles types like List[type]

if field.type._name in ['List', 'Iterable']:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't safer to check the instance and check if one can iterate?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it better to stick to the type that the user explicitly provided.
If we rely on the dynamic type, we will introduce an implicit behaviour to the user

sub_type = field.type.__args__[0]
if sub_type in [str, int, float, bool]:
Comment thread
alaeddine-13 marked this conversation as resolved.
tags[key] = attribute
multi_modal_schema[key] = {
'attribute_type': AttributeType.ITERABLE_PRIMITIVE,
'type': f'{field.type._name}[{sub_type.__name__}]',
}

else:
chunk = Document()
for element in attribute:
doc, attribute_type = cls._from_obj(
element, sub_type, field
)
if attribute_type == AttributeType.DOCUMENT:
attribute_type = AttributeType.ITERABLE_DOCUMENT
elif attribute_type == AttributeType.NESTED:
attribute_type = AttributeType.ITERABLE_NESTED
else:
raise ValueError(
f'Unsupported type annotation inside Iterable: {sub_type}'
)
chunk.chunks.append(doc)
multi_modal_schema[key] = {
'attribute_type': attribute_type,
'type': f'{field.type._name}[{sub_type.__name__}]',
'position': len(root.chunks),
}
root.chunks.append(chunk)
else:
raise ValueError(f'Unsupported type annotation {field.type._name}')
else:
doc, attribute_type = cls._from_obj(attribute, field.type, field)
multi_modal_schema[key] = {
'attribute_type': attribute_type,
'type': field.type.__name__,
'position': len(root.chunks),
}
root.chunks.append(doc)

# TODO: may have to modify this?
root.tags = tags
Comment thread
numb3r3 marked this conversation as resolved.
root._metadata['multi_modal_schema'] = multi_modal_schema

return root

def get_multi_modal_attribute(self, attribute: str) -> 'DocumentArray':
from docarray import DocumentArray

if 'multi_modal_schema' not in self._metadata:
raise ValueError(
'the Document does not correspond to a Multi Modal Document'
)

if attribute not in self._metadata['multi_modal_schema']:
raise ValueError(
f'the Document schema does not contain attribute {attribute}'
)

attribute_type = self._metadata['multi_modal_schema'][attribute][
'attribute_type'
]
position = self._metadata['multi_modal_schema'][attribute].get('position')

if attribute_type in [AttributeType.DOCUMENT, AttributeType.NESTED]:
return DocumentArray([self.chunks[position]])
elif attribute_type in [
AttributeType.ITERABLE_DOCUMENT,
AttributeType.ITERABLE_NESTED,
]:
return self.chunks[position].chunks
else:
raise ValueError(
f'Invalid attribute {attribute}: must a Document attribute or nested dataclass'
)

@classmethod
def _from_obj(cls, obj, obj_type, field) -> typing.Tuple['Document', AttributeType]:
from docarray import Document

attribute_type = AttributeType.DOCUMENT

if is_dataclass(obj_type):
doc = cls.from_dataclass(obj)
attribute_type = AttributeType.NESTED
elif isinstance(field, Field):
doc = Document()
field.serializer(obj, field.name, doc)
else:

@numb3r3 numb3r3 Mar 21, 2022

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also consider this case elif obj_type == Document?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the typing module doesn't offer a Document type annotation.
If we're supposed to offer such an annotation, how do you suggest it should work ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also not quite clear about it. Does it make sense to offer an annotation of Document?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, I think the modality of the type should be clear.
We can also add a type annotation Field to allow users to provide their custom types but I'm not sure about the priority for it

raise ValueError(f'Unsupported type annotation')
return doc, attribute_type
1 change: 1 addition & 0 deletions docarray/document/pydantic_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ class PydanticDocument(BaseModel):
weight: Optional[float]
uri: Optional[str]
tags: Optional[Dict[str, '_StructValueType']]
_metadata: Optional[Dict[str, '_StructValueType']]
offset: Optional[float]
location: Optional[List[float]]
embedding: Optional[Any]
Expand Down
Loading