description: Tokenizes a tensor of UTF-8 string into words according to labels.
Tokenizes a tensor of UTF-8 string into words according to labels.
Inherits From: TokenizerWithOffsets,
Tokenizer,
SplitterWithOffsets,
Splitter
text.SplitMergeTokenizer()
split(
input
)
Alias for
Tokenizer.tokenize.
split_with_offsets(
input
)
Alias for
TokenizerWithOffsets.tokenize_with_offsets.
tokenize(
input, labels, force_split_at_break_character=True
)
Tokenizes a tensor of UTF-8 strings according to labels.
>>> strings = ["HelloMonday", "DearFriday"]
>>> labels = [[0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1],
... [0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0]]
>>> tokenizer = SplitMergeTokenizer()
>>> tokenizer.tokenize(strings, labels)
<tf.RaggedTensor [[b'Hello', b'Monday'], [b'Dear', b'Friday']]>
| Args | |
|---|---|
| `input` | An N-dimensional `Tensor` or `RaggedTensor` of UTF-8 strings. |
| `labels` | An (N+1)-dimensional `Tensor` or `RaggedTensor` of `int32`, with `labels[i1...iN, j]` being the split(0)/merge(1) label of the j-th character for `input[i1...iN]`. Here split means create a new word with this character and merge means adding this character to the previous word. |
| `force_split_at_break_character` | bool indicates whether to force
start a new word after seeing a ICU defined whitespace character. When seeing
one or more ICU defined whitespace character: * if
`force_split_at_break_character` is set true, then create a new word at the
first non-space character, regardless of the label of that character, for
instance:
input="New York"
labels=[0, 1, 1, 0, 1, 1, 1, 1]
output tokens=["New", "York"] input="New York"
labels=[0, 1, 1, 1, 1, 1, 1, 1]
output tokens=["New", "York"] input="New York",
labels=[0, 1, 1, 1, 0, 1, 1, 1]
output tokens=["New", "York"]
|
| Returns | |
|---|---|
| A `RaggedTensor` of strings where `tokens[i1...iN, j]` is the string content of the `j-th` token in `input[i1...iN]` |
tokenize_with_offsets(
input, labels, force_split_at_break_character=True
)
Tokenizes a tensor of UTF-8 strings into tokens with [start,end) offsets.
>>> strings = ["HelloMonday", "DearFriday"]
>>> labels = [[0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1],
... [0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0]]
>>> tokenizer = SplitMergeTokenizer()
>>> tokens, starts, ends = tokenizer.tokenize_with_offsets(strings, labels)
>>> tokens
<tf.RaggedTensor [[b'Hello', b'Monday'], [b'Dear', b'Friday']]>
>>> starts
<tf.RaggedTensor [[0, 5], [0, 4]]>
>>> ends
<tf.RaggedTensor [[5, 11], [4, 10]]>
| Args | |
|---|---|
| `input` | An N-dimensional `Tensor` or `RaggedTensor` of UTF-8 strings. |
| `labels` | An (N+1)-dimensional `Tensor` or `RaggedTensor` of int32, with labels[i1...iN, j] being the split(0)/merge(1) label of the j-th character for input[i1...iN]. Here split means create a new word with this character and merge means adding this character to the previous word. |
| `force_split_at_break_character` | bool indicates whether to force
start a new word after seeing a ICU defined whitespace character. When seeing
one or more ICU defined whitespace character: * if
`force_split_at_break_character` is set true, then create a new word at the
first non-space character, regardless of the label of that character, for
instance:
input="New York"
labels=[0, 1, 1, 0, 1, 1, 1, 1]
output tokens=["New", "York"] input="New York"
labels=[0, 1, 1, 1, 1, 1, 1, 1]
output tokens=["New", "York"] input="New York",
labels=[0, 1, 1, 1, 0, 1, 1, 1]
output tokens=["New", "York"]
|
| Returns | |
|---|---|
| A tuple `(tokens, start_offsets, end_offsets)` where: | |
| `tokens` | is a `RaggedTensor` of strings where `tokens[i1...iN, j]` is the string content of the `j-th` token in `input[i1...iN]` |
| `start_offsets` | is a `RaggedTensor` of int64s where `start_offsets[i1...iN, j]` is the byte offset for the start of the `j-th` token in `input[i1...iN]`. |
| `end_offsets` | is a `RaggedTensor` of int64s where `end_offsets[i1...iN, j]` is the byte offset immediately after the end of the `j-th` token in `input[i...iN]`. |