SplitMergeTokenizer.md

description: Tokenizes a tensor of UTF-8 string into words according to labels.

text.SplitMergeTokenizer

Tokenizes a tensor of UTF-8 string into words according to labels.

Inherits From: TokenizerWithOffsets, Tokenizer, SplitterWithOffsets, Splitter

text.SplitMergeTokenizer()

Methods

`split`

View source

split(
    input
)

Alias for Tokenizer.tokenize.

`split_with_offsets`

View source

split_with_offsets(
    input
)

Alias for TokenizerWithOffsets.tokenize_with_offsets.

`tokenize`

View source

tokenize(
    input, labels, force_split_at_break_character=True
)

Tokenizes a tensor of UTF-8 strings according to labels.

Example:

>>> strings = ["HelloMonday", "DearFriday"]
>>> labels = [[0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1],
...           [0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0]]
>>> tokenizer = SplitMergeTokenizer()
>>> tokenizer.tokenize(strings, labels)
<tf.RaggedTensor [[b'Hello', b'Monday'], [b'Dear', b'Friday']]>

Args
`input`	An N-dimensional `Tensor` or `RaggedTensor` of UTF-8 strings.
`labels`	An (N+1)-dimensional `Tensor` or `RaggedTensor` of `int32`, with `labels[i1...iN, j]` being the split(0)/merge(1) label of the j-th character for `input[i1...iN]`. Here split means create a new word with this character and merge means adding this character to the previous word.
`force_split_at_break_character`	bool indicates whether to force start a new word after seeing a ICU defined whitespace character. When seeing one or more ICU defined whitespace character: * if `force_split_at_break_character` is set true, then create a new word at the first non-space character, regardless of the label of that character, for instance: input="New York" labels=[0, 1, 1, 0, 1, 1, 1, 1] output tokens=["New", "York"] input="New York" labels=[0, 1, 1, 1, 1, 1, 1, 1] output tokens=["New", "York"] input="New York", labels=[0, 1, 1, 1, 0, 1, 1, 1] output tokens=["New", "York"] otherwise, whether to create a new word or not for the first non-space character depends on the label of that character, for instance: input="New York", labels=[0, 1, 1, 0, 1, 1, 1, 1] output tokens=["NewYork"] input="New York", labels=[0, 1, 1, 1, 1, 1, 1, 1] output tokens=["NewYork"] input="New York", labels=[0, 1, 1, 1, 0, 1, 1, 1] output tokens=["New", "York"]

Returns
A `RaggedTensor` of strings where `tokens[i1...iN, j]` is the string content of the `j-th` token in `input[i1...iN]`

`tokenize_with_offsets`

View source

tokenize_with_offsets(
    input, labels, force_split_at_break_character=True
)

Tokenizes a tensor of UTF-8 strings into tokens with [start,end) offsets.

Example:

>>> strings = ["HelloMonday", "DearFriday"]
>>> labels = [[0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1],
...           [0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0]]
>>> tokenizer = SplitMergeTokenizer()
>>> tokens, starts, ends = tokenizer.tokenize_with_offsets(strings, labels)
>>> tokens
<tf.RaggedTensor [[b'Hello', b'Monday'], [b'Dear', b'Friday']]>
>>> starts
<tf.RaggedTensor [[0, 5], [0, 4]]>
>>> ends
<tf.RaggedTensor [[5, 11], [4, 10]]>

Args
`input`	An N-dimensional `Tensor` or `RaggedTensor` of UTF-8 strings.
`labels`	An (N+1)-dimensional `Tensor` or `RaggedTensor` of int32, with labels[i1...iN, j] being the split(0)/merge(1) label of the j-th character for input[i1...iN]. Here split means create a new word with this character and merge means adding this character to the previous word.
`force_split_at_break_character`	bool indicates whether to force start a new word after seeing a ICU defined whitespace character. When seeing one or more ICU defined whitespace character: * if `force_split_at_break_character` is set true, then create a new word at the first non-space character, regardless of the label of that character, for instance: input="New York" labels=[0, 1, 1, 0, 1, 1, 1, 1] output tokens=["New", "York"] input="New York" labels=[0, 1, 1, 1, 1, 1, 1, 1] output tokens=["New", "York"] input="New York", labels=[0, 1, 1, 1, 0, 1, 1, 1] output tokens=["New", "York"] otherwise, whether to create a new word or not for the first non-space character depends on the label of that character, for instance: input="New York", labels=[0, 1, 1, 0, 1, 1, 1, 1] output tokens=["NewYork"] input="New York", labels=[0, 1, 1, 1, 1, 1, 1, 1] output tokens=["NewYork"] input="New York", labels=[0, 1, 1, 1, 0, 1, 1, 1] output tokens=["New", "York"]

Returns
A tuple `(tokens, start_offsets, end_offsets)` where:
`tokens`	is a `RaggedTensor` of strings where `tokens[i1...iN, j]` is the string content of the `j-th` token in `input[i1...iN]`
`start_offsets`	is a `RaggedTensor` of int64s where `start_offsets[i1...iN, j]` is the byte offset for the start of the `j-th` token in `input[i1...iN]`.
`end_offsets`	is a `RaggedTensor` of int64s where `end_offsets[i1...iN, j]` is the byte offset immediately after the end of the `j-th` token in `input[i...iN]`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

text.SplitMergeTokenizer

Methods

`split`

`split_with_offsets`

`tokenize`

Example:

`tokenize_with_offsets`

Example:

FilesExpand file tree

SplitMergeTokenizer.md

Latest commit

History

SplitMergeTokenizer.md

File metadata and controls

text.SplitMergeTokenizer

Methods

split

split_with_offsets

tokenize

Example:

tokenize_with_offsets

Example:

`split`

`split_with_offsets`

`tokenize`

`tokenize_with_offsets`