Skip to content

Latest commit

 

History

History
292 lines (237 loc) · 8.23 KB

File metadata and controls

292 lines (237 loc) · 8.23 KB

description: Tokenizes a tensor of UTF-8 string into words according to labels.

text.SplitMergeTokenizer

View source

Tokenizes a tensor of UTF-8 string into words according to labels.

Inherits From: TokenizerWithOffsets, Tokenizer, SplitterWithOffsets, Splitter

text.SplitMergeTokenizer()

Methods

split

View source

split(
    input
)

Alias for Tokenizer.tokenize.

split_with_offsets

View source

split_with_offsets(
    input
)

Alias for TokenizerWithOffsets.tokenize_with_offsets.

tokenize

View source

tokenize(
    input, labels, force_split_at_break_character=True
)

Tokenizes a tensor of UTF-8 strings according to labels.

Example:

>>> strings = ["HelloMonday", "DearFriday"]
>>> labels = [[0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1],
...           [0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0]]
>>> tokenizer = SplitMergeTokenizer()
>>> tokenizer.tokenize(strings, labels)
<tf.RaggedTensor [[b'Hello', b'Monday'], [b'Dear', b'Friday']]>
Args
`input` An N-dimensional `Tensor` or `RaggedTensor` of UTF-8 strings.
`labels` An (N+1)-dimensional `Tensor` or `RaggedTensor` of `int32`, with `labels[i1...iN, j]` being the split(0)/merge(1) label of the j-th character for `input[i1...iN]`. Here split means create a new word with this character and merge means adding this character to the previous word.
`force_split_at_break_character` bool indicates whether to force start a new word after seeing a ICU defined whitespace character. When seeing one or more ICU defined whitespace character: * if `force_split_at_break_character` is set true, then create a new word at the first non-space character, regardless of the label of that character, for instance:
  input="New York"
  labels=[0, 1, 1, 0, 1, 1, 1, 1]
  output tokens=["New", "York"]
  input="New York"
  labels=[0, 1, 1, 1, 1, 1, 1, 1]
  output tokens=["New", "York"]
  input="New York",
  labels=[0, 1, 1, 1, 0, 1, 1, 1]
  output tokens=["New", "York"]
  • otherwise, whether to create a new word or not for the first non-space character depends on the label of that character, for instance:

    input="New York",
    labels=[0, 1, 1, 0, 1, 1, 1, 1]
    output tokens=["NewYork"]
    input="New York",
    labels=[0, 1, 1, 1, 1, 1, 1, 1]
    output tokens=["NewYork"]
    input="New York",
    labels=[0, 1, 1, 1, 0, 1, 1, 1]
    output tokens=["New", "York"]
Returns
A `RaggedTensor` of strings where `tokens[i1...iN, j]` is the string content of the `j-th` token in `input[i1...iN]`

tokenize_with_offsets

View source

tokenize_with_offsets(
    input, labels, force_split_at_break_character=True
)

Tokenizes a tensor of UTF-8 strings into tokens with [start,end) offsets.

Example:

>>> strings = ["HelloMonday", "DearFriday"]
>>> labels = [[0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1],
...           [0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0]]
>>> tokenizer = SplitMergeTokenizer()
>>> tokens, starts, ends = tokenizer.tokenize_with_offsets(strings, labels)
>>> tokens
<tf.RaggedTensor [[b'Hello', b'Monday'], [b'Dear', b'Friday']]>
>>> starts
<tf.RaggedTensor [[0, 5], [0, 4]]>
>>> ends
<tf.RaggedTensor [[5, 11], [4, 10]]>
Args
`input` An N-dimensional `Tensor` or `RaggedTensor` of UTF-8 strings.
`labels` An (N+1)-dimensional `Tensor` or `RaggedTensor` of int32, with labels[i1...iN, j] being the split(0)/merge(1) label of the j-th character for input[i1...iN]. Here split means create a new word with this character and merge means adding this character to the previous word.
`force_split_at_break_character` bool indicates whether to force start a new word after seeing a ICU defined whitespace character. When seeing one or more ICU defined whitespace character: * if `force_split_at_break_character` is set true, then create a new word at the first non-space character, regardless of the label of that character, for instance:
  input="New York"
  labels=[0, 1, 1, 0, 1, 1, 1, 1]
  output tokens=["New", "York"]
  input="New York"
  labels=[0, 1, 1, 1, 1, 1, 1, 1]
  output tokens=["New", "York"]
  input="New York",
  labels=[0, 1, 1, 1, 0, 1, 1, 1]
  output tokens=["New", "York"]
  • otherwise, whether to create a new word or not for the first non-space character depends on the label of that character, for instance:

    input="New York",
    labels=[0, 1, 1, 0, 1, 1, 1, 1]
    output tokens=["NewYork"]
    input="New York",
    labels=[0, 1, 1, 1, 1, 1, 1, 1]
    output tokens=["NewYork"]
    input="New York",
    labels=[0, 1, 1, 1, 0, 1, 1, 1]
    output tokens=["New", "York"]
Returns
A tuple `(tokens, start_offsets, end_offsets)` where:
`tokens` is a `RaggedTensor` of strings where `tokens[i1...iN, j]` is the string content of the `j-th` token in `input[i1...iN]`
`start_offsets` is a `RaggedTensor` of int64s where `start_offsets[i1...iN, j]` is the byte offset for the start of the `j-th` token in `input[i1...iN]`.
`end_offsets` is a `RaggedTensor` of int64s where `end_offsets[i1...iN, j]` is the byte offset immediately after the end of the `j-th` token in `input[i...iN]`.