tf.strings.unicode_split_with_offsets

View source on GitHub

Splits each string into a sequence of code points with start offsets.

tf.strings.unicode_split_with_offsets(
    input, input_encoding, errors='replace', replacement_char=65533, name=None
)

This op is similar to tf.strings.decode(...), but it also returns the start offset for each character in its respective string. This information can be used to align the characters with the original byte sequence.

Returns a tuple (chars, start_offsets) where:

Args:

Returns:

A tuple of N+1 dimensional tensors (codepoints, start_offsets).

The returned tensors are tf.Tensors if input is a scalar, or tf.RaggedTensors otherwise.

Example:

>>> input = [s.encode('utf8') for s in (u'G\xf6\xf6dnight', u'\U0001f60a')]
>>> result = tf.strings.unicode_split_with_offsets(input, 'UTF-8')
>>> result[0].to_list()  # character substrings
[[b'G', b'\xc3\xb6', b'\xc3\xb6', b'd', b'n', b'i', b'g', b'h', b't'],
 [b'\xf0\x9f\x98\x8a']]
>>> result[1].to_list()  # offsets
[[0, 1, 3, 5, 6, 7, 8, 9, 10], [0]]