Question

我想基于给定的字典对连接的字符进行标记，并提供和输出找到的标记化单词。例如，我有以下

dictionary = ['yak', 'kin', 'yakkin', 'khai', 'koo']
chars = 'yakkinpadthaikhaikoo'

输出应如下所示：

[('yakkin', (0, 6), 6), ('padthai', (6, 13), 7), ('khai', (13, 17), 4), ('koo', (17, 20), 3)]

我想将元组列表作为输出。元组中的第一个元素是在字典中找到的单词，第二个元素是字符偏移，第三个元素是找到的单词的长度。如果找不到字符，我们会将它们组合成一个字，例如padthai在上述情况中。如果从字典中找到多个单词，我们会选择最长的单词（选择yakkin而不是yak和kin）。

我目前的实施情况如下。它从索引开始，如果为0，则循环遍历字符（它还没有工作）。

import numpy as np

def tokenize(chars, dictionary):
    n_chars = len(chars)
    start = 0
    char_found = []
    words = []
    for _ in range(int(n_chars/3)):
        for r in range(1, n_chars + 1):
            if chars[start:(start + r)] in dictionary:
                char_found.append((chars[start:(start + r)], (start, start + r), len(chars[start:start+r])))
        id_offset = np.argmax([t[1][1] for t in char_found])
        start = char_found[id_offset][2]
        if char_found[id_offset] not in words:
            words.append(char_found[id_offset])
    return words

tokenize(chars, dictionary) # give only [('yakkin', (0, 6), 6)]

我很难绕过头来解决这个问题。请随时评论/建议！

Answer 1

它看起来有点讨厌，但它有效

def tokenize(string, dictionary):
    # sorting dictionary words by length
    # because we need to find longest word if its possible
    # like "yakkin" instead of "yak"
    sorted_dictionary = sorted(dictionary,
                               key=lambda word: len(word),
                               reverse=True)
    start = 0
    tokens = []
    while start < len(string):
        substring = string[start:]
        try:
            word = next(word
                        for word in sorted_dictionary
                        if substring.startswith(word))
            offset = len(word)
        except StopIteration:
            # no words from dictionary were found
            # at the beginning of substring,
            # looking for next appearance of dictionary words
            words_indexes = [substring.find(word)
                             for word in sorted_dictionary]
            # if word is not found, "str.find" method returns -1
            appeared_words_indexes = filter(lambda index: index > 0,
                                            words_indexes)
            try:
                offset = min(appeared_words_indexes)
            except ValueError:
                # an empty sequence was passed to "min" function
                # because there are no words from dictionary in substring
                offset = len(substring)
            word = substring[:offset]
        token = word, (start, start + offset), offset
        tokens.append(token)
        start += offset
    return tokens

给出输出

>>>tokenize('yakkinpadthaikhaikoo', dictionary)
[('yakkin', (0, 6), 6), 
 ('padthai', (6, 13), 7), 
 ('khai', (13, 17), 4), 
 ('koo', (17, 20), 3)]
>>>tokenize('lolyakhaiyakkinpadthaikhaikoolol', dictionary)
[('lol', (0, 3), 3), 
 ('yak', (3, 6), 3), 
 ('hai', (6, 9), 3), 
 ('yakkin', (9, 15), 6), 
 ('padthai', (15, 22), 7), 
 ('khai', (22, 26), 4), 
 ('koo', (26, 29), 3), 
 ('lol', (29, 32), 3)]

Answer 2

您可以使用find（）查找单词的起始索引，并且由于len（），单词的长度已知。迭代字典中的每个单词，你的列表就完成了！

def tokenize(chars, word_list):
    tokens = []
    for word in word_list:
        word_len = len(word)
        index = 0

        # skips words that appear in longer words
        skip = False
        for other_word in word_list:
            if word in other_word and len(other_word) > len(word):
                print("skipped word:", word)
                skip = True
        if skip:
            continue

        while index < len(chars):
            index = chars.find(word, index) # start search from index
            if index == -1: # find() returns -1 if not found
                break
            # Append the tuple and continue the search at the end of the word
            tokens.append((word, (index, word_len+index), word_len))
            index += word_len

    return tokens

然后我们可以运行它以获得以下输出：

>>>tokenize('yakkinpadthaikhaikoo', ['yak', 'kin', 'yakkin', 'khai', 'koo'])

skipped word: yak
skipped word: kin
[('yakkin', (0, 6), 6), 
 ('khai', (13, 17), 4), 
 ('koo', (17, 20), 3)]

基于给定字典对Tokenize连接字符进行标记

2 个答案: