我想基于给定的字典对连接的字符进行标记,并提供和输出找到的标记化单词。例如,我有以下
dictionary = ['yak', 'kin', 'yakkin', 'khai', 'koo']
chars = 'yakkinpadthaikhaikoo'
输出应如下所示:
[('yakkin', (0, 6), 6), ('padthai', (6, 13), 7), ('khai', (13, 17), 4), ('koo', (17, 20), 3)]
我想将元组列表作为输出。元组中的第一个元素是在字典中找到的单词,第二个元素是字符偏移,第三个元素是找到的单词的长度。如果找不到字符,我们会将它们组合成一个字,例如padthai
在上述情况中。如果从字典中找到多个单词,我们会选择最长的单词(选择yakkin
而不是yak
和kin
)。
我目前的实施情况如下。它从索引开始,如果为0,则循环遍历字符(它还没有工作)。
import numpy as np
def tokenize(chars, dictionary):
n_chars = len(chars)
start = 0
char_found = []
words = []
for _ in range(int(n_chars/3)):
for r in range(1, n_chars + 1):
if chars[start:(start + r)] in dictionary:
char_found.append((chars[start:(start + r)], (start, start + r), len(chars[start:start+r])))
id_offset = np.argmax([t[1][1] for t in char_found])
start = char_found[id_offset][2]
if char_found[id_offset] not in words:
words.append(char_found[id_offset])
return words
tokenize(chars, dictionary) # give only [('yakkin', (0, 6), 6)]
我很难绕过头来解决这个问题。请随时评论/建议!
答案 0 :(得分:3)
它看起来有点讨厌,但它有效
def tokenize(string, dictionary):
# sorting dictionary words by length
# because we need to find longest word if its possible
# like "yakkin" instead of "yak"
sorted_dictionary = sorted(dictionary,
key=lambda word: len(word),
reverse=True)
start = 0
tokens = []
while start < len(string):
substring = string[start:]
try:
word = next(word
for word in sorted_dictionary
if substring.startswith(word))
offset = len(word)
except StopIteration:
# no words from dictionary were found
# at the beginning of substring,
# looking for next appearance of dictionary words
words_indexes = [substring.find(word)
for word in sorted_dictionary]
# if word is not found, "str.find" method returns -1
appeared_words_indexes = filter(lambda index: index > 0,
words_indexes)
try:
offset = min(appeared_words_indexes)
except ValueError:
# an empty sequence was passed to "min" function
# because there are no words from dictionary in substring
offset = len(substring)
word = substring[:offset]
token = word, (start, start + offset), offset
tokens.append(token)
start += offset
return tokens
给出输出
>>>tokenize('yakkinpadthaikhaikoo', dictionary)
[('yakkin', (0, 6), 6),
('padthai', (6, 13), 7),
('khai', (13, 17), 4),
('koo', (17, 20), 3)]
>>>tokenize('lolyakhaiyakkinpadthaikhaikoolol', dictionary)
[('lol', (0, 3), 3),
('yak', (3, 6), 3),
('hai', (6, 9), 3),
('yakkin', (9, 15), 6),
('padthai', (15, 22), 7),
('khai', (22, 26), 4),
('koo', (26, 29), 3),
('lol', (29, 32), 3)]
答案 1 :(得分:1)
您可以使用find()查找单词的起始索引,并且由于len(),单词的长度已知。迭代字典中的每个单词,你的列表就完成了!
def tokenize(chars, word_list):
tokens = []
for word in word_list:
word_len = len(word)
index = 0
# skips words that appear in longer words
skip = False
for other_word in word_list:
if word in other_word and len(other_word) > len(word):
print("skipped word:", word)
skip = True
if skip:
continue
while index < len(chars):
index = chars.find(word, index) # start search from index
if index == -1: # find() returns -1 if not found
break
# Append the tuple and continue the search at the end of the word
tokens.append((word, (index, word_len+index), word_len))
index += word_len
return tokens
然后我们可以运行它以获得以下输出:
>>>tokenize('yakkinpadthaikhaikoo', ['yak', 'kin', 'yakkin', 'khai', 'koo'])
skipped word: yak
skipped word: kin
[('yakkin', (0, 6), 6),
('khai', (13, 17), 4),
('koo', (17, 20), 3)]