这是我从一家技术公司的现场采访中得到的问题,我认为这最终会扼杀我的机会。
您会得到一个句子,以及一个字典,字典以单词为键,而词性为值。
目标是编写一个函数,在给您一个句子时,将每个单词依次更改为词典中给定的词性。我们可以假设句子中的所有内容都作为字典中的键出现。
例如,假设我们得到以下输入:
sentence='I am done; Look at that, cat!'
dictionary={'!': 'sentinel', ',': 'sentinel',
'I': 'pronoun', 'am': 'verb',
'Look': 'verb', 'that': 'pronoun',
'at': 'preposition', ';': 'preposition',
'done': 'verb', ',': 'sentinel',
'cat': 'noun', '!': 'sentinel'}
output='pronoun verb verb sentinel verb preposition pronoun sentinel noun sentinel'
棘手的是捉住哨兵。如果词性中没有哨兵,则可以轻松完成。有一个简单的方法吗?有图书馆吗?
答案 0 :(得分:5)
Python的Regular Expression包可用于将句子拆分为标记。
import re
sentence='I am done; Look at that, cat!'
dictionary={'!': 'sentinel', ',': 'sentinel',
'I': 'pronoun', 'am': 'verb',
'Look': 'verb', 'that': 'pronoun',
'at': 'preposition', ';': 'preposition',
'done': 'verb', ',': 'sentinel',
'cat': 'noun', '!': 'sentinel'}
tags = list()
for word in re.findall(r"[A-Za-z]+|\S", sentence):
tags.append(dictionary[word])
print (' '.join(tags))
输出
代词动词动词介词动词介词代词哨兵名词哨兵
正则表达式[A-Za-z]+|\S
基本上选择所有由一个或多个出现的字母(大写和小写)(由[A-Za-z]+
以及(由|
完成,这表示更改)全部\s
之前的非空格。
答案 1 :(得分:2)
这是一个不太令人印象深刻但更具解释性的解决方案:
让我们首先定义问题中的示例字典和句子:
sentence = 'I am done; Look at that, cat!'
dictionary = {
'!': 'sentinel',
',': 'sentinel',
',': 'sentinel',
'I': 'pronoun',
'that': 'pronoun',
'cat': 'noun',
'am': 'verb',
'Look': 'verb',
'done': 'verb',
'at': 'preposition',
';': 'preposition',
}
对于我的解决方案,我定义了一个递归解析函数,恰当地命名为parse
。
parse
首先用空格将一个句子分成多个单词,然后尝试通过在提供的字典中查找每个单词来对每个单词进行分类。
如果在词典中找不到该单词(因为它附加了一些标点符号,等等),则parse
然后将该单词拆分成其组成标记,然后从那里递归地对其进行解析。
def parse(sentence, dictionary):
# split the words apart by whitespace
# some tokens may still be stuck together. (i.e. "that,")
words = sentence.split()
# this is a list of strings containing the 'category' of each word
output = []
for word in words:
if word in dictionary:
# base case, the word is in the dictionary
output.append(dictionary[word])
else:
# recursive case, the word still has tokens attached
# get all the tokens in the word
tokens = [key for key in dictionary.keys() if key in word]
# sort all the tokens by length - this makes sure big words are more likely to be preserved. (scat -> s, cat or sc, at) check
tokens.sort(key=len)
# this is where we'll store the output
sub_output = None
# iterate through the tokens to find if there's a valid way to split the word
for token in tokens:
try:
# pad the tokens inside each word
sub_output = parse(
word.replace(token, f" {token} "),
dictionary
)
# if the word is parsable, no need to try other combinations
break
except:
pass # the word couldn't be split
# make sure that the word was split - if it wasn't it's not a valid word and the sentence can't be parsed
assert sub_output is not None
output.append(sub_output)
# put it all together into a neat little string
return ' '.join(output)
这是您将如何使用它:
# usage of parse
output = parse(sentence, dictionary)
# display the example output
print(output)
我希望我的回答使您对解决该问题的另一种方法有更多的了解。
多田!
答案 2 :(得分:2)
如果您正在寻找一种基于非正则表达式的方法,则可以尝试以下方法:
def tag_pos(sentence):
output = []
for word in sentence.split():
if word not in dictionary:
literal = ''.join([char for char in word if not char.isalpha()])
word = ''.join([char for char in word if char.isalpha()])
output.append(dictionary[word])
if not len(literal)>1:
output.append(dictionary[literal])
else:
for literal in other:
output.append(dictionary[literal])
else:
output.append(dictionary[word])
return " ".join(output)
output = tag_pos(sentence)
print(output)