我在python中面临一些正则表达式的问题。
我有以下格式的Pos标记文字的文本
('play', 'NN')|('2', 'CD')|('dvd', 'NN')|('2', 'CD')|('for', 'IN')|('instance', 'NN')|('i', 'PRP')|('made', 'VBD')|('several', 'JJ')|('back', 'NN')|('ups', 'NNS')|('of', 'IN')|('my', 'PRP$')|('dvd', 'NN')|('movies', 'NNS')|('using', 'VBG')|('dvd', 'NN')|('r', 'NN')|('w', 'NN')|('and', 'CC')|('r', 'NN')|('w', 'NN')|('and', 'CC')|('it', 'PRP')|('plays', 'VBZ')|('the', 'DT')|('dvds', 'NNS')
我想要做的是从这个文本中提取所有名词,并且所有出现在一起的名词(它们之间没有任何单词)应该在同一个字符串中。所有以NN开头的标签都是名词。这是我为此写的正则表达式:
re.compile(r"(\|?\([\'|\"][\w]+[\'|\"]\, \'NN\w?\'\)\|?)+")
我刚开始写正则表达式,对于凌乱的表达感到抱歉,但这是它产生的输出:
["('play', 'NN')|", "|('dvd', 'NN')|", "|('instance', 'NN')|", "('ups', 'NNS')|", "('movies', 'NNS')|", "('w', 'NN')|", "('w', 'NN')|"]
我想要的是语料库中的“备份”和“DVD电影”之类的单词,即出现在一起的名词应该一起出现在输出中。
我做错了什么,任何人都可以请求sujjest!
答案 0 :(得分:0)
你可以不使用正则表达式吗?它不只是解析你想要的文本吗?
感谢mgilson的评论
import ast
nouns = []
for word_and_tag in pos_tagged_words.split("|"):
word, tag = ast.literal_eval(word_and_tag)
if tag.startswith("NN"):
#do something?
#probably this...
nouns.append(word)
#use nouns
答案 1 :(得分:0)
你可以在这里使用itertools做一些非常酷的事情。假设您可以可靠地拆分|
上的单词:
def word_yielder(word_str):
tuples = (ast.literal_eval(t) for t in word_str.split('|'))
for key, group in itertools.groupby(tuples, key=lambda t: t[1].startswith('NN')):
if key: # Have a group of nouns, join them together.
yield (' '.join(t[0] for t in group), 'NN')
else: # Have a group of non-nouns
for t in group: # python3.x -- yield from :-)
yield t
答案 2 :(得分:0)
这是一个pyparsing解决方案:
from pyparsing import *
LPAR,RPAR,COMMA,VERT,QUOT = map(Suppress,"(),|'")
nountype = Combine(QUOT + "NN" + Optional(Word(alphas)) + QUOT)
nounspec = LPAR + quotedString.setParseAction(removeQuotes) + COMMA + nountype + RPAR
# match all nounspec's that have one or more separated by '|'s
noungroup = delimitedList(nounspec, delim=VERT)
# join the nouns, and return a new tuple when a nounspec list is found
noungroup.setParseAction(lambda tokens: (' '.join(tokens[0::2]), tokens[1]) )
# parse sample text
sample = """('play', 'NN')|('2', 'CD')|('dvd', 'NN')|('2', 'CD')|('for', 'IN')|('instance', 'NN')|('i', 'PRP')|('made', 'VBD')|('several', 'JJ')|('back', 'NN')|('ups', 'NNS')|('of', 'IN')|('my', 'PRP$')|('dvd', 'NN')|('movies', 'NNS')|('using', 'VBG')|('dvd', 'NN')|('r', 'NN')|('w', 'NN')|('and', 'CC')|('r', 'NN')|('w', 'NN')|('and', 'CC')|('it', 'PRP')|('plays', 'VBZ')|('the', 'DT')|('dvds', 'NNS')"""
print sum(noungroup.searchString(sample)).asList()
打印:
[('play', 'NN'), ('dvd', 'NN'), ('instance', 'NN'), ('back ups', 'NN'), ('dvd movies', 'NN'), ('dvd r w', 'NN'), ('r w', 'NN'), ('dvds', 'NNS')]