我正在为我的NLP类开发一个项目,这一点我有一个.txt文件,如下所示:
(u'I', u'PRON')(u'am', u'VERB')(u'nobody', u'NOUN')(u':', u'.')(u'A', u'DET')(u'red', u'ADJ')(u'sinking', u'NOUN')(u'autumn', u'NOUN')(u'sun', u'NOUN')(u'Took', u'NOUN')(u'my', u'PRON')(u'name', u'NOUN')(u'away', u'ADV')(u'.', u'.')(u'Keep', u'VERB')(u'straight', u'VERB')(u'down', u'PRT')(u'this', u'DET')(u'block', u'NOUN')....
所以基本上,它只是一堆带有单词和标签的元组。我正在尝试遍历此文件并返回标记为“NOUN”的正式列表。
因此,输出可能如下所示:["nobody," "autumn",....]
我真的不确定如何迭代这些元组以及更多,所以摆脱那个''标签。有人可以帮忙吗?
答案 0 :(得分:1)
使用列表推导来分解所有元组,将str函数应用于单词以将其转换为字符串而不是unicode,并根据其类型过滤掉单词:
output=[str(word) for word,wtype in tuplist if wtype.lower()=='noun']
一个小技巧是使用lower函数来标准化字符串以检查条件。如果你认为你会有流氓空白,你也可以在它之后使用strip():
wtype.lower().strip()=='noun'
答案 1 :(得分:1)
考虑到您在文本文件中有数据,这是一个使用正则表达式的解决方案:
import re
data = """(u'I', u'PRON')(u'am', u'VERB')(u'nobody', u'NOUN')(u':', u'.')(u'A', u'DET')(u'red', u'ADJ')(u'sinking', u'NOUN')(u'autumn', u'NOUN')(u'sun', u'NOUN')(u'Took',u'NOUN')(u'my', u'PRON')(u'name', u'NOUN')(u'away', u'ADV')(u'.', u'.')(u'Keep', u'VERB')(u'straight', u'VERB')(u'down', u'PRT')(u'this',u'DET')(u'block', u'NOUN')'s = r"(u'I', u'PRON')(u'am', u'VERB')(u'nobody', u'NOUN')(u':', u'.')(u'A', u'DET')(u'red', u'ADJ')(u'sinking', u'NOUN')(u'autumn', u'NOUN')(u'sun', u'NOUN')(u'Took', u'NOUN')(u'my', u'PRON')(u'name', u'NOUN')(u'away', u'ADV')(u'.', u'.')(u'Keep', u'VERB')(u'straight',u'VERB')(u'down', u'PRT')(u'this', u'DET')(u'block', u'NOUN')"""
#Use regex to get the split the data as required
rx = re.compile(r"\(u'(.*?)'\,\su'(.*?)'\)")
#Find all the matches
matches = rx.findall(s)
tuples = []
for match in matches:
tuples.append(match)
#Get the nouns from the list of tuples
nouns = [ x for x in tuples if "NOUN" in x]
答案 2 :(得分:0)
您可以使用列表推导来执行此操作,例如:
lst = [i[0] for i in tuples if i[1] == "NOUN"]
列表理解语法有点令人困惑,所以这里它是等效的循环
lst=[]
for i in tuples:
if i[1] == "NOUN":
lst.append(i)
答案 3 :(得分:0)
我假设您将首先从文本中获取行并将所有u'替换为'。然后你可以像这样遍历元组;
x = [('I', 'PRON'), ('am', 'VERB'), ('nobody', 'NOUN')] // that would be lines of your text file.
array = []
for element in x:
first, second = element
if second == "NOUN":
array.append(first)
print array