我在Python(2.x)中有一个小标记脚本。 我试图用字典标记语料库的每一行。 脚本通常表现良好,但我正在寻找略有不同 结果。
代码就像,
def tag_corpus():
corpus1=open("Corpus1.txt","r")
dict1=open("Dictnew1.txt","r")
dictw=dict1.read().lower().split()
list1=[]
for line in corpus1:
linew=line.lower().split()
for word in linew:
if word in dictw:
word_i=dictw.index(word)
word_i1=word_i+1
tag=dictw[word_i1]
str1=word+"/"+tag
list1.append(str1)
else:
str2=word+"/"+"NA"
list1.append(str2)
str3=" ".join(list1)
print str3
" Corpus1.txt"的内容是,
London is situtated over Thames .
London is a village near Burgundy .
London is situated near Ontario .
和" Dictnew1.txt"是的,
伦敦LOC Thames LOC 勃艮第LOC 安大略省LOC
结果即将到来,
london/loc is/NA situtated/NA over/NA thames/loc ./NA london/loc is/NA a/NA village/NA near/NA burgundy/loc ./NA london/loc is/NA situated/NA near/NA ontario/loc ./NA
但我正在寻找标记字符串的输出,因为它打印出字符串, 喜欢
london is situtated over thames .
london/loc is/NA situtated/NA over/NA thames/loc .
如果有人可能会建议。
答案 0 :(得分:0)
这会产生您期望的输出吗?
def tag_corpus():
corpus1=open("Corpus1.txt","r")
dict1=open("Dictnew1.txt","r")
dictw=dict1.read().lower().split()
for line in corpus1:
list1=[]
linew=line.lower().split()
for word in linew:
if word in dictw:
word_i=dictw.index(word)
word_i1=word_i+1
tag=dictw[word_i1]
str1=word+"/"+tag
list1.append(str1)
else:
str2=word+"/"+"NA"
list1.append(str2)
str3=" ".join(list1)
print line
print str3