我有一个看起来像这样的文件,我正在计划从这个文件中提取一些元素以使用Python 2.7形成一个新文件,但是这里有一些我无法处理的问题。我是编程的新手,希望有人能帮助我。提前谢谢!
很抱歉由于没有定义行的第一个for循环导致的不便,错误。这是另一个问题,我必须先添加一个打印行语句才能使代码正常工作。见下面更新的代码。
#File
POS ID PosScore NegScore SynsetTerms Gloss
a 00001740 0.125 0 able#1 (usually followed by `to') having the necessary means or skill or know-how or authority to do something; "able to swim"; "she was able to program her computer"; "we were at last able to buy a car"; "able to get a grant for the project"
a 00014490 0.125 0 rich#12 plentiful#2 plenteous#1 copious#2 ample#2 affording an abundant supply; "had ample food for the party"; "copious provisions"; "food is plentiful"; "a plenteous grape harvest"; "a rich supply"
任务:
我计划从此文件中提取三列,并按此顺序组成一个新文件:SynsetTerms,PosScore,NegScore。我首先使用print来测试而不是g.write()。
问题:
1.我正在尝试索引.split()列表中包含'#'的每个元素并打印出来,但它不能超过两个单词。
2.我也试图删除'#',然后只打印[: - 2],但它不适用于超过一位的数字。
3.必须在第一步中定义行以使其余代码工作。我在想是否第一个单词和下一个单词都包含'#',然后打印它们。
#INPUT1
# Fix previous error: Define 'line'
f = open("senti_test.txt","r")
for line in f:
print line
f.close()
#INPUT2
f = open("senti_test.txt","r")
g = open("senti_test_new.txt", "w")
for num in xrange(4,len(line.split())):
for line in f:
if '#' in line.split()[num] and '#' in line.split()[num + 1]:
print (line.split()[num][:-2] + '\t' + line.split()[2] + '\t' + line.split()[3] + '\n') + ('\n') + (line.split()[num + 1][:-2] + '\t' + line.split()[2] + '\t' + line.split()[3] + '\n')
else:
print line.split()[4][:-2] + '\t' + line.split()[2] + '\t' + line.split()[3] + '\n'
f.close()
g.close()
#OUTPUT
SynsetTer PosScore NegScore
able 0.125 0
rich# 0.125 0
plentiful 0.125 0
答案 0 :(得分:1)
我希望这更接近你想要的。调整后,我会让你替换输出文件。
f = open("so.txt","r")
for line in f:
line_split = line.split() # Split the line on spaces
pos_score = line_split[2] # 3rd & 4th columns are the scores
neg_score = line_split[3]
for entry in line_split[4:]: # step through the remainder of the line, looking for words to index
if '#' in entry:
# When a word is found, split off the #<num> and print the entry.
print entry.split('#')[0], pos_score, neg_score
f.close()
我将第二行的pos_score更改为0.140,以帮助提高可读性。这是输出:
able 0.125 0
rich 0.140 0
plentiful 0.140 0
plenteous 0.140 0
copious 0.140 0
ample 0.140 0
另请注意,您可以在循环中使用一些额外代码保存索引号:
if '#' in entry:
# When a word is found, split off the #<num> and print the entry.
word, idx = entry.split('#')
print word, pos_score, neg_score, "\tindex=", idx
输出:
able 0.125 0 index= 1
rich 0.140 0 index= 12
plentiful 0.140 0 index= 2
plenteous 0.140 0 index= 1
copious 0.140 0 index= 2
ample 0.140 0 index= 2