Python:从文件中提取某些元素

时间:2016-01-18 20:02:50

标签: python

我有一个看起来像这样的文件,我正在计划从这个文件中提取一些元素以使用Python 2.7形成一个新文件,但是这里有一些我无法处理的问题。我是编程的新手,希望有人能帮助我。提前谢谢!

很抱歉由于没有定义行的第一个for循环导致的不便,错误。这是另一个问题,我必须先添加一个打印行语句才能使代码正常工作。见下面更新的代码。

#File
POS ID  PosScore    NegScore    SynsetTerms Gloss
a   00001740    0.125   0   able#1  (usually followed by `to') having the necessary means or skill or know-how or authority to do something; "able to swim"; "she was able to program her computer"; "we were at last able to buy a car"; "able to get a grant for the project"
a   00014490    0.125   0   rich#12 plentiful#2 plenteous#1 copious#2 ample#2   affording an abundant supply; "had ample food for the party"; "copious provisions"; "food is plentiful"; "a plenteous grape harvest"; "a rich supply"

任务:

我计划从此文件中提取三列,并按此顺序组成一个新文件:SynsetTerms,PosScore,NegScore。我首先使用print来测试而不是g.write()。

问题:

1.我正在尝试索引.split()列表中包含'#'的每个元素并打印出来,但它不能超过两个单词。

2.我也试图删除'#',然后只打印[: - 2],但它不适用于超过一位的数字。

3.必须在第一步中定义行以使其余代码工作。我在想是否第一个单词和下一个单词都包含'#',然后打印它们。

#INPUT1
# Fix previous error: Define 'line'
f = open("senti_test.txt","r")

for line in f:
    print line

f.close()

#INPUT2
f = open("senti_test.txt","r")
g = open("senti_test_new.txt", "w")

for num in xrange(4,len(line.split())):
    for line in f:
        if '#' in line.split()[num] and '#' in line.split()[num + 1]:
            print (line.split()[num][:-2] + '\t' + line.split()[2] + '\t' + line.split()[3] + '\n') + ('\n') + (line.split()[num + 1][:-2] + '\t' + line.split()[2] + '\t' + line.split()[3] + '\n')        
        else:
            print line.split()[4][:-2] + '\t' + line.split()[2] + '\t' + line.split()[3] + '\n'

f.close()
g.close()

#OUTPUT
SynsetTer   PosScore    NegScore

able    0.125   0

rich#   0.125   0

plentiful   0.125   0

1 个答案:

答案 0 :(得分:1)

我希望这更接近你想要的。调整后,我会让你替换输出文件。

f = open("so.txt","r")

for line in f:
    line_split = line.split()      # Split the line on spaces
    pos_score = line_split[2]      # 3rd & 4th columns are the scores
    neg_score = line_split[3]
    for entry in line_split[4:]:   # step through the remainder of the line, looking for words to index
        if '#' in entry:
            # When a word is found, split off the #<num> and print the entry.
            print entry.split('#')[0], pos_score, neg_score

f.close()

我将第二行的pos_score更改为0.140,以帮助提高可读性。这是输出:

able 0.125 0
rich 0.140 0
plentiful 0.140 0
plenteous 0.140 0
copious 0.140 0
ample 0.140 0

另请注意,您可以在循环中使用一些额外代码保存索引号:

    if '#' in entry:
        # When a word is found, split off the #<num> and print the entry.
        word, idx = entry.split('#')
        print word, pos_score, neg_score, "\tindex=", idx

输出:

able 0.125 0    index= 1
rich 0.140 0    index= 12
plentiful 0.140 0   index= 2
plenteous 0.140 0   index= 1
copious 0.140 0     index= 2
ample 0.140 0   index= 2