python:删除列表中包含单词的行

时间:2013-11-13 10:11:29

标签: python

我正在使用python中的一个脚本,我似乎无法做对。它使用两个输入:

  1. 数据文件
  2. 停止档案
  3. 数据文件由4个以制表符分隔的列组成,这些列已排序。 停止文件由也排序的单词列表组成。

    该脚本的目标是:

    • 如果数据文件的第1列中的字符串与“停止文件”中的字符串匹配,则删除整行。

    以下是数据文件的示例:

    abandonment-n   after+n-the+n-a-j   stop-n  1
    abandonment-n   against+n-the+ns    leave-n 1
    cake-n  against+n-the+vg    rest-v  1
    abandonment-n   as+n-a+vd   require-v   1
    abandonment-n   as+n-a-j+vg-up  use-v   1
    

    以下是停止文件的示例:

    apple-n
    banana-n
    cake-n
    pigeon-n
    

    这是我到目前为止的代码:

    with open("input1", "rb") as oIndexFile:
            for line in oIndexFile: 
                lemma = line.split()
                #print lemma
    
    with open ("input2", "rb") as oSenseFile:
        with open("output", "wb") as oOutFile:
            for line in oSenseFile:
                concept, slot, filler, freq = line.split()
                nounsInterest = [concept, slot, filler, freq]
                #print concept
                if concept != lemma:
                    outstring = '\t'.join(nounsInterest)
                    oOutFile.write(outstring + '\n')
                else: 
                    pass
    

    所需输出如下:

    abandonment-n   after+n-the+n-a-j-stop-n    1
    abandonment-n   against+n-the+ns-leave-n    1
    abandonment-n   as+n-a+vd-require-v 1
    abandonment-n   as+n-a-j+vg-up-use-v    1
    

    有什么见解?

    截至目前,我得到的输出如下,这基本上只是我一直在做的打印:

    abandonment-n   after+n-the+n-a-j   stop-n  1
    abandonment-n   against+n-the+ns    leave-n 1
    cake-n  against+n-the+vg    rest-v  1
    abandonment-n   as+n-a+vd   require-v   1
    abandonment-n   as+n-a-j+vg-up  use-v   1
    

    ***我尝试过的一些事情仍然无效:

    而不是if concept != lemma: 我首先尝试if concept not in lemma:

    产生与前面提到的相同的输出。

    我也怀疑该函数没有调用第一个输入文件,但即使将其合并到代码中:如下:

    with open ("input2", "rb") as oSenseFile:
        with open("tinput1", "rb") as oIndexFile:
            for line in oIndexFile: 
                lemma = line.split()
                with open("out", "wb") as oOutFile:
                    for line in oSenseFile:
                        concept, slot, filler, freq = line.split()
                        nounsInterest = [concept, slot, filler, freq]
                        if concept not in lemma:
                            outstring = '\t'.join(nounsInterest)
                            oOutFile.write(outstring + '\n')
                        else: 
                            pass
    

    生成空白输出文件。

    我也尝试了一种不同的方法:

    filename = "input1.txt" 
    filename2 = "input2.txt"
    filename3 = "output1"
    
    def fixup(filename): 
        fin1 = open(filename) 
        fin2 = open(filename2, "r")
        fout = open(filename3, "w") 
        for word in filename: 
            words = word.split()
        for line in filename2:
            concept, slot, filler, freq = line.split()
            nounsInterest = [concept, slot, filler, freq]
            if True in [concept in line for word in toRemove]:
                pass
            else:
                outstring = '\t'.join(nounsInterest)
                fout.write(outstring + '\n')
        fin1.close() 
        fin2.close() 
        fout.close()
    

    改编自here,但没有成功。在这种情况下,根本不会产生输出。

    有人能指出我解决这个问题的方向吗? 虽然示例文件很小,但我必须在大文件上运行它。 感谢您的帮助。

3 个答案:

答案 0 :(得分:4)

我认为你正在尝试做这样的事情

with open('input1', 'rb') as indexfile:
    lemma = {x.strip() for x in indexfile}

with open('input2', 'rb') as sensefile, open('output', 'wb') as outfile:
    for line in sensefile:
        nouns_interest = concept, slot, filler, freq = line.split()
        if concept not in lemma:
            outfile.write('\t'.join(nouns_interest) + '\n')

您想要的输出似乎是在slotfiller之间插入一个连字符,因此您可能想要使用

            outfile.write('{}\t{}-{}\t{}\n'.format(*nouns_interest))

答案 1 :(得分:1)

我还没有检查过你的逻辑,但是你为那里的每一行覆盖了lemma。也许将它附加到列表中?

lemma = []
for line in oIndexFile:
    lemma.append(line.strip())  #strips everything except the text

或者,正如@gnibbler建议的那样,您可以出于效率的原因使用set:

lemma = set()
for line in oIndexFile:
    lemma.add(line.strip())

编辑:看起来您不想拆分它,但剥离换行符。是的,你的逻辑几乎是正确的

这就是第二部分应该是这样的:

with open ("data_php.txt", "rb") as oSenseFile:
    with open("out_FILTER_LINES", "wb") as oOutFile:
        for line in oSenseFile:
            concept, slot, filler, freq = line.split()
            nounsInterest = [concept, slot, filler, freq]
            #print concept
            if concept not in lemma: #check if the concept exists in lemma
                outstring = '\t'.join(nounsInterest)
                oOutFile.write(outstring + '\n')
            else: 
                pass

答案 2 :(得分:1)

如果您确定数据文件中的行没有以空格开头,那么我们不需要拆分该行。这是对@gnibbler回答的轻微调整。

with open('input1', 'rb') as indexfile:
    lemma = {x.strip() for x in indexfile}

with open('input2', 'rb') as sensefile, open('output', 'wb') as outfile:
    for line in sensefile:
        if not any([line.startswith(x) for x in lemma]):
            outfile.write(line)