如何使用Python中的行号从文件中删除特定句子

时间:2016-03-17 10:14:32

标签: python file nltk

我想删除包含停用词或匹配字符串的特定行:

import nltk
from nltk import *
from nltk.tokenize import word_tokenize
import time

Mywords = 'hello', 'there', 'been'
#stopwords for matching in the sentences.
f = open('hello.txt','rU')
raw = f.read()
sent = word_tokenize(raw)
#tokenize the words.
from nltk.tokenize import wordpunct_tokenize
punct = wordpunct_tokenize(raw)
sent = sent_tokenize(raw)
length = len(sent)

print(length)
i = 0
while(i<length):
    i = i + 1
    time.sleep(2)
    #print(sent[i])
    if i <length:
        #print(sent[i])
        thisWord = (word_tokenize(sent[i]))
        for word in thisWord:
            if word in Mywords:
                #print(thisWord, word)
                print("yes: ", sent[i])
            else:
                print("No:", sent[i])

    else:
        print("End of Line")

1 个答案:

答案 0 :(得分:0)

您无法真正删除文件中的行,但您可以将所有不包含任何停用词的行写入另一个文件。

以下脚本首先获取停用词列表并将其转换为set()。然后它一次读取您的输入文件。对于要使用nltk.word_tokenize()创建单词列表的每一行。它将这个单词列表转换为一个集合,并将其与停用单词的交集。如果这不是空的那么肯定存在一些停用词。然后显示它找到的停用词。

如果未找到,则将剩余的行写入output.txt文件:

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# Full list of stop words from nltk
#stop_words = set(stopwords.words('english'))

stop_words = set(['hello', 'there', 'been'])

with open('input.txt','rU') as f_input, open('output.txt', 'w') as f_output:
    for line in f_input:
        line_words = set(word_tokenize(line))
        stop_words_present = line_words & stop_words

        if stop_words_present:
            print("Yes: '{}' contains {}".format(line.strip(), stop_words_present))     # Contains at least one stop word
        else:
            print("No:", line.strip())      # Contains non stop stops
            f_output.write(line)

注意nltk包含您可以使用的完整英语停用词列表,只需更改上面的行即可。如果找不到如下所示,您可能需要先安装它。运行以下迷你脚本:

import nltk

nltk.download()

这将显示一个下载实用程序,允许您按如下方式获取stopwords

NLTK download helper

选择Corpora,向下滚动至stopwords,然后点击Download按钮。