我正在尝试从文本文件中读取文本,读取行,删除包含特定字符串的行(在本例中为“bad”和“naughty”)。 我写的代码是这样的:
infile = file('./oldfile.txt')
newopen = open('./newfile.txt', 'w')
for line in infile :
if 'bad' in line:
line = line.replace('.' , '')
if 'naughty' in line:
line = line.replace('.', '')
else:
newopen.write(line)
newopen.close()
我是这样写的但是没有用。
重要的是,如果文本的内容是这样的:
good baby
bad boy
good boy
normal boy
我不希望输出有空行。 所以不喜欢:
good baby
good boy
normal boy
但是像这样:
good baby
good boy
normal boy
我应该从上面的代码中编辑什么?
答案 0 :(得分:49)
您可以使您的代码更简单,更易读,如此
bad_words = ['bad', 'naughty']
with open('oldfile.txt') as oldfile, open('newfile.txt', 'w') as newfile:
for line in oldfile:
if not any(bad_word in line for bad_word in bad_words):
newfile.write(line)
答案 1 :(得分:5)
您可以直接将该行包含在新文件中,而不是替换。
for line in infile :
if 'bad' not in line and 'naughty' not in line:
newopen.write(line)
答案 2 :(得分:5)
我用它来删除文本文件中不需要的单词:
bad_words = ['abc', 'def', 'ghi', 'jkl']
with open('List of words.txt') as badfile, open('Clean list of words.txt', 'w') as cleanfile:
for line in badfile:
clean = True
for word in bad_words:
if word in line:
clean = False
if clean == True:
cleanfile.write(line)
或者对目录中的所有文件执行相同的操作:
import os
bad_words = ['abc', 'def', 'ghi', 'jkl']
for root, dirs, files in os.walk(".", topdown = True):
for file in files:
if '.txt' in file:
with open(file) as filename, open('clean '+file, 'w') as cleanfile:
for line in filename:
clean = True
for word in bad_words:
if word in line:
clean = False
if clean == True:
cleanfile.write(line)
我确信必须有更优雅的方式来做到这一点,但这就是我想要的。
答案 3 :(得分:2)
else
仅与最后if
相关联。你想要elif
:
if 'bad' in line:
pass
elif 'naughty' in line:
pass
else:
newopen.write(line)
另请注意,我删除了行替换,因为您还是不写这些行。
答案 4 :(得分:2)
今天我需要完成一项类似的任务,所以我根据我所做的一些研究写了一个要点来完成任务。 我希望有人会觉得这很有用!
import os
os.system('cls' if os.name == 'nt' else 'clear')
oldfile = raw_input('{*} Enter the file (with extension) you would like to strip domains from: ')
newfile = raw_input('{*} Enter the name of the file (with extension) you would like me to save: ')
emailDomains = ['windstream.net', 'mail.com', 'google.com', 'web.de', 'email', 'yandex.ru', 'ymail', 'mail.eu', 'mail.bg', 'comcast.net', 'yahoo', 'Yahoo', 'gmail', 'Gmail', 'GMAIL', 'hotmail', 'comcast', 'bellsouth.net', 'verizon.net', 'att.net', 'roadrunner.com', 'charter.net', 'mail.ru', '@live', 'icloud', '@aol', 'facebook', 'outlook', 'myspace', 'rocketmail']
print "\n[*] This script will remove records that contain the following strings: \n\n", emailDomains
raw_input("\n[!] Press any key to start...\n")
linecounter = 0
with open(oldfile) as oFile, open(newfile, 'w') as nFile:
for line in oFile:
if not any(domain in line for domain in emailDomains):
nFile.write(line)
linecounter = linecounter + 1
print '[*] - {%s} Writing verified record to %s ---{ %s' % (linecounter, newfile, line)
print '[*] === COMPLETE === [*]'
print '[*] %s was saved' % newfile
print '[*] There are %s records in your saved file.' % linecounter
链接到要点:emailStripper.py
最佳, AZ
答案 5 :(得分:1)
使用python-textops包:
from textops import *
'oldfile.txt' | cat() | grepv('bad') | tofile('newfile.txt')
答案 6 :(得分:0)
to_skip = ("bad", "naughty")
out_handle = open("testout", "w")
with open("testin", "r") as handle:
for line in handle:
if set(line.split(" ")).intersection(to_skip):
continue
out_handle.write(line)
out_handle.close()
答案 7 :(得分:0)
bad_words = ['doc:', 'strickland:','\n']
with open('linetest.txt') as oldfile, open('linetestnew.txt', 'w') as newfile:
for line in oldfile:
if not any(bad_word in line for bad_word in bad_words):
newfile.write(line)
\n
是换行符的Unicode转义序列。
答案 8 :(得分:0)
Regex比我使用的可接受答案(对于我的23 MB测试文件)要快一点。但是里面没有很多。
import re
bad_words = ['bad', 'naughty']
regex = f"^.*(:{'|'.join(bad_words)}).*\n"
subst = ""
with open('oldfile.txt') as oldfile:
lines = oldfile.read()
result = re.sub(regex, subst, lines, re.MULTILINE)
with open('newfile.txt', 'w') as newfile:
newfile.write(result)
答案 9 :(得分:0)
试试这个效果很好。
import re
text = "this is bad!"
text = re.sub(r"(.*?)bad(.*?)$|\n", "", text)
text = re.sub(r"(.*?)naughty(.*?)$|\n", "", text)
print(text)