我正试图从文本文件中取出所有标点符号。有没有更有效的方法来做到这一点?
这是我的代码:
fname = open("text.txt","r")
stripped = ""
for line in fname:
for c in line:
if c in '!,.?-':
c = ""
stripped = stripped + c
print(stripped)
答案 0 :(得分:0)
import re
with open("text.txt","r") as r:
text = r.read()
with open("text.txt","w") as w:
w.write(re.sub(r'[!,.?-]', '', text))
这个怎么样?
或者没有正则表达式的方法:
with open("text.txt","r") as r:
text = r.read()
with open("text.txt","w") as w:
for i in '!,.?-':
text = text.replace(i, '')
w.write(text)
答案 1 :(得分:0)
您可以尝试使用正则表达式,用空字符串替换任何标点符号:
import re
with open('text.txt', 'r') as f:
for line in f:
print(re.sub(r'[.!,?-]', '', line)
答案 2 :(得分:0)
通常 比正则表达式更快或单个字符串操作或构造正在使用str.translate
:
# Python 2 solution
with open("text.txt","r") as fname:
stripped = fname.read().translate(None, '!,.?-')
请注意,这不是所有标点符号。获取所有ASCII标点符号的最佳方法是import string
并使用string.punctuation
。
在Python 3中,你可以这样做:
# Read as text and translate with str.translate
delete_punc_table = str.maketrans('', '', '!,.?-') # If you're using the table more than once, always define once, use many times
with open("text.txt","r") as fname:
stripped = fname.read().translate(delete_punc_table)
# Read as bytes to use Py2-like ultra-efficient translate then decode
with open("text.txt", "rb") as fname:
stripped = fname.read().translate(None, b'!,.?-').decode('ascii') # Or some other ASCII superset encoding
# If you use string.punctuation for the bytes approach
# you'd need to encode it, e.g. translate(None, string.punctuation.encode('ascii'))
在Python 3.4之前,“读取字节,翻译,然后解码”方法荒谬更好,在3.4+中它可能仍然稍微快一点,但不足以产生巨大的差异。
计算机上各种方法的计时(使用适用于Windows的Python 3.5 x64):
# Make random ~100KB input
data = ''.join(random.choice(string.printable) for i in range(100000))
# Using re.sub (with a compiled regex to minimize overhead)
>>> min(timeit.repeat('trans.sub("", data)', 'from __main__ import re, string, data; trans = re.compile(r"[" + re.escape(string.punctuation) + r"]")', number=1000))
17.47419076158849
# Using iterative str.replace
>>> min(timeit.repeat('d2 = data\nfor l in punc: d2 = d2.replace(l, "")', 'from __main__ import string, data; punc = string.punctuation', number=1000))
13.51673370949311
# Using str.translate
>>> min(timeit.repeat('data.translate(trans)', 'from __main__ import string, data; trans = str.maketrans("", "", string.punctuation)', number=1000))
1.5299288690396224
# Using bytes.translate then decoding as ASCII (without the decode, this is close to how Py2 would behave)
>>> bdata = data.encode("ascii")
>>> min(timeit.repeat('bdata.translate(None, trans).decode("ascii")', 'from __main__ import string, bdata; trans = string.punctuation.encode("ascii")', number=1000))
1.294337291624089
时间是在3次测试运行中运行1000次转换循环的最佳时间(采用最小值被认为是避免影响结果的时间抖动的最佳方法),以秒为单位,输入100,000个随机可打印的事物{{1 (甚至预编译)甚至没有关闭。 re.sub
方法都可以(translate
可能更快,但代码也更复杂)。如果要替换的事物较小(仅使用bytes.translate
而不是所有标点符号将其降低到~3秒),str.replace
会更具竞争力,但是对于任何合理数量的字符来说,它都会变慢,并且不像'!,.?-'
那样缩放。