Question

Previously，我一直在使用下面的code snippet清理数据

import unicodedata, re, io

all_chars = (unichr(i) for i in xrange(0x110000))
control_chars = ''.join(c for c in all_chars if unicodedata.category(c)[0] == 'C')
cc_re = re.compile('[%s]' % re.escape(control_chars))
def rm_control_chars(s): # see http://www.unicode.org/reports/tr44/#General_Category_Values
    return cc_re.sub('', s)

cleanfile = []
with io.open('filename.txt', 'r', encoding='utf8') as fin:
    for line in fin:
        line =rm_control_chars(line)
        cleanfile.append(line)

我希望保留文件中的换行符。

以下记录cc_re.sub('', s)替换前几行所用的时间（第一列是时间，第二列是len(s)）：

0.275146961212 251
0.672796010971 614
0.178567171097 163
0.200030088425 180
0.236430883408 215
0.343492984772 313
0.317672967911 290
0.160616159439 142
0.0732028484344 65
0.533437013626 468
0.260229110718 236
0.231380939484 204
0.197766065598 181
0.283867120743 258
0.229172945023 208

正如@ashwinichaudhary建议的那样，使用s.translate(dict.fromkeys(control_chars))并同时记录日志输出：

0.464188098907 252
0.366552114487 615
0.407374858856 164
0.322507858276 181
0.35142993927 216
0.319973945618 314
0.324357032776 291
0.371646165848 143
0.354818105698 66
0.351796150208 469
0.388131856918 237
0.374715805054 205
0.363368988037 182
0.425950050354 259
0.382766962051 209

但我的1GB文本代码真的很慢。有没有其他方法可以清除受控角色？

Answer 1

找到了一个由charater提供解决方案的工作人员，我使用100K文件对其进行了标记：

import unicodedata, re, io
from time import time

# This is to generate randomly a file to test the script

from string import lowercase
from random import random

all_chars = (unichr(i) for i in xrange(0x110000))
control_chars = [c for c in all_chars if unicodedata.category(c)[0] == 'C']
chars = (list(u'%s' % lowercase) * 115117) + control_chars

fnam = 'filename.txt'

out=io.open(fnam, 'w')

for line in range(1000000):
    out.write(u''.join(chars[int(random()*len(chars))] for _ in range(600)) + u'\n')
out.close()


# version proposed by alvas
all_chars = (unichr(i) for i in xrange(0x110000))
control_chars = ''.join(c for c in all_chars if unicodedata.category(c)[0] == 'C')
cc_re = re.compile('[%s]' % re.escape(control_chars))
def rm_control_chars(s):
    return cc_re.sub('', s)

t0 = time()
cleanfile = []
with io.open(fnam, 'r', encoding='utf8') as fin:
    for line in fin:
        line =rm_control_chars(line)
        cleanfile.append(line)
out=io.open(fnam + '_out1.txt', 'w')
out.write(''.join(cleanfile))
out.close()
print time() - t0

# using a set and checking character by character
all_chars = (unichr(i) for i in xrange(0x110000))
control_chars = set(c for c in all_chars if unicodedata.category(c)[0] == 'C')
def rm_control_chars_1(s):
    return ''.join(c for c in s if not c in control_chars)

t0 = time()
cleanfile = []
with io.open(fnam, 'r', encoding='utf8') as fin:
    for line in fin:
        line = rm_control_chars_1(line)
        cleanfile.append(line)
out=io.open(fnam + '_out2.txt', 'w')
out.write(''.join(cleanfile))
out.close()
print time() - t0

输出是：

114.625444174
0.0149750709534

我尝试了一个1Gb的文件（仅适用于第二个）并持续了186秒。

我还写了同一个脚本的其他版本，稍快（176s），内存效率更高（对于不适合RAM的非常大的文件）：

t0 = time()
out=io.open(fnam + '_out5.txt', 'w')
with io.open(fnam, 'r', encoding='utf8') as fin:
    for line in fin:
        out.write(rm_control_chars_1(line))
out.close()
print time() - t0

Answer 2

与UTF-8一样，所有控制字符都以1字节（与ASCII兼容）编码，而在32以下，我建议使用这段快速代码：

#!/usr/bin/python
import sys

ctrl_chars = [x for x in range(0, 32) if x not in (ord("\r"), ord("\n"), ord("\t"))]
filename = sys.argv[1]

with open(filename, 'rb') as f1:
  with open(filename + '.txt', 'wb') as f2:
    b = f1.read(1)
    while b != '':
      if ord(b) not in ctrl_chars:
        f2.write(b)
      b = f1.read(1)

可以吗？

Answer 3

这是否必须在python中？如何在python中读取文件之前清理文件。使用sed，无论如何都会逐行处理它。

请参阅删除control characters using sed。

如果你将它传播到另一个文件，你可以打开它。我不知道它有多快。您可以在shell脚本中进行测试。根据{{3}} - sed是每秒82M个字符。

希望它有所帮助。

Answer 4

如果你想让它快速移动？将您的输入分解为多个块，将该数据作为方法包装，并使用Python的RPROMPT='$(python3 ~/.git_zsh_rprompt.py)'包来并行化，写入一些常见的文本文件。逐字逐句是最容易解决这类问题的方法，但它总是需要一段时间。

https://docs.python.org/3/library/multiprocessing.html

Answer 5

我很惊讶没有人提到mmap这可能是合适的选择。

注意：我会把它作为一个答案，以防它有用并道歉我没有时间进行实际测试并立即进行比较。

您将文件加载到内存（种类），然后您实际上可以在对象上运行re.sub()。这有助于消除IO瓶颈，并允许您在一次写回之前更改就地字节。

在此之后，您可以尝试使用str.translate（）vs re.sub（），还可以包括任何进一步的优化，例如双缓冲CPU和IO或使用多个CPU核心/线程。

但它看起来像这样;

import mmap

f = open('test.out', 'r')
m = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)

mmap文档的一个很好的摘录是;

..您可以在大多数需要字符串的地方使用mmap对象;例如，您可以使用re模块搜索内存映射文件。由于它们是可变的，你可以通过obj [index] ='a'，...来改变单个字符。

Answer 6

我会尝试一些事情。

首先，使用替换所有正则表达式进行替换。

其次，设置一个带有已知控制字符范围的正则表达式char类，而不是
一类个人控制字母（这是因为发动机没有将其优化到范围范围在装配级别上需要两个条件，
而不是个别条件对类中的每个字符串）

第三，既然你要删除字符，添加一个贪婪的量词课后。这将否定进入替代的必要性每个单个字符匹配后的子程序，而不是抓住所有相邻的字符
如所须。

我不知道正则表达式构造的pythons语法也不是Unicode中的所有控制代码，但结果看起来像是什么像这样：

[\u0000-\u0009\u000B\u000C\u000E-\u001F\u007F]+

将结果复制到另一个字符串的最长时间最短的时间是找到所有控制代码，其中会是微不足道的。

在所有条件相同的情况下，正则表达式（如上所述）是最快的方法。

是否有更快的方法来清除文件中的控制字符？

6 个答案: