Question

我有5GB的文件，其格式为：

dn: cn
changetype: add
objectclass: ine
hghsfgdsdsdsd
mail: surcom
surname: satya2
givenname: surya2
cn: surya2

dn: cn
changetype: add
objectclass: inetOrgPerson
surname: sa
sddsds
givenname: s
cn: sur

如您所见，Object类和姓氏移至下一行。然后我要在同一行。下面的代码可以实现这一点，但是它会为大型文件引发内存错误，您可以更改此代码，以便有效地处理大型文件吗？

import re

pattern = re.compile(r"(\w+):(.*)")

with open("uservolvo2.ldif", "r") as f:
    new_lines = []

    for line in f:

        if line.endswith('\n'):
            line = line[:-1]

        if line == "":
            new_lines.append(line)
            continue    

        l = pattern.search(line)

        if l:
            new_lines.append(line)
        else:
            new_lines[-1] += line

with open("user_modified.ldif", "a") as f:
    f.write("\n".join(new_lines))
    f.write("\n\n")

Answer 1

也许当您加入new_lines而不是写一个大字符串时，这可能会导致内存错误，您可以遍历列表并逐行写入每一行

with open("file_modified.txt", "a") as f:
    for line in new_lines:
        f.write(line+'\n')

Answer 2

我不知道基于正则表达式的解决方案的效率如何，也没有对其进行基准测试，但这是在整个文件上使用re.sub的一种可能方法：

input = """objectclass: ine
hghsfgdsdsdsd
mail: surcom
surname: satya2"""

output = re.sub(r'objectclass:(\s*\S+)(.*?)surname:(\s*\S+)',
                "objectclass:\\1\nsurname:\\3\\2", input, flags=re.DOTALL)
print(output)

此打印：

objectclass: ine
surname: satya2
hghsfgdsdsdsd
mail: surcom

上述逻辑是匹配objectclass:行，然后匹配所有内容，直到到达surname:行。然后，我们按照您想要的顺序整理文本，surname之后紧跟objectclass。

Answer 3

我认为最有效的方法是通过原始文本文件创建另一个空文本文件（modified.txt）iter，并将处理后的行追加到新文件中。

with open('file.txt', 'r') as file, open('modified.txt', 'a') as modified:
    line = file.readline()
    while line:
        line = file.readline()
        #do procssing
        modified.write(line)

将下一行追加到上一行

3 个答案: