Question

我制作了一个脚本，将一个长文件分成多个小文件。当正好有4个整数时，它总是分裂。我想通过声明正好有4个整数但仅在行的开头来改进它。

示例输入

1020                                                                                                                                                                                                                                                            
200123242151111231                                 bla             bla                                       bla
200123331231231441                                 bla             bla                                       bla
1030
200123242151111231                                 bla             bla                                       bla
200123331231231441                                 bla             bla                                       bla

希望输出的是上面分组的内容：

200123242151111231                                 bla             bla                                       bla
200123331231231441                                 bla             bla                                       bla

和

200123242151111231                                 bla             bla                                       bla
200123331231231441                                 bla             bla                                       bla

当bla的值之一是4个整数时，它会增加额外的分割。如何确保re.split函数仅检查一行的前4或5个值。

import re

file = open('testnew.txt', 'r')

i=0
for x in re.split(r"\b[0-9]{4}\s+", file.read()):
    f = open('%d.txt' %i,'w')
    f.write(x)
    f.close()
    print (x,i)
    i = i+1

Answer 1

最好逐行读取文件。这样，如果文件太大，你就不会遇到重载内存的问题，你也可以在线路上运行4位数检查，而不会出现笨拙的分裂。

doc = 0
towrite = ""
with open("somefile.txt", "r") as f:
    for i, line in enumerate(f):
        if len(line.strip()) == 4 and line.strip().isdigit():
            if i > 0:  # write txt from prior parse
                wfile = open("{}.txt".format(doc), "w")
                wfile.write(towrite)
                wfile.close()
            doc = line.strip()
            towrite = ""  # reset
        else:
            towrite += line
wfile = open("{}.txt".format(doc), "w")
wfile.write(towrite)
wfile.close()

测试文件：

1234
43267583291483 1234 3213213
57489367483929 32133248 3728913
3267
32163721837362 4723 3291832
42189323471911 321113 3211111132
326189183828327 3218484828283 828238281
21838282387 3726173 6278
1111
1236274818 327813678
32167382167894829013 321

结果：

<强> 1234.txt

43267583291483 1234 3213213
57489367483929 32133248 3728913

<强> 3267.txt

32163721837362 4723 3291832
42189323471911 321113 3211111132
326189183828327 3218484828283 828238281
21838282387 3726173 6278

<强> 1111.txt

1236274818 327813678
32167382167894829013 321

Answer 2

^匹配字符串的开头

$匹配字符串的结尾

findall返回所有匹配项的列表，如果使用（捕获组），则返回捕获组

（？:)是非捕获组

*贪心，*？不是

此解决方案应该有效：

import re

file = open('testnew.txt', 'r')

i=0
for x in re.findall(r"((?:.|\n)*?)(?:(?:^|\n)\d{4}\n|$)", file.read()):
    if x: # skip empty matches
      f = open('%d.txt' %i,'w')
      f.write(x)
      f.close()
      print (x,i)
      i = i+1

Answer 3

逐行阅读会很好。您可以检查字符串长度是否为4然后跳过它。

在python中拆分一行;只取前4个值

3 个答案: