如何使用python基于正则表达式拆分大文本文件

时间:2018-07-25 06:12:47

标签: python regex python-3.x

我的大文件包含多行,但是在某些行中具有唯一模式,我想根据此模式拆分大文件。 在文本文件中的数据下方:

    commit e6bcab96ffe1f55e80be8d0c1e5342fb9d69ca30
Date:   Sat Jun 9 04:11:37 2018 +0530

    configurations

commit 5c8deb3114b4ed17c5d2ea31842869515073670f
Date:   Sat Jun 9 02:59:56 2018 +0530

    remote

commit 499516b7e4f95daee4f839f34cc46df404b52d7a
Date:   Sat Jun 9 02:52:51 2018 +0530

    remote fix
    This reverts commit 0a2917bd49eec7ca2f380c890300d75b69152353.

commit 349e1b42d3b3d23e95a227a1ab744fc6167e6893
Date:   Sat Jun 9 02:52:37 2018 +0530

    Revert "Removing the printf added"

    This reverts commit da0fac94719176009188ce40864b09cfb84ca590.

commit 8bfd4e7086ff5987491f280b57d10c1b6e6433fe
Date:   Sat Jun 9 02:52:18 2018 +0530

    Revert Bulk

    This reverts commit c2ee318635987d44e579c92d0b86b003e1d2a076.

commit bcb10c54068602a96d367ec09f08530ede8059ef
Date:   Fri Jun 8 19:53:03 2018 +0530

    fix crash observed

commit a84169f79fbe9b18702f6885b0070bce54d6dd5a
Date:   Fri Jun 8 18:14:21 2018 +0530

    Interface PBR

commit 254726fe3fe0b9f6b228189e8a6fe7bdf4aa9314
Date:   Fri Jun 8 18:12:10 2018 +0530

    Crash observed

commit 18e7106d54e19310d32e8b31d584cec214fb2cb7
Date:   Fri Jun 8 18:09:13 2018 +0530

    Changes to fix crash

当前我的代码如下:

import re
readtxtfile = r'C:\gitlog.txt'
with open(readtxtfile) as fp:
    txtrawdata = fp.read()    
    commits = re.split(r'^(commit|)[ a-zA-Z0-9]{40}$',txtrawdata)

print(commits)

预期输出: 我想根据“ commit 18e7106d54e19310d32e8b31d584cec214fb2cb7 ”拆分以上字符串,并将其转换为python列表。

3 个答案:

答案 0 :(得分:1)

Regex101 here中此正则表达式的说明。

groups = re.findall(r'(^\s*commit\s+[a-z0-9]+.*?)(?=^commit|\Z)', data, flags=re.DOTALL|re.MULTILINE)
for g in groups:
    print(g)
    print('-' * 80)

打印:

commit e6bcab96ffe1f55e80be8d0c1e5342fb9d69ca30
Date:   Sat Jun 9 04:11:37 2018 +0530

    configurations


--------------------------------------------------------------------------------
commit 5c8deb3114b4ed17c5d2ea31842869515073670f
Date:   Sat Jun 9 02:59:56 2018 +0530

    remote


--------------------------------------------------------------------------------
commit 499516b7e4f95daee4f839f34cc46df404b52d7a
Date:   Sat Jun 9 02:52:51 2018 +0530

    remote fix
    This reverts commit 0a2917bd49eec7ca2f380c890300d75b69152353.


--------------------------------------------------------------------------------



...and so on

答案 1 :(得分:0)

import re
text = '''    commit e6bcab96ffe1f55e80be8d0c1e5342fb9d69ca30
Date:   Sat Jun 9 04:11:37 2018 +0530

    configurations

commit 5c8deb3114b4ed17c5d2ea31842869515073670f
Date:   Sat Jun 9 02:59:56 2018 +0530

    remote

commit 499516b7e4f95daee4f839f34cc46df404b52d7a
Date:   Sat Jun 9 02:52:51 2018 +0530

    remote fix
    This reverts commit 0a2917bd49eec7ca2f380c890300d75b69152353.'''

print(re.split(r'^\s*commit \S*\s*', text, flags=re.MULTILINE))

这将输出:

['', 'Date:   Sat Jun 9 04:11:37 2018 +0530\n\n    configurations\n', 'Date:   Sat Jun 9 02:59:56 2018 +0530\n\n    remote\n', 'Date:   Sat Jun 9 02:52:51 2018 +0530\n\n    remote fix\n    This reverts commit 0a2917bd49eec7ca2f380c890300d75b69152353.']

答案 2 :(得分:0)

这将提取提交信息:

commits = list()
readtxtfile = r'C:\gitlog.txt'
with open(readtxtfile) as fp:
    for line in fp:
        m = re.match('^commit\s+([a-f0-9]{40})$', line)
        if m:
            commits.append(m.group(0))

commits现在只是提交字符串的列表。现在,如果您的gitlog输出格式更改,这将更改匹配的正则表达式。确保使用--no-abbrev-commit生成它。