在2个正则表达式之间捕获文本块

时间:2013-06-27 18:01:19

标签: python regex text python-3.x

首先我要告诉你我正在使用Python 3.3。

我从python和regex开始,我想知道如何从2个正则表达式之间的 .txt 文件中捕获文本块。

以下是我文件中的示例:

    commit 1f0883381054b796b643dcff974435633eed8a79
    this is
    commit 1
    bloc

    commit 2f0883381054b796b643dcff974435633eed8a78
    this is
    commit 2

    bloc

    commit 3f0883381054b796b643dcff974435633eed8a77
    this is

    commit 3


    bloc

    commit 4f0883381054b796b643dcff974435633eed8a76

    this is
    commit 
    4
    bloc

所以我想在commit开始的2个正则表达式之间捕捉文本,然后用空格和40个字符(我猜这是:^commit.{41})。 注:从commit开始并且后面没有任何内容的行不应该起作用。 当然,我能够获得最后一个提交块。它不会以commit.{41}结尾,而是以文件末尾结束。

一旦我完成所有阻止,我就能够继续工作。这是git log -p的样子

commit 1f0883381054b796b643dcff974435633eed8a79
Merge: 4e1d5f7 8ffg9do
Author: name <email>
Date:   date of commit

    "comment 
    of the commit
    on multilines"

diff --…
index …
--- …
+++ …
@@ …
-…
+…
 …

diff --…
index …
--- …
+++ …
@@ …

commit 2f0883381054b796b643dcff974435633eed8a78
Author: name <email>
Date:   date

例如,获取commitBlock[0]。那将是:

Merge: 4e1d5f7 8ffg9do
Author: name <email>
Date:   date of commit

    "comment 
    of the commit
    on multilines"

diff --…
index …
--- …
+++ …
@@ …
-…
+…
 …

diff --…
index …
--- …
+++ …
@@ …

提取Author:行(= Author: name <email>

提取commitBlock的评论:     “评论     提交     多线“

同样适用于diffBlock

N.B。与commitBlock相同,当diffBlockdiffcommit.{41}

时,end of file应停止

我尝试过几件事:

这是我在意识到我需要获得多个块之前所拥有的。

source = open("file.txt","rt", encoding="ISO-8859-1")

for line in source:

    commit = re.findall('^commit.{41}',line) #line starts with by "commit" and is followed by 41 characters
    merge = re.findall('^Merge:.*',line)
    author = re.findall('^Author:.*',line)
    date = re.findall('^Date:.*',line)
    signed = re.findall('Signed-off-by:.*', line)

    for commitLine in commit:
        print (commitLine)
        #post into DB

    for mergeLine in merge:
        print (mergeLine)
        #post into DB
    .
    .
    .

或者     re.findall('^ commit。{41}(。*?)^ commit。{41} | endoffile ',source.read(),re.DOTALL | re.M)

我也尝试使用re.split()。它适用于commitBlock!但是当我想要分割diffBlockcommentBlock时我遇到了问题,因为我应该有时使用commit行来阻止阻止。由于split,它不再出现了。

import os
import re
from pymongo import Connection


source = open("testSelection.txt","r", encoding="ISO-8859-1") #file that we want to analyse
sourceRead = open("testSelection.txt","r", encoding="ISO-8859-1").read() #writing source.read() bugs...
print(source.name)


commit = []
for line in source:
    if line[:6]=="commit":
            commitId = line[7:]
            commit.append(line)

f=1
while f < len(commitBlock):
    lineCommitBlock = re.split('\n', commitBlock[f])
    diffBlock = re.split('\ndiff', commitBlock[f])
    print("-----------------------------NEW COMMIT BLOCK-------------------------------")
    print(commit[f-1])
    i=0
    while i < len(lineCommitBlock):
        if "Merge:" in lineCommitBlock[i]:
            print(lineCommitBlock[i])
        elif "Author:" in lineCommitBlock[i]:
            print(lineCommitBlock[i])
        elif "Date:" in lineCommitBlock[i]:
            print(lineCommitBlock[i])
        elif "Signed-off-by:" in lineCommitBlock[i]:
            print(lineCommitBlock[i])
        elif "Tested-by:" in lineCommitBlock[i]:
            print(lineCommitBlock[i])
        elif "Reviewed-by:" in lineCommitBlock[i]:
            print(lineCommitBlock[i])
        i += 1
    print("Before commentBlock <--------------------------------------------------------------------")

    print("Adter commentBlock <--------------------------------------------------------------------")
    j=1   
    while j < len(diffBlock):
        print(diffBlock[j])
        j += 1

    f += 1

source.close()

(我同意,看起来一团糟!)

知道如何解决我的问题吗?谢谢!


编辑:
我几乎完成了我的工作,所以我想我会更快找到用正则表达式或其他东西来捕捉commentBlock的方法,而不是使用GitPython(适用于Python 2,而不是3. ...)。 /> 有人可以帮我解决这个问题吗?

我需要的是在Date:diffcommit或文件末尾之后的行之后捕捉文字。

我真的坚持这一点......

0 个答案:

没有答案