首先我要告诉你我正在使用Python 3.3。
我从python和regex开始,我想知道如何从2个正则表达式之间的 .txt 文件中捕获文本块。
以下是我文件中的示例:
commit 1f0883381054b796b643dcff974435633eed8a79
this is
commit 1
bloc
commit 2f0883381054b796b643dcff974435633eed8a78
this is
commit 2
bloc
commit 3f0883381054b796b643dcff974435633eed8a77
this is
commit 3
bloc
commit 4f0883381054b796b643dcff974435633eed8a76
this is
commit
4
bloc
所以我想在commit
开始的2个正则表达式之间捕捉文本,然后用空格和40个字符(我猜这是:^commit.{41}
)。
注:从commit
开始并且后面没有任何内容的行不应该起作用。
当然,我能够获得最后一个提交块。它不会以commit.{41}
结尾,而是以文件末尾结束。
一旦我完成所有阻止,我就能够继续工作。这是git log -p
的样子
commit 1f0883381054b796b643dcff974435633eed8a79
Merge: 4e1d5f7 8ffg9do
Author: name <email>
Date: date of commit
"comment
of the commit
on multilines"
diff --…
index …
--- …
+++ …
@@ …
-…
+…
…
diff --…
index …
--- …
+++ …
@@ …
commit 2f0883381054b796b643dcff974435633eed8a78
Author: name <email>
Date: date
例如,获取commitBlock[0]
。那将是:
Merge: 4e1d5f7 8ffg9do
Author: name <email>
Date: date of commit
"comment
of the commit
on multilines"
diff --…
index …
--- …
+++ …
@@ …
-…
+…
…
diff --…
index …
--- …
+++ …
@@ …
提取Author:
行(= Author: name <email>
)
提取commitBlock
的评论:
“评论
提交
多线“
同样适用于diffBlock
。
N.B。与commitBlock
相同,当diffBlock
或diff
或commit.{41}
end of file
应停止
我尝试过几件事:
这是我在意识到我需要获得多个块之前所拥有的。
source = open("file.txt","rt", encoding="ISO-8859-1")
for line in source:
commit = re.findall('^commit.{41}',line) #line starts with by "commit" and is followed by 41 characters
merge = re.findall('^Merge:.*',line)
author = re.findall('^Author:.*',line)
date = re.findall('^Date:.*',line)
signed = re.findall('Signed-off-by:.*', line)
for commitLine in commit:
print (commitLine)
#post into DB
for mergeLine in merge:
print (mergeLine)
#post into DB
.
.
.
或者 re.findall('^ commit。{41}(。*?)^ commit。{41} | endoffile ',source.read(),re.DOTALL | re.M)
我也尝试使用re.split()
。它适用于commitBlock
!但是当我想要分割diffBlock
和commentBlock
时我遇到了问题,因为我应该有时使用commit
行来阻止阻止。由于split
,它不再出现了。
import os
import re
from pymongo import Connection
source = open("testSelection.txt","r", encoding="ISO-8859-1") #file that we want to analyse
sourceRead = open("testSelection.txt","r", encoding="ISO-8859-1").read() #writing source.read() bugs...
print(source.name)
commit = []
for line in source:
if line[:6]=="commit":
commitId = line[7:]
commit.append(line)
f=1
while f < len(commitBlock):
lineCommitBlock = re.split('\n', commitBlock[f])
diffBlock = re.split('\ndiff', commitBlock[f])
print("-----------------------------NEW COMMIT BLOCK-------------------------------")
print(commit[f-1])
i=0
while i < len(lineCommitBlock):
if "Merge:" in lineCommitBlock[i]:
print(lineCommitBlock[i])
elif "Author:" in lineCommitBlock[i]:
print(lineCommitBlock[i])
elif "Date:" in lineCommitBlock[i]:
print(lineCommitBlock[i])
elif "Signed-off-by:" in lineCommitBlock[i]:
print(lineCommitBlock[i])
elif "Tested-by:" in lineCommitBlock[i]:
print(lineCommitBlock[i])
elif "Reviewed-by:" in lineCommitBlock[i]:
print(lineCommitBlock[i])
i += 1
print("Before commentBlock <--------------------------------------------------------------------")
print("Adter commentBlock <--------------------------------------------------------------------")
j=1
while j < len(diffBlock):
print(diffBlock[j])
j += 1
f += 1
source.close()
(我同意,看起来一团糟!)
知道如何解决我的问题吗?谢谢!
编辑:
我几乎完成了我的工作,所以我想我会更快找到用正则表达式或其他东西来捕捉commentBlock的方法,而不是使用GitPython(适用于Python 2,而不是3. ...)。 />
有人可以帮我解决这个问题吗?
我需要的是在Date:
和diff
或commit
或文件末尾之后的行之后捕捉文字。
我真的坚持这一点......