给出一个git消息列表,其中git提交后面可以跟着一个已更改的文件,插入和删除列表,如下所示:
([a-z]*[0-9])
每个更改信息后跟一个空行,如上面的示例所示。但并非所有提交都由空行分隔(请参阅第一行,即合并提交本身没有任何更改信息)。
转换如何产生以下DataFrame?
import pandas as pd
from io import StringIO
data = '''\
f0a332fc65|User 1|2017-01-30 17:26:51|Merge branch 'dev' into master
877134c7be|User 1|2017-01-30 14:46:55|commitmsg 1
1 file changed, 15 insertions(+)
557b90502d|User 1|2017-01-30 14:38:52|commitmsg 2
10 files changed, 51 insertions(+), 56 deletions(-)
052788be45|User 2|2017-01-30 14:29:28|commitmsg 3
1 file changed, 1 deletion(-)
'''
df = pd.read_csv(StringIO(data), ???? )
这个问题可能与所讨论的多线输入有关here,但有点复杂。
我有一个working solution在python中读取文件,实质上是从rest中提取更改信息,然后合并两个DataFrame。我认为它可以更快地完成而无需通过python读取文件,但只使用pandas io方法。
答案 0 :(得分:2)
这是一种允许您一次性将所有内容读入熊猫的方法,然后需要进行一些后期处理以获得您想要的格式的结果数据框:
import pandas as pd
import numpy as np
# read the data with comma OR pipe as the column separator
df = pd.read_csv(StringIO(data), sep = ',|\|', header=None)
# extract the number of changes (from column 0) and insert into column 4
df[4] = df[0].str.extract('(\d+) files? changed')
# extract the number of insertions (from column 1) and insert into column 5
df[5] = df[1].str.extract('(\d+) insertions?')
# extract the number of deletions (from column 1 or 2) and insert into column 6
df[6] = df[1].str.extract('(\d+) deletions?').fillna('') + df[2].str.extract('(\d+) deletions?').fillna('')
# replace empty strings with np.nan so they can be filled in later
df[6] = df[6].replace('', np.nan)
# make a mask of the rows you want to keep (in the end)
keep_mask = df[0].str.match('^\w+$')
# for the rows that contain change, insertion, deletion data only:
# replace NaN values with 0
df[~ keep_mask] = df[~ keep_mask].fillna(0, axis=1)
# back fill any missing nan values (should only affect columns 4-6)
# this should fill the row above each change, insertion, etc. row
# with the appropriate values
df.fillna(method = 'backfill', limit=1, inplace = True)
# drop the rows that contain change, insertion, etc. data only
df = df[keep_mask]
# replace any 0 values with np.nan
df.replace(0, np.nan, inplace=True)
# name the columns what you want
df.columns = ['sha1', 'author', 'date', 'message', 'changes', 'insertions', 'deletions']
print(df)
sha1 author date message \
0 f0a332fc65 User 1 2017-01-30 17:26:51 Merge branch 'dev' into master
1 877134c7be User 1 2017-01-30 14:46:55 commitmsg 1
3 557b90502d User 1 2017-01-30 14:38:52 commitmsg 2
5 052788be45 User 2 2017-01-30 14:29:28 commitmsg 3
changes insertions deletions
0 NaN NaN NaN
1 1 15 NaN
3 10 51 56
5 1 NaN 1
答案 1 :(得分:1)
考虑沿着文本文件有条理地检查更改,插入和删除的组合,保存到临时列表并附加到pd.DataFrame()
来电中使用的较大列表。
rows = []
item = []
for line in StringIO(data):
if 'commitmsg' in line:
item = line.replace('\n', '').split('|')
elif 'changed' in line:
chg = [int(i[:3].strip()) for i in line.replace('\n', '').split(',')]
if 'insertion' in line and 'deletion' in line:
item.extend(chg)
elif 'insertion' in line:
item.extend(chg + [0])
elif 'deletion' in line:
item.extend([chg[0], 0, chg[1]])
rows.append(item)
item = []
df = pd.DataFrame(rows, columns=['sha1', 'author', 'date', 'comment',
'changes', 'insertions', 'deletions'])
print(df)
# sha1 author date comment changes insertions deletions
# 0 877134c7be User 1 2017-01-30 14:46:55 commitmsg 1 1 15 0
# 1 557b90502d User 1 2017-01-30 14:38:52 commitmsg 2 10 51 56
# 2 052788be45 User 2 2017-01-30 14:29:28 commitmsg 3 1 0 1