pandas:使用不同的分隔符读取多行CSV输入

时间:2017-02-04 11:08:50

标签: python pandas dataframe

给出一个git消息列表,其中git提交后面可以跟着一个已更改的文件,插入和删除列表,如下所示:

([a-z]*[0-9])

每个更改信息后跟一个空行,如上面的示例所示。但并非所有提交都由空行分隔(请参阅第一行,即合并提交本身没有任何更改信息)。

转换如何产生以下DataFrame?

import pandas as pd
from io import StringIO 

data = '''\
f0a332fc65|User 1|2017-01-30 17:26:51|Merge branch 'dev' into master
877134c7be|User 1|2017-01-30 14:46:55|commitmsg 1
 1 file changed, 15 insertions(+)

557b90502d|User 1|2017-01-30 14:38:52|commitmsg 2
 10 files changed, 51 insertions(+), 56 deletions(-)

052788be45|User 2|2017-01-30 14:29:28|commitmsg 3
 1 file changed, 1 deletion(-)
'''

df = pd.read_csv(StringIO(data), ???? )

这个问题可能与所讨论的多线输入有关here,但有点复杂。

我有一个working solution在python中读取文件,实质上是从rest中提取更改信息,然后合并两个DataFrame。我认为它可以更快地完成而无需通过python读取文件,但只使用pandas io方法。

2 个答案:

答案 0 :(得分:2)

这是一种允许您一次性将所有内容读入熊猫的方法,然后需要进行一些后期处理以获得您想要的格式的结果数据框:

import pandas as pd
import numpy as np

# read the data with comma OR pipe as the column separator  
df = pd.read_csv(StringIO(data), sep = ',|\|', header=None)

# extract the number of changes (from column 0) and insert into column 4 
df[4] = df[0].str.extract('(\d+) files? changed')

# extract the number of insertions (from column 1) and insert into column 5
df[5] = df[1].str.extract('(\d+) insertions?')

# extract the number of deletions (from column 1 or 2) and insert into column 6
df[6] = df[1].str.extract('(\d+) deletions?').fillna('') + df[2].str.extract('(\d+) deletions?').fillna('')

# replace empty strings with np.nan so they can be filled in later
df[6] = df[6].replace('', np.nan)

# make a mask of the rows you want to keep (in the end)
keep_mask = df[0].str.match('^\w+$')

# for the rows that contain change, insertion, deletion data only:
# replace NaN values with 0 
df[~ keep_mask] = df[~ keep_mask].fillna(0, axis=1)

# back fill any missing nan values (should only affect columns 4-6)
# this should fill the row above each change, insertion, etc. row 
# with the appropriate values
df.fillna(method = 'backfill', limit=1, inplace = True)

# drop the rows that contain change, insertion, etc. data only
df = df[keep_mask]

# replace any 0 values with np.nan
df.replace(0, np.nan, inplace=True)

# name the columns what you want
df.columns = ['sha1', 'author', 'date', 'message', 'changes', 'insertions', 'deletions']

print(df)

         sha1  author                 date                         message  \
0  f0a332fc65  User 1  2017-01-30 17:26:51  Merge branch 'dev' into master   
1  877134c7be  User 1  2017-01-30 14:46:55                     commitmsg 1   
3  557b90502d  User 1  2017-01-30 14:38:52                     commitmsg 2   
5  052788be45  User 2  2017-01-30 14:29:28                     commitmsg 3   

  changes insertions deletions  
0     NaN        NaN       NaN  
1       1         15       NaN  
3      10         51        56  
5       1        NaN         1 

答案 1 :(得分:1)

考虑沿着文本文件有条理地检查更改插入删除的组合,保存到临时列表并附加到pd.DataFrame()来电中使用的较大列表。

rows = []
item = []

for line in StringIO(data):
    if 'commitmsg' in line:
        item = line.replace('\n', '').split('|')

    elif 'changed' in line:
        chg = [int(i[:3].strip()) for i in line.replace('\n', '').split(',')]

        if 'insertion' in line and 'deletion' in line:
            item.extend(chg)                

        elif 'insertion' in line:                
            item.extend(chg + [0])                                

        elif  'deletion' in line:            
            item.extend([chg[0], 0, chg[1]])

        rows.append(item)                
        item = []

df = pd.DataFrame(rows, columns=['sha1', 'author', 'date', 'comment',
                                 'changes', 'insertions', 'deletions'])    
print(df)

#          sha1  author                 date      comment  changes  insertions  deletions
# 0  877134c7be  User 1  2017-01-30 14:46:55  commitmsg 1        1          15          0
# 1  557b90502d  User 1  2017-01-30 14:38:52  commitmsg 2       10          51         56
# 2  052788be45  User 2  2017-01-30 14:29:28  commitmsg 3        1           0          1