我有csv文件,其内容如下:
a b
ca 12, 20, 45
ca 18, 27
ca 30, 32, 41, 49
ny 4, 12, 12, 37, 43
ny 33
ny 8, 10, 40, 44
如何将数据读入python作为pandas DataFrame并获取每行的均值和总和值?
求和示例
a b
ca 72
45
152
ny 108
33
102
答案 0 :(得分:1)
这并不容易,因为没有结构良好的csv
指向BrenBarn
。
解决方案:
主要问题是您不知道列数,需要添加到read_csv
中的参数names
以避免error
,因此您必须使用某些常量,例如N = 20
:
CParserError:标记数据时出错。 C错误:第4行预计4个字段,见5
import pandas as pd
from pandas.compat import StringIO
temp=u""" a b
ca 12, 20, 45
ca 18, 27
ca 30, 32, 41, 49
ny 4, 12, 12, 37, 43
ny 33
ny 8, 10, 40, 44
"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
N = 20
df = pd.read_csv(StringIO(temp), sep="\s+", names = range(N), skiprows = 1)
#print (df)
#create index from first column, remove all NaN columns, cast to str
df = df.set_index(0).rename_axis('a').dropna(axis=1, how='all').astype(str)
#remove all , and spaces, cast to float
df = df.apply(lambda x: x.str.strip(' ,')).astype(float)
#sum and if necessary cast to int
df1 = df.sum(axis=1).astype(int).rename('b').reset_index()
print (df1)
a b
0 ca 77
1 ca 45
2 ca 152
3 ny 108
4 ny 33
5 ny 102
#if need spaces
mask = df1.a != df1.a.shift()
df1.a = df1.a.where(mask,'')
print (df1)
a b
0 ca 77
1 45
2 152
3 ny 108
4 33
5 102
更动态的解决方案:
#get max count of space separators
data = []
with open('file.csv') as f:
lines = f.readlines()
for line in lines:
data.append(len(line.split()))
#if necessary add 1
N = max(data)
print (N)
6
df = pd.read_csv('file.csv', sep="\s+", skiprows = 1, names = range(N))
print (df)
0 1 2 3 4 5
0 ca 12, 20, 45 NaN NaN
1 ca 18, 27 NaN NaN NaN
2 ca 30, 32, 41, 49 NaN
3 ny 4, 12, 12, 37, 43.0
4 ny 33 NaN NaN NaN NaN
5 ny 8, 10, 40, 44 NaN
答案 1 :(得分:1)
设置
import pandas as pd
from pandas.compat import StringIO
txt = """a b
ca 12, 20, 45
ca 18, 27
ca 30, 32, 41, 49
ny 4, 12, 12, 37, 43
ny 33
ny 8, 10, 40, 44"""
<强> 溶液
使用\s{2,}
分隔符读取文件,该分隔符指定两个或多个空格。这将分为a
和b
列。然后我们可以事后处理b
。
df = pd.read_csv(StringIO(txt), sep='\s{2,}', engine='python', index_col=0)
df = df.b.str.split(',\s*', expand=True).astype(float) \
.sum(1).astype(int).to_frame(name='b')
print(df)
b
a
ca 77
ca 45
ca 152
ny 108
ny 33
ny 102