我有1列pandas df,如:
[col1]
area123
account,time,day,total,users
code1,50s,monday,5,6
code2,40s,monday,5,6
area234
account,time,day,total,users
code5,20s,monday,4,9
code2,40s,monday,2,6
area26
.
.
.
如何将其分隔为具有区域新列的多行,如下所示:
enter code here
[area] [account] [time] [day] [totals] [users]
area123 code1 50s monday 5 6
area123 code2 40s monday 5 6
area234 code5 20s monday 4 9
area234 code2 40s monday 2 6
. . . . . .
. . . . . .
注意:数据每4行重复一次这个结构,列名在每隔一行,每隔3行是用逗号分隔的值。所以每个区域都应该转换为2排。
我在考虑使用正则表达式按字符串'区域分割数据。什么的。
任何帮助或方向都会很棒。
提前致谢
答案 0 :(得分:1)
您可以先通过模数创建numpy array
。
insert
第一个位置的新列,其中mask
和fillna
使用方法ffill
numpy.in1d
和boolean indexing
pop
删除列,并按split
a = np.arange(len(df.index)) % 4
print (a)
[0 1 2 3 0 1 2 3 0]
df.insert(0, 'area', df['col'].mask(a != 0).ffill())
df = df[np.in1d(a, [2,3])].reset_index(drop=True)
df[['account','time','day','total', 'users']] = df.pop('col').str.split(',', expand=True)
print (df)
area account time day total users
0 area123 code1 50s monday 5 6
1 area123 code2 40s monday 5 6
2 area234 code5 20s monday 4 9
3 area234 code2 40s monday 2 6
更一般的解决方案:
mask = df['col'].str.contains(',')
df.insert(0, 'area', df['col'].mask(mask).ffill())
df = df.iloc[0:]
df = df[~((df['col'] == df['area'])|df['col'].str.contains('account,time,day,total,users'))]
df[['account','time','day','total', 'users']] = df.pop('col').str.split(',', expand=True)
print (df)
area account time day total users
2 area123 code1 50s monday 5 6
3 area123 code2 40s monday 5 6
6 area234 code5 20s monday 4 9
7 area234 code2 40s monday 2 6
答案 1 :(得分:1)
numpy
操纵
from numpy.core.defchararray import split as csplit
c = df.col1.values
i = pd.Index(c[::4].repeat(len(c) // 4), name='area')
j = csplit(c[1], ',').tolist()
v = csplit(c.reshape(-1, 2)[1::2].ravel().astype(str), ',').tolist()
pd.DataFrame(v, i, j).reset_index()
area account time day total users
0 area123 code1 50s monday 5 6
1 area123 code2 40s monday 5 6
2 area234 code5 20s monday 4 9
3 area234 code2 40s monday 2 6