如何将行移动到n行的列?

时间:2017-06-30 10:05:06

标签: python regex pandas

我有1列pandas df,如:

[col1]
area123
account,time,day,total,users
code1,50s,monday,5,6
code2,40s,monday,5,6
area234
account,time,day,total,users
code5,20s,monday,4,9
code2,40s,monday,2,6
area26
.
.
.

如何将其分隔为具有区域新列的多行,如下所示:

enter code here

[area]     [account]    [time]    [day]   [totals]   [users]
 area123     code1        50s     monday    5          6
 area123     code2        40s     monday    5          6
 area234     code5        20s     monday    4          9
 area234     code2       40s      monday    2          6
    .          .          .          .      .          .
    .          .          .          .      .          .

注意:数据每4行重复一次这个结构,列名在每隔一行,每隔3行是用逗号分隔的值。所以每个区域都应该转换为2排。

我在考虑使用正则表达式按字符串'区域分割数据。什么的。

任何帮助或方向都会很棒。

提前致谢

2 个答案:

答案 0 :(得分:1)

您可以先通过模数创建numpy array

a = np.arange(len(df.index)) % 4
print (a)
[0 1 2 3 0 1 2 3 0]

df.insert(0, 'area', df['col'].mask(a != 0).ffill())
df = df[np.in1d(a, [2,3])].reset_index(drop=True)
df[['account','time','day','total', 'users']] = df.pop('col').str.split(',', expand=True)
print (df)

      area account time     day total users
0  area123   code1  50s  monday     5     6
1  area123   code2  40s  monday     5     6
2  area234   code5  20s  monday     4     9
3  area234   code2  40s  monday     2     6

更一般的解决方案:

mask = df['col'].str.contains(',')
df.insert(0, 'area', df['col'].mask(mask).ffill())
df = df.iloc[0:]
df = df[~((df['col'] == df['area'])|df['col'].str.contains('account,time,day,total,users'))]
df[['account','time','day','total', 'users']] = df.pop('col').str.split(',', expand=True)
print (df)
      area account time     day total users
2  area123   code1  50s  monday     5     6
3  area123   code2  40s  monday     5     6
6  area234   code5  20s  monday     4     9
7  area234   code2  40s  monday     2     6

答案 1 :(得分:1)

numpy操纵

from numpy.core.defchararray import split as csplit

c = df.col1.values

i = pd.Index(c[::4].repeat(len(c) // 4), name='area')
j = csplit(c[1], ',').tolist()
v = csplit(c.reshape(-1, 2)[1::2].ravel().astype(str), ',').tolist()

pd.DataFrame(v, i, j).reset_index()

      area account time     day total users
0  area123   code1  50s  monday     5     6
1  area123   code2  40s  monday     5     6
2  area234   code5  20s  monday     4     9
3  area234   code2  40s  monday     2     6