Question

我有一长串数据，有意义的数据夹在0值之间，这就是它的样子

0的长度和有意义的值序列是可变的。我想提取有意义的序列，每个序列都在数据帧中的一行。例如，上述数据可以提取到：

1
2   3   1
1

我使用此代码“切片”有意义的数据：

import pandas as pd
import numpy as np

raw = pd.read_csv('data.csv')

df = pd.DataFrame(index=np.arange(0, 10000),columns = ['DT01', 'DT02', 'DT03', 'DT04', 'DT05', 'DT06', 'DT07', 'DT08', 'DT02', 'DT09', 'DT10', 'DT11', 'DT12', 'DT13', 'DT14', 'DT15', 'DT16', 'DT17', 'DT18', 'DT19', 'DT20',])
a = 0
b = 0
n=0

for n in range(0,999999):
    if raw.iloc[n].values > 0:
        df.iloc[a,b] = raw.iloc[n].values
        a=a+1
        if raw [n+1] == 0:
            b=b+1
            a=0

但我一直得到KeyError：n，而n是第一行的值不等于0之后的行。

我的代码问题在哪里？从速度和内存成本方面来看，有没有办法改进它？非常感谢你

Answer 1

让我们尝试输出一个数据帧：

startService(itService)

输出：

df.groupby(df[0].eq(0).cumsum().mask(df[0].eq(0)),as_index=False)[0]\
  .apply(lambda x: x.reset_index(drop=True)).unstack(1)

或字符串：

     0    1    2
0  1.0  NaN  NaN
1  2.0  3.0  1.0
2  1.0  NaN  NaN

输出：

df.groupby(df[0].eq(0).cumsum().mask(df[0].eq(0)),as_index=False)[0]\
  .apply(lambda x: ' '.join(x.astype(str)))

或者作为清单：

0        1
1    2 3 1
2        1
dtype: object

输出：

df.groupby(df[0].eq(0).cumsum().mask(df[0].eq(0)),as_index=False)[0]\
  .apply(list)

Answer 2

试试这个，我分解了步骤

df.LIST=df.LIST.replace({0:np.nan})
df['Group']=df.LIST.isnull().cumsum()
df=df.dropna()
df.groupby('Group').LIST.apply(list)
Out[384]: 
Group
2              [1]
4        [2, 3, 1]
8              [1]
Name: LIST, dtype: object

数据输入

df = pd.DataFrame({'LIST' : [0,0,1,0,0,2,3,1,0,0,0,0,1,0]})

Answer 3

您可以使用：

df['Group'] = df['col'].eq(0).cumsum()
df = df.loc[ df['col'] != 0]

df = df.groupby('Group')['col'].apply(list)
print (df)

Group
2          [1]
4    [2, 3, 1]
8          [1]
Name: col, dtype: object

df = pd.DataFrame(df.groupby('Group')['col'].apply(list).values.tolist())
print (df)
   0    1    2
0  1  NaN  NaN
1  2  3.0  1.0
2  1  NaN  NaN

Answer 4

让我们首先将您的原始数据打包成一个pandas数据帧（在现实生活中，您可能会使用pd.read_csv()来生成此数据帧）：

raw = pd.DataFrame({'0' : [0,0,1,0,0,2,3,1,0,0,0,0,1,0]})

默认索引将帮助您找到零跨度：

s1 = raw.reset_index()
s1['index'] = np.where(s1['0'] != 0, np.nan, s1['index'])
s1['index'] = s1['index'].fillna(method='ffill').fillna(0).astype(int)
s1[s1['0'] != 0].groupby('index')['0'].apply(list).tolist()
#[[1], [2, 3, 1], [1]]

使用For Loop by Pandas分配值时的KeyError

4 个答案: