Question

我有一个161941行×76列的大数据CSV文件，其中我提取了161941行×3列的有用数据。

现在我的数据框看起来像这样

Extracted Dataframme of size 161941 rows × 3 columns

“ bKLR_Touchauswertung”列为定期数据，其外观为这种形式

"bKLR_Touchauswertung"
7
7
10
10
10
10
10
7
7
0
0
0
0
0
0
0
0
0
0
7
7
10
10
10
10
10
10
7
7
0
0
0
0
0
0
0
0
7
7
10
10
10
10
10
7
7
0
0
0
0
0
0

它一直重复到最后

我想从中得到的是

该列中的每个非零值集都应采用并将其作为新列添加到数据框。

让我们说，第一组非零值应作为新列“ set1”，依此类推。

如果我能找到任何可能的解决方案，那就太好了。谢谢， Abhinay

这是初始和预期数据帧的更详细示例：

这是我下面的数据框

               temp     toucha
Timestamp      

**185            83         7
191            83         7
197            83         10
.              .          .
.              .          .
.              .          .
2051           83         10**

2057           83         0
2063           83         0
2057           83         0
.              .          .
.              .          .
.              .          .
3000           83         0

**3006           83         7
3012           83         7
3018           83         10
.              .          .
.              .          .
.              .          .
6000           83         10**

6006           83         0
6012           83         0
6018           83         0
.              .          .
.              .          .
.              .          .
8000           83         0

此序列继续进行，

现在，我需要一个看起来像这样的数据框

                temp     toucha  set1   set2    ste3.............
Timestamp      

**185            83         7     7      0
191            83         7      7      0
197            83         10     10     0 
.              .          .      .      .
.              .          .      .      .
.              .          .      .      .
2051           83         10     10     0**

2057           83         0      0      0
2063           83         0      0      0
2057           83         0      0      0
.              .          .      .      .
.              .          .      .      .
.              .          .      .      .
3000           83         0      0      0

**3006           83         7     0      7
3012           83         7      0      7
3018           83         10     0      10
.              .          .      .      .
.              .          .      .      .
.              .          .      .      .
6000           83         10     0      10**

6006           83         0      0      0
6012           83         0      0      0
6018           83         0      0      0
.              .          .      .      .
.              .          .      .      .
.              .          .      .      . 
8000           83         0      0      0

Answer 1

如果您可以接受setxx列的编号不一定是连续的，则可以使用shift来检测0和非0值之间的变化，然后使用np.split来拆分数据帧索引这些变化。

完成此操作后，可以很容易地为每个序列添加一个新的0列，并在其中复制原始值。但是由于np.split，使用简单的连续索引会更容易。因此代码可能是：

# use a simple consecutive index
df.reset_index(inplace=True)

# split the indices on transition between null and non null values
subs = np.split(df.index.values,
                df[((df.toucha == 0)&(df.toucha.shift() != 0)
                     |(df.toucha != 0)&(df.toucha.shift() == 0))
                    ].index.values)

# process those sequences
for i, a in enumerate(subs):
    # ignore empty or 0 value sequences
    if len(a) == 0: continue
    if df.toucha[a[0]] == 0: continue
    df['set'+str(i)] = 0    # initialize a new column with 0
    df.loc[a, 'set'+str(i)] = df.toucha.loc[a]  # and copy values

# set the index back
df.set_index('Timestamp', inplace=True)

使用以下示例数据

           temp  toucha
Timestamp              
185          83       7
191          83       7
197          83      10
2051         83      10
2057         83       0
2063         83       0
2057         83       0
3000         83       0
3006         83       7
3012         83       7
3018         83      10
6000         83      10
6006         83       0
6012         83       0
6018         83       0
8000         83       0

它给出：

           temp  toucha  set0  set2
Timestamp                          
185          83       7     7     0
191          83       7     7     0
197          83      10    10     0
2051         83      10    10     0
2057         83       0     0     0
2063         83       0     0     0
2057         83       0     0     0
3000         83       0     0     0
3006         83       7     0     7
3012         83       7     0     7
3018         83      10     0    10
6000         83      10     0    10
6006         83       0     0     0
6012         83       0     0     0
6018         83       0     0     0
8000         83       0     0     0

Answer 2

# use a simple consecutive index
df.reset_index(inplace=True)

# split the indices on transition between null and non null values
subs = np.split(df.index.values,
            df[((df.toucha == 0)&(df.toucha.shift() != 0)
                 |(df.toucha != 0)&(df.toucha.shift() == 0))
                ].index.values)

# process those sequences
for i, a in enumerate(subs):
    # ignore empty or 0 value sequences
    if len(a) == 0: continue
    if df.toucha[a[0]] == 0: continue
    df['set'+str(i)] = 0    # initialize a new column with 0
    df.loc[a, 'set'+str(i)] = df.toucha.loc[a]  # and copy values

# set the index back
df.set_index('Timestamp', inplace=True)

“如何从Pandas Dataframe的单个列中提取定期数据”

2 个答案: