Question

我有一个熊猫数据框：

import pandas as pd
import numpy as np

df = pd.DataFrame(columns=['Text','Selection_Values'])
df["Text"] = ["Hi", "this is", "just", "a", "single", "sentence.", "This", np.nan, "is another one.","This is", "a", "third", "sentence","."]
df["Selection_Values"] = [0,0,0,0,0,1,0,0,1,0,0,0,0,0]
print(df)

输出：

               Text  Selection_Values
0                Hi                 0
1           this is                 0
2              just                 0
3                 a                 0
4            single                 0
5         sentence.                 1
6              This                 0
7               NaN                 0
8   is another one.                 1
9           This is                 0
10                a                 0
11            third                 0
12         sentence                 0
13                .                 0

现在，我想基于Text列将Selection Value列重新分组为2D数组。 0（第一个整数，或1之后）和1（包括）之间出现的所有单词都应放入2D数组中。数据集的最后一个句子可能没有结束1。可以按照以下问题中的说明进行操作：Regroup pandas column into 2D list based on another column

[["Hi this is just a single sentence."],["This is another one"], ["This is a third sentence ."]]

我想进一步说明以下条件：如果列表中有超过max_number_of_cells_per_list个非NaN单元，则该列表应分为大致相等的部分，其中包含最多+/- 1个max_number_of_cells_per_list单元元素。

让我们说：max_number_of_cells_per_list = 2，则预期输出应为：

 [["Hi this is"], ["just a"], ["single sentence."],["This is another one"], ["This is"], ["a third sentence ."]]

示例：

基于“选择值”列，可以使用以下方法将单元格重新分组为以下2D列表：

[[s.str.cat(sep=' ')] for s in np.split(df.Text, df[df.Selection_Values == 1].index+1) if not s.empty]

输出（原始列表）：

[["Hi this is just a single sentence."],["This is another one"], ["This is a third sentence ."]]

让我们看一下这些列表中的单元格数量：

如您所见，list1有6个单元格，list 2有2个单元格，list 3有5个单元格。

现在，我要实现的目标是：如果列表中有多个单元格，则应将其拆分，以使每个结果列表具有所需单元格的+/- 1 。

例如max_number_of_cells_per_list = 2

修改后的列表：

您看到这样做的方法了吗？

编辑：重要说明：原始列表中的单元格不应放入相同的列表中。

编辑2：

               Text  Selection_Values  New
0                Hi                 0  1.0
1           this is                 0  0.0
2              just                 0  1.0
3                 a                 0  0.0
4            single                 0  1.0
5         sentence.                 1  0.0
6              This                 0  1.0
7               NaN                 0  0.0
8   is another one.                 1  1.0
9           This is                 0  0.0
10                a                 0  1.0
11            third                 0  0.0
12         sentence                 0  0.0
13                .                 0  NaN

Answer 1

IIUC，您可以执行以下操作：

n=2 #change this as you like for no. of splits
s=df.Text.dropna().reset_index(drop=True)
c=s.groupby(s.index//n).cumcount().eq(0).shift().shift(-1).fillna(False)

[[i] for i in s.groupby(c.cumsum()).apply(' '.join).tolist()]

[['Hi this is'], ['just a'], ['single sentence.'], 
    ['This is another one.'], ['This is a'], ['third sentence .']]

编辑：

d=dict(zip(df.loc[df.Text.notna(),'Text'].index,c.index))
ser=pd.Series(d)
df['new']=ser.reindex(range(ser.index.min(),
                        ser.index.max()+1)).map(c).fillna(False).astype(int)
print(df)

               Text  Selection_Values  new
0                Hi                 0    1
1           this is                 0    0
2              just                 0    1
3                 a                 0    0
4            single                 0    1
5         sentence.                 1    0
6              This                 0    1
7               NaN                 0    0
8   is another one.                 1    0
9           This is                 0    1
10                a                 0    0
11            third                 0    1
12         sentence                 0    0
13                .                 0    0

Answer 2

这是一个相当长且庞大的代码，但是可以完成工作！：）

selection_values = df["Selection_Values"].tolist()    
max_number_of_cells_per_list = 3
a = [[s.str.cat(sep=' ')] for s in np.split(df.Text, df[df.Selection_Values == 1].index+1) if not s.empty]
print(a)

number_of_cells = 0
j = 0
for i in range(len(df['Text'])):
    if isinstance(df['Text'][i], str): 
        number_of_cells += 1



    if df["Selection_Values"][i] == 1 or i == len(df['Text'])-1:

        print("j: ", j)
        if number_of_cells > max_number_of_cells_per_list:
            print(number_of_cells,max_number_of_cells_per_list)
            print("\nmax number of cells reached")
            n = np.ceil(np.divide(number_of_cells,max_number_of_cells_per_list))
            print("deviding into ", n, " cells")
            add = int((i-j)/n)
            print("add", add)
            for k in range(int(n)):

                if k == n-1:
                    j = i
                else:
                    j += add
                print("j: ", j)
                selection_values[j] = 1
            print("\n")

        # Reset Cell Counter Every time a new list should start        
        number_of_cells = 0
        j = i


df['Selection_Values'] = selection_values
print("\n", df)


a = [[s.str.cat(sep=' ')] for s in np.split(df.Text, df[df.Selection_Values == 1].index+1) if not s.empty]
print(a)

您得到：

               Text  Selection_Values
0                Hi                 0
1           this is                 0
2              just                 0
3                 a                 0
4            single                 0
5         sentence.                 1
6              This                 0
7               NaN                 0
8   is another one.                 1
9           This is                 0
10                a                 0
11            third                 0
12         sentence                 0
13                .                 0
[['Hi this is just a single sentence.'], ['This is another one.'], ['This is a third sentence .']]
j:  0
6 3

max number of cells reached
deviding into  2.0  cells
add 2
j:  2
j:  5


j:  5
j:  8
5 3

max number of cells reached
deviding into  2.0  cells
add 2
j:  10
j:  13



                Text  Selection_Values
0                Hi                 0
1           this is                 0
2              just                 1
3                 a                 0
4            single                 0
5         sentence.                 1
6              This                 0
7               NaN                 0
8   is another one.                 1
9           This is                 0
10                a                 1
11            third                 0
12         sentence                 0
13                .                 1
[['Hi this is just'], ['a single sentence.'], ['This is another one.'], ['This is a'], ['third sentence .']]

从熊猫数据帧构造2D数组

2 个答案: