我有一个熊猫数据框:
import pandas as pd
import numpy as np
df = pd.DataFrame(columns=['Text','Selection_Values'])
df["Text"] = ["Hi", "this is", "just", "a", "single", "sentence.", "This", np.nan, "is another one.","This is", "a", "third", "sentence","."]
df["Selection_Values"] = [0,0,0,0,0,1,0,0,1,0,0,0,0,0]
print(df)
输出:
Text Selection_Values
0 Hi 0
1 this is 0
2 just 0
3 a 0
4 single 0
5 sentence. 1
6 This 0
7 NaN 0
8 is another one. 1
9 This is 0
10 a 0
11 third 0
12 sentence 0
13 . 0
现在,我想基于Text
列将Selection Value
列重新分组为2D数组。 0
(第一个整数,或1
之后)和1
(包括)之间出现的所有单词都应放入2D数组中。数据集的最后一个句子可能没有结束1
。可以按照以下问题中的说明进行操作:Regroup pandas column into 2D list based on another column
[["Hi this is just a single sentence."],["This is another one"], ["This is a third sentence ."]]
我想进一步说明以下条件:如果列表中有超过max_number_of_cells_per_list
个非NaN单元,则该列表应分为大致相等的部分,其中包含最多+/- 1个max_number_of_cells_per_list
单元元素。
让我们说:max_number_of_cells_per_list
= 2,则预期输出应为:
[["Hi this is"], ["just a"], ["single sentence."],["This is another one"], ["This is"], ["a third sentence ."]]
示例:
基于“选择值”列,可以使用以下方法将单元格重新分组为以下2D列表:
[[s.str.cat(sep=' ')] for s in np.split(df.Text, df[df.Selection_Values == 1].index+1) if not s.empty]
输出(原始列表):
[["Hi this is just a single sentence."],["This is another one"], ["This is a third sentence ."]]
让我们看一下这些列表中的单元格数量:
如您所见,list1有6个单元格,list 2有2个单元格,list 3有5个单元格。
现在,我要实现的目标是:如果列表中有多个单元格,则应将其拆分,以使每个结果列表具有所需单元格的+/- 1 。
例如max_number_of_cells_per_list
= 2
您看到这样做的方法了吗?
编辑: 重要说明:原始列表中的单元格不应放入相同的列表中。
编辑2:
Text Selection_Values New
0 Hi 0 1.0
1 this is 0 0.0
2 just 0 1.0
3 a 0 0.0
4 single 0 1.0
5 sentence. 1 0.0
6 This 0 1.0
7 NaN 0 0.0
8 is another one. 1 1.0
9 This is 0 0.0
10 a 0 1.0
11 third 0 0.0
12 sentence 0 0.0
13 . 0 NaN
答案 0 :(得分:4)
IIUC,您可以执行以下操作:
n=2 #change this as you like for no. of splits
s=df.Text.dropna().reset_index(drop=True)
c=s.groupby(s.index//n).cumcount().eq(0).shift().shift(-1).fillna(False)
[[i] for i in s.groupby(c.cumsum()).apply(' '.join).tolist()]
[['Hi this is'], ['just a'], ['single sentence.'],
['This is another one.'], ['This is a'], ['third sentence .']]
编辑:
d=dict(zip(df.loc[df.Text.notna(),'Text'].index,c.index))
ser=pd.Series(d)
df['new']=ser.reindex(range(ser.index.min(),
ser.index.max()+1)).map(c).fillna(False).astype(int)
print(df)
Text Selection_Values new
0 Hi 0 1
1 this is 0 0
2 just 0 1
3 a 0 0
4 single 0 1
5 sentence. 1 0
6 This 0 1
7 NaN 0 0
8 is another one. 1 0
9 This is 0 1
10 a 0 0
11 third 0 1
12 sentence 0 0
13 . 0 0
答案 1 :(得分:2)
这是一个相当长且庞大的代码,但是可以完成工作! :)
selection_values = df["Selection_Values"].tolist()
max_number_of_cells_per_list = 3
a = [[s.str.cat(sep=' ')] for s in np.split(df.Text, df[df.Selection_Values == 1].index+1) if not s.empty]
print(a)
number_of_cells = 0
j = 0
for i in range(len(df['Text'])):
if isinstance(df['Text'][i], str):
number_of_cells += 1
if df["Selection_Values"][i] == 1 or i == len(df['Text'])-1:
print("j: ", j)
if number_of_cells > max_number_of_cells_per_list:
print(number_of_cells,max_number_of_cells_per_list)
print("\nmax number of cells reached")
n = np.ceil(np.divide(number_of_cells,max_number_of_cells_per_list))
print("deviding into ", n, " cells")
add = int((i-j)/n)
print("add", add)
for k in range(int(n)):
if k == n-1:
j = i
else:
j += add
print("j: ", j)
selection_values[j] = 1
print("\n")
# Reset Cell Counter Every time a new list should start
number_of_cells = 0
j = i
df['Selection_Values'] = selection_values
print("\n", df)
a = [[s.str.cat(sep=' ')] for s in np.split(df.Text, df[df.Selection_Values == 1].index+1) if not s.empty]
print(a)
您得到:
Text Selection_Values
0 Hi 0
1 this is 0
2 just 0
3 a 0
4 single 0
5 sentence. 1
6 This 0
7 NaN 0
8 is another one. 1
9 This is 0
10 a 0
11 third 0
12 sentence 0
13 . 0
[['Hi this is just a single sentence.'], ['This is another one.'], ['This is a third sentence .']]
j: 0
6 3
max number of cells reached
deviding into 2.0 cells
add 2
j: 2
j: 5
j: 5
j: 8
5 3
max number of cells reached
deviding into 2.0 cells
add 2
j: 10
j: 13
Text Selection_Values
0 Hi 0
1 this is 0
2 just 1
3 a 0
4 single 0
5 sentence. 1
6 This 0
7 NaN 0
8 is another one. 1
9 This is 0
10 a 1
11 third 0
12 sentence 0
13 . 1
[['Hi this is just'], ['a single sentence.'], ['This is another one.'], ['This is a'], ['third sentence .']]