Question

我正在尝试根据命名行对数据进行分组，其中每个行都是唯一值。

样本df

  Name            Description

'Apple'          'A Succulent Fruit'
'Bottom'         'Depending on the context body area'
'Jeans'          'A unisex clothing item'
'Boots'          'A type of show or a clothing item'
'Boots'          'A popular clothing item for the winder'
'Apple'          'some people name their children after this fruit'

使用此数据框，我将唯一名称分组，并通过使用正则表达式模式从关键字列表中提取关键字，将其值分配到称为“类型”的新列中。

keyword_list = ['Fruit','body area', 'clothing item']

理想情况下，它应该返回以下内容：

     Name         Type

    'Apple'      'Fruit'
    'Bottom'     'body area'
    'Jeans'      'clothing item'
    'Boots'      'clothing item'

这正常工作正常，但是我遇到了数据丢失问题。具有唯一名称的数据帧为933 x 1（“名称” x“类型”），但是返回的数据帧为775 x 1（并且大小应相同）。表示某些行被忽略或未实际附加。

这是我正在使用的当前代码：

keyword_list = ['Fruit','body area', 'clothing item']

ptn = r'\b(' + '|'.join(keyword_list) + r')\b'

test_df = df.set_index('Name').Desc.str.extractall(ptn).reset_index(level=1, drop=False)[0]

pre_shape = test_df.groupby('Name').apply(lambda x: x.value_counts().idxmax(skipna=False)).to_frame('Type')

reshaped_df = pre_shape.pivot_table(index='Name', values='Type',
                                            aggfunc=lambda x: ' '.join(str(v) for v in x))

new_df = pd.merge(reshaped_df, odf, on=['Name'], how='inner') # 'odf' is another dataframe of size 933 x 1

对于任何唯一值（“名称”），描述列都不为空，并且它们都具有至少一个关键字，因此我不确定为什么它跳过了其中一些行。

这是我尝试过的：

new_df = pd.merge(reshaped_df, odf, on=['Name'], how='outer') # How set to 'outer'

这将返回大小相等的df，但是现在缺少的值仅为NaN。

test_df = df.set_index('Name').Desc.str.extractall(ptn).reset_index(level=1, drop=False)[0] # Drop set to 'False'

这没有效果。

有人知道吗？

Answer 1

将str.findall与mode结合使用的一种方式

df['Type']=df.Description.str.findall('|'.join(keyword_list))
s = df.groupby('Name')['Type'].apply(lambda x: pd.Series.mode(x.sum())[0]).reset_index()
s
Out[49]: 
     Name           Type
0   Apple          Fruit
1   Boots  clothing item
2  Bottom      body area
3   Jeans  clothing item

通过忽略某些行值来对熊猫进行分组

1 个答案: