熊猫-如何用逗号分隔的字符串分隔和分组

时间:2019-09-16 12:37:47

标签: python pandas

对于以下示例数据框,您将如何:

  1. 分隔searched_productsbought_products字段
  2. datepageproducts分组它们
  3. 带有显示每种产品计数的列

从此:

+------------+------+---------------------+-----------------+
| date       | page | searched_products   | bought_products |
+------------+------+---------------------+-----------------+
| 2019-01-01 | abc  | apple, orange       | orange          |
+------------+------+---------------------+-----------------+
| 2019-01-01 | def  | apple, pear, orange | orange, pear    |
+------------+------+---------------------+-----------------+
| 2019-01-01 | abc  | grapes, orange      | apple, grapes   |
+------------+------+---------------------+-----------------+
| 2019-01-02 | def  | apple               | apple, oranges  |
+------------+------+---------------------+-----------------+
| 2019-01-02 | ghi  | apple, grapes       | orange          |
+------------+------+---------------------+-----------------+
| 2019-01-02 | jkl  | pear, apple         | pear            |
+------------+------+---------------------+-----------------+
| etc        | etc  | etc                 | etc             |
+------------+------+---------------------+-----------------+

对此:

+------------+------+---------+----------+-----------+
| date       | page | product | searches | purchases |
+------------+------+---------+----------+-----------+
| 2019-01-01 | abc  | apple   | 1        | 1         |
+------------+------+---------+----------+-----------+
| 2019-01-01 | abc  | orange  | 2        | 1         |
+------------+------+---------+----------+-----------+
| 2019-01-01 | abc  | grapes  | 1        | 1         |
+------------+------+---------+----------+-----------+
| 2019-01-01 | def  | apple   | 1        | NaN       |
+------------+------+---------+----------+-----------+
| 2019-01-01 | def  | pear    | 1        | 1         |
+------------+------+---------+----------+-----------+
| 2019-01-01 | def  | orange  | 1        | 1         |
+------------+------+---------+----------+-----------+
| 2019-01-02 | def  | apple   | 1        | 1         |
+------------+------+---------+----------+-----------+
| 2019-01-02 | def  | orange  | NaN      | 1         |
+------------+------+---------+----------+-----------+
| 2019-01-02 | ghi  | apple   | 1        | NaN       |
+------------+------+---------+----------+-----------+
| 2019-01-02 | ghi  | grapes  | 1        | NaN       |
+------------+------+---------+----------+-----------+
| 2019-01-02 | ghi  | orange  | NaN      | 1         |
+------------+------+---------+----------+-----------+
| 2019-01-02 | jkl  | apple   | 1        | NaN       |
+------------+------+---------+----------+-----------+
| 2019-01-02 | jkl  | pear    | 1        | 1         |
+------------+------+---------+----------+-----------+
| etc        | etc  | etc     | etc      | etc       |
+------------+------+---------+----------+-----------+

1 个答案:

答案 0 :(得分:3)

format对大熊猫0.25+的解决方案,将重复值拆分为多个值,然后按DataFrame.explode和最后一个GroupBy.size汇总计数:

s = (df.assign(searches=df['searched_products'].str.split(', '))
      .explode('searches')
      .groupby(['date','page','searches'])
      .size()
      .rename('searches'))
b = (df.assign(purchases=df['bought_products'].str.split(', '))
       .explode('purchases')
       .groupby(['date','page','purchases'])
       .size()
       .rename('purchases'))

df = pd.concat([s, b], axis=1).rename_axis(('date','page','product')).reset_index()
print (df)
        date page  product  searches  purchases
0   20190101  abc    apple       1.0        1.0
1   20190101  abc   grapes       1.0        1.0
2   20190101  abc   orange       2.0        1.0
3   20190101  def    apple       1.0        NaN
4   20190101  def      ear       1.0        NaN
5   20190101  def   orange       1.0        1.0
6   20190101  def     pear       NaN        1.0
7   20190102  def    apple       1.0        1.0
8   20190102  def  oranges       NaN        1.0
9   20190102  ghi    apple       1.0        NaN
10  20190102  ghi   grapes       1.0        NaN
11  20190102  ghi   orange       NaN        1.0
12  20190102  jkl    apple       1.0        NaN
13  20190102  jkl     pear       1.0        1.0