Question

我有一个熊猫数据框，其输出直接从USDA文本文件中抓取。以下是数据框的示例：

Date       Region                 CommodityGroup                    InboundCity  Low    High   
    1/2/2019   Mexico Crossings       Beans,Cucumbers,Eggplant,Melons   Atlanta      4500   4700
    1/2/2019   Eastern North Carolina Apples and Pears                  Baltimore    7000   8000
    1/2/2019   Michigan               Apples                            Boston       3800   4000

我正在寻找一种编程解决方案，以分解“ CommodityGroups”列中的多种商品（每个商品用逗号或上表中的“和”分隔），为分离的商品创建新行，并重复每个新行的其余列数据。所需的示例输出：

Date       Region                    CommodityGroup     InboundCity     Low     High
    1/2/2019   Mexico Crossings          Beans              Atlanta         4500    4700
    1/2/2019   Mexico Crossings          Cucumbers          Atlanta         4500    4700
    1/2/2019   Mexico Crossings          Eggplant           Atlanta         4500    4700
    1/2/2019   Mexico Crossings          Melons             Atlanta         4500    4700
    1/2/2019   Eastern North Carolina    Apples             Baltimore       7000    8000
    1/2/2019   Eastern North Carolina    Pears              Baltimore       7000    8000
    1/2/2019   Michigan                  Apples             Boston          3800    4000

在此过程中可以提供的任何指导将不胜感激！

Answer 1

使用.str.split用',| and '或','模式' and '拆分列。 '|'是OR。
使用.explode将列表元素分成单独的行
- 根据需要，可以在爆炸后使用.reset_index(drop=True)。
  - df = df.explode('CommodityGroup').reset_index(drop=True)

import pandas as pd

# data
data = {'Date': ['1/2/2019', '1/2/2019', '1/2/2019'],
        'Region': ['Mexico Crossings', 'Eastern North Carolina', 'Michigan'],
        'CommodityGroup': ['Beans,Cucumbers,Eggplant,Melons', 'Apples and Pears', 'Apples'],
        'InboundCity': ['Atlanta', 'Baltimore', 'Boston'],
        'Low': [4500, 7000, 3800],
        'High': [4700, 8000, 4000]}

# create the dataframe
df = pd.DataFrame(data)

# split the CommodityGroup strings
df.CommodityGroup = df.CommodityGroup.str.split(',| and ')

# explode the CommodityGroup lists
df = df.explode('CommodityGroup')

# final
       Date                  Region CommodityGroup InboundCity   Low  High
0  1/2/2019        Mexico Crossings          Beans     Atlanta  4500  4700
0  1/2/2019        Mexico Crossings      Cucumbers     Atlanta  4500  4700
0  1/2/2019        Mexico Crossings       Eggplant     Atlanta  4500  4700
0  1/2/2019        Mexico Crossings         Melons     Atlanta  4500  4700
1  1/2/2019  Eastern North Carolina         Apples   Baltimore  7000  8000
1  1/2/2019  Eastern North Carolina          Pears   Baltimore  7000  8000
2  1/2/2019                Michigan         Apples      Boston  3800  4000

Answer 2

您可以尝试以下方法：

df = df.set_index(['Date', 'Region', 'InboundCity', 'Low', 'High'])
   .apply(lambda x: x.str.split(',| and ').explode())
   .reset_index() 
print(df)

       Date                  Region InboundCity   Low  High CommodityGroup
0  1/2/2019        Mexico Crossings     Atlanta  4500  4700          Beans
1  1/2/2019        Mexico Crossings     Atlanta  4500  4700      Cucumbers
2  1/2/2019        Mexico Crossings     Atlanta  4500  4700       Eggplant
3  1/2/2019        Mexico Crossings     Atlanta  4500  4700         Melons
4  1/2/2019  Eastern North Carolina   Baltimore  7000  8000         Apples
5  1/2/2019  Eastern North Carolina   Baltimore  7000  8000          Pears
6  1/2/2019                Michigan      Boston  3800  4000         Apples

从熊猫数据框中的单个单元格字符串创建新行

2 个答案: