从熊猫数据框中的单个单元格字符串创建新行

时间:2020-06-24 18:43:38

标签: python pandas dataframe

我有一个熊猫数据框,其输出直接从USDA文本文件中抓取。以下是数据框的示例:

Date       Region                 CommodityGroup                    InboundCity  Low    High   
    1/2/2019   Mexico Crossings       Beans,Cucumbers,Eggplant,Melons   Atlanta      4500   4700
    1/2/2019   Eastern North Carolina Apples and Pears                  Baltimore    7000   8000
    1/2/2019   Michigan               Apples                            Boston       3800   4000

我正在寻找一种编程解决方案,以分解“ CommodityGroups”列中的多种商品(每个商品用逗号或上表中的“和”分隔),为分离的商品创建新行,并重复每个新行的其余列数据。所需的示例输出:

Date       Region                    CommodityGroup     InboundCity     Low     High
    1/2/2019   Mexico Crossings          Beans              Atlanta         4500    4700
    1/2/2019   Mexico Crossings          Cucumbers          Atlanta         4500    4700
    1/2/2019   Mexico Crossings          Eggplant           Atlanta         4500    4700
    1/2/2019   Mexico Crossings          Melons             Atlanta         4500    4700
    1/2/2019   Eastern North Carolina    Apples             Baltimore       7000    8000
    1/2/2019   Eastern North Carolina    Pears              Baltimore       7000    8000
    1/2/2019   Michigan                  Apples             Boston          3800    4000

在此过程中可以提供的任何指导将不胜感激!

2 个答案:

答案 0 :(得分:2)

  • 使用.str.split',| and '','模式' and '拆分列。 '|'OR
  • 使用.explode将列表元素分成单独的行
    • 根据需要,可以在爆炸后使用.reset_index(drop=True)
      • df = df.explode('CommodityGroup').reset_index(drop=True)
import pandas as pd

# data
data = {'Date': ['1/2/2019', '1/2/2019', '1/2/2019'],
        'Region': ['Mexico Crossings', 'Eastern North Carolina', 'Michigan'],
        'CommodityGroup': ['Beans,Cucumbers,Eggplant,Melons', 'Apples and Pears', 'Apples'],
        'InboundCity': ['Atlanta', 'Baltimore', 'Boston'],
        'Low': [4500, 7000, 3800],
        'High': [4700, 8000, 4000]}

# create the dataframe
df = pd.DataFrame(data)

# split the CommodityGroup strings
df.CommodityGroup = df.CommodityGroup.str.split(',| and ')

# explode the CommodityGroup lists
df = df.explode('CommodityGroup')

# final
       Date                  Region CommodityGroup InboundCity   Low  High
0  1/2/2019        Mexico Crossings          Beans     Atlanta  4500  4700
0  1/2/2019        Mexico Crossings      Cucumbers     Atlanta  4500  4700
0  1/2/2019        Mexico Crossings       Eggplant     Atlanta  4500  4700
0  1/2/2019        Mexico Crossings         Melons     Atlanta  4500  4700
1  1/2/2019  Eastern North Carolina         Apples   Baltimore  7000  8000
1  1/2/2019  Eastern North Carolina          Pears   Baltimore  7000  8000
2  1/2/2019                Michigan         Apples      Boston  3800  4000

答案 1 :(得分:2)

您可以尝试以下方法:

df = df.set_index(['Date', 'Region', 'InboundCity', 'Low', 'High'])
   .apply(lambda x: x.str.split(',| and ').explode())
   .reset_index() 
print(df)

       Date                  Region InboundCity   Low  High CommodityGroup
0  1/2/2019        Mexico Crossings     Atlanta  4500  4700          Beans
1  1/2/2019        Mexico Crossings     Atlanta  4500  4700      Cucumbers
2  1/2/2019        Mexico Crossings     Atlanta  4500  4700       Eggplant
3  1/2/2019        Mexico Crossings     Atlanta  4500  4700         Melons
4  1/2/2019  Eastern North Carolina   Baltimore  7000  8000         Apples
5  1/2/2019  Eastern North Carolina   Baltimore  7000  8000          Pears
6  1/2/2019                Michigan      Boston  3800  4000         Apples
相关问题