我有一个熊猫数据框,其输出直接从USDA文本文件中抓取。以下是数据框的示例:
Date Region CommodityGroup InboundCity Low High
1/2/2019 Mexico Crossings Beans,Cucumbers,Eggplant,Melons Atlanta 4500 4700
1/2/2019 Eastern North Carolina Apples and Pears Baltimore 7000 8000
1/2/2019 Michigan Apples Boston 3800 4000
我正在寻找一种编程解决方案,以分解“ CommodityGroups”列中的多种商品(每个商品用逗号或上表中的“和”分隔),为分离的商品创建新行,并重复每个新行的其余列数据。所需的示例输出:
Date Region CommodityGroup InboundCity Low High
1/2/2019 Mexico Crossings Beans Atlanta 4500 4700
1/2/2019 Mexico Crossings Cucumbers Atlanta 4500 4700
1/2/2019 Mexico Crossings Eggplant Atlanta 4500 4700
1/2/2019 Mexico Crossings Melons Atlanta 4500 4700
1/2/2019 Eastern North Carolina Apples Baltimore 7000 8000
1/2/2019 Eastern North Carolina Pears Baltimore 7000 8000
1/2/2019 Michigan Apples Boston 3800 4000
在此过程中可以提供的任何指导将不胜感激!
答案 0 :(得分:2)
.str.split
用',| and '
或','
模式' and '
拆分列。 '|'
是OR
。.explode
将列表元素分成单独的行
.reset_index(drop=True)
。
df = df.explode('CommodityGroup').reset_index(drop=True)
import pandas as pd
# data
data = {'Date': ['1/2/2019', '1/2/2019', '1/2/2019'],
'Region': ['Mexico Crossings', 'Eastern North Carolina', 'Michigan'],
'CommodityGroup': ['Beans,Cucumbers,Eggplant,Melons', 'Apples and Pears', 'Apples'],
'InboundCity': ['Atlanta', 'Baltimore', 'Boston'],
'Low': [4500, 7000, 3800],
'High': [4700, 8000, 4000]}
# create the dataframe
df = pd.DataFrame(data)
# split the CommodityGroup strings
df.CommodityGroup = df.CommodityGroup.str.split(',| and ')
# explode the CommodityGroup lists
df = df.explode('CommodityGroup')
# final
Date Region CommodityGroup InboundCity Low High
0 1/2/2019 Mexico Crossings Beans Atlanta 4500 4700
0 1/2/2019 Mexico Crossings Cucumbers Atlanta 4500 4700
0 1/2/2019 Mexico Crossings Eggplant Atlanta 4500 4700
0 1/2/2019 Mexico Crossings Melons Atlanta 4500 4700
1 1/2/2019 Eastern North Carolina Apples Baltimore 7000 8000
1 1/2/2019 Eastern North Carolina Pears Baltimore 7000 8000
2 1/2/2019 Michigan Apples Boston 3800 4000
答案 1 :(得分:2)
您可以尝试以下方法:
df = df.set_index(['Date', 'Region', 'InboundCity', 'Low', 'High'])
.apply(lambda x: x.str.split(',| and ').explode())
.reset_index()
print(df)
Date Region InboundCity Low High CommodityGroup
0 1/2/2019 Mexico Crossings Atlanta 4500 4700 Beans
1 1/2/2019 Mexico Crossings Atlanta 4500 4700 Cucumbers
2 1/2/2019 Mexico Crossings Atlanta 4500 4700 Eggplant
3 1/2/2019 Mexico Crossings Atlanta 4500 4700 Melons
4 1/2/2019 Eastern North Carolina Baltimore 7000 8000 Apples
5 1/2/2019 Eastern North Carolina Baltimore 7000 8000 Pears
6 1/2/2019 Michigan Boston 3800 4000 Apples