Question

我对熊猫不太熟悉所以这可能是一个愚蠢的问题。我试图转动以下数据：

df = pd.DataFrame({
      'Country' : ['country1', 'country2', 'country3', 'country4'],
      'Industry' : ['industry1:\$20 \n industry4:\$30', 
                    'industry10:\$100', 
                    'industry3:\$2 \n industry4:\$30 \n industry12:\$10 \n industry1:\$3',
                    'industry1:\$20 \n industry4:\$30'
                   ],})

（\ n来自excel提取）

我需要转向将行业作为指数和国家作为列。我的直觉是我需要做一些＆＃34;数据解包＆＃34;首先是关于包含多个信息的单元格，但我对如何在熊猫上进行操作感到茫然。

谢谢大家。下面有一些答案可以很好地运作。我继续搜索并发现了一些与此问题相关的其他帖子（有些人称这个问题为＃34;爆炸大熊猫行＃34;）。在下面的帖子中，有人编写了一个通用函数explode（），它是通用的并且表现良好：

Split (explode) pandas dataframe string entry to separate rows

Answer 1

您可以使用：

Industry

set_index
split正则表达式\s+\n\s+ - \s+适用于1个或多个空格
stack重塑Series
再次split由不同的分隔符
double reset_index，首先删除第一级
rename列

df = (df.set_index(['Country'])['Industry']
        .str.split('\s+\n\s+', expand=True)
        .stack()
        .str.split(r':\\\$', expand=True)
        .reset_index(level=1, drop=True)
        .reset_index()
        .rename(columns={0:'Industry', 1:'Val'})
     )   
print (df)
    Country    Industry  Val
0  country1   industry1   20
1  country1   industry4   30
2  country2  industry10  100
3  country3   industry3    2
4  country3   industry4   30
5  country3  industry12   10
6  country3   industry1    3
7  country4   industry1   20
8  country4   industry4   30

pandas透视表在单元格中包含多个信息

1 个答案: