这是我当前的数据框:
>>>df = {'most_exhibitions' : pd.Series(['USA (1) Netherlands (5)' ,
'United Kingdom (2)','China (3) India (5) Pakistan (8)','USA (11) India (4)'], index=['a', 'b', 'c','d']),
'name' : pd.Series(['Bob', 'Joe', 'Alex', 'Bill'], index=['a', 'b', 'c','d'])}
>>> df
name most_exhibitions
a Bob USA (1) India (5)
b Joe United Kingdom (2)
c Alex China (3) India (5) USA (8)
d Bill USA (11) India (4)
我正在试图弄清楚如何拆分每个单元格,然后,可能会从国家/地区创建一个新列,并将相应的计数放在右侧。如果该国家/地区已经是现有列,我想将计数放在右侧。
因此,最终的数据框将如下所示:
# name most_exhibitions USA United Kingdom China India
#a Bob USA (1), India (5) 1 5
#b Joe United Kingdom (2) 2
#c Alex China (3), India (5), USA (8) 8 3 5
#d Bill USA (11), India (4) 11 4
我想编写一个循环或函数来分割数据,然后添加新列,但我无法弄清楚如何去做。我最终通过一系列字典分割和清理数据,现在我不知道如何将最终字典放入自己的数据框中。我想,如果我可以创建这个新的数据帧,我将能够将它附加到旧的数据帧。我也认为我正在努力做到这一点,并且对任何更优雅的解决方案感兴趣。
这是我到目前为止所做的:
>>>country_rank_df['country_split']
= indexed_rankdata['most_exhibitions'].str.split(",").astype(str)
from collections import defaultdict
total_dict = defaultdict(list)
dict2 = defaultdict(list)
dict3 = defaultdict(list)
dict4 = defaultdict(list)
dict5 = defaultdict(list)
dict6 = defaultdict(list)
for name, country_count in zip(head_df['name'], head_df['most_exhibitions']):
total_dict[name].append(country_count)
for key, value in total_dict.iteritems():
for line in value:
new_line = line.split('(')
dict2[key].append(new_line)
for key, list_outside in dict2.iteritems():
for list_inside in list_outside:
for value in list_inside:
new_line = value.split(',')
dict3[key].append(new_line)
for key, list_outside in dict3.iteritems():
for list_inside in list_outside:
for value in list_inside:
new_line = value.split(')')
dict4[key].append(new_line)
for key, list_outside in dict4.iteritems():
for list_inside in list_outside:
for value in list_inside:
new_line = value.strip()
new_line = value.lstrip()
dict5[key].append(new_line)
for key, list_outside in dict5.iteritems():
new_line = filter(None, list_outside)
dict6[key].append(new_line)
>>>dict6['Bob']
[['USA',
'1',
'India',
'5']]
答案 0 :(得分:2)
您可以尝试使用此方法,主要使用string methods。然后我pivot
和fillna
数据框。我丢失了原始专栏most_exhibitions
,但我希望这是不必要的。
import pandas as pd
df = {'most_exhibitions' : pd.Series(['USA (1) Netherlands (5)' ,
'United Kingdom (2)','China (3) India (5) Pakistan (8)','USA (11) India (4)'], index=['a', 'b', 'c','d']),
'name' : pd.Series(['Bob', 'Joe', 'Alex', 'Bill'], index=['a', 'b', 'c','d'])}
df = pd.DataFrame(df)
#cange ordering of columns
df = df[['name', 'most_exhibitions']]
print df
# name most_exhibitions
#a Bob USA (1) Netherlands (5)
#b Joe United Kingdom (2)
#c Alex China (3) India (5) Pakistan (8)
#d Bill USA (11) India (4)
#remove '(' and last ')'
df['most_exhibitions'] = df['most_exhibitions'].str.replace('(', '')
df['most_exhibitions'] = df['most_exhibitions'].str.strip(')')
#http://stackoverflow.com/a/34065937/2901002
s = df['most_exhibitions'].str.split(')').apply(pd.Series, 1).stack()
s.index = s.index.droplevel(-1)
s.name = 'most_exhibitions'
print s
#a USA 1
#a Netherlands 5
#b United Kingdom 2
#c China 3
#c India 5
#c Pakistan 8
#d USA 11
#d India 4
#Name: most_exhibitions, dtype: object
df = df.drop( ['most_exhibitions'], axis=1)
df = df.join(s)
print df
# name most_exhibitions
#a Bob USA 1
#a Bob Netherlands 5
#b Joe United Kingdom 2
#c Alex China 3
#c Alex India 5
#c Alex Pakistan 8
#d Bill USA 11
#d Bill India 4
#exctract numbers and convert them to integer
df['numbers'] = df['most_exhibitions'].str.extract("(\d+)").astype('int')
#exctract text of most_exhibitions
df['most_exhibitions'] = df['most_exhibitions'].str.rsplit(' ', n=1).str[0]
print df
# name most_exhibitions numbers
#a Bob USA 1
#a Bob Netherlands 5
#b Joe United Kingdom 2
#c Alex China 3
#c Alex India 5
#c Alex Pakistan 8
#d Bill USA 11
#d Bill India 4
#pivot dataframe
df = df.pivot(index='name', columns='most_exhibitions', values='numbers')
#NaN to empty string
df = df.fillna('')
print df
#most_exhibitions India Netherlands Pakistan China USA United Kingdom
#name
#Alex 5 8 3
#Bill 4 11
#Bob 5 1
#Joe 2
编辑:
我尝试按功能merge
添加所有列作为推荐输出:
import pandas as pd
df = {'most_exhibitions' : pd.Series(['USA (1) Netherlands (5)' ,
'United Kingdom (2)','China (3) India (5) Pakistan (8)','USA (11) India (4)'], index=['a', 'b', 'c','d']),
'name' : pd.Series(['Bob', 'Joe', 'Alex', 'Bill'], index=['a', 'b', 'c','d'])}
df = pd.DataFrame(df)
#cange ordering of columns
df = df[['name', 'most_exhibitions']]
print df
# name most_exhibitions
#a Bob USA (1) Netherlands (5)
#b Joe United Kingdom (2)
#c Alex China (3) India (5) Pakistan (8)
#d Bill USA (11) India (4)
#copy original to new dataframe for joining original df
df1 = df.reset_index().copy()
#remove '(' and last ')'
df['most_exhibitions'] = df['most_exhibitions'].str.replace('(', '')
df['most_exhibitions'] = df['most_exhibitions'].str.strip(')')
#http://stackoverflow.com/a/34065937/2901002
s = df['most_exhibitions'].str.split(')').apply(pd.Series, 1).stack()
s.index = s.index.droplevel(-1)
s.name = 'most_exhibitions'
print s
#a USA 1
#a Netherlands 5
#b United Kingdom 2
#c China 3
#c India 5
#c Pakistan 8
#d USA 11
#d India 4
#Name: most_exhibitions, dtype: object
df = df.drop( ['most_exhibitions'], axis=1)
df = df.join(s)
print df
# name most_exhibitions
#a Bob USA 1
#a Bob Netherlands 5
#b Joe United Kingdom 2
#c Alex China 3
#c Alex India 5
#c Alex Pakistan 8
#d Bill USA 11
#d Bill India 4
#exctract numbers and convert them to integer
df['numbers'] = df['most_exhibitions'].str.extract("(\d+)").astype('int')
#exctract text of most_exhibitions
df['most_exhibitions'] = df['most_exhibitions'].str.rsplit(' ', n=1).str[0]
print df
# name most_exhibitions numbers
#a Bob USA 1
#a Bob Netherlands 5
#b Joe United Kingdom 2
#c Alex China 3
#c Alex India 5
#c Alex Pakistan 8
#d Bill USA 11
#d Bill India 4
#pivot dataframe
df = df.pivot(index='name', columns='most_exhibitions', values='numbers')
#NaN to empty string
df = df.fillna('')
df = df.reset_index()
print df
#most_exhibitions name India Netherlands Pakistan China USA United Kingdom
#0 Alex 5 8 3
#1 Bill 4 11
#2 Bob 5 1
#3 Joe 2
print df1
# index name most_exhibitions
#0 a Bob USA (1) Netherlands (5)
#1 b Joe United Kingdom (2)
#2 c Alex China (3) India (5) Pakistan (8)
#3 d Bill USA (11) India (4)
df = pd.merge(df1,df, on=['name'])
df = df.set_index('index')
print df
# name most_exhibitions India Netherlands Pakistan \
#index
#a Bob USA (1) Netherlands (5) 5
#b Joe United Kingdom (2)
#c Alex China (3) India (5) Pakistan (8) 5 8
#d Bill USA (11) India (4) 4
#
# China USA United Kingdom
#index
#a 1
#b 2
#c 3
#d 11