我的pandas数据框如下所示:
date | location | occurance <br>
------------------------------------------------------
somedate |united_kingdom_london | 5
somedate |united_state_newyork | 5
我希望它转变为
date | country | city | occurance <br>
---------------------------------------------------
somedate | united kingdom | london | 5
---------------------------------------------------
somedate | united state | newyork | 5
我是Python新手,经过一些研究,我编写了以下代码,但似乎无法提取国家和城市:
df.location= df.location.replace({'-': ' '}, regex=True)
df.location= df.location.replace({'_': ' '}, regex=True)
temp_location = df['location'].str.split(' ').tolist()
location_data = pd.DataFrame(temp_location, columns=['country', 'city'])
感谢您的回复。
答案 0 :(得分:3)
从这开始:
df = pd.DataFrame({'Date': ['somedate', 'somedate'],
'location': ['united_kingdom_london', 'united_state_newyork'],
'occurence': [5, 5]})
试试这个:
df['Country'] = df['location'].str.rpartition('_')[0].str.replace("_", " ")
df['City'] = df['location'].str.rpartition('_')[2]
df[['Date','Country', 'City', 'occurence']]
Date Country City occurence
0 somedate united kingdom london 5
1 somedate united state newyork 5
借用@MaxU的想法
df[['Country'," " , 'City']] = (df.location.str.replace('_',' ').str.rpartition(' ', expand= True ))
df[['Date','Country', 'City','occurence' ]]
Date Country City occurence
0 somedate united kingdom london 5
1 somedate united state newyork 5
答案 1 :(得分:1)
str.rsplit
的另一个解决方案,如果国家/地区没有_
(只包含一个字),则效果很好:
import pandas as pd
df = pd.DataFrame({'date': {0: 'somedate', 1: 'somedate', 2: 'somedate'},
'location': {0: 'slovakia_bratislava',
1: 'united_kingdom_london',
2: 'united_state_newyork'},
'occurance <br>': {0: 5, 1: 5, 2: 5}})
print (df)
date location occurance <br>
0 somedate slovakia_bratislava 5
1 somedate united_kingdom_london 5
2 somedate united_state_newyork 5
df[['country','city']] = df.location.str.replace('_', ' ').str.rsplit(n=1, expand=True)
#change ordering of columns, remove location column
cols = df.columns.tolist()
df = df[cols[:1] + cols[3:5] + cols[2:3]]
print (df)
date country city occurance <br>
0 somedate slovakia bratislava 5
1 somedate united kingdom london 5
2 somedate united state newyork 5
答案 2 :(得分:0)
试试这个:
temp_location = {}
splits = df['location'].str.split(' ')
temp_location['country'] = splits[0:-1].tolist()
temp_location['city'] = splits[-1].tolist()
location_data = pd.DataFrame(temp_location)
如果你想要它回到原来的df:
df['country'] = splits[0:-1].tolist()
df['city'] = splits[-1].tolist()
答案 3 :(得分:0)
考虑使用rfind()
import pandas as pd
df = pd.DataFrame({'Date': ['somedate', 'somedate'],
'location': ['united_kingdom_london', 'united_state_newyork'],
'occurence': [5, 5]})
df['country'] = df['location'].apply(lambda x: x[0:x.rfind('_')])
df['city'] = df['location'].apply(lambda x: x[x.rfind('_')+1:])
df = df[['Date', 'country', 'city', 'occurence']]
print(df)
# Date country city occurence
# 0 somedate united_kingdom london 5
# 1 somedate united_state newyork 5
答案 4 :(得分:0)
像这样的东西
import pandas as pd
df = pd.DataFrame({'Date': ['somedate', 'somedate'],
'location': ['united_kingdom_london', 'united_state_newyork'],
'occurence': [5, 5]})
df.location = df.location.str[::-1].str.replace("_", " ", 1).str[::-1]
newcols = df.location.str.split(" ")
newcols = pd.DataFrame(df.location.str.split(" ").tolist(),
columns=["country", "city"])
newcols.country = newcols.country.str.replace("_", " ")
df = pd.concat([df, newcols], axis=1)
df.drop("location", axis=1, inplace=True)
print(df)
Date occurence country city
0 somedate 5 united kingdom london
1 somedate 5 united state newyork
你可以在替换中使用正则表达式来处理更复杂的模式但是如果它只是在最后一个_
之后的单词我发现更容易将str作为hack反转两次而不是摆弄正则表达式< / p>
答案 5 :(得分:0)
我会使用.str.extract()方法:
In [107]: df
Out[107]:
Date location occurence
0 somedate united_kingdom_london 5
1 somedate united_state_newyork 5
2 somedate germany_munich 5
In [108]: df[['country','city']] = (df.location.str.replace('_',' ')
.....: .str.extract(r'(.*)\s+([^\s]*)', expand=True))
In [109]: df
Out[109]:
Date location occurence country city
0 somedate united_kingdom_london 5 united kingdom london
1 somedate united_state_newyork 5 united state newyork
2 somedate germany_munich 5 germany munich
In [110]: df = df.drop('location', 1)
In [111]: df
Out[111]:
Date occurence country city
0 somedate 5 united kingdom london
1 somedate 5 united state newyork
2 somedate 5 germany munich
PS请注意,无法正确解析(区分)包含两个单词country + one-word city的行和包含一个单词country + two-words city的行(除非你有一个完整的列表国家,所以你检查这个列表)...