将地点分为熊猫城市,州代码和国家/地区

时间:2019-07-03 18:24:41

标签: python split

我想知道将位置列拆分为几个新列,例如城市,州代码和熊猫国家。 从这个:

 'Location': {0: 'Warszawa, Poland',
  1: 'San Francisco, CA, United States',
  2: 'Los Angeles, CA, United States',
  3: 'Sunnyvale, CA, United States',
  4: 'Sunnyvale, CA, United States',
  5: 'San Francisco, CA, United States',
  6: 'Sunnyvale, CA, United States',
  7: 'Kraków, Poland',
  8: 'Shanghai, China',
  9: 'Mountain View, CA, United States',
  10: 'Boulder, CO, United States',
  11: 'Boulder, CO, United States',
  12: 'Xinyi District, Taiwan',
  13: 'Tel Aviv-Yafo, Israel',
  14: 'Wrocław, Poland',
  15: 'Singapore'}

对此:

 'Country': {0: 'Poland',
  1: 'United States',
  2: 'United States',
  3: 'United States',
  4: 'United States',
  5: 'United States',
  6: 'United States',
  7: 'Poland',
  8: 'China',
  9: 'United States',
  10: 'United States',
  11: 'United States',
  12: 'Taiwan',
  13: 'Israel',
  14: 'Poland',
  15: 'Singapore'}

谢谢。

3 个答案:

答案 0 :(得分:1)

我不确定这是最好的方法,其他人请评论或提出更好的方法。 我试图拆分数据,但是挑战在于,外国只有城市和国家/地区名称,而美国的条目只有城市,国家和国家/地区。因此,我无法用一种方法拆分它。下面是我用来拆分数据的两种方法,然后您必须弄清楚如何合并为一个数据帧。

 b = pd.DataFrame ({'Location': {0: 'Warszawa, Poland',
  1: 'San Francisco, CA, United States',
  2: 'Los Angeles, CA, United States',
  3: 'Sunnyvale, CA, United States',
  4: 'Sunnyvale, CA, United States',
  5: 'San Francisco, CA, United States',
  6: 'Sunnyvale, CA, United States',
  7: 'Kraków, Poland',
  8: 'Shanghai, China',
  9: 'Mountain View, CA, United States',
  10: 'Boulder, CO, United States',
  11: 'Boulder, CO, United States',
  12: 'Xinyi District, Taiwan',
  13: 'Tel Aviv-Yafo, Israel',
  14: 'Wrocław, Poland',
  15: 'Singapore'}})

c[['City', 'Country']] = b['Location'].str.split(',', n=1, expand=True) # This splits the data into city and Country. So this works very well for Foriegn address or data with just city and country. 

 Out put is:

     City       Country
0   Warszawa    Poland
1   San Francisco   CA, United States
2   Los Angeles CA, United States
3   Sunnyvale   CA, United States
4   Sunnyvale   CA, United States
5   San Francisco   CA, United States
6   Sunnyvale   CA, United States
7   Kraków  Poland
8   Shanghai    China

第二种方法是:

regex = r'(?P<City>[^,]+)\s*,\s*(?P<State>[^\s]+)\s+(?P<Country>[^,]+)'
df=b['Location'].str.extract(regex)
df # This splits the data into City, State and Country, so it works well for US address. 

Output is :

    City       State    Country
0   NaN          NaN    NaN
1   San Francisco CA,   United States
2   Los Angeles CA,     United States
3   Sunnyvale   CA,     United States
4   Sunnyvale   CA,     United States
5   San Francisco CA,   United States
6   Sunnyvale   CA,     United States
7   NaN          NaN    NaN

答案 1 :(得分:0)

$ ipython
Python 3.6.8 |Anaconda custom (64-bit)| (default, Feb 21 2019, 18:30:04) [MSC v.1916 64 bit (AMD64)]
Type 'copyright', 'credits' or 'license' for more information
IPython 7.5.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: d = {'Location': {0: 'Warszawa, Poland',
   ...:   1: 'San Francisco, CA, United States',
   ...:   2: 'Los Angeles, CA, United States',
   ...:   3: 'Sunnyvale, CA, United States',
   ...:   4: 'Sunnyvale, CA, United States',
   ...:   5: 'San Francisco, CA, United States',
   ...:   6: 'Sunnyvale, CA, United States',
   ...:   7: 'Kraków, Poland',
   ...:   8: 'Shanghai, China',
   ...:   9: 'Mountain View, CA, United States',
   ...:   10: 'Boulder, CO, United States',
   ...:   11: 'Boulder, CO, United States',
   ...:   12: 'Xinyi District, Taiwan',
   ...:   13: 'Tel Aviv-Yafo, Israel',
   ...:   14: 'Wrocław, Poland',
   ...:   15: 'Singapore'}}

In [2]: import pandas as pd
   ...: df = pd.DataFrame.from_dict(d)
   ...: df
Out[2]:
                            Location
0                   Warszawa, Poland
1   San Francisco, CA, United States
2     Los Angeles, CA, United States
3       Sunnyvale, CA, United States
4       Sunnyvale, CA, United States
5   San Francisco, CA, United States
6       Sunnyvale, CA, United States
7                     Kraków, Poland
8                    Shanghai, China
9   Mountain View, CA, United States
10        Boulder, CO, United States
11        Boulder, CO, United States
12            Xinyi District, Taiwan
13             Tel Aviv-Yafo, Israel
14                   Wrocław, Poland
15                         Singapore

In [3]: df['Country'] = df['Location'].str.split(',').apply(lambda x: x[-1])
   ...: df
Out[3]:
                            Location         Country
0                   Warszawa, Poland          Poland
1   San Francisco, CA, United States   United States
2     Los Angeles, CA, United States   United States
3       Sunnyvale, CA, United States   United States
4       Sunnyvale, CA, United States   United States
5   San Francisco, CA, United States   United States
6       Sunnyvale, CA, United States   United States
7                     Kraków, Poland          Poland
8                    Shanghai, China           China
9   Mountain View, CA, United States   United States
10        Boulder, CO, United States   United States
11        Boulder, CO, United States   United States
12            Xinyi District, Taiwan          Taiwan
13             Tel Aviv-Yafo, Israel          Israel
14                   Wrocław, Poland          Poland
15                         Singapore       Singapore

In [4]: df['Country'].to_dict()
Out[4]:
{0: ' Poland',
 1: ' United States',
 2: ' United States',
 3: ' United States',
 4: ' United States',
 5: ' United States',
 6: ' United States',
 7: ' Poland',
 8: ' China',
 9: ' United States',
 10: ' United States',
 11: ' United States',
 12: ' Taiwan',
 13: ' Israel',
 14: ' Poland',
 15: 'Singapore'}

答案 2 :(得分:0)

这稍作改进,可以完成相同的工作,并且可以放在一行代码中。

b['City'] = b['Location'].str.split(',').apply(lambda x: x[0])
b['Country'] = b['Location'].str.split(',').apply(lambda x: x[-1])
b

输出:

    Location                            City             Country
0   Warszawa, Poland                    Warszawa          Poland
1   San Francisco, CA, United States    San Francisco     United States
2   Los Angeles, CA, United States      Los Angeles       United States
3   Sunnyvale, CA, United States        Sunnyvale         United States
4   Sunnyvale, CA, United States        Sunnyvale         United States
5   San Francisco, CA, United States    San Francisco     United States
6   Sunnyvale, CA, United States        Sunnyvale         United States
7   Kraków, Poland                      Kraków            Poland
8   Shanghai, China                     Shanghai          China

这是单行版本,但是我很难将它们放在两个不同的列中。这里出了点问题,我找不到它。

b['City', 'Country']= pd.DataFrame (b['Location'].str.split(',').apply(lambda x:( x[0], x[-1]))) 


    (City,  Country)
0   (Warszawa, Poland)
1   (San Francisco, United States)
2   (Los Angeles, United States)
3   (Sunnyvale, United States)
4   (Sunnyvale, United States)
5   (San Francisco, United States)