从字符串中提取完整的国家/地区名称,并将其作为数据框列

时间:2017-02-11 10:14:57

标签: python pandas

我有如下数据。如何将以下内容转换为数据帧。我需要国家名称(某些国家/地区名称中间有逗号)作为第一列,其他值作为每列。

输入是包含许多国家/地区的txt文件

捷克共和国 ,22,22,22,21,21,21,21,21,19,18,16,14,13,12,11,11,10,9刚果,Dem.Rep。 ,275,306,327,352,376,411,420,466,472,528,592,643,697,708,710,702,692,666刚果共和国 ,209,222,231,243,255,269,424,457,367,545,313,354,402,509,477,482,511,485

输出应为国家/地区名称为第一列的数据框

Czech Republic  22  22  22  21  21  21  21  21  19  18  16  14  13  12  11  11  10  9

Congo,Rep.      209 222 231 243 255 269 424 457 367 545 313 354 402 509 477 482 511 485

Congo, Dem.Rep. 275 306 327 352 376 411 420 466 472 528 592 643 697 708 710 702 692 666

3 个答案:

答案 0 :(得分:1)

您可以先使用read_csv(如果.txt file没有问题)使用某些分隔符,而|的{​​{1}}值不是Seriesextract }和strip国家/地区名称分配到一列,另一个值split位于,

import pandas as pd
from pandas.compat import StringIO

temp=u"""Czech Republic,22,22,22,21,21,21,21,21,19,18,16,14,13,12,11,11,10,9 
Congo,Dem.Rep.,275,306,327,352,376,411,420,466,472,528,592,643,697,708,710,702,692,666 
Congo,Rep.,209,222,231,243,255,269,424,457,367,545,313,354,402,509,477,482,511,485"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
s = pd.read_csv(StringIO(temp), sep="|", header=None, squeeze=True)
print (s)
0    Czech Republic,22,22,22,21,21,21,21,21,19,18,1...
1    Congo,Dem.Rep.,275,306,327,352,376,411,420,466...
2    Congo,Rep.,209,222,231,243,255,269,424,457,367...
Name: 0, dtype: object

df = s.str.extract('([A-Za-z ,.]+)([0-9,]+)', expand=True)
df[0] = df[0].str.strip(',')
df = df.set_index(0)[1].str.split(',', expand=True).rename_axis(None).reset_index()
#reset column names by 0,1,2... 
df.columns = np.arange(len(df.columns))
print (df)
               0    1    2    3    4    5    6    7    8    9    10   11   12  \
0  Czech Republic   22   22   22   21   21   21   21   21   19   18   16   14   
1  Congo,Dem.Rep.  275  306  327  352  376  411  420  466  472  528  592  643   
2      Congo,Rep.  209  222  231  243  255  269  424  457  367  545  313  354   

    13   14   15   16   17   18  
0   13   12   11   11   10    9  
1  697  708  710  702  692  666  
2  402  509  477  482  511  485  

如果国家/地区需要index

df = df.set_index(0)[1].str.split(',', expand=True).rename_axis(None)
print (df)
                 0    1    2    3    4    5    6    7    8    9    10   11  \
Czech Republic   22   22   22   21   21   21   21   21   19   18   16   14   
Congo,Dem.Rep.  275  306  327  352  376  411  420  466  472  528  592  643   
Congo,Rep.      209  222  231  243  255  269  424  457  367  545  313  354   

                 12   13   14   15   16   17  
Czech Republic   13   12   11   11   10    9  
Congo,Dem.Rep.  697  708  710  702  692  666  
Congo,Rep.      402  509  477  482  511  485  

解决方案是来自其他answer的正则表达式 - 可以将其用作sep参数,只需要engine='python',因为警告:

import pandas as pd
from pandas.compat import StringIO


temp=u"""Czech Republic,22,22,22,21,21,21,21,21,19,18,16,14,13,12,11,11,10,9 
Congo,Dem.Rep.,275,306,327,352,376,411,420,466,472,528,592,643,697,708,710,702,692,666 
Congo,Rep.,209,222,231,243,255,269,424,457,367,545,313,354,402,509,477,482,511,485"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp), sep=",(?=\d)", header=None, engine='python')

print (df)
               0    1    2    3    4    5    6    7    8    9    10   11   12  \
0  Czech Republic   22   22   22   21   21   21   21   21   19   18   16   14   
1  Congo,Dem.Rep.  275  306  327  352  376  411  420  466  472  528  592  643   
2      Congo,Rep.  209  222  231  243  255  269  424  457  367  545  313  354   

    13   14   15   16   17   18  
0   13   12   11   11   10    9  
1  697  708  710  702  692  666  
2  402  509  477  482  511  485  

答案 1 :(得分:1)

如果你想尽快得到完整的输出,

jezrael的答案是可行的。

如果您想真正了解一些更简单的代码,请尝试执行以下操作:

  1. 将字符串拆分为以下列表:

    data = "Czech Republic..."
    lines = data.split('\n')
    rows = []
    
  2. 然后遍历这些行,并将它们附加到列表列表中:

    def is_number(s):
        try:
            float(s)
            return True
        except ValueError:
            return False
    
    for line in lines:
        temp = line.split(',')
        if is_number(temp[1]):
            rows.append([''.join(temp[:2])].extend(temp[2:])) // ignoring the first ',' delimiter if the second column is a number
        else:
            rows.append(temp)
    
  3. 然后使用此列表列表并阅读以下pandas DataFrame文档,了解如何preety-print。 (提示:首先将列表列表设为dict)

答案 2 :(得分:1)

使用re.split()函数的解决方案和带有列的标记数据结构

import pandas as pd, re

s = '''
Czech Republic,22,22,22,21,21,21,21,21,19,18,16,14,13,12,11,11,10,9
Congo,Dem.Rep.,275,306,327,352,376,411,420,466,472,528,592,643,697,708,710,702,692,666
Congo,Rep.,209,222,231,243,255,269,424,457,367,545,313,354,402,509,477,482,511,485
'''
data = []
for l in s.split('\n'):
    if l: data.append(re.split(r',(?=\d)', l))

# setting output options
pd.set_option('display.max_columns', 20)
pd.set_option('display.width', 1000)
df = pd.DataFrame(data, columns=['Country name'] + list(range(len(data[0][1:]))))

print(df)

输出:

     Country name    0    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16   17
0  Czech Republic   22   22   22   21   21   21   21   21   19   18   16   14   13   12   11   11   10    9
1  Congo,Dem.Rep.  275  306  327  352  376  411  420  466  472  528  592  643  697  708  710  702  692  666
2      Congo,Rep.  209  222  231  243  255  269  424  457  367  545  313  354  402  509  477  482  511  485