我有如下数据。如何将以下内容转换为数据帧。我需要国家名称(某些国家/地区名称中间有逗号)作为第一列,其他值作为每列。
输入是包含许多国家/地区的txt文件
捷克共和国 ,22,22,22,21,21,21,21,21,19,18,16,14,13,12,11,11,10,9
的刚果,Dem.Rep。 ,275,306,327,352,376,411,420,466,472,528,592,643,697,708,710,702,692,666
的刚果共和国 ,209,222,231,243,255,269,424,457,367,545,313,354,402,509,477,482,511,485
输出应为国家/地区名称为第一列的数据框
Czech Republic 22 22 22 21 21 21 21 21 19 18 16 14 13 12 11 11 10 9
Congo,Rep. 209 222 231 243 255 269 424 457 367 545 313 354 402 509 477 482 511 485
Congo, Dem.Rep. 275 306 327 352 376 411 420 466 472 528 592 643 697 708 710 702 692 666
答案 0 :(得分:1)
您可以先使用read_csv
(如果.txt file
没有问题)使用某些分隔符,而|
的{{1}}值不是Series
,extract
}和strip
国家/地区名称分配到一列,另一个值split
位于,
:
import pandas as pd
from pandas.compat import StringIO
temp=u"""Czech Republic,22,22,22,21,21,21,21,21,19,18,16,14,13,12,11,11,10,9
Congo,Dem.Rep.,275,306,327,352,376,411,420,466,472,528,592,643,697,708,710,702,692,666
Congo,Rep.,209,222,231,243,255,269,424,457,367,545,313,354,402,509,477,482,511,485"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
s = pd.read_csv(StringIO(temp), sep="|", header=None, squeeze=True)
print (s)
0 Czech Republic,22,22,22,21,21,21,21,21,19,18,1...
1 Congo,Dem.Rep.,275,306,327,352,376,411,420,466...
2 Congo,Rep.,209,222,231,243,255,269,424,457,367...
Name: 0, dtype: object
df = s.str.extract('([A-Za-z ,.]+)([0-9,]+)', expand=True)
df[0] = df[0].str.strip(',')
df = df.set_index(0)[1].str.split(',', expand=True).rename_axis(None).reset_index()
#reset column names by 0,1,2...
df.columns = np.arange(len(df.columns))
print (df)
0 1 2 3 4 5 6 7 8 9 10 11 12 \
0 Czech Republic 22 22 22 21 21 21 21 21 19 18 16 14
1 Congo,Dem.Rep. 275 306 327 352 376 411 420 466 472 528 592 643
2 Congo,Rep. 209 222 231 243 255 269 424 457 367 545 313 354
13 14 15 16 17 18
0 13 12 11 11 10 9
1 697 708 710 702 692 666
2 402 509 477 482 511 485
如果国家/地区需要index
:
df = df.set_index(0)[1].str.split(',', expand=True).rename_axis(None)
print (df)
0 1 2 3 4 5 6 7 8 9 10 11 \
Czech Republic 22 22 22 21 21 21 21 21 19 18 16 14
Congo,Dem.Rep. 275 306 327 352 376 411 420 466 472 528 592 643
Congo,Rep. 209 222 231 243 255 269 424 457 367 545 313 354
12 13 14 15 16 17
Czech Republic 13 12 11 11 10 9
Congo,Dem.Rep. 697 708 710 702 692 666
Congo,Rep. 402 509 477 482 511 485
解决方案是来自其他answer的正则表达式 - 可以将其用作sep
参数,只需要engine='python'
,因为警告:
import pandas as pd
from pandas.compat import StringIO
temp=u"""Czech Republic,22,22,22,21,21,21,21,21,19,18,16,14,13,12,11,11,10,9
Congo,Dem.Rep.,275,306,327,352,376,411,420,466,472,528,592,643,697,708,710,702,692,666
Congo,Rep.,209,222,231,243,255,269,424,457,367,545,313,354,402,509,477,482,511,485"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp), sep=",(?=\d)", header=None, engine='python')
print (df)
0 1 2 3 4 5 6 7 8 9 10 11 12 \
0 Czech Republic 22 22 22 21 21 21 21 21 19 18 16 14
1 Congo,Dem.Rep. 275 306 327 352 376 411 420 466 472 528 592 643
2 Congo,Rep. 209 222 231 243 255 269 424 457 367 545 313 354
13 14 15 16 17 18
0 13 12 11 11 10 9
1 697 708 710 702 692 666
2 402 509 477 482 511 485
答案 1 :(得分:1)
jezrael的答案是可行的。
如果您想真正了解一些更简单的代码,请尝试执行以下操作:
将字符串拆分为以下列表:
data = "Czech Republic..."
lines = data.split('\n')
rows = []
然后遍历这些行,并将它们附加到列表列表中:
def is_number(s):
try:
float(s)
return True
except ValueError:
return False
for line in lines:
temp = line.split(',')
if is_number(temp[1]):
rows.append([''.join(temp[:2])].extend(temp[2:])) // ignoring the first ',' delimiter if the second column is a number
else:
rows.append(temp)
然后使用此列表列表并阅读以下pandas DataFrame文档,了解如何preety-print。 (提示:首先将列表列表设为dict)
答案 2 :(得分:1)
使用re.split()
函数的解决方案和带有列的标记数据结构:
import pandas as pd, re
s = '''
Czech Republic,22,22,22,21,21,21,21,21,19,18,16,14,13,12,11,11,10,9
Congo,Dem.Rep.,275,306,327,352,376,411,420,466,472,528,592,643,697,708,710,702,692,666
Congo,Rep.,209,222,231,243,255,269,424,457,367,545,313,354,402,509,477,482,511,485
'''
data = []
for l in s.split('\n'):
if l: data.append(re.split(r',(?=\d)', l))
# setting output options
pd.set_option('display.max_columns', 20)
pd.set_option('display.width', 1000)
df = pd.DataFrame(data, columns=['Country name'] + list(range(len(data[0][1:]))))
print(df)
输出:
Country name 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
0 Czech Republic 22 22 22 21 21 21 21 21 19 18 16 14 13 12 11 11 10 9
1 Congo,Dem.Rep. 275 306 327 352 376 411 420 466 472 528 592 643 697 708 710 702 692 666
2 Congo,Rep. 209 222 231 243 255 269 424 457 367 545 313 354 402 509 477 482 511 485