将url读取为带有列名的pandas数据帧(python3)

时间:2017-03-09 10:51:34

标签: python pandas url

我已经阅读了几个关于这个主题的问题,但似乎没有什么对我有用。

我想从此页面检索数据" http://archive.ics.uci.edu/ml/machine-learning-databases/statlog/heart/heart.dat"列的某些名称。

我的代码如下,它不允许我为数据列分配名称,因为所有内容都在一列中:

import pandas as pd
import io
import requests
url="http://archive.ics.uci.edu/ml/machine-learningdatabases/statlog/heart/heart.dat"
s=requests.get(url).content
header_row = ['age','sex','chestpain','restBP','chol','sugar','ecg','maxhr','angina','dep','exercise','fluor','thal','diagnosis']
c=pd.read_csv(io.StringIO(s.decode('utf-8')), names=header_row)
print(c)

输出结果为:

     age  sex  chestpain  \
0    70.0 1.0 4.0 130.0 322.0 0.0 2.0 109.0 0.0 2.4...  NaN        NaN   
1    67.0 0.0 3.0 115.0 564.0 0.0 2.0 160.0 0.0 1.6...  NaN        NaN   
2    57.0 1.0 2.0 124.0 261.0 0.0 0.0 141.0 0.0 0.3...  NaN        NaN   
3    64.0 1.0 4.0 128.0 263.0 0.0 0.0 105.0 1.0 0.2...  NaN        NaN

我需要做些什么才能实现目标?

非常感谢!!!

1 个答案:

答案 0 :(得分:1)

您提供的链接缺少连字符。我在答案中纠正了这一点。基本上,您需要将s字符串解码为utf-8,然后将其拆分为\n以获取每一行,然后将每行划分为空白以分别获取每个值。这将为您提供数据集的嵌套列表表示,您可以将其转换为pandas数据帧,然后可以分配列名。

import pandas as pd
import io
import requests
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/heart/heart.dat"
s = requests.get(url).content
s = s.decode('utf-8')
s_rows = s.split('\n')
s_rows_cols = [each.split() for each in s_rows]
header_row = ['age','sex','chestpain','restBP','chol','sugar','ecg','maxhr','angina','dep','exercise','fluor','thal','diagnosis']
c = pd.DataFrame(s_rows_cols, columns = header_row)
c.head()