Python pandas read_csv - 在dataframe中加载tgz-zipped数据集

时间:2017-07-09 16:57:59

标签: python pandas

我正在尝试加载" California Housing"数据集直接从源URL转换为pandas数据帧。 URL指向包含两个文件的tgz文件:cal_housing.data和cal_housing.domain。

使用pandas read_csv加载文件工作正常但它会产生一个我不理解并想要摆脱的错误:数据帧的第一个值(第一行,第一列)被文件名替换

这就是cal_housing.data的样子:

0 -122.230000,37.880000,41.000000,880.000000,129.000000,322.000000,126.000000,8.325200,452600.000000
1 -122.220000,37.860000,21.000000,7099.000000,1106.000000,2401.000000,1138.000000,8.301400,358500.000000
2 -122.240000,37.850000,52.000000,1467.000000,190.000000,496.000000,177.000000,7.257400,352100.000000
3 ...

这就是cal_housing.domain的样子:

0 longitude: continuous.
1 latitude: continuous.
2 housingMedianAge: continuous. 
3 totalRooms: continuous. 
4 totalBedrooms: continuous. 
5 population: continuous. 
6 households: continuous. 
7 medianIncome: continuous. 
8 medianHouseValue: continuous. 

这就是我的所作所为:

import pandas as pd
source = 'http://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.tgz'
col_names = ['longitude', 'latitude', 'housingMedianAge', 'totalRooms', 'totalBedrooms', 'population', 'households', 'medianIncome', 'medianHouseValue']
data = pd.read_csv(source, compression='gzip', header=None, names=col_names).dropna()
print(type(data))

这就是我得到的:

0      CaliforniaHousing/cal_housing.data     37.88              41.0   ...
1                             -122.220000     37.86              21.0   ...
2                             -122.240000     37.85              52.0   ...
...

最后,这就是我想要的:

0      -122.230000     37.88              41.0   ...
1      -122.220000     37.86              21.0   ...
2      -122.240000     37.85              52.0   ...
...

2 个答案:

答案 0 :(得分:1)

好吧,经过一些游戏,我找到了解决方案。它比我希望的要复杂得多......所以如果你找到它们,请随意发布更好的解决方案。

import pandas as pd
import io
import tarfile
import urllib.request
source = 'http://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.tgz'
col_names = ['longitude', 'latitude', 'housingMedianAge', 'totalRooms', 'totalBedrooms', 'population', 'households', 'medianIncome', 'medianHouseValue']
tar = tarfile.open(fileobj=urllib.request.urlopen(source), mode="r|gz")
for member in tar:
    if 'data' in member.name: 
        content = tar.extractfile(member).read()
        data = pd.read_csv(io.BytesIO(content), encoding='utf8', header=None, names=col_names)
print(data)

这就是我得到的:

0      -122.230000     37.88              41.0   ...
1      -122.220000     37.86              21.0   ...
2      -122.240000     37.85              52.0   ...
...

答案 1 :(得分:0)

来自https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

  

压缩:{'推断','gzip','bz2','zip','xz',无},默认   '推断'用于磁盘上数据的即时解压缩。如果'推断',那么   如果filepath_or_buffer是以字符串结尾的字符串,请使用gzip,bz2,zip或xz   '.gz','。bz2','。zip'或'xz',分别没有减压   除此以外。如果使用'zip',ZIP文件必须只包含一个数据   要读入的文件。如果没有解压缩,则设置为“无”。新版本   0.18.1:支持'zip'和'xz'压缩。

我没有在列表中看到tarball。您似乎正在尝试使用' gzip'在一个压缩的tarball上。我建议在本地提取文件并从单独的csv文件中读取数据帧。