我正在尝试加载" California Housing"数据集直接从源URL转换为pandas数据帧。 URL指向包含两个文件的tgz文件:cal_housing.data和cal_housing.domain。
使用pandas read_csv加载文件工作正常但它会产生一个我不理解并想要摆脱的错误:数据帧的第一个值(第一行,第一列)被文件名替换
这就是cal_housing.data的样子:
0 -122.230000,37.880000,41.000000,880.000000,129.000000,322.000000,126.000000,8.325200,452600.000000
1 -122.220000,37.860000,21.000000,7099.000000,1106.000000,2401.000000,1138.000000,8.301400,358500.000000
2 -122.240000,37.850000,52.000000,1467.000000,190.000000,496.000000,177.000000,7.257400,352100.000000
3 ...
这就是cal_housing.domain的样子:
0 longitude: continuous.
1 latitude: continuous.
2 housingMedianAge: continuous.
3 totalRooms: continuous.
4 totalBedrooms: continuous.
5 population: continuous.
6 households: continuous.
7 medianIncome: continuous.
8 medianHouseValue: continuous.
这就是我的所作所为:
import pandas as pd
source = 'http://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.tgz'
col_names = ['longitude', 'latitude', 'housingMedianAge', 'totalRooms', 'totalBedrooms', 'population', 'households', 'medianIncome', 'medianHouseValue']
data = pd.read_csv(source, compression='gzip', header=None, names=col_names).dropna()
print(type(data))
这就是我得到的:
0 CaliforniaHousing/cal_housing.data 37.88 41.0 ...
1 -122.220000 37.86 21.0 ...
2 -122.240000 37.85 52.0 ...
...
最后,这就是我想要的:
0 -122.230000 37.88 41.0 ...
1 -122.220000 37.86 21.0 ...
2 -122.240000 37.85 52.0 ...
...
答案 0 :(得分:1)
好吧,经过一些游戏,我找到了解决方案。它比我希望的要复杂得多......所以如果你找到它们,请随意发布更好的解决方案。
import pandas as pd
import io
import tarfile
import urllib.request
source = 'http://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.tgz'
col_names = ['longitude', 'latitude', 'housingMedianAge', 'totalRooms', 'totalBedrooms', 'population', 'households', 'medianIncome', 'medianHouseValue']
tar = tarfile.open(fileobj=urllib.request.urlopen(source), mode="r|gz")
for member in tar:
if 'data' in member.name:
content = tar.extractfile(member).read()
data = pd.read_csv(io.BytesIO(content), encoding='utf8', header=None, names=col_names)
print(data)
这就是我得到的:
0 -122.230000 37.88 41.0 ...
1 -122.220000 37.86 21.0 ...
2 -122.240000 37.85 52.0 ...
...
答案 1 :(得分:0)
来自https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
压缩:{'推断','gzip','bz2','zip','xz',无},默认 '推断'用于磁盘上数据的即时解压缩。如果'推断',那么 如果filepath_or_buffer是以字符串结尾的字符串,请使用gzip,bz2,zip或xz '.gz','。bz2','。zip'或'xz',分别没有减压 除此以外。如果使用'zip',ZIP文件必须只包含一个数据 要读入的文件。如果没有解压缩,则设置为“无”。新版本 0.18.1:支持'zip'和'xz'压缩。
我没有在列表中看到tarball。您似乎正在尝试使用' gzip'在一个压缩的tarball上。我建议在本地提取文件并从单独的csv文件中读取数据帧。