导入并解析.data文件

时间:2018-12-18 13:31:25

标签: python-3.x pandas python-requests

有一个我尝试导入的文件,它作为pandas df是安全的。乍一看,似乎已经按顺序排列了行和列,但是最后我不得不做很多事情来创建pandas df。您能否检查是否有更快的方法来管理它?

url ='https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'

我的做法是:

import requests
import pandas as pd

r = requests.get(url)

file = r.text    

step_1 = file.split('\n')

for n in range(len(step_1)):                 # remove empty strings
    if bool(step_1[n]) == False:                 
        del(step_1[n])

step_2 = [i.split('\t') for i in step_1]

cars_names = [i[1] for i in step_2]

step_3 = [i[0].split(' ') for i in step_2]

for e in range(len(step_3)):         # remove empty strings in each sublist
    step_3[e] = [item for item in step_3[e] if item != '']


mpg        = [i[0] for i in step_3]
cylinders  = [i[1] for i in step_3]
disp       = [i[2] for i in step_3]
horsepower = [i[3] for i in step_3]
weight     = [i[4] for i in step_3]
acce       = [i[5] for i in step_3]
year       = [i[6] for i in step_3]
origin     = [i[7] for i in step_3]


list_cols = [cars_names, mpg, cylinders, disp, horsepower, weight, acce, year, origin]

# list_labels written manually:
list_labels = ['car name', 'mpg', 'cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'model year', 'origin']

zipped = list(zip(list_labels, list_cols))

data = dict(zipped)

df = pd.DataFrame(data)

1 个答案:

答案 0 :(得分:0)

\t替换为空格时,可以使用read_csv进行读取。但是您需要包装文本,因为read_csv中的第一个参数是filepath_or_buffer,它需要带有read()方法的对象(例如文件句柄或StringIO)。然后,您的问题可以转换为read_csv doesn't read the column names correctly on this file?

import requests
import pandas as pd
from io import StringIO

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'
r = requests.get(url)

file = r.text.replace("\t"," ")

# list_labels written manually:
list_labels = ['mpg', 'cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'model year', 'origin','car name']

df = pd.read_csv(StringIO(file),sep="\s+",header = None,names=list_labels)

with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    print(df)