从网址获取编码的csv到Pandas中

时间:2019-04-05 13:52:00

标签: python pandas csv python-requests urllib

我正在努力将以下csv从此站点添加到Pandas中。

https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/773656/HMRC_spending_over_25000_for_December_2018.csv

我已经尝试了一些方法,但是到目前为止,我还无法做出可行的csv。最终目的是使它成为熊猫dataframe

任何人都可以帮助我指出正确的方向,并解释以下原因为何无效?

使用Python 3.7,Windows 10

import requests  
import urllib
import csv

csv_url = 'https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/773656/HMRC_spending_over_25000_for_December_2018.csv'

response = urllib.request.urlopen(csv_url)
cr = csv.reader(response)
for row in cr:
    print(row)
# csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)

response = urllib.request.urlopen(csv_url)
response = response.read().decode()
cr = csv.reader(response)
for row in cr:
    print(row)
# UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 28452: invalid start byte

response = requests.get(csv_url).text
cr = csv.reader(response)
for row in cr:
    print(row)
# malformed, prints individual characters

3 个答案:

答案 0 :(得分:3)

如果您使用的是熊猫> = 0.19.2,则可以直接输入csv网址。

import pandas as pd
url="https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/773656/HMRC_spending_over_25000_for_December_2018.csv"
c=pd.read_csv(url, encoding ='latin1') # otherwise you get a UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 12: invalid start byte  

Demo1


否则请使用String.IO,即:

import pandas as pd
import requests
from io import StringIO
url="https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/773656/HMRC_spending_over_25000_for_December_2018.csv"
s=requests.get(url).content
c=pd.read_csv(StringIO(s.decode("latin1")))

Demo2

答案 1 :(得分:2)

这是一个编码问题,因为文件似乎使用Windows特定的编码。

df = pd.read_csv(url, encoding='cp1252')

应该工作。

答案 2 :(得分:1)

将编码更改为cp1252

import pandas as pd
import io
import requests
url="https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/773656/HMRC_spending_over_25000_for_December_2018.csv"
s=requests.get(url).content
c=pd.read_csv(io.StringIO(s.decode("cp1252")))