我正在努力将以下csv
从此站点添加到Pandas
中。
我已经尝试了一些方法,但是到目前为止,我还无法做出可行的csv
。最终目的是使它成为熊猫dataframe
。
任何人都可以帮助我指出正确的方向,并解释以下原因为何无效?
使用Python 3.7,Windows 10
import requests
import urllib
import csv
csv_url = 'https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/773656/HMRC_spending_over_25000_for_December_2018.csv'
response = urllib.request.urlopen(csv_url)
cr = csv.reader(response)
for row in cr:
print(row)
# csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)
response = urllib.request.urlopen(csv_url)
response = response.read().decode()
cr = csv.reader(response)
for row in cr:
print(row)
# UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 28452: invalid start byte
response = requests.get(csv_url).text
cr = csv.reader(response)
for row in cr:
print(row)
# malformed, prints individual characters
答案 0 :(得分:3)
如果您使用的是熊猫> = 0.19.2
,则可以直接输入csv
网址。
import pandas as pd
url="https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/773656/HMRC_spending_over_25000_for_December_2018.csv"
c=pd.read_csv(url, encoding ='latin1') # otherwise you get a UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 12: invalid start byte
否则请使用String.IO,即:
import pandas as pd
import requests
from io import StringIO
url="https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/773656/HMRC_spending_over_25000_for_December_2018.csv"
s=requests.get(url).content
c=pd.read_csv(StringIO(s.decode("latin1")))
答案 1 :(得分:2)
这是一个编码问题,因为文件似乎使用Windows特定的编码。
df = pd.read_csv(url, encoding='cp1252')
应该工作。
答案 2 :(得分:1)
将编码更改为cp1252
import pandas as pd
import io
import requests
url="https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/773656/HMRC_spending_over_25000_for_December_2018.csv"
s=requests.get(url).content
c=pd.read_csv(io.StringIO(s.decode("cp1252")))