Use python requests to download CSV

时间:2016-02-12 19:49:32

标签: python csv python-requests

Here is my code:

import csv
import requests
with requests.Session() as s:
    s.post(url, data=payload)
    download = s.get('url that directly download a csv report')

This gives me the access to the csv file. I tried different method to deal with the download:

This will give the the csv file in one string:

print download.content

This print the first row and return error: _csv.Error: new-line character seen in unquoted field

cr = csv.reader(download, dialect=csv.excel_tab)
for row in cr:
    print row

This will print a letter in each row and it won't print the whole thing:

cr = csv.reader(download.content, dialect=csv.excel_tab)
for row in cr:
    print row

My question is: what's the most efficient way to read a csv file in this situation. And how to download it.

thanks

10 个答案:

答案 0 :(得分:44)

This should help:

{{ post.get_path }}

Ouput sample:

import csv
import requests

CSV_URL = 'http://samplecsvs.s3.amazonaws.com/Sacramentorealestatetransactions.csv'


with requests.Session() as s:
    download = s.get(CSV_URL)

    decoded_content = download.content.decode('utf-8')

    cr = csv.reader(decoded_content.splitlines(), delimiter=',')
    my_list = list(cr)
    for row in my_list:
        print(row)

Related question with answer: https://stackoverflow.com/a/33079644/295246


Edit: Other answers are useful if you need to download large files (i.e. ['street', 'city', 'zip', 'state', 'beds', 'baths', 'sq__ft', 'type', 'sale_date', 'price', 'latitude', 'longitude'] ['3526 HIGH ST', 'SACRAMENTO', '95838', 'CA', '2', '1', '836', 'Residential', 'Wed May 21 00:00:00 EDT 2008', '59222', '38.631913', '-121.434879'] ['51 OMAHA CT', 'SACRAMENTO', '95823', 'CA', '3', '1', '1167', 'Residential', 'Wed May 21 00:00:00 EDT 2008', '68212', '38.478902', '-121.431028'] ['2796 BRANCH ST', 'SACRAMENTO', '95815', 'CA', '2', '1', '796', 'Residential', 'Wed May 21 00:00:00 EDT 2008', '68880', '38.618305', '-121.443839'] ['2805 JANETTE WAY', 'SACRAMENTO', '95815', 'CA', '2', '1', '852', 'Residential', 'Wed May 21 00:00:00 EDT 2008', '69307', '38.616835', '-121.439146'] [...] ).

答案 1 :(得分:17)

为了简化这些答案,并在下载大文件时提高性能,以下内容可能会更有效。

import requests
from contextlib import closing
import csv

url = "http://download-and-process-csv-efficiently/python.csv"

with closing(requests.get(url, stream=True)) as r:
    reader = csv.reader(r.iter_lines(), delimiter=',', quotechar='"')
    for row in reader:
        print row   

通过在GET请求中设置stream=True,当我们将r.iter_lines()传递给csv.reader()时,我们将 generator 传递给csv.reader ()。通过这样做,我们启用csv.reader()来懒惰地迭代响应中的每一行for row in reader

这可以避免在我们开始处理之前将整个文件加载到内存中,从而大大减少大文件的内存开销

答案 2 :(得分:7)

< p>您还可以使用< a href =“https://docs.python.org/3/library/csv.html#csv.DictReader"rel =”noreferrer“>< code> DictReader< /代码>< / A>迭代< code> {'columnname':'value',...}< / code>< / p>的字典 < pre>< code> import csv 导入请求 response = requests.get('http://example.test/foo.csv') reader = csv.DictReader(response.iter_lines()) 在读者中记录:     打印(记录) < /代码>< /预>

答案 3 :(得分:5)

我喜欢The Aelfinnaheld的答案。我只能通过缩短一点,删除多余的部分,使用真实的数据源,使其与2.x和3.x兼容并保持其他地方看到的高级别的内存效率来改善它们:

import csv
import requests

CSV_URL = 'http://samplecsvs.s3.amazonaws.com/Sacramentorealestatetransactions.csv'

with requests.get(CSV_URL, stream=True) as r:
    lines = (line.decode('utf-8') for line in r.iter_lines())
    for row in csv.reader(lines):
        print(row)

3.x糟糕的是CSV格式的灵活性较差,因为迭代器必须发出Unicode字符串(而requests会发出bytes),因为仅2.x版本(for row in csv.reader(r.iter_lines()): —更具Python风格(更短,更易于阅读)。无论如何,请注意上面的2.x / 3.x解决方案将无法处理OP所描述的情况,即在读取的数据中未引用NEWLINE。

对于OP关于下载(相对于处理)实际CSV文件的问题,这是另一个脚本,用于执行那个,2.x和3.x兼容,最少,可读且具有存储效率:

import os
import requests

CSV_URL = 'http://samplecsvs.s3.amazonaws.com/Sacramentorealestatetransactions.csv'

with open(os.path.split(CSV_URL)[1], 'wb') as f, \
        requests.get(CSV_URL, stream=True) as r:
    for line in r.iter_lines():
        f.write(line)

答案 4 :(得分:3)

From a little search, that I understand the file should be opened in universal newline mode, which you cannot directly do with a response content (I guess).

To finish the task, you can either save the downloaded content to a temporary file, or process it in memory.

Save as file:

df.count().unstack('Total_Sales')
df.plot(kind='bar', stacked=True)

In memory:

(To be updated)

答案 5 :(得分:2)

如果文件非常大,您可以使用iter_lines方法更新接受的答案

import csv
import requests

CSV_URL = 'http://samplecsvs.s3.amazonaws.com/Sacramentorealestatetransactions.csv'

with requests.Session() as s:
    download = s.get(CSV_URL)

    line_iterator = (x.decode('utf-8') for x in download.iter_lines(decode_unicode=True))

    cr = csv.reader(line_iterator, delimiter=',')
    my_list = list(cr)
    for row in my_list:
        print(row)

答案 6 :(得分:1)

以下方法对我来说效果很好。我也不需要使用csv.reader()csv.writer()函数,这些函数使代码更简洁。该代码与Python2和Python 3兼容。

from six.moves import urllib

DOWNLOAD_URL = "https://raw.githubusercontent.com/gjreda/gregreda.com/master/content/notebooks/data/city-of-chicago-salaries.csv"
DOWNLOAD_PATH ="datasets\city-of-chicago-salaries.csv" 
urllib.request.urlretrieve(URL,DOWNLOAD_PATH)

注意-六个是有助于编写与Python 2和Python 3兼容的代码的软件包。有关六个的更多详细信息,请参见-What does from six.moves import urllib do in Python?

答案 7 :(得分:0)

我使用以下代码(我使用Python 3):

import csv
import io
import requests

url = "http://samplecsvs.s3.amazonaws.com/Sacramentorealestatetransactions.csv"
r = requests.get(url)
r.encoding = 'utf-8'  # useful if encoding is not sent (or not sent properly) by the server
csvio = io.StringIO(r.text, newline="")
data = []
for row in csv.DictReader(csvio):
    data.append(row)

答案 8 :(得分:0)

Python3 支持的代码

    with closing(requests.get(PHISHTANK_URL, stream=True})) as r:
        reader = csv.reader(codecs.iterdecode(r.iter_lines(), 'utf-8'), delimiter=',', quotechar='"')
        for record in reader:
           print (record)

答案 9 :(得分:0)

这对我来说效果很好:

from csv import DictReader

f = requests.get('https://somedomain.com/file').content.decode('utf-8')
reader = DictReader(f.split('\n'))
csv_dict_list = list(reader)