通过PySpark并行并异步运行HTTP请求

时间:2018-11-20 15:45:21

标签: http parallel-processing pyspark python-requests

我有一个包含数百万个URL的文本文件,我必须为每个URL运行POST请求。 我尝试在自己的机器上执行此操作,但是它要花很多时间,所以我想改用Spark集群。

我写了这个PySpark代码:

from pyspark.sql.types import StringType
import requests

url = ["http://myurltoping.com"]
list_urls = url * 1000 # The final code will just import my text file
list_urls_df = spark.createDataFrame(list_urls, StringType())

print 'number of partitions: {}'.format(list_urls_df.rdd.getNumPartitions())

def execute_requests(list_of_url):
    final_iterator = []
    for url in list_of_url:
        r = requests.post(url.value)
        final_iterator.append((r.status_code, r.text))
    return iter(final_iterator)

processed_urls_df = list_urls_df.rdd.mapPartitions(execute_requests)

但是仍然要花费很多时间,例如,如何使函数execute_requests更有效率地异步启动每个分区中的请求?

谢谢!

1 个答案:

答案 0 :(得分:0)

Using the python package grequests(installable with pip install grequests) might be an easy solution for your problem without using spark.

The Documentation (can be found here https://github.com/kennethreitz/grequests) gives a simple example:

import grequests

urls = [
    'http://www.heroku.com',
    'http://python-tablib.org',
    'http://httpbin.org',
    'http://python-requests.org',
    'http://fakedomain/',
    'http://kennethreitz.com'
]

Create a set of unsent Requests:

>>> rs = (grequests.get(u) for u in urls)

Send them all at the same time:

>>> grequests.map(rs)
[<Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>, None, <Response [200]>]

I found out, that using gevent wihtin a foreach on a spark Dataframe results in some weird errors and does not work. It seems as if spark also relies on gevent, which is used by grequests...