Question

我有一个包含数百万个URL的文本文件，我必须为每个URL运行POST请求。我尝试在自己的机器上执行此操作，但是它要花很多时间，所以我想改用Spark集群。

我写了这个PySpark代码：

from pyspark.sql.types import StringType
import requests

url = ["http://myurltoping.com"]
list_urls = url * 1000 # The final code will just import my text file
list_urls_df = spark.createDataFrame(list_urls, StringType())

print 'number of partitions: {}'.format(list_urls_df.rdd.getNumPartitions())

def execute_requests(list_of_url):
    final_iterator = []
    for url in list_of_url:
        r = requests.post(url.value)
        final_iterator.append((r.status_code, r.text))
    return iter(final_iterator)

processed_urls_df = list_urls_df.rdd.mapPartitions(execute_requests)

但是仍然要花费很多时间，例如，如何使函数execute_requests更有效率地异步启动每个分区中的请求？

谢谢！

Answer 1

Using the python package grequests(installable with pip install grequests) might be an easy solution for your problem without using spark.

The Documentation (can be found here https://github.com/kennethreitz/grequests) gives a simple example:

import grequests

urls = [
    'http://www.heroku.com',
    'http://python-tablib.org',
    'http://httpbin.org',
    'http://python-requests.org',
    'http://fakedomain/',
    'http://kennethreitz.com'
]

Create a set of unsent Requests:

>>> rs = (grequests.get(u) for u in urls)

Send them all at the same time:

>>> grequests.map(rs)
[<Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>, None, <Response [200]>]

I found out, that using gevent wihtin a foreach on a spark Dataframe results in some weird errors and does not work. It seems as if spark also relies on gevent, which is used by grequests...

通过PySpark并行并异步运行HTTP请求

1 个答案: