我有一个包含数百万个URL的文本文件,我必须为每个URL运行POST请求。 我尝试在自己的机器上执行此操作,但是它要花很多时间,所以我想改用Spark集群。
我写了这个PySpark代码:
from pyspark.sql.types import StringType
import requests
url = ["http://myurltoping.com"]
list_urls = url * 1000 # The final code will just import my text file
list_urls_df = spark.createDataFrame(list_urls, StringType())
print 'number of partitions: {}'.format(list_urls_df.rdd.getNumPartitions())
def execute_requests(list_of_url):
final_iterator = []
for url in list_of_url:
r = requests.post(url.value)
final_iterator.append((r.status_code, r.text))
return iter(final_iterator)
processed_urls_df = list_urls_df.rdd.mapPartitions(execute_requests)
但是仍然要花费很多时间,例如,如何使函数execute_requests更有效率地异步启动每个分区中的请求?
谢谢!
答案 0 :(得分:0)
Using the python package grequests
(installable with pip install grequests
) might be an easy solution for your problem without using spark.
The Documentation (can be found here https://github.com/kennethreitz/grequests) gives a simple example:
import grequests
urls = [
'http://www.heroku.com',
'http://python-tablib.org',
'http://httpbin.org',
'http://python-requests.org',
'http://fakedomain/',
'http://kennethreitz.com'
]
Create a set of unsent Requests:
>>> rs = (grequests.get(u) for u in urls)
Send them all at the same time:
>>> grequests.map(rs)
[<Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>, None, <Response [200]>]
I found out, that using gevent
wihtin a foreach on a spark Dataframe results in some weird errors and does not work. It seems as if spark also relies on gevent
, which is used by grequests
...