Beautifulsoup解析千页

时间:2017-09-08 13:16:44

标签: python-3.x web-scraping beautifulsoup python-requests grequests

我有一个脚本解析包含数千个网址的列表。 但我的问题是,这个清单需要很长时间才能完成。

URL请求在加载页面之前大约需要4秒钟,并且可以进行解析。有没有办法以快速的方式解析大量的URL?

我的代码如下所示:

from bs4 import BeautifulSoup   
import requests                 

#read url-list
with open('urls.txt') as f:
    content = f.readlines()
# remove whitespace characters
content = [line.strip('\n') for line in content]

#LOOP through urllist and get information
for i in range(5):
    try:
        for url in content:

            #get information
            link = requests.get(url)
            data = link.text
            soup = BeautifulSoup(data, "html5lib")

            #just example scraping
            name = soup.find_all('h1', {'class': 'name'})

编辑: 在这个例子中如何使用钩子处理异步请求?我尝试了以下网站Asynchronous Requests with Python requests中提到的以下内容:

from bs4 import BeautifulSoup   
import grequests

def parser(response):
    for url in urls:

        #get information
        link = requests.get(response)
        data = link.text
        soup = BeautifulSoup(data, "html5lib")

        #just example scraping
        name = soup.find_all('h1', {'class': 'name'})

#read urls.txt and store in list variable
with open('urls.txt') as f:
    urls= f.readlines()
# you may also want to remove whitespace characters 
urls = [line.strip('\n') for line in urls]

# A list to hold our things to do via async
async_list = []

for u in urls:
    # The "hooks = {..." part is where you define what you want to do
    # 
    # Note the lack of parentheses following do_something, this is
    # because the response will be used as the first argument automatically
    rs = grequests.get(u, hooks = {'response' : parser})

    # Add the task to our list of things to do via async
    async_list.append(rs)

# Do our list of things to do via async
grequests.map(async_list, size=5)

这对我不起作用。我甚至没有在控制台中出现任何错误,它只是运行了很长时间才停止。

1 个答案:

答案 0 :(得分:0)

如果有人对这个问题感到好奇 - 我决定从零开始重新开始我的项目并使用scrapy代替beautifulsoup。

Scrapy是一个完整的webscraping框架,它具有内置功能,可以同时处理1000个请求,并且可以限制您的请求,以便从您的目标站点中“更加友好”。

我希望这可以帮助某人。对我来说,这是这个项目的更好选择。