Question

我希望通过浏览URL来查找URL列表中是否存在特定链接。我写了以下程序，它完美无缺。但是，我被困在2个地方。

如何从文本文件中调用链接，而不是使用数组。
抓取工具需要近4分钟才能抓取100个网页。

有没有办法可以让它更快。

from bs4 import BeautifulSoup, SoupStrainer
import urllib2
import re
import threading

start = time.time()
#Links I want to find
url = "example.com/one", "example.com/two", "example.com/three"]

#Links I want to find the above links in...
url_list =["example.com/1000", "example.com/1001", "example.com/1002",
"example.com/1003", "example.com/1004"]

print_lock = threading.Lock()
#with open("links.txt") as f:
#  url_list1 = [url.strip() for url in f.readlines()]

def fetch_url(url):
    for line1 in url_list:
        print "Crawled" " " + line1
        try:
            html_page = urllib2.urlopen(line1)
            soup = BeautifulSoup(html_page)
            link = soup.findAll(href=True)
        except urllib2.HTTPError:
        pass
        for link1 in link:
            url1 = link1.get("href")
            for url_input in url:
                if url_input in url1:
                    with print_lock:
                        print 'Found' " " +url_input+ " " 'in'+ " " + line1

threads = [threading.Thread(target=fetch_url, args=(url,)) for url in url_list]
for thread in threads:
thread.start()
for thread in threads:
thread.join()
print('Entire job took:',time.time() - start)

Answer 1

如果要从文本文件中读取，请使用您注释掉的代码。

至于“性能”问题：您的代码会在读取操作urlopen时阻塞，直到返回网站内容为止。理想情况下，您希望并行运行这些请求。例如，您需要使用线程来实现并行解决方案。

Here's使用gevent（非标准）

使用不同方法的示例

使用Python从URL列表中查找特定URL

1 个答案: