使用Python从URL列表中查找特定URL

时间:2015-09-06 16:01:53

标签: python web-crawler

我希望通过浏览URL来查找URL列表中是否存在特定链接。我写了以下程序,它完美无缺。但是,我被困在2个地方。

  1. 如何从文本文件中调用链接,而不是使用数组。
  2. 抓取工具需要近4分钟才能抓取100个网页。
  3. 有没有办法可以让它更快。

    from bs4 import BeautifulSoup, SoupStrainer
    import urllib2
    import re
    import threading
    
    start = time.time()
    #Links I want to find
    url = "example.com/one", "example.com/two", "example.com/three"]
    
    #Links I want to find the above links in...
    url_list =["example.com/1000", "example.com/1001", "example.com/1002",
    "example.com/1003", "example.com/1004"]
    
    print_lock = threading.Lock()
    #with open("links.txt") as f:
    #  url_list1 = [url.strip() for url in f.readlines()]
    
    def fetch_url(url):
        for line1 in url_list:
            print "Crawled" " " + line1
            try:
                html_page = urllib2.urlopen(line1)
                soup = BeautifulSoup(html_page)
                link = soup.findAll(href=True)
            except urllib2.HTTPError:
            pass
            for link1 in link:
                url1 = link1.get("href")
                for url_input in url:
                    if url_input in url1:
                        with print_lock:
                            print 'Found' " " +url_input+ " " 'in'+ " " + line1
    
    threads = [threading.Thread(target=fetch_url, args=(url,)) for url in url_list]
    for thread in threads:
    thread.start()
    for thread in threads:
    thread.join()
    print('Entire job took:',time.time() - start) 
    

1 个答案:

答案 0 :(得分:0)

如果要从文本文件中读取,请使用您注释掉的代码。

至于“性能”问题:您的代码会在读取操作urlopen时阻塞,直到返回网站内容为止。理想情况下,您希望并行运行这些请求。例如,您需要使用线程来实现并行解决方案。

Here's使用gevent(非标准)

使用不同方法的示例