这实际上是使用线程来搜索网址吗?

时间:2014-11-26 14:34:35

标签: python multithreading beautifulsoup python-multithreading

提前感谢您的帮助。我是Python的新手,并试图弄清楚如何使用线程模块来抓取纽约每日新闻网站的网址。我把以下内容放在一起,脚本正在报废,但它似乎没有比以前更快,所以我不确定线程​​是否正在发生。如果是的话,你能告诉我吗?我可以写任何东西以便我能说出来吗?还有关于线程的其他任何提示?

谢谢。

from bs4 import BeautifulSoup, SoupStrainer
import urllib2
import os
import io
import threading

def fetch_url():
    for i in xrange(15500, 6100, -1):
        page = urllib2.urlopen("http://www.nydailynews.com/search-results/search-results-7.113?kw=&tfq=&afq=&page={}&sortOrder=Relevance&selecturl=site&q=the&sfq=&dtfq=seven_years".format(i))
        soup = BeautifulSoup(page.read())
        snippet = soup.find_all('h2')
        for h2 in snippet:
            for link in h2.find_all('a'):
                logfile.write("http://www.nydailynews.com" + link.get('href') + "\n")
        print "finished another url from page {}".format(i)

with open("dailynewsurls.txt", 'a') as logfile:
    threads = threading.Thread(target=fetch_url())
    threads.start()

2 个答案:

答案 0 :(得分:2)

以下是一个天真的实施(很快就会让你从nydailynews.com上列入黑名单):

def fetch_url(i, logfile):
    page = urllib2.urlopen("http://www.nydailynews.com/search-results/search-results-7.113?kw=&tfq=&afq=&page={}&sortOrder=Relevance&selecturl=site&q=the&sfq=&dtfq=seven_years".format(i))
    soup = BeautifulSoup(page.read())
    snippet = soup.find_all('h2')
    for h2 in snippet:
        for link in h2.find_all('a'):
            logfile.write("http://www.nydailynews.com" + link.get('href') + "\n")
    print "finished another url from page {}".format(i)

with open("dailynewsurls.txt", 'a') as logfile:
    threads = []
    for i in xrange(15500, 6100, -1):
        t = threading.Thread(target=fetch_url, args=(i, logfile))
        t.start()
        threads.append(t)
    for t in threads:
        t.join()

请注意,fetch_url将数字替换为URL作为参数,该参数的每个可能值都在其自己的单独线程中启动。

强烈建议将作业分成更小的批次,并一次运行一批。

答案 1 :(得分:1)

不,你没有使用线程。 threads = threading.Thread(target=fetch_url())在主线程中调用fetch_url()等待它完成并将其返回值(None)传递给threading.Thread构造函数。< / p>