提前感谢您的帮助。我是Python的新手,并试图弄清楚如何使用线程模块来抓取纽约每日新闻网站的网址。我把以下内容放在一起,脚本正在报废,但它似乎没有比以前更快,所以我不确定线程是否正在发生。如果是的话,你能告诉我吗?我可以写任何东西以便我能说出来吗?还有关于线程的其他任何提示?
谢谢。
from bs4 import BeautifulSoup, SoupStrainer
import urllib2
import os
import io
import threading
def fetch_url():
for i in xrange(15500, 6100, -1):
page = urllib2.urlopen("http://www.nydailynews.com/search-results/search-results-7.113?kw=&tfq=&afq=&page={}&sortOrder=Relevance&selecturl=site&q=the&sfq=&dtfq=seven_years".format(i))
soup = BeautifulSoup(page.read())
snippet = soup.find_all('h2')
for h2 in snippet:
for link in h2.find_all('a'):
logfile.write("http://www.nydailynews.com" + link.get('href') + "\n")
print "finished another url from page {}".format(i)
with open("dailynewsurls.txt", 'a') as logfile:
threads = threading.Thread(target=fetch_url())
threads.start()
答案 0 :(得分:2)
以下是一个天真的实施(很快就会让你从nydailynews.com上列入黑名单):
def fetch_url(i, logfile):
page = urllib2.urlopen("http://www.nydailynews.com/search-results/search-results-7.113?kw=&tfq=&afq=&page={}&sortOrder=Relevance&selecturl=site&q=the&sfq=&dtfq=seven_years".format(i))
soup = BeautifulSoup(page.read())
snippet = soup.find_all('h2')
for h2 in snippet:
for link in h2.find_all('a'):
logfile.write("http://www.nydailynews.com" + link.get('href') + "\n")
print "finished another url from page {}".format(i)
with open("dailynewsurls.txt", 'a') as logfile:
threads = []
for i in xrange(15500, 6100, -1):
t = threading.Thread(target=fetch_url, args=(i, logfile))
t.start()
threads.append(t)
for t in threads:
t.join()
请注意,fetch_url
将数字替换为URL作为参数,该参数的每个可能值都在其自己的单独线程中启动。
我强烈建议将作业分成更小的批次,并一次运行一批。
答案 1 :(得分:1)
不,你没有使用线程。 threads = threading.Thread(target=fetch_url())
在主线程中调用fetch_url()
,等待它完成并将其返回值(None
)传递给threading.Thread
构造函数。< / p>