如果有人可以帮我多线程化这个脚本并将输出写入文本文件,那将会很棒。 我是编码的新手,所以请帮帮我。
#!/usr/bin/python
from tornado import ioloop, httpclient
from BeautifulSoup import BeautifulSoup
from mechanize import Browser
import requests
import urllib2
import socket
import sys
def handle_request(response):
print response.code
global i
i = 0
i -= 1
if i == 0:
http_client = httpclient.AsyncHTTPClient()
for url in open('urls.txt'):
try:
br = Browser()
br.set_handle_robots(False)
res = br.open(url, None, 2.5)
data = res.get_data()
soup = BeautifulSoup(data)
title = soup.find('title')
if soup.title != None:
print url, title.renderContents(), '\n'
i += 1
except urllib2.URLError, e:
print "Oops, timed out?", '\n'
except socket.error,e:
print "Oops, timed out?", '\n'
except socket.timeout:
print "Oops, timed out?", '\n'
print 'Processing of list completed, Cheers!!'
sys.exit()
try:
ioloop.IOLoop.instance().start()
except KeyboardInterrupt:
ioloop.IOLoop.instance().stop()
我正在尝试grep主机列表的HTTP标题。
答案 0 :(得分:2)
您已经实施的基本想法是an non-blocking HTTP client.
def handle_request(response):
if response.error:
print "Error:", response.error
else:
print response.body
for url in ["http://google.com", "http://twitter.com"]:
http_client = httpclient.AsyncHTTPClient()
http_client.fetch(url, handle_request)
您可以遍历您的网址,并且只要特定网址的响应变为可用,就会立即调用回调。
我不会混淆机械,ioloop,......如果没有必要的话。
除此之外,我建议grequests。它是一种轻量级工具,可满足您的要求。
import grequests
from bs4 import BeautifulSoup
urls = ['http://google.com', 'http://www.python.org/']
rs = (grequests.get(u) for u in urls)
res = grequests.map(rs)
for r in res:
soup = BeautifulSoup(r.text)
print "%s: %s" % (r.url, soup.title.text)