Question

我正试图从华尔街日报中搜集文章。这涉及使用机械化登录并使用BeautifulSoup进行抓取。我希望有人可以看看我的代码并向我解释为什么它不起作用。

我在运行最新软件的2012 MacBook Pro上使用python 2.7。我是python的新手，所以向我解释，就像我是5.任何建议都会深深体会到。提前谢谢。

from bs4 import BeautifulSoup
import cookielib
import mechanize

#Browser
br = mechanize.Browser()

#Cookie Jar
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj) 

# Browser options
br.set_handle_equiv(True)
br.set_handle_gzip(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False) 

# User-Agent 
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]

# Follows refresh 0 but not hangs on refresh > 0
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)

br.set_debug_http(True) # Print HTTP headers.
# Want more debugging messages?
#br.set_debug_redirects(True)
#br.set_debug_responses(True)

# The site we will navigate into, handling it's session
br.open('https://id.wsj.com/access/pages/wsj/us/login_standalone.html?mg=id-wsj')

# Select the first (index zero) form
br.select_form(nr=0)

# User credentials
br.form['username'] = 'username'
br.form['password'] = 'password' 

# Login
br.submit()

#br.open("http://online.wsj.com/home-page")
br.open("http://online.wsj.com/news/articles/SB10001424052702304626304579506924089231470?mod=WSJ_hp_LEFTTopStories&mg=reno64-wsj&url=http%3A%2F%2Fonline.wsj.com%2Farticle%2FSB10001424052702304626304579506924089231470.html%3Fmod%3DWSJ_hp_LEFTTopStories&cb=logged0.9458705162058179")
soup = BeautifulSoup(br.response().read())
title = soup.find('h1')
print title

使用Mechanize和BS4进行刮擦

0 个答案: