为Web Scraping获取不同的结果

时间:2013-11-11 23:36:44

标签: python python-2.7 web-scraping beautifulsoup web-crawler

我正在尝试进行网页抓取并使用以下代码:

import mechanize
from bs4 import BeautifulSoup

url = "http://www.thehindu.com/archive/web/2010/06/19/"

br =  mechanize.Browser()
htmltext = br.open(url).read()

link_dictionary = {}
soup = BeautifulSoup(htmltext)

for tag_li in soup.findAll('li', attrs={"data-section":"Chennai"}):
    for link in tag_li.findAll('a'):
        link_dictionary[link.string] = link.get('href')
        print link_dictionary[link.string]
        urlnew = link_dictionary[link.string]

        brnew =  mechanize.Browser()
        htmltextnew = brnew.open(urlnew).read()

        articletext = ""
        soupnew = BeautifulSoup(htmltextnew)
        for tag in soupnew.findAll('p'):
            articletext += tag.text

        print articletext

使用此功能无法获取任何打印值。但是在使用attrs={"data-section":"Business"}代替attrs={"data-section":"Chennai"}时,我能够获得所需的输出。有人能帮助我吗?

1 个答案:

答案 0 :(得分:1)

在清理前阅读本网站的服务条款

如果您在Chrome中使用firebug或inspect元素,则可能会看到一些在使用Mechanize或Urllib2时无法看到的内容。

例如,当您查看由您发送的页面的源代码时。 (在Chrome中右键单击查看源)。并搜索data-section代码,您将看不到chennai的任何标记,我不是100%肯定,但我会说这些内容需要由Javascript ..etc填充。这需要浏览器的功能。

如果我是你,我将使用selenium打开页面,然后从那里获取源页面,然后以这种方式收集的HTML将更像您在浏览器中看到的。

Cited here

from selenium import webdriver
from bs4 import BeautifulSoup
import time    

driver = webdriver.Firefox()
driver.get("URL GOES HERE")
# I noticed there is an ad here, sleep til page fully loaded.
time.sleep(10)

soup = BeautifulSoup(driver.page_source)
print len(soup.findAll(...}))
# or you can work directly in selenium      
...

driver.close()

我的输出是8