尝试抓取文章时我需要做什么,但是各种各样的广告一直在显示?具体来说,那些会弹出屏幕中间,要求登录/注册,你必须在阅读之前手动关闭它。
因此,我的抓取无法提取任何内容。有关如何使用pyquery在“抓取前关闭广告”中编码的任何建议吗?
编辑:现在与Selenium合作尝试摆脱弹出窗口。任何建议都将不胜感激。
import mechanize
import time
import urllib2
import pdb
import lxml.html
import re
from pyquery import PyQuery as pq
def open_url(url):
print 'open url:',url
try:
br = mechanize.Browser()
br.set_handle_equiv(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
br.addheaders = [('user-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.3) Gecko/20100423 Ubuntu/10.04 (lucid) Firefox/3.6.3')]
response = br.open(url)
html = response.get_data()
return html
except:
print u"!!!! url can not be open by mechanize either!!! \n"
def extract_text_pyquery(html):
p = pq(html)
article_whole = p.find(".entry-content")
p_tag = article_whole('p')
print len(p_tag)
print p_tag
for i in range (0, len(p_tag)):
text = p_tag.eq(i).text()
print text
entire = p.find(".grid_12")
author = entire.find('p')
print len(author)
print "By:", author.text()
images = p.find('#main_photo')
link = images('img')
print len(link)
for i in range(len(link)):
url = pq(link[i])
result =url.attr('src').find('smedia')
if result>0:
print url.attr('src')
if __name__ =='__main__':
#print '----------------------------------------------------------------'
url_list = ['http://www.newsobserver.com/2014/10/17/4240490/obama-weighs-ebola-czar-texas.html?sp=/99/100/&ihp=1',
]
html= open_url(url_list[0])
# dissect_article(html)
extract_text_pyquery(html)
答案 0 :(得分:0)
如果您打算继续抓取该特定网站,那么您可以检查id="continue_link
的元素并从中拉出href。然后加载该页面并刮擦。
例如,url_list
中包含此元素的网址
<a href="http://www.bnd.com/2014/10/10/3447693_rude-high-school-football-players.html?rh=1" id="continue_link" class="wp_bold_link wp_color_link wp_goto_link">Skip this ad</a>
然后,您可以直接导航到该链接,而无需任何类型的广告网关。我对BeautifulSoup的熟悉程度比你使用的要好,但似乎你可以做类似
的事情。p = pq(html)
if p.find("#continue_link):
continue_link = p.find("#continue_link")
html = open_url(continue_link('href'))
extract_text_pyquery(html)
return
<rest of code if there is no continue link>