Adblock拦截器阻止urllib.request.urlopen()

时间:2016-12-02 04:21:14

标签: python web-scraping adblock

我正在使用Python来删除报纸网站,并在删除各种HTML标签等后以文本形式收集实际故事。

我的简单代码如下

import urllib.request
from bs4 import BeautifulSoup

#targetURL = 'http://indianexpress.com/article/india/mamata-banerjee-army-deployment-nh-2-in-west-bengal-military-coup-4405871'
targetURL = "http://timesofindia.indiatimes.com/india/Congress-Twitter-hacking-Police-form-cyber-team-launch-probe/articleshow/55737598.cms"
#targetURL = 'http://www.telegraphindia.com/1161201/jsp/nation/story_122343.jsp#.WEDzfXV948o'

with urllib.request.urlopen(targetURL) as url:
    html = url.read()
soup = BeautifulSoup(html,'lxml')

for el in soup.find_all("p"):
    print (el.text)

当我访问indianexpress.com网址或telegraphindia.com网址时,代码工作得很好,我总是以纯文本形式收集故事,没有垃圾。

然而,timesofindia.com网站有一个adblock拦截器,在这种情况下,输出如下:

We have noticed that you have an ad blocker enabled which restricts ads served on the site.
Please disable to continue reading.

如何绕过此Adblock拦截器并检索该页面?将不胜感激任何建议

1 个答案:

答案 0 :(得分:0)

看起来您尝试提取的实际内容不在<p>标记内。但是,广告拦截器警告位于此类标记内。此文本始终是HTML文档的一部分,但只有在广告无法加载时才会向用户显示。

请尝试提取<arttextxml>标记的内容。