Question

我正在使用Python来删除报纸网站，并在删除各种HTML标签等后以文本形式收集实际故事。

我的简单代码如下

import urllib.request
from bs4 import BeautifulSoup

#targetURL = 'http://indianexpress.com/article/india/mamata-banerjee-army-deployment-nh-2-in-west-bengal-military-coup-4405871'
targetURL = "http://timesofindia.indiatimes.com/india/Congress-Twitter-hacking-Police-form-cyber-team-launch-probe/articleshow/55737598.cms"
#targetURL = 'http://www.telegraphindia.com/1161201/jsp/nation/story_122343.jsp#.WEDzfXV948o'

with urllib.request.urlopen(targetURL) as url:
    html = url.read()
soup = BeautifulSoup(html,'lxml')

for el in soup.find_all("p"):
    print (el.text)

当我访问indianexpress.com网址或telegraphindia.com网址时，代码工作得很好，我总是以纯文本形式收集故事，没有垃圾。

然而，timesofindia.com网站有一个adblock拦截器，在这种情况下，输出如下：

We have noticed that you have an ad blocker enabled which restricts ads served on the site.
Please disable to continue reading.

如何绕过此Adblock拦截器并检索该页面？将不胜感激任何建议

Answer 1

看起来您尝试提取的实际内容不在<p>标记内。但是，广告拦截器警告位于此类标记内。此文本始终是HTML文档的一部分，但只有在广告无法加载时才会向用户显示。

请尝试提取<arttextxml>标记的内容。

Adblock拦截器阻止urllib.request.urlopen（）

1 个答案: