Question

我正在尝试使用lxml从我订阅的新闻网站上删除文章。

我在计算机上的每个浏览器上登录网站（这不重要吗？），但每当我尝试从特定文章中获取任何文本时，请使用以下内容：

page = requests.get（“http://www.SomeWebsite.com/blah/blah/blah.html”）

tree = html.fromstring（page.text）

article = tree.xpath（'// div / p / text（）'）

我收到以下回复：

['您已查看过您的免费文章。如果您想查看更多内容，请单击下面的按钮。']

有关如何解决此问题的任何想法或建议？

免责声明：我是python和web scraping的新手

编辑：使用Selenium库

在下面发布的解决方案

Answer 1

所以基本上你想要废弃一个网站，并以更好的方式在你的网站上显示它。

所以我建议使用KIMONO，这是一个网络抓取服务，它将为您提供api，以便在适当的模型中获取数据。

检查一下，IT应该完成你的工作。

如果没有，您可以使用PHP (PHP Simple HTML DOM Parser)或Javascript创建自己的剪贴板，Javascript中也有库。

很抱歉不知道Python，但是使用Kimono的api你也可以在python中完成。

希望它有所帮助！

快乐编码!!!

Answer 2

所以我试图抓住的网站拒绝了我发送的所有帖子请求（我尝试过Python，R和PHP），我发现我只能用实际的浏览器加载新闻文章。

感谢@duhaime，我使用Selenium来实现这一目标。这是我的代码：

import selenium
from selenium import webdriver

# I used Firefox, but you could use Chrome or IE
browser = webdriver.Firefox()

browser.get('http://www.SomeWebsite.com/login')
# I needed to stop the script here to actually login.
# I tried to use an existing profile w/ my username & password but the website
# rejected my profile info and locked me out of the account

browser.get('http://www.SomeWebsite.com/blah/blah/blah.html')

element = browser.find_element_by_id("TheElementYouNeed").text
# This downloads all the text from the article at this particle 'id' element

Selenium绑定文档：http://selenium-python.readthedocs.org/en/latest/installation.html#introduction

使用登录信息用python抓取网站

2 个答案: