Question

我正在使用Goose从网址中读取文章的标题/文本正文。但是，这不适用于推特网址，我想由于HTML标签结构不同。有没有办法从这样的链接阅读推文文本？

推文（缩短链接）的一个例子如下：

https://twitter.com/UniteAlbertans/status/899468829151043584/photo/1

注意：我知道如何通过Twitter API阅读推文。但是，我对此不感兴趣。我只想通过解析HTML源代码来获取文本而不需要所有的Twitter身份验证麻烦。

Answer 1

抓自己

打开推文的网址，传递给您选择的HTML解析器并提取您感兴趣的XPath。

在http://docs.python-guide.org/en/latest/scenarios/scrape/

中讨论了刮痧

可以通过右键单击所需元素，选择“检查”，右键单击“检查器”中突出显示的行并选择“复制”＆gt;来获取XPath。如果站点的结构始终相同，则“复制XPath”。否则，请选择准确定义所需对象的属性。

在你的情况下：

//div[contains(@class, 'permalink-tweet-container')]//strong[contains(@class, 'fullname')]/text()

会告诉您作者的姓名和

//div[contains(@class, 'permalink-tweet-container')]//p[contains(@class, 'tweet-text')]//text()

将为您提供推文的内容。

完整的工作示例：

from lxml import html
import requests
page = requests.get('https://twitter.com/UniteAlbertans/status/899468829151043584')
tree = html.fromstring(page.content)
tree.xpath('//div[contains(@class, "permalink-tweet-container")]//p[contains(@class, "tweet-text")]//text()')

结果：

['Breaking:\n10 sailors missing, 5 injured after USS John S. McCain collides with merchant vessel near Singapore...\n\n', 'https://www.', 'washingtonpost.com/world/another-', 'us-navy-destroyer-collides-with-a-merchant-ship-rescue-efforts-underway/2017/08/20/c42f15b2-8602-11e7-9ce7-9e175d8953fa_story.html?utm_term=.e3e91fff99ba&wpisrc=al_alert-COMBO-world%252Bnation&wpmk=1', u'\xa0', u'\u2026', 'pic.twitter.com/UiGEZq7Eq6']

是否可以在没有Twitter API的情况下阅读推文URL的推文文本？

1 个答案: