Question

我需要从网页上的嵌入式推文中分别提取文本。下面的代码可以正常工作，但是我需要摆脱如下起止行：Skip Twitter post by...和End Twitter post by...，date和Report仅保留推文。我什至看不到这些行来自何处以及使用哪个标签。非常感谢您的帮助！

import requests
from bs4 import BeautifulSoup

r = requests.get('https://www.bbc.co.uk/news/uk-44496876')
soup = BeautifulSoup(r.content, "html.parser")
article_soup = [s.get_text() for s in soup.find_all( 'div', {'class': 'social-embed'})]
tweets = '\n'.join(article_soup)
print(tweets)

Answer 1

import requests
from bs4 import BeautifulSoup

r = requests.get('https://www.bbc.co.uk/news/uk-44496876')
soup = BeautifulSoup(r.content, "html.parser")
article_soup = [s.get_text() for s in soup.find_all('p', {'dir': 'ltr'})]
tweets = '\n'.join(article_soup)
print(tweets)

如果您还想获得这些推文的作者，这将有些棘手，因为您没有该作者的标签。因此，我使用python代码删除了作者之间的所有标签，如下所示：

import requests
from bs4 import BeautifulSoup

r = requests.get('https://www.bbc.co.uk/news/uk-44496876')
soup = BeautifulSoup(r.content, "html.parser")
articles_soup = [s for s in soup.find_all('blockquote', {'class': 'twitter-tweet'})]
tweets = []
for article_soup in articles_soup:
    tweet = article_soup.find('p').get_text()
    # The last <a href='...'></a> is the date, others are part of the tweet
    date = article_soup.find_all('a')[-1].get_text()
    tweet_author = article_soup.get_text()[len(tweet):-len(date)].strip()
    tweets.append((tweet_author, tweet))
print(tweets)

注意1 ：如果只想获取tweet_author的一部分，则可以轻松地将元组的第一个元素取为tweek并获取所需的对象。

Note2 ：问题代码示例并不总是返回推文，问题在于html页面，因为有时不返回某些元素。快速的解决方案是再次运行requests.get方法-建议您研究此问题。一旦收到带有原始问题的推文，我就找到了标签，并且得到了您期望得到的推文，每个推文都在代码的不同行中。

使用Python和BeautifulSoup从嵌入式推文中提取文本

1 个答案: