Question

我需要提取文本文章中嵌入的推文。我正在测试的页面的问题在于，它们在10次运行中约有5次加载推文。因此，我需要使用Selenium来等待页面加载，但是我无法使其正常运行。我按照他们官方网站上的步骤操作：

url = 'https://www.bbc.co.uk/news/world-us-canada-44648563'
options = webdriver.ChromeOptions()
options.add_argument("headless")
driver = webdriver.Chrome(executable_path='/Users/ME/Downloads/chromedriver', chrome_options=options)
driver.implicitly_wait(15)
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html, "lxml")
tweets_soup = [s.get_text() for s in soup.find_all('p', {'dir': 'ltr'})]
tweets = '\n'.join(tweets_soup)
print(tweets)

我无法使用该选项等待某个元素出现，因为我正在扫描不同的页面，并且并非所有页面都嵌入了推文。因此，要检查Selenium是否确实有效，我将上面的脚本与不使用Selenium的脚本一起运行，并比较它们的结果：

url = 'https://www.bbc.co.uk/news/world-us-canada-44648563'
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
tweets_soup = [s.get_text() for s in soup.find_all('p', {'dir': 'ltr'})]
tweets = '\n'.join(tweets_soup)
print(tweets)

我将非常感谢这个美好社区的帮助！

使用Selenium和BeautifulSoup从网页中废弃嵌入式推文

0 个答案: