Question

我正在进行Python练习，它要求我通过网页抓取并打印到控制台，从Google新闻网站获取最新消息。在我这样做时，我只是使用Beautiful Soup库来检索新闻。那是我的代码：

import bs4
from bs4 import BeautifulSoup
import urllib.request

news_url = "https://news.google.com/news/rss";
URLObject = urllib.request.urlopen(news_url);
xml_page = URLObject.read();
URLObject.close();

soup_page = BeautifulSoup(xml_page,"html.parser");
news_list = soup_page.findAll("item");

for news in news_list:
  print(news.title.text);
  print(news.link.text);
  print(news.pubDate.text);
  print("-"*60);

但它不会通过不打印'link'和'pubDate'来给我错误。经过一些研究，我在Stack Overflow上看到了一些答案，他们说，由于网站使用Javascript，除了Beautiful Soup之外，还应该使用Selenium包。尽管不了解Selenium是如何工作的，但我更新了以下代码：

from bs4 import BeautifulSoup
from selenium import webdriver
import urllib.request

driver = webdriver.Chrome("C:/Users/mauricio/Downloads/chromedriver");
driver.maximize_window();
driver.get("https://news.google.com/news/rss");
content = driver.page_source.encode("utf-8").strip();
soup = BeautifulSoup(content, "html.parser");
news_list = soup.findAll("item");

print(news_list);

for news in news_list:
  print(news.title.text);
  print(news.link.text);
  print(news.pubDate.text);
  print("-"*60);

但是，当我运行它时，会打开一个空白的浏览器页面并将其打印到控制台：

 raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: unknown error: Chrome failed to start: crashed
  (Driver info: chromedriver=2.38.551601 (edb21f07fc70e9027c746edd3201443e011a61ed),platform=Windows NT 6.3.9600 x86_64)

Answer 1

我刚试过，下面的代码对我有用。 items =行是可怕的，提前道歉。但是现在它有效......

修改刚刚更新了代码段，您可以使用ElementTree.iter('tag')来迭代tag的所有节点：

import urllib.request import xml.etree.ElementTree news_url = "https://news.google.com/news/rss" with urllib.request.urlopen(news_url) as page: xml_page = page.read() # Parse XML page e = xml.etree.ElementTree.fromstring(xml_page) # Get the item list for it in e.iter('item'): print(it.find('title').text) print(it.find('link').text) print(it.find('pubDate').text, '\n')

EDIT2：讨论图书馆抓取的个人偏好
就个人而言，对于交互式/动态页面，我必须在其中 stuff （点击此处，填写表格，获取结果......）：我使用{{1}通常我不需要使用selenium，因为您可以直接使用selenium来查找和解析您正在寻找的Web的特定节点。

我将bs4与bs4（而不是requests）结合使用，以便在我不想拥有的项目中解析更多静态网页安装了整个webdriver。

使用urllib.request没有问题，但是urllib.request（请参阅此处docs）是最好的python包之一（在我看来）并且很棒如何创建简单但功能强大的API的示例。

Answer 2

只需将BeautifulSoup与requests一起使用。

from bs4 import BeautifulSoup
import requests

r = requests.get('https://news.google.com/news/rss')
soup = BeautifulSoup(r.text, 'xml')
news_list = soup.find_all('item')

# do whatever you need with news_list

使用Python，Beautiful Soup和Selenium进行Web Scraping无法正常工作

2 个答案: