Question

我正在使用 Selenium 进行网络抓取，并想改用 beautiful soup，但我是这个库的新手，我想获取所有公司名称和时间并跳转到下一页.

请先使用 selenium 查找我的代码：

driver.get('http://www.csisc.cn/zbscbzw/isinbm/index_list_code.shtml')
while True:
    links = [link.get_attribute('href') for link in driver.find_elements_by_xpath('//*[@class="sibian"]/tbody/tr/td/table[2]/tbody/tr/td[2]/a')]
    for link in links:
        driver.get(link)
        driver.implicitly_wait(10)
        windows = driver.window_handles
        driver.switch_to.window(windows[-1])              
        time = driver.find_element_by_xpath('//*[@class="con_bj"]/table[3]/tbody/tr/td/publishtime').text
        company = driver.find_element_by_xpath('//*[@class="title_A"]').text
        driver.back()
    if(len(links)< 20):
      break

我尝试对 beautifulsoup 做同样的事情：

from bs4 import BeautifulSoup
import requests

html='http://www.csisc.cn/zbscbzw/isinbm/index_list_code.shtml'
soup = BeautifulSoup(html, 'html.parser')
for link in soup.find_all('td'):
    num=link.find('a').get('href')
    print(num)

但我一无所获并坚持第一步。

你能帮忙吗？

Answer 1

您不是在提出请求。你认为 BeautifulSoup 是一个 HTTPRequest 库，它只是一个解析器。将 driver.get() 视为 requests.get()（是的，我知道它们不一样，但这是为了更容易理解）。你需要做这样的事情：

from bs4 import BeautifulSoup
import requests

html_link='http://www.csisc.cn/zbscbzw/isinbm/index_list_code.shtml'
html = requests.get(html_link).text
soup = BeautifulSoup(html, 'html.parser')
for link in soup.find_all('td'):
    num=link.find('a').get('href')
    print(num)

这将允许您进一步调试代码。这可能无法工作，因为某些站点需要特定的标头或自动拒绝您的请求，例如 user-agent 标头。 Requests 是一个非常容易使用的（当然是主观的）库，并且在这个站点上有很多支持。为了省去一些麻烦，我会继续告诉您，如果该站点需要 javascript，Selenium 或某些变体是最佳选择。

Python Beautifulsoup 在另一个标签下获取标签（来自 selenium）

1 个答案: