Question

我是一名初学python程序员，我正在尝试将webcrawler作为练习。目前我遇到的问题是我无法找到合适的解决方案。问题是我试图从没有类的页面获取链接位置/地址，所以我不知道如何过滤该特定链接。告诉你可能会更好。
The page I am trying to get the link from.
正如您所看到的，我正在尝试获取＆＃34;历史价格＆＃34;的href属性内部的内容。链接。这是我的python代码：

import requests
from bs4 import BeautifulSoup

def find_historicalprices_link(url):
    source = requests.get(url)
    text = source.text
    soup = BeautifulSoup(text, 'html.parser')
    link = soup.find_all('li', 'fjfe-nav-sub')
    href = str(link.get('href'))
    find_spreadsheet(href)

def find_spreadsheet(url):
    source = requests.get(url)
    text = source.text
    soup = BeautifulSoup(text, 'html.parser')
    link = soup.find('a', {'class' : 'nowrap'})
    href = str(link.get('href'))
    download_spreadsheet(href)

def download_spreadsheet(url):
    response = requests.get(url)
    text = response.text
    lines = text.split("\\n")
    filename = r'google.csv'
    file = open(filename, 'w')
    for line in lines:
        file.write(line + "\n")
    file.close()

find_historicalprices_link('https://www.google.com/finance?q=NASDAQ%3AGOOGL&ei=3lowWYGRJNSvsgGPgaywDw')

在函数＆＃34; find_spreadsheet（url）＆＃34;中，我可以通过查找名为＆＃34; nowrap＆＃34;的类来轻松过滤链接。不幸的是，历史价格链接没有这样的类，现在我的脚本只是给我以下错误：

AttributeError: ResultSet object has no attribute 'get'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?

如何确保我的抓取工具只接受来自＆＃34;历史价格＆＃34;？的href 先感谢您。

更新：
我找到了做到这一点的方法。通过仅查找附加了特定文本的链接，我可以找到我需要的href 解决方案：
soup.find（＆＃39; a＆＃39;，string =＆＃34;历史价格＆＃34;）

Answer 1

以下代码片段对您有帮助吗？我认为您可以使用以下代码解决您的问题：

from bs4 import BeautifulSoup

html = """<a href='http://www.google.com'>Something else</a>
          <a href='http://www.yahoo.com'>Historical prices</a>"""

soup = BeautifulSoup(html, "html5lib")

urls = soup.find_all("a")

print(urls)

print([a["href"] for a in urls if a.text == "Historical prices"])

Python - 如何在没有课程的网页上找到链接？

1 个答案: