Question

我在python中结合BeautifulSoup编写了一个脚本，用于解析网页中的某个地址。但是，当我运行我的以下脚本时，遇到AttributeError: 'NavigableString' object has no attribute 'text'行时遇到问题address = [item.find_next_sibling().get_text(strip=True)。如果我尝试注释掉的行，我可以摆脱这个问题。但是，我想坚持目前应用的方式。我该怎么办？

这是我的尝试：

import requests
from bs4 import BeautifulSoup

URL = "https://beta.companieshouse.gov.uk/officers/lX9snXUPL09h7ljtMYLdZU9LmOo/appointments"

def fetch_names(session,link):
    session.headers = {"User-Agent":"Mozilla/5.0"}
    res = session.get(link)
    soup = BeautifulSoup(res.text,"lxml")
    for items in soup.select("#content-container dt"):

        #the error appears in the following line

        address = [item.find_next_sibling().get_text(strip=True) for item in items if "correspondence address" in item.text.lower()][0]
        print(address)

if __name__ == '__main__':
    with requests.Session() as session:
        fetch_names(session,URL)

我可以通过执行以下操作来摆脱错误，但我想坚持我在脚本中尝试的方式：

items = soup.select("#content-container dt")
address = [item.find_next_sibling().get_text(strip=True) for item in items if "correspondence address" in item.text.lower()][0]
print(address)

编辑：

这不是答案，但这是我试图玩的方式（仍然不确定如何应用.find_previous_sibling()：

import requests
from bs4 import BeautifulSoup

URL = "https://beta.companieshouse.gov.uk/officers/lX9snXUPL09h7ljtMYLdZU9LmOo/appointments"

def fetch_names(session,link):
    session.headers = {"User-Agent":"Mozilla/5.0"}
    res = session.get(link)
    soup = BeautifulSoup(res.text,"lxml")
    for items in soup.select("#content-container dt"):
        address = [item for item in items.strings if "correspondence address" in item.lower()]
        print(address)

if __name__ == '__main__':
    with requests.Session() as session:
        fetch_names(session,URL)

它产生（没有导航性问题）。

[]
['Correspondence address']
[]
[]

Answer 1

items不是节点列表，而是单个节点，因此您不应将其用作此处的迭代器 - for item in items。只需用以下内容替换列表理解：

for items in soup.select("#content-container dt"):
    if "correspondence address" in items.text.lower():
        address = items.find_next_sibling().get_text(strip=True)
        print(address)

Answer 2

您可以更改BeautifulSoup选择器，直接查找＃correspondence-address-value-1的联系地址ID。

import requests
from bs4 import BeautifulSoup


URL = "https://beta.companieshouse.gov.uk/officers/lX9snXUPL09h7ljtMYLdZU9LmOo/appointments"

def fetch_names(session,link):
    session.headers = {"User-Agent":"Mozilla/5.0"}
    res = session.get(link)
    soup = BeautifulSoup(res.text,"lxml")
    addresses = [a.text for a in soup.select("#correspondence-address-value-1")]
    print(addresses)

if __name__ == '__main__':
    with requests.Session() as session:
        fetch_names(session,URL)

结果

13:32 $ python test.py
['21 Maes Y Llan, Conwy, Wales, LL32 8NB']

无法摆脱BeautifulSoup引起的导航问题

编辑：

2 个答案: