我在python
中结合BeautifulSoup
编写了一个脚本,用于解析网页中的某个地址。但是,当我运行我的以下脚本时,遇到AttributeError: 'NavigableString' object has no attribute 'text'
行时遇到问题address = [item.find_next_sibling().get_text(strip=True)
。如果我尝试注释掉的行,我可以摆脱这个问题。但是,我想坚持目前应用的方式。我该怎么办?
这是我的尝试:
import requests
from bs4 import BeautifulSoup
URL = "https://beta.companieshouse.gov.uk/officers/lX9snXUPL09h7ljtMYLdZU9LmOo/appointments"
def fetch_names(session,link):
session.headers = {"User-Agent":"Mozilla/5.0"}
res = session.get(link)
soup = BeautifulSoup(res.text,"lxml")
for items in soup.select("#content-container dt"):
#the error appears in the following line
address = [item.find_next_sibling().get_text(strip=True) for item in items if "correspondence address" in item.text.lower()][0]
print(address)
if __name__ == '__main__':
with requests.Session() as session:
fetch_names(session,URL)
我可以通过执行以下操作来摆脱错误,但我想坚持我在脚本中尝试的方式:
items = soup.select("#content-container dt")
address = [item.find_next_sibling().get_text(strip=True) for item in items if "correspondence address" in item.text.lower()][0]
print(address)
编辑:
这不是答案,但这是我试图玩的方式(仍然不确定如何应用
.find_previous_sibling()
:
import requests
from bs4 import BeautifulSoup
URL = "https://beta.companieshouse.gov.uk/officers/lX9snXUPL09h7ljtMYLdZU9LmOo/appointments"
def fetch_names(session,link):
session.headers = {"User-Agent":"Mozilla/5.0"}
res = session.get(link)
soup = BeautifulSoup(res.text,"lxml")
for items in soup.select("#content-container dt"):
address = [item for item in items.strings if "correspondence address" in item.lower()]
print(address)
if __name__ == '__main__':
with requests.Session() as session:
fetch_names(session,URL)
它产生(没有导航性问题)。
[]
['Correspondence address']
[]
[]
答案 0 :(得分:1)
items
不是节点列表,而是单个节点,因此您不应将其用作此处的迭代器 - for item in items
。只需用以下内容替换列表理解:
for items in soup.select("#content-container dt"):
if "correspondence address" in items.text.lower():
address = items.find_next_sibling().get_text(strip=True)
print(address)
答案 1 :(得分:0)
您可以更改BeautifulSoup选择器,直接查找#correspondence-address-value-1的联系地址ID。
import requests
from bs4 import BeautifulSoup
URL = "https://beta.companieshouse.gov.uk/officers/lX9snXUPL09h7ljtMYLdZU9LmOo/appointments"
def fetch_names(session,link):
session.headers = {"User-Agent":"Mozilla/5.0"}
res = session.get(link)
soup = BeautifulSoup(res.text,"lxml")
addresses = [a.text for a in soup.select("#correspondence-address-value-1")]
print(addresses)
if __name__ == '__main__':
with requests.Session() as session:
fetch_names(session,URL)
结果
13:32 $ python test.py
['21 Maes Y Llan, Conwy, Wales, LL32 8NB']