Question

我正在使用BeautifulSoup从网站中提取各种元素。我遇到了一个我无法确定答案的情况。我想提取一个链接的文本，但链接是3行的断行。例如：

<span class="location-address">
<a href="https://www.google.com/maps" target="_blank">
"123 Main St"
<br>
"Suite 456" 
<br> 
"Everywhere, USA 12345"
</a>

当我使用find_all("span",{"class":"location-address"})[0].text时，我会得到类似于＆＃34; 123 Main StSuite 456 Everywhere，USA 12345＆＃34;我希望有一个更自然的回应。

Answer 1

您可以尝试获取find_all("span",{"class":"location-address")[0].contents而不是find_all("span",{"class":"location-address")[0].text。它应该返回链接标记内的所有html内容。然后，您可以将<br />替换为\n或做您需要的任何事情。

Answer 2

如果您只有一个class=location-address标记>>> from bs4 import BeautifulSoup >>> html = """<span class="location-address"> ... <a href="https://www.google.com/maps" target="_blank"> ... "123 Main St" ... <br> ... "Suite 456" ... <br> ... "Everywhere, USA 12345" ... </a>""" >>> soup = BeautifulSoup(html, 'lxml') >>> soup.find('span', class_='location-address').find_next('a').get_text(strip=True).replace('"', '') '123 Main StSuite 456Everywhere, USA 12345'，那么只需使用find()方法。

>>> for span in soup.find_all('span', class_='location-address'):
...     span.find('a').get_text(strip=True).replace('"', '')
... 
'123 Main StSuite 456Everywhere, USA 12345'

但如果你有多个＆＃34; span＆＃34;使用find_all()方法使用给定的类标记，您可以执行以下操作：

>>> for a in soup.select('span.location-address > a'):
...     a.get_text(strip=True).replace('"', '')
... 
'123 Main StSuite 456Everywhere, USA 12345'

或使用css selector：

var originalDate = new Date(1454911465467);
var clonnedDate = new Date(originalDate.getFullYear(), originalDate.getMonth(), originalDate.getDate(), originalDate.getHours(), originalDate.getMinutes(), originalDate.getSeconds(), originalDate.getMilliseconds());
document.write(clonnedDate.getTime());

逐行提取元素的文本

2 个答案: