Using beautifulsoup I'm able to scrape a web page with this code:
import requests
from bs4 import BeautifulSoup
page = requests.get("http://www.acbbroker.it/soci_dettaglio.php?r=3")
page
soup = BeautifulSoup(page.content, 'html.parser')
test = soup.find(id="paginainterna-content")
test_items = test.find_all(class_="entry-content")
tonight = test_items[0]
names = []
for x in tonight.find_all('a', itemprop="url"):
names.append(str(x))
print(names)
but I'm not able to clean the results and obtain only the content inside the < a > paragraph (removing also the href).
Here is a small snap of my result:
'<a href="http://www.google.com/maps/place/45.45249938964844,9.210599899291992" itemprop="url" target="_blank">A&B; Insurance e Reinsurance Brokers Srl</a>', '<a href="http://www.google.com/maps/place/45.647499084472656,8.774800300598145" itemprop="url" target="_blank">A.B.A. BROKERS SRL</a>', '<a href="http://www.google.com/maps/place/45.46730041503906,9.148480415344238" itemprop="url" target="_blank">ABC SRL BROKER E CONSULENTI DI ASSI.NE</a>', '<a href="http://www.google.com/maps/place/45.47710037231445,9.269220352172852" itemprop="url" target="_blank">AEGIS INTERMEDIA SAS</a>',
What is the proper way to handle this kind of data and obtain a clean result?
Thank you
答案 0 :(得分:2)
如果您只想使用来自代码的文字get_text()
方法
for x in tonight.find_all('a', itemprop="url"):
names.append(x.get_text())
print(names)
更好list comprehension
这是最快的
names = [x.get_text() for x in tonight.find_all('a', itemprop='url')]
答案 1 :(得分:1)
我不知道你想要什么输出,但是你通过改变这个来获得它的文本
names.append(str(x.get_text()))