由于我是数据科学的新手,所以我试图对房地产网站进行爬网以创建带有清单的数据集,我遇到的问题是,不同的元素(房间,表面和卫生间的数量)具有相同的li类和span类,因此我也为其他2个元素获得了第一个元素(房间)。
我尝试实施this解决方案,但出现此错误:
“'str'对象没有属性'find_next'”
网站:https://www.immobiliare.it/vendita-case/milano/
代码:
import requests
from bs4 import BeautifulSoup
import pandas
base_url = "https://www.immobiliare.it/vendita-case/milano/"
r = requests.get(base_url)
c = r.content
soup = BeautifulSoup(c, "html.parser")
# To extract the first and last page numbers
paging = soup.find("div",{"id":"listing-pagination"}).find("ul",{"class":"pagination pagination__number"}).find_all("a")
start_page = paging[0].text
last_page = paging[len(paging)-1].text
#Empty list to append content
web_content_list = []
for page_number in range(int(start_page),2):
# To form the url based on page numbers
print(page_number)
url = base_url + "?pag=" + str(page_number)
r = requests.get(url)
c = r.content
soup = BeautifulSoup(c, "html.parser")
#Extract info
listing_content = soup.find_all("div",{"class":"listing-item_body--content"})
for item in listing_content:
#Store info to a dictionary
web_content_dict = {}
web_content_dict["Title"] = item.find("p",{"class":"titolo text-primary"}).find("a").get("title")
web_content_dict["Price"] = item.find("li",{"class":"lif__item lif__princing"})
web_content_dict["Rooms"] = item.find("span",{"class":"text-bold"}).text
web_content_dict["Surface"] = web_content_dict["Rooms"].find_next("span").text
#Store dictionary into a list
web_content_list.append(web_content_dict)
#Make a dataframe with the list
df = pandas.DataFrame(web_content_list)
#Write dataframe to a csv file
df.to_csv("Output.csv")
print("Done")
我也不想使用硒, 感谢您的帮助
答案 0 :(得分:1)
作为一种快速解决方案,您可以找到房间的<span>
,然后再次使用find_next
来表示地面,并使用find_next
来表示厕所的数量:
例如(我也使用get_text(strip=True)
去除空白文本):
for item in listing_content:
#Store info to a dictionary
web_content_dict = {}
web_content_dict["Title"] = item.find("p",{"class":"titolo text-primary"}).find("a").get("title")
web_content_dict["Price"] = item.find("li",{"class":"lif__item lif__pricing"}).get_text(strip=True)
web_content_dict["Rooms"] = item.find("span",{"class":"text-bold"}).get_text(strip=True)
web_content_dict["Surface"] = item.find("span",{"class":"text-bold"}).find_next("span").get_text(strip=True)
web_content_dict["Toilets"] = item.find("span",{"class":"text-bold"}).find_next("span").find_next("span").get_text(strip=True)
当我打印变量web_content_dict
时,就像这样:
{'Title': "Bilocale via Fra' Giovanni Pantaleo 3, Bovisa, Milano", 'Price': '€ 187.000', 'Rooms': '2', 'Surface': '65', 'Toilets': '1'}
{'Title': 'Trilocale via Monte Rosa 15, Amendola - Buonarroti, Milano', 'Price': '€ 730.000', 'Rooms': '3', 'Surface': '140', 'Toilets': '2'}
{'Title': 'Trilocale via San Senatore, 2, Missori, Milano', 'Price': '€ 665.000', 'Rooms': '3', 'Surface': '109', 'Toilets': '2'}
{'Title': 'Quadrilocale viale Duilio 6, Sempione, Milano', 'Price': '€ 1.150.000', 'Rooms': '4', 'Surface': '165', 'Toilets': '2'}
{'Title': "Appartamento piazza Sant'agostino, 6, Corso Genova, Milano", 'Price': '€ 1.650.000', 'Rooms': '5', 'Surface': '275', 'Toilets': '3+'}
{'Title': 'Trilocale via Val Gardena 25, Precotto, Milano', 'Price': '€ 170.000', 'Rooms': '3', 'Surface': '91', 'Toilets': '1'}
{'Title': 'Appartamento corso Di Porta Nuova, Turati, Milano', 'Price': '€ 1.130.000', 'Rooms': '5+', 'Surface': '210', 'Toilets': '3'}
{'Title': 'Trilocale via Francesco Albani 58, Monte Rosa - Lotto, Milano', 'Price': '€ 380.000', 'Rooms': '3', 'Surface': '90', 'Toilets': '1'}
{'Title': 'Bilocale via Antonio Cesari 47, Niguarda, Milano', 'Price': '€ 115.000', 'Rooms': '2', 'Surface': '46', 'Toilets': '1'}
{'Title': 'Trilocale via mazzucotelli 15, Quartiere Forlanini, Milano', 'Price': '€ 215.000', 'Rooms': '3', 'Surface': '91', 'Toilets': '2'}
{'Title': 'Bilocale via Livorno, Palestro, Milano', 'Price': '€ 520.000', 'Rooms': '2', 'Surface': '57', 'Toilets': '1'}
{'Title': 'Bilocale via Maspero 28, Molise - Cuoco, Milano', 'Price': '€ 290.000', 'Rooms': '2', 'Surface': '70', 'Toilets': '1'}
{'Title': 'Trilocale largo Gemito, 3, Casoretto, Milano', 'Price': '€ 308.000', 'Rooms': '3', 'Surface': '93', 'Toilets': '1'}
{'Title': 'Quadrilocale via Pietro Paleocapa, Cadorna - Castello, Milano', 'Price': '€ 1.300.000', 'Rooms': '4', 'Surface': '180', 'Toilets': '3'}
{'Title': 'Bilocale via Renato Fucini, Città Studi, Milano', 'Price': '€ 511.000', 'Rooms': '2', 'Surface': '85', 'Toilets': '1'}
{'Title': 'Quadrilocale via Lucca, Bisceglie, Milano', 'Price': '€ 275.000', 'Rooms': '4', 'Surface': '100', 'Toilets': '1'}
{'Title': 'Trilocale via RIZZARDI 45, Trenno, Milano', 'Price': '€ 485.000', 'Rooms': '3', 'Surface': '127', 'Toilets': '1'}
{'Title': 'Bilocale via bacchiglione, Corvetto, Milano', 'Price': '€ 220.000', 'Rooms': '2', 'Surface': '50', 'Toilets': '1'}
{'Title': 'Quadrilocale via Cadore, Cadore, Milano', 'Price': '€ 1.060.000', 'Rooms': '4', 'Surface': '210', 'Toilets': '2'}
{'Title': 'Bilocale via bacchiglione, Corvetto, Milano', 'Price': '€ 195.000', 'Rooms': '2', 'Surface': '42', 'Toilets': '1'}
{'Title': 'Bilocale buono stato, primo piano, Brera, Milano', 'Price': '€ 800.000', 'Rooms': '2', 'Surface': '87', 'Toilets': '2'}
{'Title': 'Trilocale via bacchiglione, Corvetto, Milano', 'Price': '€ 540.000', 'Rooms': '3', 'Surface': '120', 'Toilets': '2'}
{'Title': 'Bilocale via bacchiglione, Corvetto, Milano', 'Price': '€ 350.000', 'Rooms': '2', 'Surface': '81', 'Toilets': '1'}
{'Title': 'Bilocale via bacchiglione, Corvetto, Milano', 'Price': '€ 265.000', 'Rooms': '2', 'Surface': '50', 'Toilets': '1'}
{'Title': 'Appartamento via Antonio Pianella, 4, San Siro, Milano', 'Price': '€ 649.000', 'Rooms': '5+', 'Surface': '150', 'Toilets': '3'}
答案 1 :(得分:0)
我尝试通过try_error使用find_all和标记索引来改善脚本,但是也许您可以在bs4中使用.next_siblings
属性
import requests
from bs4 import BeautifulSoup
import pandas
base_url = "https://www.immobiliare.it/vendita-case/milano/"
r = requests.get(base_url)
c = r.content
soup = BeautifulSoup(c, "html.parser")
# To extract the first and last page numbers
paging = soup.find("div",{"id":"listing-pagination"}).find("ul",{"class":"pagination pagination__number"}).find_all("a")
start_page = paging[0].text
last_page = paging[len(paging)-1].text
#Empty list to append content
web_content_list = []
for page_number in range(int(start_page),2):
# To form the url based on page numbers
print(page_number)
url = base_url + "?pag=" + str(page_number)
r = requests.get(url)
c = r.content
soup = BeautifulSoup(c, "html.parser")
#Extract info
listing_content = soup.find_all("div",{"class":"listing-item_body--content"})
for item in listing_content:
#Store info to a dictionary
web_content_dict = {}
web_content_dict["Title"] = item.find("p",{"class":"titolo text-primary"}).find("a").get("title")
web_content_dict["Price"] = item.find_all("li",{"class":"lif__item lif__princing"})
web_content_dict["Rooms"] = item.find_all("li",{"class":"lif__item"})[1].find("span",{"class":"text-bold"}).get_text(strip=True)
web_content_dict["Surface"] = item.find_all("li",{"class":"lif__item"})[2].find("span",{"class":"text-bold"}).get_text(strip=True)
web_content_dict["Bath"] = item.find_all("li",{"class":"lif__item"})[3].find("span",{"class":"text-bold"}).get_text(strip=True)
try:
web_content_dict["Floor"] = item.find_all("li",{"class":"lif__item"})[4].find("abbr",{"class":"text-bold"}).get_text(strip=True)
except IndexError as e:
web_content_dict["Floor"] = 1
#Store dictionary into a list
web_content_list.append(web_content_dict)
#Make a dataframe with the list
df = pandas.DataFrame(web_content_list)
print(df)
#Write dataframe to a csv file
df.to_csv("Output.csv")
print("Done")