我正在尝试从该城市房地产网站上抓取数据 https://www.cityrealty.com/nyc/roosevelt-island/rivercross-531-main-street/closing-history/57182 我不知道我在做什么。此时我只是漫无目的地尝试,请帮助! 到目前为止,我已经导入了库,
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text)
print(soup.title)
print(soup.title.string)
r = requests.get
('https://www.cityrealty.com/nyc/roosevelt-island/rivercross-531-main-
street/closing-history/57182')
print(len(r.text))
现在我需要提取数据。
我已经尝试过类似的事情
results = soup.find_all('tr')
r = []
for count in range(0, 6):
k = k.next_sibling
r.append(k.string)
results.append(r)
print('Number of results', len(results))
for row in range(0, len(results)):
print(results[row])
但这并没有给我任何回报。如何从网上提取数据? 谢谢 !
答案 0 :(得分:1)
您可以将div
和所有tr
一起归类为soup.findAll("div", {"class":"tr"})
。这将返回该类的所有div容器。
请注意,这些div在html属性中也具有数据,例如data-unit
,data-size
,data-price
...因此它使抓取这些值变得更加容易
代码:
import requests
import pandas as pd
from bs4 import BeautifulSoup
r = requests.get('https://www.cityrealty.com/nyc/roosevelt-island/rivercross-531-main-street/closing-history/57182')
soup = BeautifulSoup(r.text, "html.parser")
data = [
t.attrs
for t in soup.findAll("div", {"class":"tr"})
if t.has_attr("data-unit")
]
df = pd.DataFrame(data)
del df['class']
print(df)
输出:
data-unit data-size data-sizeft data-price data-priceft data-priceask data-date data-total
0 1916 3 1777 1175000 661 1250000 1587700800 84
1 1612 2 1364 1150000 843 1250000 1580274000 84
2 411 1 972 620000 638 640000 1580101200 84
3 1003 3 1777 1131000 636 1245000 1577077200 84
4 1411 1 - 682000 - - 1576731600 84
.. ... ... ... ... ... ... ... ...
79 1403 - 52877 - - 1138683600 84
80 1315 - 54921 - - 1135141200 84
81 123 - 52241 - - 1093406400 84
82 1915 - 51037 - - 1058932800 84
83 1819 - 53642 - - 1049688000 84
[84 rows x 8 columns]