Question

我正在为美国的房价建立网络刮板。我可以找到我使用的数据示例here。我试图提取特定邮政编码的数据（工作室：1420美元，1卧室：1560美元）。

以下是我要提取的HTML部分：

<tspan x="5" y="16" class="highcharts-text-outline" fill="#000000" stroke="#000000" stroke-width="2px" stroke-linejoin="round" style="">$1420</tspan>

当我尝试使用BeautifulSoup4时，我就是这样：将urllib.request导入为urllib2 来自bs4 import BeautifulSoup

# specify the url
quote_page = 'https://www.bestplaces.net/cost_of_living/zip-
code/california/san_diego/92128'

# query the website and return the html to the variable ‘page’
page = urllib2.urlopen(quote_page)


soup = BeautifulSoup(page, 'html.parser')
price = soup.find('tspan', attrs={'class': 'highcharts-text-outline'})

print(price)

但这没有任何回报。我想知道如何更改命令以正确提取它。

Answer 1

您正尝试使用无法执行此任务的urllib库来解析动态内容。您需要使用selenium之类的任何浏览器模拟器来处理它。以下是使用selenium：

的方法

from selenium.webdriver import Chrome
from contextlib import closing

with closing(Chrome()) as driver:
    quote_page = 'https://www.bestplaces.net/cost_of_living/zip-code/california/san_diego/92128'
    driver.get(quote_page)
    price = driver.find_element_by_class_name('highcharts-text-outline').text
    print(price)

输出：

$1420

Answer 2

您可以使用Period.Between属性：

PeriodUnits.Days

输出：

text

Answer 3

试试这个： -

PUB

使用Beautiful Soup在HTML中提取嵌套数据

3 个答案: