Question

我正在抓页，我必须从这种格式获得员工人数：

<h5>Number of Employees</h5>
<p>
            20
</p>

我需要得到号码＆＃34; 20＆＃34;问题是这个数字并不总是在同一个标题中，有时候在＆＃34; h4＆＃34;还有更多＆＃39; h5＆＃34;标题，所以我需要找到标题中包含的数据：＆＃34;员工人数＆＃34;并提取所包含段落中的数字

这是页面的链接

http://www.bbb.org/chicago/business-reviews/paving-contractors/lester-s-material-service-inc-in-grayslake-il-72000434/

Answer 1

嗯，最简单的方法是找到一个包含＆＃34;员工数量＆＃34; -text的元素，然后在此之后简单地使用段落，假设该段落始终紧随其后。

这是一段快速而肮脏的代码，可以执行此操作，并输出数字：

parent = soup.find("div", id='business-additional-info-text')
for child in parent.children:
    if("Number of Employees" in child):
        print(child.findNext('p').contents[0].strip())

Answer 2

'normalize-space(//*[self::h4 or self::h5][contains(., "Number of Employees")]/following-sibling::p[1]/text())'

获取标题标记名称

2 个答案: