我正在使用如下所示的HTML:
<td class="hidden-xs BuildingUnit-price" data-sort-value="625000">
<span class="price">$625,000 </span>
</td>
<td class="hidden-xs BuildingUnit-bedrooms" data-sort-value="4.0">
4 rooms, 2 beds
</td>
<td class="hidden-xs BuildingUnit-bathrooms">
5 baths
</td>
<td class="hidden-xs" data-sort-value="1">
1 bath
</td>
我在下面编写了脚本,以识别具有“ hidden-xs”类的td标签,以便提取用于房地产列表的浴室数量,但它与“ hidden-xs BuildingUnit-price”类匹配也一样我该如何纠正?
#Extract the number of baths
import re
lst_baths=list()
baths=soup.find_all("td", class_=["hidden-xs"])
bath_lines=[td.get_text().strip() for td in baths]
pattern=re.compile(r'(\d{1})\D*(bath|baths)$')
for bath in bath_lines:
match=pattern.match(bath)
if match:
lst_baths.append(bath.split()[0])
例如,如当前所写,我的代码选择了“ 5个浴室”行,但是我只希望它选择“ 1个浴室”行。
答案 0 :(得分:0)
找到一种测试每个比赛类别的方法:
#Extract the baths
lst_baths=list()
temp_lst=list()
baths=soup.find_all("td", class_=["hidden-xs"])
for item in baths:
if item['class']==['hidden-xs']:
temp_lst.append(item)
else:
pass
bath_lines=[td.get_text().strip() for td in temp_lst]
pattern=re.compile(r'(\d{1})\D*(bath|baths)$')
for bath in bath_lines:
match=pattern.match(bath)
if match:
lst_baths.append(bath.split()[0])