我最近创建了一个网络抓取工具,正在抓取网站的日期。下面的HTML代码段:
<dl class="dl-horizontal">
<dd>
::before
September 22, 1966
::after
</dd>
</dl>
当我这样做时:
dob = soup.find_all("dd")
我得到了(编辑以隐藏一些个人信息):
[<dd>Clevenger</dd>, <dd>XXX-XX-XXXX <div class="adtl">You should <a href="https://www.example.com">click here</a> to find out if blah blah.</div></dd>, <dd><a href="javascript:void(0)" id="geo">47.579909, -117.479347</a></dd>, <dd>111-111-1111</dd>, <dd>1</dd>, <dd>September 22, 1966</dd>, <dd>52 years old</dd>, <dd>Virgo</dd>]
我想要的只是日期:1966年9月22日
我怎么得到那个?
编辑:将查找更改为find_all和xPath:
//*[@id="details"]/div[2]/div[2]/div[1]/div[2]/dl[6]/dd[1]
CSS选择器:
div#details > div:nth-of-type(2) > div:nth-of-type(2) > div > div:nth-of-type(2) > dl:nth-of-type(6) > dd
答案 0 :(得分:0)
尝试找出该类的名称(或实际上是任何属性的名称) 如果您可以这样写,那就会容易得多:
dob = soup.select("dd[class=date]")
以防万一。.考虑使用Regex找出哪个<dd>
标签包含日期:
months = '[(January), (February), (March), (April), (May), (June), (July), (August), (September), (October), (November), (December)]'
for elem in dop:
text = elem.getText()
pattern = str(months) + r'\s\d{1,2},\s\d{4}'
if re.search(pattern, text) != None:
print('matching!')
else:
print('not a match!')