使用beautifulsoup抓取HTML网站ID的特定部分

时间:2020-04-08 08:03:26

标签: python python-3.x web-scraping beautifulsoup

我正在尝试抓取以下html(1217428)的ID,而不抓取其余的ID标签,但是我不知道如何仅隔离所需部分。

override fun applyOverrideConfiguration(overrideConfiguration: Configuration?) {
  if (overrideConfiguration != null) {
    val uiMode = overrideConfiguration.uiMode
    overrideConfiguration.setTo(baseContext.resources.configuration)
    overrideConfiguration.uiMode = uiMode
  }
  super.applyOverrideConfiguration(overrideConfiguration)
}

到目前为止,我已经提出了这个建议:

<td class="pb-15 text-center">
<a href="#" id="1217428_1_10/6/2020 12:00:00 AM" class="slotBooking">
    8:15 AM ✔ 
</a>
</td>

但这仅允许我将其接收为输出:

lesson_id = [] # I wish to fit the lesson id in this list
soup = bs(html, "html.parser")
slots = soup.find(attrs={"class" : "pb-15 text-center"})
tag = slots.find("a")
ID = tag.attrs["id"]
print (ID)

有什么办法可以编辑我的代码,使输出为:

1217428_1_10/6/2020 12:00:00 AM

我也尝试过使用正则表达式:

1217428

但是我收到此错误:

lesson_id = []
soup = bs(html, "html.parser")
slots = soup.find(attrs={"class" : "pb-15 text-center"})
tag = slots.find("a")
ID = tag.attrs["id"]
lesson_id.append(ID(re.findall("\d{7}")))

2 个答案:

答案 0 :(得分:1)

您可以按如下所示简单地拆分字符串:

id_list = ID.split('_',1)
#will give you ['1217428', '1_10/6/2020 12:00:00 AM']
id = id_list[0] # which is '1217428'

您也可以使用正则表达式:

match = re.search(r'\d{1,}',ID)
id = match.group() # '1217428'

答案 1 :(得分:1)

我认为您可以通过将ID分隔为“ _”并使用第一部分来解决您的问题。 (这是我从您上面的示例中所了解的内容):

lesson_id = [] # I wish to fit the lesson id in this list
soup = bs(html, "html.parser")
slots = soup.find(attrs={"class" : "pb-15 text-center"})
tag = slots.find("a")
ID = tag.attrs["id"]
if ID:
    ID = ID.split("_")[0]
print (ID)