使用“class”属性解析div

时间:2016-05-09 22:45:33

标签: python html beautifulsoup

在Python中使用BeautifulSoup模块,我正在尝试解析下面的这个网页。

<div class="span-body"><div class="timestamp updated" title="2016-05-08T1231Z">May 8, 12:31 PM EDT</div></div>

我正在尝试让下面的脚本返回2016-05-08T1231Z,该timestamp updated位于带有with open("index.html", 'rb') as source_file: soup = BeautifulSoup(source_file.read()) # Read the source file and get BeautifulSoup to work with it. div_1 = soup.find("div", {"class": "span-body"}).contents[0] # Parse the first div. div_2 = div_1("div", {"class": "timestamp updated"}) # Parse the second div. print div_2 类的第二个div中。

div_1

div_2返回我想要返回的内容(第二个div),但df['companyId'] = df['companyId'].astype('str') # because type was 'object'. df['companyId'].map(lambda x: int(x[4:])) 不是,而是只返回给我一个空列表。

如何解决此问题?

2 个答案:

答案 0 :(得分:0)

有两个选项,您只需删除contents[0]

div_1 = soup.find("div", {"class": "span-body"}) # Parse the first div.
div_2 = div_1("div", {"class": "timestamp updated"}) 

这将返回一个包含一个元素的列表:

[<div class="timestamp updated" title="2016-05-08T1231Z">May 8, 12:31 PM EDT</div>]

只需使用find()

div_1 = soup.find("div", {"class": "span-body"})
div_2 = div_1.find("div", {'class': 'timestamp updated'})
print(div_2)

结果:

<div class="timestamp updated" title="2016-05-08T1231Z">May 8, 12:31 PM EDT</div>

如果您不需要中级div_1,为什么不直接进入div_2

div_2 = soup.find("div", {'class': 'timestamp updated'})

从评论中编辑:要获取title属性的值,您可以将其编入索引:

div_2['title']

答案 1 :(得分:0)

要从div_1找到你想要的内容,你需要再次使用find函数,你也可以删除contents[0],因为find没有返回列表。

soup = BeautifulSoup(source_file.read()) # Read the source file and get BeautifulSoup to work with it.
div_1 = soup.find("div", {"class": "span-body"}) # Parse the first div.
div_2 = div_1.find("div", {"class": "timestamp updated"}) # Parse the second div.
print div_2