Question

在Python中使用BeautifulSoup模块，我正在尝试解析下面的这个网页。

<div class="span-body"><div class="timestamp updated" title="2016-05-08T1231Z">May 8, 12:31 PM EDT</div></div>

我正在尝试让下面的脚本返回2016-05-08T1231Z，该timestamp updated位于带有with open("index.html", 'rb') as source_file: soup = BeautifulSoup(source_file.read()) # Read the source file and get BeautifulSoup to work with it. div_1 = soup.find("div", {"class": "span-body"}).contents[0] # Parse the first div. div_2 = div_1("div", {"class": "timestamp updated"}) # Parse the second div. print div_2类的第二个div中。

div_1

div_2返回我想要返回的内容（第二个div），但df['companyId'] = df['companyId'].astype('str') # because type was 'object'. df['companyId'].map(lambda x: int(x[4:]))不是，而是只返回给我一个空列表。

如何解决此问题？

Answer 1

有两个选项，您只需删除contents[0]：

div_1 = soup.find("div", {"class": "span-body"}) # Parse the first div.
div_2 = div_1("div", {"class": "timestamp updated"})

这将返回一个包含一个元素的列表：

[<div class="timestamp updated" title="2016-05-08T1231Z">May 8, 12:31 PM EDT</div>]

只需使用find()：

div_1 = soup.find("div", {"class": "span-body"})
div_2 = div_1.find("div", {'class': 'timestamp updated'})
print(div_2)

结果：

<div class="timestamp updated" title="2016-05-08T1231Z">May 8, 12:31 PM EDT</div>

如果您不需要中级div_1，为什么不直接进入div_2？

div_2 = soup.find("div", {'class': 'timestamp updated'})

从评论中编辑：要获取title属性的值，您可以将其编入索引：

div_2['title']

Answer 2

要从div_1找到你想要的内容，你需要再次使用find函数，你也可以删除contents[0]，因为find没有返回列表。

soup = BeautifulSoup(source_file.read()) # Read the source file and get BeautifulSoup to work with it.
div_1 = soup.find("div", {"class": "span-body"}) # Parse the first div.
div_2 = div_1.find("div", {"class": "timestamp updated"}) # Parse the second div.
print div_2

使用“class”属性解析div

2 个答案: