Question

我正在抓取特定网站的固定内容。内容位于嵌套div中，如下所示：

<div class="table-info">
  <div>
    <span>Time</span>
        <div class="overflow-hidden">
            <strong>Full</strong>
        </div>
  </div>
  <div>
    <span>Branch</span>
        <div class="overflow-hidden">
            <strong>IT</strong>
        </div>
  </div>
  <div>
    <span>Type</span>
        <div class="overflow-hidden">
            <strong>Standard</strong>
        </div>
  </div>
  <div>
    <span>contact</span>
        <div class="overflow-hidden">
            <strong>my location</strong>
        </div>
 </div>
</div>

我想在字符串值Branch的span内检索div'overflow-hidden'内唯一的强内容。我使用的代码是：

from bs4 import BeautifulSoup
import urllib2 
url = urllib2.urlopen("https://www.xyz.com")
content = url.read()
soup = BeautifulSoup(content)
type = soup.find('div',attrs={"class":"table-info"}).findAll('span')
print type

我已经删除了主div'table-info'中的所有span内容，因此我可以使用条件语句来检索所需的内容。但是，如果我试图废弃跨度内的div内容：

type = soup.find('div',attrs={"class":"table-info"}).findAll('span').find('div')
print type

我收到错误：

AttributeError: 'list' object has no attribute 'find'

任何人都可以给我一些想法来检索跨度中div的内容。谢谢。我正在使用python2.7

Answer 1

好像你想要从div中的第二个div获取内容 - ＆＃34; table-info＆＃34;。但是，您尝试使用与您尝试访问的内容无关的标记来获取它。

 type = soup.find('div',attrs={"class":"table-info"}).findAll('span').find('div')

返回错误，因为它是空的。

最好试试这个：

from bs4 import BeautifulSoup
import urllib2 
url = urllib2.urlopen("https://www.xyz.com")
content = url.read()
soup = BeautifulSoup(content)
type = soup.find('div',attrs={"class":"table-info"}).findAll('div')
print type[2].find('strong').string

Answer 2

findAll返回BS元素列表，find是在BS对象上定义的，而不是BS对象列表，因此是错误。你的代码的初始部分很好，这样做：

from bs4 import BeautifulSoup
import urllib2 

url = urllib2.urlopen("https://www.xyz.com")
content = url.read()
soup = BeautifulSoup(content)

table = soup.find('div',attrs={"class":"table-info"})
spans = table.findAll('span')
branch_span = span[1]
# Do you manipulation with the branch_span

OR

from bs4 import BeautifulSoup
import urllib2 

url = urllib2.urlopen("https://www.xyz.com")
content = url.read()
soup = BeautifulSoup(content)

table = soup.find('div',attrs={"class":"table-info"})
spans = table.findAll('span')

for span in spans:
    if span.text.lower() == 'branch':
        # Do your manipulation

嵌套标签web scraping python

2 个答案: