嵌套标签web scraping python

时间:2014-04-01 05:00:37

标签: python html web-scraping beautifulsoup

我正在抓取特定网站的固定内容。内容位于嵌套div中,如下所示:

<div class="table-info">
  <div>
    <span>Time</span>
        <div class="overflow-hidden">
            <strong>Full</strong>
        </div>
  </div>
  <div>
    <span>Branch</span>
        <div class="overflow-hidden">
            <strong>IT</strong>
        </div>
  </div>
  <div>
    <span>Type</span>
        <div class="overflow-hidden">
            <strong>Standard</strong>
        </div>
  </div>
  <div>
    <span>contact</span>
        <div class="overflow-hidden">
            <strong>my location</strong>
        </div>
 </div>
</div>

我想在字符串值Branch的span内检索div'overflow-hidden'内唯一的强内容。我使用的代码是:

from bs4 import BeautifulSoup
import urllib2 
url = urllib2.urlopen("https://www.xyz.com")
content = url.read()
soup = BeautifulSoup(content)
type = soup.find('div',attrs={"class":"table-info"}).findAll('span')
print type

我已经删除了主div'table-info'中的所有span内容,因此我可以使用条件语句来检索所需的内容。但是,如果我试图废弃跨度内的div内容:

type = soup.find('div',attrs={"class":"table-info"}).findAll('span').find('div')
print type

我收到错误:

AttributeError: 'list' object has no attribute 'find'

任何人都可以给我一些想法来检索跨度中div的内容。谢谢。 我正在使用python2.7

2 个答案:

答案 0 :(得分:1)

好像你想要从div中的第二个div获取内容 - &#34; table-info&#34;。但是,您尝试使用与您尝试访问的内容无关的标记来获取它。

 type = soup.find('div',attrs={"class":"table-info"}).findAll('span').find('div') 

返回错误,因为它是空的。

最好试试这个:

from bs4 import BeautifulSoup
import urllib2 
url = urllib2.urlopen("https://www.xyz.com")
content = url.read()
soup = BeautifulSoup(content)
type = soup.find('div',attrs={"class":"table-info"}).findAll('div')
print type[2].find('strong').string

答案 1 :(得分:0)

findAll返回BS元素列表,find是在BS对象上定义的,而不是BS对象列表,因此是错误。你的代码的初始部分很好, 这样做:

from bs4 import BeautifulSoup
import urllib2 

url = urllib2.urlopen("https://www.xyz.com")
content = url.read()
soup = BeautifulSoup(content)

table = soup.find('div',attrs={"class":"table-info"})
spans = table.findAll('span')
branch_span = span[1]
# Do you manipulation with the branch_span

OR

from bs4 import BeautifulSoup
import urllib2 

url = urllib2.urlopen("https://www.xyz.com")
content = url.read()
soup = BeautifulSoup(content)

table = soup.find('div',attrs={"class":"table-info"})
spans = table.findAll('span')

for span in spans:
    if span.text.lower() == 'branch':
        # Do your manipulation