BeautifulSoup:<ul>元素不会显示在“子级”列表上。解析器有问题吗?

时间:2019-02-23 21:54:55

标签: python html parsing web-scraping beautifulsoup

这是我的代码:

import bs4 as bs
from urllib.request import urlopen

page = urlopen("https://www.netimoveis.com/locacao/minas-gerais/belo-horizonte/bairros/santo-antonio/apartamento/#1/").read()

soup = bs.BeautifulSoup(page, "lxml")

div_lista_locacao = soup.select("div#lista-locacao")[0]

ul_tags = list(div_lista_locacao.children)

print("ul_tags = ",ul_tags)

(您可以看到我打印了一个包含div_lista_locacao子级的列表)。

输出:

ul_tags =  ['\n']

(而且即使有真正的孩子,它也只显示换行符,如下所示)。

这是我来源的HTML:

<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" style="" class=" js flexbox flexboxlegacy canvas canvastext webgl no-touch geolocation postmessage no-websqldatabase indexeddb hashchange history draganddrop websockets rgba hsla multiplebgs backgroundsize borderimage borderradius boxshadow textshadow opacity cssanimations csscolumns cssgradients no-cssreflections csstransforms csstransforms3d csstransitions fontface generatedcontent video audio localstorage sessionstorage webworkers applicationcache svg inlinesvg smil svgclippaths"
  lang="pt">
<head></head>
<body id="topo_geral" itemscope="" itemtype="http://schema.org     
   /WebPage">
  <div id="container-hero" class="container-fluid"></div>
  <div id="resultado" class="container-fluid page-container">
    <!-- DESKTOP -->
    <div id="banner-resultado" class="col col-xs-12 col-sm-12 col-
       md-12col-lg-12 text-center hide"></div>
    <div class="row hidden-xs hidden-sm">
      <div class="col col-xs-12 col-sm-12 col-md-3 col-lg-3 filtro-  
         resultado"></div>
      <div class="col col-xs-12 col-sm-12 col-md-9 col-lg-9 box-
         resultado-hidden-xs hidden-sm"></div>
      <button id="btn-ordenacao-por-valor" data-ordenar="asc" class="btnbtn-valor btn-branco"></button>
      <ul class="nav nav-tabs" role="tablist" id="myTab"></ul>
      <div class="tab-content">
        <div role="tabpanel" class="tab-pane active" id="locacao">
          #Currently manipulating this tag beneath. This is the "div_lista_locacao" variable.
          <div id="lista-locacao" class="col col-xs-12 col-sm-12 col-
            md-12 col-lg-12 nopadmar">
            ##Need to iterate between these 'ul' tags beneath and parse the text internally.
            ## But they won't show up in the .children list.
              <ul class="ul-resultado paginacao paginacao_numero_1" style="display: block;"></ul>
              <ul class="ul-resultado paginacao paginacao_numero_2" style="display: block;"></ul>
              <ul class="ul-resultado paginacao paginacao_numero_3" style="display: none;"></ul>
          </div>
        </div>
      </div>
    </div>
  </div>
</body>

</html>

##I can reply with the contents inside the 'ul' tags if requested. 
##But I just thought it wouldn't be necessary for this particular question.

我正在使用“ lxml” 进行解析,但是我已经尝试将其更改为“ html.parser” “ html5lib” < / em>和“ xml” 。所有结果都差不多。

那么,它是解析器吗?是我用来下载网页的图书馆吗?它没有下载此部分吗?还是BS错误? IDK。

3 个答案:

答案 0 :(得分:1)

As already mentioned in an answer by @facelessuser, the content is loaded dynamically with Javascript.

The good news is that you can make the same ajax request via python and get the json response. This contains all the data that you require. I am just printing out the price.

import bs4 as bs
from urllib.request import urlopen
import json
page = urlopen("https://www.netimoveis.com/locacao/minas-gerais/belo-horizonte/bairros/santo-antonio/apartamento/?pagina=1&busca=%7B%22valorMinimo%22%3Anull%2C%22valorMaximo%22%3Anull%2C%22quartos%22%3Anull%2C%22suites%22%3Anull%2C%22banhos%22%3Anull%2C%22vagas%22%3Anull%2C%22idadeMinima%22%3Anull%2C%22areaMinima%22%3Anull%2C%22areaMaxima%22%3Anull%2C%22bairros%22%3A%5B%22santo-antonio%22%5D%2C%22ordenar%22%3Anull%7D&outrasPags=true&quantidadeDeRegistro=20&first=false").read()
properties=json.loads(page)['lista']
for item in properties:
    print(item['valorLocacaoFormat'])

Output

R$ 1.490,00
R$ 2.300,00
R$ 1.480,00
R$ 1.600,00
R$ 1.700,00
R$ 2.100,00
R$ 1.600,00
...

Note: To find the ajax url that I am using, open the network tab in you browser developer tools and go to the url. You can see the xhr request being made.

enter image description here

答案 1 :(得分:0)

我认为div_lista_locacao内容是在页面加载后通过JavaScript动态加载的。运行您的脚本并打印出[<div class="col col-xs-12 col-sm-12 col-md-12 col-lg-12 nopadmar" id="lista-locacao"> </div>] ,我得到:

ul

如您所见,在该div中没有ul元素可供选择。您可能需要使用诸如selenium之类的内容来获取动态内容,然后在获取完整的HTML后选择requests,但是仅使用div是不够的,因为您必须执行JavaScript来加载首先列出 # write the image to temporary file t = TempImage() cv2.imwrite(t.path, frame) # upload the image to Dropbox and cleanup the tempory image print("[UPLOAD] {}".format(ts)) path = "/{base_path}/{timestamp}.jpg".format( base_path=conf["dropbox_base_path"], timestamp=ts) client.files_upload(open(t.path, "rb").read(), path) t.cleanup() 元素。

答案 2 :(得分:0)

如{facelessuser和@Bitto所说,内容以Javascript动态加载。如果转到页面,请单击view-source并搜索您的ID,您不会看到任何ul。

在这种情况下,使用selenium可以更强大地从javascript获取元素。

  

如果您没有安装驱动程序,则可以安装在   http://chromedriver.chromium.org/getting-started

所有代码:

from selenium import webdriver


options = webdriver.ChromeOptions()
driver = webdriver.Chrome(chrome_options=options,
                          executable_path=r'/Users/omertekbiyik/PycharmProjects/bitirme/chromedriver')
driver.get('https://www.netimoveis.com/locacao/minas-gerais/belo-horizonte/bairros/santo-antonio/apartamento/#1/')

x = driver.find_elements_by_css_selector("div[id='lista-locacao']")

for a in x:
    print a.text


driver.close()

输出:

partamento para alugar de 3 quartos
Santo Antônio - Rua Engenheiro Zoroastro Torres, 149
More na região nobre do Santo Antônio! Local tranquilo com comércio próximo, esquina com Av. Prudente de Moraes. Prédio familiar com 08 andares e 02 elevadores, 02 aptos por andar,
3
quartos
2
suítes
3
banhos
2
vagas
R$ 1.490
condomínio: R$ 1100
código: 724362
96 m²
Apartamento para alugar de 3 quartos
Santo Antônio - Rua Paulo Afonso, 587
ALUGUE SEM FIADOR pelo melhor preço: 1 + 11 parcelas de R$ 292,50**Mediante aprovação de ficha cadastral do locatário pela seguradoraO seu próximo lar na melhor localização do bair
3
quartos
1
suíte
2
banhos
2
vagas
R$ 2.300
condomínio: R$ 1452
código: 677116
175 m²
...UP TO FINISH ALL UL TAGS

您可以看到div中的所有html部分,例如

for a in x:
    print a.get_attribute('innerHTML')