无法检索链接和子链接

时间:2017-02-28 14:19:15

标签: python elasticsearch web-scraping beautifulsoup

我是python&的新手漂亮的汤,需要网页废弃所有的链接,以弹性搜索索引它,我使用下面的代码获取信息页面内的所有链接/子链接,但无法检索它。

 from bs4 import BeautifulSoup
try:
    import urllib.request as urllib2
except ImportError:
    import urllib2

    urlFile = urllib2.urlopen("http://pubs.vmware.com/sddc-mgr-12/index.jsp#com.vmware.evosddc.via.doc_211/GUID-71BE2329-4B96-4B18-9FF4-1BC458446DB2.html")

urlHtml = urlFile.read()
urlFile.close()

soup = BeautifulSoup(urlHtml,"html.parser")
urlAll = soup.find_all("a")
for links in soup.find_all('a'):
     print (links.get('href'))

无法检索任何链接/子链接,因为print()没有提供任何o / p

请提供一些指示。

1 个答案:

答案 0 :(得分:1)

您想要的数据是通过ajax调用加载的。

替换

http://pubs.vmware.com/sddc-mgr-12/index.jsp#com.vmware.evosddc.via.doc_211/GUID-71BE2329-4B96-4B18-9FF4-1BC458446DB2.html

使用

http://pubs.vmware.com/sddc-mgr-12/advanced/tocfragment

并将find_all元素类型更改为node

from bs4 import BeautifulSoup
try:
    import urllib.request as urllib2
except ImportError:
    import urllib2

    urlFile = urllib2.urlopen("http://pubs.vmware.com/sddc-mgr-12/advanced/tocfragment")

urlHtml = urlFile.read()
urlFile.close()

soup = BeautifulSoup(urlHtml,"html.parser")
for links in soup.find_all('node'):
     print (links.get('href'))

哪个输出:

../topic/com.vmware.evosddc.via.doc_211/GUID-71BE2329-4B96-4B18-9FF4-1BC458446DB2.html
../topic/com.vmware.vcf.ovdeploy.doc_21/GUID-F2DCF1B2-4EF6-444E-80BA-8F529A6D0725.html
../topic/com.vmware.vcf.admin.doc_211/GUID-D5A44DAA-866D-47C9-B1FB-BF9761F97E36.html
../topic/com.vmware.ICbase/PDF/ic_pdf.html

请注意,每次单击左侧面板项时,它都会触发ajax调用以填充列表。例如:

http://pubs.vmware.com/sddc-mgr-12/advanced/tocfragment?toc=/com.vmware.evosddc.via.doc_211/toc.xml

请注意这个特定的url片段作为示例:com.vmware.evosddc.via.doc_211 - 您会看到您需要从第一个输出中获取该部分以获取第二个输出,依此类推。

示例:

soup = BeautifulSoup(urlHtml,"html.parser")
for links in soup.find_all('node'):
    child_url = links.get('href').replace("../topic/", "")
    child = urllib2.urlopen("http://pubs.vmware.com/sddc-mgr-12/advanced/tocfragment?toc=/" + child_url[0:child_url.index("/")])
    print (child.read())
    #print (links.get('href'))

哪个输出

<?xml version="1.0" encoding="UTF-8"?>
<tree_data>
<node
      path="0"
      title="VIA User&apos;s Guide"
      id="/com.vmware.evosddc.via.doc_211/toc.xml"
      href="../topic/com.vmware.evosddc.via.doc_211/GUID-71BE2329-4B96-4B18-9FF4-1BC458446DB2.html"
      image="toc_closed">
</node>

...