我是python&的新手漂亮的汤,需要网页废弃所有的链接,以弹性搜索索引它,我使用下面的代码获取信息页面内的所有链接/子链接,但无法检索它。
from bs4 import BeautifulSoup
try:
import urllib.request as urllib2
except ImportError:
import urllib2
urlFile = urllib2.urlopen("http://pubs.vmware.com/sddc-mgr-12/index.jsp#com.vmware.evosddc.via.doc_211/GUID-71BE2329-4B96-4B18-9FF4-1BC458446DB2.html")
urlHtml = urlFile.read()
urlFile.close()
soup = BeautifulSoup(urlHtml,"html.parser")
urlAll = soup.find_all("a")
for links in soup.find_all('a'):
print (links.get('href'))
无法检索任何链接/子链接,因为print()没有提供任何o / p
请提供一些指示。
答案 0 :(得分:1)
您想要的数据是通过ajax调用加载的。
替换
http://pubs.vmware.com/sddc-mgr-12/index.jsp#com.vmware.evosddc.via.doc_211/GUID-71BE2329-4B96-4B18-9FF4-1BC458446DB2.html
使用
http://pubs.vmware.com/sddc-mgr-12/advanced/tocfragment
并将find_all
元素类型更改为node
:
from bs4 import BeautifulSoup
try:
import urllib.request as urllib2
except ImportError:
import urllib2
urlFile = urllib2.urlopen("http://pubs.vmware.com/sddc-mgr-12/advanced/tocfragment")
urlHtml = urlFile.read()
urlFile.close()
soup = BeautifulSoup(urlHtml,"html.parser")
for links in soup.find_all('node'):
print (links.get('href'))
哪个输出:
../topic/com.vmware.evosddc.via.doc_211/GUID-71BE2329-4B96-4B18-9FF4-1BC458446DB2.html
../topic/com.vmware.vcf.ovdeploy.doc_21/GUID-F2DCF1B2-4EF6-444E-80BA-8F529A6D0725.html
../topic/com.vmware.vcf.admin.doc_211/GUID-D5A44DAA-866D-47C9-B1FB-BF9761F97E36.html
../topic/com.vmware.ICbase/PDF/ic_pdf.html
请注意,每次单击左侧面板项时,它都会触发ajax调用以填充列表。例如:
http://pubs.vmware.com/sddc-mgr-12/advanced/tocfragment?toc=/com.vmware.evosddc.via.doc_211/toc.xml
请注意这个特定的url片段作为示例:com.vmware.evosddc.via.doc_211
- 您会看到您需要从第一个输出中获取该部分以获取第二个输出,依此类推。
示例:
soup = BeautifulSoup(urlHtml,"html.parser")
for links in soup.find_all('node'):
child_url = links.get('href').replace("../topic/", "")
child = urllib2.urlopen("http://pubs.vmware.com/sddc-mgr-12/advanced/tocfragment?toc=/" + child_url[0:child_url.index("/")])
print (child.read())
#print (links.get('href'))
哪个输出
<?xml version="1.0" encoding="UTF-8"?>
<tree_data>
<node
path="0"
title="VIA User's Guide"
id="/com.vmware.evosddc.via.doc_211/toc.xml"
href="../topic/com.vmware.evosddc.via.doc_211/GUID-71BE2329-4B96-4B18-9FF4-1BC458446DB2.html"
image="toc_closed">
</node>
...