我想从这个网站上抓取产品信息:http://megabuy.vn/Default.aspx。
我的解决方案是根据网站结构抓取网站。因此,首先,我必须在深入子类别然后到每个特定产品之前,删除有关常规类别的所有链接。
我无法抓取所有链接等常规类别:
等...
我认为问题是这些链接在java脚本标记下。
这是我的代码:
from bs4 import BeautifulSoup
import requests
import re
def web_scrape(url):
web_connect = requests.get(url)
text = web_connect.text
soup = BeautifulSoup(text,"html.parser")
return soup
homepage = web_scrape("http://megabuy.vn/Default.aspx")
listgianhang = homepage.findAll("a", class_=re.compile("ContentPlaceholder"))
len(listgianhang)
我得到了结果:0
答案 0 :(得分:0)
[crowdpdf-total=
出:
import requests, bs4, re
r = requests.get('http://megabuy.vn/Default.aspx')
soup = bs4.BeautifulSoup(r.text, 'lxml')
table = soup.find(id='ctl00_ContentPlaceHolder1_TopMenu1_dlMenu')
for a in table('a',href=re.compile(r'^http')):
link = a.get('href')
text = a.text
print(link, text)
你无法按类获取标签的原因是标签的类是由JavaScript生成的,原始的html代码是这样的:
http://megabuy.vn/gian-hang/thiet-bi-van-phong THIẾT BỊ VĂN PHÒNG
http://megabuy.vn/gian-hang/may-fax Máy Fax
http://megabuy.vn/gian-hang/may-fax/hsx/Panasonic Panasonic
http://megabuy.vn/gian-hang/may-chieu-man-chieu-phu-kien Máy chiếu Màn chiếu Phụ kiện
http://megabuy.vn/gian-hang/may-chieu-projector Máy chiếu projector
http://megabuy.vn/gian-hang/may-chieu-projector/hsx/Optoma Optoma
http://megabuy.vn/gian-hang/may-chieu-projector/hsx/Sony Sony
http://megabuy.vn/gian-hang/may-chieu-projector/hsx/ViewSonic ViewSonic
http://megabuy.vn/gian-hang/may-chieu-man-chieu-phu-kien Xem thêm
http://megabuy.vn/gian-hang/may-photocopy Máy photocopy
http://megabuy.vn/gian-hang/may-photocopy- Máy photocopy
http://megabuy.vn/gian-hang/may-photocopy-/hsx/Canon Canon
http://megabuy.vn/gian-hang/may-photocopy-/hsx/Ricoh Ricoh
实际代码不包含class属性。