如何在javascript标签下刮取标签?

时间:2016-12-31 05:04:19

标签: beautifulsoup screen-scraping findall

我想从这个网站上抓取产品信息:http://megabuy.vn/Default.aspx

我的解决方案是根据网站结构抓取网站。因此,首先,我必须在深入子类别然后到每个特定产品之前,删除有关常规类别的所有链接。

我无法抓取所有链接等常规类别:

  • thiet bi van phong
  • 可能是小屋
  • do da dung nha bep

等...

我认为问题是这些链接在java脚本标记下。

这是我的代码:

from bs4 import BeautifulSoup
import requests
import re
def web_scrape(url):
    web_connect = requests.get(url)
    text = web_connect.text
    soup = BeautifulSoup(text,"html.parser")
    return soup
homepage = web_scrape("http://megabuy.vn/Default.aspx")
listgianhang = homepage.findAll("a", class_=re.compile("ContentPlaceholder"))
len(listgianhang)

我得到了结果:0

1 个答案:

答案 0 :(得分:0)

[crowdpdf-total=

出:

import requests, bs4, re

r = requests.get('http://megabuy.vn/Default.aspx')

soup = bs4.BeautifulSoup(r.text, 'lxml')
table = soup.find(id='ctl00_ContentPlaceHolder1_TopMenu1_dlMenu')
for a in table('a',href=re.compile(r'^http')):
    link = a.get('href')
    text = a.text
    print(link, text)

你无法按类获取标签的原因是标签的类是由JavaScript生成的,原始的html代码是这样的:

http://megabuy.vn/gian-hang/thiet-bi-van-phong THIẾT BỊ VĂN PHÒNG
http://megabuy.vn/gian-hang/may-fax  Máy Fax
http://megabuy.vn/gian-hang/may-fax/hsx/Panasonic Panasonic
http://megabuy.vn/gian-hang/may-chieu-man-chieu-phu-kien  Máy chiếu Màn chiếu Phụ kiện
http://megabuy.vn/gian-hang/may-chieu-projector  Máy chiếu projector
http://megabuy.vn/gian-hang/may-chieu-projector/hsx/Optoma Optoma
http://megabuy.vn/gian-hang/may-chieu-projector/hsx/Sony Sony
http://megabuy.vn/gian-hang/may-chieu-projector/hsx/ViewSonic ViewSonic
http://megabuy.vn/gian-hang/may-chieu-man-chieu-phu-kien  Xem thêm
http://megabuy.vn/gian-hang/may-photocopy  Máy photocopy
http://megabuy.vn/gian-hang/may-photocopy-  Máy photocopy 
http://megabuy.vn/gian-hang/may-photocopy-/hsx/Canon Canon
http://megabuy.vn/gian-hang/may-photocopy-/hsx/Ricoh Ricoh

实际代码不包含class属性。