Python:如何提取“data-bind”html元素?

时间:2017-07-19 16:51:14

标签: python html data-binding web-scraping data-extraction

我正在尝试从网站中提取数据。该元素是隐藏的。当我尝试“查看源代码”时,不显示标题文本。

<h4 data-bind="Text: Name"></h4>

但是当我尝试检查时,可以看到文字。

<h4 data-bind="Text: Name">STM1F-1S-HC</h4>

使用的代码是:

def getlink(link):
    try:
        f = urllib.request.urlopen(link)
        soup0 = BeautifulSoup(f)
    except Exception as e:
        print (e)
        soup0 = 'abc'
    for row2 in soup0.findAll("h4",{"data-bind":"text: Name"}):
        Name = row2.text
        print(Name)

#code to find all links to the products for further processing.
i=1
global i
for row in r1.findAll('a', { "class" : "col-xs-12 col-sm-6" }):
    link = 'https://www.truemfg.com/USA-Foodservice/'+row['href']
    print(link)
    getlink(link)
print(productcount)

输出结果为:

https://www.truemfg.com/USA-Foodservice/Products/Traditional-Reach-Ins
C:\Users\Santosh\Anaconda3\lib\site-packages\bs4\__init__.py:181: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

The code that caused this warning is on line 193 of the file C:\Users\Santosh\Anaconda3\lib\runpy.py. To get rid of this warning, change code that looks like this:

 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "lxml")

  markup_type=markup_type))

https://www.truemfg.com/USA-Foodservice/Products/Specification-Series

https://www.truemfg.com/USA-Foodservice/Products/Food-Prep-Tables

https://www.truemfg.com/USA-Foodservice/Products/Undercounters

https://www.truemfg.com/USA-Foodservice/Products/Worktops

https://www.truemfg.com/USA-Foodservice/Products/Chef-Bases

https://www.truemfg.com/USA-Foodservice/Products/Milk-Coolers

https://www.truemfg.com/USA-Foodservice/Products/Glass-Door-Merchandisers

https://www.truemfg.com/USA-Foodservice/Products/Air-Curtains

https://www.truemfg.com/USA-Foodservice/Products/Display-Cases

https://www.truemfg.com/USA-Foodservice/Products/Underbar-Refrigeration

我们发现没有打印出名字。

有人可以让我知道打印名称的解决方案。

谢谢, 桑托什

1 个答案:

答案 0 :(得分:0)

XHR动态生成的必需内容。您可以尝试使用以下代码直接请求数据,并避免解析HTML

import requests

url = 'https://prodtrueservices.azurewebsites.net/api/products/productline/403/1?skip=0&take=200&unit=Imperial'
r = requests.get(url)
counter = 0

while True:
    try:
        print(r.json()['Products'][counter]['Name'])
            counter += 1
    except IndexError:
        break

这应该允许你获得所有名字