Question

我的意图是刮掉top-selling products on Ali-Express的名字。

我正在使用Requests库和Beautiful Soup来实现这一目标。

# Remember to import BeautifulSoup, requests and pprint 

url = "https://bestselling.aliexpress.com/en?spm=2114.11010108.21.3.qyEJ5m"
soup = bs(req.get(url).text, 'html.parser')
#pp.pprint(soup) Verify that the page has been found  
all_items = soup.find_all('li',class_= 'top10-item')

pp.pprint(all_items)

#  []

但是这会返回一个空列表，表明soup_find_all（）没有找到符合该条件的任何标签。

Inspect Element in Chrome displays the list items as such

但是在源代码（ul class =“top10-items”）中包含一个脚本，它似乎遍历每个列表项（我不熟悉HTML）。

        <div class="container">
            <div class="top10-header"><span class="title">TOP SELLING</span> <span class="sub-title">This week's most popular products</span></div>
            <ul class="top10-items loading" id="bestselling-top10">
            </ul>
            <script class="X-template-top10" type="text/mustache-template">
    {{#topList}}
    <li class="top10-item">
        <div class="rank-orders">
            <span class="rank">{{rank}}</span>
            <span class="orders">{{productOrderNum}}</span>
        </div>
        <div class="img-wrap">
            <a href="{{productDetailUrl}}" target="_blank">
                <img src="{{productImgUrl}}" alt="{{productName}}">
            </a>
        </div>
        <a class="item-desc" href="{{productDetailUrl}}" target="_blank">{{productName}}</a>
        <p class="item-price">
            <span class="price">US ${{productMinPrice}}</span>
            <span class="uint">/ {{productUnitType}}</span>
        </p>
    </li>
    {{/topList}}</script>
        </div>
    </div>

所以这可能解释了为什么soup.find_all（）找不到“li”标签。

我的问题是：如何使用Beautiful soup从脚本中提取项目名称？

如何刮取脚本中出现的标记

0 个答案: