Question

我正试图在Shopee上搜索一个网站列表。一些示例包括dudesgadget和2ubest。这些shopee商店中的每一个都有不同的设计和构建其web元素和不同领域的方式。它们看起来像独立的网站但实际上并非如此。

所以这里的主要问题是我试图抓住产品细节。我将总结一些不同的结构：

2ubest

<html>
    <body>
        <div id="shopify-section-announcement-bar" id="shopify-section-announcement-bar">
            <main class="wrapper main-content" role="main">
                <div class="grid">
                    <div class="grid__item">
                        <div id="shopify-section-product-template" class="shopify-section">
                            <script id="ProductJson-product-template" type="application/json">
                                //Things I am looking for
                            </script>
                        </div>
                    </div>
                </div>
            </main>
        </div>
    </body>
</html>

littleplayland

<html>
    <body id="adjustable-ergonomic-laptop-stand" class="template-product">
        <script>
            //Things I am looking for
        </script>
    </body>
</html>

还有其他一些，我发现它们之间有一种模式。

我正在寻找的东西肯定会在<body>
我要找的东西在<script>
我唯一不确定的是从<body>到<script>

我的解决方案是：

def parse(self, response):
    body = response.xpath("//body")
    for script in body.xpath("//script/text()").extract():
        #Manipulate the script with js2xml here

我能够提取littleplayland，dailysteals以及与<body>到<script>的距离非常短的其他许多其他内容，但不适用于{{ 3}}在我正在寻找的东西之间有很多其他的html元素。我可以知道是否有解决方案可以忽略其间的所有html元素并且只查找<script>标记？

我需要一个通用的解决方案，如果可能的话，可以在所有2ubest网站上使用，因为它们都具有我上面提到的特征。

这意味着解决方案不应使用<div>进行过滤，因为每个不同的网站都有不同数量的<div>

Answer 1

这是使用Scrapy在HTML中获取脚本的方法：

scriptTagSelector = scrapy.Selector(text=text, type="html")
theScripts = scriptTagSelector.xpath("//script/text()").extract()

for script in theScripts:
    #Manipulate the script with js2xml here
    print("------->A SCRIPT STARTS HERE<--------")
    print(script)
    print("------->A SCRIPT ENDS HERE<--------")

以下是您问题中HTML的示例（我添加了一个额外的脚本:)）：

import scrapy

text="""<html>
    <body>
        <div id="shopify-section-announcement-bar" id="shopify-section-announcement-bar">
            <main class="wrapper main-content" role="main">
                <div class="grid">
                    <div class="grid__item">
                        <div id="shopify-section-product-template" class="shopify-section">
                            <script id="ProductJson-product-template" type="application/json">
                                //Things I am looking for
                            </script>
                        </div>
                        <script id="script 2">I am another script</script>
                    </div>
                </div>
            </main>
        </div>
    </body>
</html>"""

scriptTagSelector = scrapy.Selector(text=text, type="html")
theScripts = scriptTagSelector.xpath("//script/text()").extract()

for script in theScripts:
    #Manipulate the script with js2xml here
    print("------->A SCRIPT STARTS HERE<--------")
    print(script)
    print("------->A SCRIPT ENDS HERE<--------")

Answer 2

试试这个：

//body//script/text()

Scrapy刮取未知数量<div>内的元素

2 个答案: