使用手术刀解析标签相邻块时出现问题

时间:2019-02-06 11:29:54

标签: haskell web-scraping

我在使用scalpel来捕获标签块时遇到问题。

遵循testS :: String中的HTML代码段存储

<body>
  <h2>Apple</h2>
  <p>I Like Apple</p>
  <p>Do you like Apple?</p>

  <h2>Banana</h2>
  <p>I Like Banana</p>
  <p>Do you like Banana?</p>

  <h2>Carrot</h2>
  <p>I Like Carrot</p>
  <p>Do you like Carrot?</p>
</body>

我想将h2和两个p的块解析为单个记录Block

{-#LANGUAGE OverloadedStrings #-}

import Control.Monad
import Text.HTML.Scalpel

data Block = B String String String
  deriving Show

block :: Scraper String Block
block = do
  h  <- text $ "h2"
  pa <- text $ "p"
  pb <- text $ "p"
  return $ B h pa pb

blocks :: Scraper String [Block]
blocks = chroot "body" $ replicateM 3 block

但是抓取的结果不是我想要的,看起来它不断重复捕获第一个块,并且从不消耗它。

λ> traverse (mapM_ print) $ scrapeStringLike testS blocks
B "Apple" "I Like Apple" "I Like Apple"
B "Apple" "I Like Apple" "I Like Apple"
B "Apple" "I Like Apple" "I Like Apple"

预期输出:

B "Apple" "I Like Apple" "Do you like Apple?"
B "Banana" "I Like Banana" "Do you like Banana?"
B "Carrot" "I Like Carrot" "Do you like Carrot?"

如何使其工作?

2 个答案:

答案 0 :(得分:1)

首先,对于没有测试或不了解手术刀(例如傲慢)的提议,我深表歉意。让我来弥补你;这是我完全重写的尝试。

首先,这种怪兽起作用。

blocks :: Scraper String [Block]
blocks = chroot "body" $ do
  hs <- texts "h2"
  ps <- texts "p"
  return $ combine hs ps
  where
    combine (h:hs) (p:p':ps) = B h p p' : combine hs ps
    combine _ _ = []

我称它为怪兽,因为它通过两次texts调用擦除了文档的结构,然后通过combine以假定的顺序重新创建了它。不过实际上这并不是什么大问题,因为大多数页面都是通过<div>组合标签来构成的。

因此,如果我们要使用其他页面:

testS' :: String
testS'= unlines [ "<body>",
              "<div>",
              "  <h2>Apple</h2>",
              "  <p>I Like Apple</p>",
              "  <p>Do you like Apple?</p>",
              "</div>",
              "",
              "<div>",
              "  <h2>Banana</h2>",
              "  <p>I Like Banana</p>",
              "  <p>Do you like Banana?</p>",
              "",
              "</div>",
              "<div>",
              "  <h2>Carrot</h2>",
              "  <p>I Like Carrot</p>",
              "  <p>Do you like Carrot?</p>",
              "</div>",
              "</body>"
              ]

然后我们可以通过以下方式进行解析:

block' :: Scraper String Block
block' = do
  h  <- text $ "h2"
  [pa,pb] <- texts $ "p"
  return $ B h pa pb

blocks' :: Scraper String [Block]
blocks' = chroots ("body" // "div") $ block'

屈服,

B "Apple" "I Like Apple" "Do you like Apple?"
B "Banana" "I Like Banana" "Do you like Banana?"
B "Carrot" "I Like Carrot" "Do you like Carrot?"

编辑:重新>>=combine

我上面的combine是本地where的定义。所见即所得。它与>>=中使用的函数无关,顺便说一下,它也是一个本地定义的函数,名称稍有不同(combined)。但是,即使它们具有相同的名称,也没关系,因为它们仅在各自功能的范围内。

对于>>=,只是按照观察到的行为进行,每个刮擦都从当前所选标签的开头开始。因此,在您的block定义中,chroot “body”返回正文中的所有标签,text “h2”匹配第一个<h2>,接下来的两个text “p”都匹配第一个{ {1}}。因此,绑定的行为就像一个“和”:给定一堆标签的手术刀上下文匹配一个<p>和一个<h2>并(一个冗余地)匹配一个<p>。请注意,在基于<p>的解析中,我可以使用<div>(请注意“ s”)来获得我期望的两个texts

最后,当我看到它是基于标签汤的时候,这种行为对我来说是很有趣的。 (与此同时,他们为什么将其命名为汤)。这些刮擦都像是将汤匙浸入无序的标签汤中。选择器做汤,刮刀是汤匙。希望能有所帮助。

答案 1 :(得分:1)

现在,通过使用SerialScrapers,在手术刀0.6.0版中支持此功能。 SerialScrapers允许您一次专注于当前根节点的一个子节点,并公开API以移动焦点并在当前聚焦的节点上执行Scrapers

将文档中的示例代码适应HTML即可:

-- Copyright 2019 Google LLC.
-- SPDX-License-Identifier: Apache-2.0

-- Chroot to the body tag and start a SerialScraper context with inSerial.
-- This will allow for focusing each child of body.
--
-- Many applies the subsequent logic repeatedly until it no longer matches 
-- and returns the results as a list.
chroot "body" $ inSerial $ many $ do
   -- Move the focus forward until text can be extracted from an h2 tag.
   title <- seekNext $ text "h2"
   -- Create a new SerialScraper context that contains just the tags between
   -- the current focus and the next h2 tag. Then until the end of this new 
   -- context, move the focus forward to the next p tag and extract its text.
   ps <- untilNext (matches "h2") (many $ seekNext $ text "p")
   return (title, ps)

哪个会返回:

[
  ("Apple", ["I like Apple", "Do you like Apple?"]),
  ("Banana", ["I like Banana", "Do you like Banana?"]),
  ("Carrot", ["I like Carrot", "Do you like Carrot?"])
]