我在使用scalpel来捕获标签块时遇到问题。
遵循testS :: String
中的HTML代码段存储
<body>
<h2>Apple</h2>
<p>I Like Apple</p>
<p>Do you like Apple?</p>
<h2>Banana</h2>
<p>I Like Banana</p>
<p>Do you like Banana?</p>
<h2>Carrot</h2>
<p>I Like Carrot</p>
<p>Do you like Carrot?</p>
</body>
我想将h2
和两个p
的块解析为单个记录Block
。
{-#LANGUAGE OverloadedStrings #-}
import Control.Monad
import Text.HTML.Scalpel
data Block = B String String String
deriving Show
block :: Scraper String Block
block = do
h <- text $ "h2"
pa <- text $ "p"
pb <- text $ "p"
return $ B h pa pb
blocks :: Scraper String [Block]
blocks = chroot "body" $ replicateM 3 block
但是抓取的结果不是我想要的,看起来它不断重复捕获第一个块,并且从不消耗它。
λ> traverse (mapM_ print) $ scrapeStringLike testS blocks
B "Apple" "I Like Apple" "I Like Apple"
B "Apple" "I Like Apple" "I Like Apple"
B "Apple" "I Like Apple" "I Like Apple"
预期输出:
B "Apple" "I Like Apple" "Do you like Apple?"
B "Banana" "I Like Banana" "Do you like Banana?"
B "Carrot" "I Like Carrot" "Do you like Carrot?"
如何使其工作?
答案 0 :(得分:1)
首先,对于没有测试或不了解手术刀(例如傲慢)的提议,我深表歉意。让我来弥补你;这是我完全重写的尝试。
首先,这种怪兽起作用。
blocks :: Scraper String [Block]
blocks = chroot "body" $ do
hs <- texts "h2"
ps <- texts "p"
return $ combine hs ps
where
combine (h:hs) (p:p':ps) = B h p p' : combine hs ps
combine _ _ = []
我称它为怪兽,因为它通过两次texts
调用擦除了文档的结构,然后通过combine
以假定的顺序重新创建了它。不过实际上这并不是什么大问题,因为大多数页面都是通过<div>
组合标签来构成的。
因此,如果我们要使用其他页面:
testS' :: String
testS'= unlines [ "<body>",
"<div>",
" <h2>Apple</h2>",
" <p>I Like Apple</p>",
" <p>Do you like Apple?</p>",
"</div>",
"",
"<div>",
" <h2>Banana</h2>",
" <p>I Like Banana</p>",
" <p>Do you like Banana?</p>",
"",
"</div>",
"<div>",
" <h2>Carrot</h2>",
" <p>I Like Carrot</p>",
" <p>Do you like Carrot?</p>",
"</div>",
"</body>"
]
然后我们可以通过以下方式进行解析:
block' :: Scraper String Block
block' = do
h <- text $ "h2"
[pa,pb] <- texts $ "p"
return $ B h pa pb
blocks' :: Scraper String [Block]
blocks' = chroots ("body" // "div") $ block'
屈服,
B "Apple" "I Like Apple" "Do you like Apple?"
B "Banana" "I Like Banana" "Do you like Banana?"
B "Carrot" "I Like Carrot" "Do you like Carrot?"
编辑:重新>>=
和combine
我上面的combine
是本地where
的定义。所见即所得。它与>>=
中使用的函数无关,顺便说一下,它也是一个本地定义的函数,名称稍有不同(combined
)。但是,即使它们具有相同的名称,也没关系,因为它们仅在各自功能的范围内。
对于>>=
,只是按照观察到的行为进行,每个刮擦都从当前所选标签的开头开始。因此,在您的block
定义中,chroot “body”
返回正文中的所有标签,text “h2”
匹配第一个<h2>
,接下来的两个text “p”
都匹配第一个{ {1}}。因此,绑定的行为就像一个“和”:给定一堆标签的手术刀上下文匹配一个<p>
和一个<h2>
并(一个冗余地)匹配一个<p>
。请注意,在基于<p>
的解析中,我可以使用<div>
(请注意“ s”)来获得我期望的两个texts
。
最后,当我看到它是基于标签汤的时候,这种行为对我来说是很有趣的。 (与此同时,他们为什么将其命名为汤)。这些刮擦都像是将汤匙浸入无序的标签汤中。选择器做汤,刮刀是汤匙。希望能有所帮助。
答案 1 :(得分:1)
现在,通过使用SerialScrapers,在手术刀0.6.0版中支持此功能。 SerialScrapers
允许您一次专注于当前根节点的一个子节点,并公开API以移动焦点并在当前聚焦的节点上执行Scrapers
。
将文档中的示例代码适应HTML即可:
-- Copyright 2019 Google LLC.
-- SPDX-License-Identifier: Apache-2.0
-- Chroot to the body tag and start a SerialScraper context with inSerial.
-- This will allow for focusing each child of body.
--
-- Many applies the subsequent logic repeatedly until it no longer matches
-- and returns the results as a list.
chroot "body" $ inSerial $ many $ do
-- Move the focus forward until text can be extracted from an h2 tag.
title <- seekNext $ text "h2"
-- Create a new SerialScraper context that contains just the tags between
-- the current focus and the next h2 tag. Then until the end of this new
-- context, move the focus forward to the next p tag and extract its text.
ps <- untilNext (matches "h2") (many $ seekNext $ text "p")
return (title, ps)
哪个会返回:
[
("Apple", ["I like Apple", "Do you like Apple?"]),
("Banana", ["I like Banana", "Do you like Banana?"]),
("Carrot", ["I like Carrot", "Do you like Carrot?"])
]