我最近创建了一个网站,需要从TED网站检索谈话标题。
到目前为止,问题仅针对此演讲:Francis Collins: We need better drugs -- now
从网页来源,我得到:
<title>Francis Collins: We need better drugs -- now | Video on TED.com</title>
<span id="altHeadline" >Francis Collins: We need better drugs -- now</span>
现在,在ghci,我尝试了这个:
λ> :m +Network.HTTP Text.Regex.PCRE
λ> let uri = "http://www.ted.com/talks/francis_collins_we_need_better_drugs_now.html"
λ> body <- (simpleHTTP $ getRequest uri) >>= getResponseBody
λ> body =~ "<span id=\"altHeadline\" >(.+)</span>" :: [[String]]
[["id=\"altHeadline\" >Francis Collins: We need better drugs -- now</span>\n\t\t</h","s Collins: We need better drugs -- now</span"]]
λ> body =~ "<title>(.+)</title>" :: [[String]]
[["tle>Francis Collins: We need better drugs -- now | Video on TED.com</title>\n<l","ncis Collins: We need better drugs -- now | Video on TED.com</t"]]
无论哪种方式,解析后的标题都会遗漏左侧的某些字符,并且右侧会显示一些非预期的字符。它似乎与谈话标题中的--
有关。然而,
λ> let body' = "<title>Francis Collins: We need better drugs -- now | Video on TED.com</title>"
λ> body' =~ "<title>(.+)</title>" :: [[String]]
[["<title>Francis Collins: We need better drugs -- now | Video on TED.com</title>","Francis Collins: We need better drugs -- now | Video on TED.com"]]
幸运的是,这不是Text.Regex.Posix
的问题。
λ> import qualified Text.Regex.Posix as P
λ> body P.=~ "<title>(.+)</title>" :: [[String]]
[["<title>Francis Collins: We need better drugs -- now | Video on TED.com</title>","Francis Collins: We need better drugs -- now | Video on TED.com"]]
答案 0 :(得分:4)
我的建议是:不要使用正则表达式来解析HTML。请改用适当的HTML解析器。下面是一个使用html-conduit解析器和xml-conduit游标库(以及http-conduit下载)的示例。
{-# LANGUAGE OverloadedStrings #-}
import Data.Monoid (mconcat)
import Network.HTTP.Conduit (simpleHttp)
import Text.HTML.DOM (parseLBS)
import Text.XML.Cursor (attributeIs, content, element,
fromDocument, ($//), (&//), (>=>))
main = do
lbs <- simpleHttp "http://www.ted.com/talks/francis_collins_we_need_better_drugs_now.html"
let doc = parseLBS lbs
cursor = fromDocument doc
print $ mconcat $ cursor $// element "title" &// content
print $ mconcat $ cursor $// element "span" >=> attributeIs "id" "altHeadline" &// content