我一直在使用SwiftSoup从Swift中的许多网站上刮取主体文本,但是某些网站,例如CNN或Hill(例如:https://www.cnn.com/2019/07/25/us/colorado-missing-girl-remains-found-after-34-years/index.html)或(https://thehill.com/homenews/media/454838-cnn-announces-climate-town-hall-with-2020-democrats)会刮错文字。
到目前为止,我已经尝试过SwiftSoup抓取网站。
这是我到目前为止为SwiftSoup使用的代码:
func htmlToText(html: String) -> String
{
var text = ""
do
{
let els: Elements = try SwiftSoup.parse(html).select("p")
let links: Elements = try (els.select("time")).remove()
let pTag: Elements = try els.prepend("/n")
text = try pTag.text()
}
catch Exception.Error(let type, let message)
{
print(message)
}
catch
{
print("error")
}
if text.contains("/n")
{
text = text.replacingOccurrences(of: "/n", with: "\n")
}
text = text.replacingOccurrences(of: "Advertisement", with: "")
return text
}
但是,最终结果只刮擦了文章的一小部分。