Question

我开发了一个系统，可以抓取外部网站上的html内容进行分析。在我/服务器端，我使用DOMDocument / DOMLoad加载内容和XPath来过滤我需要的正确标签（例如h1- / h2- / h3-标签，正文，等等）。

总之，我会说它基本上是Google Adsense或其他分析工具所做的：抓取内容 - ＆gt;从数据库收集数据和... - ＆gt;将等效内容（即广告）发回网站。

这一切都很好。

现在这是我面临的问题：

我抓取的大部分网站都是博客。因此，我需要能够分析的内容不仅仅是一个页面，它应该能够在文章的基础上进行搜索，例如如果您在同一页面上有10篇或更多文章，其中包含大量不同主题。现在，我的抓取工具只抓取整个网页并在内容中搜索关键字。

现在我想知道：在网站中过滤文章容器是否有最佳做法？我知道它之前已经完成，它真的基于系统（例如Wordpress，Joomla，drupal等）。但是，不能保证某些类名或boudaries可以用于分类（即word ='post'用于wordpress或说“”很可能是文章的结尾）。对于HTML5，有基础上的文章标记或爬行（这不是我喜欢的方式）。

我虽然有这样的事情：

<html>
<body>

---- article1
text about article1

---- article 2
text about article2


---- article 3
text about article3

</body>
</html>

- 伪代码式：

while($content['body']) { // crawling
   if(html5) -> $articles = get content for <article> tags
   elseif(found(rss)) -> crawl on a base of rss, not prefered
   elseif(found "class=post") -> $articles = get content for this container
   elseif(found "class=article") -> $articles = get content for this container
   elseif(found "</div></div></div>") -> article end -> $articles = get content for the container above...
   //etc.
}

非常感谢任何建议或意见！感谢。

在外部网站中提取文章/帖子的最佳做法是什么？

0 个答案: