Question

最近我想通过Java获取HTML源代码的信息。基本需求是获取HTML的主要内容区域。例如，以下是HTML源代码：

<html> 
  <head>
  <tilte>
     chinese charactor --中文
   <title>
  </head> 

      <body>
        <div>
        this is something area including Chinese charactor.,like meun I don't need,
        </div>
        <div>
   this is something area including Chinese charactor,like ads I don't need, 
        </div>
        <div>  
 this is  main content, include the content I need. almost every content is filled by         many  Chinese charactor.Like: 好好学习，天天向上。 我爱stackoverflow.谢谢你的帮助，非常感谢！
        </div>
        <div>  
 this is foot area, also including Chinese charactor ,but I don't need.
         </div>
        </body>
   </html>

这个HTML源代码很简单;有许多不同的复杂来源。我想通过java解析包含主要内容的div或其他元素区域。我想要的结果是：

<div>  
   This is main content, include the content I need. almost every content is filled by         many Chinese character like: 好好学习，天天向上。 我爱stackoverflow.谢谢你的帮助，非常感谢！
   </div>

有成千上万的div，其内容不同，div id不明或不同。 div有许多不同的条件，例如p标签。有没有办法判断汉字的外观或分布来解析内容？

Answer 1

我不能说我对自己理解这个问题充满信心，但似乎你想通过Java在HTML页面中抓取某个div？

我必须这样做才能从遗留系统中获取一些数据以测试新数据 - 看看http://htmlunit.sourceforge.net/。基本上它允许你点击你想要的页面，就像它在浏览器中一样（所以即使你通常必须填写一个表单来到那个页面你可以做到），然后刮掉不同部分的内容页面以一堆不同的方式 - 你可以获得所有div的集合，例如选择第三个div，或者选择具有正确CSS类的div，或者只使用XPath。

Answer 2

我不能说我知道你想要什么，但一个好的起点可能是在Apache的HTTPComponents包中。有很多工具可用于发出http请求并将数据放回字符串缓冲区（我认为你的目的）

在这里查看：

http://hc.apache.org/httpcomponents-client-ga/tutorial/html/fundamentals.html#d5e43

此外，在HTTPComponents主页上，有大多数教程的中文翻译 - 你知道，如果这对你有用：D

http://hc.apache.org/

查找包含HTML中的文章内容的文本区域

2 个答案: