Question

File input = new File("1727209867.htm");
Document doc = Jsoup.parse(input, "UTF-8","http://www.facebook.com/people/Alison-Vella/1727209867");

我正在尝试解析这个保存并在本地系统中使用的html文件。但解析不解析所有的HTML。所以我无法获得我需要的信息。使用此代码解析仅适用于6k char，但实际上html文件具有60k char。

Answer 1

这在jsoup中是不可能的，但是使用解决方法：

final File input = new File("example.html");
final int maxLength = 6000; // Limit of char's to read

InputStream is = new FileInputStream(input); // Open file for reading
StringBuilder sb = new StringBuilder(maxLength); // Init the "buffer" with the size required
int count = 0; // Count of chars readen
int c; // Char for reading

while( ( c = is.read() ) != -1 && count < maxLength ) // Read a single char until limit is reached
{
    sb.append((char) c); // Save the char into the buffer
    count++; // increment the chars readen
}


Document doc = Jsoup.parse(sb.toString()); // Parse the Html from buffer

<强>解释

将文件 char-by-char 读入缓冲区，直至达到限制
从缓冲区解析文本并使用jsoup

问题：这不会关闭关闭标签等 - 如果您达到限制，它将完全停止阅读。

（可能） 解决方案：

忽略这一点，并准确地停止你的位置，解析这个并“修复”或删除悬挂的html
如果您在最后，请阅读，直至到达下一个结束标记或> char
如果您在最后，请阅读，直至到达下一个block-tag
如果您在最后，请阅读特定标签或评论

如何用Jsoup添加html中的所有元素？

1 个答案: