Question

我正在处理来自Hadoop（0.20.203.0）Java库的Text对象中的大文本。我需要从中提取XML内容，而不将整个对象转换为Java String（使用.toString（））。

有人可以举例说明如何做到这一点吗？

阅读文档（http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/Text.html），我假设我需要使用.decode（）函数。

Text t = "....<content>secret</content>...."
int start = t.find("<content>");
int end = t.find("</content>", start);
t.decode(String.getBytes(), start+7, end);

我不明白如何使用函数的第一个参数。

Answer 1

您的代码看起来大致正确。 decode的第一个参数是您要从中创建String的字节数组。

来自文档：

public static String decode(byte[] utf8, int start, int length)

它说utf8只是说它希望你的字节缓冲区是UTF-8格式（默认情况下Text使用）。所以你的代码是：

Text.decode(t.getBytes(), start+7, end);

因为decode是一个静态函数。另外，查看Text的源代码，这不应该增加内存占用量，因为getBytes()返回对Text对象所持有的基础字节数组的引用。

Answer 2

顺便说一句，我可以找到解决解析两个XML标记之间内容的具体问题：

int start = t.find("<content>", 0);
int end = t.find("</content>", start);
int advance = "<content>".length();

try {
  content = Text.decode(t.getBytes(), start+advance, end-start-advance);
} catch (IOException e) {
  System.out.println("IOException was " + e.getMessage());
}

最后一个参数是要提取的内容的长度，而不是其最终位置（这是初始帖子中的错误）。

从Hadoop Text对象中提取内容

2 个答案: