我正在尝试使用jsoup从xml中提取文本,但也保留了一些标签,因为它们很有用,如何实现呢?
也许就像迭代文档并通过它的标签取出一个组件,然后迭代该组件并根据嵌套标签提取更新。但我无法解决这个问题。
for( Element item : doc.select("sentence") )
{
for( Element component : item)
{
get the tag of sentence and the words of the
sentence as described below
}
}
我有一个以这种方式标记的xml文档:
<sentences>
<sentence id="1">
<tokens>
<token id="1">
<word>The</word>
<CharacterOffsetBegin>0</CharacterOffsetBegin>
<CharacterOffsetEnd>3</CharacterOffsetEnd>
</token>
<token id="2">
<word>newspaper</word>
<CharacterOffsetBegin>4</CharacterOffsetBegin>
<CharacterOffsetEnd>13</CharacterOffsetEnd>
</token>
<token id="3">
<word>cartoons</word>
<CharacterOffsetBegin>14</CharacterOffsetBegin>
<CharacterOffsetEnd>22</CharacterOffsetEnd>
</token>
<token id="4">
<word>here</word>
<CharacterOffsetBegin>23</CharacterOffsetBegin>
<CharacterOffsetEnd>27</CharacterOffsetEnd>
</token>
<token id="5">
<word>often</word>
<CharacterOffsetBegin>28</CharacterOffsetBegin>
<CharacterOffsetEnd>33</CharacterOffsetEnd>
</token>
<token id="6">
<word>portray</word>
<CharacterOffsetBegin>34</CharacterOffsetBegin>
<CharacterOffsetEnd>41</CharacterOffsetEnd>
</token>
<token id="7">
<word>Per-Kristian</word>
<CharacterOffsetBegin>42</CharacterOffsetBegin>
<CharacterOffsetEnd>54</CharacterOffsetEnd>
</token>
<token id="8">
<word>Foss</word>
<CharacterOffsetBegin>55</CharacterOffsetBegin>
<CharacterOffsetEnd>59</CharacterOffsetEnd>
</token>
<token id="9">
<word>,</word>
<CharacterOffsetBegin>59</CharacterOffsetBegin>
<CharacterOffsetEnd>60</CharacterOffsetEnd>
</token>
<token id="10">
<word>the</word>
<CharacterOffsetBegin>61</CharacterOffsetBegin>
<CharacterOffsetEnd>64</CharacterOffsetEnd>
</token>
<token id="11">
<word>finance</word>
<CharacterOffsetBegin>65</CharacterOffsetBegin>
<CharacterOffsetEnd>72</CharacterOffsetEnd>
</token>
<token id="12">
<word>minister</word>
<CharacterOffsetBegin>73</CharacterOffsetBegin>
<CharacterOffsetEnd>81</CharacterOffsetEnd>
</token>
<token id="13">
<word>of</word>
<CharacterOffsetBegin>82</CharacterOffsetBegin>
<CharacterOffsetEnd>84</CharacterOffsetEnd>
</token>
<token id="14">
<word>Norway</word>
<CharacterOffsetBegin>85</CharacterOffsetBegin>
<CharacterOffsetEnd>91</CharacterOffsetEnd>
</token>
<token id="15">
<word>,</word>
<CharacterOffsetBegin>91</CharacterOffsetBegin>
<CharacterOffsetEnd>92</CharacterOffsetEnd>
</token>
<token id="16">
<word>buoyed</word>
<CharacterOffsetBegin>93</CharacterOffsetBegin>
<CharacterOffsetEnd>99</CharacterOffsetEnd>
</token>
<token id="17">
<word>by</word>
<CharacterOffsetBegin>100</CharacterOffsetBegin>
<CharacterOffsetEnd>102</CharacterOffsetEnd>
</token>
<token id="18">
<word>a</word>
<CharacterOffsetBegin>103</CharacterOffsetBegin>
<CharacterOffsetEnd>104</CharacterOffsetEnd>
</token>
<token id="19">
<word>spouting</word>
<CharacterOffsetBegin>105</CharacterOffsetBegin>
<CharacterOffsetEnd>113</CharacterOffsetEnd>
</token>
<token id="20">
<word>geyser</word>
<CharacterOffsetBegin>114</CharacterOffsetBegin>
<CharacterOffsetEnd>120</CharacterOffsetEnd>
</token>
<token id="21">
<word>of</word>
<CharacterOffsetBegin>121</CharacterOffsetBegin>
<CharacterOffsetEnd>123</CharacterOffsetEnd>
</token>
<token id="22">
<word>oil</word>
<CharacterOffsetBegin>124</CharacterOffsetBegin>
<CharacterOffsetEnd>127</CharacterOffsetEnd>
</token>
<token id="23">
<word>.</word>
<CharacterOffsetBegin>127</CharacterOffsetBegin>
<CharacterOffsetEnd>128</CharacterOffsetEnd>
</token>
</tokens>
</sentence>
理想的输出是:
<sentence id="1">
The newspaper cartoons here often portray Per-Kristian Foss, the finance minister of Norway, buoyed by a spouting geyser of oil.
</sentence>
等文档的其余部分,可能包含许多句子,或者也可能只包含一个句子。
到目前为止,我试过了:
String sentence = doc.select("sentence").text();
但我得到的只是这个烂摊子:
The 0 3 newspaper 4 13 cartoons 14 22 here 23 27 often 28 33 portray 34 41 Per-Kristian 42
答案 0 :(得分:0)
瞧
for( Element item : doc.select("sentence") )
{
System.out.println("<sentence> " + index );
String word = item.select("word").text();
System.out.println(word);
System.out.println("</sentence>" + "\n");
index++;
}