Question

我正在尝试在Node.js中编写一个将导入文本文档的脚本。该文本包含三篇报道文章，其中包含许多元数据标签。我需要为每篇文章获取其中两个标记的内容，并将其放在数组或JSON文件中。

例如，其中一个标记对是<text></text>，其中包含文章的全文。另一个是<docid></docid>，其中包含每篇文章的唯一编号。最好是我的阵列最终看起来像

articles = [[docid1, text1], [docid2, text2], [docid3, text3]]

或者可能是一个样式为

的JSON文件

{"article1" : {"docid" : "docid1", "text" : "text1"}
 "article2" : {"docid" : "docid2", "text" : "text2"}
}

使用substring和search()我可以使用以下代码获取第一篇文章的内容：

var substring = string.substring(string.search("<text>"), string.search("</text>"))

但我不仅需要第一篇文章，而且还需要每个实例的内容，其中有一对<text> </text>标记。

我可以使用search()查找多个结果并将其填入数组吗？

文本格式如下。它与html类似，但我不认为它是正确的HTML。：

<doc>
<docid> 1 </docid>
<date>
January 1, 2000 
</date>
<headline>
SOMETHING HAS HAPPENED IN THE WORLD 
</headline>
<byline>
By Andy N. Onymous. 
</byline>
<text>
Blah blah this is text blah blah lorum ipsum dolor sit amet. 
</text>
</doc>

Answer 1

我设法搞清楚了！我必须在for循环中使用substring - 方法来获取每个文档的docid和文本，并将它们放在一个数组中。它可能不是最干净的方法，但溪流给我做噩梦，这对我有用！代码是：

var fs = require('fs');

var collection = fs.readFileSync('collection.txt').toString();

var articles = collection.split('</doc>');
var articleCount = articles.length-1
var articleArray = [];

for (var i=0; i < articleCount; i++) {
    articleArray[i] = [articles[i].substring(articles[i].search('<docid>')+7, articles[i].search('</docid>')), articles[i].substring(articles[i].search('<text>')+6, articles[i].search('</text>'))];
    } 

    console.log(articleArray[1]);

如何获取Node中每个标签的实例？

1 个答案: