如何从文件内容中提取关键字或标签

时间:2013-07-02 05:12:16

标签: java tags metadata keyword apache-tika

我有一些不同格式的文件(Html,PDF,doc,epub),使用apache tika和java我已经提取了元数据并将其存储到mongo db中,现在我的目标是从文件内容中提取关键字或标签将它添加到其中一个元数据字段中,是否可以使用Apache tika,如果没有,请建议我这样做的方法吗?

我的mongodb字段(示例)

{"Filename":"PHP Book.pdf","Author":"John" ,"Description":"This is my PHP Book"} 
{"Filename":"Java Book.html" ,"Author":"Paul" ,"Description":"This is my JAVA Book"}
{"Filename":".NET Book.doc" ,"Author":"James" ,"Description":"This is my .NET Book"}

现在我想添加另一个包含内容标签或关键字的字段,它应如下所示(示例)

{"Filename":"PHP Book.pdf","Author":"John" ,"Description":"This is my PHP Book", "keywords":["PHP","PDF","BOOK"]} 
{"Filename":"Java Book.html" ,"Author":"Paul" ,"Description":"This is my JAVA Book","keywords":["JAVA","html","BOOK"]}
{"Filename":".NET Book.doc" ,"Author":"James" ,"Description":"This is my .NET Book",    "keywords":[".NET","doc"]}

由于

0 个答案:

没有答案