Question

我一直在探索使用预训练的MITIE模型进行命名实体提取。无论如何，我可以看看他们的实际神经模型，而不是使用预训练模型？该模型是否可用作开源？

Answer 1

设置：

对于初学者，您可以下载English Language Model   包含来自文件中的巨大转储的带注释文本的语料库   的 total_word_feature_extractor.dat

之后，从他们的下载/克隆MITIE-Master Project   官方Git。

如果您运行的是Windows O.S，请下载CMake。

如果您运行的是基于x64的Windows O.S，请安装Visual Studio   2015年C ++编译器社区版。

下载后，将其全部提取到一个文件夹中。

从开始＆gt;打开VS 2015的开发人员命令提示符所有应用＆gt; Visual Studio，并导航到tools文件夹，你会看到里面的5个子文件夹。

下一步是使用Visual Studio Developer命令提示符中的以下Cmake命令构建ner_conll，ner_stream，train_freebase_relation_detector和wordrep包。

这样的事情：

对于ner_conll：

cd "C:\Users\xyz\Documents\MITIE-master\tools\ner_conll"

i）mkdir build ii）cd build iii）cmake -G "Visual Studio 14 2015 Win64" .. iv）cmake --build . --config Release --target install

对于ner_stream：

cd "C:\Users\xyz\Documents\MITIE-master\tools\ner_stream"

i）mkdir build ii）cd build iii）cmake -G "Visual Studio 14 2015 Win64" .. iv）cmake --build . --config Release --target install

对于train_freebase_relation_detector：

cd "C:\Users\xyz\Documents\MITIE-master\tools\train_freebase_relation_detector"

i）mkdir build ii）cd build iii）cmake -G "Visual Studio 14 2015 Win64" .. iv）cmake --build . --config Release --target install

对于wordrep：

cd "C:\Users\xyz\Documents\MITIE-master\tools\wordrep"

i）mkdir build ii）cd build iii）cmake -G "Visual Studio 14 2015 Win64" .. iv）cmake --build . --config Release --target install

构建它们之后，你会得到150-160个警告，不用担心。

现在，导航至"C:\Users\xyz\Documents\MITIE-master\examples\cpp\train_ner"

制作一个JSON文件＆＃34; data.json＆＃34;使用Visual Studio Code手动注释文本，如下所示：

{ "AnnotatedTextList": [ { "text": "I want to travel from New Delhi to Bangalore tomorrow.", "entities": [ { "type": "FromCity", "startPos": 5, "length": 2 }, { "type": "ToCity", "startPos": 8, "length": 1 }, { "type": "TimeOfTravel", "startPos": 9, "length": 1 } ] } ] }

您可以添加更多话语并对其进行注释，训练数据越多，预测准确度越高。

这个带注释的JSON也可以通过jQuery或Angular等前端工具创建。但为了简洁起见，我手工制作了它们。

现在，要解析我们的Annotated JSON文件并将其传递给ner_training_instance的add_entity方法。

但是C ++并不支持反序来反序列化JSON，这就是为什么你可以使用这个库Rapid JSON Parser的原因。从他们的Git页面下载该包，并将其放在"C:\Users\xyz\Documents\MITIE-master\mitielib\include\mitie"。
下
现在我们必须自定义train_ner_example.cpp文件，以便解析带注释的自定义实体JSON并将其传递给MITIE进行训练。

#include "mitie\rapidjson\document.h" #include "mitie\ner_trainer.h" #include <iostream> #include <vector> #include <list> #include <tuple> #include <string> #include <map> #include <sstream> #include <fstream> using namespace mitie; using namespace dlib; using namespace std; using namespace rapidjson; string ReadJSONFile(string FilePath) { ifstream file(FilePath); string test; cout << "path: " << FilePath; try { std::stringstream buffer; buffer << file.rdbuf(); test = buffer.str(); cout << test; return test; } catch (exception &e) { throw std::exception(e.what()); } } //Helper function to tokenize a string based on multiple delimiters such as ,.;:- or whitspace std::vector<string> SplitStringIntoMultipleParameters(string input, string delimiter) { std::stringstream stringStream(input); std::string line; std::vector<string> TokenizedStringVector; while (std::getline(stringStream, line)) { size_t prev = 0, pos; while ((pos = line.find_first_of(delimiter, prev)) != string::npos) { if (pos > prev) TokenizedStringVector.push_back(line.substr(prev, pos - prev)); prev = pos + 1; } if (prev < line.length()) TokenizedStringVector.push_back(line.substr(prev, string::npos)); } return TokenizedStringVector; } //Parse the JSON and store into appropriate C++ containers to process it. std::map<string, list<tuple<string, int, int>>> FindUtteranceTuple(string stringifiedJSONFromFile) { Document document; cout << "stringifiedjson : " << stringifiedJSONFromFile; document.Parse(stringifiedJSONFromFile.c_str()); const Value& a = document["AnnotatedTextList"]; assert(a.IsArray()); std::map<string, list<tuple<string, int, int>>> annotatedUtterancesMap; for (int outerIndex = 0; outerIndex < a.Size(); outerIndex++) { assert(a[outerIndex].IsObject()); assert(a[outerIndex]["entities"].IsArray()); const Value &entitiesArray = a[outerIndex]["entities"]; list<tuple<string, int, int>> entitiesTuple; for (int innerIndex = 0; innerIndex < entitiesArray.Size(); innerIndex++) { entitiesTuple.push_back(make_tuple(entitiesArray[innerIndex]["type"].GetString(), entitiesArray[innerIndex]["startPos"].GetInt(), entitiesArray[innerIndex]["length"].GetInt())); } annotatedUtterancesMap.insert(pair<string, list<tuple<string, int, int>>>(a[outerIndex]["text"].GetString(), entitiesTuple)); } return annotatedUtterancesMap; } int main(int argc, char **argv) { try { if (argc != 3) { cout << "You must give the path to the MITIE English total_word_feature_extractor.dat file." << endl; cout << "So run this program with a command like: " << endl; cout << "./train_ner_example ../../../MITIE-models/english/total_word_feature_extractor.dat" << endl; return 1; } else { string filePath = argv[2]; string stringifiedJSONFromFile = ReadJSONFile(filePath); map<string, list<tuple<string, int, int>>> annotatedUtterancesMap = FindUtteranceTuple(stringifiedJSONFromFile); std::vector<string> tokenizedUtterances; ner_trainer trainer(argv[1]); for each (auto item in annotatedUtterancesMap) { tokenizedUtterances = SplitStringIntoMultipleParameters(item.first, " "); mitie::ner_training_instance *currentInstance = new mitie::ner_training_instance(tokenizedUtterances); for each (auto entity in item.second) { currentInstance -> add_entity(get<1>(entity), get<2>(entity), get<0>(entity).c_str()); } // trainingInstancesList.push_back(currentInstance); trainer.add(*currentInstance); delete currentInstance; } trainer.set_num_threads(4); named_entity_extractor ner = trainer.train(); serialize("new_ner_model.dat") << "mitie::named_entity_extractor" << ner; const std::vector<std::string> tagstr = ner.get_tag_name_strings(); cout << "The tagger supports " << tagstr.size() << " tags:" << endl; for (unsigned int i = 0; i < tagstr.size(); ++i) cout << "\t" << tagstr[i] << endl; return 0; } } catch (exception &e) { cerr << "Failed because: " << e.what(); } }

add_entity接受3个参数，可以是向量的标记化字符串，自定义实体类型名称，句子中单词的起始索引以及单词的范围。

现在我们必须使用Developer Command Prompt Visual Studio中的以下命令构建ner_train_example.cpp。

1）cd "C:\Users\xyz\Documents\MITIE-master\examples\cpp\train_ner" 2）mkdir build 3）cd build 4）cmake -G "Visual Studio 14 2015 Win64" .. 5）cmake --build . --config Release --target install 6）cd Release

7）train_ner_example "C:\\Users\\xyz\\Documents\\MITIE-master\\MITIE-models\\english\\total_word_feature_extractor.dat" "C:\\Users\\xyz\\Documents\\MITIE-master\\examples\\cpp\\train_ner\\data.json"

成功执行上述操作后，我们将获得一个new_ner_model.dat文件，该文件是我们话语的序列化和训练版本。

现在，该.dat文件可以传递给RASA或单独使用。

将其传递给RASA：

按如下方式创建config.json文件：

{ "project": "demo", "path": "C:\\Users\\xyz\\Desktop\\RASA\\models", "response_log": "C:\\Users\\xyz\\Desktop\\RASA\\logs", "pipeline": ["nlp_mitie", "tokenizer_mitie", "ner_mitie", "ner_synonyms", "intent_entity_featurizer_regex", "intent_classifier_mitie"], "data": "C:\\Users\\xyz\\Desktop\\RASA\\data\\examples\\rasa.json", "mitie_file" : "C:\\Users\\xyz\\Documents\\MITIE-master\\examples\\cpp\\train_ner\\Release\\new_ner_model.dat", "fixed_model_name": "demo", "cors_origins": ["*"], "aws_endpoint_url": null, "token": null, "num_threads": 2, "port": 5000 }

MITIE ner模型

1 个答案: