我正在尝试使用Stanford NLP .Net类添加自定义标记。我使用regexner实现了管道,并为自定义添加了外部标记文件。我遇到的问题是(尽管正则表达式在外部进行了验证)我的自定义电话号码标签没有被应用,但是非常简单的“度”标签也被很好地应用了。 (我从另一个斯坦福大学的示例复制了学位标签,只是为了确保正则表达式文件已加载并正常工作。)
问题:使用下面列出的正则表达式是否存在错误或不兼容?如果是,该怎么办?还是我提供的外部文件存在文件格式问题?还是斯坦福大学NLP库中的错误?
这是我用于简单测试的C#设置:
using System.IO;
using Console = System.Console;
using edu.stanford.nlp.ling;
using edu.stanford.nlp.pipeline;
using edu.stanford.nlp.sentiment;
using edu.stanford.nlp.ie.crf;
using java.util;
public MainWindow()
{
InitializeComponent();
string codeBase = Assembly.GetExecutingAssembly().CodeBase;
UriBuilder uri = new UriBuilder(codeBase);
string appPath = System.IO.Path.GetDirectoryName(Uri.UnescapeDataString(uri.Path));
// Path to the folder with models extracted from `stanford-corenlp-3.4-models.jar`
var libraryRoot = appPath + @"/models/";
var modelRoot = libraryRoot + @"stanford-corenlp-3.9.1-models/";
const string text = "I have a Bachelor of Science degree.\nPhone: 856.821.9331\n856-821-1234\n856.821.1234\n8568211234";
Test3(libraryRoot, modelRoot, text);
}
public void Test3(string libraryRoot, string modelRoot, string text)
{
// Annotation pipeline configuration
var props = new java.util.Properties();
props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref, coref, regexner");
props.setProperty("ner.useSUTime", "0");
props.put("regexner.mapping", $"{libraryRoot}regexner-general.txt");
var curDir = Environment.CurrentDirectory;
Directory.SetCurrentDirectory(modelRoot);
var pipeline = new StanfordCoreNLP(props);
Directory.SetCurrentDirectory(curDir);
Annotation document = new Annotation(text);
pipeline.annotate(document);
var sentences = document.get(typeof(CoreAnnotations.SentencesAnnotation));
if (sentences == null)
return;
foreach (Annotation sentence in sentences as ArrayList)
{
Console.WriteLine(sentence);
var tokens = sentence.get(typeof(CoreAnnotations.TokensAnnotation));
foreach (CoreLabel token in tokens as ArrayList)
{
string word = token.get(typeof(CoreAnnotations.TextAnnotation)).ToString();
string position = token.get(typeof(CoreAnnotations.PartOfSpeechAnnotation)).ToString();
string ne = token.get(typeof(CoreAnnotations.NamedEntityTagAnnotation)).ToString();
Console.WriteLine($"=> word={word}\tposition={position}\tNE={ne}");
}
}
}
我放在一起的外部正则表达式文件如下所示(是的,每列之间的制表符):
^(\+\d{1,2}\s)?\(?\d{3}\)?[\s.-]\d{3}[\s.-]\d{4}$ PHONE MISC,NUMBER 4.0
^(?:\(?)(\d{3})(?:[\).\s]?)(\d{3})(?:[-\.\s]?)(\d{4})(?!\d) PHONE MISC,NUMBER 3.0
[0-9]{3}\W[0-9]{3}-[0-9]{4} PHONE MISC,NUMBER 2.0
Bachelor of (Arts|Laws|Science|Engineering|Divinity) DEGREE 2.0
这将产生以下输出。请注意,可以正确识别学位,但没有电话号码。每个regexner电话规则应至少涵盖一个电话号码。 (我在外部进行了测试)。
I have a Bachelor of Science degree.
=> word=I position=PRP NE=O
=> word=have position=VBP NE=O
=> word=a position=DT NE=O
=> word=Bachelor position=NN NE=DEGREE
=> word=of position=IN NE=DEGREE
=> word=Science position=NNP NE=DEGREE
=> word=degree position=NN NE=O
=> word=. position=. NE=O
Phone: 856.821.9331
856-821-1234
856.821.1234
8568211234
=> word=Phone position=NN NE=O
=> word=: position=: NE=O
=> word=856.821.9331 position=CD NE=NUMBER
=> word=856-821-1234 position=CD NE=NUMBER
=> word=856.821.1234 position=CD NE=NUMBER
=> word=8568211234 position=CD NE=NUMBER