Stanford RegexNER-正则表达式有问题吗?

时间:2018-09-26 16:01:54

标签: c# regex stanford-nlp

我正在尝试使用Stanford NLP .Net类添加自定义标记。我使用regexner实现了管道,并为自定义添加了外部标记文件。我遇到的问题是(尽管正则表达式在外部进行了验证)我的自定义电话号码标签没有被应用,但是非常简单的“度”标签也被很好地应用了。 (我从另一个斯坦福大学的示例复制了学位标签,只是为了确保正则表达式文件已加载并正常工作。)

问题:使用下面列出的正则表达式是否存在错误或不兼容?如果是,该怎么办?还是我提供的外部文件存在文件格式问题?还是斯坦福大学NLP库中的错误?

这是我用于简单测试的C#设置:

using System.IO;
using Console = System.Console;
using edu.stanford.nlp.ling;
using edu.stanford.nlp.pipeline;
using edu.stanford.nlp.sentiment;
using edu.stanford.nlp.ie.crf;
using java.util;

    public MainWindow()
    {
        InitializeComponent();

        string codeBase = Assembly.GetExecutingAssembly().CodeBase;
        UriBuilder uri = new UriBuilder(codeBase);
        string appPath = System.IO.Path.GetDirectoryName(Uri.UnescapeDataString(uri.Path));

        // Path to the folder with models extracted from `stanford-corenlp-3.4-models.jar`            
        var libraryRoot = appPath + @"/models/";
        var modelRoot = libraryRoot + @"stanford-corenlp-3.9.1-models/";

        const string text = "I have a Bachelor of Science degree.\nPhone: 856.821.9331\n856-821-1234\n856.821.1234\n8568211234";
        Test3(libraryRoot, modelRoot, text);

    }
   public void Test3(string libraryRoot, string modelRoot, string text)
    {
        // Annotation pipeline configuration
        var props = new java.util.Properties();
        props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref, coref, regexner");
        props.setProperty("ner.useSUTime", "0");
        props.put("regexner.mapping", $"{libraryRoot}regexner-general.txt");

        var curDir = Environment.CurrentDirectory;
        Directory.SetCurrentDirectory(modelRoot);
        var pipeline = new StanfordCoreNLP(props);
        Directory.SetCurrentDirectory(curDir);

        Annotation document = new Annotation(text);
        pipeline.annotate(document);

        var sentences = document.get(typeof(CoreAnnotations.SentencesAnnotation));
        if (sentences == null)
            return;

        foreach (Annotation sentence in sentences as ArrayList)
        {
            Console.WriteLine(sentence);
            var tokens = sentence.get(typeof(CoreAnnotations.TokensAnnotation));
            foreach (CoreLabel token in tokens as ArrayList)
            {
                string word = token.get(typeof(CoreAnnotations.TextAnnotation)).ToString();
                string position = token.get(typeof(CoreAnnotations.PartOfSpeechAnnotation)).ToString();
                string ne = token.get(typeof(CoreAnnotations.NamedEntityTagAnnotation)).ToString();
                Console.WriteLine($"=>  word={word}\tposition={position}\tNE={ne}");
            }
        }
    }

我放在一起的外部正则表达式文件如下所示(是的,每列之间的制表符):

^(\+\d{1,2}\s)?\(?\d{3}\)?[\s.-]\d{3}[\s.-]\d{4}$   PHONE   MISC,NUMBER 4.0
^(?:\(?)(\d{3})(?:[\).\s]?)(\d{3})(?:[-\.\s]?)(\d{4})(?!\d) PHONE   MISC,NUMBER 3.0
[0-9]{3}\W[0-9]{3}-[0-9]{4} PHONE   MISC,NUMBER 2.0
Bachelor of (Arts|Laws|Science|Engineering|Divinity)    DEGREE      2.0

这将产生以下输出。请注意,可以正确识别学位,但没有电话号码。每个regexner电话规则应至少涵盖一个电话号码。 (我在外部进行了测试)。

I have a Bachelor of Science degree.
=>  word=I  position=PRP    NE=O
=>  word=have   position=VBP    NE=O
=>  word=a  position=DT NE=O
=>  word=Bachelor   position=NN NE=DEGREE
=>  word=of position=IN NE=DEGREE
=>  word=Science    position=NNP    NE=DEGREE
=>  word=degree position=NN NE=O
=>  word=.  position=.  NE=O
Phone: 856.821.9331
856-821-1234
856.821.1234
8568211234
=>  word=Phone  position=NN NE=O
=>  word=:  position=:  NE=O
=>  word=856.821.9331   position=CD NE=NUMBER
=>  word=856-821-1234   position=CD NE=NUMBER
=>  word=856.821.1234   position=CD NE=NUMBER
=>  word=8568211234 position=CD NE=NUMBER

0 个答案:

没有答案