Question

更新

dir=/Users/matthew/Workbench
for f in $dir/Data/NYTimes/NYTimesCorpus_4/*/*/*/*.txt; do
    [[ $f == *.xml ]] && continue # skip output files
    java -cp "*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,parse,dcoref -filelist "$f" -outputDirectory .  
done

这个似乎工作得更好，但我收到io exception file name too long错误，那是什么，如何解决？

我猜文档中的其他命令不起作用

我试图用这个脚本用斯坦福CoreNLP处理我的语料库，但我一直收到错误

Could not find or load main class .Users.matthew.Workbench.Code.CoreNLP.Stanford-corenlp-full-2015-01-29.edu.stanford.nlp.pipeline.StanfordCoreNLP

这是脚本

dir=/Users/matthew/Workbench
for f in $dir/Data/NYTimes/NYTimesCorpus_3/*/*/*/*.txt; do
    [[ $f == *.xml ]] && continue # skip output files
    java -mx600m -cp $dir/Code/CoreNLP/stanford-corenlp-full-2015-01-29/stanford-corenlp-VV.jar:stanford-corenlp-VV-models.jar:xom.jar:joda-time.jar:jollyday.jar:ejml-VV.jar -Xmx2g /Users/matthew/Workbench/Code/CoreNLP/stanford-corenlp-full-2015-01-29/edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit -file "$f" -outputDirectory $dir/Data/NYTimes/NYTimesCorpus_3/*/*/*/. 
done

一个非常类似于Stanford NER的人，看起来像这样：

dir=/Users/matthew/Workbench
for f in $dir/Data/NYTimes/NYTimesCorpus_3/*/*/*/*.txt; do
    [[ $f == *_NER.txt ]] && continue # skip output files
    g="${f%.txt}_NER.txt"
    java -mx600m -cp $dir/Code/StanfordNER/stanford-ner-2015-01-30/stanford-ner-3.5.1.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier $dir/Code/StanfordNER/stanford-ner-2015-01-30/classifiers/english.all.3class.distsim.crf.ser.gz -textFile "$f" -outputFormat inlineXML > "$g"
done

我无法弄清楚为什么我会一直收到这个错误，似乎我已经正确指定了所有路径。

我知道选项-filelist parameter [which] points to a file whose content lists all files to be processed (one per line).

但我不知道在我的情况下究竟是如何工作的，因为我的目录结构看起来像$dir/Data/NYTimes/NYTimesCorpus_3/*/*/*/*.txt，其中有许多文件需要处理。

也可以动态指定他们在文档-outputDirectory中说的You may specify an alternate output directory with the flag，但似乎会被调用一次然后是静态的，这在我的情况下将是一场噩梦。

我想也许我可以写一些代码来做这件事，也不起作用，这就是我试过的：

public static void main(String[] args) throws Exception 
{

    BufferedReader br = new BufferedReader(new FileReader("/home/matthias/Workbench/SUTD/nytimes_corpus/NYTimesCorpus/2005/01/01/1638802_output.txt"));
    try 
    {
        StringBuilder sb = new StringBuilder();
        String line = br.readLine();

        while (line != null) 
        {

            sb.append(line);
            sb.append(System.lineSeparator());
            line = br.readLine();
        }
        String everything = sb.toString();
        //System.out.println(everything);

        Annotation doc = new Annotation(everything);

        StanfordCoreNLP pipeline;

        // creates a StanfordCoreNLP object, with POS tagging, lemmatization,
        // NER, parsing, and coreference resolution
        Properties props = new Properties();

        // configure pipeline
        props.put(
                  "annotators", 
                  "tokenize, ssplit"
                  );

        pipeline = new StanfordCoreNLP(props);

        pipeline.annotate(doc);

        System.out.println( doc );

    }
    finally 
    {
        br.close();
    }

}

Answer 1

到目前为止，使用Stanford CoreNLP处理大量文件的最佳方法是安排加载系统一次 - 因为在完成任何实际文档处理之前，根据您的计算机加载所有各种模型需要15秒或更长时间 - 以及然后用它处理一堆文件。您在更新中拥有的内容不会这样做，因为运行CoreNLP位于for循环内。一个好的解决方案是使用for循环创建文件列表，然后在文件列表上运行CoreNLP一次。文件列表只是一个文本文件，每行有一个文件名，因此你可以按照你想要的任意方式（使用脚本，编辑器宏，自己输入），你可以并且应该在运行之前检查其内容是否正确CoreNLP。对于您的示例，根据您的更新示例，以下内容应该有效：

dir=/Users/matthew/Workbench
for f in $dir/Data/NYTimes/NYTimesCorpus_3/*/*/*/*.txt; do
    echo $f >> filelist.txt
done
# You can here check that filelist.txt has in it the files you want
java -cp "*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,parse,dcoref -filelist filelist
# By default output files are written to the current directory, so you don't need to specify -outputDirectory .

早期尝试的其他说明：

-mx600m不是运行完整CoreNLP管道的合理方式（通过解析和coref）。所有模型的总和太大了。 -mx2g没问题。
上述最佳方式并未完全延伸至NER案例。 Stanford NER不使用-filelist选项，如果使用-textFiles，则文件会连接在一起并成为一个输出文件，这可能是您不想要的。目前，对于NER，您可能需要在for循环内运行它，就像在脚本中一样。
我还没有完全解释你是如何得到错误Could not find or load main class .Users.matthew.Workbench.Code.CoreNLP.Stanford-corenlp-full-2015-01-29.edu.stanford.nlp.pipeline.StanfordCoreNLP的，但是这种情况正在发生，因为你正在使用字符串（文件名？）（可能有斜线而不是句点）{ {1}}命令需要一个类名。在那个位置，应该只有java，如您更新的脚本或我的。{/ li>
在一次调用CoreNLP时，您无法拥有动态edu.stanford.nlp.pipeline.StanfordCoreNLP。通过使用两个嵌套的outputDirectory循环调用CoreNLP 每个目录，您可以获得我认为您想要合理有效的效果。外部for循环将遍历目录，内部循环将从该目录中的所有文件创建文件列表，然后在一次调用CoreNLP中处理该文件并根据输入目录写入相应的输出目录在外for循环中。有更多时间或者比我更糟糕的人可以尝试写出来....
您当然也可以编写自己的代码来调用CoreNLP，但是您负责扫描输入目录并自己写入适当的输出文件。你看起来基本没问题，除了行for不会做任何有用的事情 - 它只是打印出你开始的测试。你需要这样的东西：
```
System.out.println( doc );
```

stanford coreNLP使用脚本处理许多文件

1 个答案: