Question

我目前正在使用Java和IntelliJ IDE来运行Stanford的POS标记器。我已经使用本教程进行了设置：（http://new.galalaly.me/index.php/2011/05/tagging-text-with-stanford-pos-tagger-in-java-applications/）。它运行正常，但它只输出大约两段文本，即使我给它的内容多于那个（我的文件大小为774 KB的文本）。

在本教程的底部，它说明了内存问题：

事实证明，问题是eclipse分配256MB 内存默认情况下。右键单击项目 - >运行as-＆gt;运行配置 - ＆gt;转到参数选项卡 - ＆gt;在VM参数类型下 -Xmx2048m这会将分配的内存设置为2GB，所有标记文件都应该立即运行。

我已将IntelliJ配置为每个答案使用4GB内存：How to increase IDE memory limit in IntelliJ IDEA on Mac?

然而，它并没有丝毫改变输出文本的数量。

还有什么可能导致这种情况发生？

（链接到POS标记的原始网站：https://nlp.stanford.edu/software/tagger.shtml）

编辑：

我已粘贴下面的主要课程。 TaggedWord是一个类，可以帮助我解析和组织从标记器中检索到的相关数据。

package com.company;
import java.io.*;
import java.util.ArrayList;
import java.util.List;
import java.util.Scanner;
import edu.stanford.nlp.tagger.maxent.MaxentTagger;

public class Main {

    public static void main(String[] args) {

        File infile = new File("C:\\Users\\TEST\\Desktop\\input.txt");
        File outfile = new File("C:\\Users\\TEST\\Desktop\\output.txt");
        MaxentTagger tagger = new MaxentTagger("tagger/english-left3words-distsim.tagger");
        FileWriter fw;
        BufferedWriter bw;
        List<TaggedWord> taggedWords;

        try {
            //read in entire text file to String
            String fileContents = new Scanner(infile).useDelimiter("\\Z").next();

            //erase contents of outfile from previous run
            PrintWriter pw = new PrintWriter(outfile);
            pw.close();

            //tag file contents with parts of speech
            String fileContentsTagged = tagger.tagString(fileContents);

            taggedWords = processTaggedWords(fileContentsTagged);

            fw = new FileWriter(outfile, true); //true = append
            bw = new BufferedWriter(fw);

            String uasiContent = "";
            boolean firstWord = true;
            for (TaggedWord tw : taggedWords) {
                String englishWord = tw.getEng_word();
                String uasiWord = translate(englishWord);
                if (!tw.isPunctuation()) {
                    uasiContent += uasiWord + " ";
                }
                else {
                    //remove last space
                    uasiContent = uasiContent.substring(0, uasiContent.length() - 1);
                    uasiContent += uasiWord + " ";
                }
            }
            bw.write(uasiContent);
            bw.close();
        }
        catch (FileNotFoundException e1) {
            System.out.println("File not found.");
        }
        catch (IOException e) {
            System.out.print("Error writing to file.");
        }
    }  //end main

EDIT2：

我现在已经使用while循环将我在文件中读取的行修改为字符串，但它仍然给出了相同的结果：

        //read in entire text file to String
        String fileContents = "";
        Scanner sc = new Scanner(infile).useDelimiter("\\Z");
        while (sc.hasNext()) {
            fileContents += sc.next();
        }

Answer 1

只有在您的扫描仪读取输入文件的开头时才会调用它。要继续，您需要声明Scanner独立，然后在hasNext（）方法上使用while循环进行迭代。有关声明和迭代扫描仪的信息，请参阅文档和example here。

使用斯坦福大学的词性标注器

1 个答案: