Question

我有一些代码可以读入两个文本文件（一个包含要删除的单词，另一个包含从Twitter收集的数据）。在我的程序中，我在分隔符之间包含了Twitter用户名，因此我可以在后期删除它们（以及停用词）。

我的代码（下面）完全删除了数据中的停用词，但我很难知道如何删除两个分隔符之间的字符串。我有一种感觉，indexOf（）的内置函数可能最适合它，但我不知道如何用我当前的代码实现它。下面是一个示例测试用例，它删除了分隔符，推文句柄和停用词：

输入：

--/--RedorDead :--/-- Tottenham are the worst team in existence

输出：

Tottenham worst team existence

我的代码：

    Scanner stopWordsFile = new Scanner(new File("stopwords_twitter.txt"));
    Scanner textFile = new Scanner(new File("Test.txt"));

    // Create a set for the stop words (a set as it doesn't allow duplicates)
    Set<String> stopWords = new HashSet<String>();
    // For each word in the file
    while (stopWordsFile.hasNext()) {
        stopWords.add(stopWordsFile.next().trim().toLowerCase());
    }

    // Creates an empty list for the test.txt file
    ArrayList<String> words = new ArrayList<String>();
    // For each word in the file
    while (textFile.hasNext()) {
        words.add(textFile.next().trim().toLowerCase());
    }

    // Create an empty list (a list because it allows duplicates) 
    ArrayList<String> listOfWords = new ArrayList<String>();

    // Iterate over the list "words" 
    for(String word : words) {
        // If the word isn't a stop word, add to listOfWords list
        if (!stopWords.contains(word)) {
            listOfWords.add(word);
        }

    stopWordsFile.close();
    textFile.close();

    for (String str : listOfWords) {
        System.out.print(str + " ");
    }

Answer 1

将regex替换为不情愿的量词：

str = str.replaceAll("--/--.*?--/--\\s*", "");

表达式*?是一个不情愿的量词，这意味着它在匹配时尽可能匹配 little ，这反过来意味着它将停在第一个分隔符，在第一个分隔符后，如果输入中有多个分隔符对。

我在结尾处添加了\s*，以便在结束分隔符之后删除尾随空格（您的示例似乎表明需要）。

要使用此方法，您必须一次读取文本文件 line ，而不是一次只读 word ，处理要删除的行用户名然后分成单词：

while (textFile.hasNextLine()) {
    for (string word : textFile.nextLine().trim().toLowerCase().replaceAll("--/--.*?--/--\\s*", "").split("\\s+")) {
        words.add(word);
    }
}

Answer 2

public static String remove(String str) {
    return str.replaceAll("\\s*--\\/-.*?)--\\/--", "").trim();
}

输入： "--/--RedorDead :--/-- Tottenham are the worst team in existence --/--RedorDead :--/-- Tottenham are the worst team in existence"

输出： "Tottenham are the worst team in existence Tottenham are the worst team in existence"

Demo at regex101.com

删除两个分隔符之间的字符串

2 个答案: