如何按所有标点符号类型拆分ArrayList中的文本文件?

时间:2019-03-21 15:31:26

标签: java arraylist

到目前为止,这是我的代码:

import java.util.*;
import java.io.*;

public class Alice {

    public static void main(String[] args) throws IOException {

        /*
         * To put the text document into an ArrayList
         */
        Scanner newScanner = new Scanner(new File("ALICES ADVENTURES IN WONDERLAND.txt"));

        ArrayList<String> list = new ArrayList<String>();

        while (newScanner.hasNext()) {
            list.add(newScanner.next());
        }
        newScanner.close();
    }
}

我对如何按所有标点符号拆分文档感到困惑,但是我仍然需要能够对文本中的单词执行String操作。请帮助

输入内容是整本《爱丽丝梦游仙境》全书,我需要输出看起来像这样:

“这本书可供使用,等等。”

基本上,所有单词都被分开,所有标点符号都从文档中删除。

2 个答案:

答案 0 :(得分:0)

List <String> list = new ArrayList <> ();
Pattern wordPattern = Pattern.compile ("\\w+");
try (BufferedReader reader = new BufferedReader (new FileReader ("ALICES ADVENTURES IN WONDERLAND.txt"))) {
    String line;
    while ((line = reader.readLine ()) != null) {
        Matcher matcher = wordPattern.matcher (line);
        while (matcher.find())
            list.add (matcher.group());
    }
}

答案 1 :(得分:0)

您可以使用\p{Punct}.正则表达式的字符类作为分隔符。下面给出了下面的输出。

代码

String regex = "\\p{Punct}.";
String phrase = "Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.";
Scanner scanner = new Scanner(phrase);
scanner.useDelimiter(Pattern.compile(regex));

List<String> list = new ArrayList<String>(); // <- Try also as much as possible to work with interfaces

while (scanner.hasNext()) {
    list.add(scanner.next());
}

list.forEach(System.out::println);
scanner.close();

结果

Lorem Ipsum is simply dummy text of the printing and typesetting industry
Lorem Ipsum has been the industry
 standard dummy text ever since the 1500s
when an unknown printer took a galley of type and scrambled it to make a type specimen book
It has survived not only five centuries
but also the leap into electronic typesetting
remaining essentially unchanged
It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages
and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.