Question

我想要做的基本上是这样的：

读取文件;
删除所有标点符号并将所有字母转换为小写字母;
将单词转换为4个字母的短语（如果单词短于4个字符，则将其作为一个整体）;

示例：

输入：您好，我的身份证明是Dister先生。

输出：hell，ello，my，iden，dent，enti，ntif，tifi，ific，fica，icat，cati，atio，tion，is，mist，iste，ster，dude。

如果我能将每个4字短语作为数组中的单独值，那就太好了。

现在我已经设法完成的事情：

public String[] OpenFile() throws IOException {
    FileReader fr = new FileReader(path);
    BufferedReader textReader = new BufferedReader(fr);
    int numberOfLines = readLines();
    String[] textData = new String[numberOfLines];
    int i;

    for (i = 0; i < numberOfLines; i++) {
        textData[i] = textReader.readLine();
        textData[i] = textData[i].replaceAll("[^A-Za-ząčęėįšųūž]+", " ").toLowerCase();
    }
    textReader.close();

    return textData;
}

textData[i]是我需要划分的每行文字。我已经尝试了几种方法，例如.toCharArray和2D数组，但我似乎无法管理字母排列部分。我怎样才能完成第3号任务？

Answer 1

基本上，对于每个单词，您需要迭代可能的位置以从以下位置开始四个字母的序列：

public static List<String> sequences (String line) {
    List result = new LinkedList<>();
    String[] words = line.split(" ");
    for (String word : words) {
        if (word.length() <= 4) {
            result.add(word);
        } else {
            for (int i = 0; i <= word.length() - 4; ++i) {
                result.add(word.substring(i, i + 4));
            }
        }
    }

    return result;
}

Answer 2

在ideone.com上测试：

public static void main (String[] args) {
    String text = "Hello, my identification is Mister Dude.";
    String[] words = text.replaceAll("[^(\\w )]+", "").toLowerCase().split(" ");
    for (String word : words) {
        if (word.length() <= 4) {
            System.out.println(word);
        } 
        else {
            for (int i = 0; i <= word.length() - 4; i++) {
                System.out.println(word.substring(i, i + 4));
            }
        }
    }
}

Answer 3

启动示例：

    List<String> result = new ArrayList<String>();
    for (int i = 0; i < textData.length; i++) {
        String[] currLine = textData[i].split("\\s+");
        for (String word : currLine) {
            if (word.length() > 4) {
                for (int j = 0; j < currLine.length - 4; j++) {
                    result.add(word.substring(j, j + 4));
                }
            } else {
                result.add(word);
            }
        }
    }

我没有测试过，所以请检查并告诉我它是否有效。

Answer 4

首先，您需要按空格和标点符号分割方法。请注意第3行中的分割，即通过空格和标点符号的任意组合进行分割。

在我的例子中我有

    String text = "Hello, my identification is Mister Dude.";

    String[] textArray = text.split("\\W+");
    List<String> result = new ArrayList<>();
    for (String word : textArray) {
        result.addAll(Arrays.asList(split(word.toLowerCase(), 4)));
    }

然后是方法

private static String[] split(String word, int letters) {
    if (word == null || word.length() == 0) {
        return new String[] {};
    } else if (word.length() <= letters) {
        return new String[] { word };
    } else {
        int quantity = (word.length() - letters) + 1;
        String[] val = new String[quantity];
        int a = 0;
        while (a + letters <= word.length()) {
            val[a] = word.substring(a, a + letters);
            a++;
        }
        return val;
    }
}

这将输出以下内容

[hell, ello, my, iden, dent, enti, ntif, tifi, ific, fica, icat, cati, atio, tion, is, mist, iste, ster, dude]

将字符串分成所有可能的4个字母的后续短语

4 个答案: