Question

考虑以下字符串：

String text="un’accogliente villa del.";

我有单词“ accogliente”的开头索引，即5。但是它是根据utf-8编码预先计算的。

我需要单词3的确切索引作为输出。即，我想从5中得到3作为输出。最好的计算方法是什么？

Answer 1

String text = "un’accogliente villa del."; // Unicode text
text = Normalizer.normalize(text, Form.NFC); // Normalize text

byte[] bytes = text.getBytes(StandardCharsets.UTF_8); // Index 5 UTF-8; 1 byte
char[] chars = text.toCharArray();                    // Index 3 UTF-16; 2 bytes (indexOf)
int[] codePoints = text.codePoints().toArray();       // Index 3 UTF-32; 4 bytes

int charIndex = text.indexOf("accogliente");
int codePointIndex = (int) text.substring(0, charIndex).codePoints().count();
int byteIndex = text.substring(0, charIndex).getBytes(StandardCharsets.UTF_8).length;

UTF-32是Unicode 代码点，所有带有 U + XXXX 的符号的编号可能大于（或小于） 4个十六进制数字。

需要文本规范化，因为é可以是一个代码点，也可以是两个代码点，零宽度´后跟e。

从UTF-8字节索引到UTF-16字符索引的问题：

int charIndex = new String(text.getBytes(StandardCharsets.UTF_8),
                           0, byteIndex, StandardCharsets.UTF_8).length();

Answer 2

下面的代码将以3的形式返回输出，我是否在您的问题中遗漏了某些东西？

String text="un’accogliente villa del.";
text.indexOf("accogliente");

Answer 3

假设这个startIndex只能是一个字母（ASCII一个），您可以这样做：

String text = "un’accogliente villa del.";
char c = text.charAt(5);
String normalized = Normalizer.normalize(text, Normalizer.Form.NFD);
normalized = normalized.replaceAll("[^\\p{ASCII}]", " ");

Pattern p = Pattern.compile("\\p{L}*?" + c + "\\p{L}*?[$|\\s]");
Matcher m = p.matcher(normalized);

if (m.find()) {
     System.out.println(m.start(0));
}

Java根据utf-8编码索引查找字符串的索引

3 个答案: