Question

目前我有一个异常，告诉我整行包含无效的ISO 8859-1字符，但我想确切地检测出它是哪一个。

我可以检查字符串中的每个字符，但效率很低。

这样做的目的是向用户报告他们写了一个无效的字符，如€

输入：

scrollIntoView

输出：

import pandas as pd
import numpy as np

# build the milti-indexed dataframe
index1 =  list('ABC')
index2 = np.arange(5)
mli = pd.MultiIndex.from_product([index1, index2])
df = pd.DataFrame(index=mli, columns=list('xyz'), data = np.random.randint(0,10,(15,3))

>>> df
    Out[9]:
    ...:      x  y  z
    ...: A 0  7  9  1
    ...:   1  9  0  9
    ...:   2  2  0  7
    ...:   3  5  2  5
    ...:   4  3  4  5
    ...: B 0  3  9  9
    ...:   1  7  0  2
    ...:   2  6  9  2
    ...:   3  8  1  0
    ...:   4  7  7  5
    ...: C 0  9  1  1
    ...:   1  7  5  7
    ...:   2  5  6  9
    ...:   3  5  0  1
    ...:   4  3  4  0

# slice out all values in column "x" that have an index value=='A' in the first level of the index (i.e. level=0)
>>>df.xs(level=0, key='A').x
    ...: Out[10]:
    ...: 0    7
    ...: 1    9
    ...: 2    2
    ...: 3    5
    ...: 4    3
    ...: Name: x, dtype: int64
    ...:

有没有快速有效的方法来实现这一目标？

实际方法的片段：

Hello fri€nd

Answer 1

您可以尝试使用Apache Tika来检测字符串的编码。

示例：

CharsetDetector detector = new CharsetDetector();
detector.setText(string.getBytes());
detector.detect();

然后，您可以将字符串从原始字符集转换为任何人：

detector.getString(yourStr.getBytes(), "utf-8");

Answer 2

我可以检查字符串中的每个字符，但那样就可以了非常不合理

您认为canEncode在做什么？如果没有检查所有字符，则无法检查所有字符。

如果您的String 非常长，您可能会看到使用并行流带来的好处：

final OptionalInt firstInvalidChar = line.chars()
    .parallel()
    .filter(ch -> !Charset.forName("ISO-8859-1").newEncoder().canEncode((char) ch))
    .findFirst();

if (firstInvalidChar.isPresent()) {
    throw new EncodingException(
        "The first invalid char is: " + (char) firstInvalidChar.getAsInt()
    );
}

如果Charset是线程安全的，你可以通过创建单个实例而不是批量来看到一些性能提升，但作为一个抽象工厂，文档中没有任何内容，我们必须假设它不是。 / p>

Answer 3

您基本上有两个选项可用于跟踪代码段中的编码错误：

使用canEncode(char c)
尝试将编码器配置为抛出UnmappableCharacterException，其中包含inputLength，它会告诉您错误字符的位置。这是通过在CharsetEncoder上设置CodingErrorAction来触发的，但我不相信这适用于所有编码。

如果您的输入也是ISO-8859-1，并且您的处理相当简单，那么您可以在内部将其用作byte[]而不是String来完全删除此缩小转化。< / p>

检测哪个字符与特定编码不同的最快方法

3 个答案: