如何将Windows-1251文本转换为可读的内容?

时间:2015-07-30 09:39:45

标签: java encoding jericho-html-parser

我有一个字符串,由Jericho HTML解析器返回并包含一些俄语文本。根据{{​​1}}和相应HTML文件的标题,编码为Windows-1251。

如何将此字符串转换为可读的内容?

我试过了:

source.getEncoding()

变量import java.io.UnsupportedEncodingException; public class Program { public void run() throws UnsupportedEncodingException { final String windows1251String = getWindows1251String(); System.out.println("String (Windows-1251): " + windows1251String); final String readableString = convertString(windows1251String); System.out.println("String (converted): " + readableString); } private String convertString(String windows1251String) throws UnsupportedEncodingException { return new String(windows1251String.getBytes(), "UTF-8"); } private String getWindows1251String() { final byte[] bytes = new byte[] {32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32}; return new String(bytes); } public static void main(final String[] args) throws UnsupportedEncodingException { final Program program = new Program(); program.run(); } } 包含调试器中显示的数据,它是bytes的结果。我只是在这里复制并粘贴了这个数组。

这不起作用 - net.htmlparser.jericho.Element.getContent().toString().getBytes()包含垃圾。

我该怎么办呢,我即确保正确解码Windows-1251字符串?

更新1(2015年7月30日12:45 MSK):将readableString中的通话中的编码更改为convertString时,没有任何更改。请参见下面的屏幕截图。

Screenshot

更新2:另一次尝试:

Second screenshot

更新3(2015年7月30日14:38):我需要解码的文本对应于下面显示的下拉列表中的文字。

Expected result

更新4(2015年7月30日14:41):编码检测器(代码见下文)表示编码不是Windows-1251,而是Windows-1251

UTF-8

3 个答案:

答案 0 :(得分:3)

(根据更新,我删除了原来的答案并重新开始)

出现的文字

пїЅпїЅпїЅпїЅпїЅпїЅ

是对这些字节值的准确解码

-17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67

(两端填充32,即空格。)

所以要么

1)文字是垃圾或

2)文本应该看起来像那样或

3)编码不是Windows-1215

这一行显然是错误的

return new String(windows1251String.getBytes(), "UTF-8");

从字符串中提取字节并从中构造新字符串不是"转换"编码之间。输入String和输出String都在内部使用UTF-16编码(你通常不需要知道或关心它)。其他编码发挥作用的唯一时间是文本数据存储在字符串对象的OUTSIDE中 - 即在您的初始字节数组中。构造String然后完成转换时发生转换。没有从一种String类型到另一种String类型的转换 - 它们都是相同的。

这个事实

return new String(bytes);

与此相同

return new String(bytes, "Windows-1251");

表明Windows-1251是平台的默认编码。 (您的时区进一步支持MSK)

答案 1 :(得分:2)

我通过修改从网站上读取文本的代码来解决这个问题。

private String readContent(final String urlAsString) {
    final StringBuilder content = new StringBuilder();
    BufferedReader reader = null;
    InputStream inputStream = null;
    try {
        final URL url = new URL(urlAsString);
        inputStream = url.openStream();
        reader =
            new BufferedReader(new InputStreamReader(inputStream);

        String inputLine;
        while ((inputLine = reader.readLine()) != null) {
            content.append(inputLine);
        }
    } catch (final IOException exception) {
        exception.printStackTrace();
    } finally {
        IOUtils.closeQuietly(reader);
        IOUtils.closeQuietly(inputStream);
    }
    return content.toString();
}

我改变了行

new BufferedReader(new InputStreamReader(inputStream);

new BufferedReader(new InputStreamReader(inputStream, "Windows-1251"));

然后它奏效了。

答案 2 :(得分:1)

只是为了确保您100%了解java如何处理charbyte

byte[] input = new byte[1];

// values > 127 become negative when you put them in an array.
input[0] = (byte)239; // the array contains value -17 now.

// but all 255 values are preserved. 
// But if you cast them to integers, you should use their unsigned value.
// (casting alone isn't enough).
int output = input[0] & 0xFF; // output is 239 again

// you shouldn't cast directly from a single-byte to a char.
// because: char is 16-bit ; but you only want to use 1 byte ; unfortunately your negative values will be applied in the 2nd byte, and break it.
char corrupted = (char) input[0]; // char-code: 65519 (2 bytes are used)
char corrupted = (char) ((int)input[0]); // char-code: 66519 (2 bytes are used)

// just casting to an integer/character is ok for values < 0x7F though
// values < 0x7F are always positive, even when casted to byte
// AND the first 7-bits in any ascii-encodings (e.g. windows-1251) are identical.
byte simple = (byte) 'a';
char chr = (char) ascii_LT_7F; // will result in 'a' again

// But it's still more reliable to use the & 0xFF conversion.
// Because it ensures that your character can never be greater than char code 255 (a single byte), even when the byte is unexpectedly negative (> 0x7F).
char chr = (char) ((byte)simple & 0xFF); // also results in 'a'

// for value 239 (which is 0xEF) it's impossible though.
// a java char is 16-bit encoded internally, following the unicode character set.
// characters 0x00 to 0x7F are identical in most encodings.
// but e.g. 0xEF in windows-1251 does not match 0xEF in UTF-16.
// so, this is a bad idea.
char corrupted = (char) (input[0] & 0xFF);

// And that's something you can only fix by using encodings.
// It's good practice to use encodings really just ALWAYS.
// the encoding indicates what your bytes[] are encoded in NOW.
// your bytes will be converted to 16-bit characters.
String text = new String(bytes, "from-encoding");

// if you want to change that text back to bytes, use an encoding !!
// this time the encoding specifies is the TARGET-ENCODING.
byte[] bytes = text.getBytes("to-encoding");

我希望这会有所帮助。

至于显示的值: 我可以确认字节[]是否正确显示。我在Windows-1251代码页中检查了它们。 (byte -17 = int 239 = 0xEF =char'п')

换句话说,您的字节值不正确,或者它是不同的源编码。