Question

*****使用完整代码编辑

我正在尝试使用Python（版本3.5.3）和MacOS上的MeCab库解析一些日语代码。

我有一个带有以下文字的txt文件：

石の上に三年

我在textEdit上设置我的首选项以使用utf-8进行保存。所以我相信系统正确地以utf-8格式保存它。

我收到以下错误：

Traceback (most recent call last):   File "japanese.py", line 29, in <module>
    words = extractMetadataFromTXT(fileName)   File "japanese.py", line 14, in extractMetadataFromTXT
    md = extractWordsJP(data)   File "japanese.py", line 22, in extractWordsJP
    components.append(parsed.surface) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 0: invalid start byte

贝娄完整的代码。没有什么遗漏。

import MeCab
import nltk
from nltk import *
from nltk.corpus import knbc

mt = MeCab.Tagger("-d /usr/local/lib/mecab/dic/mecab-ipadic-neologd")
wordsList = knbc.words()
fdist = nltk.FreqDist(w.lower() for w in wordsList)

def extractMetadataFromTXT(filePath):
    with open(filePath, 'r', encoding='utf-8') as f:
        data = f.read()
        print(data)
    md = extractWordsJP(data)
    print(md)
    return md

def extractWordsJP(wordsJP):
    components = []
    parsed = mt.parseToNode(wordsJP)
    while parsed:
        components.append(parsed.surface)
        parsed = parsed.next
    return components

if __name__ == "__main__":
    fileName = "simple_japanese.txt"
    words = extractMetadataFromTXT(fileName)
    print(words)

有没有人知道我收到此错误消息的原因？

有趣的事实：有时它有效。：0

提前致谢，

以色列

Answer 1

错误正在发生，因为您正在向UTF-8解码器提供无效的UTF-8。这可能是由于分割字节而不是字符，或者可能是错误地尝试解码像JIS或EUC这样的其他编码，就像它是UTF-8一样。在Python中，通常听起来坚持使用unicode字符串，如果某些内容设置了locale参数，您的系统可能会切换到解码文本文件。即使你有正确的unicode字符串，拆分也是一个非常重要的问题，因为有代码可以修改其他字符串，例如重音符号。幸运的是，日本人并没有那么多东西（除非有人将po编码为ho + ring等）。

一个潜在的问题：Mecab的网页声明（根据谷歌翻译）“除非另有说明，否则使用euc。”如果Mecab在假设读取EUC的情况下进行单词分割，则会破坏UTF-8。

Answer 2

解决方案：

显然，问题在于MeCab，而不是python代码本身。这个问题是当你从头开始安装它时，使用make，有时候它没有正确安装，但它不会引起任何错误。

我不确定为什么，但如果你想进一步挖掘并找出究竟发生了什么，那就太好了。我只知道我使用brew卸载并重新安装，并且它有效。

类似的事情发生在办公室的其他Mac上。我在OS X中使用brew，所以我将发布我用来正确安装它的命令：

brew install mecab mecab-ipadic git curl xz

另外，要在linux上安装它，请使用以下命令：

sudo apt-get install mecab libmecab-dev mecab-ipadic
sudo apt-get install mecab-ipadic-utf8
sudo apt-get install python-mecab

希望这有助于未来的人们尝试标记日语单词。

Answer 3

打开文件时，请指定编码：

with open(file, 'r', encoding='utf-8') as f:
    data = f.read()

...

顺便说一下，打开文件时，请使用context manager，如本例所示。

解析日语Python

3 个答案: