Python:UnicodeDecodeError:'utf8'编解码器无法解码字节

时间:2012-08-11 23:32:41

标签: python encoding utf-8 scikit-learn

我正在将一堆RTF文件读入python字符串。 在某些文本中,我收到此错误:

Traceback (most recent call last):
  File "11.08.py", line 47, in <module>
    X = vectorizer.fit_transform(texts)
  File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line
716, in fit_transform
    X = super(TfidfVectorizer, self).fit_transform(raw_documents)
  File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line
398, in fit_transform
    term_count_current = Counter(analyze(doc))
  File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line
313, in <lambda>
    tokenize(preprocess(self.decode(doc))), stop_words)
  File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line
224, in decode
    doc = doc.decode(self.charset, self.charset_error)
  File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 462: invalid
 start byte

我试过了:

  1. 将文件文本复制并粘贴到新文件
  2. 将rtf文件保存为txt文件
  3. 在Notepad ++中打开txt文件并选择“convert to utf-8”并将编码设置为utf-8
  4. 使用Microsoft Word打开文件并将其另存为新文件
  5. 没有任何作用。有什么想法吗?

    它可能没有关系,但是这里是你想知道的代码:

    f = open(dir+location, "r")
    doc = Rtf15Reader.read(f)
    t = PlaintextWriter.write(doc).getvalue()
    texts.append(t)
    f.close()
    vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, stop_words='english')
    X = vectorizer.fit_transform(texts)     
    

4 个答案:

答案 0 :(得分:10)

这将解决您的问题:

import codecs

f = codecs.open(dir+location, 'r', encoding='utf-8')
txt = f.read()

从那时起txt采用unicode格式,您可以在代码中的任何地方使用它。

如果您想在处理后生成UTF-8文件:

f.write(txt.encode('utf-8'))

答案 1 :(得分:6)

正如我在邮件列表中所说,最简单的方法是使用charset_error选项并将其设置为ignore。 如果文件实际上是utf-16,您还可以在Vectorizer中将字符集设置为utf-16。 请参阅docs

答案 2 :(得分:1)

您可以在json文件中转储csv文件行,而不会出现任何编码错误,如下所示:

json.dump(row,jsonfile, encoding="ISO-8859-1")

答案 3 :(得分:0)

保持这一行:

/***
         * Vue Component: Rating
         */
        Vue.component('star-rating', {
            props: {
                'name': String,
                'value': null,
                'value_t': null,
                'id': String,
                'disabled': Boolean,
                'required': Boolean
            },

            template: '<div class="star-rating">\
            <label class="star-rating__star" v-for="rating in ratings" :class="{\'is-selected\': ((value >= rating) && value != null), \'is-hover\': ((value_t >= rating) && value_t != null), \'is-disabled\': disabled}" v-on:click="set(rating)" v-on:mouseover="star_over(rating)" v-on:mouseout="star_out">\
            <input class="star-rating star-rating__checkbox" type="radio" :value="rating" :name="name"  v-model="value" :disabled="disabled"><i class="fas fa-star"></i></label></div>',

            /*
             * Initial state of the component's data.
             */
            data: function() {
                return {
                    temp_value: null,
                    ratings: [1, 2, 3, 4, 5]
                };
            },

            methods: {
                /*
                 * Behaviour of the stars on mouseover.
                 */
                star_over: function (index) {
                    var self = this;

                    if (!this.disabled) {
                        this.temp_value = this.value_t;
                        return this.value_t = index;
                    }
                },

                /*
                 * Behaviour of the stars on mouseout.
                 */
                star_out: function() {
                    var self = this;

                    if (!this.disabled) {
                        return this.value_t = this.temp_value;
                    }
                },

                /*
                 * Set the rating of the score
                 */
                set: function set(value) {
                    var self = this;

                    if (!this.disabled) {
                        // Make some call to a Laravel API using Vue.Resource
                        this.temp_value = value;
                        return this.value = value;
                    }
                }
            }
        });

encoding ='latin-1'对我有用。