如何在Python中显示Unicode字符

时间:2016-12-28 06:36:43

标签: python python-2.7 unicode python-unicode

我有一个文本文件,其中包含重音字符,例如:'č','š','ž'。当我使用Python程序读取此文件并将文件内容放入Python列表时,重音字符将丢失,Python将其替换为其他字符。例如:'č'替换为'_'。当我从文件中读取它们时,有谁知道如何在Python程序中保留重音字符?我的代码:

import sqlite3 #to work with relational DB

conn = sqlite3.connect('contacts.sqlite') #connect to db 
cur = conn.cursor() #db connection handle

cur.execute("DROP TABLE IF EXISTS contacts")

cur.execute("CREATE TABLE contacts (id INTEGER, name TEXT, surname  TEXT, email TEXT)")

fname = "acos_ibm_notes_contacts - test.csv"
fh = open(fname) #file handle
print " "
print "Reading", fname
print " "

#--------------------------------------------------
#First build a Python list with new contacts data: name, surname and email address

lst = list() #temporary list to hold content of the file
new_contact_list = list() #this list will contain contatcs data: name, surname and email address
count = 0 # to count number of contacts
id = 1 #will be used to add contacts id into the DB
for line in fh: #for every line in the file handle
    new_contact = list()
    name = ''
    surname = ''
    mail = ''
    #split line into tokens at each '"' character and put tokens into  the temporary list
    lst = line.split('"')
    if lst[1] == ',': continue #if there is no first name, move to next line
    elif lst[1] != ',': #if 1st element of list is not empty
        name = lst[1] #this is the name
        if name[-1] == ',': #If last character in name is ','
        name = name[:-1] #delete it
        new_contact.append({'Name':name}) #add first name to new list of contacts
        if lst[5] != ',': #if there is a last name in the contact data
            surname = lst[5] #assign 5th element of the list to surname
            if surname[0] == ',': #If first character in surname is ','
                surname = surname[1:] #delete it
            if surname[-1] == ',': #If last character in surname is ','
                surname = surname[:-1] #delete it
            if ',' in surname: #if surname and mail are merged in same list element
                sur_mail = surname.split(',') #split them at the ','
                surname = sur_mail[0]
                mail = sur_mail[1]
            new_contact.append({'Surname':surname}) #add last name to new list of contacts
            new_contact.append({'Mail':mail}) #add mail address to new list of contacts
        new_contact_list.append(new_contact)
    count = count + 1

fh.close()
#--------------------------------------------------
# Second: populate the DB with data from the new_contact_list

row = cur.fetchone()
id = 1
for i in range(count):
    entry = new_contact_list[i] #every row in the list has data about 1 contact - put it into variable
    name_dict = entry[0] #First element is a dictionary with name data
    surname_dict = entry[1] #Second element is a dictionary with surname data
    mail_dict = entry[2] #Third element is a dictionary with mail data
    name = name_dict['Name']
    surname = surname_dict['Surname']
    mail = mail_dict['Mail']
    cur.execute("INSERT INTO contacts VALUES (?, ?, ?, ?)", (id, name, surname, mail))
    id = id + 1               

conn.commit() # Commit outstanding changes to disk 

-----------------------------------

这是没有DB的程序的简化版本,只是打印到屏幕

import io
fh = io.open("notes_contacts.csv", encoding="utf_16_le") #file handle

lst = list() #temporary list to hold content of the file
new_contact_list = list() #this list will contain the contact name,    surname and email address
count = 0 # to count number of contacts
id = 1 #will be used to add contacts id into the DB
for line in fh: #for every line in the file handle
    print "Line from file:\n", line # print it for debugging purposes
    new_contact = list()
    name = ''
    surname = ''
    mail = ''
    #split line into tokens at each '"' character and put tokens into  the temporary list
    lst = line.split('"')
    if lst[1] == ',': continue #if there is no first name, move to next line
    elif lst[1] != ',': #if 1st element of list is not empty
        name = lst[1] #this is the name
        print "Name in variable:", name # print it for debugging purposes
        if name[-1] == ',': #If last character in name is ','
            name = name[:-1] #delete it
            new_contact.append({'Name':name}) #add first name to new list of contacts
        if lst[5] != ',': #if there is a last name in the contact data
            surname = lst[5] #assign 5th element of the list to surname
            print "Surname in variable:", surname # print it for debugging purposes
            if surname[0] == ',': #If first character in surname is ','
                surname = surname[1:] #delete it
            if surname[-1] == ',': #If last character in surname is ','
                surname = surname[:-1] #delete it
            if ',' in surname: #if surname and mail are merged in same list element
                sur_mail = surname.split(',') #split them at the ','
                surname = sur_mail[0]
                mail = sur_mail[1]
            new_contact.append({'Surname':surname}) #add last name to new list of contacts
            new_contact.append({'Mail':mail}) #add mail address to new list of contacts
        new_contact_list.append(new_contact)
        print "New contact within the list:", new_contact # print it for debugging purposes

fh.close()

这是文件notes_contacts.csv的内容,它只有1行:

Aco,"",Vidovič,aco.vidovic@si.ibm.com,+38613208872,"",+38640456872,"","","","","","","","",""

2 个答案:

答案 0 :(得分:0)

在Python 2.7中,默认文件模式是二进制。相反,您需要以文本模式打开文件,并在Python 3中对文本进行解码。在阅读文件时,您不必解码文本,但这样可以避免在以后的文件中担心编码代码。

加入顶部:

import io

变化:

 fh = io.open(fname, encoding='utf_16_le')

注意:您始终需要传递encoding,因为Python无法原始猜测编码。

现在,每次read(),文本都将转换为Unicode字符串。

SQLite模块接受TEXT为Unicode或UTF-8编码的str。由于您已经将文本解码为Unicode,因此您无需执行任何其他操作。

为了确保SQLite不会尝试将SQL命令的主体编码回ASCII字符串,请通过在字符串中附加u将SQL命令更改为Unicode字符串。

E.g。

cur.execute(u"INSERT INTO contacts VALUES (?, ?, ?, ?)", (id, name, surname, mail))

Python 3将帮助您避免一些这些怪癖,您只需要执行以下操作即可使其工作:

fh = io.open(fname, encoding='utf_16_le')

由于您的数据看起来像标准Excel方言CSV,因此您可以使用CSV模块拆分数据。 DictReader允许您传递列名,这使得解析字段变得非常容易。不幸的是,Python的2.7 CSV模块不是Unicode安全的,所以你需要使用Py3 backport:https://github.com/ryanhiebert/backports.csv

您的代码可以简化为:

from backports import csv
import io

csv_fh = io.open('contacts.csv', encoding='utf_16_le')

field_names = [u'first_name', u'middle_name', u'surname', u'email',
               u'phone_office', u'fax', u'phone_mobile', u'inside_leg_measurement']

csv_reader = csv.DictReader(csv_fh, fieldnames=field_names)

for row in csv_reader:
    if not row['first_name']: continue

    print u"First Name: {first_name}, " \
          u"Surname: {surname} " \
          u"Email: {email}".format(first_name=row['first_name'],
                                   surname=row['surname'],
                                   email=row['email'])

答案 1 :(得分:-3)

尝试在代码程序的第一行使用# coding=utf-8