输出

Question

我有三个UTF-8蜇伤：

hello, world
hello, 世界
hello, 世rld

我只想要前10个ascii-char-width，以便括号在一列中：

[hello, wor]
[hello, 世 ]
[hello, 世r]

在控制台中：

width('世界')==width('worl')
width('世 ')==width('wor')  #a white space behind '世'

一个中文字符是三个字节，但在控制台中显示时只有2个ascii字符宽度：

>>> bytes("hello, 世界", encoding='utf-8')
b'hello, \xe4\xb8\x96\xe7\x95\x8c'

当

中混合使用UTF-8字符时，

python的format()无效

>>> for s in ['[{0:<{1}.{1}}]'.format(s, 10) for s in ['hello, world', 'hello, 世界', 'hello, 世rld']]:
...    print(s)
...
[hello, wor]
[hello, 世界 ]
[hello, 世rl]

它不漂亮：

 -----------Songs-----------
|    1: 蝴蝶                  |
|    2: 心之城                 |
|    3: 支持你的爱人              |
|    4: 根生的种子               |
|    5: 鸽子歌(CUCURRUCUCU PALO|
|    6: 林地之间                |
|    7: 蓝光                  |
|    8: 在你眼里                |
|    9: 肖邦离别曲               |
|   10: 西行( 魔戒王者再临主题曲)(INTO |
| X 11: 深陷爱河                |
| X 12: 钟爱大地(THE MO RUN AIR |
| X 13: 时光流逝                |
| X 14: 卡农                  |
| X 15: 舒伯特小夜曲(SERENADE)    |
| X 16: 甜蜜的摇篮曲(Sweet Lullaby|
 ---------------------------

所以，我想知道是否有标准方法来执行UTF-8填充工作人员？

Answer 1

当尝试使用固定宽度字体的中文对齐ASCII文本时，有一组可打印ASCII字符的全宽版本。下面我制作了ASCII到全宽版本的转换表：

# coding: utf8

# full width versions (SPACE is non-contiguous with ! through ~)
SPACE = '\N{IDEOGRAPHIC SPACE}'
EXCLA = '\N{FULLWIDTH EXCLAMATION MARK}'
TILDE = '\N{FULLWIDTH TILDE}'

# strings of ASCII and full-width characters (same order)
west = ''.join(chr(i) for i in range(ord(' '),ord('~')))
east = SPACE + ''.join(chr(i) for i in range(ord(EXCLA),ord(TILDE)))

# build the translation table
full = str.maketrans(west,east)

data = '''\
蝴蝶(A song)
心之城(Another song)
支持你的爱人(Yet another song)
根生的种子
鸽子歌(Cucurrucucu palo whatever)
林地之间
蓝光
在你眼里
肖邦离别曲
西行（魔戒王者再临主题曲）(Into something)
深陷爱河
钟爱大地
时光流逝
卡农
舒伯特小夜曲(SERENADE)
甜蜜的摇篮曲(Sweet Lullaby)
'''

# Replace the ASCII characters with full width, and create a song list.
data = data.translate(full).rstrip().split('\n')

# translate each printable line.
print(' ----------Songs-----------'.translate(full))
for i,song in enumerate(data):
    line = '|{:4}: {:20.20}|'.format(i+1,song)
    print(line.translate(full))
print(' --------------------------'.translate(full))

输出

　－－－－－－－－－－Ｓｏｎｇｓ－－－－－－－－－－－
｜　　　１：　蝴蝶（Ａ　ｓｏｎｇ）　　　　　　　　　　｜
｜　　　２：　心之城（Ａｎｏｔｈｅｒ　ｓｏｎｇ）　　　｜
｜　　　３：　支持你的爱人（Ｙｅｔ　ａｎｏｔｈｅｒ　ｓ｜
｜　　　４：　根生的种子　　　　　　　　　　　　　　　｜
｜　　　５：　鸽子歌（Ｃｕｃｕｒｒｕｃｕｃｕ　ｐａｌｏ｜
｜　　　６：　林地之间　　　　　　　　　　　　　　　　｜
｜　　　７：　蓝光　　　　　　　　　　　　　　　　　　｜
｜　　　８：　在你眼里　　　　　　　　　　　　　　　　｜
｜　　　９：　肖邦离别曲　　　　　　　　　　　　　　　｜
｜　　１０：　西行（魔戒王者再临主题曲）（Ｉｎｔｏ　ｓ｜
｜　　１１：　深陷爱河　　　　　　　　　　　　　　　　｜
｜　　１２：　钟爱大地　　　　　　　　　　　　　　　　｜
｜　　１３：　时光流逝　　　　　　　　　　　　　　　　｜
｜　　１４：　卡农　　　　　　　　　　　　　　　　　　｜
｜　　１５：　舒伯特小夜曲（ＳＥＲＥＮＡＤＥ）　　　　｜
｜　　１６：　甜蜜的摇篮曲（Ｓｗｅｅｔ　Ｌｕｌｌａｂｙ｜
　－－－－－－－－－－－－－－－－－－－－－－－－－－

它不是太漂亮，但它排成一行。

Answer 2

首先，看起来你正在使用Python 3，所以我会做出相应的回应。

也许我不理解你的问题，但看起来你得到的输出完全你想要什么，除了你的字体中的汉字更宽

所以UTF-8是一个红色的鲱鱼，因为我们不是在谈论字节，我们谈论的是字符。您使用的是Python 3，因此所有字符串都是Unicode。基础字节表示（其中每个中文字符由三个字节表示）是无关紧要的。

您希望将每个字符串剪切或填充到10个字符，这样才能正常工作：

>>> len('hello, wor')
10
>>> len('hello, 世界 ')
10
>>> len('hello, 世rl')
10

唯一的问题是你看起来像是一个等宽字体，但实际上不是。大多数等宽字体都有这个问题。所有普通拉丁字符在此字体中具有完全相同的宽度，但中文字符稍宽。因此，三个字符"世界 "比三个字符"wor"占用更多的水平空间。除了a）获得真正等宽的字体，或b）精确计算字体中每个字符的宽度，并添加一些大约将您带入的空格之外，您无能为力。相同的水平位置（这永远不会准确）。

Answer 3

似乎没有官方支持，但内置程序包可能有所帮助：

>>> import unicodedata
>>> print unicodedata.east_asian_width(u'中')

返回的值代表category of the code point。具体来说，

W - East Asian Wide
F - 东亚全宽（窄）
Na - 东亚缩小
H - 东亚半宽（宽）
A - 东亚暧昧
N - 不是东亚

This answer对类似问题提供了快速解决方案。但请注意，显示结果取决于所使用的完全等宽字体。 ipython和pydev使用的默认字体不能正常工作，而Windows控制台还可以。

Answer 4

看看kitchen。我认为它可能有what you want。

Answer 5

如果您使用的是英文和中文字符，则此代码段可能会对您有所帮助。

data = '''\
蝴蝶(A song)
心之城(Another song)
支持你的爱人(Yet another song)
根生的种子
鸽子歌(Cucurrucucu palo whatever)
林地之间
蓝光
在你眼里
肖邦离别曲
西行（魔戒王者再临主题曲）(Into something)
深陷爱河
钟爱大地
时光流逝
卡农
舒伯特小夜曲(SERENADE)
甜蜜的摇篮曲(Sweet Lullaby)'''

width = 80

def get_aligned_string(string,width):
    string = "{:{width}}".format(string,width=width)
    bts = bytes(string,'utf-8')
    string = str(bts[0:width],encoding='utf-8',errors='backslashreplace')
    new_width = len(string) + int((width - len(string))/2)
    if new_width!=0:
        string = '{:{width}}'.format(str(string),width=new_width)
    return string

for i,line in enumerate(data.split('\n')):
    song = get_aligned_string(line,width)
    line = '|{:4}: {:}|'.format(i+1,song)
    print(line)

输出

Answer 6

这里是一个基于unicodedata的脚本，用于检测东亚字符并将其规范化为NFC形式，以确保精确的半角/全角匹配。对于macOS中的韩语，需要标准化，因为macOS使用NFD格式，并且朝鲜语字符被分解为单个音节，在Python中这些字符被视为字符。（例如，“가”被分解为两个字符，而“각”被分解为三个字符，依此类推，但两者均应被视为全角字符。）

它枚举给定root_path中的所有文件，并显示文件名是NFC还是NFD形式。

#! /usr/bin/env python3
import unicodedata
from pathlib import Path


def len_ea(string: str) -> int:
    nfc_string = unicodedata.normalize('NFC', string)
    return sum((2 if unicodedata.east_asian_width(c) in 'WF' else 1) for c in nfc_string)


def align_string(string: str, width: int):
    nfc_string = unicodedata.normalize('NFC', string)
    num_wide_chars = sum(1 for c in nfc_string if unicodedata.east_asian_width(c) in 'WF')
    width = width - num_wide_chars
    return '{:{width}}'.format(nfc_string, width=width)


def show_filename_encodings(root_path: Path):
    outputs = []
    for p in root_path.glob("*"):
        nfc_name = unicodedata.normalize('NFC', p.name)
        nfd_name = unicodedata.normalize('NFD', p.name)
        if p.name == nfc_name:
            enc = "\033[94mNFC\033[0m"
        elif p.name == nfd_name:
            enc = "\033[91mNFD\033[0m"
        outputs.append((p.name, nfc_name, nfd_name, enc))

    # Take the NFC string to check the maximum length
    colw = max(len_ea(o[1]) for o in outputs) + 2
    for name, nfc_name, nfd_name, enc in outputs:
        print(f"{align_string(nfc_name, colw)}: {enc}")

如何控制包含东亚字符的Unicode字符串的填充

6 个答案:

输出

输出