有效地将字符串转换为python 2.7的unicode

时间:2016-12-07 16:16:27

标签: python lda python-unicode isnumeric

我跟随LDA并且遇到了一个问题,因为turtorial是在python 3中制作的,我在2.7中工作(turtorial声称在两者中工作)。据我所知,我需要在python 2.x中将字符串转换为unicode才能应用token.isnumeric()。由于我缺乏经验和知识,我不确定如何在以下脚本中很好地做到这一点。有没有人有解决方案?

data_dir = 'nipstxt/'
yrs = ['00', '01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12']
dirs = ['nips' + yr for yr in yrs]
docs = []
for yr_dir in dirs:
files = os.listdir(data_dir + yr_dir)
    for filen in files:
        # Note: ignoring characters that cause encoding errors.
        with open(data_dir + yr_dir + '/' + filen) as fid:
            txt = fid.read()
        docs.append(txt)

tokenizer = RegexpTokenizer(r'\w+')
for idx in range(len(docs)):
    docs[idx] = docs[idx].lower()  # Convert to lowercase.
    docs[idx] = tokenizer.tokenize(docs[idx])  # Split into words.

docs = [[token for token in doc if not token.isnumeric()] for doc in docs]

docs = [[token for token in doc if len(token) > 1] for doc in docs]

1 个答案:

答案 0 :(得分:0)

将字节字符串转换为Unicode字符串的一般方法是使用decode。如果你知道字符串只包含ASCII字符(作为数字),你不必指定参数,它将默认为ascii

docs = [[token for token in doc if not token.decode().isnumeric()] for doc in docs]

如果该字符串有可能包含非ASCII字符,您可以使用不会计为数字的特殊字符替换这些字符。

docs = [[token for token in doc if not token.decode(errors='replace').isnumeric()] for doc in docs]