如何将2个DataFrame列转换为ascii?

时间:2017-09-18 11:54:59

标签: python-3.x pandas numpy ascii

我有一个带有2列字符串的DataFrame,从tsv文件导入。两列都需要转换为ascii。 (这是因为我想通过scikit-learn中的CountVectorizer和TfidfTransformer管道传递文本。)

我在stackoverflow和外部都经历了数十个帖子,但是无法想出这个帖子。我的代码如下,包括我尝试过的一些内容。

有任何建议使这项工作?

# tried including adding encoding="utf-8", does not work
df = pd.read_csv(questions, usecols = [3, 4, 5], nrows = 10, header=0, sep="\t") 

y = df["is_duplicate"].values
X = df.drop("is_duplicate", axis=1).values

for col in X:
    X = X.encode('utf-8') # does not work

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3,
random_state = 21, stratify = y)

def flat_list(my_list):
    return [str(item) for sublist in my_list for item in sublist]

def transform_data(trans_obj_list,dataset_splits):
    X_train = dataset_splits[0].astype(str)

X_train = flat_list(X_train)

for trfs in trans_obj_list:
    transformed_vector = trfs().fit(X_train)
    for x in range(0,len(dataset_splits)):
        dataset_splits[x] =flat_list(dataset_splits[x].astype(str))

return dataset_splits

new_X_train, new_X_test = transform_data([CountVectorizer,TfidfTransformer],
[X_train, X_test])

2 个答案:

答案 0 :(得分:0)

您需要拨打X.str.encode(..)而不是X.encode(..),如下所示:

for col in X:
    X = X.str.encode('utf-8') # does not work

答案 1 :(得分:0)

我在这个问题中找到了我的问题的答案:How do I use encode (Python 3) to fix non-ascii code for CSV import in Pandas?

file_obj = open(file_name, encoding="utf-8")
master = pd.read_csv(file_obj)

我刚用“ascii”代替“utf-8”。