Question

我的一些列包含文本分类值，例如可能值为“did_do_something”或“true”或其他列的“false”可能为“browser_type”，可能的值为“chrome” ，“safari”，但我还有其他包含数字类别“枚举”的列，例如“version_type”，其值可能为"1" ,"2" ,"3" ,"4"，然后只有简单的数字列，如“ age“只获得一个数值，应该保持不变。

我在https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html

检查了pandas文档

特别是这个标志：

列：类似列表，默认无

要编码的DataFrame中的列名称。如果列是None，那么   所有具有对象或类别dtype的列都将被转换。

我的虚处理看起来像这样：

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

data_csv_file = 'data/data.csv'

data = pd.read_table(data_csv_file,delimiter = ",").dropna()

# this is the column containing the label for the row
label_column = 'converted_pixel'

# these columns SHOULD NOT be encoded since they are read numeric values
numeric_columns = ['campaign_frequency','user_age_days']

# all the other columns which are not label or numeric should be dummy encoded
dummy_columns = [a for a in data.columns if a != label_column and a not in numeric_columns]

# create the new processed data frame with the dummy columns
processed_dummy_data = pd.get_dummies(data,columns = dummy_columns)

处理后的数据框导致原始21列的大约1000列。

我的问题是从原始数据帧中得到一个向量，如何从得到的虚拟数据中得到它的虚拟编码？

由于伪数据帧太大，我自己这样做是不合理的。

我正在寻找像

这样的API

dummy_encoded_vector = get_dummy_encoding(vector_from_original_dataframe_encoding, processed_dummy_data)

Answer 1

您可以使用str.contains

df=pd.DataFrame({'A':list('abcde'),'B':list('abcde')})

s=pd.get_dummies(df)

yourcol='A'

s.loc[:,s.columns.str.contains(yourcol+'_')]

Out[117]: 
   A_a  A_b  A_c  A_d  A_e
0    1    0    0    0    0
1    0    1    0    0    0
2    0    0    1    0    0
3    0    0    0    1    0
4    0    0    0    0    1

Pandas - 在我创建虚拟列后，如何给出新的向量获取虚拟表示？

1 个答案: