我一直试图解决这个问题,虽然我在How can i vectorize list using sklearn DictVectorizer找到了类似的问题,但解决方案过于简单。
我想在Logistic回归模型中加入一些特征来预测中文'或者'非中国人。我有一个raw_name,我将提取以获得两个功能1)只是姓氏,2)是姓氏的子串列表,例如,' Chan'会给[' ch',' ha'' an']。但似乎Dictvectorizer并没有将列表类型作为字典的一部分。从上面的链接,我尝试创建一个函数list_to_dict,并成功返回一些dict元素,
{'substring=co': True, 'substring=or': True, 'substring=rn': True, 'substring=ns': True}
但我不知道如何在应用dictvectorizer之前将其合并到my_dict = ...中。
# coding=utf-8
import pandas as pd
from pandas import DataFrame, Series
import numpy as np
import nltk
import re
import random
from random import randint
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction import DictVectorizer
lr = LogisticRegression()
dv = DictVectorizer()
# Get csv file into data frame
data = pd.read_csv("V2-1_2000Records_Processed_SEP2015.csv", header=0, encoding="utf-8")
df = DataFrame(data)
# Pandas data frame shuffling
df_shuffled = df.iloc[np.random.permutation(len(df))]
df_shuffled.reset_index(drop=True)
# Assign X and y variables
X = df.raw_name.values
y = df.chineseScan.values
# Feature extraction functions
def feature_full_last_name(nameString):
try:
last_name = nameString.rsplit(None, 1)[-1]
if len(last_name) > 1: # not accept name with only 1 character
return last_name
else: return None
except: return None
def feature_twoLetters(nameString):
placeHolder = []
try:
for i in range(0, len(nameString)):
x = nameString[i:i+2]
if len(x) == 2:
placeHolder.append(x)
return placeHolder
except: return []
def list_to_dict(substring_list):
try:
substring_dict = {}
for i in substring_list:
substring_dict['substring='+str(i)] = True
return substring_dict
except: return None
list_example = ['co', 'or', 'rn', 'ns']
print list_to_dict(list_example)
# Transform format of X variables, and spit out a numpy array for all features
my_dict = [{'two-letter-substrings': feature_twoLetters(feature_full_last_name(i)),
'last-name': feature_full_last_name(i), 'dummy': 1} for i in X]
print my_dict[3]
输出:
{'substring=co': True, 'substring=or': True, 'substring=rn': True, 'substring=ns': True}
{'dummy': 1, 'two-letter-substrings': [u'co', u'or', u'rn', u'ns'], 'last-name': u'corns'}
示例数据:
Raw_name chineseScan
Jack Anderson non-chinese
Po Lee chinese
答案 0 :(得分:2)
如果我已经正确理解你想要一种方法来编码列表值,以便拥有DictVectorizer可以使用的特征字典。 (一年太晚但是)根据具体情况可以使用这样的东西:
my_dict_list = []
for i in X:
# create a new feature dictionary
feat_dict = {}
# add the features that are straight forward
feat_dict['last-name'] = feature_full_last_name(i)
feat_dict['dummy'] = 1
# for the features that have a list of values iterate over the values and
# create a custom feature for each value
for two_letters in feature_twoLetters(feature_full_last_name(i)):
# make sure the naming is unique enough so that no other feature
# unrelated to this will have the same name/ key
feat_dict['two-letter-substrings-' + two_letters] = True
# save it to the feature dictionary list that will be used in Dict vectorizer
my_dict_list.append(feat_dict)
print my_dict_list
from sklearn.feature_extraction import DictVectorizer
dict_vect = DictVectorizer(sparse=False)
transformed_x = dict_vect.fit_transform(my_dict_list)
print transformed_x
输出:
[{'dummy': 1, u'two-letter-substrings-er': True, 'last-name': u'Anderson', u'two-letter-substrings-on': True, u'two-letter-substrings-de': True, u'two-letter-substrings-An': True, u'two-letter-substrings-rs': True, u'two-letter-substrings-nd': True, u'two-letter-substrings-so': True}, {'dummy': 1, u'two-letter-substrings-ee': True, u'two-letter-substrings-Le': True, 'last-name': u'Lee'}]
[[ 1. 1. 0. 1. 0. 1. 0. 1. 1. 1. 1. 1.]
[ 1. 0. 1. 0. 1. 0. 1. 0. 0. 0. 0. 0.]]
如果您不想创建与列表中的值一样多的功能,那么您可以做的另一件事(但我不建议)是这样的:
# sorting the values would be a good idea
feat_dict[frozenset(feature_twoLetters(feature_full_last_name(i)))] = True
# or
feat_dict[" ".join(feature_twoLetters(feature_full_last_name(i)))] = True
但第一个意味着您不能拥有任何重复值,并且可能两者都没有很好的功能,特别是如果您需要微调和详细的功能。此外,它们减少了两行具有两个字母组合的相同组合的可能性,因此分类可能不会很好。
输出:
[{'dummy': 1, 'last-name': u'Anderson', frozenset([u'on', u'rs', u'de', u'nd', u'An', u'so', u'er']): True}, {'dummy': 1, 'last-name': u'Lee', frozenset([u'ee', u'Le']): True}]
[{'dummy': 1, 'last-name': u'Anderson', u'An nd de er rs so on': True}, {'dummy': 1, u'Le ee': True, 'last-name': u'Lee'}]
[[ 1. 0. 1. 1. 0.]
[ 0. 1. 1. 0. 1.]]