我正在进行机器学习竞赛,其目标是预测客户向超市提供的旅行的类型或动机,并提供有关旅行的信息。我有一个以下格式的CSV文件:
TripType,VisitNumber,Weekday,Upc,ScanCount,DepartmentDescription,FinelineNumber
999,5,Friday,68113152929,-1,FINANCIAL SERVICES,1000
30,7,Friday,60538815980,1,SHOES,8931
30,7,Friday,7410811099,1,PERSONAL CARE,4504
26,8,Friday,2238403510,2,PAINT AND ACCESSORIES,3565
26,8,Friday,2006613744,2,PAINT AND ACCESSORIES,1017
我采取的第一步是将这些数据转换为特征向量。为此,我将每个分类变量转换为虚拟变量,然后每个向量将是一个唯一的样本。创建向量的问题是样本不是在行上分开的;您有关于不同行上的样本的数据。例如,上面我们有5行但只有3个样本(5,7和8)。以下是上述样本的特征向量:
'Friday', 68113152929, 60538815980, 7410811099, 2238403510, 2006613744, 'FINANCIAL SERVICES', 'SHOES', 'PERSONAL CARE', 'PAINT AND ACCESSORIES', 1000, 8931, 4504, 3565, 1017, 'Returned'
[ 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 0. 0. 0. 1.]
[ 1. 0. 1. 1. 0. 0. 0. 1. 1. 0. 0. 1. 1. 0. 0. 0.]
[ 1. 0. 0. 0. 2. 2. 0. 0. 0. 1. 0. 0. 0. 1. 1. 0.]
请注意,我添加了“已退回”'最后的功能(当任何Upcs或购买的项目的扫描计数之一为负时,这是1)。还有一个"目标"矢量与相应的标签:
[999, 30, 26]
我的问题是有效地生成这些向量。我相对快速地完成了程序的这一部分,仅在我拥有的总数据的一小部分(100-1000行)上测试我的代码(~700k行)。当我完成剩下的程序(学习和预测部分)并回到整个数据时,"矢量化"部分似乎花了太长时间。你有什么建议来接近这个(从csv文件中获取特征向量)以获得更好的性能吗?
这是我现在使用的代码。如果你想知道我现在正在做什么,请看看。转到"逐行"直接进入矢量化部分:
import pandas as pd
#Reshape smaller vector by adding zeros to the beginning. Add the vectors but add 0 if there's a value in both vectors other than zero
def vector_add(P, Q):
a = []
for x,y in izip_longest(reversed(P), reversed(Q), fillvalue=0):
if x == 0 or y == 0:
a.append(x+y)
else:
a.append(1)
return a[::-1]
csv_file = open('exp-train', 'rb')
df = pd.read_csv(csv_file)
# Get features
visitnums = df.drop_duplicates(subset='VisitNumber')['VisitNumber']
days = df.drop_duplicates(subset='Weekday')['Weekday']
upcs = df.drop_duplicates(subset='Upc')['Upc']
departments = df.drop_duplicates(subset='DepartmentDescription')['DepartmentDescription']
finenums = df.drop_duplicates(subset='FinelineNumber')['FinelineNumber']
# List to contain all feature vectors
lines = []
# Put in list and put put list in the large list
top_line = []
top_line.append('VisitType')
for day in days:
top_line.append(day)
for upc in upcs:
top_line.append(upc)
for department in departments:
top_line.append(department)
for finenum in finenums:
top_line.append(finenum)
top_line.append('Returned')
lines.append(top_line)
#Iterate by row
counter = 0
#Back variable deal with duplicate samples
back = 'no'
line = []
returned = 0
for i, row in enumerate(df.itertuples()):
#Line2 to deal with duplicate samples
line2 = []
if not back == row[2]:
if not back == 'no':
line.append(returned)
returned = 0
lines.append(line)
line = []
line.append(row[1])
for day in days:
if day == row[3]:
line.append(1)
else:
line.append(0)
for upc in upcs:
if upc == row[4]:
if int(row[5]) < 0:
returned = 1
line.append(0)
else:
line.append(int(row[5]))
else:
line.append(0)
for department in departments:
if department == row[6]:
line.append(1)
else:
line.append(0)
for finenum in finenums:
if finenum == row[7]:
line.append(1)
else:
line.append(0)
else:
for upc in upcs:
if upc == row[4]:
if int(row[5]) < 0:
returned = 1
line2.append(0)
else:
line2.append(int(row[5]))
else:
line2.append(0)
for department in departments:
if department == row[6]:
line2.append(1)
else:
line2.append(0)
for finenum in finenums:
if finenum == row[7]:
line2.append(1)
else:
line2.append(0)
#Deal with multiple samples by adding line, line2 into line
line = vector_add(line, line2)
back = row[2]
if i == (len(df.index) - 1):
line.append(returned)
returned = 0
lines.append(line)
a = time.time()
如果有更好/更好的方法,请告诉我。
答案 0 :(得分:2)
如果我理解正确,你可以简单地创建一个这样的公式:
import pandas as pd
import numpy as np
df = pd.read_csv('exp-train')
from patsy import dmatrices
#here the ~ sign is an = sign
#The C() lets our algorithm know that those variables are categorical
formula_ml = 'TripType ~ VisitNumber + C(Weekday) + Upc + ScanCount + C(DepartmentDescription)+ FinelineNumber'
#assign the variables
Y_train, X_train = dmatrices(formula_ml, data=df, return_type='dataframe')
Y_train= np.asarray(Y_train).ravel()
您可以通过更改公式来选择要用于机器学习算法的功能。
答案 1 :(得分:0)
纯Python代码可能非常慢 - 这就是numpy等用C,Fortran和Cython编写的原因。
例如,纯python中的整数使用12个字节而不是8来存储。通过list()
构建append
整数预计会很慢而且成本很高。
要加快速度,请尝试
还可以使用python profiler识别你的热点所在的 。