使用sklearn包

时间:2016-06-17 21:54:59

标签: python memory scikit-learn data-fitting

在我的问题:http://stackoverflow.com/questions/37844596/avoid-memory-error-when-dealing-with-large-arrays之后,由于数组操作,我能够通过将它们分成几行来处理Memory Error;谢谢你们的回应。现在的问题是在使用Memory Error拟合数据时抛出Sklearn packages;例如,在下面的代码中尝试.fit(arr_3d[i])km时。

数组维度是3D,我正在循环它,所以为什么我有这个错误?以及如何解决它?请注意,它不会一直发生,有时它可以正常工作而没有错误,也不确定为什么。

整个代码是:

def home(request):
    if request.method=="POST":
        img = UploadForm(request.POST, request.FILES)
        no_clus = int(request.POST.get('num_clusters', 10))

        if img.is_valid():

            paramFile =io.TextIOWrapper(request.FILES['File'].file)
            portfolio1 = csv.DictReader(paramFile)
            users = []
            users = [row["BASE_NAME"] for row in portfolio1]


            my_list = users
            vectorizer = CountVectorizer()
            dtm = vectorizer.fit_transform(my_list)

            lsa = TruncatedSVD(n_components=100)
            dtm_lsa = lsa.fit_transform(dtm)
            dtm_lsa = Normalizer(copy=False).fit_transform(dtm_lsa)
            product= (np.dot(dtm_lsa, dtm_lsa.T))
            dist1 = (1 - product)
            k = len(my_list) ### length is 5362 
            data2 = np.asarray(dist1)
            arr_3d = data2.reshape((1, k, k))

            print(arr_3d) ### shown below
            print(len(arr_3d))
            no_cluster = number_cluster(request,len(my_list))
            print(no_cluster)
            for i in range(len(arr_3d)):
                #km = AgglomerativeClustering(n_clusters=no_clus, linkage='ward')
                #km = km.fit(arr_3d[i])
              #  km = KMeans(n_clusters=no_cluster, init='k-means++')
                km = AgglomerativeClustering(n_clusters=no_cluster, linkage='complete')
                km = km.fit(arr_3d[i])
                #km = AgglomerativeClustering(n_clusters=no_cluster, linkage='average').fit(arr_3d[i])
                # km = AgglomerativeClustering(n_clusters=no_clus, linkage='complete').fit(arr_3d[i])
                # km = MeanShift()
                # km = KMeans(n_clusters=no_clus, init='k-means++')
                # km = MeanShift()
                #  km = km.fit(arr_3d[i])
                # print km
                labels = km.labels_

            csvfile = settings.MEDIA_ROOT +'\\'+ 'images\\export.csv'

            csv_input = pd.read_csv(csvfile, encoding='latin-1')
            csv_input['cluster_ID'] = labels
            csv_input['BASE_NAME'] = my_list
            csv_input.to_csv(settings.MEDIA_ROOT +'/'+ 'output.csv', index=False)

arr_3d是:

 [[[  0.00000000e+00   9.87752905e-01   1.00070800e+00 ...,   8.93937985e-01
     1.00352321e+00   1.00481892e+00]
  [  9.87752905e-01  -2.22044605e-16   1.00107768e+00 ...,   9.80156085e-01
     1.00047940e+00   1.00059883e+00]
  [  1.00070800e+00   1.00107768e+00  -6.66133815e-16 ...,   9.97548342e-01
     9.99890765e-01   1.00143594e+00]
  ..., 
  [  8.93937985e-01   9.80156085e-01   9.97548342e-01 ...,  -2.22044605e-16
     2.34431311e-01   9.87267801e-01]
  [  1.00352321e+00   1.00047940e+00   9.99890765e-01 ...,   2.34431311e-01
    -2.22044605e-16   1.00152421e+00]
  [  1.00481892e+00   1.00059883e+00   1.00143594e+00 ...,   9.87267801e-01
     1.00152421e+00   3.33066907e-16]]]

0 个答案:

没有答案