改善熊猫双循环的性能

时间:2019-09-10 13:25:55

标签: pandas performance

我有一个由数字和分类字段组成的数据框:

import pandas as pd
df2=pd.DataFrame({'col1':[1,2,3,4],'col2':[5,6,7,8], 'col3':['cat','cat','dog','bird']})
df2

并使用以下代码计算每一行的相似程度:

#calculate distance matrix comparing how similar two rows are
vals=[]
for i in range(len(df2)):
    for j in range(len(df2)):
        if(j<=i): continue
        a=df2.iloc[i,:]
        b=df2.iloc[j,:]

        d0=(a[0]-b[0])**2
        d1=(a[1]-b[1])**2
        d2=np.where(a[2]==b[2],0,10)**2

        row_values=(i,j, (d0 + d1 +d2)**0.5)
        vals.append(row_values)
new_df = pd.DataFrame(vals, columns =['Row1','Row2','Difference'])
new_df

这对于一个小型数据框来说很好用,但是当我将其实现为与使用10k行和10列的数据框类似时,计算将花费大量时间。

关于如何提高此代码的处理能力是否有任何建议?

我从开始:

    col1    col2    col3
0   1   5   cat
1   2   6   cat
2   3   7   dog
3   4   8   bird

最后得到:

    Row1    Row2    Difference
0   0   1   1.414214
1   0   2   10.392305
2   0   3   10.862780
3   1   2   10.099505
4   1   3   10.392305
5   2   3   10.099505

我正在计算每行数据之间的距离。

1 个答案:

答案 0 :(得分:0)

这是一个距离矩阵问题,因此我们可以使用<template> <v-data-table :headers="headers" :items="domains" :rows-per-page-items="listSize" :pagination.sync="pagination" :total-items="totalNumberOfItems" @update:pagination="paginate" > .... </template> <script> data() { return { totalNumberOfItems: 0, domains: [], listSize: [10, 25, 50, 100], pagination: { descending: !!this.$route.query.desc, sortBy: this.$route.query.orderby || 'name', rowsPerPage: +this.$route.query.limit || 10, page: +this.$route.query.page || 1, totalItems: 0, } } }, beforeRouteUpdate (to, from, next) { this.fetchData(to); next(); }, methods: { buildQuery (route) { const query = route ? route.query : this.$route.query; const paginate = this.pagination; let params = '?limit=' + paginate.rowsPerPage + '&orderby=' + paginate.sortBy; if (paginate.page > 1) params += '&start=' + (this.pagination.page - 1) * this.pagination.rowsPerPage; if (paginate.descending) params += '&direction=desc'; return params; }, fetchData (route) { this.$ajax( { method: 'GET', url: '/getData/' + this.buildQuery(route), okay: (data) => { this.totalNumberOfItems = data.resulttotals; this.domains = data.items; }, fail: (stat, error) => { this.$root.showFailed(error); }, } ); }, paginate (val) { // emitted by the data-table when changing page, rows per page, or the sorted column/direction - will be also immediately emitted after the component was created const query = this.$route.query; const obj = Object.assign({}, query); if (val.rowsPerPage !== this.listSize[0]) obj.limit = val.rowsPerPage; if (val.descending) obj.desc = 'true'; else delete obj.desc; obj.orderby = val.sortBy; obj.page = val.page; // check if old and new query are the same - VueRouter will not change the route if they are, so probably this is the first page loading let same = true; for (let key in query) { if (query[key] != obj[key]) { same = false; break; } } // to handle the case when a KEY exists in OBJ but not in query for (let key in obj) { if (query[key] != obj[key]) { same = false; break; } } if (same) this.fetchData(); // page has been manually reloaded in the browser else { this.$router.replace({ ...this.$router.currentRoute, query: obj }); } }, } </script> distance_matrix。但是请注意,这仅在您的数据不太大时起作用。

broadcasting

输出:

from scipy.spatial import distance_matrix

# normal distance:
d01 = distance_matrix(df2[['col1','col2']].values, df2[['col1','col2']].values)**2

# category distance
d2 = x = df2['col3'].values[:,None] != df2['col3'].values

# the matrix
dist_mat = np.sqrt(d1 + x*100)

# we only care for the distance with row != col
np.triu(dist_mat)