我有一个由数字和分类字段组成的数据框:
import pandas as pd
df2=pd.DataFrame({'col1':[1,2,3,4],'col2':[5,6,7,8], 'col3':['cat','cat','dog','bird']})
df2
并使用以下代码计算每一行的相似程度:
#calculate distance matrix comparing how similar two rows are
vals=[]
for i in range(len(df2)):
for j in range(len(df2)):
if(j<=i): continue
a=df2.iloc[i,:]
b=df2.iloc[j,:]
d0=(a[0]-b[0])**2
d1=(a[1]-b[1])**2
d2=np.where(a[2]==b[2],0,10)**2
row_values=(i,j, (d0 + d1 +d2)**0.5)
vals.append(row_values)
new_df = pd.DataFrame(vals, columns =['Row1','Row2','Difference'])
new_df
这对于一个小型数据框来说很好用,但是当我将其实现为与使用10k行和10列的数据框类似时,计算将花费大量时间。
关于如何提高此代码的处理能力是否有任何建议?
我从开始:
col1 col2 col3
0 1 5 cat
1 2 6 cat
2 3 7 dog
3 4 8 bird
最后得到:
Row1 Row2 Difference
0 0 1 1.414214
1 0 2 10.392305
2 0 3 10.862780
3 1 2 10.099505
4 1 3 10.392305
5 2 3 10.099505
我正在计算每行数据之间的距离。
答案 0 :(得分:0)
这是一个距离矩阵问题,因此我们可以使用<template>
<v-data-table
:headers="headers"
:items="domains"
:rows-per-page-items="listSize"
:pagination.sync="pagination"
:total-items="totalNumberOfItems"
@update:pagination="paginate"
>
....
</template>
<script>
data()
{
return {
totalNumberOfItems: 0,
domains: [],
listSize: [10, 25, 50, 100],
pagination:
{
descending: !!this.$route.query.desc,
sortBy: this.$route.query.orderby || 'name',
rowsPerPage: +this.$route.query.limit || 10,
page: +this.$route.query.page || 1,
totalItems: 0,
}
}
},
beforeRouteUpdate (to, from, next)
{
this.fetchData(to);
next();
},
methods:
{
buildQuery (route)
{
const query = route ? route.query : this.$route.query;
const paginate = this.pagination;
let params = '?limit=' + paginate.rowsPerPage + '&orderby=' + paginate.sortBy;
if (paginate.page > 1) params += '&start=' + (this.pagination.page - 1) * this.pagination.rowsPerPage;
if (paginate.descending) params += '&direction=desc';
return params;
},
fetchData (route)
{
this.$ajax(
{
method: 'GET',
url: '/getData/' + this.buildQuery(route),
okay: (data) =>
{
this.totalNumberOfItems = data.resulttotals;
this.domains = data.items;
},
fail: (stat, error) =>
{
this.$root.showFailed(error);
},
}
);
},
paginate (val)
{
// emitted by the data-table when changing page, rows per page, or the sorted column/direction - will be also immediately emitted after the component was created
const query = this.$route.query;
const obj = Object.assign({}, query);
if (val.rowsPerPage !== this.listSize[0]) obj.limit = val.rowsPerPage;
if (val.descending) obj.desc = 'true';
else delete obj.desc;
obj.orderby = val.sortBy;
obj.page = val.page;
// check if old and new query are the same - VueRouter will not change the route if they are, so probably this is the first page loading
let same = true;
for (let key in query)
{
if (query[key] != obj[key])
{
same = false;
break;
}
}
// to handle the case when a KEY exists in OBJ but not in query
for (let key in obj)
{
if (query[key] != obj[key])
{
same = false;
break;
}
}
if (same) this.fetchData(); // page has been manually reloaded in the browser
else
{
this.$router.replace({
...this.$router.currentRoute,
query: obj
});
}
},
}
</script>
和distance_matrix
。但是请注意,这仅在您的数据不太大时起作用。
broadcasting
输出:
from scipy.spatial import distance_matrix
# normal distance:
d01 = distance_matrix(df2[['col1','col2']].values, df2[['col1','col2']].values)**2
# category distance
d2 = x = df2['col3'].values[:,None] != df2['col3'].values
# the matrix
dist_mat = np.sqrt(d1 + x*100)
# we only care for the distance with row != col
np.triu(dist_mat)