Numpy有效的方法来获得独特的交叉间隔的数量

时间:2016-02-23 11:06:29

标签: python algorithm numpy pandas

我有一个像

这样的间隔的数据框
start end
1     10
3     7
8     10

我需要找到与其他数据框的交叉点数

value
2
5
9

结果应该是

1
2
2

问题的秒部分更棘手。 我的带间隔的数据框也包含type

start end type
1     10  1
3     7   1
8     10  2

我需要知道许多独特的(按类型)间隔将被交叉。结果应该是:

1
1
2 

我想第一部分可以由numpy.searchsorted完成但第二部分是什么呢?

2 个答案:

答案 0 :(得分:1)

让我们调用您的第一个数据帧public class Person { private final String personId; private final String name; private final Set<Person> friends; public Person(String personId, String name) { super(); this.personId = personId; this.name = name; this.friends = new HashSet<Person>(); } public void addFriend(Person friend) { if(friend != null && !friends.contains(friend)) { this.friends.add(friend); // Optional : if it is a two-way relationship that doesn't need approving etc friend.addFriend(this); } } public void unfriend(Person nonFriend) { if(nonFriend != null && friends.contains(nonFriend)) { this.friends.remove(nonFriend); // Optional : if it is a two-way relationship that doesn't need approving etc nonFriend.unfriend(this); } } public Set<Person> getFriends() { return friends; } @Override public String toString() { return "Person [name=" + name + "]"; } public static void main(String[] args) { Person dana = new Person("D001", "Dana"); Person gina = new Person("G001", "Gina"); Person john = new Person("J001", "John"); dana.addFriend(gina); dana.addFriend(john); john.addFriend(gina); john.addFriend(dana); john.unfriend(dana); System.out.println("Dana's friends are: "+dana.getFriends()); System.out.println("Gina's friends are: "+gina.getFriends()); System.out.println("John's friends are: "+john.getFriends()); } // Equals and Hashcode are very important when using 'contains' and other Set-based methods @Override public int hashCode() { final int prime = 31; int result = 1; result = prime * result + ((personId == null) ? 0 : personId.hashCode()); return result; } @Override public boolean equals(Object obj) { if (this == obj) return true; if (obj == null) return false; if (getClass() != obj.getClass()) return false; Person other = (Person) obj; if (personId == null) { if (other.personId != null) return false; } else if (!personId.equals(other.personId)) return false; return true; } } 。对于给定值,可以找到相交的间隔:

df

以下将返回交叉间隔的数量:

mask = (df['start'] <= value) & (df['end'] >= value)

以下将返回相交类型的数量:

mask.sum()

现在你可以len(df['type'][mask].unique()) lambda函数到值系列:

apply

答案 1 :(得分:1)

DSM使用Pandas显示a great way to deal with intervals。遵循该模式,我们可以将startend值组合到idx s的单个列中,其中第二列(change)等于{{1 } {}对应idx,当start对应idx时为-1。

end

现在,由于我们希望跟踪df = pd.DataFrame( {'end': [10, 7, 10], 'start': [1, 3, 8], 'type': [1, 1, 2]}) event = pd.melt(df, id_vars=['type'], var_name='change', value_name='idx') event['change'] = event['change'].map({'start':1, 'end':-1}) event = event.sort_values(by=['idx']) # type change idx # 3 1 1 1 # 4 1 1 3 # 1 1 -1 7 # 5 2 1 8 # 0 1 -1 10 # 2 2 -1 10 间隔,我们可以使用type 将每种类型放在自己的列中。取event.pivot计算涵盖cumsum的间隔数:

idx

对于每个event = event.pivot(index='idx', columns='type', values='change').fillna(0).cumsum(axis=0) # type 1 2 # idx # 1 1 0 # 3 2 0 # 7 1 0 # 8 1 1 # 10 0 0 ,我们只关心所涵盖的值,而不是覆盖的次数。因此,让我们计算type来查找涵盖的值:

event > 0

现在我们可以使用event = event > 0 # type 1 2 # idx # 1 True False # 3 True False # 7 True False # 8 True True # 10 False False 找到所需的结果:

searchsorted

全部放在一起:

other = pd.DataFrame({'value': [2, 5, 9]})
idx = event.index.searchsorted(other['value'])-1
other['result'] = event.iloc[idx].sum(axis=1).values

产量

import numpy as np
import pandas as pd

df = pd.DataFrame(
    {'end': [10, 7, 10], 'start': [1, 3, 8], 'type': [1, 1, 2]})

event = pd.melt(df, id_vars=['type'], var_name='change', value_name='idx')
event['change'] = event['change'].map({'start':1, 'end':-1})
event = event.sort_values(by=['idx'])
event = event.pivot(index='idx', columns='type', values='change').fillna(0).cumsum(axis=0)
event = event > 0
other = pd.DataFrame({'value': [2, 5, 9]})
idx = event.index.searchsorted(other['value'])-1
other['result'] = event.iloc[idx].sum(axis=1).values
print(other)

要检查计算的正确性,让我们看一下

   value  result
0      2       1
1      5       1
2      9       2

然后

other = pd.DataFrame({'value': np.arange(13)})

产量

idx = event.index.searchsorted(other['value'])-1
other['result'] = event.iloc[idx].sum(axis=1).values
print(other)

请注意,此计算方法将 value result 0 0 0 1 1 0 <-- The half-open interval (1, 10] does not include 1 2 2 1 3 3 1 4 4 1 5 5 1 6 6 1 7 7 1 8 8 1 <-- The half-open interval (8, 10] does not include 8 9 9 2 10 10 2 11 11 0 12 12 0 间隔视为半开。如果您希望使用半开区间(start, end],请使用

[start, end)