我有一个像
这样的间隔的数据框start end
1 10
3 7
8 10
我需要找到与其他数据框的交叉点数
value
2
5
9
结果应该是
1
2
2
问题的秒部分更棘手。
我的带间隔的数据框也包含type
start end type
1 10 1
3 7 1
8 10 2
我需要知道许多独特的(按类型)间隔将被交叉。结果应该是:
1
1
2
我想第一部分可以由numpy.searchsorted
完成但第二部分是什么呢?
答案 0 :(得分:1)
让我们调用您的第一个数据帧public class Person
{
private final String personId;
private final String name;
private final Set<Person> friends;
public Person(String personId, String name) {
super();
this.personId = personId;
this.name = name;
this.friends = new HashSet<Person>();
}
public void addFriend(Person friend) {
if(friend != null && !friends.contains(friend)) {
this.friends.add(friend);
// Optional : if it is a two-way relationship that doesn't need approving etc
friend.addFriend(this);
}
}
public void unfriend(Person nonFriend)
{
if(nonFriend != null && friends.contains(nonFriend)) {
this.friends.remove(nonFriend);
// Optional : if it is a two-way relationship that doesn't need approving etc
nonFriend.unfriend(this);
}
}
public Set<Person> getFriends()
{
return friends;
}
@Override
public String toString() {
return "Person [name=" + name + "]";
}
public static void main(String[] args)
{
Person dana = new Person("D001", "Dana");
Person gina = new Person("G001", "Gina");
Person john = new Person("J001", "John");
dana.addFriend(gina);
dana.addFriend(john);
john.addFriend(gina);
john.addFriend(dana);
john.unfriend(dana);
System.out.println("Dana's friends are: "+dana.getFriends());
System.out.println("Gina's friends are: "+gina.getFriends());
System.out.println("John's friends are: "+john.getFriends());
}
// Equals and Hashcode are very important when using 'contains' and other Set-based methods
@Override
public int hashCode() {
final int prime = 31;
int result = 1;
result = prime * result + ((personId == null) ? 0 : personId.hashCode());
return result;
}
@Override
public boolean equals(Object obj) {
if (this == obj)
return true;
if (obj == null)
return false;
if (getClass() != obj.getClass())
return false;
Person other = (Person) obj;
if (personId == null) {
if (other.personId != null)
return false;
} else if (!personId.equals(other.personId))
return false;
return true;
}
}
。对于给定值,可以找到相交的间隔:
df
以下将返回交叉间隔的数量:
mask = (df['start'] <= value) & (df['end'] >= value)
以下将返回相交类型的数量:
mask.sum()
现在你可以len(df['type'][mask].unique())
lambda函数到值系列:
apply
答案 1 :(得分:1)
DSM使用Pandas显示a great way to deal with intervals。遵循该模式,我们可以将start
和end
值组合到idx
s的单个列中,其中第二列(change
)等于{{1 } {}对应idx
,当start
对应idx
时为-1。
end
现在,由于我们希望跟踪df = pd.DataFrame(
{'end': [10, 7, 10], 'start': [1, 3, 8], 'type': [1, 1, 2]})
event = pd.melt(df, id_vars=['type'], var_name='change', value_name='idx')
event['change'] = event['change'].map({'start':1, 'end':-1})
event = event.sort_values(by=['idx'])
# type change idx
# 3 1 1 1
# 4 1 1 3
# 1 1 -1 7
# 5 2 1 8
# 0 1 -1 10
# 2 2 -1 10
间隔,我们可以使用type
将每种类型放在自己的列中。取event.pivot
计算涵盖cumsum
的间隔数:
idx
对于每个event = event.pivot(index='idx', columns='type', values='change').fillna(0).cumsum(axis=0)
# type 1 2
# idx
# 1 1 0
# 3 2 0
# 7 1 0
# 8 1 1
# 10 0 0
,我们只关心所涵盖的值,而不是覆盖的次数。因此,让我们计算type
来查找涵盖的值:
event > 0
现在我们可以使用event = event > 0
# type 1 2
# idx
# 1 True False
# 3 True False
# 7 True False
# 8 True True
# 10 False False
找到所需的结果:
searchsorted
全部放在一起:
other = pd.DataFrame({'value': [2, 5, 9]})
idx = event.index.searchsorted(other['value'])-1
other['result'] = event.iloc[idx].sum(axis=1).values
产量
import numpy as np
import pandas as pd
df = pd.DataFrame(
{'end': [10, 7, 10], 'start': [1, 3, 8], 'type': [1, 1, 2]})
event = pd.melt(df, id_vars=['type'], var_name='change', value_name='idx')
event['change'] = event['change'].map({'start':1, 'end':-1})
event = event.sort_values(by=['idx'])
event = event.pivot(index='idx', columns='type', values='change').fillna(0).cumsum(axis=0)
event = event > 0
other = pd.DataFrame({'value': [2, 5, 9]})
idx = event.index.searchsorted(other['value'])-1
other['result'] = event.iloc[idx].sum(axis=1).values
print(other)
要检查计算的正确性,让我们看一下
value result
0 2 1
1 5 1
2 9 2
然后
other = pd.DataFrame({'value': np.arange(13)})
产量
idx = event.index.searchsorted(other['value'])-1
other['result'] = event.iloc[idx].sum(axis=1).values
print(other)
请注意,此计算方法将 value result
0 0 0
1 1 0 <-- The half-open interval (1, 10] does not include 1
2 2 1
3 3 1
4 4 1
5 5 1
6 6 1
7 7 1
8 8 1 <-- The half-open interval (8, 10] does not include 8
9 9 2
10 10 2
11 11 0
12 12 0
间隔视为半开。如果您希望使用半开区间(start, end]
,请使用
[start, end)