Question

我的数据框有3个与此问题相关的值，:ID，:Position，:Probability。每行都是唯一的，但多行可能具有相同的ID。我想要做的是获取Position的特定值的所有行，与ID的任何行共享Probability高于某个值的行1020692×8 DataFrames.DataFrame │ Row │ ID │ Position │ Probability │ ├─────────┼─────┼───────────────┼─────────────┤ │ 1 │ 425 │ "first" │ 0.02 │ │ 2 │ 425 │ "last" │ 0.03 │ │ 3 │ 425 │ "penultimate" │ 0.02 │ │ 4 │ 425 │ "other" │ 0.04 │ │ 5 │ 421 │ "first" │ 0.44 │ │ 6 │ 421 │ "last" │ 0.85 │ │ 7 │ 421 │ "second" │ 0.59 │ │ 8 │ 421 │ "other" │ 1.0 │ ⋮位置。

例如，假设我有以下DataFrame（df）：

0.8

如果我设置了:Position == "first"的阈值，那么如果:ID :Position == "last" && :Probability > 0.8 :Probability > 0.8 :Position == "first"，我希望最终得到"last"的所有行。换句话说，我想要第5行，因为第6行有ID，但不是第1行，因为第2行没有。

检查阈值的行不会始终跟在我要保留的行之后。并非last要检查Probability > 0.8行的所有行，但最多只有一行。

我尝试解决此问题的方法是使用in()制作firsts = df[df[:Position] .== "first", :] lasts = df[df[:Position] .== "last", :] meetsthreshold = lasts[lasts[:Probability] .> 0.8, :ID] final = firsts[[in(i, meetsthreshold) for i in firsts[:ID]], :]位于ID位置的所有length(meetsthreshold)的向量，然后尝试使用{{1}对数据框进行子集化}。所以......

ID

我使用intersect(Set(firsts[:ID]), Set(meetsthreshold)) s的超短向量对其进行了测试，但它确实有效，但实际数据（RefBool test = false; if (test) { }>> 100k）却超级严重。我认为我想要的基本上是一个集合交集，如果我用public class RefBool { public bool Value { get; set; } public RefBool(bool value) { this.Value = value; } public static implicit operator RefBool(bool val) { return new RefBool(val); } } s（例如$.connection.hub.qs = "access_token=" + mytoken;）这样做，它基本上是瞬时的。有没有办法与数据帧进行集合交集，这样我才能真正得到行？

Answer 1

我觉得自己像个白痴 - 解决方案只是使用一个集合代替向量来搜索。例如：

firsts = df[df[:Position] .== "first", :]
lasts = df[df[:Position] .== "last", :]
meetsthreshold = Set(lasts[lasts[:Probability] .> 0.8, :ID])

final = firsts[Vector{Bool}([in(i, meetsthreshold) for i in firsts[:ID]]), :]

跑了~1秒。

高效的集合交集以获取DataFrame中的行

1 个答案: