计算值A在数据帧行中存在多少次,值B有多少次以及值A和B有多少次

时间:2019-11-25 15:06:58

标签: string pandas dataframe count delimiter-separated-values

我有一个数据框“ dfTags”,其中包含140.000行(全部为小写字母),列“ tags”中逗号分隔值的数量范围可以从71到1。但是列标签是一个字符串,Pandas不知道数组或列表:

index tags
0     a, b, c, aa, bb, 2019
1     a, d, 18, gb
2     aa, a, dd, fb, la
3     aa, d, ddaa, b, k, l

和一组“ tagTuples”,其中包含850.000个已排序的元组(全部为小写字母),例如:

(a, b), (b, c), (aa, c), (aa, bb), (2019, bb), (a, d), (18, d), (18, gb), (a, aa), (a, dd), (dd, fb), (fb, la), (aa, d), (d, ddaa), ...

我使用了一个集合,因为我删除了仅出现一次的每个标记,然后仅添加了每个创建的元组,从而自动删除了重复项。

对于“ tagTuples”中的每个元组,我需要:

  • 例如(a,b)
  • “标签”列中有多少行包含“ a”? (3)
  • “标签”列中包含“ a”的行中还包含“ b”? (1)
  • = 1/3 => 0,33
  • “标签”列中有多少行包含“ b”? (2)
  • “标记”列中包含“ b”的行中还包含“ a”? (1)
  • = 1/2 => 0,5

    导致a <> b =(0,33 + 0,5)* 100 = 83%(修正的Jaccard指数)之间的边权重

然后应将每个结果推入数据框dfTagTuple

dfTagTuple = pd.DataFrame(columns=["Source", "Target", "Weight"])

其中Source =元组[0],Target =元组[1],Weight =边缘权重

这样我就可以在每个标签之间获得具有边缘权重的Edge连接,以在Gephi中可视化它们,从而创建标签网络。

但是标签的类型是“对象”,因为熊猫不知道数组。因此,当我检查row [“ tags”]是否包含“ a”时,如何在不计算“ aa” /“ ddaa” /“ la”的情况下检查该元组?

我该如何执行这4次检查并以一种高效的方式获得每个元组的最终结果(0,833 ..)?

def calc_distance(tagLeft, tagRight):
# how many times does "a" appear in tags per row?
onlyTagLeft = ??
# # how many times does "b" appear in tags per row?
onlyTagRight = ??
# how many times does "a" and "b" appear together in tags per row?
bothTags = ??
edgeWeight = ((bothTags / onlyTagLeft) + (bothTags / onlyTagRight)) * 100
# print(tagLeft, "#", tagRight, edgeWeight)
print("{}: {}, {}: {}, bothTags: {}, weight: {}".format(tagLeft, onlyTagLeft, tagRight, onlyTagRight, bothTags,
                                                        edgeWeight))

df = pd.DataFrame([["a, b, c, aa, bb, 2019"], ["a, d, 18, gb"], ["aa, a, dd, fb, la"], ["aa, d, ddaa, b, k, l"]], columns=["tags"])
tagSet = {('aa', 'd'), ('a', 'aa'), ('a', 'd'), ('a', 'b')}

for tagTuple in tagSet:
calc_distance(tagTuple[0], tagTuple[1])

1 个答案:

答案 0 :(得分:0)

这不是一个完整的答案,但是对于每个tagTupples(Select listagg(query_text,';') within group (order by start_time) as query_text, username, min(start_time) as start_time, max(end_time) as end_time From (Select t.*, Row_number() over (partition by username order by start_time) - sum(case when start_time < prev_end_time or prev_end_time is null then 1 end) over (partition by username order by start_time) as grp From (select t.*, Lag(end_time) over (partition by username order by start_time) as prev_end_time From your_table t ) t ) Group by username, grp Order by start_time; ),它都会为您提供tt的第一个元素出现多少次,以及它们都出现多少次,然后您可以进行计算

tt

希望有帮助