Question

我有一个数据框：

ID    message
1     request body: <?xml version="2.0",<code> dwfkjn34241
2     request body: <?xml version="2.0",<code> jnwg3425
3     request body: <?xml version="2.0", <PlatCode>, <code> qwefn2
4     received an error
5     <MarkCheckMSG>
6     received an error

我想根据常用字提取列中的值组。因此，尽管消息列中的前三行稍有不同，但可以将它们视为同一组。第四和第六是同一小组的成员。如何在列消息中使用单词和结构相似性准则将这些值分组？有什么好的方法呢？示例中的数据帧例如给出。因此，与基于正则表达式的解决方案相比，即时消息对适合问题概念的方法更感兴趣

Answer 1

也许尝试用字符串距离量度进行k-medoids聚类分析？

def generate_qns_from_list(lst):
    qns_list = []
    for sub_list in lst:
        if len(sub_list) < 2:
            continue
        d = {}
        qns = ' x '.join(map(str, sub_list))
        d["qns"] = f"{qns}"
        ans = 1
        for x in sub_list:
          ans *= x
        d["ans"] = ans
        qns_list.append(d)
    return qns_list

test_list = [[1, 3, 3], [2, 5, -1], [3, 2], [4, 5, 3], [0, 23], [1, 2, 3, 4], [1]]
print(generate_qns_from_list(test_list))

输出

library(cluster)
library(stringdist)

find_medoids <- function(x, k_from, method = "osa", weight = c(d = 1, i = 1, s = 1, t = 1)) {
  diss <- stringdist::stringdistmatrix(x, x, method = method, weight = weight)
  dimnames(diss) <- list(x, x)
  trials <- lapply(
    seq(from = k_from, to = length(unique(x))), 
    function(i) cluster::pam(diss, i, diss = TRUE)
  )
  sel <- which.max(vapply(trials, `[[`, numeric(1L), c("silinfo", "avg.width")))
  trials[[sel]]
}

map_cluster <- function(x, med_obj) {
  unname(med_obj$clustering[x])
}

对于真实数据，您可能需要调整一些参数，例如字符串距离法（上面的示例使用余弦距离）。

根据常用词对列中的值进行分组

1 个答案: