Question

我有大约2000万个观测值的大型数据集。我想计算每行TitleAbstract.x1和TitleAbstract.y1之间的Jaccard索引。

这是2个观测样本：

    structure(list(Patent = c(6326004L, 6514936L), TitleAbstract.x = c("mechanical multiplier purpose speed steering control hydrostatic system invention concerned improvement control system hydrostatic drive vehicle comprising pair hydrostatic pumps output adjustable moving arm attached servo valve controlling displacement said pumps, pump powering respective hydraulic motor drives respective ground engaging means said vehicle. improvement present invention mechanically controls speed steering functions system. comprises pair adjusting means, one communicating pumps, comprising frame adjacent pump, first crank mounted centrally frame, first end first crank drivingly linked arm; second crank mounted centrally frame, first end second crank drivingly linked second end first crank third crank mounted centrally frame, first end third crank drivingly linked second end first crank second end third crank drivingly linked steering linkage means. improved arrangement includes tying means drivingly mounted adjacent second end second cranks linking movement thereof.", 
"mechanical multiplier purpose speed steering control hydrostatic system invention concerned improvement control system hydrostatic drive vehicle comprising pair hydrostatic pumps output adjustable moving arm attached servo valve controlling displacement said pumps, pump powering respective hydraulic motor drives respective ground engaging means said vehicle. improvement present invention mechanically controls speed steering functions system. comprises pair adjusting means, one communicating pumps, comprising frame adjacent pump, first crank mounted centrally frame, first end first crank drivingly linked arm; second crank mounted centrally frame, first end second crank drivingly linked second end first crank third crank mounted centrally frame, first end third crank drivingly linked second end first crank second end third crank drivingly linked steering linkage means. improved arrangement includes tying means drivingly mounted adjacent second end second cranks linking movement thereof."
), cited = c(4261928L, 4261928L), TitleAbstract.y = c("antiviral methods using fragments human rhinovirus receptor (icam-1) ", 
"antiviral methods using human rhinovirus receptor (icam-1) method substantially inhibiting initiation spread infection rhinovirus coxsackie virus host cells expressing major human rhinovirus receptor (icam-1), comprising step contacting virus soluble polypeptide comprising hrv binding site domains ii icam-1; polypeptide capable binding virus reducing infectivity thereof; contact conditions permit virus bind polypeptide."
), Jaccard = c(0, 0.00909090909090909)), row.names = c(NA, -2L
), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x7f9c8f801778>, sorted = "cited", .Names = c("Patent", 
"TitleAbstract.x", "cited", "TitleAbstract.y", "Jaccard"))

在之前的文章中，我使用自制的equation计算Jaccard索引，并创建了function然后与Mapply一起运行，但出现错误'this is not a function'。 / p>

Jaccard_Index <- function(x,y)
{
  return(mapply(length(intersect(unlist(strsplit(df$TitleAbstract.x1, "\\s+")),unlist(strsplit(df$TitleAbstract.y1, "\\s+")))) / length(union(unlist(strsplit(df$TitleAbstract.x1, "\\s+")),unlist(strsplit(df$TitleAbstract.y1, "\\s+"))))))
}

mapply(Jaccard_Index,df$TitleAbstract.x1,df$TitleAbstract.y1)

我尝试使用TitleAbstract.x1和TitleAbstract.y1来更改x和y，但是仍然遇到相同的错误。

这可能是一个新手问题，但是有人可以帮助我编写正确的功能吗？

另外，我还有两个问题：

第二季度，我如何使用parallel＆mcapply来加快此过程？

第3季度在内存存储和速度方面R的局限性是什么，您是否建议对长时间且占用大量内存的过程使用其他方法（即通过bash使用python）？

修改

我已经上传了正确的数据集，为了避免数据集被截断，我不得不更新RStudio。

Answer 1

我简化了您的数据集。您可以从同名的软件包中使用stringdist()，尽管这并不适用以单词为单位的Jaccard索引，所以我改而固定了Jaccard_Index()。这是使用mapply()，但如果要对其进行并行化，只需将其替换为mcmapply()

df <- data.frame(
Patent=1:3, 
TitleAbstract.x1=c(
"methods testing oligonucleotide arrays methods testing oligonucleotide",
"isolation cellular material microscopic visualization method microdissection",
"support method determining analyte method producing support method producing"), 
TitleAbstract.y1=c(
"support method determining analyte method producing support method producing",
"method utilizing convex geometry laser capture microdissection process",
"methods testing oligonucleotide arrays methods testing oligonucleotide"),
stringsAsFactors=FALSE)


Jaccard_Index <- function(x, y) {
    if (length(x) == 1) {
        x <- strsplit(x, "\\s+")[[1]]
    }
    if (length(y) == 1) {
        y <- strsplit(y, "\\s+")[[1]]
    }
    length(intersect(x, y)) / length(union(x, y))
}

# Appears to be that splitting the strings outside the loop is quicker
df$TitleAbstract.x1 <- strsplit(df$TitleAbstract.x1, "\\s+")
df$TitleAbstract.y1 <- strsplit(df$TitleAbstract.y1, "\\s+")

mapply(Jaccard_Index, df$TitleAbstract.x1, df$TitleAbstract.y1, USE.NAMES=FALSE)
# [1] 0.0000000 0.1538462 0.0000000

映射多个参数

1 个答案: