推特之间的Jaccard距离

时间:2016-04-01 19:01:10

标签: json r twitter set stringdist

我目前正在尝试衡量数据集中推文之间的Jaccard距离

这是数据集

的地方

http://www3.nd.edu/~dwang5/courses/spring15/assignments/A2/Tweets.json

我尝试了一些测量距离的方法

这是我到目前为止所拥有的

我将链接数据集保存到名为Tweets.json

的文件中
json_alldata <- fromJSON(sprintf("[%s]", paste(readLines(file("Tweets.json")),collapse=",")))

然后我将json_alldata转换为tweet.features并删除了geo列

# get rid of geo column
tweet.features = json_alldata
tweet.features$geo <- NULL

这是前两个推文的样子

tweet.features$text[1]
[1] "RT @ItsJennaMarbles: Reports of Marathon Runners that crossed finish line and continued to run to Mass General Hospital to give blood to victims. #PrayforBoston"
> tweet.features$text[2]
[1] "RT @NBCSN: Reports of Marathon Runners that crossed finish line and continued to run to Mass General Hospital to give blood to victims #PrayforBoston"

我尝试的第一件事是使用stringdist库

下的方法stringdist
install.packages("stringdist")
library(stringdist)

#This works?
#
stringdist(tweet.features$text[1], tweet.features$text[2], method = "jaccard")

当我跑步时,我得到了

[1] 0.1621622

但是,我不确定这是否正确。交叉点B = 23,A联合B = 25.雅克卡距离是A交点B / A联合B - 对吗?所以通过我的计算,Jaccard距离应该是0.92?

所以我认为我可以通过套装来做到这一点。只需计算交点和并数并除以

这就是我试过的

# Jaccard distance is the intersection of A and B divided by the Union of A and B
#
#create set for First Tweet
A1 <- as.set(tweet.features$text[1])
A2 <- as.set(tweet.features$text[2])

当我尝试做交叉时,我明白了:输出只是list()

 Intersection <- intersect(A1, A2)
 list()

当我尝试Union时,我明白了:

union(A1,A2)

[[1]]
[1] "RT @ItsJennaMarbles: Reports of Marathon Runners that crossed finish line and continued to run to Mass General Hospital to give blood to victims. #PrayforBoston"

[[2]]
[1] "RT @NBCSN: Reports of Marathon Runners that crossed finish line and continued to run to Mass General Hospital to give blood to victims #PrayforBoston"

这似乎并没有将单词分组为一组。

我认为我能够通过联盟划分交叉点。但我想我需要程序来计算每组中的数字或单词,然后进行计算。

毋庸置疑,我有点卡住了,我不确定自己是否走上正轨。

任何帮助将不胜感激。谢谢。

1 个答案:

答案 0 :(得分:3)

intersectunion期望向量(as.set不存在)。我想你想要比较单词,所以你可以使用strsplit,但分割的方式属于你。以下示例:

tweet.features <- list(tweet1="RT @ItsJennaMarbles: Reports of Marathon Runners that crossed finish line and continued to run to Mass General Hospital to give blood to victims. #PrayforBoston",
                       tweet2=          "RT @NBCSN: Reports of Marathon Runners that crossed finish line and continued to run to Mass General Hospital to give blood to victims #PrayforBoston")

jaccard_i <- function(tw1, tw2){
  tw1 <- unlist(strsplit(tw1, " |\\."))
  tw2 <- unlist(strsplit(tw2, " |\\."))
  i <- length(intersect(tw1, tw2))
  u <- length(union(tw1, tw2))
  list(i=i, u=u, j=i/u)
}

jaccard_i(tweet.features[[1]], tweet.features[[2]])

$i
[1] 20

$u
[1] 23

$j
[1] 0.8695652

这是否是你想要的?

这里为每个空格或点完成strsplit。您可能希望从split优化strsplit参数,并将" |\\."替换为更具体的内容(请参阅?regex)。