我正在研究Twitter数据集,但我还没有想出根据主题标签列表对数据进行子集化。
DF:
rowID Hashtags
1 ogretmenemayistamujdehazirandaatama,onlarkonusurakpartiyapar
2 onlarkonusurakpartiyapar,halkinbasbakanitokatta
3 kurdish,mahabad,justiceforfarinaz,kurdistan
4 onlarkonusurakpartiyapar
5 anfal,halabja,kurdistan,kobani
6 onlarkonusurakpartiyapar
7 kurdistan
Hashtags是一个字符列表
hashtag_list:
"onlarkonusurakpartiyapar" "kurdistan"
我试过这段代码,但它对我不起作用;
new_df=df[df$Hashtags %in% hashtag_list,]
它只能提供“onlarkonusurakpartiyapar”标签的子集。 我知道它看起来很简单,但即使我查看了网站上的所有帖子,我还是想不通。 谢谢你的帮助。
答案 0 :(得分:1)
这是一种通过区分由","分隔的字符来修改你的方法。要成为不同的主题标签,并且如果列表中包含任何这些主题标签,则表示该行是匹配的。
df <- data.frame(
rowID=1:8,
Hashtags=c(
"ogretmenemayistamujdehazirandaatama,onlarkonusurakpartiyapar",
"onlarkonusurakpartiyapar,halkinbasbakanitokatta",
"kurdish,mahabad,justiceforfarinaz,kurdistan",
"onlarkonusurakpartiyapar",
"anfal,halabja,kurdistan,kobani",
"onlarkonusurakpartiyapar",
"kurdistan",
"this,willnot,befound"
),
stringsAsFactors=F
)
hashtag_list <- c("onlarkonusurakpartiyapar", "kurdistan")
find_ht <- function(hashtags, hashtag_list){
sapply(strsplit(hashtags, split=","), function(x)any(x%in%hashtag_list))
}
find_ht(hashtags=df$Hashtags, hashtag_list=hashtag_list)
返回......
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
要执行子集,您只需要...
sub.index <- find_ht(hashtags=df$Hashtags, hashtag_list=hashtag_list)
df[sub.index,]
返回
rowID Hashtags
1 1 ogretmenemayistamujdehazirandaatama,onlarkonusurakpartiyapar
2 2 onlarkonusurakpartiyapar,halkinbasbakanitokatta
3 3 kurdish,mahabad,justiceforfarinaz,kurdistan
4 4 onlarkonusurakpartiyapar
5 5 anfal,halabja,kurdistan,kobani
6 6 onlarkonusurakpartiyapar
7 7 kurdistan
或者,如果您希望索引执行which(sub.index)
。要仅对rowID
进行专门设置,请执行df[sub.index,"rowID"]
。在这种情况下,这两个都返回[1] 1 2 3 4 5 6 7