根据列表

时间:2015-08-01 20:52:08

标签: r list subset tweets

我正在研究Twitter数据集,但我还没有想出根据主题标签列表对数据进行子集化。

DF:

rowID                Hashtags
 1                   ogretmenemayistamujdehazirandaatama,onlarkonusurakpartiyapar
 2                   onlarkonusurakpartiyapar,halkinbasbakanitokatta
 3                   kurdish,mahabad,justiceforfarinaz,kurdistan
 4                   onlarkonusurakpartiyapar
 5                   anfal,halabja,kurdistan,kobani
 6                   onlarkonusurakpartiyapar
 7                   kurdistan

Hashtags是一个字符列表

hashtag_list:

"onlarkonusurakpartiyapar" "kurdistan"

我试过这段代码,但它对我不起作用;

new_df=df[df$Hashtags %in% hashtag_list,]

它只能提供“onlarkonusurakpartiyapar”标签的子集。 我知道它看起来很简单,但即使我查看了网站上的所有帖子,我还是想不通。 谢谢你的帮助。

1 个答案:

答案 0 :(得分:1)

这是一种通过区分由","分隔的字符来修改你的方法。要成为不同的主题标签,并且如果列表中包含任何这些主题标签,则表示该行是匹配的。

您的数据

df <- data.frame(
    rowID=1:8, 
    Hashtags=c(
        "ogretmenemayistamujdehazirandaatama,onlarkonusurakpartiyapar", 
        "onlarkonusurakpartiyapar,halkinbasbakanitokatta",
        "kurdish,mahabad,justiceforfarinaz,kurdistan",
        "onlarkonusurakpartiyapar",
        "anfal,halabja,kurdistan,kobani",
        "onlarkonusurakpartiyapar",
        "kurdistan",
        "this,willnot,befound"
    ), 
    stringsAsFactors=F
)
hashtag_list <- c("onlarkonusurakpartiyapar", "kurdistan")

解决方案

find_ht <- function(hashtags, hashtag_list){
    sapply(strsplit(hashtags, split=","), function(x)any(x%in%hashtag_list))
}

实施

find_ht(hashtags=df$Hashtags, hashtag_list=hashtag_list)

返回......

[1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE

修改

要执行子集,您只需要...

sub.index <- find_ht(hashtags=df$Hashtags, hashtag_list=hashtag_list)
df[sub.index,]

返回

 rowID                                                     Hashtags
1     1 ogretmenemayistamujdehazirandaatama,onlarkonusurakpartiyapar
2     2              onlarkonusurakpartiyapar,halkinbasbakanitokatta
3     3                  kurdish,mahabad,justiceforfarinaz,kurdistan
4     4                                     onlarkonusurakpartiyapar
5     5                               anfal,halabja,kurdistan,kobani
6     6                                     onlarkonusurakpartiyapar
7     7                                                    kurdistan

或者,如果您希望索引执行which(sub.index)。要仅对rowID进行专门设置​​,请执行df[sub.index,"rowID"]。在这种情况下,这两个都返回[1] 1 2 3 4 5 6 7