Question

我有一个数据表，其中包含使用twitteR库抓取的推文列表，并希望得到一个用

注释的推文列表

例如，我从：

开始

tmp=data.frame(tweets=c("this tweet with #onehashtag","#two hashtags #here","no hashtags"),dummy=c('random','other','column'))
> tmp
                       tweets  dummy
1 this tweet with #onehashtag random
2         #two hashtags #here  other
3                 no hashtags column

并想生成：

result=data.frame(tweets=c("this tweet with #onehashtag","#two hashtags #here","#two hashtags #here","no hashtags"),dummy=c('random','other','other','column'),tag=c('#onehashtag','#two','#here',NA))
> result
                       tweets  dummy        tag
1 this tweet with #onehashtag random #onehashtag
2         #two hashtags #here  other        #two
3         #two hashtags #here  other       #here
4                 no hashtags column        <NA>

我可以使用正则表达式：

library(stringr)
str_extract_all("#two hashtags #here","#[a-zA-Z0-9]+")

将标签从推文中提取到列表中，可能使用如下内容：

tmp$tags=sapply(tmp$tweets,function(x) str_extract_all(x,'#[a-zA-Z0-9]+'))
> tmp
                       tweets  dummy        tags
1 this tweet with #onehashtag random #onehashtag
2         #two hashtags #here  other #two, #here
3                 no hashtags column

但我在某个地方错过了一个技巧，无法看到如何使用它作为创建重复行的基础......

Answer 1

首先让我们得到比赛：

matches <- gregexpr("#[a-zA-Z0-9]+",tmp$tweets)
matches
[[1]]
[1] 17
attr(,"match.length")
[1] 11

[[2]]
[1]  1 15
attr(,"match.length")
[1] 4 5

[[3]]
[1] -1
attr(,"match.length")
[1] -1

现在我们可以使用它来从原始data.frame获取正确的行数：

rep(seq(matches),times=sapply(matches,length))
[1] 1 2 2 3
tmp2 <- tmp[rep(seq(matches),times=sapply(matches,length)),]

现在使用匹配来获得开始和结束位置：

starts <- unlist(matches)
ends <- starts + unlist(sapply(matches,function(x) attr(x,"match.length"))) - 1

使用substr提取：

tmp2$tag <- substr(tmp2$tweets,starts,ends)
tmp2
                         tweets  dummy         tag
1   this tweet with #onehashtag random #onehashtag
2           #two hashtags #here  other        #two
2.1         #two hashtags #here  other       #here
3                   no hashtags column

Answer 2

包含和不包含标记的行的行为有所不同，因此如果您单独处理这些情况，您的代码将更容易理解。

像以前一样使用str_extract_all来获取代码。

tags <- str_extract_all(tmp$tweets, '#[a-zA-Z0-9]+')

（您也可以使用正则表达式快捷方式alnum来获取所有字母数字字符。'#[[:alnum:]]+'。）

使用rep找出重复每一行的次数。

index <- rep.int(seq_len(nrow(tmp)), sapply(tags, length))

使用此索引展开tmp，然后添加标记列。

tagged <- tmp[index, ]
tagged$tags <- unlist(tags)

没有标记的行应该出现一次（不是零次）并且在标记列中有NA。

has_no_tag <- sapply(tags, function(x) length(x) == 0L)
not_tagged <- tmp[has_no_tag, ]
not_tagged$tags <- NA

结合两者。

all_data <- rbind(tagged, not_tagged)

基于从单个列重新编写的多个项目，使用其他行重新整形data.frame

2 个答案: