Question

我希望用R来清理数据库中的一些文本字符串。数据库存储带有HTML标记的文本。不幸的是，由于数据库的限制，每个字符串都被分成数据库中的多个片段。我想我可以弄清楚如何使用正则表达式和其他帖子的帮助删除html标签，但我不希望这些解决方案能够工作，除非我将片段连接在一起（打开/关闭html标签可以分布在记录中在数据框中）。以下是一些示例数据：

现有数据框

Record_nbr  fragment    Comments
1   1   "The quick brown"
1   2   "fox jumped over"
1   3   "the lazy dog."
2   1   "New Record."

所需的输出数据框

Record_nbr  fragment    Comments
1   3   "The quick brown fox jumped over the lazy dog."
2   2   "New Record."

数据：

dat <- read.table(text='Record_nbr  fragment    Comments
1   1   "The quick brown"
1   2   "fox jumped over"
1   3   "the lazy dog."
2   1   "New Record."', header=TRUE)

Answer 1

我假设您实际上并不想保留片段列。在这种情况下，您可以使用这种快速单行：

aggregate(comment ~ Record_nbr, data=dat, function(x) paste(x, collapse=" "))

Answer 2

这是众多方法中的一种：

## ensure order
dat <- with(dat, dat[order(Record_nbr, fragment), ])

do.call(rbind, lapply(split(dat, dat$Record_nbr), function(x) {
    data.frame(
        x[1, 1, drop=FALSE], 
        fragment = max(x[, 2]), 
        Comments = paste(x$Comments, collapse=" ")
    )
}))

##   Record_nbr fragment                                      Comments
## 1          1        3 The quick brown fox jumped over the lazy dog.
## 2          2        1                                   New Record.

Answer 3

使用dplyr：

library(dplyr)
dat %>% 
group_by(Record_nbr) %>% 
summarize(fragment= n(), Comments=paste(Comments, collapse= " "))

#  Record_nbr fragment                                      Comments
#1          1        3 The quick brown fox jumped over the lazy dog.
#2          2        1                                   New Record.

Answer 4

还要考虑使用更快的“聚合”功能：

aggregate(dat,  by=list(dat$Record_nbr), paste, collapse=" ")

##   Group.1 Record_nbr fragment                                      Comments
## 1       1      1 1 1    1 2 3 The quick brown fox jumped over the lazy dog.
## 2       2          2        1                                   New Record.

编辑：您可能需要使用功能输入来获得所需的确切结果。

Answer 5

分割后，fragment列似乎无法使用？也许

> aggregate(dat[3], dat[1], paste)
#   Record_nbr                                             x
# 1          1 The quick brown fox jumped over the lazy dog.
# 2          2                                   New Record.

相当于

aggregate(Comments~Record_nbr, data = dat, paste)

R - 从dataframe连接字符串并删除html标记

5 个答案: