我正在尝试从数据框中删除包含少于5个字的行。 例如
mydf <- as.data.frame(read.xlsx("C:\\data.xlsx", 1, header=TRUE)
head(mydf)
NO ARTICLE
1 34 The New York Times reports a lot of words here.
2 12 Greenwire reports a lot of words.
3 31 Only three words.
4 2 The Financial Times reports a lot of words.
5 9 Greenwire short.
6 13 The New York Times reports a lot of words again.
我想删除5个或更少单词的行。我怎么能这样做?
答案 0 :(得分:5)
以下是两种方式:
mydf[sapply(gregexpr("\\W+", mydf$ARTICLE), length) >4,]
# NO ARTICLE
# 1 34 The New York Times reports a lot of words here.
# 2 12 Greenwire reports a lot of words.
# 4 2 The Financial Times reports a lot of words.
# 6 13 The New York Times reports a lot of words again.
mydf[sapply(strsplit(as.character(mydf$ARTICLE)," "),length)>5,]
# NO ARTICLE
# 1 34 The New York Times reports a lot of words here.
# 2 12 Greenwire reports a lot of words.
# 4 2 The Financial Times reports a lot of words.
# 6 13 The New York Times reports a lot of words again.
第一个生成一个向量,包含第一个之后每个单词的起始位置,然后计算该向量的长度。
第二个将ARTICLE列拆分为包含组成单词的向量,并计算该向量的长度。这可能是一种更好的方法。
答案 1 :(得分:4)
qdap包中的字数(wc
)函数也可以促进这一点:
dat <- read.transcript(text="34 The New York Times reports a lot of words here.
12 Greenwire reports a lot of words.
31 Only three words.
2 The Financial Times reports a lot of words.
9 Greenwire short.
13 The New York Times reports a lot of words again.",
col.names = qcv(NO, ARTICLE), sep=" ")
library(qdap)
dat[wc(dat$ARTICLE) > 4, ]
## NO ARTICLE
## 1 34 The New York Times reports a lot of words here.
## 2 12 Greenwire reports a lot of words.
## 4 2 The Financial Times reports a lot of words.
## 6 13 The New York Times reports a lot of words again.