如何在R中的字符串中获取前10个单词?

时间:2014-01-12 22:13:12

标签: r csv

我在R中有一个字符串

x <- "The length of the word is going to be of nice use to me"

我想要上面指定字符串的前10个单词。

另外,例如我有一个CSV文件,其格式如下: -

Keyword,City(Column Header)
The length of the string should not be more than 10,New York
The Keyword should be of specific length,Los Angeles
This is an experimental basis program string,Seattle
Please help me with getting only the first ten words,Boston

我想从每行的“关键字”列中仅获取前10个字,并将其写入CSV文件。 请帮助我。

4 个答案:

答案 0 :(得分:18)

使用\w(单词字符)及其否定\W的正则表达式(正则表达式)答案:

gsub("^((\\w+\\W+){9}\\w+).*$","\\1",x)
  1. ^令牌的开头(零宽度)
  2. ((\\w+\\W+){9}\\w+)十个单词由非单词分隔。
    1. (\\w+\\W+){9}一个单词后跟非单词,9次
      1. \\w+一个或多个单词字符(即单词)
      2. \\W+一个或多个非单词字符(即空格)
      3. {9}九次重复
    2. \\w+第十个字
  3. .*其他任何内容,包括其他后续字词
  4. $令牌结束(零宽度)
  5. \\1找到此标记后,将其替换为第一个捕获的组(10个字)

答案 1 :(得分:6)

如何使用Hadley Wickham的word包中的stringr函数?

word(string = x, start = 1, end = 10, sep = fixed(" "))

答案 2 :(得分:3)

这是一个小函数,它将字符串取消,对前十个单词进行子集,然后再粘贴在一起。

string_fun <- function(x) {
  ul = unlist(strsplit(x, split = "\\s+"))[1:10]
  paste(ul,collapse=" ")
}

string_fun(x)

df <- read.table(text = "Keyword,City(Column Header)
The length of the string should not be more than 10 is or are in,New York
The Keyword should be of specific length is or are in,Los Angeles
                 This is an experimental basis program string is or are in,Seattle
                 Please help me with getting only the first ten words is or are in,Boston", sep = ",", header = TRUE)

df <- as.data.frame(df)

使用应用(该功能在第二列中没有执行任何操作)

df$Keyword <- apply(df[,1:2], 1, string_fun)

修改 可能这是使用该功能的更通用的方法。

df[,1] <- as.character(df[,1])
df$Keyword <- unlist(lapply(df[,1], string_fun))

print(df)
#                      Keyword                            City.Column.Header.
# 1    The length of the string should not be more than            New York
# 2  The Keyword should be of specific length is or are         Los Angeles
# 3  This is an experimental basis program string is or             Seattle
# 4      Please help me with getting only the first ten              Boston

答案 3 :(得分:2)

x <- "The length of the word is going to be of nice use to me"
head(strsplit(x, split = "\ "), 10)