如何在R中将句子分为两半

时间:2019-02-27 12:02:13

标签: r string

我有一个字符串向量,我希望将每个字符串在最近的位置切成两半。

例如,具有以下数据:

test <- data.frame(init = c("qsdf mqsldkfop mqsdfmlk lksdfp pqpdfm mqsdfmj mlk",
      "qsdf",
      "mp mlksdfm mkmlklkjjjjjjjjjjjjjjjjjjjjjjklmmjlkjll",
      "qsddddddddddddddddddddddddddddddd",
      "qsdfmlk mlk mkljlmkjlmkjml lmj mjjmjmjm lkj"), stringsAsFactors = FALSE)

我想得到这样的东西:

                              first                                       sec
1          qsdf mqsldkfop mqsdfmlk                lksdfp pqpdfm mqsdfmj mlk
2                              qsdf                                    
3                        mp mlksdfm mkmlklkjjjjjjjjjjjjjjjjjjjjjjklmmjlkjll
4 qsddddddddddddddddddddddddddddddd                                        
5                  lmj mjjmjmjm lkj                        lmj mjjmjmjm lkj

任何不切成两半而是“使第一部分不超过X字符”的解决方案也都很好。

2 个答案:

答案 0 :(得分:3)

首先,我们用空格分隔字符串。

a <- strsplit(test$init, " ")

然后我们找到每个向量的最后一个元素,其累积字符总和低于向量中所有字符总和的一半:

b <- lapply(a, function(x) which.max(cumsum(cumsum(nchar(x)) <= sum(nchar(x))/2)))

此后,如果向量的长度为1(仅一个单词),我们将两半合并,用NA代替。

combined <- Map(function(x, y){
  if(y == 1){
    return(c(x, NA))
  }else{
   return(c(paste(x[1:y], collapse = " "), paste(x[(y+1):length(x)], collapse = " ")))
  }
}, a, b)

最后,我们rbind组合字符串并更改列名称。

newdf <- do.call(rbind.data.frame, combined)
names(newdf) <- c("first", "second")

结果:

> newdf
                              first                                  second
1           qsdf mqsldkfop mqsdfmlk               lksdfp pqpdfm mqsdfmj mlk
2                              qsdf                                    <NA>
3                        mp mlksdfm mkmlklkjjjjjjjjjjjjjjjjjjjjjjklmmjlkjll
4 qsddddddddddddddddddddddddddddddd                                    <NA>
5                       qsdfmlk mlk         mkljlmkjlmkjml lmj mjjmjmjm lkj

答案 1 :(得分:2)

您可以从我编写的程序包中使用函数nbreak

devtools::install_github("igorkf/breaker")
library(tidyverse)

test <- data.frame(init = c("Phrase with four words", "That phrase has five words"), stringsAsFactors = F)

#This counts the numbers of words of each row:
nwords = str_count(test$init, " ") + 1

#This is the position where break the line for each row:
break_here = ifelse(nwords %% 2 == 0, nwords/2, round(nwords/2) + 1)

test
#                        init
# 1     Phrase with four words
# 2 That phrase has five words

#the map2_chr is applying a function with two arguments,
#the string is "init" and the n is "break_here":
test %>%
  mutate(init = map2_chr(init, break_here, ~breaker::nbreak(string = .x, n = .y, loop = F))) %>%
  separate(init, c("first", "second"), sep = "\n")
#             first     second
# 1     Phrase with four words
# 2 That phrase has five words