在R数据帧中组合碎片句

时间:2015-11-15 11:29:00

标签: regex r data-cleansing srt

我有一个数据框,其中包含遍布整个句子的部分内容,在某些情况下,还包含数据框的多行。

例如,head(mydataframe)返回

#  1 Do you have any idea what
#  2  they were arguing about?
#  3          Do--Do you speak
#  4                  English?
#  5                     yeah.
#  6            No, I'm sorry.

假设一个句子可以被

终止

“”。要么 ”?”要么 ”!”或“......”

是否有任何R库函数能够输出以下内容:

#  1 Do you have any idea what they were arguing about?
#  2          Do--Do you speak English?
#  3                     yeah.
#  4            No, I'm sorry.

2 个答案:

答案 0 :(得分:4)

这适用于以. ... ?!

结尾的所有句子
x <- paste0(foo$txt, collapse = " ")
trimws(unlist(strsplit(x, "(?<=[?.!|])(?=\\s)", perl=TRUE)))

@AvinashRaj对lookbehind

指针的认可

给出了:

#[1] "Do you have any idea what they were arguing about?"
#[2] "Do--Do you speak English?"                         
#[3] "yeah..."                                           
#[4] "No, I'm sorry." 

数据

我修改了玩具数据集,以包含一个字符串以...结尾的情况(根据OP的要求)

foo <- data.frame(num = 1:6,
                  txt = c("Do you have any idea what", "they were arguing about?",
                          "Do--Do you speak", "English?", "yeah...", "No, I'm sorry."), 
                  stringsAsFactors = FALSE)

答案 1 :(得分:3)

这是我得到的。我相信有更好的方法可以做到这一点。在这里我使用了基本功能。我创建了一个名为txt的示例数据框。首先,我创建了一个包含toString()中所有文本的字符串。 ,添加了gsub(),因此我在第一个gsub()中删除了它们。然后,我在第二个strsplit()中处理了空白区域(超过2个空格)。然后,我按您指定的分隔符拆分字符串。将Tyler Rinker称为translation units,我设法在foo <- data.frame(num = 1:6, txt = c("Do you have any idea what", "they were arguing about?", "Do--Do you speak", "English?", "yeah.", "No, I'm sorry."), stringsAsFactors = FALSE) library(magrittr) toString(foo$txt) %>% gsub(pattern = ",", replacement = "", x = .) %>% strsplit(x = ., split = "(?<=[?.!])", perl = TRUE) %>% lapply(., function(x) {gsub(pattern = "^ ", replacement = "", x = x) }) %>% unlist #[1] "Do you have any idea what they were arguing about?" #[2] "Do--Do you speak English?" #[3] "yeah." #[4] "No I'm sorry." 留下分隔符。最后的工作是删除句子初始位置的空格。然后,取消列表。

修改 StevenBeaupré修改了我的代码。这是要走的路!

RewriteCond %{HTTP_HOST} ^(.*)\.domain2\.com
RewriteRule ^(.*)$ http://{%1.}domain2.com/$1 [L,NC,QSA]