我有一个疯狂的文本文件,其头看起来像这样:
2016-07-01 02:50:35 <name redacted> hey
2016-07-01 02:51:26 <name redacted> waiting for plane to Edinburgh
2016-07-01 02:51:45 <name redacted> thinking about my boo
2016-07-01 02:52:07 <name reda> nothing crappy has happened, not really
2016-07-01 02:52:20 <name redac> plane went by pretty fast, didn't sleep
2016-07-01 02:54:08 <name r> no idea what time it is or where I am really
2016-07-01 02:54:17 <name redacted> just know it's london
2016-07-01 02:56:44 <name redacted> you are probably asleep
2016-07-01 02:58:45 <name redacted> I hope fish was fishy in a good eay
2016-07-01 02:58:56 <name redacted>
2016-07-01 02:59:34 <name redacted>
2016-07-01 03:02:48 <name > British security is a little more rigorous...
它持续了一段时间。这是一个很大的文件。但是我觉得用coreNLP库或包进行注释将很困难。我正在做自然语言处理。换句话说,我很好奇如何剃除至少日期,如果没有的话,日期和名称。
但是我想我需要这个名字,因为最终我想成为这样,这个人说了50次,而这个人说了75次,依此类推,但是有点可能领先于我自己。
这需要一个正则表达式吗?我正在R中工作。
由于我不知道从哪里开始,我还没有尝试过任何东西。我将如何在R中编写仅选择性读取文本的代码?有意义地组合在一起的短语和句子?
答案 0 :(得分:1)
这可能不需要表达式,但是如果您希望这样做,则此表达式可以帮助您简单地做到这一点:
(.*)(\s<name.*)
如果这不是您想要的表达式,则可以在regex101.com中修改/更改表达式。您可以根据需要添加更多边界。
您还可以在jex.im中可视化您的表达式:
const regex = /(.*)(\s<name.*)/gm;
const str = `2016-07-01 02:50:35 <name redacted> hey
2016-07-01 02:51:26 <name redacted> waiting for plane to Edinburgh
2016-07-01 02:51:45 <name redacted> thinking about my boo
2016-07-01 02:52:07 <name reda> nothing crappy has happened, not really
2016-07-01 02:52:20 <name redac> plane went by pretty fast, didn't sleep
2016-07-01 02:54:08 <name r> no idea what time it is or where I am really
2016-07-01 02:54:17 <name redacted> just know it's london
2016-07-01 02:56:44 <name redacted> you are probably asleep
2016-07-01 02:58:45 <name redacted> I hope fish was fishy in a good eay
2016-07-01 02:58:56 <name redacted>
2016-07-01 02:59:34 <name redacted>
2016-07-01 03:02:48 <name > British security is a little more rigorous...`;
let m;
while ((m = regex.exec(str)) !== null) {
// This is necessary to avoid infinite loops with zero-width matches
if (m.index === regex.lastIndex) {
regex.lastIndex++;
}
// The result can be accessed through the `m`-variable.
m.forEach((match, groupIndex) => {
console.log(`Found match, group ${groupIndex}: ${match}`);
});
}
答案 1 :(得分:0)
使用gsub函数中使用的基数R正则表达式可以提取每条信息。 假设以该文件为例:
2016-07-01 02:50:35 <name1 surname1> hey
2016-07-01 02:51:26 <name1 surname1> waiting for plane to Edinburgh
2016-07-01 02:51:45 <name1 surname1> thinking about
2016-07-01 02:52:07 <name2 surname2> nothing crappy
2016-07-01 02:52:20 <name2 surname2> plane went by pretty fast
2016-07-01 02:54:08 <name2 surname2> no idea
2016-07-01 02:54:17 <name2 surname2> just know it's london
2016-07-01 02:56:44 <name1 surname1> you are probably asleep
2016-07-01 02:58:45 <name1 surname1> I hope fish was fishy in a good eay
2016-07-01 02:58:56 <name2 surname2> x
2016-07-01 02:59:34 <name1 surname2> y
2016-07-01 03:02:48 <name2 > British security is a little more rigorous...
现在,在R控制台中,您以简单文本的形式读取文件并通过正则表达式对其进行处理。 gsub的参数2是从正则表达式中提取模式
your_data <- readLines(your_text_file) # Reading
pattern <- "(.*) <(\\S*) (\\S*)>(.*)" # The regex pattern
times <- gsub(pattern,"\\1",your_data) # Get Time and date
person_name <- gsub(pattern,"\\2 \\3",your_data) # Get name
message <- gsub(pattern,"\\4",your_data) # Get message
答案 2 :(得分:0)
使用您的示例粘贴文本,我们可以执行以下操作。请注意,您对复制粘贴时文本行为的描述向我暗示,文本中实际上有换行符\n
,但没有可复制的示例,很难说。
通过在日期之前在边界上分割,将单个长字符串分割为几行。如果有人经常在消息中键入日期,则可以扩展模式以包括时间和名称。如果人们在消息中输入内容,那将会很复杂,但希望只会影响一些消息。这将通过线条描述来解决。
将行放入dataframe列中,并在插入符号<
或>
之前或之后的空格上进行拆分,以拆分为名称和消息。
library(tidyverse)
text <- "2016-07-01 23:59:27 <John Doe> We're both signing off at the same time2016-07-02 00:00:04 <John Doe> :-)2016-07-02 00:00:28 <John Doe> I live you supercalagraa...phragrlous...esp..dociois2016-07-02 00:12:23 <Jane Doe> I love you :)2016-07-02 08:57:33"
text %>%
str_split("(?=\\d{4}-\\d{2}-\\d{2})") %>%
pluck(1) %>%
enframe(name = NULL, value = "message") %>%
separate(message, c("datetime", "name", "message"), sep = "\\s(?=<)|(?<=>)\\s", extra = "merge")
#> Warning: Expected 3 pieces. Missing pieces filled with `NA` in 2 rows [1,
#> 6].
#> # A tibble: 6 x 3
#> datetime name message
#> <chr> <chr> <chr>
#> 1 "" <NA> <NA>
#> 2 2016-07-01 23:59:… <John Do… We're both signing off at the same time
#> 3 2016-07-02 00:00:… <John Do… :-)
#> 4 2016-07-02 00:00:… <John Do… I live you supercalagraa...phragrlous...esp…
#> 5 2016-07-02 00:12:… <Jane Do… I love you :)
#> 6 2016-07-02 08:57:… <NA> <NA>
由reprex package(v0.2.1)于2019-05-16创建
答案 3 :(得分:0)
在一些帮助下,我得以弄清楚。
> a <- readLines ("hangouts-conversation-6.txt", encoding = "UTF-8")
> b <- "^([0-9-]{10} [0-9:]{8} )[*]{3} (\\w+ \\w+)"
> c <- gsub(b, "\\1<\\2> ", a)
> d <- "^([0-9-]{10}) ([0-9:]{8}) <(\\w+ \\w+)>\\s*(.+)$"
> e <- data.frame(date = character(),
+ time = character(),
+ name = character(),
+ text = character(),
+ stringsAsFactors = TRUE)
f <- strcapture(d, c, e)
> f <- f [-c(1),]
第一行是所有NA,因此最后一次是-c