Question

我一直在寻找正则表达的所有帖子，但似乎无法让这对我有用。

行的示例（某些词被编辑或更改）

Df $ text：“CommonWord＃79 - EVENT类型为1200秒[对象] xxx.xxx.xxx.xxx/## xxx.xxx.xxx.xxx/##端口：##

我想在＃之后提取数值并将其放在新列中我试过：df $ number＆lt; - sub（“\＃（[0-9] {2,4}）。*”，“\ 1”，df $ text）

结果是“CommonWord 79”我似乎无法找到正确的正则表达式来删除第一个单词。
下一个正则表达式我想拉“EVENT类型”并放入另一列。 “EVENT”和“type”都可以改变，所以我需要在“ - ”之后和“for”之前拉文本。
1. 我需要的最后两个正则表达式是IP地址和子网掩码，然后是端口号（仅限数字）。我需要所有这些新列。

抱歉这个冗长的问题。在这一次击败我的头脑

已解决第1部分，事件类型和端口需求

df$number <- sub(".*\\#(\\d{1,4}).*", "\\1", df$text)
df$attackType <- sub(".*\\-.(\\w+\\s\\w+).*","\\1", df$text)
df$port <- as.numeric(sub(".*\\:(\\d{1, })?","\\1", df$text))

在查找IP地址方面存在一些问题（仅获取第一组数字中的第一个数字。示例实际IP为127.0.0.1/28但是我将返回7.0.0.1/28。在弄清楚如何获取IP地址/掩码我需要确定如何在文本中找到多个结果冗长的正则表达式 - 希望以后优化

df$IPs <- sub(".*(+\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\/\\d{2, }).*","\\1", df$text)

Answer 1

那些x应该代表数字吗？有些值会有所帮助，特别是考虑到IP地址并不完全遵循这种模式。

无论如何，我已经添加了一些东西来搜索。我喜欢将rex包与stringr::str_view_all结合使用来测试正则表达式模式。结果在“查看器”窗格中突出显示。

text <- "CommonWord #79 - EVENT type for 1200 seconds [Objects] 192.168.0.24/## xxx.xxx.xxx.xxx/## Port: 80"
library(stringr)
library(rex)

# show matches where at least one digit follows #
str_view_all(text, rex(at_least(digit, 1) %if_prev_is% "#"))

# show matches where characters are after - and before 'for'
str_view_all(text, rex((prints %if_prev_is% "-") %if_next_is% "for"))

# show matches the x in your IP text match 1-3 digits, and end with /
str_view_all(text, rex(between(digit, 1, 3), dot, 
                       between(digit, 1, 3), dot, 
                       between(digit, 1, 3), dot, 
                       between(digit, 1, 3), "/"))

# show matches where digits follow 'Port:'
str_view_all(text, rex(digits %if_prev_is% "Port: "))

Answer 2

你只需添加＆＃34;。*＆＃34;在数字

之前指出任何#character

sub(".*\\#([0-9]{2,4}).*", "\\1", x)

＃创建新列

 df$new_col <- as.numeric(sub(".*\\#([0-9]{2,4}).*", "\\1", df$text))

正则表达式从Dataframe中提取文本并插入到新列

2 个答案: