我需要找出匹配CSV或文本文件中具有非结构化数据集的符号,字母和数字的精确模式组合的最佳方法。
我需要准确提取模式“BR1*********
”(BR1
+正好9位数),它位于行的中间:61:和模式“?54***
”( ?54
+正好3位数),总是在行的末尾:61:。
两种模式都是重复的,但具有不同的数字组合。
到目前为止,我已尝试使用grep
和grepl
但未成功。我总是收到整行,这个模式大致匹配,但不是符号和数字的完全匹配。
以下是数据集的一小部分:
:11:hgttu6576575?//80&&80980jhkhkhlkhkh gjdggfjsdf?kjhkuhsfk778798978**&
:27:jhkjhuiy867tjhfsh/.>?kjklh8ggdhkotrdkhofkhodkgj
:61:kjljlkfjsdlBR1678899458iyuyugug7787?>?///uhhiuyi
jhkhkjhiy878697y8hukjlu97 ??///khiuy8oujhuhijk?54160
:11:hgggdgf79878yiuhlkhkh gjdggfhuihiuhuiou89 ioiojsdf?kjhkuhsfk778798978**&
:27:jhkjhuiy867tjhfsh/.>?kjklh8ggdhkotrdkhofkhodkgj
:61:kjljlkfjsdlBR1234885765iyuyugug7787?>?///uhhiuyi
jhkhkjhiy878697y8hukjlu97 ??///khiuy8oujhuhijk?54190
答案 0 :(得分:3)
strapplyc
可用于提取这些部分。在这里,我们提取整个字符串,或者如果您只想要数字部分在数字部分周围放置括号,例如pat1 <- "BR(1\\d{9})"
library(gsubfn)
pat1 <- "BR1\\d{9}"
pat2 <- "[?]54\\d{3}$"
strapplyc(lines, pat1, simplify = c)
## [1] "BR1678899458" "BR1234885765"
strapplyc(lines, pat2, simplify = c)
## [1] "?54160" "?54190"
或两者同时出现:
strapplyc(lines, paste(pat1, pat2, sep = "|"), simplify = c)
## [1] "BR1678899458" "?54160" "BR1234885765" "?54190"
如果您想要行号(即第一行是1,第二行是2等)而不是值本身,请使用相同模式的grep
。
已添加如果只有几千行,则读取文件应该不会有问题:
lines <- readLines("File.txt")
如果它确实太大,你可以在sqldf包中读取use read.csv.sql
,在一行代码中可以读取设置,sqlite数据库将文件读入其中,然后将一行子集提取到R.我们假设文件中没有?
,但是如果使用其他分隔符
不在文件中:
library(sqldf)
lines <- read.csv.sql("File.txt", header = FALSE, sep = "?",
sql = "select * from file where V1 like '%BR1%' or V1 like '%54%'")
# now use strapplyc as above
答案 1 :(得分:2)
尝试
library(stringr)
unlist(str_extract_all(lines, "(BR1\\d{9})|(\\?54\\d{3})"))
#[1] "BR1678899458" "?54160" "BR1234885765" "?54190"
如果它是一个巨大的文件,你可以使用stringi
,这将更快
library(stringi)
na.omit(unlist(stri_extract_all_regex(lines, "(BR1\\d{9})|(\\?54\\d{3})")))
#[1] "BR1678899458" "?54160" "BR1234885765" "?54190"
lines <- readLines(textConnection(':11:hgttu6576575?//80&&80980jhkhkhlkhkh gjdggfjsdf?kjhkuhsfk778798978**&
:27:jhkjhuiy867tjhfsh/.>?kjklh8ggdhkotrdkhofkhodkgj
:61:kjljlkfjsdlBR1678899458iyuyugug7787?>?///uhhiuyi jhkhkjhiy878697y8hukjlu97 ??///khiuy8oujhuhijk?54160
:11:hgggdgf79878yiuhlkhkh gjdggfhuihiuhuiou89 ioiojsdf?kjhkuhsfk778798978**&
:27:jhkjhuiy867tjhfsh/.>?kjklh8ggdhkotrdkhofkhodkgj
:61:kjljlkfjsdlBR1234885765iyuyugug7787?>?///uhhiuyi jhkhkjhiy878697y8hukjlu97 ??///khiuy8oujhuhijk?54190'))
答案 2 :(得分:2)
dat <- readLines(textConnection(":11:hgttu6576575?//80&&80980jhkhkhlkhkh gjdggfjsdf?kjhkuhsfk778798978**&
:27:jhkjhuiy867tjhfsh/.>?kjklh8ggdhkotrdkhofkhodkgj
:61:kjljlkfjsdlBR1678899458iyuyugug7787?>?///uhhiuyi
jhkhkjhiy878697y8hukjlu97 ??///khiuy8oujhuhijk?54160
:11:hgggdgf79878yiuhlkhkh gjdggfhuihiuhuiou89 ioiojsdf?kjhkuhsfk778798978**&
:27:jhkjhuiy867tjhfsh/.>?kjklh8ggdhkotrdkhofkhodkgj
:61:kjljlkfjsdlBR1234885765iyuyugug7787?>?///uhhiuyi
jhkhkjhiy878697y8hukjlu97 ??///khiuy8oujhuhijk?54190"))
library(stringr)
unlist(str_match_all(dat, "(BR1[[:digit:]]{9})|(\\?54[[:digit:]]{3})"))
## [1] "BR1678899458" "BR1678899458" "" "?54160"
## [5] "" "?54160" "BR1234885765" "BR1234885765"
## [9] "" "?54190" "" "?54190"
如果我们对您需要的格式有更多了解,我们可以为您提供更好的帮助。
答案 3 :(得分:2)
我对自然语言描述的阅读是只有2行符合要求。例如,第4行确实有一个&#34;?54nnn $&#34;模式,但该行不以&#34;:61:&#34;:
开头dat=readLines(textConnection(":11:hgttu6576575?//80&&80980jhkhkhlkhkh gjdggfjsdf?kjhkuhsfk778798978**&
:27:jhkjhuiy867tjhfsh/.>?kjklh8ggdhkotrdkhofkhodkgj
:61:kjljlkfjsdlBR1678899458iyuyugug7787?>?///uhhiuyi
jhkhkjhiy878697y8hukjlu97 ??///khiuy8oujhuhijk?54160
:11:hgggdgf79878yiuhlkhkh gjdggfhuihiuhuiou89 ioiojsdf?kjhkuhsfk778798978**&
:27:jhkjhuiy867tjhfsh/.>?kjklh8ggdhkotrdkhofkhodkgj
:61:kjljlkfjsdlBR1234885765iyuyugug7787?>?///uhhiuyi
jhkhkjhiy878697y8hukjlu97 ??///khiuy8oujhuhijk?54190"))
> grep("^:61:.+(BR1\\d{9}|[?]54\\d{3}$)", dat)
#[1] 3 7
修改测试用例以查看我的模式建议是否正确地执行了我认为的问题:
> dat=readLines(textConnection(":11:hgttu6576575?//80&&80980jhkhkhlkhkh gjdggfjsdf?kjhkuhsfk778798978**&
+ :27:jhkjhuiy867tjhfsh/.>?kjklh8ggdhkotrdkhofkhodkgj
+ :61:kjljlkfjsdlBR1678899458iyuyugug7787?>?///uhhiuyi
+ :61:jhkhkjhiy878697y8hukjlu97 ??///khiuy8oujhuhijk?54160
+ :11:hgggdgf79878yiuhlkhkh gjdggfhuihiuhuiou89 ioiojsdf?kjhkuhsfk778798978**&
+ :27:jhkjhuiy867tjhfsh/.>?kjklh8ggdhkotrdkhofkhodkgj
+ :61:kjljlkfjsdlBR1234885765iyuyugug7787?>?///uhhiuyi
+ jhkhkjhiy878697y8hukjlu97 ??///khiuy8oujhuhijk?54190")
+ )
> grep("^:61:.+(BR1\\d{9}|[?]54\\d{3}$)", dat)
[1] 3 4 7
答案 4 :(得分:2)
您可以使用基础R来处理这些匹配的提取。
> unlist(regmatches(lines, gregexpr('BR1\\d{9}|\\?54\\d{3}', lines)))
# [1] "BR1678899458" "?54160" "BR1234885765" "?54190"