我有以下输入句子:
B <- "ASSIGNEE/BANK (FORMERLY KNOWN AS BANK SETIA) AND NUR AMIRA BINTI RAMZI [NRIC NO. 918267-16-6252] AND HAFIZUDDIN BIN ALI [NRIC NO. 918273-16-1635] ASSIGNOR"
我想提取AND
(应包含在输出中)和ASSIGNOR
(应从输出中排除)之间的单词。
预期产量
AND NUR AMIRA BINTI RAMZI [NRIC NO. 918267-16-6252] AND HAFIZUDDIN BIN ALI [NRIC NO. 918273-16-1635]".
在受让人之前和之后,我还有很多话要说。我只想捕获上面显示的中间一个。
这是我到目前为止的试验,没有产生所需的输出:
sub(".*ASSIGNEE.* *(AND.*?) *ASSIGNOR.*", "\\1", B)
# [1] "AND HAFIZUDDIN BIN ALI [NRIC NO. 918273-16-1635]"
谢谢。
答案 0 :(得分:1)
使用stringr
和regex
:
library(stringr)
str_extract(B, regex("(?=AND)(?s)(.*$)"))
# [1] " AND NUR AMIRA BINTI RAMZI [NRIC NO. 918267-16-6252] AND HAFIZUDDIN BIN ALI [NRIC NO. 918273-16-1635] ASSIGNOR"
对于正则表达式参考,请看Regular Expression Reference: Special Groups。
如果您想要AND
ASSIGNOR
之间的单词,可以按以下方式修改regex
:
str_extract(B, regex("(?=AND)(.*?)(?=ASSIGNOR)"))
B <- "ASSIGNEE/BANK (FORMERLY KNOWN AS BANK SETIA) AND ASSIGNOR"
# "AND "
B <- "ASSIGNEE/BANK (FORMERLY KNOWN AS BANK SETIA) AND The Man in the iron mask other more strings ASSIGNOR"
#AND The Man in the iron mask other more strings
B <- "ASSIGNEE/BANK (FORMERLY KNOWN AS BANK SETIA) AND NUR AMIRA BINTI RAMZI [NRIC NO. 918267-16-6252] AND HAFIZUDDIN BIN ASSIGNOR ALI [NRIC NO. 918273-16-1635] ASSIGNOR and another ASSIGNOR"
#"AND NUR AMIRA BINTI RAMZI [NRIC NO. 918267-16-6252] AND HAFIZUDDIN BIN
B <- "ASSIGNEE/BANK (FORMERLY KNOWN AS BANK SETIA) AND NUR AMIRA BINTI RAMZI [NRIC NO. 918267-16-6252] AND HAFIZUDDIN BIN ALI [NRIC NO. 918273-16-1635] ASSIGNOR and another ASSIGNOR"
#"AND NUR AMIRA BINTI RAMZI [NRIC NO. 918267-16-6252] AND HAFIZUDDIN BIN ALI [NRIC NO. 918273-16-1635] "
现在应该可以使用。
答案 1 :(得分:0)
我认为您可以使用这样的正则表达式:
and{1}.*
此正则表达式将找到第一个“和”并与之匹配,并继续直到换行。如果您想要整个字符串,则可以执行以下操作(可能是一种更好的方法。
and{1}[^]*
您可以在以下站点上测试正则表达式:https://regexr.com/
答案 2 :(得分:0)
答案 3 :(得分:-1)
您可以使用regexec
/ regmatches
通过基本R解决方案提取所需的字符串:
rx <- "\\b(AND.*?)\\s*ASSIGNOR\\b"
x <- "ASSIGNEE/BANK (FORMERLY KNOWN AS BANK SETIA) AND NUR AMIRA BINTI RAMZI [NRIC NO. 918267-16-6252] AND HAFIZUDDIN BIN ALI [NRIC NO. 918273-16-1635] ASSIGNOR"
regmatches(x, regexec(rx, x))[[1]][2]
## => [1] "AND NUR AMIRA BINTI RAMZI [NRIC NO. 918267-16-6252] AND HAFIZUDDIN BIN ALI [NRIC NO. 918273-16-1635]"
将ASSIGNOR
移至前行的同一个正则表达式可以与PCRE正则表达式一起使用:
regmatches(x, regexpr("\\bAND.*?(?=\\s*ASSIGNOR\\b)", x, perl=TRUE))
# => [1] "AND NUR AMIRA BINTI RAMZI [NRIC NO. 918267-16-6252] AND HAFIZUDDIN BIN ALI [NRIC NO. 918273-16-1635]"
可以将ASSIGNOR
函数与使用ICU regex库的stringr::str_extract
函数一起使用,将library(stringr)
stringr::str_extract(x, "\\bAND.*?(?=\\s*ASSIGNOR\\b)")
# => [1] "AND NUR AMIRA BINTI RAMZI [NRIC NO. 918267-16-6252] AND HAFIZUDDIN BIN ALI [NRIC NO. 918273-16-1635]"
移到前瞻中的同一个正则表达式:
\b
说明
(AND.*?)
-单词边界AND
-捕获组1:\s*
,然后捕获尽可能少的任意数量的0+字符(在PCRE和ICU正则表达式中,而不是换行符),直到第一个ASSIGNOR\b
-超过0个空格ASSIGNOR
-整个词(?=...)
。在PCRE和ICU正则表达式中,不需要捕获括号,import pandas as pd
dataframe= pd.read_csv("lettera.csv", sep='\t')
df=pd.DataFrame(dataframe)
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size = 0.2)
train_features = train[['F1','F2','F3','F4','F5','X','Y','Z','C1','C2']]
是与文本匹配但不放入匹配项的正向超前(=不消耗文本)。 / p>
请参见regex demo。