提取包含字符串中的字符和数字的文本

时间:2018-02-21 11:15:44

标签: r string

我想提取包含数字的所有文字,例如" US6184521-B1"和" US3967255-A",在以下字符串中:

US6184521-B1 -- US3967255-A   DELPHIAN FOUNDATION (DELP-Non-standard);  Q2 CORP (QTWO-Non-standard)   OLIVER S M,  PROUD R A,  PARSONS S J;  US3973118-A   LAMONTAGNE J A (LAMO-Individual)   LAMONTAGNE J A;  US4303855-A   IBM CORP (IBMC)   BAPST U H,  GFELLER F,  VETTIGER P;  US4394572-A   BIOX TECH INC (BIOX-Non-standard)   WILBER S;  US4407290-A   BIOX TECH INC (BIOX-Non-standard);  BOC GROUP PLC (BRTO)   WILBER S A;  US4633087-A   TREBOR INDS INC (TREB-Non-standard)   ROSENTHAL G K,  STEPHENS J D,  ROSENTHAL R D;  US4678921-A   NIPPONDENSO CO LTD (NPDE)   NAKAMURA T,  SATO S,  HATTORI T,  NABETA T,  KATO M;  US4864126-A   HEWLETT-PACKARD CO (HEWP)   WALTERS M D,  PERYESZI J,  PETRILLA J F,  PERNYESZI J;  US4865038-A   NOVAMETRIX MED SYST INC (NOVA-Non-standard)   RICH D,  THOMAS S;  US4907594-A   NICOLAY GMBH (NICO-Non-standard)   MUZ E;  US4939375-A   HEWLETT-PACKARD CO (HEWP)   WALTERS M D,  PERNYESZI J,  PETRILLA J F;  US5036437-A   LECTRON PRODUCTS IN (LECT-Non-standard)   MACKS H R;  US5209230-A   NELLCOR INC (NELL-Non-standard)   SWEDLOW D B,  WARING J,  DELONZO R;  US5237994-A   SQUARE ONE TECHNOLOGY (SQUA-Non-standard)   GOLDBERGER D S;  US5239169-A   MICROSCAN SYSTEMS INC (MICR-Non-standard)   THOMAS J E;  US5325192-A   TEKTRONIX INC (TEKT)   ALLEN D W;  US5373102-A   US SEC OF ARMY (USSA)   DAVENPORT W E,  EHRLICH J J,  TAYLOR T S;  US5561295-A   LITTON SYSTEMS INC (LITO)   PREIS M K,  JACKSEN N F;  US5629517-A   XEROX CORP (XERO)   JACKSON W B,  BIEGELSEN D K,  STREET R A,  WEISFIELD R L;  US5752914-A   NELLCOR PURITAN BENNETT INC (MLCW)   DELONZOR R,  NAMY A;  US5786592-A   HOEK INSTR AB (HOEK-Non-standard)   HOEK B

这应该与显示here的内容类似,但我想提取数字和字母。我怎样才能在R中实现这一目标?

2 个答案:

答案 0 :(得分:2)

试试这个:

  test<-c("aa1","aaa")
  test[grepl("[1-9]", test)]
[1] "aa1"

使用您的数据:

input<-"US6184521-B1 -- US3967255-A   DELPHIAN FOUNDATION (DELP-Non-standard);  Q2 CORP (QTWO-Non-standard)   OLIVER S M,  PROUD R A,  PARSONS S J;  US3973118-A   LAMONTAGNE J A (LAMO-Individual)   LAMONTAGNE J A;  US4303855-A   IBM CORP (IBMC)   BAPST U H,  GFELLER F,  VETTIGER P;  US4394572-A   BIOX TECH INC (BIOX-Non-standard)   WILBER S;  US4407290-A   BIOX TECH INC (BIOX-Non-standard);  BOC GROUP PLC (BRTO)   WILBER S A;  US4633087-A   TREBOR INDS INC (TREB-Non-standard)   ROSENTHAL G K,  STEPHENS J D,  ROSENTHAL R D;  US4678921-A   NIPPONDENSO CO LTD (NPDE)   NAKAMURA T,  SATO S,  HATTORI T,  NABETA T,  KATO M;  US4864126-A   HEWLETT-PACKARD CO (HEWP)   WALTERS M D,  PERYESZI J,  PETRILLA J F,  PERNYESZI J;  US4865038-A   NOVAMETRIX MED SYST INC (NOVA-Non-standard)   RICH D,  THOMAS S;  US4907594-A   NICOLAY GMBH (NICO-Non-standard)   MUZ E;  US4939375-A   HEWLETT-PACKARD CO (HEWP)   WALTERS M D,  PERNYESZI J,  PETRILLA J F;  US5036437-A   LECTRON PRODUCTS IN (LECT-Non-standard)   MACKS H R;  US5209230-A   NELLCOR INC (NELL-Non-standard)   SWEDLOW D B,  WARING J,  DELONZO R;  US5237994-A   SQUARE ONE TECHNOLOGY (SQUA-Non-standard)   GOLDBERGER D S;  US5239169-A   MICROSCAN SYSTEMS INC (MICR-Non-standard)   THOMAS J E;  US5325192-A   TEKTRONIX INC (TEKT)   ALLEN D W;  US5373102-A   US SEC OF ARMY (USSA)   DAVENPORT W E,  EHRLICH J J,  TAYLOR T S;  US5561295-A   LITTON SYSTEMS INC (LITO)   PREIS M K,  JACKSEN N F;  US5629517-A   XEROX CORP (XERO)   JACKSON W B,  BIEGELSEN D K,  STREET R A,  WEISFIELD R L;  US5752914-A   NELLCOR PURITAN BENNETT INC (MLCW)   DELONZOR R,  NAMY A;  US5786592-A   HOEK INSTR AB (HOEK-Non-standard)   HOEK B"
  input<-unlist(strsplit(input,split=" "))

你的输出:

input[grepl("[1-9]", input)]
 [1] "US6184521-B1" "US3967255-A"  "Q2"           "US3973118-A"  "US4303855-A"  "US4394572-A"  "US4407290-A" 
 [8] "US4633087-A"  "US4678921-A"  "US4864126-A"  "US4865038-A"  "US4907594-A"  "US4939375-A"  "US5036437-A" 
[15] "US5209230-A"  "US5237994-A"  "US5239169-A"  "US5325192-A"  "US5373102-A"  "US5561295-A"  "US5629517-A" 
[22] "US5752914-A"  "US5786592-A"

答案 1 :(得分:1)

一个简单的grep会做到这一点。请注意,参数value设置为TRUE,默认值为FALSE

grep("[[:digit:]]", s, value = TRUE)
# [1] "US6184521-B1" "US3967255-A"  "Q2"           "US3973118-A"  "US4303855-A" 
# [6] "US4394572-A"  "US4407290-A"  "US4633087-A"  "US4678921-A"  "US4864126-A" 
#[11] "US4865038-A"  "US4907594-A"  "US4939375-A"  "US5036437-A"  "US5209230-A" 
#[16] "US5237994-A"  "US5239169-A"  "US5325192-A"  "US5373102-A"  "US5561295-A" 
#[21] "US5629517-A"  "US5752914-A"  "US5786592-A"

数据。
以下内容使用scan读取您提供的数据。它用空格分隔字符串,因此你的字符串可能不同。但这只是为了测试上面的代码。

s <- 
scan(what = character(),
text = "US6184521-B1 -- US3967255-A   DELPHIAN FOUNDATION (DELP-Non-standard);
  Q2 CORP (QTWO-Non-standard)   OLIVER S M,  PROUD R A,  PARSONS S J;  
US3973118-A   LAMONTAGNE J A (LAMO-Individual)   LAMONTAGNE J A;  US4303855-A   
IBM CORP (IBMC)   BAPST U H,  GFELLER F,  VETTIGER P;  US4394572-A   BIOX TECH INC 
(BIOX-Non-standard)   WILBER S;  US4407290-A   BIOX TECH INC (BIOX-Non-standard);  
BOC GROUP PLC (BRTO)   WILBER S A;  US4633087-A   TREBOR INDS INC (TREB-Non-standard)   
ROSENTHAL G K,  STEPHENS J D,  ROSENTHAL R D;  US4678921-A   NIPPONDENSO CO LTD 
(NPDE)   NAKAMURA T,  SATO S,  HATTORI T,  NABETA T,  KATO M;  US4864126-A   
HEWLETT-PACKARD CO (HEWP)   WALTERS M D,  PERYESZI J,  PETRILLA J F,  PERNYESZI 
J;  US4865038-A   NOVAMETRIX MED SYST INC (NOVA-Non-standard)   RICH D,  
THOMAS S;  US4907594-A   NICOLAY GMBH (NICO-Non-standard)   MUZ E;  
US4939375-A   HEWLETT-PACKARD CO (HEWP)   WALTERS M D,  PERNYESZI J,  
PETRILLA J F;  US5036437-A   LECTRON PRODUCTS IN (LECT-Non-standard)   
MACKS H R;  US5209230-A   NELLCOR INC (NELL-Non-standard)   SWEDLOW D B,  
WARING J,  DELONZO R;  US5237994-A   SQUARE ONE TECHNOLOGY (SQUA-Non-standard)   
GOLDBERGER D S;  US5239169-A   MICROSCAN SYSTEMS INC (MICR-Non-standard)   
THOMAS J E;  US5325192-A   TEKTRONIX INC (TEKT)   ALLEN D W;  US5373102-A   
US SEC OF ARMY (USSA)   DAVENPORT W E,  EHRLICH J J,  TAYLOR T S;  
US5561295-A   LITTON SYSTEMS INC (LITO)   PREIS M K,  JACKSEN N F;  
US5629517-A   XEROX CORP (XERO)   JACKSON W B,  BIEGELSEN D K,  STREET R A,  
WEISFIELD R L;  US5752914-A   NELLCOR PURITAN BENNETT INC (MLCW)   
DELONZOR R,  NAMY A;  US5786592-A   HOEK INSTR AB (HOEK-Non-standard)   
HOEK B")