我正在使用具有40多个变量的数据库。每个案例都有其属性的唯一标识符。这些标识符中的一些已输入到地址变量中。
标识符只能采用以下格式:
foo-a {etc}
我不确定如何在不创建查找表和不使用left_join的情况下从其所包含的地址文本中删除唯一标识符。查找表将需要不断更新,使其非常麻烦。
我还没有找到这种事情的例子。我可能已经错过了一些东西。
我的数据如下:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>CSS Smooth Animation of Element's Text Color</title>
<style type="text/css">
a {
margin: 20px;
-webkit-transition: color 0.5s; /* For Safari 3.0 to 6.0 */
transition: color 0.5s; /* For modern browsers */
}
a:hover {
color: #ff0000;
}
</style>
</head>
<body>
<h1><a href="#">Hover on me</a></h1>
</body>
</html>
干净的数据将以NA123456 - First letter constant - N, 1 Letter A-K, Numbers 1-9
SA123456 - First 2 letters constant - SA, 6 Numbers 0-9
MABC1234 - First letter constant - M, 3 Letters A-Z, 4 Numbers 0-9
QABC1234 - First letter constant - Q, 3 Letters A-Z, 4 Numbers 0-9
WABC1234 - First letter constant - W, 3 Letters A-Z, 4 Numbers 1-9
TABC1234 - First letter constant - T, 3 Letters A-Z, 4 Numbers 1-9
3ABCD123 - First number constant - 3, 3 Letters A-Z, 3 Numbers 1-9
列中的唯一标识符结尾,并且不会用NA覆盖具有正确变量中的数据的观察值。
在此先感谢您的帮助。
答案 0 :(得分:1)
使用正则表达式和stringr::str_extract_all()
我假设您的电话号码应为0-9,而不是1-9。如果不是,请将所有[0-9]
更改为[1-9]
。
另外,如果您要查找特定数目的字母/数字重复(例如:n),则将+
更改为{n}
,就像在vec
中的第一个模式中一样。>
library( data.table )
library( stringr )
# NA123456 - First letter constant - N, Letter A-K, Numbers 1-9
# SA123456 - First 2 letters constant - SA, Numbers 1-9
# MABC1234 - First letter constant - M, Letters A-Z, Numbers 1-9
# QABC1234 - First letter constant - Q, Letters A-Z, Numbers 1-9
# WABC1234 - First letter constant - W, Letters A-Z, Numbers 1-9
# TABC1234 - First letter constant - T, Letters A-Z, Numbers 1-9
# 3ABCD123 - First number constant - 3, Letters A-Z, Numbers 1-9
#create a vector with all regex-patterns
#I assumed 1-9 should be 0-9 ?? <-- !!
vec <- c( "N[A-K]{1}[0-9]+",
"SA[0-9]+",
"M[A-Z]+[0-9]+",
"Q[A-Z]+[0-9]+",
"W[A-Z]+[0-9]+",
"T[A-Z]+[0-9]+",
"3[A-Z]+[0-9]+" )
#paste patterns together to one large regex-OR-pattern
pattern <- paste( vec, collapse = "|" )
#extract all patterns from the column 'Property', and put (as vector) in Aa-reference
#extract all patterns from the column 'Property', and put (as vector) in Aa-reference
DT[, Aa_reference := stringr::str_extract_all( Address, pattern )]
输出
# Property Address Aa_reference
# 1: PIC: 3WABG086 260 SPRINGHURST ROAD
# 2: PIC: 35PSR217 1350 RIVER ROAD
# 3: PIC# NH244157 1038 QUONDONG ROAD
# 4: PIC: 3GMUF425 70 DIGBY ROAD
# 5: PIC# 3GMUF425 70 DIGBY ROAD
# 6: PIC QTIWW0626 REMOLEA
# 7: PIC#EBWSE235 BOX 191
# 8: PIC #3WLKM019 198 MONTGOMERY ROAD
# 9: PIC # 3BWMM021 149 ANDERSONS ROAD
# 10: PIC: 3WCGN034 WERRIBEE
# 11: GARANGULA PIC: NH630488 PO BOX 84
# 12: GARANGULA PIC: NH630488 PO BOX 84
# 13: PIC: 3GMTL320 2980 GLENELG HIGHWAY
# 14: GREENSLOPES PIC: MJKE0261 914 WEST KENTISH ROAD
# 15: PIC: WFZB3246 859 PFEIFFER ROAD
# 16: PIC: WFAY3549 34605 ALBANY HIGHWAY
# 17: PIC: 3CEXK044 2244 LAVERS HILL ROAD
# 18: PIC: QGWW0462 ELDERFIELD
# 19: PIC: 3WCGN034 WERRIBEE
# 20: KAYA DORPER & WHITE DORPER STUD PIC: WABN0262 WABN0262
# 21: SPOTTSWOOD PIC QKDR0078 QKDR0078
# 22: COOMBOONA HOLSTEINS PIC 3SPSR217 3SPSR217
# 23: ROSEVALE PIC: QKEV0169 QKEV0169
# 24: <NA> PIC 3EGON009 3EGON009
# 25: <NA> PIC WFKPO316 WFKPO316
# 26: IVADENE PIC 3WANP0T1 3WANP0
# 27: <NA> PIC ND225813 ND225813
# 28: HEAVENLY VALLEY FARMS PIC #NF538645 NF538645
# 29: C/- CED WISE AB CENTRE PIC: QCST0158 QCST0158
# 30: GARANGULA PIC # NH630488 NH630488
# Property Address Aa_reference
使用的示例数据
DT <- fread('
Property | Address | Aa_reference
PIC: 3WABG086| 260 SPRINGHURST ROAD| NA
PIC: 35PSR217| 1350 RIVER ROAD | NA
PIC# NH244157| 1038 QUONDONG ROAD |NA
PIC: 3GMUF425| 70 DIGBY ROAD| NA
PIC# 3GMUF425| 70 DIGBY ROAD | NA
PIC QTIWW0626 | REMOLEA | NA
PIC#EBWSE235 | BOX 191 | NA
PIC #3WLKM019 | 198 MONTGOMERY ROAD| NA
PIC # 3BWMM021 | 149 ANDERSONS ROAD | NA
PIC: 3WCGN034 | WERRIBEE | NA
GARANGULA PIC: NH630488| PO BOX 84 |NA
GARANGULA PIC: NH630488 | PO BOX 84| NA
PIC: 3GMTL320| 2980 GLENELG HIGHWAY| NA
GREENSLOPES PIC: MJKE0261| 914 WEST KENTISH ROAD| NA
PIC: WFZB3246 | 859 PFEIFFER ROAD| NA
PIC: WFAY3549| 34605 ALBANY HIGHWAY| NA
PIC: 3CEXK044 | 2244 LAVERS HILL ROAD| NA
PIC: QGWW0462 | ELDERFIELD| NA
PIC: 3WCGN034 | WERRIBEE| NA
KAYA DORPER & WHITE DORPER STUD| PIC: WABN0262| NA
SPOTTSWOOD| PIC QKDR0078 | NA
COOMBOONA HOLSTEINS| PIC 3SPSR217 | NA
ROSEVALE | PIC: QKEV0169 | NA
NA| PIC 3EGON009 | NA
NA | PIC WFKPO316 | NA
IVADENE| PIC 3WANP0T1 | NA
NA | PIC ND225813 | NA
HEAVENLY VALLEY FARMS| PIC #NF538645 | NA
C/- CED WISE AB CENTRE| PIC: QCST0158 |NA
GARANGULA| PIC # NH630488 |NA
', sep = "|")
答案 1 :(得分:0)
最终成功了:
vec <- c( "N[A-K]{1}[0-9]+",
"SA[0-9]+",
"M[A-Z]+[0-9]+",
"Q[A-Z]+[0-9]+",
"W[A-Z]+[0-9]+",
"T[A-Z]+[0-9]+",
"3[A-Z]+[0-9]+" )
#paste patterns together to one large regex-OR-pattern
pattern <- paste( vec, collapse = "|" )
df <- df %>%
mutate(`id1` = str_extract_all(`Property`, vec),
`id2` = str_extract_all(`Address`, vec),
`id1` = na_if(`Pic1`, "character(0)"),
`id2` = na_if(`Pic2`, "character(0)")
) %>%
unite(id3, id1, id2, remove = TRUE, sep = " ") %>%
mutate(`id3` = str_extract_all(id3, vec),
`id3` = na_if(`id3`, "character(0)"))