我想将这些地址分成相应的类别(街道号,街道名称,城市,州和邮编),以最终检查哪些是相同的。任何人都可以帮助解决如何在R中实现这个目标的基本想法吗?
Company Address
1. A 1 NE 1 Street Miami,FL 33132
2. B 1 1st Street Miami,FL 33132
3. C 1 NE 1st St Miami,FL 33132
4. D 1 1st Street Miami,FL 33134
5. E 100 Biscayne Blvd. Miami,FL 33132
6. F 100 Biscayne Blvd Miami ,FL 33132
7. G 100 Biscayne Boulevard Suite 604 Miami,FL 33132
8. H 100 Biscayne Blvd. Suite 604 Miami,FL 33132
9. I 100 N. Biscayne Blvd. Miami,FL 33132
答案 0 :(得分:4)
在gsubfn包中尝试read.pattern
。如果Lines在文件中,则将text = Lines
替换为给出文件名的字符串。这可能相当脆弱,一旦你有更多的数据可以尝试,你可能需要稍微调整一下正则表达式。
Lines <- "Company Address
1. A 1 NE 1 Street Miami,FL 33132
2. B 1 1st Street Miami,FL 33132
3. C 1 NE 1st St Miami,FL 33132
4. D 1 1st Street Miami,FL 33134
5. E 100 Biscayne Blvd. Miami,FL 33132
6. F 100 Biscayne Blvd Miami ,FL 33132
7. G 100 Biscayne Boulevard Suite 604 Miami,FL 33132
8. H 100 Biscayne Blvd. Suite 604 Miami,FL 33132
9. I 100 N. Biscayne Blvd. Miami,FL 33132"
library(gsubfn)
DF <- read.pattern(text = Lines,
pattern = "\\S+ \\S+ *(\\d+) (.*) (\\S+) ?,(\\S+) (\\d+)$",
skip = 1,
as.is = TRUE,
col.names = c("No", "Street", "City", "State", "Zip"))
,并提供:
> DF
No Street City State Zip
1 1 NE 1 Street Miami FL 33132
2 1 1st Street Miami FL 33132
3 1 NE 1st St Miami FL 33132
4 1 1st Street Miami FL 33134
5 100 Biscayne Blvd. Miami FL 33132
6 100 Biscayne Blvd Miami FL 33132
7 100 Biscayne Boulevard Suite 604 Miami FL 33132
8 100 Biscayne Blvd. Suite 604 Miami FL 33132
9 100 N. Biscayne Blvd. Miami FL 33132
这是可视化的正则表达式:
\S+ \S+ *(\d+) (.*) (\S+) ?,(\S+) (\d+)$
答案 1 :(得分:0)
您也可以为此使用“ StringR”包。 要使用的函数是“ Str_extract”。 这将根据给定的数据库提取城市名称。
要提取街道号。 ,则可以使用“ gsub”和“ ^ [[:digit:]]”“。