我有一个旧的客户端数据库(.csv
)地址。最大的问题是它们不一致,当我将它分开时,市政府要么在区域,要么在城市等等......
例如:
(header) Country, Municipality, City, Detailed address(street name, number, floor, ap.)
**(proper) Count.xxxxxx, Mun.xxxxx, City.xxxx**
(case 1) Count.xxxxxx, City.xxxx, Mun.xxxxx
(case 2) Count.xxxxxx, City.xxxx, -Mun.xxxxx
(case 3) City.xxxx, Count.xxxxxx, Mun.xxxxx
(case 4) Mun.xxxxx, City.xxxx, Count.xxxxxx
(case 5) Mun.xxxxx, Count.xxxxxx, City.xxxx
" XXXX" =各种名称,也包含数字,空格和"。
我尝试按以下格式对它们进行重新排序:
Count.
,Mun.
,City.
但我看到和尝试的一切更像是排序和过滤
我需要帮助重新排序,以便数据库保持一致,并且所有数据都在相应的列中。
更复杂的例子:
国家,地区,自治市,市,详细地址街/林荫大道入口楼ap。号码(详细地址如Boul.Bulgaria 100 entr.A fl.4 ap.256)
您可以想象并非所有字段都被填充,有时字段不会与"," (但这是一个我不得不忍受的问题......不能超过65k行...)
Count.xxxxx, Area.xx xxx, Munic.xxxxx, Cit.xxxxx, Addr.xxxxx
Area.xxxxx, Munic.xxxxx, Cit.xxxxx, Addr.xxxxx Munic.xxxxx, Cit.xxxxx,
Addr.xx xxx, Count.xxxxx Count.xxxxx, Munic.xxxxx, Cit.xxxxx, Addr.xxxxx
Munic.xxxxx, Vill.xxxxx Area.xxxxx, Addr.xxxxx Munic.xxxxx, Cit.xxxxx
Cit.xxxxx, Munic.xx xxx, Addr.xxx xx
另一件事是它可能是城市或村庄(ct.vill。)
答案 0 :(得分:2)
听起来你只需要从每一行抓住县,市和市。您可以使用grep
来获取正确的行元素:
data.frame(County = apply(dat, 1, grep, pattern="Count\\.", value=TRUE),
City = apply(dat, 1, grep, pattern="City\\.", value=TRUE),
Mun = apply(dat, 1, grep, pattern="Mun\\.", value=TRUE))
# County City Mun
# 1 Count.1 City.1 Mun.4
# 2 Count.3 City.2 Mun.7
# 3 Count.2 City.5 Mun.8
# 4 Count.2 City.2 Mun.1
# 5 Count.10 City.2 Mun.6
# 6 Count.1 City.1 Mun.4
数据:
(dat = data.frame(A=c("Count.1", "Count.3", "City.5", "City.2", "Mun.6", "Mun.4"),
B=c("City.1", "Mun.7", "Count.2", "Mun.1", "Count.10", "City.1"),
C=c("Mun.4", "City.2", "Mun.8", "Count.2", "City.2", "Count.1"),
stringsAsFactors=FALSE))
# A B C
# 1 Count.1 City.1 Mun.4
# 2 Count.3 Mun.7 City.2
# 3 City.5 Count.2 Mun.8
# 4 City.2 Mun.1 Count.2
# 5 Mun.6 Count.10 City.2
# 6 Mun.4 City.1 Count.1