修复大向量中的字符

时间:2017-01-17 09:43:43

标签: r

我正在尝试在大型字符串向量中修复字符。字符看起来像<U\\+[0-9a-fA-F]{4}>(例如S<U+00E3>)。这是一个358,626长的向量,我提供了100个值的随机样本。

预期结果:

"Thessalon<U+00ED>ki Thessaloniki Greece" => "Thessaloníki Thessaloniki Greece"
"Phoenix Arizona United States" => "Phoenix Arizona United States"
""  => ""
NA => NA

幸运的是,@ MRFlick设计了一个名为trueunicode的好函数,用于将这些函数转换回“普通”字符。但是,对于不包含该模式的字符串,trueunicode失败。我试图通过将trueunicode仅应用于包含模式的值来解决这个问题,如下所示:

sapply(addresses, function(x) ifelse(grepl("<U\\+[0-9a-fA-F]{4}>", x)), trueunicode(x), x))

不幸的是,由于某种原因,trueunicode仍然在某处失败,返回:

  

(函数(cp)中的错误:位太多

addresses <- c("San Francisco California United States", "Encinitas CA United States", 
"Malvern Pennsylvania United States", "New York NY United States", 
"Temecula CA United States", "San Francisco CA United States", 
"Pittsburgh Pennsylvania United States", "Istanbul Turkey", "Rochester New York United States", 
"Atlanta GA United States", "Cochin Kerala India", "Sydney New South Wales Australia", 
"Los Angeles CA United States", "Vancouver British Columbia Canada", 
"Rio De Janeiro Rio de Janeiro Brazil", "Washington District of Columbia United States", 
"Seattle Washington United States", "Phoenix Arizona United States", 
"Kwun Tong Kowloon Hong Kong", "Milwaukee Wisconsin United States", 
"Dublin Dublin Ireland", "London England United Kingdom", "Broomfield Colorado United States", 
"Bandung Indonesia", "London England United Kingdom", "Washington United States", 
"Ramat Gan Tel Aviv Israel", "Sydney New South Wales Australia", 
"Houston TX United States", "Salida CO United States", "Bethesda Maryland United States", 
"San Jose California United States", "S<U+00E3>o Gon<U+00E7>alo Rio de Janeiro Brazil", 
"Richmond Virginia United States", "Davao City Davao City Philippines", 
"Bucharest Bucuresti Romania", "Providencia Chile", "Cape Coral Florida United States", 
"Glenrothes Fife United Kingdom", "New York New York United States", 
"Brooklyn NY United States", "New York New York United States", 
"Vienna Wien Austria", "Addison TX United States", "Tel Aviv Tel Aviv Israel", 
"Hilton New York United States", "Tiangu<U+00E1> Ceara Brazil", 
"Hamburg Hamburg Germany", "Thessalon<U+00ED>ki Thessaloniki Greece", 
"New York New York United States", "Vancouver British Columbia Canada", 
"Lagos Lagos Nigeria", "Karachi Sindh Pakistan", "Santa Barbara CA United States", 
"Mumbai Maharashtra India", "Burlington Massachusetts United States", 
"Oslo Oslo Norway", "Jakarta Jakarta Raya Indonesia", "Madrid Madrid Spain", 
"Singapore", "San Mateo California United States", "St. Petersburg Florida United States", 
"Cincinnati Ohio United States", "San Francisco CA United States", 
"Gaithersburg Maryland United States", "Watford Hertford United Kingdom", 
"Austin Texas United States", "Gent Oost-Vlaanderen Belgium", 
"Canton Massachusetts United States", "Berkeley California United States", 
"Carlsbad California United States", "St. Petersburg Florida United States", 
"Bangalore Karnataka India", "Nyon Vaud Switzerland", "Arlington Virginia United States", 
"Palo Alto California United States", "London England United Kingdom", 
"Sydney New South Wales Australia", "Mumbai Maharashtra India", 
"Austin Texas United States", "Larnaca Cyprus", "Melbourn Cambridgeshire United Kingdom", 
"Chicago Illinois United States", "Houston Texas United States", 
"Paris France", "New York New York United States", "Auburn Hills Michigan United States", 
"New Delhi Delhi India", "Bangalore Karnataka India", "Redwood City California United States", 
"Mississauga Ontario Canada", "New York New York United States", 
"Sydney New South Wales Australia", "St Louis MO United States", 
"Rotterdam The Netherlands", "Delta British Columbia Canada", 
"Erlangen Bayern Germany", "Ashburn Virginia United States", 
"Pasadena California United States", "Palo Alto CA United States"
)
编辑:在@ PanagiotisKanavos的评论之后,我删除了对Unicode错误的提及。

1 个答案:

答案 0 :(得分:0)

一次调用函数一个值,而不是向量,完成了这项工作。

for (x in 1:length(addresses)) {
 y <- addresses[x]
 addresses[x] <- ifelse(grepl("<U\\+[0-9a-fA-F]{4}>", y), trueunicode(y), y)
}

然而,这更像是一种解决方法,而不是一种明确的解决方案。另外我注意到结果中的奇怪:模式"<U\\+[0-9a-fA-F]{4}>"在RStudio的数据帧“视图”中仍然可见,尽管grepl不再检测到它。