在R中的向量中替换流氓双引号

时间:2017-02-08 07:28:19

标签: r regex

我的CSV文件损坏,文本字段长,包含双引号和逗号。我已经能够在某种程度上清理它,现在将制表符分隔的字段作为整行的矢量(每个值都是一行)。

head(temp, 2)
[1] "\"org_order\"\t\"organizations.api_path\"\t\"permalink\"\t\"api_path\"\t\"web_path\"\t\"name\"\t\"also_known_as\"\t\"short_description\"\t\"description\"\t\"profile_image_url\"\t\"primary_role\"\t\"role_company\"\t\"role_investor\"\t\"role_group\"\t\"role_school\"\t\"founded_on\"\t\"founded_on_trust_code\"\t\"is_closed\"\t\"closed_on\"\t\"closed_on_trust_code\"\t\"num_employees_min\"\t\"num_employees_max\"\t\"stock_exchange\"\t\"stock_symbol\"\t\"total_funding_usd\"\t\"number_of_investments\"\t\"homepage_url\"\t\"created_at\"\t\"updated_at\""                                                                                                                                                                                                                                                                                                                                                                                                                                                               
[2] "1\t\"organizations/care1st-health-plan-arizona\"\t\"care1st-health-plan-arizona\"\t\"organizations/care1st-health-plan-arizona\"\t\"organization/care1st-health-plan-arizona\"\t\"Care1st Health Plan Arizona\"\t\"\"\t\"Care1st Health Plan Arizona provides high quality health care services.\"\t\"Care1st is a health plan providing support and services to meet the health care needs of eligible members enrolled in KidsCare, AHCCCS, and DDD.\"\t\"http://public.crunchbase.com/t_api_images/v1475743278/m2teurxnhkwacygzdn2m.png\"\t\"company\"\t\"\"\t\"\"\t\"\"\t\"\"\t\"2003-01-01\"\t\"4\"\t\"FALSE\"\t\"\"\t\"0\"\t\"251\"\t\"500\"\t\"\"\t\"\"\t\"0\"\t\"0\"\t\"\"\t\"1475743348\"\t\"1475899305\""  

然后我将temp写为文件并将其读回(我发现它比textConnection快得多)。但是,read.table("temp", sep = "\t", quote = "\"", encoding = "UTF-8", colClasses = "character")会在某些行上发出窒息,并向我发送消息,例如:

  

扫描错误(file = file,what = what,sep = sep,quote = quote,dec   = dec,:第66951行没有29个元素

我认为这是由于流氓双引号,如下一行(流氓引语可以在&#34之后立即找到; TripAdvisor de la sant?")。

temp[66951]
[1] "67654\t\"organizations/docotop\"\t\"docotop\"\t\"organizations/docotop\"\t\"organization/docotop\"\t\"DOCOTOP\"\t\"\"\t\"Le 'TripAdvisor de la sant?\" est arriv?. Docotop permet de trouver le meilleur professionnel de sant?gr?e ?la communaut?de patients\"\t\"\"\t\"http://public.crunchbase.com/t_api_images/v1455271104/ry9lhcfezcmemoifp92h.png\"\t\"company\"\t\"TRUE\"\t\"\"\t\"\"\t\"\"\t\"2015-11-17\"\t\"7\"\t\"\"\t\"\"\t\"0\"\t\"1\"\t\"10\"\t\"EURONEXT\"\t\"\"\t\"0\"\t\"0\"\t\"http://docotop.com/\"\t\"1455271299\"\t\"1473443321\""

我建议用单引号替换流氓双引号,但我必须留下预期的引号。预期在分隔符(选项卡)之前或之后以及在开头(仅第一行)和行的结尾处引用。我已经在正则表达式中编写了以下针对制表符和行开头和结尾的外观,但它不起作用:

temp <- gsub("(?<![^\t])\"(?![\t$])", "'", temp, perl = T)
编辑:我试过@ akrun的解决方案,但得到:

  

扫描错误(file = file,what = what,sep = sep,quote = quote,dec   = dec ,:第181行没有29个元素

有问题的行(之前没有造成错误):

temp[181]
[1] "198\torganizations/playfusion\tplayfusion\torganizations/playfusion\torganization/playfusion\tPlayFusion\t\tPlayFusion is a developer of computer games.\tPlayFusion is pioneering the next generation of connected interactive entertainment. PlayFusion's proprietary technology platform fuses video games, robotics, toys, and trans-media entertainment. The company is currently working on its own original IP to trail-blaze its vision ahead of opening its platform to others.    PlayFusion is an independent, employee-owned company with offices in Cambridge and Derby in the UK, Douglas in the Isle of Man, and New York and San Francisco in the USA.\thttp://public.crunchbase.com/t_api_images/v1475688372/xnhrd4t254pxj6yxegzt.png\tcompany\t\t\t\t\t2015-01-01\t4\tFALSE\t\t0\t11\t50\t\t\t0\t0\thttp://playfusion.com/#intro\t1475688521\t1475899292"

1 个答案:

答案 0 :(得分:1)

您的(?<![^\t])"(?![\t$])正则表达式匹配的"前面没有选项卡以外的字符(因此,"之前必须有一个制表符或字符串的开头),并且没有标签或$符号。

因此,^$内部角色类失去了它们的锚意义。

用替换组替换字符类:

gsub("(?<!\t|^)\"(?!\t|$)", "'", temp, perl=TRUE)

(?<!\t|^) lookbehind要求"不在字符串的开头,并且前面没有标签。

(?!\t|$)前瞻要求"不在字符串的末尾($),并且后面没有标签字符。