清理R中的txt文件

时间:2016-03-05 04:07:44

标签: regex r

我在asdfree上找到了以下关于分类法的脚本。当前脚本将所有专业合并为一列asdfree original script。问题是当前脚本忽略了专业的层次结构。

以下代码可让您了解实际上有多个级别

library(downloader)
tf <- tempfile()
download("https://raw.githubusercontent.com/ajdamico/asdfree/master/National%20Plan%20and%20Provider%20Enumeration%20System/taxonomy%20id%20table.txt", tf)
z <- readLines(tf)
hmt <- gregexpr("\t", z)
l <- unlist(lapply(hmt, function(x) length(x[x > 0])))

specialty_groups <- pre[l == 1]
specialty_individual <- pre[l == 2]

问题在于,Allegery和Immunology(排在第一行)是错误的,它应该真的进入最后一栏。

6      2                Allergy & Immunology  207K00000X Allopathic & Osteopathic Physicians                 <NA>
7      3                             Allergy  207KA0200X Allopathic & Osteopathic Physicians Allergy & Immunology
8      3    Clinical & Laboratory Immunology  207KI0005X Allopathic & Osteopathic Physicians Allergy & Immunology
9      2                      Anesthesiology  207L00000X Allopathic & Osteopathic Physicians                 <NA>

换句话说,数据应该看起来像这样

LEVEL_1                              LEVEL_2              LEVEL_3                            TAXONOMY
Allopathic & Osteopathic Physicians  Allergy & Immunology                                    207K00000X
Allopathic & Osteopathic Physicians  Allergy & Immunology Allergy                            207KA0200X
Allopathic & Osteopathic Physicians  Allergy & Immunology Clinical & Laboratory Immunology   207KI0005X

如何在R中使用正则表达式来实现这一目标?

0 个答案:

没有答案