提取不规则长度的字符串:来自引文的未知数量的作者

时间:2015-03-17 10:50:14

标签: r extraction strsplit

来自 Web of Science 我已在textfile下载了500篇文章引文。只有作者的专栏(AU)被读入R.该变量包含由分号分隔的Author1到AuthorN:

  

Anselin,L;藤田,M;这个,JF

我想在不同的专栏中提取Author1,Author2,Author3 ... AuthorN。在我的文件中,我有多达10位作者。在此示例中,最多7位作者:

 #Sample of Data
    data <- c("Anselin, L; Varga, A; Acs, Z",
    "Acs, ZJ; Anselin, L; Varga, A",
    "Anselin, L",
    "Fujita, M; Thisse, JF",
    "Turner, RK; van den Bergh, JCJM; Soderqvist, T; Barendregt, A; van der Straaten, J; Maltby, E; van Ierland, EC",
    "Talen, E; Anselin, L",
    "Irwin, EG; Bockstael, NE",
    "Leggett, CG; Bockstael, NE",
    "Guimaraes, P; Figueiredo, O; Woodward, D",
    "Halpern, Benjamin S.; McLeod, Karen L.; Rosenberg, Andrew A.; Crowder, Larry B.")

我尝试了很多途径:

      #Method3 - Read table : Not same amount of elements
            Meth3 <- read.table(textConnection(data), sep=";", stringsAsFactors=FALSE)

      #Method2 - Separate in different column : repeats the Names
        Meth2 <- do.call(rbind, 
                          strsplit(gsub(";", 
                                        "\\1NONSENSESPLIT\\2NONSENSESPLIT\\3", data),
                                   "NONSENSESPLIT"))


      #Method5 - Split row entries, make an identifier and recombine them later : Struggle to recombine
        Meth5 <- strsplit(data, ";")
        i <- 0
        id <- unlist( sapply( Meth5, function(r) rep(i<<-i+1, length(r) ) ) )
        x <- unlist(Meth5, recursive = FALSE )

        x <- list(do.call(rbind, 
               strsplit(gsub(";", 
                             "\\1NONSENSESPLIT\\2NONSENSESPLIT\\3", x),
                        "NONSENSESPLIT")))
        require(data.table)
        data.table( ID=id, do.call(rbind,x))  

      #Method6: Identifies first Author :
        Meth6 <- gsub("[^a-zA-Z0-9 ]","",strsplit(data,"\\; ")[[1]][[1]])

欢迎任何关于组织和识别作者1 ...作者N的建议。

1 个答案:

答案 0 :(得分:4)

read.csv支持此:

read.csv(text=data,header=FALSE,sep=";")
                     V1                   V2                    V3                 V4                   V5         V6               V7
1            Anselin, L             Varga, A                Acs, Z                                                                    
2               Acs, ZJ           Anselin, L              Varga, A                                                                    
3            Anselin, L                                                                                                               
4             Fujita, M           Thisse, JF                                                                                          
5            Turner, RK  van den Bergh, JCJM         Soderqvist, T      Barendregt, A  van der Straaten, J  Maltby, E  van Ierland, EC
6              Talen, E           Anselin, L                                                                                          
7             Irwin, EG        Bockstael, NE                                                                                          
8           Leggett, CG        Bockstael, NE                                                                                          
9          Guimaraes, P        Figueiredo, O           Woodward, D                                                                    
10 Halpern, Benjamin S.     McLeod, Karen L.  Rosenberg, Andrew A.  Crowder, Larry B.