R unnest_tokens并计算每个令牌的位置(开始和结束位置)

时间:2018-01-05 18:35:42

标签: r string nlp emr tidytext

使用unnest_tokens后如何获取所有令牌的位置? 这是一个简单的例子 -

df<-data.frame(id=1,
               doc=c("Patient:   [** Name **], [** Name **] Acct.#:         
[** Medical_Record_Number **]        MR #:     [** Medical_Record_Number **]
Location: [** Location **] "))

使用tidytext -

通过空格进行标记
library(tidytext)
tokens_df<-df %>% 
unnest_tokens(tokens,doc,token = stringr::str_split, 
pattern = "\\s",
to_lower = F, drop = F)

如何获得所有代币的位置?

id  tokens  start  end
 1  Patient: 1      8
 1           9      9
 1  [**      12     14
 1  Name     16     19  

2 个答案:

答案 0 :(得分:0)

这是解决问题的非整洁方法。

regex = "([^\\s]+)"
df_i = str_extract_all(df$doc, regex) 
df_ii = str_locate_all(df$doc, regex) 

output1 = Map(function(x, y, z){
  if(length(y) == 0){
    y = NA
  }
  if(nrow(z) == 0){
    z = rbind(z, list(start = NA, end = NA))
  }
  data.frame(id = x, token = y, z)
}, df$id, df_i, df_ii) %>%
  do.call(rbind,.) %>%
  merge(df, .)

答案 1 :(得分:0)

我认为这里的第一个回答者有正确的想法,最好的方法是使用字符串处理,而不是标记化和NLP,如果在空格上分割的标记和字符位置是你想要的输出。

如果你想要使用整洁的数据原则并最终得到一个数据框,请尝试这样的事情:

let response: Response = Response(status: "A", code: "B", uuid: "C")
let data = try JSONEncoder().encode(res)

//Data to [String : Any]

Alamofire.request("endpoint", method: .post, parameters: params).responseJSON {
    // Handle response
}

这适用于每个字符串都有library(tidyverse) df <- data_frame(id=1, doc=c("Patient: [** Name **], [** Name **] Acct.#: [** Medical_Record_Number **] ")) df %>% mutate(tokens = str_extract_all(doc, "([^\\s]+)"), locations = str_locate_all(doc, "([^\\s]+)"), locations = map(locations, as.data.frame)) %>% select(-doc) %>% unnest(tokens, locations) #> # A tibble: 11 x 4 #> id tokens start end #> <dbl> <chr> <int> <int> #> 1 1.00 Patient: 1 8 #> 2 1.00 [** 12 14 #> 3 1.00 Name 16 19 #> 4 1.00 **], 21 24 #> 5 1.00 [** 26 28 #> 6 1.00 Name 30 33 #> 7 1.00 **] 35 37 #> 8 1.00 Acct.#: 39 45 #> 9 1.00 [** 50 52 #> 10 1.00 Medical_Record_Number 54 74 #> 11 1.00 **] 76 78 列的多个文档,并且由于正则表达式的构造方式,它会从输出中删除实际的空格。

EDITED: 在评论中,原始海报要求采用一种方法,允许按句子进行标记,跟踪每个单词的位置。以下代码执行此操作,从某种意义上说,我们得到每个句子每个句子中的开始和结束位置。您是否可以使用id列与sentenceIDstart列的组合来查找您要查找的内容?

end

请注意,这些并非完全“标记化”#34;在library(tidyverse) library(tidytext) james <- paste0( "The question thus becomes a verbal one\n", "again; and our knowledge of all these early stages of thought and feeling\n", "is in any case so conjectural and imperfect that farther discussion would\n", "not be worth while.\n", "\n", "Religion, therefore, as I now ask you arbitrarily to take it, shall mean\n", "for us _the feelings, acts, and experiences of individual men in their\n", "solitude, so far as they apprehend themselves to stand in relation to\n", "whatever they may consider the divine_. Since the relation may be either\n", "moral, physical, or ritual, it is evident that out of religion in the\n", "sense in which we take it, theologies, philosophies, and ecclesiastical\n", "organizations may secondarily grow.\n" ) d <- data_frame(txt = james) d %>% unnest_tokens(sentence, txt, token = "sentences") %>% mutate(sentenceID = row_number(), tokens = str_extract_all(sentence, "([^\\s]+)"), locations = str_locate_all(sentence, "([^\\s]+)"), locations = map(locations, as.data.frame)) %>% select(-sentence) %>% unnest(tokens, locations) #> # A tibble: 112 x 4 #> sentenceID tokens start end #> <int> <chr> <int> <int> #> 1 1 the 1 3 #> 2 1 question 5 12 #> 3 1 thus 14 17 #> 4 1 becomes 19 25 #> 5 1 a 27 27 #> 6 1 verbal 29 34 #> 7 1 one 36 38 #> 8 1 again; 40 45 #> 9 1 and 47 49 #> 10 1 our 51 53 #> # ... with 102 more rows 的正常意义上;他们仍然会在每个单词上附上逗号和句点的结束标点符号。看起来你想从原来的问题那里得到它。