R的新手
我正在使用tidytext::unnest_tokens
使用下面的
tidy_drugs <- drugstext.raw %>%
unnest_tokens(sentence, Section, token="sentences")
所以我得到一个data.frame,所有句子都转换成行。
我想从长篇文章中获取每个句子的开头和结尾位置。
以下是长文本文件的示例。它来自药品标签。
<< *6.1 Clinical Trial Experience
Because clinical trials are conducted under widely varying conditions, adverse reaction rates observed in clinical trials of a drug cannot be directly compared to rates in the clinical trials of another drug and may not reflect the rates observed in practice.
The data below reflect exposure to ARDECRETRIS as monotherapy in 327 patients with classical Hodgkin lymphoma (HL) and systemic anaplastic large cell lymphoma (sALCL), including 160 patients in two uncontrolled single-arm trials (Studies 1 and 2) and 167 patients in one placebo-controlled randomized trial (Study 3).
In Studies 1 and 2, the most common adverse reactions were neutropenia, fatigue, nausea, anemia, cough, and vomiting.*
所需的结果是具有三列的数据框
答案 0 :(得分:1)
您可以使用str_locate
中的stringr
执行此操作。这通常很烦人,因为换行符和特殊字符可能会破坏您搜索的正则表达式。这里我们首先使用str_replace_all
从输入文本中删除换行符,然后取消标记,确保保留原始文本并防止更改大小写。然后,我们制作一个新的正则表达式列,用正确转义的版本替换特殊字符(此处为(
,)
和.
),并使用str_locate
添加开头和每个字符串的结尾。
我没有得到与您相同的数字,但我复制了您的代码中的文字,该文字并不总是保留所有字符,并且您的最终end
数字小于start
无论如何。
library(tidyverse)
library(tidytext)
raw_text <- tibble(section = "6.1 Clinical Trial Experience
Because clinical trials are conducted under widely varying conditions, adverse reaction rates observed in clinical trials of a drug cannot be directly compared to rates in the clinical trials of another drug and may not reflect the rates observed in practice.
The data below reflect exposure to ARDECRETRIS as monotherapy in 327 patients with classical Hodgkin lymphoma (HL) and systemic anaplastic large cell lymphoma (sALCL), including 160 patients in two uncontrolled single-arm trials (Studies 1 and 2) and 167 patients in one placebo-controlled randomized trial (Study 3).
In Studies 1 and 2, the most common adverse reactions were neutropenia, fatigue, nausea, anemia, cough, and vomiting."
)
tidy_text <- raw_text %>%
mutate(section = str_replace_all(section, "\\n", "")) %>%
unnest_tokens(
output = sentence,
input = section,
token = "sentences",
drop = FALSE,
to_lower = FALSE
) %>%
mutate(
regex = str_replace_all(sentence, "\\(", "\\\\("),
regex = str_replace_all(regex, "\\)", "\\\\)"),
regex = str_replace_all(regex, "\\.", "\\\\.")
) %>%
mutate(
start = str_locate(section, regex)[, 1],
end = str_locate(section, regex)[, 2]
) %>%
select(sentence, start, end) %>%
print()
#> # A tibble: 3 x 3
#> sentence start end
#> <chr> <int> <int>
#> 1 6.1 Clinical Trial Experience Because clinical trials are ~ 1 290
#> 2 The data below reflect exposure to ARDECRETRIS as monothera~ 310 626
#> 3 In Studies 1 and 2, the most common adverse reactions were ~ 646 762
由reprex package(v0.2.0)创建于2018-02-23。