可能已经有关于此主题的文章,但是我不确定要搜索哪些术语。我正在尝试从具有以下格式的txt文件中导入数据(前两行没有意义):
FN Clarivate Analytics Web of Science
VR 1.0
PT J
AU Ahituv, Nadav
Zhu, Yiwen
Visel, Axel
Holt, Amy
Afzal, Veena
Pennacchio, Len A.
Rubin, Edward M.
TI Deletion of ultraconserved elements yields viable mice
SO PLOS BIOLOGY
VL 5
IS 9
BP 1906
EP 1911
AR e234
DI 10.1371/journal.pbio.0050234
PD SEP 2007
PY 2007
RI Visel, Axel/A-9398-2009; Ahituv, Nadav/; Pennacchio, Len/
OI Visel, Axel/0000-0002-4130-7784; Ahituv, Nadav/0000-0002-7434-8144;
Pennacchio, Len/0000-0002-8748-3732
SN 1544-9173
UT WOS:000249552300010
PM 17803355
ER
PT J
AU Ahmadiyeh, Nasim
Pomerantz, Mark M.
Grisanzio, Chiara
Herman, Paula
Jia, Li
Almendro, Vanessa
He, Housheng Hansen
Brown, Myles
Liu, X. Shirley
Davis, Matt
Caswell, Jennifer L.
Beckwith, Christine A.
Hills, Adam
MacConaill, Laura
Coetzee, Gerhard A.
Regan, Meredith M.
Freedman, Matthew L.
TI 8q24 prostate, breast, and colon cancer risk loci show tissue-specific
long-range interaction with MYC
SO PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF
AMERICA
VL 107
IS 21
BP 9742
EP 9746
DI 10.1073/pnas.0910668107
PD MAY 25 2010
PY 2010
RI Davis, Matt/F-9045-2012; He, Housheng/G-9614-2011; he, housheng hansen/; Caswell-Jin, Jennifer/; Brown, Myles/
OI he, housheng hansen/0000-0003-2898-3363; Caswell-Jin,
Jennifer/0000-0002-5711-8355; Brown, Myles/0000-0002-8213-1658
SN 0027-8424
UT WOS:000278054700049
PM 20453196
ER
由于某些类别(例如AU)具有多个对象,因此我认为我需要将其作为列表导入。类别标签全为2个字符,后跟一个空格,但是某些类别在多行中,并且后续的行未使用类别标签进行标记。另外,对于某些类别占用多个行,例如AU,我希望将数据作为向量导入。对于其他人,例如TI或SO,我想将多行分类为列表中character
类的一个对象。
我希望条目看起来像这样:
print(<portion of list that corresponds to AU for first reference>)
[AU]
[[1]] "Ahituv, Nadav" "Zhu, Yiwen" "Visel, Axel" "Holt, Amy" "Afzal, Veena"
[[6]] "Pennacchio, Len A." "Rubin, Edward M."
print(<portion of lilst that corresponds to TI and SO for second reference>)
[TI]
[[1]] "8q24 prostate, breast, and colon cancer risk loci show tissue-specific long-range interaction with MYC"
[SO]
[[1]] "PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA"
我尝试通过以下代码使用scan()
:
scan("savedrecs_spitz refs.txt", what = "character", sep = "\n")
但是,读入的是单个字符向量,其中txt的每一行作为向量中的单独对象被读入:
[1] "FN Clarivate Analytics Web of Science" "VR 1.0"
[3] "PT J" "AU Ahituv, Nadav"
[5] " Zhu, Yiwen" " Visel, Axel"
我应该使用其他功能来读取这些数据吗?
答案 0 :(得分:0)
这是您要找的吗?
dt=scan("savedrecs_spitz refs.txt", what = "character", sep = "\n")
mgrep=function(dt){
v=intersect(grep("[A-Z]{2}",dt),which(nchar(dt)==2))
ret=list()
for(i in 1:length(v)){
end=ifelse(i==length(v),length(dt),(v[i+1]-1))
st=(v[i]+1)
ret[[i]]=dt[st:end]
}
names(ret)=dt[v]
return(ret)
}
mgrep(dt)
ps:请注意某些特殊字符会被错误地读取,例如“FN”,而这些字符在函数内部将无法正确使用。
答案 1 :(得分:0)
认为我已经解决了您的问题,但是我将数据保存在data.frame中
library(stringr)
text <- scan("text.txt",sep = "\n",what = "character")
textLoop <- grep("^[[:upper:]]|^[[:blank:]]", text, value = TRUE)
for(i in 1:length(textLoop)){
if(grepl("^[[:blank:]]", textLoop[i])){
partOne <- substring(textLoop[i-1], 1, 2)
textLoop[i] <- paste0(partOne, textLoop[i])
}
}
textDf <- data.frame(partOne = substring(textLoop, 1, 2),
partTwo = substring(textLoop, 4))