所以我的情况是我有一个物理化学数据集中的文件列表,我是通过多次计算创建的,我希望在我的数据框中名为Files的列中运行foreach或while循环,标题为CD1_and_CH2_INTERACTION_ENERGIES_and_DISTANCES。
我的文件名看起来像这样:&#34; 1AH7A_TRP-16-A_GLU-9-A.log:&#34;,&#34; 1AH7A_TRP-198-A_ASP-197-A.log:&# 34;,&#34; 1BGFA_TRP-43-A_GLU-44-A.log:&#34;,&#34; 1CXQA_TRP-61-A_ASP-82-A.log:&#34;等... < / p>
我希望在我的专栏&#34;文件&#34;中运行一段时间或一个foreach循环,如果存在单词&#34; GLU&#34;或&#34; ASP&#34;,然后如果我发现&#34; GLU&#34;或者&#34; ASP&#34;,在文件中我想将其打印到列表中。
因此,在上述文件中,打印顺序为&#34; GLU&#34;,&#34; ASP&#34;,&#34; GLU&#34;,&#34; ASP&#34;。同样,我的文件不是以任何特定的方式排序,而是一直到我的1273个文件条目。然后我可以保存这个列表并将其放入列标题&#34; Residues&#34;在我的数据框中,并做一些有用的探索性数据分析。
注意:ASP用于氨基酸天冬氨酸,GLU用于氨基酸谷氨酸。
我知道我可以正常表达式搜索grep以获取列中的条款&#34; Files&#34;像这样。
搜索&#34; ASP&#34;:
> grep("ASP", CD1_and_CH2_INTERACTION_ENERGIES_and_DISTANCES$Files, value = TRUE)
[1] "1AH7A_TRP-198-A_ASP-197-A.log:"
[2] "1CXQA_TRP-61-A_ASP-82-A.log:"
[3] "1EJDA_TRP-279-A_ASP-278-A.log:"
[4] "1EU1A_TRP-32-A_ASP-33-A.log:"
如你所见,我得到了一些比赛。事实上我得到了683场比赛。但那还不够好。我需要匹配它们发生的地方,而不是它们发生。
当然,我可以为#34; GLU&#34;:
> grep("GLU", CD1_and_CH2_INTERACTION_ENERGIES_and_DISTANCES$Files, value = TRUE)
[1] "1AH7A_TRP-16-A_GLU-9-A.log:"
[2] "1BGFA_TRP-43-A_GLU-44-A.log:"
[3] "1D8WA_TRP-17-A_GLU-14-A.log:"
我得到了一大堆比赛!
我试过一个for循环。当然失败了!
> for(i in 1:length(CD1_and_CH2_Distances$Distance_Files))
{if(grep("ASP", CD1_and_CH2_INTERACTION_ENERGIES_and_DISTANCES$Files))
{print("ASP")}
else if(grep("GLU", CD1_and_CH2_INTERACTION_ENERGIES_and_DISTANCES$Files))
{print("GLU")}}
所有这一切都是打印:
[1] "ASP"
[1] "ASP"
[1] "ASP"
...
即使有&#34; GLU&#34;!
我的意思是我可以做一些对任何人都不重要的基本代数循环:
> for(i in 1:10){print(i^2)}
[1] 1
[1] 4
[1] 9
[1] 16
无论如何,我检查了警告,看看出了什么问题:
> warnings()
Warning messages:
1: In if (grep("ASP", CD1_and_CH2_INTERACTION_ENERGIES_and_DISTANCES$Files)) { ... :
the condition has length > 1 and only the first element will be used
2: In if (grep("ASP", CD1_and_CH2_INTERACTION_ENERGIES_and_DISTANCES$Files)) { ... :
the condition has length > 1 and only the first element will be used
正如您所看到的,我一遍又一遍地得到同样的错误。我想这是有道理的,因为这是一个循环。但是为什么会发生这种情况,为什么我不能在循环内部进行grep?
我想解析的数据框如下所示:
"","Files","Interaction_Energy_kcal_per_Mole","atom","Distance_Angstroms"
"1","1AH7A_TRP-16-A_GLU-9-A.log:",-8.49787784468197,"CD1",4.03269909613896
"2","1AH7A_TRP-198-A_ASP-197-A.log:",-7.92648167142146,"CD1",3.54307493570204
"3","1BGFA_TRP-43-A_GLU-44-A.log:",-6.73507800775909,"CD1",4.17179517713897
"4","1CXQA_TRP-61-A_ASP-82-A.log:",-9.39887176290279,"CD1",5.29897291934956
"5","1D8WA_TRP-17-A_GLU-14-A.log:",-9.74720319145055,"CD1",3.69398565238145
"6","1D8WA_TRP-17-A_GLU-18-A.log:",-11.3235196065977,"CD1",3.52345441293058
"7","1DJ0A_TRP-223-A_GLU-226-A.log:",-7.46891330209553,"CD1",5.41108436452436
"8","1E58A_TRP-15-A_GLU-18-A.log:",-6.59830781067777,"CD1",4.79790235415437
其中逗号分隔列。
这就是我想要的结果:
"","Files","Interaction_Energy_kcal_per_Mole","atom","Distance_Angstroms", "Residue",
"1","1AH7A_TRP-16-A_GLU-9-A.log:",-8.49787784468197,"CD1",4.03269909613896, "GLU",
"2","1AH7A_TRP-198-A_ASP-197-A.log:",-7.92648167142146,"CD1",3.54307493570204, "ASP",
"3","1BGFA_TRP-43-A_GLU-44-A.log:",-6.73507800775909,"CD1",4.17179517713897, "GLU",
"4","1CXQA_TRP-61-A_ASP-82-A.log:",-9.39887176290279,"CD1",5.29897291934956, "ASP",
"5","1D8WA_TRP-17-A_GLU-14-A.log:",-9.74720319145055,"CD1",3.69398565238145, "GLU",
"6","1D8WA_TRP-17-A_GLU-18-A.log:",-11.3235196065977,"CD1",3.52345441293058, "GLU",
"7","1DJ0A_TRP-223-A_GLU-226-A.log:",-7.46891330209553,"CD1",5.41108436452436, "GLU",
"8","1E58A_TRP-15-A_GLU-18-A.log:",-6.59830781067777,"CD1",4.79790235415437, "GLU",
...
任何帮助表示赞赏!谢谢!
答案 0 :(得分:2)
我们可以使用[{1}}
派生的子字符串将split
数据集用于list
data.frame
sub
lst <- split(df1, sub(".*_([A-Z]{3})-.*", "\\1", df1$Files))
df1 <- structure(list(X = 1:8, Files = c("1AH7A_TRP-16-A_GLU-9-A.log:",
"1AH7A_TRP-198-A_ASP-197-A.log:", "1BGFA_TRP-43-A_GLU-44-A.log:",
"1CXQA_TRP-61-A_ASP-82-A.log:", "1D8WA_TRP-17-A_GLU-14-A.log:",
"1D8WA_TRP-17-A_GLU-18-A.log:", "1DJ0A_TRP-223-A_GLU-226-A.log:",
"1E58A_TRP-15-A_GLU-18-A.log:"), Interaction_Energy_kcal_per_Mole = c(-8.49787784468197,
-7.92648167142146, -6.73507800775909, -9.39887176290279, -9.74720319145055,
-11.3235196065977, -7.46891330209553, -6.59830781067777), atom = c("CD1",
"CD1", "CD1", "CD1", "CD1", "CD1", "CD1", "CD1"), Distance_Angstroms = c(4.03269909613896,
3.54307493570204, 4.17179517713897, 5.29897291934956, 3.69398565238145,
3.52345441293058, 5.41108436452436, 4.79790235415437)), .Names = c("X",
"Files", "Interaction_Energy_kcal_per_Mole", "atom", "Distance_Angstroms"
), class = "data.frame", row.names = c(NA, -8L))
答案 1 :(得分:1)
我不确定我是否完全接受了您的问题但请考虑您的数据位于&#34; dat&#34;数据(包含GLU和ASP的行)。使用下面的表格列出一个字段,该字段可以包含&#34; ASP&#34;的数据。和&#34; GLU&#34;。
library(stringr)
newvar <- NULL
newvar$GLU <- str_extract(dat$Files,"(GLU)")
newvar$ASP <- str_extract(dat$Files,"(ASP)")
newvar1 <- data.frame(newvar)
newvar1
library(tidyr)
newvar1[is.na(newvar1)] = ""
new <- unite(newvar1, new, GLU:ASP, sep='')
dat$new <- new
此处名为new的字段将包含您的GLU和ASP值
<强>答案:强>
dat
X Files Interaction_Energy_kcal_per_Mole atom Distance_Angstroms new
1 1 1AH7A_TRP-16-A_GLU-9-A.log: -8.497878 CD1 4.032699 GLU
2 2 1AH7A_TRP-198-A_ASP-197-A.log: -7.926482 CD1 3.543075 ASP
3 3 1BGFA_TRP-43-A_GLU-44-A.log: -6.735078 CD1 4.171795 GLU
4 4 1CXQA_TRP-61-A_ASP-82-A.log: -9.398872 CD1 5.298973 ASP
5 5 1D8WA_TRP-17-A_GLU-14-A.log: -9.747203 CD1 3.693986 GLU
6 6 1D8WA_TRP-17-A_GLU-18-A.log: -11.323520 CD1 3.523454 GLU
7 7 1DJ0A_TRP-223-A_GLU-226-A.log: -7.468913 CD1 5.411084 GLU
8 8 1E58A_TRP-15-A_GLU-18-A.log: -6.598308 CD1 4.797902 GLU
答案 2 :(得分:1)
After a long time I figured out a solution to my problem:
# Save my column as a vector because factors are making the world burn:
Files <- as.vector(CD1_and_CH2_INTERACTION_ENERGIES_and_DISTANCES$Files)
# Split the Files into three parts along the two underscores, and save it back to my vector, preserving the third cut around the underscore.
Files <- str_split_fixed(Files, "_", 3)[,3]
Result:
[1] "GLU-9-A.log:"
"ASP-197-A.log:"
etc ...
# Split those results along the hyphens, and take what's next to the first hyphen or the first cut:
Residues <- str_split_fixed(Files, "-", 3)[,1]
> Residues
[1] "GLU" "ASP" "GLU", ...
Add the Residue columns to my data.frame.
CD1_and_CH2_INTERACTION_ENERGIES_and_DISTANCES$Residue <- Residue
I guess the grep function is overrated. I had to look hard for this function.
答案 3 :(得分:0)
假设您保存了试图在文件Error:java: com.sun.tools.javac.code.Symbol$CompletionFailure: class file
for groovy.lang.Closure not found
Error:java: java.lang.RuntimeException:
com.sun.tools.javac.code.Symbol$CompletionFailure: class file for
groovy.lang.Closure not found
中解析的数据。
下面是如何创建两个数据框的示例,一个用于GLU,另一个用于ASP:
glu_vs_asp.csv
要创建包含GLU和ASP的数据框,您可以尝试以下操作:
# Read .csv file.
dt <- read.table(file = "glu_vs_asp.csv", sep = ",", header = TRUE)
# Create two data frames, one for GLU and one for ASP.
dt_glu <- dt[grep("GLU", dt$Files),]
dt_asp <- dt[grep("ASP", dt$Files),]
命令
dt_glu_asp <- dt[grep("(ASP|GLU)", dt$Files),]
为您提供分别包含&#39; ASP&#39;的行的索引。和&#39; GLU&#39;在grep("ASP", dt$Files)
grep("GLU", dt$Files)
列。