在Neo4j中创造代谢途径

时间:2018-04-05 22:13:13

标签: neo4j cypher

我试图使用这些数据在Neo4j中创建此问题底部图像中显示的糖酵解途径:

glycolysis_bioentities.csv

name
α-D-glucose
glucose 6-phosphate
fructose 6-phosphate
"fructose 1,6-bisphosphate"
dihydroxyacetone phosphate
D-glyceraldehyde 3-phosphate
"1,3-bisphosphoglycerate"
3-phosphoglycerate
2-phosphoglycerate
phosphoenolpyruvate
pyruvate
hexokinase
glucose-6-phosphatase
phosphoglucose isomerase
phosphofructokinase
"fructose-bisphosphate aldolase, class I"
triosephosphate isomerase (TIM)
glyceraldehyde-3-phosphate dehydrogenase
phosphoglycerate kinase
phosphoglycerate mutase
enolase
pyruvate kinase

glycolysis_relations.csv

source,relation,target
α-D-glucose,substrate_of,hexokinase
hexokinase,yields,glucose 6-phosphate
glucose 6-phosphate,substrate_of,glucose-6-phosphatase
glucose-6-phosphatase,yields,α-D-glucose
glucose 6-phosphate,substrate_of,phosphoglucose isomerase
phosphoglucose isomerase,yields,fructose 6-phosphate
fructose 6-phosphate,substrate_of,phosphofructokinase
phosphofructokinase,yields,"fructose 1,6-bisphosphate"
"fructose 1,6-bisphosphate",substrate_of,"fructose-bisphosphate aldolase, class I"
"fructose-bisphosphate aldolase, class I",yields,D-glyceraldehyde 3-phosphate
D-glyceraldehyde 3-phosphate,substrate_of,glyceraldehyde-3-phosphate dehydrogenase
D-glyceraldehyde 3-phosphate,substrate_of,triosephosphate isomerase (TIM)
triosephosphate isomerase (TIM),yields,dihydroxyacetone phosphate
glyceraldehyde-3-phosphate dehydrogenase,yields,"1,3-bisphosphoglycerate"
"1,3-bisphosphoglycerate",substrate_of,phosphoglycerate kinase
phosphoglycerate kinase,yields,3-phosphoglycerate
3-phosphoglycerate,substrate_of,phosphoglycerate mutase
phosphoglycerate mutase,yields,2-phosphoglycerate
2-phosphoglycerate,substrate_of,enolase
enolase,yields,phosphoenolpyruvate
phosphoenolpyruvate,substrate_of,pyruvate kinase
pyruvate kinase,yields,pyruvate

到目前为止,这就是我所拥有的

enter image description here

...使用此密码(传递给Cyclicypher-shell):

LOAD CSV WITH HEADERS FROM "file:/glycolysis_relations.csv" AS row
MERGE (s:Glycolysis {source: row.source})
MERGE (r:Glycolysis {relation: row.relation})
MERGE (t:Glycolysis {target: row.target})
FOREACH (x in case row.relation when "substrate_of" then [1] else [] end |
  MERGE (s)-[r:substrate_of]->(t)
)
FOREACH (x in case row.relation when "yields" then [1] else [] end |
  MERGE (s)-[r:yields]->(t)
  );

我想创建一个完全连接的路径,所有节点都有字幕。建议?

enter image description here

3 个答案:

答案 0 :(得分:3)

[增订]

存在多个问题和可能的改进:

  1. 应删除第二个MERGE,因为它会创建孤立的节点。不应将关系类型调整为Glycolysis节点,并且此类节点永远不会连接到任何其他节点。
  2. 第一个和第三个MERGE子句必须对源节点和目标节点使用相同的属性名称(例如,name),否则相同的化学品最终会有2个节点(具有不同的属性键) )。这就是为什么你最终得到了没有所有预期连接的节点。
  3. APOC程序apoc.cypher.doIt可用于简化与动态名称的MERGE关系。
  4. 此用例不需要glycolysis_bioentities.csv
  5. 通过上述更改,您最终得到类似的内容,这将生成与输入数据匹配的连接图:

    LOAD CSV WITH HEADERS FROM "file:/glycolysis_relations.csv" AS row
    MERGE (s:Glycolysis {name: row.source})
    MERGE (t:Glycolysis {name: row.target})
    WITH s, t, row
    CALL apoc.cypher.doIt(
      'MERGE (s)-[r:' + row.relation + ']->(t)',
      {s:s, t:t}) YIELD value
    RETURN 1;
    

答案 1 :(得分:3)

@ cybersam的答案非常好,提供了最优雅的解决方案(再一次:谢谢!) - 请提供接受的答案。

由于这个问题/答案/话题很可能引起其他人的兴趣,我想提一下我的代码(基于这个SO线程,How to specify relationship type in CSV?,并根据@cybersam提供的提示进行修改)有效,并显示结果:

解决方案1(我原来的帖子,已更新):

LOAD CSV WITH HEADERS FROM "file:/glycolysis_relations.csv" AS row
MERGE (s:Glycolysis {name:row.source})
MERGE (t:Glycolysis {name:row.target})
FOREACH (x in case row.relation when "substrate_of" then [1] else [] end |
  MERGE (s)-[r:substrate_of]->(t)
)
FOREACH (x in case row.relation when "yields" then [1] else [] end |
  MERGE (s)-[r:yields]->(t)
  );

解决方案2(cybersam' s,更新):

LOAD CSV WITH HEADERS FROM "file:/glycolysis_relations.csv" AS row
MERGE (s:Metabolism:Glycolysis {name: row.source})
MERGE (t:Metabolism:Glycolysis {name: row.target})
WITH s, t, row
  // "Bug" -- additional duplicate relations with each iteration of this statement/script:
  // CALL apoc.create.relationship(s, row.relation, {}, t) YIELD rel
  // Solution: 
  // https://github.com/neo4j-contrib/neo4j-apoc-procedures/issues/271
  // https://stackoverflow.com/questions/47808421/neo4j-load-csv-to-create-dynamic-relationship-types
  CALL apoc.merge.relationship(s, row.relation, {}, {}, t) YIELD rel
RETURN COUNT(*);

两种解决方案都生成相同的图表,如下所示。 :-D

neo4j_glycolytc_pathway

答案 2 :(得分:0)

如果允许的话,我想再发布一个后续答案 - 我的理由是目前在Neo4j中重建代谢途径的情况很少,以下将提供完整的摘要 StackOverflow标题/主题,“在Neo4j中创建代谢途径”。

与上面的糖酵解途径一样,我在Neo4j中重新创建 TCA 柠檬酸循环 | Kreb's cycle )途径:

TCA cycle

[TCA周期图像来源:https://metabolicpathways.stanford.edu/]

在创建我的TCA途径图期间出现的一个问题是,其中一个节点(酶,“aconitase”)被使用了两次,因此在图创建期间data = [ 11 76 25 44 55 78; 11 75 25 44 55 75; 11 75 25 44 55 75; 11 75 25 44 55 75; 11 75 25 44 55 75; 11 0 25 44 55 0; 11 0 25 44 55 0; 11 0 25 44 55 88; 11 0 25 44 55 0; 11 0 25 44 55 0; ]; for row = 1:5:size(data, 1) fprintf('Row %d - %d\n', row, row+4); indices = find(data(row:row+4,:) > 75); if ~isempty(indices) [~, cols] = ind2sub([5 size(data, 2)], indices); col_min = min(cols); col_max = max(cols); fprintf('Column: %d and %d\n', col_min, col_max); end end 合并了公共节点{{ 1}}作为单个实体,产生这种布局,

aconitase - 'incorrect' layout

......根据需要不是这个,

aconitase - 'correct' layout

我对该问题的解决方案是使用节点属性创建“TCA图”,暂时差异标记受影响的源节点和目标节点(稍后在正确创建图形后删除这些标记)。

我还添加了MERGE标签,以便我可以根据需要选择单个路径(aconitase | :Metabolism)或完整代谢网络(:Glycolysis

最后,我需要通过它们的公共节点:TCA连接两个路径(:Metabolism | :Glycolysis),我可以通过APOC过程(此处,附加)到我的:TCA(Cypher)剧本的末尾。

以下是我的CSV数据文件,* .cql Cypher脚本,脚本执行以及结果图。

<强> glycolysis.csv:

pyruvate

<强> tca.csv:

glycolysis.cql
“tsv.csv”中的“tag1”和“tag”2用于唯一的源节点和目标节点,当它们通过“tca.cql”脚本创建时:

<强> tca.cql:

source,relation,target
α-D-glucose,substrate_of,hexokinase
hexokinase,yields,glucose 6-phosphate
glucose 6-phosphate,substrate_of,glucose-6-phosphatase
glucose-6-phosphatase,yields,α-D-glucose
glucose 6-phosphate,substrate_of,phosphoglucose isomerase
phosphoglucose isomerase,yields,fructose 6-phosphate
fructose 6-phosphate,substrate_of,phosphofructokinase
phosphofructokinase,yields,"fructose 1,6-bisphosphate"
"fructose 1,6-bisphosphate",substrate_of,"fructose-bisphosphate aldolase, class I"
"fructose-bisphosphate aldolase, class I",yields,D-glyceraldehyde 3-phosphate
D-glyceraldehyde 3-phosphate,substrate_of,glyceraldehyde-3-phosphate dehydrogenase
D-glyceraldehyde 3-phosphate,substrate_of,triosephosphate isomerase (TIM)
triosephosphate isomerase (TIM),yields,dihydroxyacetone phosphate
glyceraldehyde-3-phosphate dehydrogenase,yields,"1,3-bisphosphoglycerate"
"1,3-bisphosphoglycerate",substrate_of,phosphoglycerate kinase
phosphoglycerate kinase,yields,3-phosphoglycerate
3-phosphoglycerate,substrate_of,phosphoglycerate mutase
phosphoglycerate mutase,yields,2-phosphoglycerate
2-phosphoglycerate,substrate_of,enolase
enolase,yields,phosphoenolpyruvate
phosphoenolpyruvate,substrate_of,pyruvate kinase
pyruvate kinase,yields,pyruvate

<强> glycolysis.cql:

source,relation,target,tag1,tag2
pyruvate,substrate_of,pyruvate dehydrogenase,,
pyruvate dehydrogenase,yields,acetyl CoA,,
acetyl CoA,substrate_of,citrate synthase,,
oxaloacetate,substrate_of,citrate synthase,,
citrate synthase,yields,citrate,,
citrate,substrate_of,aconitase,,1
aconitase,yields,cis-aconitate,1,
cis-aconitate,substrate_of,aconitase,,2
aconitase,yields,isocitrate,2,
isocitrate,substrate_of,isocitrate dehydrogenase,,
isocitrate dehydrogenase,yields,α-ketoglutarate,,
α-ketoglutarate,substrate_of,α-ketoglutarate dehydrogenase,,
α-ketoglutarate dehydrogenase,yields,succinyl-CoA,,
succinyl-CoA,substrate_of,succinyl-CoA synthetase,,
succinyl-CoA synthetase,yields,succinate,,
succinate,substrate_of,succinate dehydrogenase,,
succinate dehydrogenase,yields,fumarate,,
fumarate,substrate_of,fumarase,,
fumarase,yields,S-malate,,
S-malate,substrate_of,malate dehydrogenase,,
malate dehydrogenase,yields,oxaloacetate,,

脚本执行:

// CREATE INDICES:
CREATE INDEX ON :Metabolism(name);
CREATE INDEX ON :TCA(name);

// CREATE GRAPH:
// USING PERIODIC COMMIT 5000
LOAD CSV WITH HEADERS FROM "file:/mnt/Vancouver/Programming/data/metabolism/tca.csv" AS row
MERGE (s:Metabolism:TCA {name: row.source, tag:COALESCE(row.tag1, '')})
MERGE (t:Metabolism:TCA {name: row.target, tag:COALESCE(row.tag2, '')})
WITH s, t, row
  CALL apoc.merge.relationship(s, row.relation, {}, {}, t) YIELD rel
  REMOVE s.tag, t.tag
RETURN COUNT(*);

Neo4j图表(// CREATE INDICES: CREATE INDEX ON :Metabolism(name); CREATE INDEX ON :Glycolysis(name); // CREATE GRAPH: //USING PERIODIC COMMIT 5000 LOAD CSV WITH HEADERS FROM "file:/mnt/Vancouver/Programming/data/metabolism/glycolysis.csv" AS row MERGE (s:Metabolism:Glycolysis {name: row.source}) MERGE (t:Metabolism:Glycolysis {name: row.target}) WITH s, t, row CALL apoc.merge.relationship(s, row.relation, {}, {}, t) YIELD rel RETURN COUNT(*); // MERGE COMMON NODE (GLYCOLYSIS: PYRUVATE; TCA: PYRUVATE): // As presented, run "tca.cql" first, then "glycolysis.cql" MATCH (g:Glycolysis), (t:TCA) WHERE g.name = t.name CALL apoc.refactor.mergeNodes([g,t]) YIELD node RETURN node; 查看):

Neo4j Browser: Glycolysis + TCA metabolic psathways