考虑我有以下节点和关系的图表:
(:GPE)-[:contains]->(:PE)-[:has]->(:E)
具有以下属性:
GPE: joinProp1, outProp1, randomProps
PE : joinProp2, outProp2, randomProps
E : joinProp3, outProp3, randomProps
现在考虑我有以下格式的csv:
joinCol1, joinCol2, joinCol3, outCol1, outCol2, outCol3, randomProps
现在考虑我在这个csv文件中有百万行。此外,我在图表中有每个(:GPE),(:PE),(:E)
的百万个实例。我想将图形和csv合并到新的csv中。为此,我想映射/等同
csv中的每一行(伪cypher):
MATCH (gpe:GPE {joinProp1:joinCol1})-[:contains]->(pe:PE {joinProp2:joinCol2})-[:has]->(e:E {joinProp3:joinCol3}) RETURN gpe.outProp1, pe.outProp2, e.outProp3
因此输出csv格式为:
joinCol1, joinCol2, joinCol3, outCol1, outCol2, outCol3, outProp1, outProp2, outProp3
如果我在所有joinProp
上创建索引并使用参数化cypher(考虑到我使用java api实现这个简单逻辑),我可以完成此任务的粗略最小执行时间估计(分钟或小时)是多少。我只是想知道什么是粗略估计。我们实施了类似的(可能是未优化的)任务,这需要几个小时才能完成。挑战在于缩短执行时间。我可以做些什么来优化并将执行时间缩短到几分钟?任何快速优化点/链接?使用除java api之外的一些方法会提供性能改进吗?
答案 0 :(得分:0)
我尝试了一些可以大大提高性能的方法。
适用于我的场景的一些neo4j性能指南:
批处理:避免为每个csv行执行密码调用(通过bolt api调用)。迭代几个固定数量的csv行,形成地图列表,其中每个地图都是csv行。然后将此映射列表作为参数传递给cypher。 {c}的UNWIND
这个列表并执行所需的操作。对下一组csv行重复相同的操作。
不要将节点关系对象从cypher 返回到java端。而是尝试返回所需的地图列表作为最终输出。当我们返回节点/关系列表时,我们可能不得不重申它们将属性与csv列合并以形成最终输出行(或映射)
将csv列值传递给cypher:要实现第2点,请将csv列值(与图形属性合并)发送给cypher。在cypher中执行匹配,并通过合并匹配节点的属性和输入csv列来形成输出映射。
要匹配的索引节点/关系属性(Official docs)
参数化密码(API Example,Official docs)
我做了一个快速的脏实验,正如我在下面解释的那样。
我的输入csv看起来像这样。
<强> inputDataCsv.csv 强>
csv-id1,csv-id2,csv-id3,outcol1,outcol2,outcol3
gppe0,ppe1,pe2,outcol1-val-0,outcol2-val-1,outcol3-val-2
gppe3,ppe4,pe5,outcol1-val-3,outcol2-val-4,outcol3-val-5
gppe6,ppe7,pe8,outcol1-val-6,outcol2-val-7,outcol3-val-8
...
为了创建图形,我首先创建了以下形式的csv:
gppe.csv(实体)
gppe0,gppe_out_prop_1_val_0,gppe_out_prop_2_val_0,gppe_prop_X_val_0
gppe3,gppe_out_prop_1_val_3,gppe_out_prop_2_val_3,gppe_prop_X_val_3
gppe6,gppe_out_prop_1_val_6,gppe_out_prop_2_val_6,gppe_prop_X_val_6
...
ppe.csv(实体)
ppe1,ppe_out_prop_1_val_1,ppe_out_prop_2_val_1,ppe_prop_X_val_1
ppe4,ppe_out_prop_1_val_4,ppe_out_prop_2_val_4,ppe_prop_X_val_4
ppe7,ppe_out_prop_1_val_7,ppe_out_prop_2_val_7,ppe_prop_X_val_7
...
pe.csv(实体)
pe2,pe_out_prop_1_val_2,pe_out_prop_2_val_2,pe_prop_X_val_2
pe5,pe_out_prop_1_val_5,pe_out_prop_2_val_5,pe_prop_X_val_5
pe8,pe_out_prop_1_val_8,pe_out_prop_2_val_8,pe_prop_X_val_8
...
gppeHasPpe.csv(人际关系)
gppe0,ppe1
gppe3,ppe4
gppe6,ppe7
...
ppeContainsPe.csv(人际关系)
ppe1,pe2
ppe4,pe5
ppe7,pe8
...
我在neo4j中按如下方式加载:
USING PERIODIC COMMIT
LOAD CSV FROM 'file:///gppe.csv' AS line
CREATE (:GPPocEntity {id:line[0],gppe_out_prop_1: line[1], gppe_out_prop_2: line[2],gppe_out_prop_X: line[3]})
USING PERIODIC COMMIT
LOAD CSV FROM 'file:///ppe.csv' AS line
CREATE (:PPocEntity {id:line[0],ppe_out_prop_1: line[1], ppe_out_prop_2: line[2],ppe_out_prop_X: line[3]})
USING PERIODIC COMMIT
LOAD CSV FROM 'file:///pe.csv' AS line
CREATE (:PocEntity {id:line[0],pe_out_prop_1: line[1], pe_out_prop_2: line[2],pe_out_prop_X: line[3]})
USING PERIODIC COMMIT
LOAD CSV FROM 'file:///gppeHasPpe.csv' AS line
MATCH(gppe:GPPocEntity {id:line[0]})
MATCH(ppe:PPocEntity {id:line[1]})
MERGE (gppe)-[:has]->(ppe)
USING PERIODIC COMMIT
LOAD CSV FROM 'file:///ppeContainsPe.csv' AS line
MATCH(ppe:PPocEntity {id:line[0]})
MATCH(pe:PocEntity {id:line[1]})
MERGE (ppe)-[:contains]->(pe)
接下来,我在查找属性上创建了indix:
CREATE INDEX ON :GPPocEntity(id)
CREATE INDEX ON :PPocEntity(id)
CREATE INDEX ON :PocEntity(id)
下面是将csv读入地图列表的实用程序类:
package csv2csv;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.Iterator;
import java.util.List;
import java.util.Map;
import org.json.simple.JSONArray;
import org.json.simple.JSONObject;
import org.json.simple.parser.JSONParser;
import org.json.simple.parser.ParseException;
import org.w3c.dom.stylesheets.LinkStyle;
public class Config {
String configFilePath;
Map csvColumnToGraphNodeMapping;
List<Map> mappingsGraphRelations;
Map<String,Map<String,String>> mappingsGraphRelationsMap = new HashMap<String, Map<String,String>>();
List<String> outputColumnsFromCsv;
Map outputColumnsFromGraph;
public Config(String pConfigFilePath) {
configFilePath = pConfigFilePath;
JSONParser parser = new JSONParser();
try {
Object obj = parser.parse(new FileReader(configFilePath));
JSONObject jsonObject = (JSONObject) obj;
csvColumnToGraphNodeMapping = (HashMap) ((HashMap) jsonObject.get("csvColumn-graphNodeProperty-mapping"))
.get("mappings");
mappingsGraphRelations = (ArrayList) ((HashMap) jsonObject.get("csvColumn-graphNodeProperty-mapping"))
.get("mappings-graph-relations");
for(Map m : mappingsGraphRelations)
{
mappingsGraphRelationsMap.put(""+ m.get("start-entity") + "-" + m.get("end-entity"), m);
}
outputColumnsFromCsv = (ArrayList) ((HashMap) jsonObject.get("output-csv-columns"))
.get("columns-from-input-csv");
outputColumnsFromGraph = (HashMap) ((HashMap) jsonObject.get("output-csv-columns"))
.get("columns-from-graph");
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} catch (ParseException e) {
e.printStackTrace();
}
}
}
下面的类执行合并并创建另一个csv:
import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.PrintWriter;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import org.neo4j.driver.internal.value.MapValue;
import org.neo4j.driver.internal.value.NodeValue;
import org.neo4j.driver.v1.AuthTokens;
import org.neo4j.driver.v1.Driver;
import org.neo4j.driver.v1.GraphDatabase;
import org.neo4j.driver.v1.Record;
import org.neo4j.driver.v1.Session;
import org.neo4j.driver.v1.StatementResult;
import org.apache.commons.lang3.time.StopWatch;
public class Csv2CsvUtil2 {
static String inCsvFilePath = "D:\\Mahesh\\work\\files\\inputDataCsv.csv";
static String outCsvFilePath = "D:\\Mahesh\\work\\files\\csvout.csv";
private final static Driver driver = GraphDatabase.driver(
"bolt://localhost:7687", AuthTokens.basic("neo4j", "password"));
public static void main(String[] args) throws FileNotFoundException, IOException {
mergeNonBatch();
}
private static void merge() throws FileNotFoundException, IOException
{
List<Map<String,String>> csvRowMapList = new CsvReader(inCsvFilePath).getMapListFromCsv();
Session session = driver.session();
String cypherFilter = "";
String cypher;
PrintWriter pw = new PrintWriter(new File(outCsvFilePath));
StringBuilder sb = new StringBuilder();
List<Map<String, String>> inputMapList = new ArrayList<Map<String,String>>();
Map<String,Object> inputMapListMap = new HashMap<String,Object>();
Map<String, Object> params = new HashMap<String, Object>();
cypher = "WITH {inputMapList} AS inputMapList"
+ " UNWIND inputMapList AS rowMap"
+ " WITH rowMap"
+ " MATCH (gppe:GPPocEntity {id:rowMap.csvid1})-[:has]->(ppe:PPocEntity {id:rowMap.csvid2})-[:contains]->(pe:PocEntity {id:rowMap.csvid3})"
+ " RETURN {id1:gppe.id,id2:ppe.id,id3:pe.id"
+ ",gppeprop1: gppe.gppe_out_prop_1,gppeprop2: gppe.gppe_out_prop_2"
+ ",ppeprop1: ppe.ppe_out_prop_1,ppeprop2: ppe.ppe_out_prop_2"
+ ",peprop1: pe.pe_out_prop_1,peprop2: pe.pe_out_prop_2"
+ ",outcol1:rowMap.outcol1,outcol2:rowMap.outcol2,outcol3:rowMap.outcol3}";
int i;
for(i=0;i<csvRowMapList.size();i++)
{
Map<String, String> rowMap = new HashMap<String, String>();
rowMap.put("csvid1", csvRowMapList.get(i).get("csv-id1"));
rowMap.put("csvid2", csvRowMapList.get(i).get("csv-id2"));
rowMap.put("csvid3", csvRowMapList.get(i).get("csv-id3"));
rowMap.put("outcol1", csvRowMapList.get(i).get("outcol1"));
rowMap.put("outcol2", csvRowMapList.get(i).get("outcol2"));
rowMap.put("outcol3", csvRowMapList.get(i).get("outcol3"));
inputMapList.add(rowMap);
if(i%10000 == 0) //run in batch
{
inputMapListMap.put("inputMapList", inputMapList);
StatementResult stmtRes = session.run(cypher,inputMapListMap);
List<Record> retList = stmtRes.list();
for (Record record2 : retList) {
MapValue retMap = (MapValue) record2.get(0);
sb.append(retMap.get("id1")
+","+retMap.get("id2")
+","+retMap.get("id3")
+","+retMap.get("gppeprop1")
+","+retMap.get("gppeprop2")
+","+retMap.get("ppeprop1")
+","+retMap.get("ppeprop2")
+","+retMap.get("peprop1")
+","+retMap.get("peprop2")
+","+retMap.get("outcol1")
+","+retMap.get("outcol2")
+","+retMap.get("outcol3")
+"\n"
);
}
inputMapList.clear();
}
}
if(inputMapList.size() != 0) //ingest remaining rows which does not complete
//10000 reords failing to form next batch
{
inputMapListMap.put("inputMapList", inputMapList);
StatementResult stmtRes = session.run(cypher,inputMapListMap);
List<Record> retList = stmtRes.list();
for (Record record2 : retList) {
MapValue retMap = (MapValue) record2.get(0);
sb.append(retMap.get("id1")
+","+retMap.get("id2")
+","+retMap.get("id3")
+","+retMap.get("gppeprop1")
+","+retMap.get("gppeprop2")
+","+retMap.get("ppeprop1")
+","+retMap.get("ppeprop2")
+","+retMap.get("peprop1")
+","+retMap.get("peprop2")
+","+retMap.get("outcol1")
+","+retMap.get("outcol2")
+","+retMap.get("outcol3")
+"\n"
);
}
}
pw.write(sb.toString());
pw.close();
}
}
在非批处理模式下运行会减慢10次。如果未遵循指南2至4,则速度要慢得多。如果所有上述事情都是正确的,如果我犯了任何错误,如果我遗漏任何东西,如果可以进一步改进,那么会爱上某人。