OrientDB ETL具有CSV,无标题和多个连接字段

时间:2015-07-02 14:06:53

标签: etl orientdb

我正在尝试将一些CSV文件加载到OrientDB中。它们是从拥有统一医学语言系统(NIH UMLS)数据的MySQL数据库中提取的。这两个文件包含顶点:

"C0484850"  "A18164418" "Troponin T.cardiac [Mass/volume] in Venous blood"  "Y" "Clinical Attribute"
"C0484850"  "A18241423" "Troponin T.cardiac:MCnc:Pt:BldV:Qn"    "Y" "Clinical Attribute"
"C0484850"  "A18861342" "Troponin T.cardiac:Mass Concentration:Point in time:Blood venous:Quantitative" "Y" "Clinical Attribute"
"C0484851"  "A18280127" "Troponin T.cardiac [Mass/volume] in Serum or Plasma"   "Y" "Clinical Attribute"
"C0484851"  "A18357585" "Troponin T.cardiac:MCnc:Pt:Ser/Plas:Qn"    "Y" "Clinical Attribute"
"C0484851"  "A18816754" "Troponin T.cardiac:Mass Concentration:Point in time:Serum/Plasma:Quantitative" "Y" "Clinical Attribute"

和关系:

"C0484850"  "A18164418" "has_common_name"   "C0484850"  "A18241423"
"C0484850"  "A18241423" "class_of"  "C0201682"  "A18205079"
"C0484850"  "A18241423" "component_of"  "C3538889"  "A18284809"
"C0484850"  "A18241423" "property_of"   "C0560150"  "A18367132"
"C0484850"  "A18241423" "scale_of"  "C1442116"  "A18405933"
"C0484850"  "A18241423" "system_of" "C1442207"  "A18136032"
"C0484850"  "A18241423" "time_aspect_of"    "C1442880"  "A18406936"
"C0484850"  "A18241423" "fragments_for_synonyms_of" "C2603360"  "A18401194"

我发现OrientDB文档for extractorsfor CSV相当缺乏。

  1. 对于"行"提取器,只有一个例子没有完整的文档。我没有行标题,所以如何使用" row"提取器命名顶点中的字段(cui,aui,description,pref,syn)?我猜测有一种语法,比如id:row 2,但我无法找到它。
  2. 使用未命名的顶点的第2和第5个字段连接边。此外,edge属性未命名。
  3. 由于愚蠢的原因,我现在无法直接从MySQL获取,但如果有比官方网站更好的例子,我会有兴趣看到它们。

1 个答案:

答案 0 :(得分:1)

使用csv提取器(参见:http://orientdb.com/docs/2.2.x/Extractor.html) set" columnsOnFirstLine"为假 设置"列"按照csv文件中存在的顺序显示列的显式列表