我有700万个生物多样性记录的csv,其中分类学级别为列。例如:
RecordID,kingdom,phylum,class,order,family,genus,species
1,Animalia,Chordata,Mammalia,Primates,Hominidae,Homo,Homo sapiens
2,Animalia,Chordata,Mammalia,Carnivora,Canidae,Canis,Canis
3,Plantae,nan,Magnoliopsida,Brassicales,Brassicaceae,Arabidopsis,Arabidopsis thaliana
4,Plantae,nan,Magnoliopsida,Fabales,Fabaceae,Phaseoulus,Phaseolus vulgaris
我想在D3中创建一个可视化文件,但是数据格式必须是网络,其中每个列的不同值都是上一个特定值列的子级。我需要从csv转到类似这样的内容:
{
name: 'Animalia',
children: [{
name: 'Chordata',
children: [{
name: 'Mammalia',
children: [{
name: 'Primates',
children: 'Hominidae'
}, {
name: 'Carnivora',
children: 'Canidae'
}]
}]
}]
}
我还没有想到不使用上千个for循环就如何做到这一点的想法。有没有人建议如何在python或javascript上创建此网络?
答案 0 :(得分:16)
要创建所需的确切嵌套对象,我们将使用纯JavaScript和名为d3.stratify
的D3方法的混合。但是,请记住,要计算的行数为700万(请参见下面的 post scriptum )。
非常重要的一点是,对于此建议的解决方案,您必须将王国分开在不同的数据数组中(例如,使用Array.prototype.filter
)。出现此限制是因为我们需要一个根节点,并且在Linnaean分类法中,王国之间没有任何关系(除非您创建“ Domain” 作为最高等级,它将成为所有真核生物的根,但是那么对于古细菌和细菌,您将遇到同样的问题。
因此,假设您有一个仅一个王国的CSV(我又添加了一些行):
RecordID,kingdom,phylum,class,order,family,genus,species
1,Animalia,Chordata,Mammalia,Primates,Hominidae,Homo,Homo sapiens
2,Animalia,Chordata,Mammalia,Carnivora,Canidae,Canis,Canis latrans
3,Animalia,Chordata,Mammalia,Cetacea,Delphinidae,Tursiops,Tursiops truncatus
1,Animalia,Chordata,Mammalia,Primates,Hominidae,Pan,Pan paniscus
基于该CSV,我们将在此处创建一个名为tableOfRelationships
的数组,顾名思义,该数组具有等级之间的关系:
const data = d3.csvParse(csv);
const taxonomicRanks = data.columns.filter(d => d !== "RecordID");
const tableOfRelationships = [];
data.forEach(row => {
taxonomicRanks.forEach((d, i) => {
if (!tableOfRelationships.find(e => e.name === row[d])) tableOfRelationships.push({
name: row[d],
parent: row[taxonomicRanks[i - 1]] || null
})
})
});
对于上面的数据,这是tableOfRelationships
:
+---------+----------------------+---------------+
| (Index) | name | parent |
+---------+----------------------+---------------+
| 0 | "Animalia" | null |
| 1 | "Chordata" | "Animalia" |
| 2 | "Mammalia" | "Chordata" |
| 3 | "Primates" | "Mammalia" |
| 4 | "Hominidae" | "Primates" |
| 5 | "Homo" | "Hominidae" |
| 6 | "Homo sapiens" | "Homo" |
| 7 | "Carnivora" | "Mammalia" |
| 8 | "Canidae" | "Carnivora" |
| 9 | "Canis" | "Canidae" |
| 10 | "Canis latrans" | "Canis" |
| 11 | "Cetacea" | "Mammalia" |
| 12 | "Delphinidae" | "Cetacea" |
| 13 | "Tursiops" | "Delphinidae" |
| 14 | "Tursiops truncatus" | "Tursiops" |
| 15 | "Pan" | "Hominidae" |
| 16 | "Pan paniscus" | "Pan" |
+---------+----------------------+---------------+
以null
作为Animalia
的父对象:这就是为什么我告诉您需要按王国将数据集分开的原因,因此在null
中只能有一个d3.stratify()
值。整个桌子。
最后,基于该表,我们使用const stratify = d3.stratify()
.id(function(d) { return d.name; })
.parentId(function(d) { return d.parent; });
const hierarchicalData = stratify(tableOfRelationships);
创建层次结构:
children
这是演示。打开浏览器的控制台(该任务的片段不是很好),然后检查该对象的多个级别(const csv = `RecordID,kingdom,phylum,class,order,family,genus,species
1,Animalia,Chordata,Mammalia,Primates,Hominidae,Homo,Homo sapiens
2,Animalia,Chordata,Mammalia,Carnivora,Canidae,Canis,Canis latrans
3,Animalia,Chordata,Mammalia,Cetacea,Delphinidae,Tursiops,Tursiops truncatus
1,Animalia,Chordata,Mammalia,Primates,Hominidae,Pan,Pan paniscus`;
const data = d3.csvParse(csv);
const taxonomicRanks = data.columns.filter(d => d !== "RecordID");
const tableOfRelationships = [];
data.forEach(row => {
taxonomicRanks.forEach((d, i) => {
if (!tableOfRelationships.find(e => e.name === row[d])) tableOfRelationships.push({
name: row[d],
parent: row[taxonomicRanks[i - 1]] || null
})
})
});
const stratify = d3.stratify()
.id(function(d) {
return d.name;
})
.parentId(function(d) {
return d.parent;
});
const hierarchicalData = stratify(tableOfRelationships);
console.log(hierarchicalData);
):
<script src="https://cdnjs.cloudflare.com/ajax/libs/d3/5.7.0/d3.min.js"></script>
using (var reader = new StreamReader("path\\to\\file.csv"))
using (var csv = new CsvReader(reader))
{
// Do any configuration to `CsvReader` before creating CsvDataReader.
using (var dr = new CsvDataReader(csv))
{
var dt = new DataTable();
dt.Load(dr);
}
}
PS :我不知道您将创建哪种数据,但您实际上应该避免使用分类排名。整个Linnaean分类法已过时,我们不再使用等级:由于系统发育系统是在60年代中期开发的,因此我们仅使用分类单位,而没有任何分类等级(此处为进化生物学老师)。另外,我对这700万行非常好奇,因为我们已经描述了100万种以上!
答案 1 :(得分:8)
使用python和python-benedict
库(您在Github上是开源的)可以很轻松地完成所需的工作:
安装pip install python-benedict
from benedict import benedict as bdict
# data source can be a filepath or an url
data_source = """
RecordID,kingdom,phylum,class,order,family,genus,species
1,Animalia,Chordata,Mammalia,Primates,Hominidae,Homo,Homo sapiens
2,Animalia,Chordata,Mammalia,Carnivora,Canidae,Canis,Canis
3,Plantae,nan,Magnoliopsida,Brassicales,Brassicaceae,Arabidopsis,Arabidopsis thaliana
4,Plantae,nan,Magnoliopsida,Fabales,Fabaceae,Phaseoulus,Phaseolus vulgaris
"""
data_input = bdict.from_csv(data_source)
data_output = bdict()
ancestors_hierarchy = ['kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species']
for value in data_input['values']:
data_output['.'.join([value[ancestor] for ancestor in ancestors_hierarchy])] = bdict()
print(data_output.dump())
# if this output is ok for your needs, you don't need the following code
keypaths = sorted(data_output.keypaths(), key=lambda item: len(item.split('.')), reverse=True)
data_output['children'] = []
def transform_data(d, key, value):
if isinstance(value, dict):
value.update({ 'name':key, 'children':[] })
data_output.traverse(transform_data)
for keypath in keypaths:
target_keypath = '.'.join(keypath.split('.')[:-1] + ['children'])
data_output[target_keypath].append(data_output.pop(keypath))
print(data_output.dump())
第一个打印输出将是:
{
"Animalia": {
"Chordata": {
"Mammalia": {
"Carnivora": {
"Canidae": {
"Canis": {
"Canis": {}
}
}
},
"Primates": {
"Hominidae": {
"Homo": {
"Homo sapiens": {}
}
}
}
}
}
},
"Plantae": {
"nan": {
"Magnoliopsida": {
"Brassicales": {
"Brassicaceae": {
"Arabidopsis": {
"Arabidopsis thaliana": {}
}
}
},
"Fabales": {
"Fabaceae": {
"Phaseoulus": {
"Phaseolus vulgaris": {}
}
}
}
}
}
}
}
第二个打印输出为:
{
"children": [
{
"name": "Animalia",
"children": [
{
"name": "Chordata",
"children": [
{
"name": "Mammalia",
"children": [
{
"name": "Carnivora",
"children": [
{
"name": "Canidae",
"children": [
{
"name": "Canis",
"children": [
{
"name": "Canis",
"children": []
}
]
}
]
}
]
},
{
"name": "Primates",
"children": [
{
"name": "Hominidae",
"children": [
{
"name": "Homo",
"children": [
{
"name": "Homo sapiens",
"children": []
}
]
}
]
}
]
}
]
}
]
}
]
},
{
"name": "Plantae",
"children": [
{
"name": "nan",
"children": [
{
"name": "Magnoliopsida",
"children": [
{
"name": "Brassicales",
"children": [
{
"name": "Brassicaceae",
"children": [
{
"name": "Arabidopsis",
"children": [
{
"name": "Arabidopsis thaliana",
"children": []
}
]
}
]
}
]
},
{
"name": "Fabales",
"children": [
{
"name": "Fabaceae",
"children": [
{
"name": "Phaseoulus",
"children": [
{
"name": "Phaseolus vulgaris",
"children": []
}
]
}
]
}
]
}
]
}
]
}
]
}
]
}
答案 2 :(得分:5)
var log = console.log;
var data = `
1,Animalia,Chordata,Mammalia,Primates,Hominidae,Homo,Homo sapiens
2,Animalia,Chordata,Mammalia,Carnivora,Canidae,Canis,Canis
3,Plantae,nan,Magnoliopsida,Brassicales,Brassicaceae,Arabidopsis,Arabidopsis thaliana
4,Plantae,nan,Magnoliopsida,Fabales,Fabaceae,Phaseoulus,Phaseolus vulgaris`;
//make array of rows with array of values
data = data.split("\n").map(v=>v.split(","));
//init tree
var tree = {};
data.forEach(row=>{
//set current = root of tree for every row
var cur = tree;
var id = false;
row.forEach((value,i)=>{
if (i == 0) {
//set id and skip value
id = value;
return;
}
//If branch not exists create.
//If last value - write id
if (!cur[value]) cur[value] = (i == row.length - 1) ? id : {};
//Move link down on hierarhy
cur = cur[value];
});
});
log("Tree:");
log(JSON.stringify(tree, null, " "));
//Now you have hierarhy in tree and can do anything with it.
var toStruct = function(obj) {
let ret = [];
for (let key in obj) {
let child = obj[key];
let rec = {};
rec.name = key;
if (typeof child == "object") rec.children = toStruct(child);
ret.push(rec);
}
return ret;
}
var struct = toStruct(tree);
console.log("Struct:");
console.log(struct);
答案 3 :(得分:5)
这似乎很简单,所以也许我不了解您的问题。
所需的数据结构是字典,键/值对的嵌套集。您的顶级王国词典为您的每个王国都有一个键,其值是门词典。一个门词典(用于一个王国)的每个门名称都有一个键,每个键的值都是一个类字典,依此类推。
为了简化编码,您的属词典将为每个物种提供一个键,但该物种的值将为空词典。
这应该是您想要的;不需要奇怪的库。
import csv
def read_data(filename):
tree = {}
with open(filename) as f:
f.readline() # skip the column headers line of the file
for animal_cols in csv.reader(f):
spot = tree
for name in animal_cols[1:]: # each name, skipping the record number
if name in spot: # The parent is already in the tree
spot = spot[name]
else:
spot[name] = {} # creates a new entry in the tree
spot = spot[name]
return tree
要进行测试,我使用了您的数据和标准库中的pprint
。
from pprint import pprint
pprint(read_data('data.txt'))
获取
{'Animalia': {'Chordata': {'Mammalia': {'Carnivora': {'Canidae': {'Canis': {'Canis': {}}}},
'Primates': {'Hominidae': {'Homo': {'Homo sapiens': {}}}}}}},
'Plantae': {'nan': {'Magnoliopsida': {'Brassicales': {'Brassicaceae': {'Arabidopsis': {'Arabidopsis thaliana': {}}}},
'Fabales': {'Fabaceae': {'Phaseoulus': {'Phaseolus vulgaris': {}}}}}}}}
再次阅读您的问题,您可能想要一张成对的大表(“来自更一般的组的链接”,“到更特定的组的链接”)。也就是说,“ Animalia”链接到“ Animalia:Chordata”,“ Animalia:Chordata”链接到“ Animalia:Chordata:Mammalia”等。不幸的是,数据中的“ nan”表示您需要在每个链接处使用全名。父母,孩子)对就是您想要的,就这样走树:
def walk_children(tree, parent=''):
for child in tree.keys():
full_name = parent + ':' + child
yield (parent, full_name)
yield from walk_children(tree[child], full_name)
tree = read_data('data.txt')
for (parent, child) in walk_children(tree):
print(f'parent="{parent}" child="{child}"')
给予:
parent="" child=":Animalia"
parent=":Animalia" child=":Animalia:Chordata"
parent=":Animalia:Chordata" child=":Animalia:Chordata:Mammalia"
parent=":Animalia:Chordata:Mammalia" child=":Animalia:Chordata:Mammalia:Primates"
parent=":Animalia:Chordata:Mammalia:Primates" child=":Animalia:Chordata:Mammalia:Primates:Hominidae"
parent=":Animalia:Chordata:Mammalia:Primates:Hominidae" child=":Animalia:Chordata:Mammalia:Primates:Hominidae:Homo"
parent=":Animalia:Chordata:Mammalia:Primates:Hominidae:Homo" child=":Animalia:Chordata:Mammalia:Primates:Hominidae:Homo:Homo sapiens"
parent=":Animalia:Chordata:Mammalia" child=":Animalia:Chordata:Mammalia:Carnivora"
parent=":Animalia:Chordata:Mammalia:Carnivora" child=":Animalia:Chordata:Mammalia:Carnivora:Canidae"
parent=":Animalia:Chordata:Mammalia:Carnivora:Canidae" child=":Animalia:Chordata:Mammalia:Carnivora:Canidae:Canis"
parent=":Animalia:Chordata:Mammalia:Carnivora:Canidae:Canis" child=":Animalia:Chordata:Mammalia:Carnivora:Canidae:Canis:Canis"
parent="" child=":Plantae"
parent=":Plantae" child=":Plantae:nan"
parent=":Plantae:nan" child=":Plantae:nan:Magnoliopsida"
parent=":Plantae:nan:Magnoliopsida" child=":Plantae:nan:Magnoliopsida:Brassicales"
parent=":Plantae:nan:Magnoliopsida:Brassicales" child=":Plantae:nan:Magnoliopsida:Brassicales:Brassicaceae"
parent=":Plantae:nan:Magnoliopsida:Brassicales:Brassicaceae" child=":Plantae:nan:Magnoliopsida:Brassicales:Brassicaceae:Arabidopsis"
parent=":Plantae:nan:Magnoliopsida:Brassicales:Brassicaceae:Arabidopsis" child=":Plantae:nan:Magnoliopsida:Brassicales:Brassicaceae:Arabidopsis:Arabidopsis thaliana"
parent=":Plantae:nan:Magnoliopsida" child=":Plantae:nan:Magnoliopsida:Fabales"
parent=":Plantae:nan:Magnoliopsida:Fabales" child=":Plantae:nan:Magnoliopsida:Fabales:Fabaceae"
parent=":Plantae:nan:Magnoliopsida:Fabales:Fabaceae" child=":Plantae:nan:Magnoliopsida:Fabales:Fabaceae:Phaseoulus"
parent=":Plantae:nan:Magnoliopsida:Fabales:Fabaceae:Phaseoulus" child=":Plantae:nan:Magnoliopsida:Fabales:Fabaceae:Phaseoulus:Phaseolus vulgaris"
答案 4 :(得分:3)
在Python中,对树进行编码的一种方法是使用dict
,其中键代表节点,而关联的值是节点的父节点:
{'Homo sapiens': 'Homo',
'Canis': 'Canidae',
'Arabidopsis thaliana': 'Arabidopsis',
'Phaseolus vulgaris': 'Phaseoulus',
'Homo': 'Hominidae',
'Arabidopsis': 'Brassicaceae',
'Phaseoulus': 'Fabaceae',
'Hominidae': 'Primates',
'Canidae': 'Carnivora',
'Brassicaceae': 'Brassicales',
'Fabaceae': 'Fabales',
'Primates': 'Mammalia',
'Carnivora': 'Mammalia',
'Brassicales': 'Magnoliopsida',
'Fabales': 'Magnoliopsida',
'Mammalia': 'Chordata',
'Magnoliopsida': 'nan',
'Chordata': 'Animalia',
'nan': 'Plantae',
'Animalia': None,
'Plantae': None}
这样做的好处是,您可以确保节点是唯一的,因为dicts
不能有重复的密钥。
如果您想编码更通用的有向图(即,节点可以有多个父级),则可以使用值列表并具有代表子级(我想是父级):
{'Homo': ['Homo sapiens', 'ManBearPig'],
'Ursus': ['Ursus arctos', 'ManBearPig'],
'Sus': ['ManBearPig']}
您可以对JS中的对象执行类似的操作,必要时用Array代替列表。
这是我用来创建上述第一个字典的Python代码:
import csv
ROWS = []
# Load file: tbl.csv
with open('tbl.csv', 'r') as in_file:
csvreader = csv.reader(in_file)
# Ignore leading row numbers
ROWS = [row[1:] for row in csvreader]
# Drop header row
del ROWS[0]
# Build dict
mytree = {row[i]: row[i-1] for row in ROWS for i in range(len(row)-1, 0, -1)}
# Add top-level nodes
mytree = {**mytree, **{row[0]: None for row in ROWS}}
答案 5 :(得分:2)
将数据转换为层次结构的最简单方法可能是利用D3内置的nesting运算符d3.nest()
:
嵌套可将数组中的元素分组为分层树结构;
通过nest.key()
注册关键功能,您可以轻松指定层次结构。就像Gerardo在他的answer中布置的一样,您可以在解析CSV后使用数据数组上公开的.columns
属性来自动生成这些关键函数。整个代码可以归纳为以下几行:
const nester = d3.nest(); // Create a nest operator
const [, ...taxonomicRanks] = data.columns; // Get rid of the RecordID property
taxonomicRanks.forEach(r => nester.key(d => d[r])); // Register key functions
const nest = nester.entries(data); // Calculate hierarchy
但是请注意,由于对象是{ key, values }
而不是{ name, children }
,因此所得的层次结构与问题中要求的结构并不完全相似;顺便说一句,杰拉多的答案也是如此。不过,这对两个答案都没有影响,因为d3.hierarchy()
可以通过指定 children accessor 函数来阻塞结果:
d3.hierarchy(nest, d => d.values) // Second argument is the children accessor
以下演示将所有部分放在一起:
const csv = `RecordID,kingdom,phylum,class,order,family,genus,species
1,Animalia,Chordata,Mammalia,Primates,Hominidae,Homo,Homo sapiens
2,Animalia,Chordata,Mammalia,Carnivora,Canidae,Canis,Canis latrans
3,Animalia,Chordata,Mammalia,Cetacea,Delphinidae,Tursiops,Tursiops truncatus
1,Animalia,Chordata,Mammalia,Primates,Hominidae,Pan,Pan paniscus`;
const data = d3.csvParse(csv);
const nester = d3.nest();
const [, ...taxonomicRanks] = data.columns;
taxonomicRanks.forEach(r => nester.key(d => d[r]));
const nest = nester.entries(data);
console.log(nest);
const hierarchy = d3.hierarchy(nest, d => d.values);
console.log(hierarchy);
<script src="https://cdnjs.cloudflare.com/ajax/libs/d3/5.12.0/d3.js"></script>
您可能还想看看d3.nest() key and values conversion to name and children,以防您需要确切地拥有您发布的结构。
答案 6 :(得分:1)
一个有趣的挑战。试试这个javascript代码。为了简单起见,我使用Lodash的集合。
import { set } from 'lodash'
const csvString = `RecordID,kingdom,phylum,class,order,family,genus,species
1,Animalia,Chordata,Mammalia,Primates,Hominidae,Homo,Homo sapiens
2,Animalia,Chordata,Mammalia,Carnivora,Canidae,Canis,Canis
3,Plantae,nan,Magnoliopsida,Brassicales,Brassicaceae,Arabidopsis,Arabidopsis thaliana
4,Plantae,nan,Magnoliopsida,Fabales,Fabaceae,Phaseoulus,Phaseolus vulgaris`
// First create a quick lookup map
const result = csvString
.split('\n') // Split for Rows
.slice(1) // Remove headers
.reduce((acc, row) => {
const path = row
.split(',') // Split for columns
.filter(item => item !== 'nan') // OPTIONAL: Filter 'nan'
.slice(1) // Remove record id
const species = path.pop() // Pull out species (last entry)
set(acc, path, species)
return acc
}, {})
console.log(JSON.stringify(result, null, 2))
// Then convert to the name-children structure by recursively calling this function
const convert = (obj) => {
// If we're at the end of our chain, end the chain (children is empty)
if (typeof obj === 'string') {
return [{
name: obj,
children: [],
}]
}
// Else loop through each entry and add them as children
return Object.entries(obj)
.reduce((acc, [key, value]) => acc.concat({
name: key,
children: convert(value), // Recursive call
}), [])
}
const result2 = convert(result)
console.log(JSON.stringify(result2, null, 2))
这将产生与您想要的最终结果(相似)。
[
{
"name": "Animalia",
"children": [
{
"name": "Chordata",
"children": [
{
"name": "Mammalia",
"children": [
{
"name": "Primates",
"children": [
{
"name": "Hominidae",
"children": [
{
"name": "Homo",
"children": [
{
"name": "Homo sapiens",
"children": []
}
]
}
]
}
]
},
{
"name": "Carnivora",
"children": [
{
"name": "Canidae",
"children": [
{
"name": "Canis",
"children": [
{
"name": "Canis",
"children": []
}
]
}
]
}
]
}
]
}
]
}
]
},
{
"name": "Plantae",
"children": [
{
"name": "Magnoliopsida",
"children": [
{
"name": "Brassicales",
"children": [
{
"name": "Brassicaceae",
"children": [
{
"name": "Arabidopsis",
"children": [
{
"name": "Arabidopsis thaliana",
"children": []
}
]
}
]
}
]
},
{
"name": "Fabales",
"children": [
{
"name": "Fabaceae",
"children": [
{
"name": "Phaseoulus",
"children": [
{
"name": "Phaseolus vulgaris",
"children": []
}
]
}
]
}
]
}
]
}
]
}
]
答案 7 :(得分:1)
实际上,@ Charles Merriam的解决方案非常优雅。
如果要使结果与问题相同,请尝试以下操作。
from io import StringIO
import csv
CSV_CONTENTS = """RecordID,kingdom,phylum,class,order,family,genus,species
1,Animalia,Chordata,Mammalia,Primates,Hominidae,Homo,Homo sapiens
2,Animalia,Chordata,Mammalia,Carnivora,Canidae,Canis,Canis
3,Plantae,nan,Magnoliopsida,Brassicales,Brassicaceae,Arabidopsis,Arabidopsis thaliana
4,Plantae,nan,Magnoliopsida,Fabales,Fabaceae,Phaseoulus,Phaseolus vulgaris
"""
def recursive(dict_data):
lst = []
for key, val in dict_data.items():
children = recursive(val)
lst.append(dict(name=key, children=children))
return lst
def main():
with StringIO() as io_f:
io_f.write(CSV_CONTENTS)
io_f.seek(0)
io_f.readline() # skip the column headers line of the file
result_tree = {}
for row_data in csv.reader(io_f):
cur_dict = result_tree # cursor, back to root
for item in row_data[1:]: # each item, skip the record number
if item not in cur_dict:
cur_dict[item] = {} # create new dict
cur_dict = cur_dict[item]
else:
cur_dict = cur_dict[item]
# change answer format
result_list = []
for cur_kingdom_name in result_tree:
result_list.append(dict(name=cur_kingdom_name, children=recursive(result_tree[cur_kingdom_name])))
# Optional
import json
from os import startfile
output_file = 'result.json'
with open(output_file, 'w') as f:
json.dump(result_list, f)
startfile(output_file)
if __name__ == '__main__':
main()