Question

在Python中，我正在使用EggLib。我正在尝试计算VCF文件中找到的每个SNP的Jost D值。

数据

VCF格式的数据为here。数据集很小，有2个种群，每个种群100个个体和6个SNP（均在1号染色体上）。

每个人都被命名为Pp.Ii，其中p是其所属的人口索引，i是个人索引。

代码

我的困难涉及人口结构的规范。这是我的试用

### Read the vcf file ###
vcf = egglib.io.VcfParser("MyData.vcf") 

### Create the `Structure` object ###
# Dictionary for a given cluster. There is only one cluster.
dcluster = {}            
# Loop through each population 
for popIndex in [0,1]:  
    # dictionnary for a given population. There are two populations
    dpop = {}            
    # Loop through each individual
    for IndIndex in range(popIndex * 100,(popIndex + 1) * 100):     
            # A single list to define an individual
        dpop[IndIndex] = [IndIndex*2, IndIndex*2 + 1]
    dcluster[popIndex] = dpop

struct = {0: dcluster}

### Define the population structure ###
Structure = egglib.stats.make_structure(struct, None) 

### Configurate the 'ComputeStats' object ###
cs = egglib.stats.ComputeStats()
cs.configure(only_diallelic=False)
cs.add_stats('Dj') # Jost's D

### Isolate a SNP ###
vcf.next()
site = egglib.stats.site_from_vcf(vcf)

### Calculate Jost's D ###
cs.process_site(site, struct=Structure)

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Python/2.7/site-packages/egglib/stats/_cstats.py", line 431, in process_site
    self._frq.process_site(site, struct=struct)
  File "/Library/Python/2.7/site-packages/egglib/stats/_freq.py", line 159, in process_site
    if sum(struct) != site._obj.get_ning(): raise ValueError, 'invalid structure (sample size is required to match)'
ValueError: invalid structure (sample size is required to match)

文档显示here

[Structure对象]是一个包含两个项目的元组，每个项目都是一个dict。第一个代表内群，第二个代表外群。

ingroup字典本身就是一个包含更多字典的字典，每个字典对应一个字典。每个群集字典都是一个群体字典，人口本身就是一个字典。人口词典也是个人词典。幸运的是，个人由名单代表。

单个列表包含属于此个人的所有样本的索引。对于单倍体数据，个体将是单项目列表。在其他情况下，所有单个列表都需要具有相同数量的项目（一致的倍性）。请注意，如果倍性不止一个，则没有任何内容强制将给定个体的样本分组到原始数据中。

ingroup字典的键是标识每个簇的标签。在群集字典中，键是填充标签。最后，在人口词典中，键是单独的标签。

第二个字典代表外群。它的结构更简单：它有单独的标签作为键，相应的样本索引列表作为值。 outgroup字典类似于任何ingroup人口字典。倍性需要匹配所有内组和外组个体。

但我没理解它。提供的示例是针对fasta格式的，我不理解将逻辑扩展为VCF格式。

Answer 1

有两个错误

第一个错误

函数make_structure返回Structure对象，但不将其保存在stats中。因此，您必须保存此输出并在函数process_site中使用它。

Structure = egglib.stats.make_structure(struct, None)

第二次错误

Structure对象必须指定单倍体。因此，将字典创建为

dcluster = {}            
for popIndex in [0,1]:  
    dpop = {}            
    for IndIndex in range(popIndex * 100,(popIndex + 1) * 100):     
        dpop[IndIndex] = [IndIndex]
    dcluster[popIndex] = dpop

struct = {0: dcluster}

在EggLib Python

1 个答案: