从元数据+值创建xarray DataSet的简单方法?

时间:2017-10-17 23:12:38

标签: python data-science python-xarray xarray

我正在使用单细胞RNA测序数据,这是最近的10k-100k样本(cell s)x 20k特征(gene s)的稀疏值,还包括很多元数据,例如起源的组织(“脑”与“肝”)。元数据是~10-100列,我存储为pandas.DataFrame。现在,我通过dict-ifiying元数据并将它们添加为坐标来制作xarray.DataSets。由于我在笔记本之间复制片段,因此看起来很笨拙且容易出错。有更简单的方法吗?

cell_metadata_dict = cell_metadata.to_dict(orient='list')
coords = {k: ('cell', v) for k, v in cell_metadata_dict.items()}
coords.update(dict(gene=counts.columns, cell=counts.index))

ds = xr.Dataset(
    {'counts': (['cell', 'gene'], counts),
    },
    coords=coords)

编辑:

要显示一些示例数据,这里是cell_metadata.head().to_csv()

cell,Uniquely mapped reads number,Number of input reads,EXP_ID,TAXON,WELL_MAPPING,Lysis Plate Batch,dNTP.batch,oligodT.order.no,plate.type,preparation.site,date.prepared,date.sorted,tissue,subtissue,mouse.id,FACS.selection,nozzle.size,FACS.instument,Experiment ID ,Columns sorted,Double check,Plate,Location ,Comments,mouse.age,mouse.number,mouse.sex
A1-MAA100140-3_57_F-1-1,428699,502312,170928_A00111_0068_AH3YKKDMXX,mus,MAA100140,,,,Biorad 96well,Stanford,,170720,Liver,Hepatocytes,3_57_F,,,,,,,,,,3,57,F
A10-MAA100140-3_57_F-1-1,324428,360285,170928_A00111_0068_AH3YKKDMXX,mus,MAA100140,,,,Biorad 96well,Stanford,,170720,Liver,Hepatocytes,3_57_F,,,,,,,,,,3,57,F
A11-MAA100140-3_57_F-1-1,381310,431800,170928_A00111_0068_AH3YKKDMXX,mus,MAA100140,,,,Biorad 96well,Stanford,,170720,Liver,Hepatocytes,3_57_F,,,,,,,,,,3,57,F
A12-MAA100140-3_57_F-1-1,393498,446705,170928_A00111_0068_AH3YKKDMXX,mus,MAA100140,,,,Biorad 96well,Stanford,,170720,Liver,Hepatocytes,3_57_F,,,,,,,,,,3,57,F
A2-MAA100140-3_57_F-1-1,717,918,170928_A00111_0068_AH3YKKDMXX,mus,MAA100140,,,,Biorad 96well,Stanford,,170720,Liver,Hepatocytes,3_57_F,,,,,,,,,,3,57,F

counts.iloc[:5, :20].to_csv()

cell,0610005C13Rik,0610007C21Rik,0610007L01Rik,0610007N19Rik,0610007P08Rik,0610007P14Rik,0610007P22Rik,0610008F07Rik,0610009B14Rik,0610009B22Rik,0610009D07Rik,0610009L18Rik,0610009O20Rik,0610010B08Rik,0610010F05Rik,0610010K14Rik,0610010O12Rik,0610011F06Rik,0610011L14Rik,0610012G03Rik
A1-MAA100140-3_57_F-1-1,308,289,81,0,4,88,52,0,0,104,65,0,1,0,9,8,12,283,12,37
A10-MAA100140-3_57_F-1-1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
A11-MAA100140-3_57_F-1-1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
A12-MAA100140-3_57_F-1-1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
A2-MAA100140-3_57_F-1-1,375,325,70,0,2,72,36,13,0,60,105,0,13,0,0,29,15,264,0,65

Re:pandas.DataFrame.to_xarray() - 这速度非常慢,对于我来说,将数字和分类数据编码为100级MultiIndex似乎很奇怪。那,以及每次我尝试使用MultiIndex时,它总会导致我说“哦,这就是为什么我不使用MultiIndex”并恢复使用单独的元数据并计算数据帧。

1 个答案:

答案 0 :(得分:2)

Xarray使用pandas索引/列标签作为默认元数据。当所有变量共享相同的维度时,您可以在单个函数调用中进行转换,但如果不同的变量具有不同的维度,则需要分别从pandas转换它们,然后将它们放在xarray端。例如:

import pandas as pd
import io
import xarray

# read your data
cell_metadata = pd.read_csv(io.StringIO(u"""\
cell,Uniquely mapped reads number,Number of input reads,EXP_ID,TAXON,WELL_MAPPING,Lysis Plate Batch,dNTP.batch,oligodT.order.no,plate.type,preparation.site,date.prepared,date.sorted,tissue,subtissue,mouse.id,FACS.selection,nozzle.size,FACS.instument,Experiment ID ,Columns sorted,Double check,Plate,Location ,Comments,mouse.age,mouse.number,mouse.sex
A1-MAA100140-3_57_F-1-1,428699,502312,170928_A00111_0068_AH3YKKDMXX,mus,MAA100140,,,,Biorad 96well,Stanford,,170720,Liver,Hepatocytes,3_57_F,,,,,,,,,,3,57,F
A10-MAA100140-3_57_F-1-1,324428,360285,170928_A00111_0068_AH3YKKDMXX,mus,MAA100140,,,,Biorad 96well,Stanford,,170720,Liver,Hepatocytes,3_57_F,,,,,,,,,,3,57,F
A11-MAA100140-3_57_F-1-1,381310,431800,170928_A00111_0068_AH3YKKDMXX,mus,MAA100140,,,,Biorad 96well,Stanford,,170720,Liver,Hepatocytes,3_57_F,,,,,,,,,,3,57,F
A12-MAA100140-3_57_F-1-1,393498,446705,170928_A00111_0068_AH3YKKDMXX,mus,MAA100140,,,,Biorad 96well,Stanford,,170720,Liver,Hepatocytes,3_57_F,,,,,,,,,,3,57,F
A2-MAA100140-3_57_F-1-1,717,918,170928_A00111_0068_AH3YKKDMXX,mus,MAA100140,,,,Biorad 96well,Stanford,,170720,Liver,Hepatocytes,3_57_F,,,,,,,,,,3,57,F"""))
counts = pd.read_csv(io.StringIO(u"""\
cell,0610005C13Rik,0610007C21Rik,0610007L01Rik,0610007N19Rik,0610007P08Rik,0610007P14Rik,0610007P22Rik,0610008F07Rik,0610009B14Rik,0610009B22Rik,0610009D07Rik,0610009L18Rik,0610009O20Rik,0610010B08Rik,0610010F05Rik,0610010K14Rik,0610010O12Rik,0610011F06Rik,0610011L14Rik,0610012G03Rik
A1-MAA100140-3_57_F-1-1,308,289,81,0,4,88,52,0,0,104,65,0,1,0,9,8,12,283,12,37
A10-MAA100140-3_57_F-1-1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
A11-MAA100140-3_57_F-1-1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
A12-MAA100140-3_57_F-1-1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
A2-MAA100140-3_57_F-1-1,375,325,70,0,2,72,36,13,0,60,105,0,13,0,0,29,15,264,0,65"""))

# build the output
xarray_counts = xarray.DataArray(counts.set_index('cell'), dims=['cell', 'gene'])
xarray_counts.coords.update(cell_metadata.set_index('cell').to_xarray())
print(xarray_counts)

这样可以获得一个漂亮,整洁的xarray.DataArray计数:

<xarray.DataArray (cell: 5, gene: 20)>
array([[308, 289,  81,   0,   4,  88,  52,   0,   0, 104,  65,   0,   1,   0,
          9,   8,  12, 283,  12,  37],
       [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0],
       [375, 325,  70,   0,   2,  72,  36,  13,   0,  60, 105,   0,  13,   0,
          0,  29,  15, 264,   0,  65]])
Coordinates:
  * cell                          (cell) object 'A1-MAA100140-3_57_F-1-1' ...
  * gene                          (gene) object '0610005C13Rik' ...
    Uniquely mapped reads number  (cell) int64 428699 324428 381310 393498 717
    Number of input reads         (cell) int64 502312 360285 431800 446705 918
    EXP_ID                        (cell) object '170928_A00111_0068_AH3YKKDMXX' ...
    TAXON                         (cell) object 'mus' 'mus' 'mus' 'mus' 'mus'
    WELL_MAPPING                  (cell) object 'MAA100140' 'MAA100140' ...
    Lysis Plate Batch             (cell) float64 nan nan nan nan nan
    dNTP.batch                    (cell) float64 nan nan nan nan nan
    oligodT.order.no              (cell) float64 nan nan nan nan nan
    plate.type                    (cell) object 'Biorad 96well' ...
    preparation.site              (cell) object 'Stanford' 'Stanford' ...
    date.prepared                 (cell) float64 nan nan nan nan nan
    date.sorted                   (cell) int64 170720 170720 170720 170720 ...
    tissue                        (cell) object 'Liver' 'Liver' 'Liver' ...
    subtissue                     (cell) object 'Hepatocytes' 'Hepatocytes' ...
    mouse.id                      (cell) object '3_57_F' '3_57_F' '3_57_F' ...
    FACS.selection                (cell) float64 nan nan nan nan nan
    nozzle.size                   (cell) float64 nan nan nan nan nan
    FACS.instument                (cell) float64 nan nan nan nan nan
    Experiment ID                 (cell) float64 nan nan nan nan nan
    Columns sorted                (cell) float64 nan nan nan nan nan
    Double check                  (cell) float64 nan nan nan nan nan
    Plate                         (cell) float64 nan nan nan nan nan
    Location                      (cell) float64 nan nan nan nan nan
    Comments                      (cell) float64 nan nan nan nan nan
    mouse.age                     (cell) int64 3 3 3 3 3
    mouse.number                  (cell) int64 57 57 57 57 57
    mouse.sex                     (cell) object 'F' 'F' 'F' 'F' 'F'

如果您想要数据集,请将DataArray对象放入数据集构造函数中,例如,

# shouldn't really need to use .data_vars here, that might be an xarray bug
>>> xarray.Dataset({'counts': xarray.DataArray(counts.set_index('cell'),
...                                            dims=['cell', 'gene'])},
...                coords=cell_metadata.set_index('cell').to_xarray().data_vars)    <xarray.Dataset>

Dimensions:                       (cell: 5, gene: 20)
Coordinates:
  * cell                          (cell) object 'A1-MAA100140-3_57_F-1-1' ...
  * gene                          (gene) object '0610005C13Rik' ...
    Uniquely mapped reads number  (cell) int64 428699 324428 381310 393498 717
    Number of input reads         (cell) int64 502312 360285 431800 446705 918
    EXP_ID                        (cell) object '170928_A00111_0068_AH3YKKDMXX' ...
    TAXON                         (cell) object 'mus' 'mus' 'mus' 'mus' 'mus'
    WELL_MAPPING                  (cell) object 'MAA100140' 'MAA100140' ...
    Lysis Plate Batch             (cell) float64 nan nan nan nan nan
    dNTP.batch                    (cell) float64 nan nan nan nan nan
    oligodT.order.no              (cell) float64 nan nan nan nan nan
    plate.type                    (cell) object 'Biorad 96well' ...
    preparation.site              (cell) object 'Stanford' 'Stanford' ...
    date.prepared                 (cell) float64 nan nan nan nan nan
    date.sorted                   (cell) int64 170720 170720 170720 170720 ...
    tissue                        (cell) object 'Liver' 'Liver' 'Liver' ...
    subtissue                     (cell) object 'Hepatocytes' 'Hepatocytes' ...
    mouse.id                      (cell) object '3_57_F' '3_57_F' '3_57_F' ...
    FACS.selection                (cell) float64 nan nan nan nan nan
    nozzle.size                   (cell) float64 nan nan nan nan nan
    FACS.instument                (cell) float64 nan nan nan nan nan
    Experiment ID                 (cell) float64 nan nan nan nan nan
    Columns sorted                (cell) float64 nan nan nan nan nan
    Double check                  (cell) float64 nan nan nan nan nan
    Plate                         (cell) float64 nan nan nan nan nan
    Location                      (cell) float64 nan nan nan nan nan
    Comments                      (cell) float64 nan nan nan nan nan
    mouse.age                     (cell) int64 3 3 3 3 3
    mouse.number                  (cell) int64 57 57 57 57 57
    mouse.sex                     (cell) object 'F' 'F' 'F' 'F' 'F'
Data variables:
    counts                        (cell, gene) int64 308 289 81 0 4 88 52 0 ...