将numpy of string转换为numpy characters python

时间:2018-03-29 23:30:04

标签: python numpy jupyter-notebook

我正在从URL读取数据并尝试将其转换为数字,以便在jupyter上进行进一步分析。它是一个基因序列,每个基因编码4个二进制数字。 A - > 0001,C - > 0010,G - > 0100和T - > 1000.例如,我想从CGGT转到0010010001001000.到目前为止,我已经能够删除空白区域并将其转换为字符串。但是我无法从字符串转到char和char到数字。我正在使用numpy数组并且做了这些尝试但无济于事。

charGenes = [var for var in genes if var]

charGenes = np.char.array(genes)

以下是代码的其余部分:

import pandas as pd
import numpy as np

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/molecular- biology/splice-junction-gene-sequences/splice.data"
file = pd.read_csv(url, delimiter=',', header=None,dtype='str')

X = file[2]
y = file[0]

myGenes = np.array(X)
stringGenes = myGenes.astype(str)

spaceGenes = stringGenes.reshape( stringGenes.size, 1)

genes = np.char.strip(spaceGenes)
genes

这是输出:

array([['CCAGCTGCATCACAGGAGGCCAGCGAGCAGGTCTGTTCCAAGGGCCTTCGAGCCAGTCTG'],
   ['AGACCCGCCGGGAGGCGGAGGACCTGCAGGGTGAGCCCCACCGCCCCTCCGTGCCCCCGC'],
   ['GAGGTGAAGGACGTCCTTCCCCAGGAGCCGGTGAGAAGCGCAGTCGGGGGCACGGGGATG'],
   ...,
   ['TCTCGGGGGCGGCCGGCGCGGCGGGGAGCGGTCCCCGGCCGCGGCCCCGACGTGTGTGTC'],
   ['ATTCTACTTAGTAAACATAATTTCTTGTGCTAGATAACCAAATTAAGAAAACCAAAACAA'],
   ['AGGCTGCCTATCAGAAGGTGGTGGCTGGTGTGGCTGCTGCTCTGGCTCACAAGTACCATT']],
  dtype='<U79')

任何指针的帮助都将不胜感激!

2 个答案:

答案 0 :(得分:2)

Numpy有一个char.replace方法(参见docs)。您需要做的就是:

genes = np.char.replace(genes, 'A', '1')
genes = np.char.replace(genes, 'C', '2')
genes = np.char.replace(genes, 'G', '4')
genes = np.char.replace(genes, 'T', '8')

要将其转换为int数组,

genes = genes.astype(int)

然后您可以在阵列上使用bitwise operations

正如评论中所指出的,结果序列的长度有限。解决这个问题的方法:

genes = np.char.replace(genes, 'A', '1')
genes = np.char.replace(genes, 'C', '2')
genes = np.char.replace(genes, 'G', '4')
genes = np.char.replace(genes, 'T', '8')

>>> genes
array([['12481248'],
       ['12481248']], dtype='|S8')

在数字之间插入逗号

genes = np.char.join(',', genes)

>>> genes
array([['1,2,4,8,1,2,4,8'],
       ['1,2,4,8,1,2,4,8']], dtype='|S15')

根据逗号分割并转换回纯np.char.array

genes = np.char.array(np.char.split(genes, ','))

>>> genes
chararray([[['1', '2', '4', '8', '1', '2', '4', '8']],

           [['1', '2', '4', '8', '1', '2', '4', '8']]], dtype='|S1')

转换为int数组:

genes = np.array(genes, dtype=int)

>>> genes
array([[[1, 2, 4, 8, 1, 2, 4, 8]],

       [[1, 2, 4, 8, 1, 2, 4, 8]]])

删除大小为1的中间维度:

genes = genes.reshape(list(genes.shape[:-2]) + [genes.shape[-1]])

>>> genes
array([[1, 2, 4, 8, 1, 2, 4, 8],
       [1, 2, 4, 8, 1, 2, 4, 8]])

答案 1 :(得分:2)

以下是使用查找表的方法:

>>> alphabet = np.array(list('ACGT'))
>>> alphabet
array(['A', 'C', 'G', 'T'], dtype='<U1')

要使用查找表,我们需要将字母重新解释为索引,这是通过视图转换完成的:

>>> alph_as_num = alphabet.view(np.int32)
>>> alph_as_num
array([65, 67, 71, 84], dtype=int32)

我们现在可以构建它需要85个插槽的查找表,我们实际上只使用4个,即656771和{{1 }}。至于输出格式,我们可以自由选择最符合我们要求的格式:

示例一 - 输出为bytestring:

84

示例二 - 输出为>>> lookup_1 = np.zeros((alph_as_num.max()+1), dtype='S4') >>> lookup_1[alph_as_num] = [b'0001000'[i:i+4] for i in range(4)]

uint8

示例三 - 每个字母输出四个>>> lookup_2 = np.zeros((alph_as_num.max()+1), dtype=np.uint8) >>> lookup_2[alph_as_num] = 1 << np.arange(4)

uint8

现在让我们将其应用于>>> lookup_3 = np.zeros((alph_as_num.max()+1, 4), dtype=np.uint8) >>> lookup_3[alph_as_num[::-1]] = np.identity(4) 字母序列:

100

翻译紧凑而快速,因为它仅依赖于

  • numpy的内置高级索引,它为我们提供了非常快速的查找(例如,比Python词典快得多)

  • 查看广告这基本上是免费的,因为它只是重新解释数据缓冲区(无需复制或转换)

示例一 - bytestrings:

>>> seq
array(['CATTTCTCCACCATTTTGGTTTTTCATTGATCCGTTAGGTGGAGCCGGACTATGTCTACCGAAAGATGCACCTGCGCCGGGTCTGGTCTATCTCTTAATG'],
      dtype='<U100')

作为优先选择,这些也可以被视为一个长序列:

>>> lookup_1[seq.view(np.int32)]
array([b'0010', b'0001', b'1000', b'1000', b'1000', b'0010', b'1000',
       b'0010', b'0010', b'0001', b'0010', b'0010', b'0001', b'1000',
       b'1000', b'1000', b'1000', b'0100', b'0100', b'1000', b'1000',
       b'1000', b'1000', b'1000', b'0010', b'0001', b'1000', b'1000',
       b'0100', b'0001', b'1000', b'0010', b'0010', b'0100', b'1000',
       b'1000', b'0001', b'0100', b'0100', b'1000', b'0100', b'0100',
       b'0001', b'0100', b'0010', b'0010', b'0100', b'0100', b'0001',
       b'0010', b'1000', b'0001', b'1000', b'0100', b'1000', b'0010',
       b'1000', b'0001', b'0010', b'0010', b'0100', b'0001', b'0001',
       b'0001', b'0100', b'0001', b'1000', b'0100', b'0010', b'0001',
       b'0010', b'0010', b'1000', b'0100', b'0010', b'0100', b'0010',
       b'0010', b'0100', b'0100', b'0100', b'1000', b'0010', b'1000',
       b'0100', b'0100', b'1000', b'0010', b'1000', b'0001', b'1000',
       b'0010', b'1000', b'0010', b'1000', b'1000', b'0001', b'0001',
       b'1000', b'0100'], dtype='|S4')

示例二 - >>> lookup_1[seq.view(np.int32)].view('S400') array([b'0010000110001000100000101000001000100001001000100001100010001000100001000100100010001000100010000010000110001000010000011000001000100100100010000001010001001000010001000001010000100010010001000001001010000001100001001000001010000001001000100100000100010001010000011000010000100001001000101000010000100100001000100100010001001000001010000100010010000010100000011000001010000010100010000001000110000100'], dtype='|S400')

uint8

示例3 - 每个字母四个>>> lookup_2[seq.view(np.int32)] array([2, 1, 8, 8, 8, 2, 8, 2, 2, 1, 2, 2, 1, 8, 8, 8, 8, 4, 4, 8, 8, 8, 8, 8, 2, 1, 8, 8, 4, 1, 8, 2, 2, 4, 8, 8, 1, 4, 4, 8, 4, 4, 1, 4, 2, 2, 4, 4, 1, 2, 8, 1, 8, 4, 8, 2, 8, 1, 2, 2, 4, 1, 1, 1, 4, 1, 8, 4, 2, 1, 2, 2, 8, 4, 2, 4, 2, 2, 4, 4, 4, 8, 2, 8, 4, 4, 8, 2, 8, 1, 8, 2, 8, 2, 8, 8, 1, 1, 8, 4], dtype=uint8) ;但是,让我们使用不同的uint8多行:

seq