读取ProteinNet的TFRecords数据集

时间:2019-02-13 15:34:32

标签: python tensorflow dataset tfrecord

我一直在尝试读取和使用ProteinNet数据集,但收效甚微,有关该文件的论文为here,而github存储库为here

数据巨大(TFRecords 11版中未压缩的9GB),所以现在我只想使用可视化工具来更好地理解它,但是github阅读器(parser.py)使用了不赞成使用的tensorflow函数。在这里:

__author__ = "Mohammed AlQuraishi"
__copyright__ = "Copyright 2018, Harvard Medical School"
__license__ = "MIT"

import tensorflow as tf
NUM_AAS = 20
NUM_DIMENSIONS = 3

def masking_matrix(mask, name=None):

    with tf.name_scope(name, 'masking_matrix', [mask]) as scope:
        mask = tf.convert_to_tensor(mask, name='mask')

        mask = tf.expand_dims(mask, 0)
        base = tf.ones([tf.size(mask), tf.size(mask)])
        matrix_mask = base * mask * tf.transpose(mask)

        return matrix_mask


def read_protein(filename_queue, max_length, num_evo_entries=21, name=None):
    """ Reads and parses a ProteinNet TF Record. 

        Primary sequences are mapped onto 20-dimensional one-hot vectors.
        Evolutionary sequences are mapped onto num_evo_entries-dimensional real-valued vectors.
        Secondary structures are mapped onto ints indicating one of 8 class labels.
        Tertiary coordinates are flattened so that there are 3 times as many coordinates as 
        residues.

        Evolutionary, secondary, and tertiary entries are optional.

    Args:
        filename_queue: TF queue for reading files
        max_length:     Maximum length of sequence (number of residues) [MAX_LENGTH]. Not a 
                        TF tensor and is thus a fixed value.

    Returns:
        id: string identifier of record
        one_hot_primary: AA sequence as one-hot vectors
        evolutionary: PSSM sequence as vectors
        secondary: DSSP sequence as int class labels
        tertiary: 3D coordinates of structure
        matrix_mask: Masking matrix to zero out pairwise distances in the masked regions
        pri_length: Length of amino acid sequence
        keep: True if primary length is less than or equal to max_length
    """

    with tf.name_scope(name, 'read_protein', []) as scope:
        reader = tf.TFRecordReader()
        _, serialized_example = reader.read(filename_queue)

        context, features = tf.parse_single_sequence_example(serialized_example,
                                context_features={'id': tf.FixedLenFeature((1,), tf.string)},
                                sequence_features={
                                    'primary':      tf.FixedLenSequenceFeature((1,),               tf.int64),
                                    'evolutionary': tf.FixedLenSequenceFeature((num_evo_entries,), tf.float32, allow_missing=True),
                                    'secondary':    tf.FixedLenSequenceFeature((1,),               tf.int64,   allow_missing=True),
                                    'tertiary':     tf.FixedLenSequenceFeature((NUM_DIMENSIONS,),  tf.float32, allow_missing=True),
                                    'mask':         tf.FixedLenSequenceFeature((1,),               tf.float32, allow_missing=True)})
        id_ = context['id'][0]
        primary =   tf.to_int32(features['primary'][:, 0])
        evolutionary =          features['evolutionary']
        secondary = tf.to_int32(features['secondary'][:, 0])
        tertiary =              features['tertiary']
        mask =                  features['mask'][:, 0]

        pri_length = tf.size(primary)
        keep = pri_length <= max_length

        one_hot_primary = tf.one_hot(primary, NUM_AAS)

        # Generate tertiary masking matrix--if mask is missing then assume all residues are present
        mask = tf.cond(tf.not_equal(tf.size(mask), 0), lambda: mask, lambda: tf.ones([pri_length]))
        ter_mask = masking_matrix(mask, name='ter_mask')

        return id_, one_hot_primary, evolutionary, secondary, tertiary, ter_mask, pri_length, keep

不推荐使用的功能是:

  

tf.TFRecordReader()

显然应该替换为

  

tf.data.TFRecordDataset(文件名)

尽管我缺乏对TFRecords的了解,也缺乏关于虚拟记录的文档,所以我无法读取有关数据集的任何信息。

如何更新read_protein()函数以使其正常工作,如何从TFRecords转换为普通张量?我完全不熟悉这种类型的文件。

如果需要,我可以提供数据集的样本,因为我了解9GB的下载量很小。

1 个答案:

答案 0 :(得分:1)

您可以使用

访问单个序列化的示例
<select id="category">
	<option value="bmwsubcategory">bmw</option>
	<option value="audisubcategory">audi</option>
	<option value="fordsubcategory">ford</option>
	<option value="fiatsubcategory">fiat</option>
</select>

<select id="bmwsubcategory" name="bmw">
	<option value="bmw 1">bmw 1</option>
	<option value="bmw 2">bmw 2</option>
	<option value="bmw 3">bmw 3</option>
	<option value="bmw 4">bmw 4</option>
</select>

<select id="audisubcategory" name="audi">
	<option value="audi 1">audi 1</option>
	<option value="audi 2">audi 2</option>
	<option value="audi 3">audi 3</option>
	<option value="audi 4">audi 4</option>
</select>

<select id="fordsubcategory" name="ford">
	<option value="ford 1">ford 1</option>
	<option value="ford 2">ford 2</option>
	<option value="ford 3">ford 3</option>
	<option value="ford 4">ford 4</option>
</select>

<select id="fiatsubcategory" name="fiat">
	<option value="fiat 1">fiat 1</option>
	<option value="fiat 2">fiat 2</option>
	<option value="fiat 3">fiat 3</option>
	<option value="fiat 4">fiat 4</option>
</select>
<script>
    var category = document.getElementById('category');
    var subcategories = ['bmwsubcategory', 'audisubcategory', 'fordsubcategory', 'fiatsubcategory'];
    category.onchange = function(){
        subcategories.forEach(function(value){
            if (value !== category.value) {
                 document.getElementById(value).style.display = "none";
            }
            else {
                 document.getElementById(value).style.display = "";
            }
        });
    }
    
</script>

然后在for循环中,您可以通过以下方式访问“主要”功能

for str_rec in tf.python_io.tf_record_iterator('filename.tfrecords'):
    example = tf.train.Example()
    example.ParseFromString(str_rec)

通常,您需要了解如何将数据编码为tfrecords。有关更多信息,请点击https://www.tensorflow.org/tutorials/load_data/tf_records