我一直在尝试读取和使用ProteinNet数据集,但收效甚微,有关该文件的论文为here,而github存储库为here。
数据巨大(TFRecords 11版中未压缩的9GB),所以现在我只想使用可视化工具来更好地理解它,但是github阅读器(parser.py)使用了不赞成使用的tensorflow函数。在这里:
__author__ = "Mohammed AlQuraishi"
__copyright__ = "Copyright 2018, Harvard Medical School"
__license__ = "MIT"
import tensorflow as tf
NUM_AAS = 20
NUM_DIMENSIONS = 3
def masking_matrix(mask, name=None):
with tf.name_scope(name, 'masking_matrix', [mask]) as scope:
mask = tf.convert_to_tensor(mask, name='mask')
mask = tf.expand_dims(mask, 0)
base = tf.ones([tf.size(mask), tf.size(mask)])
matrix_mask = base * mask * tf.transpose(mask)
return matrix_mask
def read_protein(filename_queue, max_length, num_evo_entries=21, name=None):
""" Reads and parses a ProteinNet TF Record.
Primary sequences are mapped onto 20-dimensional one-hot vectors.
Evolutionary sequences are mapped onto num_evo_entries-dimensional real-valued vectors.
Secondary structures are mapped onto ints indicating one of 8 class labels.
Tertiary coordinates are flattened so that there are 3 times as many coordinates as
residues.
Evolutionary, secondary, and tertiary entries are optional.
Args:
filename_queue: TF queue for reading files
max_length: Maximum length of sequence (number of residues) [MAX_LENGTH]. Not a
TF tensor and is thus a fixed value.
Returns:
id: string identifier of record
one_hot_primary: AA sequence as one-hot vectors
evolutionary: PSSM sequence as vectors
secondary: DSSP sequence as int class labels
tertiary: 3D coordinates of structure
matrix_mask: Masking matrix to zero out pairwise distances in the masked regions
pri_length: Length of amino acid sequence
keep: True if primary length is less than or equal to max_length
"""
with tf.name_scope(name, 'read_protein', []) as scope:
reader = tf.TFRecordReader()
_, serialized_example = reader.read(filename_queue)
context, features = tf.parse_single_sequence_example(serialized_example,
context_features={'id': tf.FixedLenFeature((1,), tf.string)},
sequence_features={
'primary': tf.FixedLenSequenceFeature((1,), tf.int64),
'evolutionary': tf.FixedLenSequenceFeature((num_evo_entries,), tf.float32, allow_missing=True),
'secondary': tf.FixedLenSequenceFeature((1,), tf.int64, allow_missing=True),
'tertiary': tf.FixedLenSequenceFeature((NUM_DIMENSIONS,), tf.float32, allow_missing=True),
'mask': tf.FixedLenSequenceFeature((1,), tf.float32, allow_missing=True)})
id_ = context['id'][0]
primary = tf.to_int32(features['primary'][:, 0])
evolutionary = features['evolutionary']
secondary = tf.to_int32(features['secondary'][:, 0])
tertiary = features['tertiary']
mask = features['mask'][:, 0]
pri_length = tf.size(primary)
keep = pri_length <= max_length
one_hot_primary = tf.one_hot(primary, NUM_AAS)
# Generate tertiary masking matrix--if mask is missing then assume all residues are present
mask = tf.cond(tf.not_equal(tf.size(mask), 0), lambda: mask, lambda: tf.ones([pri_length]))
ter_mask = masking_matrix(mask, name='ter_mask')
return id_, one_hot_primary, evolutionary, secondary, tertiary, ter_mask, pri_length, keep
不推荐使用的功能是:
tf.TFRecordReader()
显然应该替换为
tf.data.TFRecordDataset(文件名)
尽管我缺乏对TFRecords的了解,也缺乏关于虚拟记录的文档,所以我无法读取有关数据集的任何信息。
如何更新read_protein()函数以使其正常工作,如何从TFRecords转换为普通张量?我完全不熟悉这种类型的文件。
如果需要,我可以提供数据集的样本,因为我了解9GB的下载量很小。
答案 0 :(得分:1)
您可以使用
访问单个序列化的示例<select id="category">
<option value="bmwsubcategory">bmw</option>
<option value="audisubcategory">audi</option>
<option value="fordsubcategory">ford</option>
<option value="fiatsubcategory">fiat</option>
</select>
<select id="bmwsubcategory" name="bmw">
<option value="bmw 1">bmw 1</option>
<option value="bmw 2">bmw 2</option>
<option value="bmw 3">bmw 3</option>
<option value="bmw 4">bmw 4</option>
</select>
<select id="audisubcategory" name="audi">
<option value="audi 1">audi 1</option>
<option value="audi 2">audi 2</option>
<option value="audi 3">audi 3</option>
<option value="audi 4">audi 4</option>
</select>
<select id="fordsubcategory" name="ford">
<option value="ford 1">ford 1</option>
<option value="ford 2">ford 2</option>
<option value="ford 3">ford 3</option>
<option value="ford 4">ford 4</option>
</select>
<select id="fiatsubcategory" name="fiat">
<option value="fiat 1">fiat 1</option>
<option value="fiat 2">fiat 2</option>
<option value="fiat 3">fiat 3</option>
<option value="fiat 4">fiat 4</option>
</select>
<script>
var category = document.getElementById('category');
var subcategories = ['bmwsubcategory', 'audisubcategory', 'fordsubcategory', 'fiatsubcategory'];
category.onchange = function(){
subcategories.forEach(function(value){
if (value !== category.value) {
document.getElementById(value).style.display = "none";
}
else {
document.getElementById(value).style.display = "";
}
});
}
</script>
然后在for循环中,您可以通过以下方式访问“主要”功能
for str_rec in tf.python_io.tf_record_iterator('filename.tfrecords'):
example = tf.train.Example()
example.ParseFromString(str_rec)
通常,您需要了解如何将数据编码为tfrecords。有关更多信息,请点击https://www.tensorflow.org/tutorials/load_data/tf_records