我正在使用tensorflow玩一下,但对输入管道有点困惑。我正在处理的数据是一个大型的csv文件,有307列,其中第一列是表示日期的字符串,其余的是浮点数。
我遇到了预处理数据的一些问题。我想添加一些功能,而不是基于日期字符串。 (具体地说,代表时间的正弦和余弦)。我还想将CSV行中接下来的120个值组合为一个功能,之后将96个值组合为一个功能,并将我的标签基于CSV中的其余值。
这是我现在生成数据集的代码:
import tensorflow as tf
defaults = []
defaults.append([""])
for i in range(0,306):
defaults.append([1.0])
def dataset(train_fraction=0.8):
path = "training_examples_shuffled.csv"
# Define how the lines of the file should be parsed
def decode_line(line):
items = tf.decode_csv(line, record_defaults=defaults)
datetimeString = items[0]
minuteFeatures = items[1:121]
halfHourFeatures = items[121:217]
labelFeatures = items[217:]
## Do something to convert datetimeString to timeSine and timeCosine
features_dict = {
'timeSine': timeSine,
'timeCosine': timeCosine,
'minuteFeatures': minuteFeatures,
'halfHourFeatures': halfHourFeatures
}
label = [1] # placeholder. I seem to need some python logic here, but I'm
not sure how to apply that to data in tensor format.
return features_dict, label
def in_training_set(line):
"""Returns a boolean tensor, true if the line is in the training set."""
num_buckets = 1000000
bucket_id = tf.string_to_hash_bucket_fast(line, num_buckets)
# Use the hash bucket id as a random number that's deterministic per example
return bucket_id < int(train_fraction * num_buckets)
def in_test_set(line):
"""Returns a boolean tensor, true if the line is in the training set."""
return ~in_training_set(line)
base_dataset = (tf.data
# Get the lines from the file.
.TextLineDataset(path))
train = (base_dataset
# Take only the training-set lines.
.filter(in_training_set)
# Decode each line into a (features_dict, label) pair.
.map(decode_line))
# Do the same for the test-set.
test = (base_dataset.filter(in_test_set).map(decode_line))
return train, test
我现在的问题是:如何访问datetimeString Tensor中的字符串以将其转换为datetime对象?或者这是错误的地方吗?我想用时间和星期几作为输入功能。
其次:基于CSV的剩余值,标签几乎相同。我可以在某种程度上使用标准的python代码,或者我应该使用基本的tensorflow操作来实现我想要的,如果可能的话?
最后,是否有任何关于这是否是处理我输入的正确方法的评论? Tensorflow有点令人困惑,旧的教程使用弃用的处理输入方式在互联网上传播。