使用tensorflow DataSet API预处理CSV数据

时间:2018-01-11 16:30:54

标签: python csv tensorflow tensorflow-datasets

我正在使用tensorflow玩一下,但对输入管道有点困惑。我正在处理的数据是一个大型的csv文件,有307列,其中第一列是表示日期的字符串,其余的是浮点数。

我遇到了预处理数据的一些问题。我想添加一些功能,而不是基于日期字符串。 (具体地说,代表时间的正弦和余弦)。我还想将CSV行中接下来的120个值组合为一个功能,之后将96个值组合为一个功能,并将我的标签基于CSV中的其余值。

这是我现在生成数据集的代码:

import tensorflow as tf

defaults = []
defaults.append([""])
for i in range(0,306):
  defaults.append([1.0])

def dataset(train_fraction=0.8):
  path = "training_examples_shuffled.csv"

  # Define how the lines of the file should be parsed
  def decode_line(line):
    items = tf.decode_csv(line, record_defaults=defaults)

    datetimeString = items[0]
    minuteFeatures = items[1:121]
    halfHourFeatures = items[121:217]
    labelFeatures = items[217:]

    ## Do something to convert datetimeString to timeSine and timeCosine

    features_dict = {
      'timeSine': timeSine,
      'timeCosine': timeCosine,
      'minuteFeatures': minuteFeatures,
      'halfHourFeatures': halfHourFeatures
    }

    label = [1] # placeholder. I seem to need some python logic here, but I'm 
                  not sure how to apply that to data in tensor format.

    return features_dict, label

  def in_training_set(line):
    """Returns a boolean tensor, true if the line is in the training set."""
    num_buckets = 1000000
    bucket_id = tf.string_to_hash_bucket_fast(line, num_buckets)
    # Use the hash bucket id as a random number that's deterministic per example
    return bucket_id < int(train_fraction * num_buckets)

  def in_test_set(line):
    """Returns a boolean tensor, true if the line is in the training set."""
    return ~in_training_set(line)

  base_dataset = (tf.data
                  # Get the lines from the file.
                  .TextLineDataset(path))

  train = (base_dataset
           # Take only the training-set lines.
           .filter(in_training_set)
           # Decode each line into a (features_dict, label) pair.
           .map(decode_line))

  # Do the same for the test-set.
  test = (base_dataset.filter(in_test_set).map(decode_line))

  return train, test

我现在的问题是:如何访问datetimeString Tensor中的字符串以将其转换为datetime对象?或者这是错误的地方吗?我想用时间和星期几作为输入功能。

其次:基于CSV的剩余值,标签几乎相同。我可以在某种程度上使用标准的python代码,或者我应该使用基本的tensorflow操作来实现我想要的,如果可能的话?

最后,是否有任何关于这是否是处理我输入的正确方法的评论? Tensorflow有点令人困惑,旧的教程使用弃用的处理输入方式在互联网上传播。

0 个答案:

没有答案