使用decode_csv()从.csv文件中读取NaN值

时间:2017-03-10 12:28:17

标签: csv tensorflow

我的.csv文件包含整数值,可以有NA值表示缺少数据。

示例文件:

-9882,-9585,-9179
-9883,-9587,NA
-9882,-9585,-9179

尝试使用

阅读时
import tensorflow as tf

reader = tf.TextLineReader(skip_header_lines=1)
key, value = reader.read_up_to(filename_queue, 1)
record_defaults = [[0], [0], [0]]
data, ABL_E, ABL_N = tf.decode_csv(value, record_defaults=record_defaults)

稍后在第二次迭代的sess.run(_)上抛出以下错误

InvalidArgumentError (see above for traceback): Field 5 in record 32400 is not a valid int32: NA

在TensorFlow中读取csv为NaN或类似值时,有没有办法解释字符串“NA”?

1 个答案:

答案 0 :(得分:0)

我最近遇到了同样的问题。我通过将CSV作为字符串读取来解决它,用一些有效值替换每次出现的“NA”,然后将其转换为float

  # Set up reading from CSV files
  filename_queue = tf.train.string_input_producer([filename])
  reader = tf.TextLineReader()
  key, value = reader.read(filename_queue)
  NUM_COLUMNS = XX # Specify number of expected columns

  # Read values as string, set "NA" for missing values.
  record_defaults = [[tf.cast("NA", tf.string)]] * NUM_COLUMNS 
  decoded = tf.decode_csv(value, record_defaults=record_defaults, field_delim="\t")
  # Replace every occurrence of "NA" with "-1"
  no_nan = tf.where(tf.equal(decoded,  "NA"), ["-1"]*NUM_COLUMNS, decoded)
  # Convert to float, combine to a single tensor with stack.
  float_row = tf.stack(tf.string_to_number(no_nan, tf.float32))

但是长期来看,我计划切换到tfrecords,因为阅读csv对我的需求来说太慢了