How do I split Tensorflow datasets?

You may use Dataset.take() and Dataset.skip(): train_size = int(0.7 * DATASET_SIZE) val_size = int(0.15 * DATASET_SIZE) test_size = int(0.15 * DATASET_SIZE) full_dataset = tf.data.TFRecordDataset(FLAGS.input_file) full_dataset = full_dataset.shuffle() train_dataset = full_dataset.take(train_size) test_dataset = full_dataset.skip(train_size) val_dataset = test_dataset.skip(test_size) test_dataset = test_dataset.take(test_size) For more generality, I gave an example using a 70/15/15 train/val/test split but if you don’t … Read more

Numpy to TFrecords: Is there a more simple way to handle batch inputs from tfrecords?

The whole process is simplied using the Dataset API. Here are both the parts: (1): Convert numpy array to tfrecords and (2,3,4): read the tfrecords to generate batches. 1. Creation of tfrecords from a numpy array: def npy_to_tfrecords(…): # write records to a tfrecords file writer = tf.python_io.TFRecordWriter(output_file) # Loop through all the features you … Read more

Meaning of buffer_size in Dataset.map , Dataset.prefetch and Dataset.shuffle

TL;DR Despite their similar names, these arguments have quite difference meanings. The buffer_size in Dataset.shuffle() can affect the randomness of your dataset, and hence the order in which elements are produced. The buffer_size in Dataset.prefetch() only affects the time it takes to produce the next element. The buffer_size argument in tf.data.Dataset.prefetch() and the output_buffer_size argument … Read more