Data Batches

pysgmcmc.data_batches.generate_batches(x, y, x_placeholder, y_placeholder, batch_size=20, seed=None)[source]

Infinite generator of random minibatches for a dataset.

For general reference on (infinite) generators, see: https://www.python.org/dev/peps/pep-0255/
Parameters:
  • x (np.ndarray (N, D)) – Training data points/features
  • y (np.ndarray (N, 1)) – Training data labels
  • x_placeholder (tensorflow.placeholder) – Placeholder for batches of data from x.
  • y_placeholder (tensorflow.placeholder) – Placeholder for batches of data from y.
  • batch_size (int, optional) – Number of datapoints to put into a batch.
  • seed (int, optional) – Random seed to use during batch generation. Defaults to None.
Yields:

batch_dict (dict) – A dictionary that maps x_placeholder and y_placeholder to batch_size sized minibatches of data (numpy.ndarrays) from the dataset x, y.

Examples

Simple batch extraction example:

>>> import numpy as np
>>> import tensorflow as tf
>>> N, D = 100, 3  # 100 datapoints with 3 features each
>>> x = np.asarray([np.random.uniform(-10, 10, D) for _ in range(N)])
>>> y = np.asarray([np.random.choice([0., 1.]) for _ in range(N)])
>>> x.shape, y.shape
((100, 3), (100,))
>>> x_placeholder, y_placeholder = tf.placeholder(dtype=tf.float64), tf.placeholder(dtype=tf.float64)
>>> batch_size = 20
>>> gen = generate_batches(x, y, x_placeholder, y_placeholder, batch_size)
>>> batch_dict = next(gen)  # extract a batch
>>> set(batch_dict.keys()) == set((x_placeholder, y_placeholder))
True
>>> batch_dict[x_placeholder].shape, batch_dict[y_placeholder].shape
((20, 3), (20, 1))

Batch extraction resizes batch size if dataset is too small:

>>> import numpy as np
>>> import tensorflow as tf
>>> N, D = 10, 3  # 10 datapoints with 3 features each
>>> x = np.asarray([np.random.uniform(-10, 10, D) for _ in range(N)])
>>> y = np.asarray([np.random.choice([0., 1.]) for _ in range(N)])
>>> x.shape, y.shape
((10, 3), (10,))
>>> x_placeholder, y_placeholder = tf.placeholder(dtype=tf.float64), tf.placeholder(dtype=tf.float64)
>>> batch_size = 20
>>> gen = generate_batches(x, y, x_placeholder, y_placeholder, batch_size)
>>> batch_dict = next(gen)  # extract a batch
>>> set(batch_dict.keys()) == set((x_placeholder, y_placeholder))
True
>>> batch_dict[x_placeholder].shape, batch_dict[y_placeholder].shape
((10, 3), (10, 1))

In this case, the batches contain exactly all datapoints:

>>> np.allclose(batch_dict[x_placeholder], x), np.allclose(batch_dict[y_placeholder].reshape(N,), y)
(True, True)
pysgmcmc.data_batches.generate_shuffled_batches(x, y, x_placeholder, y_placeholder, batch_size=20, seed=None)[source]

Infinite generator of shuffled random minibatches for a dataset.

For general reference on (infinite) generators, see: https://www.python.org/dev/peps/pep-0255/
Parameters:
  • x (np.ndarray (N, D)) – Training data points/features
  • y (np.ndarray (N, 1)) – Training data labels
  • x_placeholder (tensorflow.placeholder) – Placeholder for batches of data from x.
  • y_placeholder (tensorflow.placeholder) – Placeholder for batches of data from y.
  • batch_size (int, optional) – Number of datapoints to put into a batch.
  • seed (int, optional) – Random seed to use during batch generation (and for shuffling!). Defaults to None.
Yields:

batch_dict (dict) – A dictionary that maps x_placeholder and y_placeholder to batch_size sized minibatches of data (numpy.ndarrays) from the dataset x, y.

Examples

Simple shuffled batch extraction example:

>>> import numpy as np
>>> import tensorflow as tf
>>> N, D = 100, 3  # 100 datapoints with 3 features each
>>> x = np.asarray([np.random.uniform(-10, 10, D) for _ in range(N)])
>>> y = np.asarray([np.random.choice([0., 1.]) for _ in range(N)])
>>> x.shape, y.shape
((100, 3), (100,))
>>> x_placeholder, y_placeholder = tf.placeholder(dtype=tf.float64), tf.placeholder(dtype=tf.float64)
>>> batch_size = 20
>>> gen = generate_shuffled_batches(x, y, x_placeholder, y_placeholder, batch_size)
>>> batch_dict = next(gen)  # extract a batch
>>> set(batch_dict.keys()) == set((x_placeholder, y_placeholder))
True
>>> batch_dict[x_placeholder].shape, batch_dict[y_placeholder].shape
((20, 3), (20, 1))

TODO: Demonstrate that shuffled batches are shuffled correctly, e.g. datapoint still matches corresponding label