7赞

如何在tensorflow中使用自定义python函数预取数据

作者：mobiledu2402852357 | 2023-09-08 14:52

如何解决《如何在tensorflow中使用自定义python函数预取数据》经验，为你挑选了2个好方法。

我正在尝试预取训练数据以隐藏I/O延迟.我想编写自定义Python代码,从磁盘加载数据并预处理数据(例如,通过添加上下文窗口).换句话说,一个线程进行数据预处理,另一个线程进行训练.这在TensorFlow中可行吗？

更新:我有一个基于@ mrry的例子的工作示例.

import numpy as np
import tensorflow as tf
import threading

BATCH_SIZE = 5
TRAINING_ITERS = 4100

feature_input = tf.placeholder(tf.float32, shape=[128])
label_input = tf.placeholder(tf.float32, shape=[128])

q = tf.FIFOQueue(200, [tf.float32, tf.float32], shapes=[[128], [128]])
enqueue_op = q.enqueue([label_input, feature_input])

label_batch, feature_batch = q.dequeue_many(BATCH_SIZE)
c = tf.reshape(feature_batch, [BATCH_SIZE, 128]) + tf.reshape(label_batch, [BATCH_SIZE, 128])

sess = tf.Session()

def load_and_enqueue(sess, enqueue_op, coord):
  with open('dummy_data/features.bin') as feature_file, open('dummy_data/labels.bin') as label_file:
    while not coord.should_stop():
      feature_array = np.fromfile(feature_file, np.float32, 128)
      if feature_array.shape[0] == 0:
        print('reach end of file, reset using seek(0,0)')
        feature_file.seek(0,0)
        label_file.seek(0,0)
        continue
      label_value = np.fromfile(label_file, np.float32, 128)

      sess.run(enqueue_op, feed_dict={feature_input: feature_array,
                                      label_input: label_value})

coord = tf.train.Coordinator()
t = threading.Thread(target=load_and_enqueue, args=(sess,enqueue_op, coord))
t.start()

for i in range(TRAINING_ITERS):
  sum = sess.run(c)
  print('train_iter='+str(i))
  print(sum)

coord.request_stop()
coord.join([t])

mrry.. 52

这是一个常见的用例,大多数实现使用TensorFlow的队列将预处理代码与训练代码分离.有一个关于如何使用队列的教程,但主要步骤如下:

定义一个队列,q它将缓冲预处理的数据.TensorFlow支持以tf.FIFOQueue排队顺序生成元素的简单元素,以及tf.RandomShuffleQueue以随机顺序生成元素的更高级元素.队列元素是一个或多个张量的元组(可以具有不同的类型和形状).所有队列都支持单元素(enqueue,dequeue)和批处理(enqueue_many,dequeue_many)操作,但是要使用批处理操作,您必须在构造队列时指定队列元素中每个张量的形状.

构建一个子图,将预处理的元素排入队列.一种方法是tf.placeholder()为张量定义一些与单个输入示例相对应的操作,然后将它们传递给q.enqueue().(如果您的预处理一次生成一个批处理,则应该使用q.enqueue_many().)您可能还在此子图中包含TensorFlow操作.

建立一个执行培训的子图.这看起来像一个普通的TensorFlow图,但会通过调用得到它的输入q.dequeue_many(BATCH_SIZE).

开始你的会议.

创建一个或多个执行预处理逻辑的线程,然后执行enqueue op,输入预处理的数据.您可能会发现tf.train.Coordinator和tf.train.QueueRunner实用程序类对此有用.

正常运行您的训练图(优化器等).

编辑:这是一个简单的load_and_enqueue()函数和代码片段,可以帮助您入门:

# Features are length-100 vectors of floats
feature_input = tf.placeholder(tf.float32, shape=[100])
# Labels are scalar integers.
label_input = tf.placeholder(tf.int32, shape=[])

# Alternatively, could do:
# feature_batch_input = tf.placeholder(tf.float32, shape=[None, 100])
# label_batch_input = tf.placeholder(tf.int32, shape=[None])

q = tf.FIFOQueue(100, [tf.float32, tf.int32], shapes=[[100], []])
enqueue_op = q.enqueue([feature_input, label_input])

# For batch input, do:
# enqueue_op = q.enqueue_many([feature_batch_input, label_batch_input])

feature_batch, label_batch = q.dequeue_many(BATCH_SIZE)
# Build rest of model taking label_batch, feature_batch as input.
# [...]
train_op = ...

sess = tf.Session()

def load_and_enqueue():
  with open(...) as feature_file, open(...) as label_file:
    while True:
      feature_array = numpy.fromfile(feature_file, numpy.float32, 100)
      if not feature_array:
        return
      label_value = numpy.fromfile(feature_file, numpy.int32, 1)[0]

      sess.run(enqueue_op, feed_dict={feature_input: feature_array,
                                      label_input: label_value})

# Start a thread to enqueue data asynchronously, and hide I/O latency.
t = threading.Thread(target=load_and_enqueue)
t.start()

for _ in range(TRAINING_EPOCHS):
  sess.run(train_op)

AlexP.. 7

换句话说,一个线程进行数据预处理,另一个线程进行训练.这在TensorFlow中可行吗？

是的.mrry的解决方案有效,但存在更简单.

获取数据

tf.py_func包装python函数并将其用作TensorFlow运算符.所以我们可以sess.run()每次加载数据.这种方法的问题是sess.run()通过主线程加载数据.

一个最小的例子:

def get_numpy_tensor():
  return np.array([[1,2],[3,4]], dtype=np.float32)
tensorflow_tensor = tf.py_func(get_numpy_tensor, [], tf.float32)

一个更复杂的例子:

def get_numpy_tensors():
  # Load data from the disk into numpy arrays.
  input = np.array([[1,2],[3,4]], dtype=np.float32)
  target = np.int32(1)
  return input, target
tensorflow_input, tensorflow_target = tf.py_func(get_numpy_tensors, [], [tf.float32, tf.int32])

tensorflow_input, tensorflow_target = 2*tensorflow_input, 2*tensorflow_target

sess = tf.InteractiveSession()
numpy_input, numpy_target = sess.run([tensorflow_input, tensorflow_target])
assert np.all(numpy_input==np.array([[2,4],[6,8]])) and numpy_target==2

在另一个线程中预取数据

要在另一个线程中排队我们的数据(这样sess.run()就不必等待数据),我们就可以使用tf.train.batch()我们的运算符了tf.py_func().

一个最小的例子:

tensor_shape = get_numpy_tensor().shape
tensorflow_tensors = tf.train.batch([tensorflow_tensor], batch_size=32, shapes=[tensor_shape])
# Run `tf.train.start_queue_runners()` once session is created.

shapes如果tensorflow_tensor指定了其形状,我们可以省略该参数:

tensor_shape = get_numpy_tensor().shape
tensorflow_tensor.set_shape(tensor_shape)
tensorflow_tensors = tf.train.batch([tensorflow_tensor], batch_size=32)
# Run `tf.train.start_queue_runners()` once session is created.

一个更复杂的例子:

input_shape, target_shape = (2, 2), ()
def get_numpy_tensors():
  input = np.random.rand(*input_shape).astype(np.float32)
  target = np.random.randint(10, dtype=np.int32)
  print('f', end='')
  return input, target
tensorflow_input, tensorflow_target = tf.py_func(get_numpy_tensors, [], [tf.float32, tf.int32])
batch_size = 2
tensorflow_inputs, tensorflow_targets = tf.train.batch([tensorflow_input, tensorflow_target], batch_size, shapes=[input_shape, target_shape], capacity=2)
# Internal queue will contain at most `capasity=2` times `batch_size=2` elements `[tensorflow_input, tensorflow_target]`.

tensorflow_inputs, tensorflow_targets = 2*tensorflow_inputs, 2*tensorflow_targets

sess = tf.InteractiveSession()
tf.train.start_queue_runners() # Internally, `tf.train.batch` uses a QueueRunner, so we need to ask tf to start it.
for _ in range(10):
  numpy_inputs, numpy_targets = sess.run([tensorflow_inputs, tensorflow_targets])
  assert numpy_inputs.shape==(batch_size, *input_shape) and numpy_targets.shape==(batch_size, *target_shape)
  print('r', end='')

# Prints `fffffrrffrfrffrffrffrffrffrffrf`.

如果get_numpy_tensor()返回一批张量,那么tf.train.batch(..., enqueue_many=True)将有所帮助.

1> mrry..：

这是一个常见的用例,大多数实现使用TensorFlow的队列将预处理代码与训练代码分离.有一个关于如何使用队列的教程,但主要步骤如下:

建立一个执行培训的子图.这看起来像一个普通的TensorFlow图,但会通过调用得到它的输入q.dequeue_many(BATCH_SIZE).

开始你的会议.

创建一个或多个执行预处理逻辑的线程,然后执行enqueue op,输入预处理的数据.您可能会发现tf.train.Coordinator和tf.train.QueueRunner实用程序类对此有用.

正常运行您的训练图(优化器等).

编辑:这是一个简单的load_and_enqueue()函数和代码片段,可以帮助您入门:

# Features are length-100 vectors of floats
feature_input = tf.placeholder(tf.float32, shape=[100])
# Labels are scalar integers.
label_input = tf.placeholder(tf.int32, shape=[])

# Alternatively, could do:
# feature_batch_input = tf.placeholder(tf.float32, shape=[None, 100])
# label_batch_input = tf.placeholder(tf.int32, shape=[None])

q = tf.FIFOQueue(100, [tf.float32, tf.int32], shapes=[[100], []])
enqueue_op = q.enqueue([feature_input, label_input])

# For batch input, do:
# enqueue_op = q.enqueue_many([feature_batch_input, label_batch_input])

feature_batch, label_batch = q.dequeue_many(BATCH_SIZE)
# Build rest of model taking label_batch, feature_batch as input.
# [...]
train_op = ...

sess = tf.Session()

def load_and_enqueue():
  with open(...) as feature_file, open(...) as label_file:
    while True:
      feature_array = numpy.fromfile(feature_file, numpy.float32, 100)
      if not feature_array:
        return
      label_value = numpy.fromfile(feature_file, numpy.int32, 1)[0]

      sess.run(enqueue_op, feed_dict={feature_input: feature_array,
                                      label_input: label_value})

# Start a thread to enqueue data asynchronously, and hide I/O latency.
t = threading.Thread(target=load_and_enqueue)
t.start()

for _ in range(TRAINING_EPOCHS):
  sess.run(train_op)

2> AlexP..：

换句话说,一个线程进行数据预处理,另一个线程进行训练.这在TensorFlow中可行吗？

是的.mrry的解决方案有效,但存在更简单.

获取数据

tf.py_func包装python函数并将其用作TensorFlow运算符.所以我们可以sess.run()每次加载数据.这种方法的问题是sess.run()通过主线程加载数据.

一个最小的例子:

def get_numpy_tensor():
  return np.array([[1,2],[3,4]], dtype=np.float32)
tensorflow_tensor = tf.py_func(get_numpy_tensor, [], tf.float32)

一个更复杂的例子:

def get_numpy_tensors():
  # Load data from the disk into numpy arrays.
  input = np.array([[1,2],[3,4]], dtype=np.float32)
  target = np.int32(1)
  return input, target
tensorflow_input, tensorflow_target = tf.py_func(get_numpy_tensors, [], [tf.float32, tf.int32])

tensorflow_input, tensorflow_target = 2*tensorflow_input, 2*tensorflow_target

sess = tf.InteractiveSession()
numpy_input, numpy_target = sess.run([tensorflow_input, tensorflow_target])
assert np.all(numpy_input==np.array([[2,4],[6,8]])) and numpy_target==2

在另一个线程中预取数据

要在另一个线程中排队我们的数据(这样sess.run()就不必等待数据),我们就可以使用tf.train.batch()我们的运算符了tf.py_func().

一个最小的例子:

tensor_shape = get_numpy_tensor().shape
tensorflow_tensors = tf.train.batch([tensorflow_tensor], batch_size=32, shapes=[tensor_shape])
# Run `tf.train.start_queue_runners()` once session is created.

shapes如果tensorflow_tensor指定了其形状,我们可以省略该参数:

tensor_shape = get_numpy_tensor().shape
tensorflow_tensor.set_shape(tensor_shape)
tensorflow_tensors = tf.train.batch([tensorflow_tensor], batch_size=32)
# Run `tf.train.start_queue_runners()` once session is created.

一个更复杂的例子:

input_shape, target_shape = (2, 2), ()
def get_numpy_tensors():
  input = np.random.rand(*input_shape).astype(np.float32)
  target = np.random.randint(10, dtype=np.int32)
  print('f', end='')
  return input, target
tensorflow_input, tensorflow_target = tf.py_func(get_numpy_tensors, [], [tf.float32, tf.int32])
batch_size = 2
tensorflow_inputs, tensorflow_targets = tf.train.batch([tensorflow_input, tensorflow_target], batch_size, shapes=[input_shape, target_shape], capacity=2)
# Internal queue will contain at most `capasity=2` times `batch_size=2` elements `[tensorflow_input, tensorflow_target]`.

tensorflow_inputs, tensorflow_targets = 2*tensorflow_inputs, 2*tensorflow_targets

sess = tf.InteractiveSession()
tf.train.start_queue_runners() # Internally, `tf.train.batch` uses a QueueRunner, so we need to ask tf to start it.
for _ in range(10):
  numpy_inputs, numpy_targets = sess.run([tensorflow_inputs, tensorflow_targets])
  assert numpy_inputs.shape==(batch_size, *input_shape) and numpy_targets.shape==(batch_size, *target_shape)
  print('r', end='')

# Prints `fffffrrffrfrffrffrffrffrffrffrf`.

如果get_numpy_tensor()返回一批张量,那么tf.train.batch(..., enqueue_many=True)将有所帮助.

推荐阅读

程序员
二进制搜索树,inorder方法迭代不起作用

如何解决《二进制搜索树,inorder方法迭代不起作用》经验，为你挑选了1个好方法。 ... [详细]
程序员
我可以在Control的Content属性中使用"< - "符号吗？

如何解决《我可以在Control的Content属性中使用"<-"符号吗？》经验，为你挑选了1个好方法。 ... [详细]
程序员
Swift中的底部类型

如何解决《Swift中的底部类型》经验，为你挑选了0个好方法。 ... [详细]
程序员
如何打印Windows窗体而不显示/显示它

如何解决《如何打印Windows窗体而不显示/显示它》经验，为你挑选了1个好方法。 ... [详细]
程序员
如何设置ViewController以编程方式在Top Bar下扩展边缘？

如何解决《如何设置ViewController以编程方式在TopBar下扩展边缘？》经验，为你挑选了1个好方法。 ... [详细]
程序员
jQuery更改和HTML更改有什么区别？

如何解决《jQuery更改和HTML更改有什么区别？》经验，为你挑选了1个好方法。 ... [详细]
程序员
XGBoost如何进行并行计算？

如何解决《XGBoost如何进行并行计算？》经验，为你挑选了1个好方法。 ... [详细]
程序员
Web Api(MVC 6)Chunked body没有正确终止0大小的块

如何解决《WebApi(MVC6)Chunkedbody没有正确终止0大小的块》经验，为你挑选了1个好方法。 ... [详细]
程序员
SQL SELECT(初学者)

如何解决《SQLSELECT(初学者)》经验，为你挑选了1个好方法。 ... [详细]
程序员
React native,无法解析静态图像模块

如何解决《Reactnative,无法解析静态图像模块》经验，为你挑选了2个好方法。 ... [详细]
程序员
启用水平滚动条的ASP.NET MVC引导程序

如何解决《启用水平滚动条的ASP.NETMVC引导程序》经验，为你挑选了1个好方法。 ... [详细]
程序员
Spring,Spring Data JPA:org.hibernate.hql.internal.ast.QuerySyntaxException:Test未映射

如何解决《Spring,SpringDataJPA:org.hibernate.hql.internal.ast.QuerySyntaxException:Test未映射》经验，为你挑选了1个好方法。 ... [详细]
程序员
使用SynchronizationContext时async/await死锁

如何解决《使用SynchronizationContext时async/await死锁》经验，为你挑选了1个好方法。 ... [详细]
程序员
根据输入框的内容过滤UL列表中的数据-AngularJs

如何解决《根据输入框的内容过滤UL列表中的数据-AngularJs》经验，为你挑选了1个好方法。 ... [详细]
程序员
如何在codeigniter中显示数据库表名列表

如何解决《如何在codeigniter中显示数据库表名列表》经验，为你挑选了1个好方法。 ... [详细]
程序员
无法安装sqlite3 gem

如何解决《无法安装sqlite3gem》经验，为你挑选了1个好方法。 ... [详细]
程序员
为什么(sizeof(int)> -1)为false？

如何解决《为什么(sizeof(int)>-1)为false？》经验，为你挑选了1个好方法。 ... [详细]
程序员
将光标隐藏在UITextView - Swift上

如何解决《将光标隐藏在UITextView-Swift上》经验，为你挑选了1个好方法。 ... [详细]
程序员
仅当存在DROP触发器时（ORACLE）

如何解决《仅当存在DROP触发器时（ORACLE）》经验，为你挑选了1个好方法。 ... [详细]
程序员
打印执行的命令

如何解决《打印执行的命令》经验，为你挑选了0个好方法。 ... [详细]

mobiledu2402852357

这个屌丝很懒，什么也没留下！

关注作者

Tags | 热门标签

RankList | 热门文章