我尝试Estimator
使用TPUEstimator
API 实现基于Tensorflow的模型失败.它在训练期间遇到错误:
InvalidArgumentError (see above for traceback): No OpKernel was registered to support Op 'CrossReplicaSum' with these attrs. Registered devices: [CPU], Registered kernels:[[Node: CrossReplicaSum_5 = CrossReplicaSum[T=DT_FLOAT](gradients/dense_2/BiasAdd_grad/tuple/control_dependency_1)]]
一开始也有警告,但我不确定它是否相关:
WARNING:tensorflow:CrossShardOptimizer should be used within a tpu_shard_context, but got unset number_of_shards. Assuming 1.
这是模型函数的相关部分:
def model_fn(features, labels, mode, params): """A simple NN with two hidden layers of 10 nodes each.""" input_layer = tf.feature_column.input_layer(features, params['feature_columns']) dense1 = tf.layers.dense(inputs=input_layer, units=10, activation=tf.nn.relu, kernel_initializer=tf.glorot_uniform_initializer()) dense2 = tf.layers.dense(inputs=dense1, units=10, activation=tf.nn.relu, kernel_initializer=tf.glorot_uniform_initializer()) logits = tf.layers.dense(inputs=dense2, units=4) reshaped_logits = tf.reshape(logits, [-1, 1, 4]) onehot_labels = tf.one_hot(indices=tf.cast(labels, tf.int32), depth=4) loss = tf.losses.softmax_cross_entropy(onehot_labels=onehot_labels, logits=reshaped_logits) if mode == tf.estimator.ModeKeys.TRAIN: optimizer = tf.contrib.tpu.CrossShardOptimizer(tf.train.AdagradOptimizer(learning_rate=0.05)) train_op = optimizer.minimize( loss=loss, global_step=tf.train.get_global_step())
我正在TPUEstimator
通过将--use_tpu
标志设置为来尝试本地CPU执行False
.在TPUEstimator
被实例化和train
被称为正是如此:
estimator_classifier = tf.contrib.tpu.TPUEstimator( model_fn=model_fn, model_dir="/tmp/estimator_classifier_logs", config=tf.contrib.tpu.RunConfig( session_config=tf.ConfigProto( allow_soft_placement=True, log_device_placement=True), tpu_config=tf.contrib.tpu.TPUConfig() ), train_batch_size=DEFAULT_BATCH_SIZE, use_tpu=False, params={ 'feature_columns': feature_columns } ) tensors_to_log = {"probabilities": "softmax_tensor"} logging_hook = tf.train.LoggingTensorHook(tensors=tensors_to_log, every_n_iter=50) estimator_classifier.train( input_fn=data_factory.make_tpu_train_input_fn(train_x, train_y, DEFAULT_BATCH_SIZE), steps=DEFAULT_STEPS, hooks=[logging_hook] )
这个错误是什么意思,我该如何排除故障?