主动学习是一种半监督机器学习技术,它通过从学习过程(损失)的角度选择最重要的样本来标记较少的数据。在数据量大、贴标签率高的情况下,会对项目成本产生巨大影响。例如,对象检测和np - ner问题。
import numpy as np import pandas as pd import tensorflow as tf import matplotlib.pyplot as plt sess = tf.InteractiveSession()
#load 4000 of MNIST data for train and 400 for testing (x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data() x_full = x_train[:4000] / 255 y_full = y_train[:4000] x_test = x_test[:400] /255 y_test = y_test[:400] x_full.shape, y_full.shape, x_test.shape, y_test.shape
((4000, 28, 28), (4000,), (400, 28, 28), (400,))
#build computation graph x = tf.placeholder(tf.float32, [None, 28, 28]) x_flat = tf.reshape(x, [-1, 28 * 28]) y_ = tf.placeholder(tf.int32, [None]) W = tf.Variable(tf.zeros([28 * 28, 10]), tf.float32) b = tf.Variable(tf.zeros([10]), tf.float32) y = tf.matmul(x_flat, W) + b y_sm = tf.nn.softmax(y) loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y_, logits=y)) train = tf.train.AdamOptimizer(0.1).minimize(loss) accuracy = tf.reduce_mean(tf.cast(tf.equal(y_, tf.cast(tf.argmax(y, 1), tf.int32)), tf.float32))
作为一个框架,我们可以使用TensorFlow计算图来构建10个神经元(每个数字)。W和b是神经元的权重。softmax输出y_sm将帮助处理数字的概率(置信度)。损失将是一个典型的“softmaxed”交叉熵之间的预测和标记的数据。优化器的选择是流行的Adam,其学习率几乎是默认的- 0.1。测试数据的准确性将作为主要度量指标
def reset(): '''Initialize data sets and session''' global x_labeled, y_labeled, x_unlabeled, y_unlabeled x_labeled = x_full[:0] y_labeled = y_full[:0] x_unlabeled = x_full y_unlabeled = y_full tf.global_variables_initializer().run() tf.local_variables_initializer().run() def fit(): '''Train current labeled dataset until overfit.''' trial_count = 10 acc = sess.run(accuracy, feed_dict={x:x_test, y_:y_test}) weights = sess.run([W, b]) while trial_count > 0: sess.run(train, feed_dict={x:x_labeled, y_:y_labeled}) acc_new = sess.run(accuracy, feed_dict={x:x_test, y_:y_test}) if acc_new <= acc: trial_count -= 1 else: trial_count = 10 weights = sess.run([W, b]) acc = acc_new sess.run([W.assign(weights[0]), b.assign(weights[1])]) acc = sess.run(accuracy, feed_dict={x:x_test, y_:y_test}) print('Labels:', x_labeled.shape[0], ' Accuracy:', acc) def label_manually(n): '''Human powered labeling (actually copying from the prelabeled MNIST dataset).''' global x_labeled, y_labeled, x_unlabeled, y_unlabeled x_labeled = np.concatenate([x_labeled, x_unlabeled[:n]]) y_labeled = np.concatenate([y_labeled, y_unlabeled[:n]]) x_unlabeled = x_unlabeled[n:] y_unlabeled = y_unlabeled[n:]
- reset() -清空已标记的机器学习数据集,将所有数据放入未标记的机器学习数据集中,并重置会话变量
- fit()——运行一个试图达到最佳准确度的训练。如果在前十次尝试中不能提高,训练就会停在最后一个最好的结果。我们不能只使用大量的训练时间,因为模型往往很快就会过度拟合,或者需要进行L2正则化。
- label_manual()——这是一种模拟人类数据标记的方法。实际上,我们从已经标记的MNIST数据集中获取标签。
Ground Truth
#train full dataset of 1000 reset() label_manually(4000) fit()
Labels: 4000 Accuracy: 0.9225
#apply clustering kmeans = tf.contrib.factorization.KMeansClustering(10, use_mini_batch=False) kmeans.train(lambda: tf.train.limit_epochs(x_full.reshape(4000, 784).astype(np.float32), 10))
centers = kmeans.cluster_centers().reshape([10, 28, 28]) plt.imshow(np.concatenate([centers[i] for i in range(10)], axis=1))
#try to run on random 400 reset() label_manually(400) fit()
Labels: 400 Accuracy: 0.8375
让我们尝试仅标记10%的数据(400个样本),我们将获得83.75%的准确度,远远低于92.25%的ground truth。
#now try to run on 10 reset() label_manually(10) fit()
Labels: 10 Accuracy: 0.38
#pass unlabeled rest 3990 through the early model res = sess.run(y_sm, feed_dict={x:x_unlabeled}) #find less confident samples pmax = np.amax(res, axis=1) pidx = np.argsort(pmax) #sort the unlabeled corpus on the confidency x_unlabeled = x_unlabeled[pidx] y_unlabeled = y_unlabeled[pidx] plt.plot(pmax[pidx])
现在我们将使用active learning标记相同10%的数据(400个样本)。为了做到这一点,我们从10个样本中抽取一批样本,并训练一个非常原始的机器学习模型。然后,我们将剩余的数据(3990个样本)通过该机器学习模型传递,并计算最大softmax输出。这将显示所选类是正确答案的概率(换句话说,是神经网络的置信度)。排序后,我们可以在图中看到置信度的分布在20%到100%之间。我们的想法是从LESS CONFIDENT样本中精确选择下一批标记。
#do the same in a loop for 400 samples for i in range(39): label_manually(10) fit() res = sess.run(y_sm, feed_dict={x:x_unlabeled}) pmax = np.amax(res, axis=1) pidx = np.argsort(pmax) x_unlabeled = x_unlabeled[pidx] y_unlabeled = y_unlabeled[pidx]
#pass rest unlabeled data through the model and try to autolabel res = sess.run(y_sm, feed_dict={x:x_unlabeled}) y_autolabeled = res.argmax(axis=1) x_labeled = np.concatenate([x_labeled, x_unlabeled]) y_labeled = np.concatenate([y_labeled, y_autolabeled]) #train on 400 labeled by active learning and 3600 stochasticly autolabeled data fit()
#pass rest of unlabeled (3600) data trough the model for automatic labeling and show most confident samples res = sess.run(y_sm, feed_dict={x:x_unlabeled}) y_autolabeled = res.argmax(axis=1) pmax = np.amax(res, axis=1) pidx = np.argsort(pmax) #sort by confidency x_unlabeled = x_unlabeled[pidx] y_autolabeled = y_autolabeled[pidx] plt.plot(pmax[pidx])
#automatically label 10 most confident sample and train for it x_labeled = np.concatenate([x_labeled, x_unlabeled[-10:]]) y_labeled = np.concatenate([y_labeled, y_autolabeled[-10:]]) x_unlabeled = x_unlabeled[:-10] fit()
Labels: 410 Accuracy: 0.8975
在这里,我们通过模型评估运行剩余的未标记数据,我们仍然可以看到其余样本的置信度不同。因此,我们的想法是采取一批十个MOST CONFIDENT样本并训练模型。
#run rest of unlabelled samples starting from most confident for i in range(359): res = sess.run(y_sm, feed_dict={x:x_unlabeled}) y_autolabeled = res.argmax(axis=1) pmax = np.amax(res, axis=1) pidx = np.argsort(pmax) x_unlabeled = x_unlabeled[pidx] y_autolabeled = y_autolabeled[pidx] x_labeled = np.concatenate([x_labeled, x_unlabeled[-10:]]) y_labeled = np.concatenate([y_labeled, y_autolabeled[-10:]]) x_unlabeled = x_unlabeled[:-10] fit()