Spark K-fold Cross Validation

apache-spark-mllibclassificationcross-validationmachine learning

I’m having some trouble understanding Spark’s cross validation. Any example I have seen uses it for parameter tuning, but I assumed that it would just do regular K-fold cross validation as well?

What I want to do is to perform k-fold cross validation, where k=5. I want to get the accuracy for each result and then get the average accuracy.
In scikit learn this is how it would be done, where scores would give you the result for each fold, and then you can use scores.mean()

scores = cross_val_score(classifier, y, x, cv=5, scoring='accuracy')

This is how I am doing it in Spark, paramGridBuilder is empty as I don’t want to enter any parameters.

val paramGrid = new ParamGridBuilder().build()
val evaluator = new MulticlassClassificationEvaluator()
  evaluator.setLabelCol("label")
  evaluator.setPredictionCol("prediction")
evaluator.setMetricName("precision")


val crossval = new CrossValidator()
crossval.setEstimator(classifier)
crossval.setEvaluator(evaluator) 
crossval.setEstimatorParamMaps(paramGrid)
crossval.setNumFolds(5)


val modelCV = crossval.fit(df4)
val chk = modelCV.avgMetrics

Is this doing the same thing as the scikit learn implementation? Why do the examples use training/testing data when doing cross validation?

How to cross validate RandomForest model?

https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/ModelSelectionViaCrossValidationExample.scala

Best Answer

What you're doing looks ok.
Basically, yes, it works the same as sklearn's grid search CV.
For each EstimatorParamMaps (a set of params), the algorithm is tested with CV so avgMetrics is average cross-validation accuracy metric/s on all folds. In case one is using empty ParamGridBuilder (no params search), it's like having "regular" cross validation" and we that will result one cross-validated training accuracy.
Each CV iteration includes K-1 training folds and 1 test fold, so why most examples separate the data to training/testing data before doing cross validation? because the test folds inside the CV are used for params grid search. That means additional validation dataset is needed for model selection. So what is called "test dataset" is needed to evaluate the final model. Read more here

Related Solutions

Python – Does TensorFlow have cross validation implemented for its users

As already discussed, tensorflow doesn't provide its own way to cross-validate the model. The recommended way is to use KFold. It's a bit tedious, but doable. Here's a complete example of cross-validating MNIST model with tensorflow and KFold:

from sklearn.model_selection import KFold
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data

# Parameters
learning_rate = 0.01
batch_size = 500

# TF graph
x = tf.placeholder(tf.float32, [None, 784])
y = tf.placeholder(tf.float32, [None, 10])
W = tf.Variable(tf.zeros([784, 10]))
b = tf.Variable(tf.zeros([10]))
pred = tf.nn.softmax(tf.matmul(x, W) + b)
cost = tf.reduce_mean(-tf.reduce_sum(y*tf.log(pred), reduction_indices=1))
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
correct_prediction = tf.equal(tf.argmax(pred, 1), tf.argmax(y, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
init = tf.global_variables_initializer()

mnist = input_data.read_data_sets("data/mnist-tf", one_hot=True)
train_x_all = mnist.train.images
train_y_all = mnist.train.labels
test_x = mnist.test.images
test_y = mnist.test.labels

def run_train(session, train_x, train_y):
  print "\nStart training"
  session.run(init)
  for epoch in range(10):
    total_batch = int(train_x.shape[0] / batch_size)
    for i in range(total_batch):
      batch_x = train_x[i*batch_size:(i+1)*batch_size]
      batch_y = train_y[i*batch_size:(i+1)*batch_size]
      _, c = session.run([optimizer, cost], feed_dict={x: batch_x, y: batch_y})
      if i % 50 == 0:
        print "Epoch #%d step=%d cost=%f" % (epoch, i, c)

def cross_validate(session, split_size=5):
  results = []
  kf = KFold(n_splits=split_size)
  for train_idx, val_idx in kf.split(train_x_all, train_y_all):
    train_x = train_x_all[train_idx]
    train_y = train_y_all[train_idx]
    val_x = train_x_all[val_idx]
    val_y = train_y_all[val_idx]
    run_train(session, train_x, train_y)
    results.append(session.run(accuracy, feed_dict={x: val_x, y: val_y}))
  return results

with tf.Session() as session:
  result = cross_validate(session)
  print "Cross-validation result: %s" % result
  print "Test accuracy: %f" % session.run(accuracy, feed_dict={x: test_x, y: test_y})

Windows – Parsing string in batch file

This should do it:

FOR /F "tokens=1-6 delims==," %%I IN ("MyProject/Architecture=32bit,BuildType=Debug,OS=winpc") DO (
    ECHO I %%I, J %%J, K %%K, L %%L, M %%M, N %%N
)
REM output is: I MyProject/Architecture, J 32bit, K BuildType, L Debug, M OS, N winpc

The batch FOR loop is a pretty interesting piece of machinery. Type FOR /? in a console for a description of some of the crazy stuff it can do.

Best Answer

Related Solutions

Python – Does TensorFlow have cross validation implemented for its users

Windows – Parsing string in batch file

Related Topic