R – predict() returns nothing for type = “class” works fine with type = “raw”

predictr

Training data is read in from two files–one with the independent variables only (df.train) and one with the actual corresponding class values only (df.churn). These values are -1 and 1 only. I then remove all-NA columns and remove duplicate columns in there are any found.

I assemble the two sets of data into a single dataframe with independent and class values, and run naiveBayes() without and errors.

Using the model produced by naiveBayes, I run predict() and note that the output with type = "raw" looks like reasonable data–in most cases those probabilities are relatively close to 0 or 1. I show the first 6 elements below.

I'm looking for the actual predicted class values for input into prediction() with a view to getting an ROC plot and an AUC value. I run predict() again with type = "class", and this is where I get basically nothing at all.

    df.train <- read.csv('~/projects/kdd_analysis/data/train_table.csv', header=TRUE, sep=',')
    df.churn <- read.csv('~/projects/kdd_analysis/data/sm_churn_labels.csv', header=TRUE, sep=',')
    df.train <- df.train[,colSums(is.na(df.train))<nrow(df.train)]
    df.train <- df.train[!duplicated(lapply(df.train,c))]
    df.train_C <- cbind(df.train, df.churn)
    mod_C <- naiveBayes(V1~., df.train_C, laplace=0.01)
    pre_C <- predict(mod_C, df.train ,type="raw", threshold=0.001)

I'm running predict() against the training data intentionally because I thought that would be interesting. Below, the values out of predict() seem 'reasonable' to me…that is, they at least don't seem like complete nonsense. I have not compared them to the actuals yet, and would expect to use the explicit class values given by predict() to do that.

    head(pre_C)
           -1            1
    [1,] 9.996934e-01 3.066321e-04
    [2,] 9.005501e-07 9.999991e-01
    [3,] 1.000000e+00 3.468739e-11
    [4,] 9.362914e-01 6.370858e-02
    [5,] 9.854649e-01 1.453510e-02
    [6,] 9.997680e-01 2.320003e-04

So, this is predict() run again against the identical model–I don't understand how it's possible for it to return nothing:

    > pre_C <- predict(mod_C, df.train ,type="class", threshold=0.001)
    > pre_C
    factor(0)
    Levels:

Best Answer

The solution is to coerce the column of class variables to type factor:

df.train_C$V1 <- factor(df.train_C$V1)

then run the model and predict() as before. I changed nothing else and this one mod 'fixed' the issue. Courtesy Andy Liaw at r-help.

Related Topic