Matlab – Naive Bayes classifier and discriminant analysis accuracy is way off

bayesiancluster-analysisMATLABnaivebayesstatistics

So I have two methods of classification, Discriminant analysis diaglinear classification (naive Bayes) and the pure Naive Bayes classifier implemented in matlab, there are 23 classes in the entire dataset. The first method discriminant analysis:

%% Classify Clusters using Naive Bayes Classifier and classify
training_data = Testdata; 
target_class = TestDataLabels;

[class, err]  = classify(UnseenTestdata, training_data, target_class,'diaglinear')

cmat1 = confusionmat(UnseenTestDataLabels, class);
acc1 = 100*sum(diag(cmat1))./sum(cmat1(:));
fprintf('Classifier1:\naccuracy = %.2f%%\n', acc1);
fprintf('Confusion Matrix:\n'), disp(cmat1)

Yields an accuracy from the confusion matrix of 81.49% with an error rate (err) of 0.5040 (not sure how to interpret that).

The second method Naive Bayes classifier:

%% Classify Clusters using Naive Bayes Classifier
training_data = Testdata; 
target_class = TestDataLabels;
%# train model
nb = NaiveBayes.fit(training_data, target_class, 'Distribution', 'mn');

%# prediction
class1 = nb.predict(UnseenTestdata); 

%# performance
cmat1 = confusionmat(UnseenTestDataLabels, class1);
acc1 = 100*sum(diag(cmat1))./sum(cmat1(:));
fprintf('Classifier1:\naccuracy = %.2f%%\n', acc1);
fprintf('Confusion Matrix:\n'), disp(cmat1)

Yields an accuracy of 81.89%.

I have only done one round of cross validation, I'm new at matlab and supervised/unsupervised algorithms so I did the cross validation myself. I just basically take 10% of the data and keep it aside for testing purposes, as it is a random set each time. I could go through it several times and take the average accuracy but the results will do for explanation purposes.

So to my problem question.

In my literature review of current methods a lot of researchers are finding that a single classification algorithm mixed with a clustering algorithm are yielding better accuracy results. They do this by finding the optimal number of clusters for their data and using the partitioned clusters (which should be more alike than not) run each individual cluster through a classification algorithm. A process where you can use the best parts of an unsupervised algorithm in conjunction with a supervised classification algorithm.

Now I'm using a dataset that has been used numerous times in literature and I'm attempting a not so dissimilar approach to others in my quest.

I first use the simple K-Means clustering which surprisingly has a good capability to cluster my data. The output looks like so:

enter image description here

Looking at each cluster (K1, K2…K12) class labels:

%% output the class labels of each cluster
K1 = UnseenTestDataLabels(indX(clustIDX==1),:)

I find that predominately each cluster has one class label in 9 clusters while 3 clusters contain multiple class labels. Showing that K-means has a good fit to the data.

The problem however is once I have each cluster data (cluster1,cluster2…cluster12):

%% output the real data of each cluster
cluster1 = UnseenTestdata(clustIDX==1,:)

And I put each cluster through the naive Bayes or discriminant analysis like so:

class1  = classify(cluster1, training_data, target_class, 'diaglinear');
class2  = classify(cluster2, training_data, target_class, 'diaglinear');
class3  = classify(cluster3, training_data, target_class, 'diaglinear');
class4  = classify(cluster4, training_data, target_class, 'diaglinear');
class5  = classify(cluster5, training_data, target_class, 'diaglinear');
class6  = classify(cluster6, training_data, target_class, 'diaglinear');
class7  = classify(cluster7, training_data, target_class, 'diaglinear');
class8  = classify(cluster8, training_data, target_class, 'diaglinear');
class9  = classify(cluster9, training_data, target_class, 'diaglinear');
class10  = classify(cluster10, training_data, target_class, 'diaglinear'); 
class11  = classify(cluster11, training_data, target_class, 'diaglinear');
class12  = classify(cluster12, training_data, target_class, 'diaglinear');

The accuracy becomes horrifying, 50% of the clusters are classified with 0% accuracy, each classified cluster (acc1, acc2,…acc12) has its own corresponding confusion matrix you can see the accuracy of each cluster here:

enter image description here

So my problem/question is: where am I going wrong? I first thought maybe I have the data/labels mixed up for the clusters, but what I posted above looks correct I can't see an issue with it.

Why is the data that is the exact same unseen 10% data used in the first experiment yielding such strange results for the same unseen clustered data? I mean it should be noted that NB is a stable classifier and shouldn't overfit easily and seeing as the training data is vast while the clusters to be classified are concurrent overfitting shouldn't happen?

EDIT:

As requested from comments I have included the cmat file for the first example of testing which gives an accuracy of 81.49% and an err of 0.5040:

enter image description here

Also requested was the a snippet of K, class and the related cmat in this example (cluster4) the accuracy is 3.03%:

enter image description here

Seeing as there was a large number of classes (23 in total) I decided to reduce the classes as outlined in the 1999 KDD Cup this is just applying a bit of domain knowledge as some of the attacks are more alike than others and come under one umbrella term.

I then trained the classifier with 444 thousand records while holding back 10% for testing purposes.

The accuracy was worse 73.39% the error rate was also worse 0.4261

enter image description here

The unseen data broken down into its classes:

DoS: 39149
Probe: 405
R2L: 121
U2R: 6
normal.: 9721

The class or classified labels (outcome of discriminant analysis):

DoS: 28135
Probe: 10776
R2L: 1102
U2R: 1140
normal.: 8249

The training data is made up of:

DoS: 352452
Probe: 3717
R2L: 1006
U2R: 49
normal.: 87395

I fear if I lower the training data to have a similar percentage of malicious activity, then the classifier won't have enough predictive power to distinguish between classes, however looking at some other literature I have noticed that some researchers remove U2R as there isn't enough data for successful classification.

Methods I have tried so far are one class classifiers where I train the classifier to only predict one class (not effective), classifying individual clusters (worse accuracy yet), reducing the class labels (2nd best) and keeping the full 23 class labels (best accuracy).

Best Answer

As others have correctly pointed out, at least one problem here is on these lines:

class1  = classify(cluster1, training_data, target_class, 'diaglinear');
...

You are training the classifier using all the training_data, but evaluating it on only the sub-clusters. For clustering the data to have any effect, you need to train a different classifier within each of the sub-clusters. Sometimes this can be very difficult - e.g., there may be very few (or no!) examples in cluster C from class Y. That's inherent to trying to do joint clustering and learning.

The general framework for your problem is as follows:

Training data:
   Cluster into C clusters
   Within each cluster, develop a classifier

Testing data:
   Assign observation into one of the C clusters (either "hard", or "soft")
   Run the correct classifier (corresponding to that cluster)

This

class1  = classify(cluster1, training_data, target_class, 'diaglinear');

Does not do that.

Related Topic