Python – scikit learn output metrics.classification_report into CSV/tab-delimited format

classificationcsvpythonscikit-learntext;

I'm doing a multiclass text classification in Scikit-Learn. The dataset is being trained using the Multinomial Naive Bayes classifier having hundreds of labels. Here's an extract from the Scikit Learn script for fitting the MNB model

from __future__ import print_function

# Read **`file.csv`** into a pandas DataFrame

import pandas as pd
path = 'data/file.csv'
merged = pd.read_csv(path, error_bad_lines=False, low_memory=False)

# define X and y using the original DataFrame
X = merged.text
y = merged.grid

# split X and y into training and testing sets;
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

# import and instantiate CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

# create document-term matrices using CountVectorizer
X_train_dtm = vect.fit_transform(X_train)
X_test_dtm = vect.transform(X_test)

# import and instantiate MultinomialNB
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()

# fit a Multinomial Naive Bayes model
nb.fit(X_train_dtm, y_train)

# make class predictions
y_pred_class = nb.predict(X_test_dtm)

# generate classification report
from sklearn import metrics
print(metrics.classification_report(y_test, y_pred_class))

And a simplified output of the metrics.classification_report on command line screen looks like this:

             precision  recall   f1-score   support
     12       0.84      0.48      0.61      2843
     13       0.00      0.00      0.00        69
     15       1.00      0.19      0.32       232
     16       0.75      0.02      0.05       965
     33       1.00      0.04      0.07       155
      4       0.59      0.34      0.43      5600
     41       0.63      0.49      0.55      6218
     42       0.00      0.00      0.00       102
     49       0.00      0.00      0.00        11
      5       0.90      0.06      0.12      2010
     50       0.00      0.00      0.00         5
     51       0.96      0.07      0.13      1267
     58       1.00      0.01      0.02       180
     59       0.37      0.80      0.51      8127
      7       0.91      0.05      0.10       579
      8       0.50      0.56      0.53      7555      
    avg/total 0.59      0.48      0.45     35919

I was wondering if there was any way to get the report output into a standard csv file with regular column headers

When I send the command line output into a csv file or try to copy/paste the screen output into a spreadsheet – Openoffice Calc or Excel, It lumps the results in one column. Looking like this:

Best Answer

As of scikit-learn v0.20, the easiest way to convert a classification report to a pandas Dataframe is by simply having the report returned as a dict:

report = classification_report(y_test, y_pred, output_dict=True)

and then construct a Dataframe and transpose it:

df = pandas.DataFrame(report).transpose()

From here on, you are free to use the standard pandas methods to generate your desired output formats (CSV, HTML, LaTeX, ...).

See the documentation.

Server side

If you want something easy to re-use or automate, you can use Postgresql's built in COPY command. e.g.

Copy (Select * From foo) To '/tmp/test.csv' With CSV DELIMITER ',' HEADER;

This approach runs entirely on the remote server - it can't write to your local PC. It also needs to be run as a Postgres "superuser" (normally called "root") because Postgres can't stop it doing nasty things with that machine's local filesystem.

That doesn't actually mean you have to be connected as a superuser (automating that would be a security risk of a different kind), because you can use the SECURITY DEFINER option to CREATE FUNCTION to make a function which runs as though you were a superuser.

The crucial part is that your function is there to perform additional checks, not just by-pass the security - so you could write a function which exports the exact data you need, or you could write something which can accept various options as long as they meet a strict whitelist. You need to check two things:

Which files should the user be allowed to read/write on disk? This might be a particular directory, for instance, and the filename might have to have a suitable prefix or extension.
Which tables should the user be able to read/write in the database? This would normally be defined by GRANTs in the database, but the function is now running as a superuser, so tables which would normally be "out of bounds" will be fully accessible. You probably don’t want to let someone invoke your function and add rows on the end of your “users” table…

I've written a blog post expanding on this approach, including some examples of functions that export (or import) files and tables meeting strict conditions.

Client side

The other approach is to do the file handling on the client side, i.e. in your application or script. The Postgres server doesn't need to know what file you're copying to, it just spits out the data and the client puts it somewhere.

The underlying syntax for this is the COPY TO STDOUT command, and graphical tools like pgAdmin will wrap it for you in a nice dialog.

The psql command-line client has a special "meta-command" called \copy, which takes all the same options as the "real" COPY, but is run inside the client:

\copy (Select * From foo) To '/tmp/test.csv' With CSV

Note that there is no terminating ;, because meta-commands are terminated by newline, unlike SQL commands.

From the docs:

Do not confuse COPY with the psql instruction \copy. \copy invokes COPY FROM STDIN or COPY TO STDOUT, and then fetches/stores the data in a file accessible to the psql client. Thus, file accessibility and access rights depend on the client rather than the server when \copy is used.

Your application programming language may also have support for pushing or fetching the data, but you cannot generally use COPY FROM STDIN/TO STDOUT within a standard SQL statement, because there is no way of connecting the input/output stream. PHP's PostgreSQL handler (not PDO) includes very basic pg_copy_from and pg_copy_to functions which copy to/from a PHP array, which may not be efficient for large data sets.

Best Answer

Related Solutions

Mysql – How to output MySQL query results in CSV format

Sql – Save PL/pgSQL output from PostgreSQL to a CSV file

Server side

Client side

Related Topic