C# – Parsing Office Documents

asp.netcms-office

I`d like to be able to read the content of office documents (for a custom crawler).

The office version that need to be readable are from 2000 to 2007. I mainly want to be crawling words, excel and powerpoint documents.

I don`t want to retrieve the formatting, only the text in it.

The crawler is based on lucene.NET if that can be of some help and is in c#.

I already used iTextSharp for parsing PDF

Best Answer

Here's a nice little post on c-charpcorner by Krishnan LN that gives basic code to grab the text from a Word document using the Word Primary Interop assemblies.

Basically, you get the "WholeStory" property out of the Word document, paste it to the clipboard, then pull it from the clipboard while converting it to text format. The clipboard step is presumably done to strip out formatting.

For PowerPoint, you do a similar thing, but you need to loop through the slides, then for each slide loop through the shapes, and grab the "TextFrame.TextRange.Text" property in each shape.

For Excel, since Excel can be an OleDb data source, it's easiest to use ADO.NET. Here's a good post by Laurent Bugnion that walks through this technique.

Related Solutions

C# – How to create an Excel (.XLS and .XLSX) file in C# without installing Microsoft Office

You can use a library called ExcelLibrary. It's a free, open source library posted on Google Code:

ExcelLibrary

This looks to be a port of the PHP ExcelWriter that you mentioned above. It will not write to the new .xlsx format yet, but they are working on adding that functionality in.

It's very simple, small and easy to use. Plus it has a DataSetHelper that lets you use DataSets and DataTables to easily work with Excel data.

ExcelLibrary seems to still only work for the older Excel format (.xls files), but may be adding support in the future for newer 2007/2010 formats.

You can also use EPPlus, which works only for Excel 2007/2010 format files (.xlsx files). There's also NPOI which works with both.

There are a few known bugs with each library as noted in the comments. In all, EPPlus seems to be the best choice as time goes on. It seems to be more actively updated and documented as well.

Also, as noted by @АртёмЦарионов below, EPPlus has support for Pivot Tables and ExcelLibrary may have some support (Pivot table issue in ExcelLibrary)

Here are a couple links for quick reference:
ExcelLibrary - GNU Lesser GPL
EPPlus - GNU (LGPL) - No longer maintained
EPPlus 5 - Polyform Noncommercial - Starting May 2020
NPOI - Apache License

Here some example code for ExcelLibrary:

Here is an example taking data from a database and creating a workbook from it. Note that the ExcelLibrary code is the single line at the bottom:

//Create the data set and table
DataSet ds = new DataSet("New_DataSet");
DataTable dt = new DataTable("New_DataTable");

//Set the locale for each
ds.Locale = System.Threading.Thread.CurrentThread.CurrentCulture;
dt.Locale = System.Threading.Thread.CurrentThread.CurrentCulture;

//Open a DB connection (in this example with OleDB)
OleDbConnection con = new OleDbConnection(dbConnectionString);
con.Open();

//Create a query and fill the data table with the data from the DB
string sql = "SELECT Whatever FROM MyDBTable;";
OleDbCommand cmd = new OleDbCommand(sql, con);
OleDbDataAdapter adptr = new OleDbDataAdapter();

adptr.SelectCommand = cmd;
adptr.Fill(dt);
con.Close();

//Add the table to the data set
ds.Tables.Add(dt);

//Here's the easy part. Create the Excel worksheet from the data set
ExcelLibrary.DataSetHelper.CreateWorkbook("MyExcelFile.xls", ds);

Creating the Excel file is as easy as that. You can also manually create Excel files, but the above functionality is what really impressed me.

Alternative to Office Interop for document generation

I just posted this as an answer to another question about automating Office, but I think it's a suitable response to this question too (especially since you are looking for a free or low cost solution).

I've had no end of problems (poor performance, hanging processes, crashing processes etc) using Microsoft Excel, Word and PowerPoint through interop in a web service to print Office documents to PDF format. I too have faced problems that I suspect are because of invisible dialog boxes (maybe a file is corrupt, read-only recommended has been set, file is password protected, or whatever).

I know there are tools available that don't use Office, but they are very expensive. My solution was to switch to automating OpenOffice. OpenOffice seems to be much more stable, and I've left hanging processes and the like behind.

So, while I suppose I am saying "don't automate Microsoft Office", I'm not suggesting that you abandon automation altogether; just that I've had much more success automating OpenOffice than Microsoft Office.

Best Answer

Related Solutions

C# – How to create an Excel (.XLS and .XLSX) file in C# without installing Microsoft Office

Alternative to Office Interop for document generation

Related Topic