C# – Parsing Office Documents

asp.netcms-office

I`d like to be able to read the content of office documents (for a custom crawler).

The office version that need to be readable are from 2000 to 2007. I mainly want to be crawling words, excel and powerpoint documents.

I don`t want to retrieve the formatting, only the text in it.

The crawler is based on lucene.NET if that can be of some help and is in c#.

I already used iTextSharp for parsing PDF

Best Answer

Here's a nice little post on c-charpcorner by Krishnan LN that gives basic code to grab the text from a Word document using the Word Primary Interop assemblies.

Basically, you get the "WholeStory" property out of the Word document, paste it to the clipboard, then pull it from the clipboard while converting it to text format. The clipboard step is presumably done to strip out formatting.

For PowerPoint, you do a similar thing, but you need to loop through the slides, then for each slide loop through the shapes, and grab the "TextFrame.TextRange.Text" property in each shape.

For Excel, since Excel can be an OleDb data source, it's easiest to use ADO.NET. Here's a good post by Laurent Bugnion that walks through this technique.

Related Topic