Excel – Using LocationTextExtractionStrategy in itextSharp for text coordinate

excelitextsharppdf

My goal is to retrieve data from PDF which may be in table structure to an excel file.

using LocationTextExtractionStrategy with iTextSharp we can get the string data in plain text with page content in left to right manner.

How can I move forward such that during

PdfTextExtractor.GetTextFromPage(reader, i, new LocationTextExtractionStrategy())

I could make the text retain its coordinate in the resulting string.

As for instance if the first line in the pdf has text aligned to right, then the resulting string must be containing trailing space or spaces keeping the content right aligned.

Please give some suggestions, how I may proceed to achieve the same.

Best Answer

Its very important to understand that PDFs have no support for tables. Anything that looks like a table is really just a bunch of text placed at specific locations over a background of lines. This is very important and you need to keep this in mind as you work on this.

That said, you need to subclass TextExtractionStrategy and pass that into GetTextFromPage(). See this post for a simple example of that. Then see this post for a more complex example of subclassing. The latter isn't completely relevant to your goal but it does show some more complex things that you can do.

Related Topic