I am having a problem with reading some data from pdf file.
My file is structurized and it contains tables and plain text. Standard parser reads data from separate columns at the same line. For example:
Some Table Header Data Col1a Data Col2a Data Col3a Data Col1b Data Col2b Data Col3b Data Col2c
with this code
PdfReader reader = new PdfReader(pdfName);
List<String> text = new List<String>();
String page;
List<String> pageStrings;
string[] separators = { "\n", "\r\n" };
for (int i = 1; i <= reader.NumberOfPages; i++)
{
page = PdfTextExtractor.GetTextFromPage(reader, i);
pageStrings = new List<string>(page.Split(separators, StringSplitOptions.RemoveEmptyEntries));
text.AddRange(pageStrings);
}
reader.Close();
return text;
will be concatenated into strings:
Some Table Header Data Col1a Data Col2a Data Col3a Data Col1b Data Col2b Data Col3b Data Col2c
I'd like to get concatenated strings that will reflect data from blocks. I'd like to get such strings for upper example:
Some Table Header Data Col1a Data Col1b Data Col2a Data Col2b Data Col2c Data Col3a Data Col3b
Does anyone have any idea how to tune itextsharp to get such behavior of pdf parser?
Maybe someone has appropriate code sample?
The sample PDF file is here
Best Answer
The OP's sample file contains multiple sections like this one:
And the OP mentioned in a comment:
Using PDFBox (v1.8.10, the current release version) in this method:
returns for the section shown above
This is not really a neat column-wise extraction but certain blocks of information (like address blocks) remain together.
Getting the same output with iText(Sharp) actually is very easy: One merely has to explicitly use the
SimpleTextExtractionStrategy
instead of theLocationTextExtractionStrategy
which is used by default, i.e. one has to replace this lineby
With the exception of one space character per dataset (iText(Sharp) extracts
Destination: Pick-up:
instead ofDestination:Pick-up:
) the results are identical.Concerning your conclusion from PDFBox extracting the text as it does:
Actually this order of extraction means merely that the operations for drawing the string segments in the PDF page content stream occur in this very order. As the order of those operations is arbitrary according to the PDF specification, any update of the software generating those PDFs may result in files from which the PDFBox
PDFTextStripper
and the iTextSimpleTextExtractionStrategy
extract merely an unintelligible soup of characters.PS: If one sets the PDFBox
PDFTextStripper
propertySortByPosition
totrue
like thisthen PDFBox extracts the text just like iText(Sharp) with the (default)
LocationTextExtractionStrategy
doesThe OP indicated interest in a block structure inherent in the content stream. The most obvious structure like that in a generic PDF would be the text objects (in which multiple strings may be drawn).
In the case at hand the
SimpleTextExtractionStrategy
is used. It can easily be extended to also include markers corresponding to the start and end of text objects in its output. In Java this can be done by using an anonymous class like this:(TextExtraction.java method
extractSimple
)(This Java code should easily be translatable into C#. The playing around with an
empty
boolean may look funny; it is necessary, though, because the base class assumes certain additional properties to be set as soon as some chunk has been appended to the extracted content.)Using this extended strategy one gets for the section shown above:
As this keeps addresses in the same block, this might help during extraction.