C# – Is it possible to get structural elements from a PDF file using iTextSharp

cc#-4.0itextsharppdf

We are using iTextSharp with a C# WinForms application to parse a PDF file. Using iTextSharp, I can easily extract the text data from the PDF file. Suppose a PDF file contains an image surrounded by two lines of text. In this case, I could not extract the information about the image.

My requirement is:

Get structural elements of the PDF file
Process whether each is of type text, image, table or other

For example, the structural elements are similar to the following:

text :paragraph1
text :paragraph2
Image:Image
text :paragraph3
Table:table info
text :Paragraph4

If I can obtain information in a format like this, I can easily understand the text, image, table, header or footer information.

So, is it possible to get this kind of information using iTextSharp? If yes, please enlighten me on this. Otherwise, could you please suggest some other tools capable of meeting this requirement?

Thanks to all,

Saravanan

Best Answer

I used to have this kind of need a while ago. I used this function (from Extract images using iTextSharp) :

private static PdfObject FindImageInPDFDictionary(PdfDictionary pg)
{
    PdfDictionary res =
        (PdfDictionary)PdfReader.GetPdfObject(pg.Get(PdfName.RESOURCES));


    PdfDictionary xobj =
      (PdfDictionary)PdfReader.GetPdfObject(res.Get(PdfName.XOBJECT));
    if (xobj != null)
    {
        foreach (PdfName name in xobj.Keys)
        {

            PdfObject obj = xobj.Get(name);
            if (obj.IsIndirect())
            {
                PdfDictionary tg = (PdfDictionary)PdfReader.GetPdfObject(obj);

                PdfName type =
                  (PdfName)PdfReader.GetPdfObject(tg.Get(PdfName.SUBTYPE));

                //image at the root of the pdf
                if (PdfName.IMAGE.Equals(type))
                {
                    return obj;
                }// image inside a form
                else if (PdfName.FORM.Equals(type))
                {
                    return FindImageInPDFDictionary(tg);
                } //image inside a group
                else if (PdfName.GROUP.Equals(type))
                {
                    return FindImageInPDFDictionary(tg);
                }

            }
        }
    }

    return null;
}

As you can see in the foreach (PdfName name in xobj.Keys) statement, I think you can easily parse a whole PDF and treat every kind of data from it. But I'm not sure about the "verticality" part of your need.

Hope it could help you.

Related Solutions

C# – Get int value from enum in C#

Just cast the enum, e.g.

int something = (int) Question.Role;

The above will work for the vast majority of enums you see in the wild, as the default underlying type for an enum is int.

However, as cecilphillip points out, enums can have different underlying types. If an enum is declared as a uint, long, or ulong, it should be cast to the type of the enum; e.g. for

enum StarsInMilkyWay:long {Sun = 1, V645Centauri = 2 .. Wolf424B = 2147483649};

you should use

long something = (long)StarsInMilkyWay.Wolf424B;

C# – Get property value from string using reflection

 public static object GetPropValue(object src, string propName)
 {
     return src.GetType().GetProperty(propName).GetValue(src, null);
 }

Of course, you will want to add validation and whatnot, but that is the gist of it.

Best Answer

Related Solutions

C# – Get int value from enum in C#

C# – Get property value from string using reflection

Related Topic