Java XML – Handling XML Elements with Both CDATA and Regular Data Using SimpleXML

javaxml

I'm struggling to understanding how I can deserialize the following response from an RSS feed. I need the text blah blah blah etc as well as the embedded image source http://host/path/to/picture.jpg.

<description>blah blah blah blah br /&gt;
<![CDATA[<img src=http://host/path/to/picture.jpg>]]>&lt;br /&gt;
blah blah blah blah&lt;br /&gt;&lt;br /&gt;
</description>

Here's my model class (or rather what I want it to be) – I've shortened it for brevity:

public static class Item {

    ...

    @Element(name="description", required=false)
    String descriptionContent;

    String imageLink;
    ...

}

From the docs I know I can set data=true on the @Element attribute but from my reading that works if the entire content of your element is CDATA, not partially.

I'm using doing this on Android using Retrofit and the SimpleXMLConverter but I think that is just by the by.

Best Answer

I figured this out eventually. Hopefully this will help others in the future.

What was needed was a way to interrupt the deserialization process and run some custom code to extract the data. SimpleXML allows you to use many different strategies for serialization / deserialization. I chose one called the Annotation Strategy where by I annotate my POJO model class with the @Convert annotation that points to a converter class.

....

@Element(name="description", required=false)
@Convert(DescriptionConverter.class)
Description description;

...

And here's what the converter looks like:

public class DescriptionConverter implements Converter<RssFeed.Description> {

    @Override
    public RssFeed.Description read(InputNode node) throws Exception {
        final String IMG_SRC_REG_EX = "<img src=([^>]+)>";
        final String HTML_TAG_REG_EX = "</?[^>]+>";

        String nodeText = node.getValue();

        Pattern imageLinkPattern = Pattern.compile(IMG_SRC_REG_EX);
        Matcher matcher = imageLinkPattern.matcher(nodeText);

        String link = null;
        while (matcher.find()) {
            link = matcher.group(1);
        }

        String text = nodeText.replaceFirst(IMG_SRC_REG_EX, "")
                              .replaceAll(HTML_TAG_REG_EX, "");

        return new RssFeed.Description(text, link);
    }

    @Override
    public void write(OutputNode node, RssFeed.Description value) throws Exception {
            ...
    }

}

You still need to tell Simple to use a different strategy though otherwise it will ignore the annotation. Given that I am using Retrofit and the SimpleXMLConverter here is what my implemenation looks like:

private static final Retrofit.Builder builder = new Retrofit.Builder()
            .baseUrl(API_BASE_URL)
            .addConverterFactory(SimpleXmlConverterFactory.create(new Persister(new AnnotationStrategy())));
Related Topic