Java – How to iterate over plain text segments with the Jericho HTML parser

javaparsingtext;

For a Jericho Element, I am trying to find out how to loop over all child nodes, whether an element or plain text.

Now there is Element.getNodeIterator(), but this references ALL descendants within the Element, not just the first descendants.

I need the equivalent of Element.getChildSegments(). Any ideas?

Thanks

Best Answer

All plain text segments not within any child elements, correct?

public static Iterator<Segment> directPlainTextChildren(Element elem) {
    final Iterator<Segment> it = elem.getContent().getNodeIterator();
    final List<Segment> results = new LinkedList<Segment>();
    final List<Element> children = elem.getChildElements();
    while (it.hasNext()) {
        Segment cur = it.next();
        if (!(cur instanceof Tag) && !(cur instanceof CharacterReference)) {
            for (Element child : children)
                if (child.contains(cur)) continue;
            results.add(cur);
        }
    }
    return results.iterator();
}

An element should have few direct children and the Element::contains(Segment) method is just a simple bounds check, so the performance should be adequate.

edit: If you wanted to add the ability to iterate all direct child segments it would look like this:

public static Iterator<Segment> getChildSegments(Element elem) {
    final Iterator<Segment> it = elem.getContent().getNodeIterator();
    final List<Segment> results = new LinkedList<Segment>();
    final List<Element> children = elem.getChildElements();
    while (it.hasNext()) {
        Segment cur = it.next();
        if (cur instanceof CharacterReference)
            results.add(cur);
        else if (cur instanceof Tag) {
            if (cur instanceof StartTag)
                results.add(((StartTag)cur).getElement());
        }
        else {
            for (Element child : children)
                if (child.contains(cur)) continue;
            results.add(cur);
        }
    }
    return results.iterator();
}