Android – nything faster than Jsoup for HTML scraping?

androidjsoup

So I'm building an app that displays an imageboard from a website I go to in a more user-friendly interface. There's a lot of problems with it at the moment, but the biggest one right now is fetching the images to display them.

The way I have it right now, the images are displayed in a GridView of size 12, mirroring the number of images on each page of the imageboard. I'm using Jsoup to scrape the page for the thumbnail image URLs to display in the GridView, as well as getting the URLs for the full size images to display when a user clicks on the thumbnail.

The problem right now is that it takes anywhere from 8-12 seconds on average for Jsoup to get the HTML page to scrape. This I find unacceptable and I was wondering if there was any way to make this faster or if this is going to be an inherent bottleneck that I can't do anything about.

Here's the code I'm using to fetch the page to scrape:

try {
    Document doc = Jsoup.connect(url).get();
    Elements links = doc.select("img[src*=/alt2/]");
    for (Element link : links) {
        thumbURL = link.attr("src");
        linkURL = thumbURL.replace("/alt2/", "/").replace("s.jpg", ".jpg");
        imgSrc.add(new Pair<String, String>(thumbURL, linkURL));
    }
}
catch {
    e.printStackTrace();
}

Best Answer

I used Jsoup for a TLFN scraper and I had no issues with speed. You should narrow down the bottleneck. I presume its your scraping that is causing the speed issue. Try tracing your selector and your network traffic separately and see which is to blame. If your selector is to blame then consider finding another approach for querying and benchmark the results.

For faster, general idea, testing you can always run Jsoup from a normal Java project and when you feel like you have improved it, throw it back on a device and see if it has similar performance improvements.

EDIT

Not that this is your issue but be aware that using iterators 'can' cause quite a bit of garbage collection to trigger. Typically this is not a concern although if you use them in many places with much repetition, they can cause some devices to take a noticeable performance hit.

not great

for (Element link : links)

better

int i;
Element tempLink;
for (i=0;i<links.size();i++) {
   tempLink = links.get(i);
}

EDIT 2

If the image URLs are starting with /alt2/ you may be able to use ^= instead of *= which could potentially make the search faster. Additionally, depending on the amount of HTML, you may be wasting a lot of time looking in the completely wrong place for these images. Check to see if these images are wrapped inside an identifiable container such as something like <div class="posts">. If you can narrow down the amount of HTML to sift through it may improve the performance.

Related Topic