How to scrape logos from websites

html-parsingscreen-scraping

First off, this is not a question about how to scrape websites. I am fully aware of the tools available to me to scrape (css_parser, nokogiri, etc. I'm using Ruby to do the scraping).

This is more of an overarching question on the best possible solution to scrape the logo of a website starting with nothing but a website address.

The two solutions I've begun to create are these:

  1. Use Google AJAX APIs to do an image search that is scoped to the site in question, with the query "logo", and grab the first result. This gets the logo, I'd say, about 30% of the time.
  2. The problem with the above is that Google doesn't really seem to care about CSS image replaced logos (ie. H1 text that is image replaced with the logo). The solution I've tentatively come up with is to pull down all CSS files, scan for url() declarations, and then look for the words header or logo in the file names.

Solution two is problematic because of the many idiosyncrasies of all the people who write CSS for websites. They use Header instead of logo in the file name. Sometimes the file name is random, saying nothing about a logo. Other times, it's just the wrong image.

I realize I might be able to do something with some sort of machine learning, but I'm on a bit of a deadline for a client and need something fairly capable soon.

So with all that said, if anyone has any "out of the box" thinking on this one, I'd love to hear it. If I can create a solution that works well enough, I plan on open-sourcing the library for any other interested parties 🙂

Thanks!

Best Answer

Check this API by Clearbit. It's super simple to use:

Just send a query to: https://logo.clearbit.com/[enter-domain-here]

For example: https://logo.clearbit.com/www.stackoverflow.com

and get back the logo image!

More about it here

Related Topic