Extracting URLs from an Emacs buffer

elisp

How can I write an Emacs Lisp function to find all hrefs in an HTML file and extract all of the links?

Input:

<html>
 <a href="http://www.stackoverflow.com" _target="_blank">StackOverFlow</a>
 <h1>Emacs Lisp</h1>
 <a href="http://news.ycombinator.com" _target="_blank">Hacker News</a>
</html>

Output:

http://www.stackoverflow.com|StackOverFlow
http://news.ycombinator.com|Hacker News

I've seen the re-search-forward function mentioned several times during my search. Here's what I think that I need to do based on what I've read so far.

(defun extra-urls (file)
 ...
 (setq buffer (...
 (while
        (re-search-forward "http://" nil t)
        (when (match-string 0)
...
))

Best Answer

If there is at most one link per line and you don't mind some very ugly regular expression hacking, run the following code on your buffer:

(defun getlinks ()
  (beginning-of-buffer)
  (replace-regexp "^.*<a href=\"\\([^\"]+\\)\"[^>]+>\\([^<]+\\)</a>.*$" "LINK:\\1|\\2")
  (beginning-of-buffer)
  (replace-regexp "^\\([^L]\\|\\(L[^I]\\)\\|\\(LI[^N]\\)\\|\\(LIN[^K]\\)\\).*$" "")
  (beginning-of-buffer)
  (replace-regexp "
+" "
")
  (beginning-of-buffer)
  (replace-regexp "^LINK:\\(.*\\)$" "\\1")
)

It replaces all links with LINK:url|description, deletes all lines containing anything else, deletes empty lines, and finally removes the "LINK:".

Detailed HOWTO: (1) Correct the bug in your example html file by replacing <href with <a href, (2) copy the above function into Emacs scratch, (3) hit C-x C-e after the final ")" to load the function, (4) load your example HTML file, (5) execute the function with M-: (getlinks).

Note that the linebreaks in the third replace-regexp are important. Don't indent those two lines.