I've got a business requirement to run through a list of URLs and identify the ones that return an error. I've written a simple script that fetches the header for a particular url since I don't care about the content. I just want to know if there's an error fetching the content. In some cases, my script returns a 503 error while also returning content. Here's one example.
$ curl --head https://www.eia.gov/consumption/
HTTP/1.1 503 Service Unavailable
Server: AkamaiGHost
Mime-Version: 1.0
Content-Type: text/html
Content-Length: 175
Expires: Fri, 05 Jan 2018 21:32:47 GMT
Cache-Control: max-age=0, no-cache, no-store
Pragma: no-cache
Date: Fri, 05 Jan 2018 21:32:47 GMT
Connection: keep-alive
Running the same curl command without the "–head" part returns a page of HTML and it's not an error page. It's relevant content. So, that 503 error is misleading.
Is this a misconfigured web server returning an incorrect response header or am I missing something?
The real question is this: Is there a reliable way to determine if a URL returns valid content or if it returns an error? The presence of HTML is useful in this case but I wouldn't count on getting HTML back meaning there is not an error. The 404 error is the classic case of getting a page of HTML but the error code tells me that the page wasn't found.
Best Answer
The
--head
option makescurl
send an actualHTTP HEAD
request. Some servers might not honor this or might not route it the same as anHTTP GET
request such as a browser would send. Using the-i
option will print response headers but still send aGET
request. This will also return the entire body of the response. You could cut this down to the first line containing the protocol version and response status only with thehead
command like so:(The
-s
option for curl prevents showing the download status triggered by piping curl to another process.-n
option on head is the number of lines to return.)How to determine success depends on your definition of "valid". HTTP standards consider anything in the 200 or 300 range to be successful. If you wanted to detect based on that you could use
grep
like so:This uses a regular expression to match on any return code starting with 2 or 3. Make sure you don't try to match on the HTTP protocol version as it may not always be the same.
Once you have the line returned by
curl
andhead
, there is endless possibilities to process, format, and return the results depending on what you actually need.