How to search the Internet Archive’s Wayback Machine by URL prefix or pattern

wayback-machine

I know it is possible to search by domain as mentioned at: How to search Internet Archive for all pages from a particular domain but this is not always feasible for huge domains like Twitter, where there will be way too many results for a given date.

For example I spent a long time looking for: https://web.archive.org/web/20191005025317if_/https://twitter.com/dmorey/status/1180312072027947008

I knew the user's URL: https://twitter.com/dmorey and the approximate date based on news reports, but not the Tweet URL since the Tweet had been deleted, so what I would like would be to search for any URL under https://twitter.com/dmorey

I tried to add an asterisk: https://twitter.com/dmorey/*: http://web.archive.org/web/20190615000000*/https://twitter.com/dmorey/* but that has no results.

How I ended up finding the page: http://archive.is has the asterisk feature, which pointed me to the desired exact URL.

Asked them on Twitter at: https://twitter.com/cirosantilli/status/1312690953875083265

Best Answer

I found a way. It is not amazing, but it would have worked. You can construct a search query of an URL of type:

http://web.archive.org/cdx/search/cdx?url=twitter.com/dmorey/status&matchType=prefix&limit=1000&from=20191001&to=20191007

and matchType=prefix will make it search by prefix, and from and to date ranges.

This gives a text list of what I presume are archived URLs, one of which was the desired one:

com,twitter)/dmorey/status/1180312072027947008 20191005025317 https://twitter.com/dmorey/status/1180312072027947008 text/html 200 L7ZPXVHCUPLD64RKA5SSIXAQ3I6Q56GM 27230

I found this by Googling into https://github.com/internetarchive/wayback/blob/master/wayback-cdx-server/README.md which documents those search parameters.