Below are declared variables for 3 requests which I implement in my macros. I listed libraries they use and their late bindings in comments:
Dim XMLHTTP As New MSXML2.XMLHTTP 'Microsoft XML, v6.0 'Set XMLHTTP = CreateObject("MSXML2.XMLHTTP.6.0")
Dim ServerXMLHTTP As New MSXML2.ServerXMLHTTP 'Microsoft XML, v6.0 'Set ServerXMLHTTP = CreateObject("MSXML2.ServerXMLHTTP.6.0")
Dim http As New WinHttpRequest 'Microsoft WinHttp Services, version 5.1 'Set http = CreateObject("WinHttp.WinHttpRequest.5.1")
I have a few old web scraping macros which used Internet Explorer automation. I wanted to clean coding and speed them up with these requests.
Unfortunately what I have noticed, MSXML2.ServerXMLHTTP
and WinHttpRequest
are slower on online store's 20 products test (34 and 35 sec) than IE automation with pictures and active scripting off (24 sec)! MSXML2.XMLHTTP
executes in 18 secs. I used to see situations when some out of these 3 requests are 2-3 times faster / slower than the other ones, so I always test which one performs best, but never before had any request lost to IE automation.
The main page with results is below, it's all results on one page, 1500+ of them, so request takes some time (6500 pages if pasted to MS Word):
www.justbats.com/products/bat type~baseball/?sortBy=TotalSales Descending&page=1&size=2400
Then I open individual links from main result page:
http://www.justbats.com/product/2017-marucci-cat-7-bbcor-baseball-bat–mcbc7/24317/
I would like to know if these 3 requests are all options I have to get data from websites without browser automation. Also – how possibly browser automation can beat some of these requests?
UPDATE
I have tested the main result page with procedure provided in answer by Robin Mackenzie, clearing IE cache before running it. At least on this particular page, caching seemed to have no explicit gain, as subsequent requests yielded a similar result. IE had active scripting disabled and no images loading.
IE automation method, Document length: 7593346 chars, Processed in: 8 seconds
WinHTTP method, Document length: 7824059 chars, Processed in: 29 seconds
XML HTTP method, Document length: 7830217 chars, Processed in: 4 seconds
Server XML HTTP method, Document length: 7823958 chars, Processed in: 26 seconds
URL download file method, Document length: 7830346 chars, Processed in: 7 seconds
Very surprising for me is the difference in amount of characters returned by these methods.
Best Answer
In addition to the methods you've mentioned:
There are 2 other methods you can think about:
CreateDocumentFromUrl
method of theMSHTML.HTMLDocument
objectURLDownloadToFileA
There are some other Windows APIs that I am ignoring such as
InternetOpen
,InternetOpenUrl
etc as potential performance will be outweighed by complexity of guess the response length, buffering the response, and so forth.CreateDocumentFromUrl
With the
CreateDocumentFromUrl
method it is a problem with your sample website because it attempts to create aHTMLDocument
in an frame which is not allowed with errors such as:and
So we should not use this method.
URLDownloadToFileA
I thought you need the php equivalent of
file_get_contents
and found this method. It is easily used (check this link) and out-performs the other methods when used on a large request (e.g. try it when you go for >2000 baseball bats). TheXMLHTTP
also method uses theURLMon
library so I guess this way is just cutting out a bit of middle-man logic and obviously there's a downside because you have to do some file system handling.With the
URLDownloadToFileA
it is taking me about 1-2 seconds to download you sample URL versus 4-5 seconds with theXMLHTTP
method (full code below).The URL:
This is the output:
Code
This includes all methods discussed e.g. IE automation, WinHTTPRequest, XMLHTTP, ServerXMLHTTP, CreateDocumentFromURL and URLDownloadFile.
You need all these references in project:
Here it is: