I have to collect information from webpages using Python from a Linux terminal, it works wonderful but some pages (not all of them) are retrieving invalid URL's when I try to use requests.get due to they have agents detectors and they don't know how to answer my request (I'm not a browser or mobile application from a Linux terminal).
Using "User-Agent" header didn't work either, I tried several different ways to send it to emulate I am a Mozilla browser:
user_agent = {'User-Agent': 'Mozilla/5.0'}
or
user_agent = {'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; hu-HU; rv:1.7.8) Gecko/20050511 Firefox/1.0.4'}
or many other combinations.
In some servers when I try to use this line:
page = requests.get(url, headers=user_agent)
I get a bad request, because these servers try to send me a webpage for desktop or mobile browsers and they fail to identify it.
Am I doing something wrong sending a User-Agent in this way? I tried my code in a Python Notebook and it works perfectly due to I'm currently (of course) sending a request from a browser.
Best Answer
You are using a very old user agent and indeed some sites will block you because of this.