Python – How to split a url string up into separate parts in Python

parsingpythonurl

I decided that I'll learn python tonight 🙂
I know C pretty well (wrote an OS in it) so I'm not a noob in programming so everything in python seems pretty easy, but I don't know how to solve this problem :
let's say I have this address:

http://example.com/random/folder/path.html
Now how can I create two strings from this, one containing the "base" name of the server, so in this example it would be
http://example.com/
and another containing the thing without the last filename, so in this example it would be
http://example.com/random/folder/
.
Also I of course know the possibility to just find the 3rd and last slash respectively but maybe you know a better way :]
Also it would be cool to have the trailing slash in both cases but I don't care since it can be added easily.
So anyone has a good, fast, effective solution for this? Or is there only "my" solution, finding the slashes?

Thanks!

Best Answer

The urlparse module in python 2.x (or urllib.parse in python 3.x) would be the way to do it.

>>> from urllib.parse import urlparse
>>> url = 'http://example.com/random/folder/path.html'
>>> parse_object = urlparse(url)
>>> parse_object.netloc
'example.com'
>>> parse_object.path
'/random/folder/path.html'
>>> parse_object.scheme
'http'
>>>

If you wanted to do more work on the path of the file under the url, you can use the posixpath module :

>>> from posixpath import basename, dirname
>>> basename(parse_object.path)
'path.html'
>>> dirname(parse_object.path)
'/random/folder'

After that, you can use posixpath.join to glue the parts together.

EDIT: I totally forgot that windows users will choke on the path separator in os.path. I read the posixpath module docs, and it has a special reference to URL manipulation, so all's good.