Python – How to split a url string up into separate parts in Python

parsingpythonurl

I decided that I'll learn python tonight 🙂
I know C pretty well (wrote an OS in it) so I'm not a noob in programming so everything in python seems pretty easy, but I don't know how to solve this problem :
let's say I have this address:

http://example.com/random/folder/path.html
Now how can I create two strings from this, one containing the "base" name of the server, so in this example it would be
http://example.com/
and another containing the thing without the last filename, so in this example it would be
http://example.com/random/folder/
.
Also I of course know the possibility to just find the 3rd and last slash respectively but maybe you know a better way :]
Also it would be cool to have the trailing slash in both cases but I don't care since it can be added easily.
So anyone has a good, fast, effective solution for this? Or is there only "my" solution, finding the slashes?

Thanks!

Best Answer

The urlparse module in python 2.x (or urllib.parse in python 3.x) would be the way to do it.

>>> from urllib.parse import urlparse
>>> url = 'http://example.com/random/folder/path.html'
>>> parse_object = urlparse(url)
>>> parse_object.netloc
'example.com'
>>> parse_object.path
'/random/folder/path.html'
>>> parse_object.scheme
'http'
>>>

If you wanted to do more work on the path of the file under the url, you can use the posixpath module :

>>> from posixpath import basename, dirname
>>> basename(parse_object.path)
'path.html'
>>> dirname(parse_object.path)
'/random/folder'

After that, you can use posixpath.join to glue the parts together.

EDIT: I totally forgot that windows users will choke on the path separator in os.path. I read the posixpath module docs, and it has a special reference to URL manipulation, so all's good.

Related Solutions

Python – How to execute a program or call a system command

Use the subprocess module in the standard library:

import subprocess
subprocess.run(["ls", "-l"])

The advantage of subprocess.run over os.system is that it is more flexible (you can get the stdout, stderr, the "real" status code, better error handling, etc...).

Even the documentation for os.system recommends using subprocess instead:

The subprocess module provides more powerful facilities for spawning new processes and retrieving their results; using that module is preferable to using this function. See the Replacing Older Functions with the subprocess Module section in the subprocess documentation for some helpful recipes.

On Python 3.4 and earlier, use subprocess.call instead of .run:

subprocess.call(["ls", "-l"])

Python – How to safely create a nested directory in Python

On Python ≥ 3.5, use pathlib.Path.mkdir:

from pathlib import Path
Path("/my/directory").mkdir(parents=True, exist_ok=True)

For older versions of Python, I see two answers with good qualities, each with a small flaw, so I will give my take on it:

Try os.path.exists, and consider os.makedirs for the creation.

import os
if not os.path.exists(directory):
    os.makedirs(directory)

As noted in comments and elsewhere, there's a race condition – if the directory is created between the os.path.exists and the os.makedirs calls, the os.makedirs will fail with an OSError. Unfortunately, blanket-catching OSError and continuing is not foolproof, as it will ignore a failure to create the directory due to other factors, such as insufficient permissions, full disk, etc.

One option would be to trap the OSError and examine the embedded error code (see Is there a cross-platform way of getting information from Python’s OSError):

import os, errno

try:
    os.makedirs(directory)
except OSError as e:
    if e.errno != errno.EEXIST:
        raise

Alternatively, there could be a second os.path.exists, but suppose another created the directory after the first check, then removed it before the second one – we could still be fooled.

Depending on the application, the danger of concurrent operations may be more or less than the danger posed by other factors such as file permissions. The developer would have to know more about the particular application being developed and its expected environment before choosing an implementation.

Modern versions of Python improve this code quite a bit, both by exposing FileExistsError (in 3.3+)...

try:
    os.makedirs("path/to/directory")
except FileExistsError:
    # directory already exists
    pass

...and by allowing a keyword argument to os.makedirs called exist_ok (in 3.2+).

os.makedirs("path/to/directory", exist_ok=True)  # succeeds even if directory exists.

Best Answer

Related Solutions

Python – How to execute a program or call a system command

Python – How to safely create a nested directory in Python

Related Topic