Php – RegEx Match URL Pattern

domain-namePHPregexregular expressions

I'm trying to come up with a regex pattern that will match any domains in this format:

example.com

but not this:

subdomain.example.com

Currently it needs to only cover the main TLDs (com, net, org), but I'd like it to be able to handle others (like co.uk, com.br, etc.) for flexibility.

So far I've got this, but it definitely needs some work:

^[^w].*\.[a-z]{3}.*$

Could a regex ninja help me out?

EDIT:
The regex will be used in PHP, and there is never a protocol on the beginning of the string to match due to the setup of the script. I'd have to dig more into the script to get more details on why this is true, but I believe it is just grabbing the host name from the PHP $_SERVER variable.

EDIT 2:
Perhaps this would work to cover anything but a period up to something matching .xyz or .xyz.ab or .xyz.abc
^[^.]+(\.[^.]{3}|\.[^.]{2,3}\.[^.]{2,3}).*$

EDIT 3:
I've got the nearly completed pattern:
updated below (php requires / and / at the beginning and end)
Can anyone poke holes in the implementation? It appears to be working as expected.

EDIT 4:
This is where I'm currently at: updated below
It matches nearly what I want, though it requires the / at the beginning of the filepath so example.com does not match, while example.com/test does. I can't get it to match example.com without matching the ".exa" in "www.example.com".

EDIT 5:
Ok, we've got a winner: /^[^.]+((\.[^.\/]{1,3}\b){1,2}).*$/

Matches:
example.com
example.co.uk
example.com/test.php?a=b
example.co.uk/test.php?a=b
123.com
1234.com
www.123.com (matches all URLs with domains shorter than 4 characters)

Doesn't match:
www.example.com
www.example.co.uk
www.example.com/test.php?a=b
www.example.co.uk/test.php?a=b
test.example.com/test.php?a=b
test.example.co.uk/test.php?a=b
www.1234.com

Best Answer

What language are you using?

In general it sounds like you want something that matches the basic aspects of a domain, ruling out the possibility of a period other than the one that delinates the .tld.

#http://[^.]+\.(com|net|org)#i

If you don't want to match the protocal, maybe something like this.

#[^. ]+\.(com|net|org)#i

Your desire to handle multi-part TLD's will really screw this up, you will need to maintain a manual list of all the ones you want to match. The only alternative is to do DNS lookups to determine the listing type. There really isn't another way to extract subdomain data from the domain with a regular expression because by rights domains are actually just subdomains of some TLD (top level domain).

Edit: To match TLD's assuming they woudl have less than four characters, you can play around with something like this. You're going to have to work out what constitutes the start and end of a match. Are you requireing the presense of a protocal? Is this in a paragraph where somebody could type.a period out of context? If you give more details on the parameters we might be able to provide a more precise solution.

[^.]+((\.[^.]{0,3})+)