Using regular expressions is probably the best way. You can see a bunch of tests here (taken from chromium)
function validateEmail(email) {
const re = /^(([^<>()[\]\\.,;:\s@"]+(\.[^<>()[\]\\.,;:\s@"]+)*)|(".+"))@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\])|(([a-zA-Z\-0-9]+\.)+[a-zA-Z]{2,}))$/;
return re.test(String(email).toLowerCase());
}
Here's the example of regular expresion that accepts unicode:
const re = /^(([^<>()[\]\.,;:\s@\"]+(\.[^<>()[\]\.,;:\s@\"]+)*)|(\".+\"))@(([^<>()[\]\.,;:\s@\"]+\.)+[^<>()[\]\.,;:\s@\"]{2,})$/i;
But keep in mind that one should not rely only upon JavaScript validation. JavaScript can easily be disabled. This should be validated on the server side as well.
Here's an example of the above in action:
function validateEmail(email) {
const re = /^(([^<>()[\]\\.,;:\s@\"]+(\.[^<>()[\]\\.,;:\s@\"]+)*)|(\".+\"))@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\])|(([a-zA-Z\-0-9]+\.)+[a-zA-Z]{2,}))$/;
return re.test(email);
}
function validate() {
const $result = $("#result");
const email = $("#email").val();
$result.text("");
if (validateEmail(email)) {
$result.text(email + " is valid :)");
$result.css("color", "green");
} else {
$result.text(email + " is not valid :(");
$result.css("color", "red");
}
return false;
}
$("#email").on("input", validate);
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<label for=email>Enter an email address:</label>
<input id="email">
<h2 id="result"></h2>
The fully RFC 822 compliant regex is inefficient and obscure because of its length. Fortunately, RFC 822 was superseded twice and the current specification for email addresses is RFC 5322. RFC 5322 leads to a regex that can be understood if studied for a few minutes and is efficient enough for actual use.
One RFC 5322 compliant regex can be found at the top of the page at http://emailregex.com/ but uses the IP address pattern that is floating around the internet with a bug that allows 00
for any of the unsigned byte decimal values in a dot-delimited address, which is illegal. The rest of it appears to be consistent with the RFC 5322 grammar and passes several tests using grep -Po
, including cases domain names, IP addresses, bad ones, and account names with and without quotes.
Correcting the 00
bug in the IP pattern, we obtain a working and fairly fast regex. (Scrape the rendered version, not the markdown, for actual code.)
(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9]))\.){3}(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9])|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])
or:
(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9]))\.){3}(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9])|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])
Here is diagram of finite state machine for above regexp which is more clear than regexp itself
The more sophisticated patterns in Perl and PCRE (regex library used e.g. in PHP) can correctly parse RFC 5322 without a hitch. Python and C# can do that too, but they use a different syntax from those first two. However, if you are forced to use one of the many less powerful pattern-matching languages, then it’s best to use a real parser.
It's also important to understand that validating it per the RFC tells you absolutely nothing about whether that address actually exists at the supplied domain, or whether the person entering the address is its true owner. People sign others up to mailing lists this way all the time. Fixing that requires a fancier kind of validation that involves sending that address a message that includes a confirmation token meant to be entered on the same web page as was the address.
Confirmation tokens are the only way to know you got the address of the person entering it. This is why most mailing lists now use that mechanism to confirm sign-ups. After all, anybody can put down president@whitehouse.gov
, and that will even parse as legal, but it isn't likely to be the person at the other end.
For PHP, you should not use the pattern given in Validate an E-Mail Address with PHP, the Right Way from which I quote:
There is some danger that common usage and widespread sloppy coding will establish a de facto standard for e-mail addresses that is more restrictive than the recorded formal standard.
That is no better than all the other non-RFC patterns. It isn’t even smart enough to handle even RFC 822, let alone RFC 5322. This one, however, is.
If you want to get fancy and pedantic, implement a complete state engine. A regular expression can only act as a rudimentary filter. The problem with regular expressions is that telling someone that their perfectly valid e-mail address is invalid (a false positive) because your regular expression can't handle it is just rude and impolite from the user's perspective. A state engine for the purpose can both validate and even correct e-mail addresses that would otherwise be considered invalid as it disassembles the e-mail address according to each RFC. This allows for a potentially more pleasing experience, like
The specified e-mail address 'myemail@address,com' is invalid. Did you mean 'myemail@address.com'?
See also Validating Email Addresses, including the comments. Or Comparing E-mail Address Validating Regular Expressions.
Debuggex Demo
Best Answer
The answer is, needless to say, YES! You can most certainly write a Java regex pattern to match anbn. It uses a positive lookahead for assertion, and one nested reference for "counting".
Rather than immediately giving out the pattern, this answer will guide readers through the process of deriving it. Various hints are given as the solution is slowly constructed. In this aspect, hopefully this answer will contain much more than just another neat regex pattern. Hopefully readers will also learn how to "think in regex", and how to put various constructs harmoniously together, so they can derive more patterns on their own in the future.
The language used to develop the solution will be PHP for its conciseness. The final test once the pattern is finalized will be done in Java.
Step 1: Lookahead for assertion
Let's start with a simpler problem: we want to match
a+
at the beginning of a string, but only if it's followed immediately byb+
. We can use^
to anchor our match, and since we only want to match thea+
without theb+
, we can use lookahead assertion(?=…)
.Here is our pattern with a simple test harness:
The output is (as seen on ideone.com):
This is exactly the output we want: we match
a+
, only if it's at the beginning of the string, and only if it's immediately followed byb+
.Lesson: You can use patterns in lookarounds to make assertions.
Step 2: Capturing in a lookahead (and f r e e - s p a c i n g mode)
Now let's say that even though we don't want the
b+
to be part of the match, we do want to capture it anyway into group 1. Also, as we anticipate having a more complicated pattern, let's usex
modifier for free-spacing so we can make our regex more readable.Building on our previous PHP snippet, we now have the following pattern:
The output is now (as seen on ideone.com):
Note that e.g.
aaa|b
is the result ofjoin
-ing what each group captured with'|'
. In this case, group 0 (i.e. what the pattern matched) capturedaaa
, and group 1 capturedb
.Lesson: You can capture inside a lookaround. You can use free-spacing to enhance readability.
Step 3: Refactoring the lookahead into the "loop"
Before we can introduce our counting mechanism, we need to do one modification to our pattern. Currently, the lookahead is outside of the
+
repetition "loop". This is fine so far because we just wanted to assert that there's ab+
following oura+
, but what we really want to do eventually is assert that for eacha
that we match inside the "loop", there's a correspondingb
to go with it.Let's not worry about the counting mechanism for now and just do the refactoring as follows:
a+
to(?: a )+
(note that(?:…)
is a non-capturing group)a*
before we can "see" theb+
, so modify the pattern accordinglySo we now have the following:
The output is the same as before (as seen on ideone.com), so there's no change in that regard. The important thing is that now we are making the assertion at every iteration of the
+
"loop". With our current pattern, this is not necessary, but next we'll make group 1 "count" for us using self-reference.Lesson: You can capture inside a non-capturing group. Lookarounds can be repeated.
Step 4: This is the step where we start counting
Here's what we're going to do: we'll rewrite group 1 such that:
+
, when the firsta
is matched, it should captureb
a
is matched, it should capturebb
bbb
b
to capture into group 1 then the assertion simply failsSo group 1, which is now
(b+)
, will have to be rewritten to something like(\1 b)
. That is, we try to "add" ab
to what group 1 captured in the previous iteration.There's a slight problem here in that this pattern is missing the "base case", i.e. the case where it can match without the self-reference. A base case is required because group 1 starts "uninitialized"; it hasn't captured anything yet (not even an empty string), so a self-reference attempt will always fail.
There are many ways around this, but for now let's just make the self-reference matching optional, i.e.
\1?
. This may or may not work perfectly, but let's just see what that does, and if there's any problem then we'll cross that bridge when we come to it. Also, we'll add some more test cases while we're at it.The output is now (as seen on ideone.com):
A-ha! It looks like we're really close to the solution now! We managed to get group 1 to "count" using self-reference! But wait... something is wrong with the second and the last test cases!! There aren't enough
b
s, and somehow it counted wrong! We'll examine why this happened in the next step.Lesson: One way to "initialize" a self-referencing group is to make the self-reference matching optional.
Step 4½: Understanding what went wrong
The problem is that since we made the self-reference matching optional, the "counter" can "reset" back to 0 when there aren't enough
b
's. Let's closely examine what happens at every iteration of our pattern withaaaaabbb
as input.A-ha! On our 4th iteration, we could still match
\1
, but we couldn't match\1b
! Since we allow the self-reference matching to be optional with\1?
, the engine backtracks and took the "no thanks" option, which then allows us to match and capture justb
!Do note, however, that except on the very first iteration, you could always match just the self-reference
\1
. This is obvious, of course, since it's what we just captured on our previous iteration, and in our setup we can always match it again (e.g. if we capturedbbb
last time, we're guaranteed that there will still bebbb
, but there may or may not bebbbb
this time).Lesson: Beware of backtracking. The regex engine will do as much backtracking as you allow until the given pattern matches. This may impact performance (i.e. catastrophic backtracking) and/or correctness.
Step 5: Self-possession to the rescue!
The "fix" should now be obvious: combine optional repetition with possessive quantifier. That is, instead of simply
?
, use?+
instead (remember that a repetition that is quantified as possessive does not backtrack, even if such "cooperation" may result in a match of the overall pattern).In very informal terms, this is what
?+
,?
and??
says:In our setup,
\1
will not be there the very first time, but it will always be there any time after that, and we always want to match it then. Thus,\1?+
would accomplish exactly what we want.Now the output is (as seen on ideone.com):
Voilà!!! Problem solved!!! We are now counting properly, exactly the way we want it to!
Lesson: Learn the difference between greedy, reluctant, and possessive repetition. Optional-possessive can be a powerful combination.
Step 6: Finishing touches
So what we have right now is a pattern that matches
a
repeatedly, and for everya
that was matched, there is a correspondingb
captured in group 1. The+
terminates when there are no morea
, or if the assertion failed because there isn't a correspondingb
for ana
.To finish the job, we simply need to append to our pattern
\1 $
. This is now a back reference to what group 1 matched, followed by the end of the line anchor. The anchor ensures that there aren't any extrab
's in the string; in other words, that in fact we have anbn.Here's the finalized pattern, with additional test cases, including one that's 10,000 characters long:
It finds 4 matches:
ab
,aabb
,aaabbb
, and the a5000b5000. It takes only 0.06s to run on ideone.com.Step 7: The Java test
So the pattern works in PHP, but the ultimate goal is to write a pattern that works in Java.
The pattern works as expected (as seen on ideone.com).
And now we come to the conclusion...
It needs to be said that the
a*
in the lookahead, and indeed the "main+
loop", both permit backtracking. Readers are encouraged to confirm why this is not a problem in terms of correctness, and why at the same time making both possessive would also work (though perhaps mixing mandatory and non-mandatory possessive quantifier in the same pattern may lead to misperceptions).It should also be said that while it's neat that there's a regex pattern that will match anbn, this is in not always the "best" solution in practice. A much better solution is to simply match
^(a+)(b+)$
, and then compare the length of the strings captured by groups 1 and 2 in the hosting programming language.In PHP, it may look something like this (as seen in ideone.com):
The purpose of this article is NOT to convince readers that regex can do almost anything; it clearly can't, and even for the things it can do, at least partial delegation to the hosting language should be considered if it leads to a simpler solution.
As mentioned at the top, while this article is necessarily tagged
[regex]
for stackoverflow, it is perhaps about more than that. While certainly there's value in learning about assertions, nested references, possessive quantifier, etc, perhaps the bigger lesson here is the creative process by which one can try to solve problems, the determination and hard work that it often requires when you're subjected to various constraints, the systematic composition from various parts to build a working solution, etc.Bonus material! PCRE recursive pattern!
Since we did bring up PHP, it needs to be said that PCRE supports recursive pattern and subroutines. Thus, following pattern works for
preg_match
(as seen on ideone.com):Currently Java's regex does not support recursive pattern.
Even more bonus material! Matching anbncn !!
So we've seen how to match anbn which is non-regular, but still context-free, but can we also match anbncn, which isn't even context-free?
The answer is, of course, YES! Readers are encouraged to try to solve this on their own, but the solution is provided below (with implementation in Java on ideone.com).