Regular expressions and multiple writing systems

regex

Types of writing systems:

  • Alphabet
  • Abjad
  • Abugida
  • Syllabary
  • Logography

In regular expressions we need to tell which "chars" we want to validate:

We use something like this a-zA-Z0-9 to say that we accept all the alphanumeric.

How can we make regular expressions that validate other writing systems non-alphanumerics?
(how can I make a regular expression that will validate chinese, or indian, or greek or russian, or someother?

UPDATE:

Using ASP.NET regex engine.

If you don't mind, could you provide me some examples?

Thanks

Best Answer

What regex engine are you using? If you are using Java or .NET, there are many different unicode categories you can use, such as \p{InGreek}.

Another solution, which is perhaps more generic, is to use unicode ranges. This page contains a list of several well known unicode ranges. For instance, if you want to match a Tibetan character, you would use [\u0F00-\u0FFF]. If you want to match a Tibetan character and English characters, you could use [A-Za-z\u0F00-\u0FFF], et cetera.

If you want to match several languages, you can use the page that I mentioned to lookup the languages' unicode range, and combine them. For example, the unicode range [\u0370-\u06FF] covers Greek, Cyrillic (used in Russian languages and other Slavic languages), Hebrew and Arabic. If you need more, just add the ranges you need until all languages are covered.


EDIT: Based on your comments, you can just use the following expression:

@"\p{L}{4,10}"

\p{L} or \p{Letter} is used to match a letter from any language so, the above expression matches 4 to 10 letters from any language.