Python Regular Expressions – Flavor and Usage

pythonregular expressions

So my copy of the classic book, Mastering Regular Expressions, just arrived, and I'm scanning through it. The cover (third edition) says, "for Perl, PHP, Java, .NET, Ruby, and More!" Well, it does have a full chapter for each of Perl, PHP, Java, and .NET – but no chapter on Ruby, or, more importantly to me, on Python.

Although the index lists a few points of Python's differences (such as how it handles newlines), there's not even a table of them, much less an entire chapter.

Which of the four covered languages, is Python's regex engine most similar to? More importantly, how similar, and what are the major differences?

Note – I don't mean how regexes are used in Python (versus, say Perl) – I know and use the re module. I'm only interested in the syntax differences of the actual regexes.

Best Answer

This site has a table comparing regex features for a wide range of languages and platforms (make sure to scroll to the bottom). There's also a page specific to Python, with more info about the re module (though, for a more complete regex library in Python, you should also look at the newer regex module).

Related Solutions

How to Learn to Write Pythonic Code Effectively

I've found that a most people have their own interpretations of what being "Pythonic" really means. From Wikipedia:

A common neologism in the Python community is pythonic, which can have a wide range of meanings related to program style. To say that code is pythonic is to say that it uses Python idioms well, that it is natural or shows fluency in the language. Likewise, to say of an interface or language feature that it is pythonic is to say that it works well with Python idioms, that its use meshes well with the rest of the language.

In contrast, a mark of unpythonic code is that it attempts to write C++ (or Lisp, Perl, or Java) code in Python—that is, provides a rough transcription rather than an idiomatic translation of forms from another language. The concept of pythonicity is tightly bound to Python's minimalist philosophy of readability and avoiding the "there's more than one way to do it" approach. Unreadable code or incomprehensible idioms are unpythonic.

I've found that more times than not, more "pythonic" examples are actually derived from people trying to be clever with Python idioms and (again, more times than not) rendering their code virtually unreadable (which is not Pythonic).

As long as you're sticking to Python's idioms and avoiding trying to use C++ (or other language) styles in Python, then you're being Pythonic.

As pointed out by WorldEngineer, PEP8 is a good standard to follow (and if you use VIM, there are plugins available for PEP8 linting).

Really though, at the end of the day, if your solution works and isn't absolutely horribly un-maintainable and slow, who cares? More times than not, your job is to get a task done, not write the most elegant, pythonic code possible.

Another side note (just my opinion, feel free to downvote because of it ;)): I've also found the Python community to be filled with a ton of ego (not that most communities aren't, it's just a little more prevalent in communities such as C and Python). So, combining the ego with misconstrued interpretations of being "pythonic" will tend to yield a whole lot of baseless negativity. Take what you read from others with a grain of salt. Stick to official standards and documentation and you'll be fine.

Regular Expressions – How Do Regular Expressions Actually Work?

How does it work?

Take a look at automata theory

In short, each regular expression has an equivalent finite automaton and can be compiled and optimized to a finite automaton. The involved algorithms can be found in many compiler books. These algorithms are used by unix programs like awk and grep.

However, most modern programming languages (Perl, Python, Ruby, Java (and JVM based languages), C#) do not use this approach. They use a recursive backtracking approach, which compiles a regular expression into a tree or a sequence of constructs representing various sub-chunks of the regular expression. Most modern "regular expression" syntaxes offer backreferences which are outside the group of regular languages (they have no representation in finite automata), which are trivially implementable in recursive backtracking approach.

The optimization does usually yield a more efficient state machine. For example: consider aaaab|aaaac|aaaad, a normal programmer can get the simple but less efficient search implementation (comparing three strings separately) right in ten minutes; but realizing it is equivalent to aaaa[bcd], a better search can be done by searching first four 'a' then test the 5th character against [b,c,d]. The process of optimization was one of my compiler home work many years ago so I assume it is also in most modern regular expression engines.

On the other hand, state machines do have some advantage when they are accepting strings because they use more space compared to a "trivial implementation". Consider a program to un-escape quotation on SQL strings, that is: 1) starts and ends with single quotation marks; 2) single quotation marks are escaped by two consecutive single quotations. So: input ['a'''] should yield output [a']. With a state machine, the consecutive single quotation marks are handled by two states. These two states serve the purpose of remembering the input history such that each input character is processed exactly only once, as the following illustrated:

...
S1->'->S2
S1->*->S1, output *, * can be any other character 
S2->'->S1, output '
S2->*->END, end the current string

So, in my opinion, regular expression may be slower in some trivial cases, but usually faster than a manually crafted search algorithm, given the fact that the optimization cannot be reliably done by human.

(Even in trivial cases like searching a string, a smart engine can recognize the single path in the state map and reduce that part to a simple string comparison and avoid managing states.)

A particular engine from a framework/library may be slow because the engine does a bunch of other things a programmer usually don't need. Example: the Regex class in .NET create a bunch of objects including Match, Groups and Captures.

Best Answer

Related Solutions

How to Learn to Write Pythonic Code Effectively

Regular Expressions – How Do Regular Expressions Actually Work?

Related Topic