Python – Regex add character to matched string

nlppythonregex

I have a long string which is a paragraph, however there is no white space after periods. For example:

para = "I saw this film about 20 years ago and remember it as being particularly nasty. I believe it is based on a true incident: a young man breaks into a nurses\' home and rapes, tortures and kills various women.It is in black and white but saves the colour for one shocking shot.At the end the film seems to be trying to make some political statement but it just comes across as confused and obscene.Avoid."

I am trying to use re.sub to solve this problem, but the output is not what I expected.

This is what I did:

re.sub("(?<=\.).", " \1", para)

I am matching the first char of each sentence, and I want to put a white space before it. My match pattern is (?<=\.)., which (supposedly) checks for any character that appears after a period. I learned from other stackoverflow questions that \1 matches the last matched pattern, so I wrote my replace pattern as \1, a space followed by the previously matched string.

Here is the output:

"I saw this film about 20 years ago and remember it as being particularly nasty. \x01I believe it is based on a true incident: a young man breaks into a nurses\' home and rapes, tortures and kills various women. \x01t is in black and white but saves the colour for one shocking shot. \x01t the end the film seems to be trying to make some political statement but it just comes across as confused and obscene. \x01void. \x01

Instead of matching any character preceded by a period and adding a space before it, re.sub replaced the matched character with \x01. Why? How do I add a character before a matched string?

Best Answer

The (?<=a)b is a positive lookbehind. It matches b following a. The a is not captured. So in your expression, I'm not sure what the value of \1 represents in this case, but it's not what's inside of (?<=...).

Your current approach has another flaw: it would add a space after a . even when one is already there.

To add missing space after ., I suggest a different strategy: replace .-followed-by-non-space-non-dot with . and a space:

re.sub(r'\.(?=[^ .])', '. ', para)