Perl on Linux – Greedy Regular Expressions and the Question Marks that Tame Them

In 1986, Larry Wall invented a scripting language to solve the problems of generating reports for system administrators on unix. He called it the Practical Extraction and Reporting Language since that was its function. And it does do that. In a Unix based OS where everything is output in text, Perl has dominated somewhat because it so easy to use but mostly because of its powerful and easily used regular expression capabilities.

As are most things in our modern society, perl’s regular expressions are greedy by default. For example:

$_ = "I am an oh so pretty sentence!";
print "$1n" if /s(.*)s/;

Now if we ponder the regex used here, it will match anything that is surrounded by two space characters. That means it could be any of these:

  • am
  • am an
  • am an oh
  • am an oh so
  • etc…

Greedy means that Perl will match the biggest one it can find, so it will match ‘am an oh so pretty’ since that is the biggest segment that is surrounded by space characters. But what if we want to tone that down and get the first-smallest? Funny you should ask. Asking requires a question mark and so does taming perl’s regex greed.

We insert the ? just after the expression we want to curb. In our previous example, we insert it just after the .*, like so:

$_ = "I am an oh so pretty sentence!";
print "$1n" if /s(.*?)s/;

And then we only match “am” since that is the first match we ran across. So start reigning in those regular expressions.

Leave a Reply