More on perl

Back in 2008, I did a fairly major upgrade of my website, and after trying sed, I found that perl provided a better way of modifying html code across multiple files.

I wrote a web page about it, and it has been most useful since then. This page is a sequel to that page, and it should be read first, before continuing with this page.

Here is a link to it.

Updating HTML code with perl

In a more recent upgrade of various sections of my website, I came across some difficulties, and had to dig a bit deeper into perl, this web page documents how I used perl to solve them.

From the previous page about perl

By the end of the previous web page about perl, I had a working script that did text substitution across multiple lines.


        #!/bin/bash

        old='< some-text >.*? < more-text >'

        new='< new-text >'

        perl -0777 -pi -e "s/$old/$new/gis;" *.htm

Here is a line by line discription of what it does.

⇒ bash shell script

The first line is of course just to define the script as a bash shell script.

The interesting thing that I didn`t realise until I had almost finished writing this page is that this means that the next two lines are creating variables that are bash shell script variables, not perl variables - because at that point the script doesn`t know that perl is going to be used.

What the impact of this is I`m not sure - but it seems to work. It may have some implications that I haven`t met yet.

Maybe it is related to the fact that I`ve used hard quotes ( ie - the single ' )rather than soft quotes ( ie - the double " ).

⇒ defining the regex

The second line creates the variable "old" which contains the text which should be removed.

The first two characters of .*? are respectively a metacharacter and a quantifier - the third one is a sort of quantifier - they all define how wildcarding is to work.

There are 7 metacharacters within the perl world -


 \  ^  .  $  |  ()  []

There are four quantifiers within the perl world -


 *  +  ?  {}

Some websites about perl don`t call these four characters quantifiers, they include them in the list of metacharacters.

Again, some websites describe things differently, and also define another four quantifiers -


 *?  +?  ??  {}?

which tell perl to match the shortest possible number of characters, rather than the longest.

Somewhat confusingingly, some other websites refer to these characters as control codes -


 ?  *  +  |  ^  $  ()

All of which doesn`t really help when you are trying to work out how perl works.

Anyway, in the above script -

. - this metacharacter should match any character except newline characters
* - this is a quantifier which tells perl to do the match using the previous metacharacter zero or any number of times
? - don`t do greedy matching, use the shortest possible number of characters to obtain a match

NB - just to confuse the issue - in this case ? is being used to specify non-greedy matching.

If it is used straight after the . then it is a quantifier, and tells the . to match zero or one times only

⇒ the replacement text

The third line creates the variable "new" which contains the text that is to be inserted into the place or places from which the old text has been removed.

⇒ the perl s/// operator

The fourth line is the perl instruction.

the first s is an operator which tells perl that this is a substitution
$old is the variable containing the text that is to be removed
$new is the variable containing the text that is to be added in the space left by the removal of $old
the letters gis are modifiers, which direct some aspects of the substitution - the s/// operator can have several modifiers -


      /i /m /s /x /o /g /e

In the above script

g - tells perl to match all occurrences of the regex in the variable $old
i - tells perl to ignore the case of letters
s - tells perl that newline characters should be treated as normal alphanumeric characters - s is used in association with the . metacharacter so that $old can contain multiple lines

Now the above script worked fine for some of the html coding I was trying to do, but I met some problems.

What follows is some of the solutions I eventually dug up from the internet.

Unknown amount of white space

One of the problems I had was in replacing several lines of text, with different files having different amounts of white space in them.

The solution to this involved the use of a perl feature called Character Classes, where a small number of specific characters can be used to represent a number of characters of a similar type.

The specific characters are


      \d \s \w \D \S \W

The \s character will represent any white space characters, and can be used as a shorthand for tabs, spaces, and new lines.

So the regex could be for example


      $old='text-1\s+text-2\s+text-3'

and this would match any block of text that contained "text-1" then "text-2" then "text-3", with any amount of white space between them.

Adding text to the end of a file

The particular problem here was that each file had different text in it, so it was not possible to specify a particular string after which the new text should be added.

Now simple logic would suggest that if I just used wildcards such as


      $old='.*'

then tried


      $new='.* newtext'

then the substitution would just add "newtext" after all the existing text.

However it doesn`t work - the "newtext" is added just as instructed, however all the original text is deleted.

After digging around, it appears that the way that perl does substitution is

use the regex to find the places in which to put the new text
remove the old text identified by the regex
put in the new text

The trouble is - after step 2 - it doesn`t know what it has removed, it just has an empty space. So using the wild card in the $new variable doesn`t achieve anything.

The solution is to change the fourth line of the script at the top of the page, so that the s/// operator now becomes -


       perl -0777 -pi -e "s/($old)/$new/gis;" *.htm

That is - the bit that calls the $old variable is placed inside parenthesis.

Now perl adds another bit to the procedure -

use the regex to find the places in which to put the new text
create a variable called $1, and in it place all the text in the places it has just identified
remove the old text identified by the regex
put in the new text

So the $new variable can now be written as


      new='$1 newtext'

and in line 4, perl puts back all the original text ( from the variable $1 ) plus the new text.

So the final script now becomes something like


        #!/bin/bash

        old='^.*$'

        new='$1 newtext'

        perl -0777 -pi -e "s/($old)/$new/gis;" *.htm

As far as I can see, perl creates the variable $1 after it has done a succesful match.

Also, perl can create more than one of these numbered variables - so $1 up to $9 are certainly possible, but there doesn`t seem to be a definitive answer as to how many of these numbered variables are allowed beyond these nine.

Changing file extensions

One of the things I needed to do was to change the file extensions on multiple files from .htm to .php.

I started to dig around perl to see how to do it, but was quickly diverted to a much simpler method - using the bash shell script command "rename".

All it needed was


      rename .htm .php *.htm

Job done !

website design by ron-t

website hosting by freevirtualservers.com

+ +