Updating HTML code with perl

Introduction

In my previous web page about updating HTML code with sed, I described a problem I had with using sed, in that I couldn`t get sed to replace text strings that contained newline characters.

After raking around on the internet some more, it looked as if a possible way round this is to use perl instead of sed. Eventually I got it to work, so here is some information about using perl to replace a text string containing newline characters in multiple files.

What is perl ?

perl is another tool for text manipulation - it is, like sed, a very comprehensive tool, and what I describe in this web page is only a microscopic part of the things that perl can do. Like sed, whole books are written about perl, and there are quite a few of them.

perl was written to be a more comprehensive text manipulation tool, and has many of its roots in the more popular parts of other tools such as sed and awk, and also shell scripting.

perl can be used as a command line tool, either at the command prompt, or in bash shell scripts - or I think in other types as script as well.

Alternatively, you can set up perl scripts - ie, the first line is


        #!/usr/bin/perl

The current version available from the perl website at "www.perl.org" is version 5.10, and this is the version that comes with SuSE Linux 11.

Regular expressions and pattern matching

Many of the web pages about perl also talk about regular expressions ( or regex ) and about pattern matching, instead of referring to text strings - "regular expressions" is a whole subject in its own right, and yes, whole books are written about them as well.

Regular expressions and pattern matching refer to the ability of perl to match ( or find ) a specified string of text, singly or multiple times, within a file, or within a variable, or within another string of text - perhaps the output of another process. Then to perform some action when it finds the string.

The regex can contain various different items, as well as just plain text - such as

various wildcards
characters which represent ASCII control characters
instructions - about repetitions
groups of text strings which are alternatives
classes of characters

Particular characters are used to inform perl that these are instructions, rather than part ot the text string. Since I am not writing another book on perl, I will not try to enumerate them all here. But I will be using some of them later in this web page.

Because of these instructional characters, if these same characters appear in the text string, perl will see them as instructions, rather than characters in the text string. Therefore they need to be escaped with the backslash character.

There seems to be a bit of variation in different web pages about which characters perl sees as instructions, rather than as text - maybe it depends on which version of perl the author of the web page was referring to. So this list may not be correct or complete.


        |  ^  $  *  +  ?  \  .  {  }  [  ]  (  )  /  &

Text replacement with perl

The form of the command line instruction for text replacement with perl is not too far distant from the command line instruction for text replacement with sed.

The basic format of the command written at the command line is :-


        perl -e 's/< old-text >/< new-text >/< modifiers >' filename

The modifiers are single characters that indicate to perl various conditions for the text replacement. More on these later.

Like sed, by default, perl will push the replacement text string to standard output. Also like sed, we can instruct perl to do in-file text replacement, so the replacement text string doesn`t go to standard output or need redirecting into a file, the replacement is done within each file.

Also like sed, we can use wildcards for the filename.

So to tell perl to do in-file text string replacement on multiple files, the command line now looks like


        perl -pi -e 's/< old-text >/< new-text >/< modifiers >' *.htm

Again, like sed, we can use a shell script, and also use variables to define the text strings.


        #!/bin/bash

        old="< old-text >"

        new="< new-text >"

        perl -pi -e "s/$old/$new/< modifiers >" *.htm

We have to use soft quotes in the last line, otherwise the substitution for the variables wil not work - the double dibs are soft quotes.

Instructional characters

In order to achieve matching of newline characters, we need to look at some of the various instructional characters that were introduced higher up the page. There are two of them that are of particular relevance

"." - the dot - which means 'match a single occurrence of any character'
"*" - the star - which means 'match either zero or any number of occurrences of the previous character'

We can combine them, and include ".*" in the regex, ( ie, the text string to be matched ), so that perl will try to match any number of any characters that occur within that part of the regex.

So we could have as our regex variable


        old="< some-text >.* < more-text >"

However there is a snag with this - the "." wildcard character will match any character - except the newline character. perl is a line based tool, so doesn`t expect to replace line characters.

So now we need to go and examine the modifiers that were mentioned further up the page.

As mentioned above, the modifiers act to modify the way the "s" command works. There are several modifiers available, the three that are of particular interest in this case are

g - which means match all occurrences of the text string on each line
i - which means that perl should match the text whichever case the text is written in
s - which means that perl should "see" the whole file as a single line

This last one, the "s" modifier, means that newline characters are seen by perl as just text characters, rather than as the end of the line that perl is examining. We gain two benefits from this -

the "g" modifier matches all occurrences of the text string throughout the whole input file
the "." wildcard now sees newline characters as text, and will match them

So now we can modify our script to include the matching of newline characters, and while we are at it, we can make it case insensitive as well :-


        #!/bin/bash

        old="< some-text >.* < more-text >"

        new="< new-text >"

        perl -pi -e "s/$old/$new/gis;" *.htm

So now we are getting somewhere, except ......

It doesn`t work !

After more hours of searching on the internet, I found the answer - in a short posting on an obscure forum about perl, I learnt that the combination of "-pi" and the "s" modifier doesn`t work. And the solution is to put "-0777" in front of the "-pi". So now our script looks like


        #!/bin/bash

        old='< some-text >.* < more-text >'

        new='< new-text >'

        perl -0777 -pi -e "s/$old/$new/gis;" *.htm

and this does work. I have no idea who the author of the post was, he was named as a guest - I can`t now find the forum again, and I can`t get Google to find any other reference about this "-0777"

But it works, and what`s more, this script will

do in-file text replacement in multiple files
do a match whatever the case of the text
match any number of Unix style LF newline characters ( or zero of them )
match any number of Windows style CR+LF newline characters ( or zero of them )

More on this "-0777"

Since I couldn`t find anything on the internet about it, I did some playing around to see if I could find out some reason behind it.

One of the things that I discovered was that it is not just the combination of the "-pi" and the "s" modifier that causes the problem, you have to be trying to match with the wild cards as well - ie, the script


        #!/bin/bash

        old='< old-text >'

        new='< new-text >'

        perl -pi -e "s/$old/$new/gis;" *.htm

works okay, but the script


        #!/bin/bash

        old='< some-text >.* < more-text >'

        new='< new-text >'

        perl -pi -e "s/$old/$new/gis;" *.htm

doesn`t work, and we have to add the "-0777".

Another thing that I found is that it doesn`t have to be "-0777", the string "-0744" works as well. So maybe there is something related to file rights - does perl have some kind of in-built CHMOD function ?

Using "ls -Fla" at the command prompt doesn`t show any permanent change to the file rights, but maybe it changes things momentarily for perl.

I also had a look at the hex strings for the files before and after running the script, but there didn`t seem to be anything unexpected happening there either.

It also didn`t make any difference if the files had come from a Unix environment or from a Windows environment, with either ANSI character coding, or UTF8 character coding.

Greedy matching

I found a bit of a snag with my script, as perl produced a very unexpected result if the input text string occurs more than once in any file. It appears that perl is doing a match based on

the first occurrence of the < some-text > text string in the file
the last occurrence of the text string < more-text > in the file
any or all of the text between them

It also occurs on a line by line basis, if the "s" modifier is not being used.

So I went digging on the internet again, and found that the answer lay in "Greedy matching" - which is a characteristic which perl has, that by default, it will look for the longest section it can find in the input file or text string which matches the search string.

This behaviour can be altered so that perl will look for the shortest section which matches the search string, by adding a "?" character into the search string after the wildcard / repetition characters. So now our script needs to look like :-


        #!/bin/bash

        old='< some-text >.*? < more-text >'

        new='< new-text >'

        perl -0777 -pi -e "s/$old/$new/gis;" *.htm

I think I`ve finally got to the end !

website design by ron-t

website hosting by freevirtualservers.com

+ +