In my previous web page about updating HTML code with sed, I described a problem I had with using sed, in that I couldn`t get sed to replace text strings that contained newline characters.
After raking around on the internet some more, it looked as if a possible way round this is to use perl instead of sed. Eventually I got it to work, so here is some information about using perl to replace a text string containing newline characters in multiple files.
perl is another tool for text manipulation - it is, like sed, a very comprehensive tool, and what I describe in this web page is only a microscopic part of the things that perl can do. Like sed, whole books are written about perl, and there are quite a few of them.
perl was written to be a more comprehensive text manipulation tool, and has many of its roots in the more popular parts of other tools such as sed and awk, and also shell scripting.
perl can be used as a command line tool, either at the command prompt, or in bash shell scripts - or I think in other types as script as well.
Alternatively, you can set up perl scripts - ie, the first line is
#!/usr/bin/perl
The current version available from the perl website at "www.perl.org" is version 5.10, and this is the version that comes with SuSE Linux 11.
Many of the web pages about perl also talk about regular expressions ( or regex ) and about pattern matching, instead of referring to text strings - "regular expressions" is a whole subject in its own right, and yes, whole books are written about them as well.
Regular expressions and pattern matching refer to the ability of perl to match ( or find ) a specified string of text, singly or multiple times, within a file, or within a variable, or within another string of text - perhaps the output of another process. Then to perform some action when it finds the string.
The regex can contain various different items, as well as just plain text - such as
Particular characters are used to inform perl that these are instructions, rather than part ot the text string. Since I am not writing another book on perl, I will not try to enumerate them all here. But I will be using some of them later in this web page.
Because of these instructional characters, if these same characters appear in the text string, perl will see them as instructions, rather than characters in the text string. Therefore they need to be escaped with the backslash character.
There seems to be a bit of variation in different web pages about which characters perl sees as instructions, rather than as text - maybe it depends on which version of perl the author of the web page was referring to. So this list may not be correct or complete.
| ^ $ * + ? \ . { } [ ] ( ) / &
The form of the command line instruction for text replacement with perl is not too far distant from the command line instruction for text replacement with sed.
The basic format of the command written at the command line is :-
perl -e 's/< old-text >/< new-text >/< modifiers >' filename
The modifiers are single characters that indicate to perl various conditions for the text replacement. More on these later.
Like sed, by default, perl will push the replacement text string to standard output. Also like sed, we can instruct perl to do in-file text replacement, so the replacement text string doesn`t go to standard output or need redirecting into a file, the replacement is done within each file.
Also like sed, we can use wildcards for the filename.
So to tell perl to do in-file text string replacement on multiple files, the command line now looks like
perl -pi -e 's/< old-text >/< new-text >/< modifiers >' *.htm
Again, like sed, we can use a shell script, and also use variables to define the text strings.
#!/bin/bash old="< old-text >" new="< new-text >" perl -pi -e "s/$old/$new/< modifiers >" *.htm
We have to use soft quotes in the last line, otherwise the substitution for the variables wil not work - the double dibs are soft quotes.
In order to achieve matching of newline characters, we need to look at some of the various instructional characters that were introduced higher up the page. There are two of them that are of particular relevance
We can combine them, and include ".*" in the regex, ( ie, the text string to be matched ), so that perl will try to match any number of any characters that occur within that part of the regex.
So we could have as our regex variable
old="< some-text >.* < more-text >"
However there is a snag with this - the "." wildcard character will match any character - except the newline character. perl is a line based tool, so doesn`t expect to replace line characters.
So now we need to go and examine the modifiers that were mentioned further up the page.
As mentioned above, the modifiers act to modify the way the "s" command works. There are several modifiers available, the three that are of particular interest in this case are
This last one, the "s" modifier, means that newline characters are seen by perl as just text characters, rather than as the end of the line that perl is examining. We gain two benefits from this -
So now we can modify our script to include the matching of newline characters, and while we are at it, we can make it case insensitive as well :-
#!/bin/bash old="< some-text >.* < more-text >" new="< new-text >" perl -pi -e "s/$old/$new/gis;" *.htm
So now we are getting somewhere, except ......
After more hours of searching on the internet, I found the answer - in a short posting on an obscure forum about perl, I learnt that the combination of "-pi" and the "s" modifier doesn`t work. And the solution is to put "-0777" in front of the "-pi". So now our script looks like
#!/bin/bash old='< some-text >.* < more-text >' new='< new-text >' perl -0777 -pi -e "s/$old/$new/gis;" *.htm
and this does work. I have no idea who the author of the post was, he was named as a guest - I can`t now find the forum again, and I can`t get Google to find any other reference about this "-0777"
But it works, and what`s more, this script will
Since I couldn`t find anything on the internet about it, I did some playing around to see if I could find out some reason behind it.
One of the things that I discovered was that it is not just the combination of the "-pi" and the "s" modifier that causes the problem, you have to be trying to match with the wild cards as well - ie, the script
#!/bin/bash old='< old-text >' new='< new-text >' perl -pi -e "s/$old/$new/gis;" *.htm
works okay, but the script
#!/bin/bash old='< some-text >.* < more-text >' new='< new-text >' perl -pi -e "s/$old/$new/gis;" *.htm
doesn`t work, and we have to add the "-0777".
Another thing that I found is that it doesn`t have to be "-0777", the string "-0744" works as well. So maybe there is something related to file rights - does perl have some kind of in-built CHMOD function ?
Using "ls -Fla" at the command prompt doesn`t show any permanent change to the file rights, but maybe it changes things momentarily for perl.
I also had a look at the hex strings for the files before and after running the script, but there didn`t seem to be anything unexpected happening there either.
It also didn`t make any difference if the files had come from a Unix environment or from a Windows environment, with either ANSI character coding, or UTF8 character coding.
I found a bit of a snag with my script, as perl produced a very unexpected result if the input text string occurs more than once in any file. It appears that perl is doing a match based on
It also occurs on a line by line basis, if the "s" modifier is not being used.
So I went digging on the internet again, and found that the answer lay in "Greedy matching" - which is a characteristic which perl has, that by default, it will look for the longest section it can find in the input file or text string which matches the search string.
This behaviour can be altered so that perl will look for the shortest section which matches the search string, by adding a "?" character into the search string after the wildcard / repetition characters. So now our script needs to look like :-
#!/bin/bash old='< some-text >.*? < more-text >' new='< new-text >' perl -0777 -pi -e "s/$old/$new/gis;" *.htm
I think I`ve finally got to the end !