Updating HTML code with sed

Introduction

I have recently had to do a significant coding update to a substantial part of my website - some 119 pages needed changing.

I am using SUSE Linux 11 as my working desktop at the moment, rather than MS Windows, so I went surfing on the web to look for some suitable software that would enable me to do text substitution on multiple files, but didn`t find any. There are several bits of Freeware and Shareware software for text substitution on MS Windows, but not for Linux. However I did find some references about using sed for text substitution in a Unix / Linux environment, it is included in SUSE 11, so I started to look into it.

What is sed ?

sed is a command line tool for text manipulation - it is quite a complex application that requires very exact instructions. It has approximately 23 different commands, each command has a whole subset of parameters that goes with it, and whole books are written about sed.

I have used only one of these commands, so have barely scratched the surface of sed.

The "s" command

The one command I have used is the "s" command, and this is a command that performs text substitution - it looks for a specified string of text, and replaces it with another specified string of text.

The basic format of the command written at the command line is :-


        sed -e "s/< old-text >/< new-text >/"X"" file-name

where "X" is some parameter indicating how many times on each line the substitution should be performed.

In my case, I always used "g", to indicate that sed should replace all instances of < old-text > on each line.

I found several different suggestions on the web as to how to tell sed to perform the substitutions on all files in a folder. By default, I think that sed pushes the replacement text on to standard output, and if you want to send it to a file, the output has to be redirected into a file.

If you want to do the substitution on multiple files, then you have to add some mechanism to move the contents of files around to different files.

I never got any of these suggestions to work, and eventually found that there is a better way anyway - you get sed to rewrite the file it is working on, rather than push the output to another file. The instruction for this is :-


        sed -i "s/< old-text >/< new-text >/"X"" file-name

Wildcards work, so it becomes :-


        sed -i "s/< old-text >/< new-text >/"X"" *.htm

With the global option, ie, every instance of < old-text > is replaced, the instruction becomes :-


        sed -i "s/< old-text >/< new-text >/g" *.htm

You can use this at the command line prompt, and it works fine.

Instructional characters

If you are recoding HTML files, then there is a problem that you will meet - sed has a number of characters that it recognises as instructions, rather than as a character in the text string.

The two that I know about, and are relevant if you are changing HTML tags, are :-

the forward slash - "/" - used in closing tags - eg - </p>
The ampersand - "&" - used in entities such as a non-breaking space -

If the text strings you are taking out or putting in contains these characters, then you have to turn off the instructional meaning of the character by preceeding it with a backslash - "\".

So if you want to add a blank line before the closing tag for a table column. the sed instruction needs to look like :-


        sed -i "s/<\/td>/<p>\&nbsp;<\/p><\/td>/g" *.htm

instead of


        sed -i "s/</td>/<p>&nbsp;</p></td>/g" *.htm

All of this is now getting quite messy, and difficult to read when you are typing it into the command line. And that is a very short text string. It is much easier to put the sed command into a shell script, and run the script from the command line.

Using a shell script instead

Using a shell script instead of typing the sed command into the command line shell is much easier in many cases, and the shell script can be quite minimal. A minimal script can be only two lines :-


        #!/bin/bash

        sed -i "s/< old-text >/< new-text >/g" *.htm

would be a working script.

Once you`ve written the script, don`t forget to set the file attributes so it is executable.

Using a shell script has two distinct advantages - the first of these is that you can write your script in a text editor, such as Gedit. And when you have Gedit open, you can open a second file, and "copy-and-paste" between the two windows. So you don`t have laboriously write out long text strings.

The other advantage is that you can now use variables, instead of writing the text strings directly into the sed command. So now the script looks like :-


        #!/bin/bash

        old="< old-text >"

        new="< new-text >"

        sed -i "s/$old/$new/g" *.htm

This makes things a lot easier to read if the text strings are quite long, and you have to add backslashes to kill instructional characters.

Be careful how you quote things if you are using variables. If you use single dibs to quote the "s" command - in other words. it looks like :-


        sed -i 's/$old/$new/g' *.htm

then the substitution doesn`t work, and sed uses the variable names as the text string. So you do need to use double dibs.

I have had problems when using variables if the text strings contain quotes - substitution results were sometimes a bit wrong. It might help to use single dibs to quote the text strings, ie, :-


        #!/bin/bash

        old='< old-text >'

        new='< new-text >'

        sed -i "s/$old/$new/g" *.htm

I changed several text strings in my 119 files using sed in scripts like this in just seconds, so it has been very useful in my case. However it hasn`t solved all my problems.

An insurmountable problem ?

I have found a problem with sed, when using it to replace coding tags in HTML files - and I don`t know if there is a solution. sed is a line editor - it does things line by line. If it meets newline characters, it falls over. I don`t know if it is my programming skills are the problem, or if it is a limitation of sed.

So I haven`t been able to replace text strings where I typed a succession of tags on different lines, which is something I used to do quite lot.

I spent quite a lot of time on this, and investigating what the text string for a new line looks like. I used the command line tool "od" to look at the hexadecimal code in my files,


        od -h < file-name >

and found that my files were using both a carriage return character and a line feed character, the hex codes for them being OD and OA respectively.

My files were all written on Windows based pc`s, so they contain the DOS standard way of using CR + LF to mark new lines. The Unix method to mark new lines is to use the LF character only.

I used the tool which is included on SUSE 11 and which is /usr/bin/dos2unix - this tool converts the DOS way of marking new lines to the Unix way. I think it also changes the end-of-file characters, but don`t quote me on that. However that didn`t help.

I tried lots of different combinations of quoting with single or double dibs, and adding a backslash to the end of lines, however nothing worked. Either I got error messages from sed, or sed did nothing.

Eventually I decided it was going to be easier just to change that bit of the files manually, which turned out to be quite useful, as some of the files had been coded differently anyway.

So although sed was very useful in making quite a few changes to my files, my inability to program sed to accomodate text strings which include new lines was a bit of a limitation.

I subsequently found out that sed is really for manipulation of streaming text rather than text in files - for text manipulation in files, the answer lies in perl, which is much more powerful.

website design by ron-t

website hosting by freevirtualservers.com

+ +