How to replace paired square brackets with other syntax with sed?

It took a little doing, but here:

sed -i.bkup  's/\[\([^]]*\)\]/\\macro{\1}/g' test.txt

Let’s see if I can explain this regular expression:

  1. The \[ is matching a square bracket. Since [ is a valid magic regular expression character, the backslash means to match the literal character.
  2. The \(...\) is a capture group. It captures the part of the regular expression I want. I can have many capture groups, and in sed I can reference them as \1, \2, etc.
  3. Inside the capture group \(...\). I have [^]]*.
    1. The [^...] syntax means any character but.
    2. The [^]] means any character but a closing brace.
    3. The * means zero or more of the preceding. That means I am capturing zero or more characters that are not closing square braces.
  4. The \] means the closing square bracket

Let’s look at the line this is [some] more [text]

  • In #1 above, I capture the first open square bracket in front of the word some. However, it’s not in a capture group. This is the first character I’m going to substitute.
  • I now start a capture group. I am capturing according to 3.2 and 3.3 above, starting with the letter s in some as many characters as possible that are not closing square brackets. This means I am matching [some, but only capturing some.
  • In #4, I have ended my capture group. I’ve matched for substitution purposes [some and now I’m matching on the last closing square bracket. That means I’m matching [some]. Note that regular expressions are normally greedy. I’ll explain below why this is important.
  • Now, I can match the replacement string. This is much easier. It’s \\macro(\1). The \1 is replaced by my capture group. The \\ is just a backslash. Thus, I’ll replace [some] with \macro{some}.

It would be much easier if I could be guaranteed a single set of square brackets in each line. Then I could have done this:

sed -i.bkup 's/\[\(.*\)\]/\\macro(\1)/g'

The capture group is now saying anything between to square brackets. However, the problem is that regular expressions are greedy, that means I would have matched from the s in some all the way to the final t in text. The ‘x’ below show the capture group. The [ and ] show the square brackets I’m matching on:

 this is [some] more [text]
         [xxxxxxxxxxxxxxxx]

This became more complex because I had to match on characters that had special meaning to regular expressions, so we see a lot of backslashing. Plus, I had to account for regular expression greediness, which got the nice looking, non-matching string [^]]* to match anything not a closing bracket. Add in the square brackets before and after \[[^]]*\], and don’t forget the \(...\) capture group: \[\([^]]*\)\]And you get one big mess of a regular expression.

Leave a Comment