sed regex behaviour changed?

Hi. I have had a script running on my server for months. Essentially it downloads a web page and srtrips out the data I want evry night. As such a load of filtering happens. Replacing the source data with a simple echo command I have been doing this:

echo "----><----> 3,818.91 <----><---->" | sed 's/<>!-\"]*//' | sed 's/ <.*//'

and it striped out the leading and trailing garbage to give " 3,818.91" for months.

as of last Thursday my system started returning “----><----> 3,818.91” which caused a load of cascade errors in my system.

I can get it to work as I want it again with (The ‘-’ is moved to the first character in the regex range):

echo "----><----> 3,818.91 <----><---->" | sed 's/-<>!\"]*//' | sed 's/ <.*//'

I also tried escaping the ‘-’ with a ‘’ but that made no difference.

Clearly it is interpreting the ‘-’ as a range of characters. Anyone any ideas as to what changed? Should it always have failed and I was exploiting a bug. Any comments or suggestions welcome.

This is on Leap 15.2 (the same as you say you have):

henk@boven:~> echo "----><----> 3,818.91 <----><---->" | sed 's/<>!-\"]*//' | sed 's/ <.*//'
 3,818.91
henk@boven:~>

I updated the system last thuesday, just looked with YaST Online Update and there is no patch for sed waiting, thus this must be up-to-date IMHO.

OK now I am utterly confused.

If I login as root:

echo "----><----> 3,818.91 <----><---->" | sed 's/<>!-\"]*//' | sed 's/ <.*//'
 3,818.91

If I log in as me:

echo "----><----> 3,818.91 <----><---->" | sed 's/<>!-\"]*//' | sed 's/ <.*//'
----><----> 3,818.91

Both root and I are using bash as our shells (checked with '“ps -p $$”).

Hm, strange indeed. As you have seen from my prompt, I was user henk.

That is why we ask to always include the line with the prompt and the command, the output and the line with the new prompt when posting code from the terminal. Only output often hides a lot of information the potential helpers need.

So better SHOW

henk@boven:~> ps -p $$
  PID TTY          TIME CMD
 2942 pts/0    00:00:00 bash
henk@boven:~> 

then telling stories.

Which is the only correct way to include literal “-” in character list. Your original expression is interpreted as range from “!” to “”" inclusive. I have no idea why it worked and if it did, it is by accident (or mistake).

Compare environment variables and locale settings. It is possible that for one user character range between “!” and “”" includes “-”, and for another does not.

Many thanks … but still very strange mixed behaviour!

OK, been checking the environment variables and here is TTY transcript where I can change the behavior:

julian@Cumulus:~> echo "----><----> 3,818.91 <----><---->" | sed 's/<>!-\"]*//' | sed 's/ <.*//'
----><----> 3,818.91
julian@Cumulus:~> echo $LANG
en_GB.UTF-8
julian@Cumulus:~> LANG=POSIX
julian@Cumulus:~> echo "----><----> 3,818.91 <----><---->" | sed 's/<>!-\"]*//' | sed 's/ <.*//'
 3,818.91
julian@Cumulus:~> LANG=en_GB.UTF-8
julian@Cumulus:~> echo "----><----> 3,818.91 <----><---->" | sed 's/<>!-\"]*//' | sed 's/ <.*//'
----><----> 3,818.91
julian@Cumulus:~> 

Changing the LANG environment variable changes the behaviour …

Very strange. It is all in the ASCII range.

That was wrong. "" does not have special meaning in brackets which means range is from “!” to "".

Which is absolutely irrelevant because range is determined according to collating order of specified locale.

bor@10:~> printf '!
-
\\
' | LC_COLLATE=POSIX sort
!
-
\
bor@10:~> printf '!
-
\\
' | LC_COLLATE=en_GB.UTF-8 sort
-
!
\
bor@10:~> 

So in POSIX locale this range includes “-” while in en-GB.UTF-8 it does not.

As far as I know, that is how “sed” has always worked.

@avidjaar. Thanks for the explanation. When thinking about collating sequence I always restricted that to the alphabet, like e.g. where is the é in French. Never that the sequence of non-alphabetic/numeric characters would differ between them.