RegExp

Name

RegExp — enable regular expressions

Synopsis

RegExp = Y / N

Description

Globally enable regular expressions rather than partial matches for options that support it. Currently these include UrlReplace , UrlExclude , UrlInclude , UrlSkip , ExcludeWords , IndexWords , IncludePages and ExcludePages . Contents of UrlExcludeFile , UrlIncludeFile and UrlSkipFile , populating UrlExclude , UrlInclude and UrlSkip respectively, also comply to this directive.

Regular expressions can contain the following special control characters:

Table 1. RegExp Control Characters
^ Beginning of the string. The expression ^A will match an A only at the beginning of the string.
^ The caret (^) immediately following the left-bracket ([) has a different meaning. It is used to exclude the remaining characters within brackets from matching the target string. The expression [^0-9] indicates that the target character should not be a digit.
$ The dollar sign ($ ) will match the end of the string. The expression abc$ will match the sub-string abc only if it is at the end of the string.
| The alternation character (| ) allows either expression on its side to match the target string. The expression a|b will match a as well as b .
. The dot (. ) will match any character.
* The asterisk (* ) indicates that the character to the left of the asterisk in the expression should match 0 or more times.
+ The plus (+ ) is similar to asterisk but there should be at least one match of the character to the left of the + sign in the expression.
? The question mark (? ) matches the character to its left 0 or 1 times.
() The parenthesis affects the order of pattern evaluation.
[ ] Brackets ([ and ] ) enclosing a set of characters indicates that any of the enclosed characters may match the target character.

The parenthesis, besides affecting the evaluation order of the regular expression, also serves as tagged expression which is something like a temporary memory. This memory can then be used when we want to replace the source expression with a replace expression. The replace expression can specify an & character which means that the & represents the sub-string that was found. So, if the sub-string that matched the regular expression is abcd , then a replace expression of xyz&xyz will change it to xyzabcdxyz . The replace expression can also be expressed as xyz\0xyz . The \0 indicates a tagged expression representing the entire sub-string that was matched. Similarly you can have other tagged expression represented by \1 , \2 etc. Note that although the tagged expression 0 is always defined, the tagged expression 1, 2, etc. are only defined if the regular expression used in the search had enough sets of parenthesis. Here are few examples:

Table 2. RegExp Examples
String Search Replace Result
Mr. (Mr)(\.) \1s\2 Mrs.
abc (a)b(c) &-\1-\2 abc-a-c
bcd (a|b)c*d &-\1 bcd-b
abcde (.*)c(.*) &-\1-\2 abcde-ab-de
cde (ab|cd)e &-\1 cde-cd
  ([0-9,A-Z,a-z,\ ]*)(STOP:)([0-9,A-Z,a-z,\ ]*) -> \1\2 foo bar STOP: lkasdfkjakjlf foo bar STOP:

Alkaline has command line parameters, such as rxmatch and rxrepl to test regular expressions. For more information, please refer to the Testing Regular Expressions section.

Example of Global RegExp Option

Exclude the entire /bar section from http://www.foo.com and both words, Foo and foo. Also, replace www by ns in all urls.
RegExp=Y
UrlExclude=http://www.foo.com/bar/.*
ExcludeWords=foo/words.regexp
UrlReplace (.*)(www)(.*)=\1ns\3

The words.regexp file contains:
[Ff]oo

Example of Scoped RegExp Option

Exclude the entire /bar section from http://www.foo.com. The list of words is not a list of regular expressions. Also, replace www by ns in all urls.
RegExp UrlExclude=http://www.foo.com/bar/.*
RegExp UrlReplace (.*)(www)(.*)=\1ns\3
ExcludeWords=foo/words.txt

Notes

This feature was added in version 1.3 (02-Jul-2000). The regular expressions replacements were added in version 1.3 (12-Jul-2000). Support for RegExp [Option] expressions was added in version 1.6.0824.0.

If you use an ExcludeWords directive with RegExp=Y, the default English dictionary supplied in the distribution cannot be used. It contains a .* expression which will exclude all possible word combinations.