Globally enable regular expressions rather than partial matches for options that support it. Currently these include UrlReplace , UrlExclude , UrlInclude , UrlSkip , ExcludeWords , IndexWords , IncludePages and ExcludePages . Contents of UrlExcludeFile , UrlIncludeFile and UrlSkipFile , populating UrlExclude , UrlInclude and UrlSkip respectively, also comply to this directive.
Regular expressions can contain the following special control characters:
Table 1. RegExp Control Characters
| ^ | Beginning of the string. The expression ^A will match an A only at the beginning of the string. |
| ^ | The caret (^) immediately following the left-bracket ([) has a different meaning. It is used to exclude the remaining characters within brackets from matching the target string. The expression [^0-9] indicates that the target character should not be a digit. |
| $ | The dollar sign ($ ) will match the end of the string. The expression abc$ will match the sub-string abc only if it is at the end of the string. |
| | | The alternation character (| ) allows either expression on its side to match the target string. The expression a|b will match a as well as b . |
| . | The dot (. ) will match any character. |
| * | The asterisk (* ) indicates that the character to the left of the asterisk in the expression should match 0 or more times. |
| + | The plus (+ ) is similar to asterisk but there should be at least one match of the character to the left of the + sign in the expression. |
| ? | The question mark (? ) matches the character to its left 0 or 1 times. |
| () | The parenthesis affects the order of pattern evaluation. |
| [ ] | Brackets ([ and ] ) enclosing a set of characters indicates that any of the enclosed characters may match the target character. |
The parenthesis, besides affecting the evaluation order of the regular expression, also serves as tagged expression which is something like a temporary memory. This memory can then be used when we want to replace the source expression with a replace expression. The replace expression can specify an & character which means that the & represents the sub-string that was found. So, if the sub-string that matched the regular expression is abcd , then a replace expression of xyz&xyz will change it to xyzabcdxyz . The replace expression can also be expressed as xyz\0xyz . The \0 indicates a tagged expression representing the entire sub-string that was matched. Similarly you can have other tagged expression represented by \1 , \2 etc. Note that although the tagged expression 0 is always defined, the tagged expression 1, 2, etc. are only defined if the regular expression used in the search had enough sets of parenthesis. Here are few examples:
Table 2. RegExp Examples
| String | Search | Replace | Result |
|---|---|---|---|
| Mr. | (Mr)(\.) | \1s\2 | Mrs. |
| abc | (a)b(c) | &-\1-\2 | abc-a-c |
| bcd | (a|b)c*d | &-\1 | bcd-b |
| abcde | (.*)c(.*) | &-\1-\2 | abcde-ab-de |
| cde | (ab|cd)e | &-\1 | cde-cd |
| ([0-9,A-Z,a-z,\ ]*)(STOP:)([0-9,A-Z,a-z,\ ]*) -> \1\2 | foo bar STOP: lkasdfkjakjlf | foo bar STOP: |
Alkaline has command line parameters, such as rxmatch and rxrepl to test regular expressions. For more information, please refer to the Testing Regular Expressions section.
Exclude the entire /bar section from http://www.foo.com and both words, Foo and foo. Also, replace www by ns in all urls.
RegExp=Y UrlExclude=http://www.foo.com/bar/.* ExcludeWords=foo/words.regexp UrlReplace (.*)(www)(.*)=\1ns\3 |
The words.regexp file contains:
[Ff]oo |
Exclude the entire /bar section from http://www.foo.com. The list of words is not a list of regular expressions. Also, replace www by ns in all urls.
RegExp UrlExclude=http://www.foo.com/bar/.* RegExp UrlReplace (.*)(www)(.*)=\1ns\3 ExcludeWords=foo/words.txt |
This feature was added in version 1.3 (02-Jul-2000). The regular expressions replacements were added in version 1.3 (12-Jul-2000). Support for RegExp [Option] expressions was added in version 1.6.0824.0.
If you use an ExcludeWords directive with RegExp=Y, the default English dictionary supplied in the distribution cannot be used. It contains a .* expression which will exclude all possible word combinations.