Utorkaranje Snikolu: Regularni izrazi

Generalno postoje dve implementacije regularnih izraza, NFA (regex usmereni) i DFA (tekstualno usmereni). Najpopularnije implementacije su regex usmerene izmedju ostalog i zato sto dozvoljavaju lenja poredjenja i reference unazad. Ovo stvarno nece da radi... Ukratko, ovo je moja stranica da se podsetim sintakse. Zbog toga sto postoje razlicite implementacije, ove definicije su onako, od prilike i izraze treba prvo probati.

[^ ] - negates classes inside
\d - digit, i.e. [0-9]; \D=[^\d] - negated \d
\w - word characters, usually [A-Za-z] but maybe more, e.g. '_\d'; \W=[^\w]
\s - whitespace characters, e.g. \s=[ \t\r\n]; \S=[^\s]
. - matches all except new line. So on Unix, .=[^\n], Windows .=[^\r\n], Mac ... who knows? new line used to be marked with \r, but OS X is Unix, so I guess it's just \n now.
{min,max} - specifies number of repetitions for the previous expression (char or ())
+ - repeat 1 or more times, greedy; +={1,}
* - repeat 0 or more times, greedy; *={0,}
Anchors:

^ - beginning, $ - end of the line

\A - beginning, \Z - end of the string (possibly multiple lines)
\b - word boundary

(...)? - optional inside (), greedy
? - can turn greedy into lazy, e.g. +?, *?, ??
Backreferences:

( ) - provide the reference anchor
(?: ) - removes the reference from ( )
\1, \2 ... - the usual access method for backreferences; e.g. for capturing HTML tag one can use something like: <([A-Z][A-Z0-9]*)[^>]*>.*?
Python capturing group: (?P<name>group), referenced with \1 or (?P=name)
Atomic grouping: (?>expression) there can not be backreferences inside expression.

Utorkaranje Snikolu

Saturday, October 15, 2005

Regularni izrazi

No comments: