- [^ ] - negates classes inside
- \d - digit, i.e. [0-9]; \D=[^\d] - negated \d
- \w - word characters, usually [A-Za-z] but maybe more, e.g. '_\d'; \W=[^\w]
- \s - whitespace characters, e.g. \s=[ \t\r\n]; \S=[^\s]
- . - matches all except new line. So on Unix, .=[^\n], Windows .=[^\r\n], Mac ... who knows? new line used to be marked with \r, but OS X is Unix, so I guess it's just \n now.
- {min,max} - specifies number of repetitions for the previous expression (char or ())
- + - repeat 1 or more times, greedy; +={1,}
- * - repeat 0 or more times, greedy; *={0,}
- Anchors:
- ^ - beginning, $ - end of the line
- \A - beginning, \Z - end of the string (possibly multiple lines)
- \b - word boundary
- (...)? - optional inside (), greedy
- ? - can turn greedy into lazy, e.g. +?, *?, ??
- Backreferences:
- ( ) - provide the reference anchor
- (?: ) - removes the reference from ( )
- \1, \2 ... - the usual access method for backreferences; e.g. for capturing HTML tag one can use something like: <([A-Z][A-Z0-9]*)[^>]*>.*?
- Python capturing group: (?P<
name>group), referenced with \1 or (?P=name) - Atomic grouping: (?>expression) there can not be backreferences inside expression.
Saturday, October 15, 2005
Regularni izrazi
Generalno postoje dve implementacije regularnih izraza, NFA (regex usmereni) i DFA (tekstualno usmereni). Najpopularnije implementacije su regex usmerene izmedju ostalog i zato sto dozvoljavaju lenja poredjenja i reference unazad. Ovo stvarno nece da radi... Ukratko, ovo je moja stranica da se podsetim sintakse. Zbog toga sto postoje razlicite implementacije, ove definicije su onako, od prilike i izraze treba prvo probati.
Subscribe to:
Posts (Atom)