Saturday, October 15, 2005

Regularni izrazi

Generalno postoje dve implementacije regularnih izraza, NFA (regex usmereni) i DFA (tekstualno usmereni). Najpopularnije implementacije su regex usmerene izmedju ostalog i zato sto dozvoljavaju lenja poredjenja i reference unazad. Ovo stvarno nece da radi... Ukratko, ovo je moja stranica da se podsetim sintakse. Zbog toga sto postoje razlicite implementacije, ove definicije su onako, od prilike i izraze treba prvo probati.

  • [^ ] - negates classes inside
  • \d - digit, i.e. [0-9]; \D=[^\d] - negated \d
  • \w - word characters, usually [A-Za-z] but maybe more, e.g. '_\d'; \W=[^\w]
  • \s - whitespace characters, e.g. \s=[ \t\r\n]; \S=[^\s]
  • . - matches all except new line. So on Unix, .=[^\n], Windows .=[^\r\n], Mac ... who knows? new line used to be marked with \r, but OS X is Unix, so I guess it's just \n now.
  • {min,max} - specifies number of repetitions for the previous expression (char or ())
  • + - repeat 1 or more times, greedy; +={1,}
  • * - repeat 0 or more times, greedy; *={0,}
  • Anchors:
    • ^ - beginning, $ - end of the line
    • \A - beginning, \Z - end of the string (possibly multiple lines)
    • \b - word boundary
  • (...)? - optional inside (), greedy
  • ? - can turn greedy into lazy, e.g. +?, *?, ??
  • Backreferences:
    • ( ) - provide the reference anchor
    • (?: ) - removes the reference from ( )
    • \1, \2 ... - the usual access method for backreferences; e.g. for capturing HTML tag one can use something like: <([A-Z][A-Z0-9]*)[^>]*>.*?
    • Python capturing group: (?P<name>group), referenced with \1 or (?P=name)
    • Atomic grouping: (?>expression) there can not be backreferences inside expression.