Patterns

_images/unicon.png

Index Unicon

Unicon Pattern data

SNOBOL patterns

Unicon version 13 alpha has SNOBOL inspired pattern matching. New functions and operators were added to emulate the very powerful, and well studied SNOBOL pattern matching features. This augments String scanning quite nicely. These features introduce a new datatype, pattern.

Details are in Technical Report UTR18a, http://unicon.org/utr/utr18.pdf.

SNOBOL is still relevant to many developers and SNOBOL4 implementations have been made freely available, thanks in large part to Catspaw Inc.

There is also a very comprehensive tutorial hosted at http://www.snobol4.org.

Chapter 4 of the tutorial is about Pattern Matching.

http://www.snobol4.org/docs/burks/tutorial/ch4.htm

This is a conversion (with some changes to add a test pass, and outputting results) of the small program listed in section 4.7 of that page:

#
# snobols.icn, SNOBOL like patterns
#
procedure main()

   # From http://www.snobol4.org/docs/burks/tutorial/ch4.htm
   # SNOBOL code
   # (('B' | 'F' | 'N') . FIRST 'EA' ('R' | 'T') . LAST) . WORD
   #
   # matches 'BEAR', 'FEAR', 'NEAR', 'BEAT', 'FEAT', or 'NEAT',
   # assigning the first letter matched to FIRST,
   # the last letter to LAST, and the entire result to WORD. 

   # Unicon version, with test strings, and addition of cursor 
   # position capture. BEAD expected to fail.
   every str := !["BEAR", "FEAR", "NEAR",
                  "BEAT", "FEAT", "NEAT", "BEAD"] do {
       writes("subject: ", str, " ")
       if str ?? .> p1 || (("B" .| "F" .| "N") -> first || "EA" ||
                 .> p2 || ("R" .| "T") -> last) -> word then
           write("first: ", first, ";", p1, ", last: ", last, ";", p2,
                 ", word: ", word)
       else
           write("did not match")
   }
end

examples/snobols.icn

subject: BEAR first: B;1, last: R;4, word: BEAR
subject: FEAR first: F;1, last: R;4, word: FEAR
subject: NEAR first: N;1, last: R;4, word: NEAR
subject: BEAT first: B;1, last: T;4, word: BEAT
subject: FEAT first: F;1, last: T;4, word: FEAT
subject: NEAT first: N;1, last: T;4, word: NEAT
subject: BEAD did not match

Clinton Jeffery, along with Sudarshan Gaikaiwari and John Goettsche carefully designed this feature set to be an almost one to one correspondence to SNOBOL patterns. It provides a highly viable path for porting old, beloved, SNOBOL programs to Unicon.

Unicon currently lacks the full eval potential of SNOBOL but ameliorates that downside, somewhat, by allowing invocation of functions and methods along with variable and field references inside patterns.

Internals

To see a little bit of how the implementation actually works, let’s take a look at the preprocessor output. The listing below has extra blank lines squeezed out, cat -s, and is reformatted, fmt. This is only for human curiousity and the listing below is not the version sent to the compiler.

prompt$ unicon -s -E snobols.icn | cat -s | fmt
#line 0 "/tmp/uni89582078" #line 0 "snobols.icn"

procedure main();

   every str := !["BEAR", "FEAR", "NEAR",
                  "BEAT", "FEAT", "NEAT", "BEAD"] do {
       writes("subject: ", str, " "); if( "" ? pattern_match(
       str,pattern_setcur(        "p1",p1 ) || pattern_assign_onmatch(
       ( pattern_assign_onmatch( ( pattern_alternate(
       pattern_alternate( "B",  "F" ) ,    "N" ) ),    "first",first )
       ||"EA"||pattern_setcur(
                    "p2",p2 ) || pattern_assign_onmatch( (
                    pattern_alternate( "R",  "T" ) ),    "last",last )
                    ),"word",word ) ))  then
           write("first: ", first, ";", p1, ", last: ", last, ";", p2,
                 ", word: ", word)
       else
           write("did not match")
   };
end

Nice. The SNOBOL operators are actually a new class of functions.

I talked with Clinton about this, and for now, those functions are for compiler internal use only. Much smarter, and cleaner, to use the operators.

Regular expressions

When SNOBOL patterns were added to Unicon, regular expression features were also added. This means Unicon has the power of String Scanning, SNOBOL patterns and regular expressions available. And all three features can be freely mixed in string manipulation expressions. Raising the bar.

Regular expression literals are surrounded by angle brackets, not quotes. Pattern matching uses a ?? operator. As of early Unicon release 13, regular expressions are limited to basic regex patterns.

#
# hello-regex
#
procedure main()
    L := ["Hello?", "Hello, world", "helloworld", "Hello World!", "World"]
    every write(!L ?? <[hH]ello","?[ \t]*[wW]orld"!"?>)
end

Displays a message when the subject includes some form of Hello, world. In the example, the first and last elements of the string list do not match. The regular expression looks for Hello with or without a capital H, an optional comma, any number of spaces or tabs (including zero), followed by World (or world), with an optional exclamation mark.

Hello, world
helloworld
Hello World!

Pattern operators

  • ?? - a variant form of string scanning, s ?? p matching a pattern, not a general Unicon expression as with ? scanning. Unanchored.
  • =p - anchored match of pattern, p.
  • .| - a pattern alternation. Accepts Unicon expressions as an operand.
  • -> - conditional assignment.
  • => - immediate assignment, (regardless of an actual successful match result).
  • .> - cursor position assignment.
  • <r> - a regular expression literal is surrounded in angle brackets (chevrons).

Regex syntax

Regular expressions can include the following components

  • r - ordinary symbol that matches to r.
  • r1 r2 - juxtaposition is concatenation.
  • r1 | r2 - regular expression alternate (not a generator).
  • r* - match zero or more occurrences of r.
  • r+ - match one or more occurrences of r.
  • r? - match zero or one occurrences of r.
  • r{n} - braces surround an integer count, match n occurrences of r.
  • "lit" - match the literal string, with the usual escapes allowed.
  • 'lit' - cset literal matching any one character of the set, escapes allowed.
  • [chars] - cset literal with dash range syntax.
  • . - match any character except newline.
  • (r) - parentheses are used for grouping.

Index | Previous: String Processing | Next: Objects