Patterns¶
Unicon Pattern data¶
SNOBOL patterns¶
Unicon version 13 alpha has SNOBOL inspired pattern matching. New
functions and operators were added to emulate the very powerful, and well
studied SNOBOL
pattern matching features. This augments String
scanning
quite nicely. These features introduce a new datatype, pattern
.
Details are in Technical Report UTR18a, http://unicon.org/utr/utr18.pdf.
SNOBOL is still relevant to many developers and SNOBOL4 implementations have been made freely available, thanks in large part to Catspaw Inc.
There is also a very comprehensive tutorial hosted at http://www.snobol4.org.
Chapter 4 of the tutorial is about Pattern Matching.
http://www.snobol4.org/docs/burks/tutorial/ch4.htm
This is a conversion (with some changes to add a test pass, and outputting results) of the small program listed in section 4.7 of that page:
#
# snobols.icn, SNOBOL like patterns
#
procedure main()
# From http://www.snobol4.org/docs/burks/tutorial/ch4.htm
# SNOBOL code
# (('B' | 'F' | 'N') . FIRST 'EA' ('R' | 'T') . LAST) . WORD
#
# matches 'BEAR', 'FEAR', 'NEAR', 'BEAT', 'FEAT', or 'NEAT',
# assigning the first letter matched to FIRST,
# the last letter to LAST, and the entire result to WORD.
# Unicon version, with test strings, and addition of cursor
# position capture. BEAD expected to fail.
every str := !["BEAR", "FEAR", "NEAR",
"BEAT", "FEAT", "NEAT", "BEAD"] do {
writes("subject: ", str, " ")
if str ?? .> p1 || (("B" .| "F" .| "N") -> first || "EA" ||
.> p2 || ("R" .| "T") -> last) -> word then
write("first: ", first, ";", p1, ", last: ", last, ";", p2,
", word: ", word)
else
write("did not match")
}
end
subject: BEAR first: B;1, last: R;4, word: BEAR
subject: FEAR first: F;1, last: R;4, word: FEAR
subject: NEAR first: N;1, last: R;4, word: NEAR
subject: BEAT first: B;1, last: T;4, word: BEAT
subject: FEAT first: F;1, last: T;4, word: FEAT
subject: NEAT first: N;1, last: T;4, word: NEAT
subject: BEAD did not match
Clinton Jeffery, along with Sudarshan Gaikaiwari and John Goettsche carefully
designed this feature set to be an almost one to one correspondence to
SNOBOL
patterns. It provides a highly viable path for porting old,
beloved, SNOBOL
programs to Unicon.
Unicon currently lacks the full eval
potential of SNOBOL but
ameliorates that downside, somewhat, by allowing invocation of functions and
methods along with variable and field references inside patterns.
Internals¶
To see a little bit of how the implementation actually works, let’s take a
look at the preprocessor output. The listing below has extra blank lines
squeezed out, cat -s
, and is reformatted, fmt
. This is only
for human curiousity and the listing below is not the version sent to the
compiler.
prompt$ unicon -s -E snobols.icn | cat -s | fmt
#line 0 "/tmp/uni89582078" #line 0 "snobols.icn"
procedure main();
every str := !["BEAR", "FEAR", "NEAR",
"BEAT", "FEAT", "NEAT", "BEAD"] do {
writes("subject: ", str, " "); if( "" ? pattern_match(
str,pattern_setcur( "p1",p1 ) || pattern_assign_onmatch(
( pattern_assign_onmatch( ( pattern_alternate(
pattern_alternate( "B", "F" ) , "N" ) ), "first",first )
||"EA"||pattern_setcur(
"p2",p2 ) || pattern_assign_onmatch( (
pattern_alternate( "R", "T" ) ), "last",last )
),"word",word ) )) then
write("first: ", first, ";", p1, ", last: ", last, ";", p2,
", word: ", word)
else
write("did not match")
};
end
Nice. The SNOBOL operators are actually a new class of functions.
I talked with Clinton about this, and for now, those functions are for compiler internal use only. Much smarter, and cleaner, to use the operators.
Regular expressions¶
When SNOBOL patterns were added to Unicon, regular expression features were
also added. This means Unicon has the power of String Scanning,
SNOBOL patterns and regular expressions
available. And all three
features can be freely mixed in string manipulation expressions. Raising the
bar.
Regular expression literals are surrounded by angle brackets, not quotes.
Pattern matching uses a ?? operator. As of early Unicon
release 13, regular expressions are limited to basic
regex patterns.
#
# hello-regex
#
procedure main()
L := ["Hello?", "Hello, world", "helloworld", "Hello World!", "World"]
every write(!L ?? <[hH]ello","?[ \t]*[wW]orld"!"?>)
end
Displays a message when the subject includes some form of Hello, world
.
In the example, the first and last elements of the string list do not match.
The regular expression looks for Hello with or without a capital H, an
optional comma, any number of spaces or tabs (including zero), followed by
World (or world), with an optional exclamation mark.
Hello, world
helloworld
Hello World!
Pattern operators¶
??
- a variant form of string scanning, s ?? p matching a pattern, not a general Unicon expression as with ? scanning. Unanchored.=p
- anchored match of pattern, p..|
- a pattern alternation. Accepts Unicon expressions as an operand.->
- conditional assignment.=>
- immediate assignment, (regardless of an actual successful match result)..>
- cursor position assignment.<r>
- a regular expression literal is surrounded in angle brackets (chevrons).
Regex syntax¶
Regular expressions can include the following components
r
- ordinary symbol that matches to r.r1 r2
- juxtaposition is concatenation.r1 | r2
- regular expression alternate (not a generator).r*
- match zero or more occurrences of r.r+
- match one or more occurrences of r.r?
- match zero or one occurrences of r.r{n}
- braces surround an integer count, match n occurrences of r."lit"
- match the literal string, with the usual escapes allowed.'lit'
- cset literal matching any one character of the set, escapes allowed.[chars]
- cset literal with dash range syntax..
- match any character except newline.(r)
- parentheses are used for grouping.
Index | Previous: String Processing | Next: Objects