Add Book to My BookshelfPurchase This Book Online

Chapter 16 - Miscellaneous Routines

UNIX Systems Programming for SVR4
David A. Curry
 Copyright © 1996 O'Reilly & Associates, Inc.

Pattern Matching
Most of the UNIX shells and text editors allow the user to use wildcard characters to match a large set of items. For example, a* matches all filenames that begin with a in the shell, * being a wildcard. More sophisticated ways to abbreviate a search string are also supported. For example, ^whi[lnt]e.*sleeping$ matches all lines that begin with while, whine, or white and end in sleeping in a text editor.
The code that performs this type of matching is fairly complex, and would be difficult to reproduce each time a program needed these facilities. For this reason, library routines that implement these functions are provided.
Shell Pattern Matching
Pattern matching in the shell, also called globbing, is used primarily to generate lists of filenames. In a shell pattern, the following characters have special meaning:
*
Matches any string, including the null string.
?
Matches any single character.
[]
Matches any one of the enclosed characters. Two characters separated by - match any one character lexically between the two characters (i.e., [a-z] matches any of the characters a through z). If the first character after the [ is !, then this matches any character except one of the enclosed characters.
These special characters, also called metacharacters, may be escaped with a backslash; i.e., \? matches the actual question mark character.
The gmatch function is used to perform shell pattern matching in a program. This function is contained in the -lgen library:
    #include <libgen.h>
    int gmatch(const char *str, const char *pattern);
The gmatch function returns non-zero if the shell pattern in pattern matches the string contained in str; it returns 0 if they do not match. The additional pattern matching characters provided by the C shell, most notably {}, are not supported by gmatch.
The gmatch function is not available in HP-UX 10.x. However, a similar function, fnmatch, is available. You can use fnmatch to emulate gmatch as follows:
    int
    gmatch(const char *str, const char *pattern)
    {
        return(!fnmatch(pattern, str, 0));
    }
Example 16-11 shows a program that uses gmatch to search a file given as its second argument for lines that match the pattern given as its first argument. Note that the pattern must be enclosed in quotes to prevent the shell from processing it.
Example 16-11:  gmatch
#include <libgen.h>
#include <stdio.h>
int
main(int argc, char **argv)
{
    FILE *fp;
    char line[BUFSIZ];
    char *pattern, *filename;
    /*
     * Check arguments.
     */
    if (argc != 3) {
        fprintf(stderr, "Usage: %s pattern file\n", *argv);
        exit(1);
    }
    pattern = *++argv;
    filename = *++argv;
    /*
     * Open the file.
     */
    if ((fp = fopen(filename, "r")) == NULL) {
        perror(filename);
        exit(1);
    }
    /*
     * Read lines from the file.
     */
    while (fgets(line, sizeof(line), fp) != NULL) {
        /*
         * Strip the newline.
         */
        line[strlen(line) - 1] = '\0';
        /*
         * If it matches, print it.
         */
        if (gmatch(line, pattern) != 0)
            puts(line);
    }
    fclose(fp);
    exit(0);
}
    % gmatch 'A????d' /usr/dict/words
    Aeneid
    Alfred
    Arnold
    Atwood
    % gmatch 'z*[ty]' /usr/dict/words
    zealot
    zest
    zesty
    zippy
    zloty
    zoology
Regular Expressions
A regular expression specifies a set of strings, through the use of special characters. Most text editors support regular expressions in some form; the grep family of commands also supports them. The canonical definition of a regular expression is provided by the ed text editor, which was the first UNIX text editor to implement them.
In ed, a regular expression is defined as follows:
 A single character (except a special character; see below) is a one-character regular expression that matches itself.
 A backslash preceding a special character causes that character to lose its special meaning.
 A period (.) is a one-character regular expression that matches any single character.
 A string of characters enclosed in square brackets ([ and ]) is a one-character regular expression that matches any single character in the string, unless the first character of the string is a circumflex (^), in which case the string is a regular expression that matches any single character not in the string. The circumflex has special meaning only when it is the first character in the string.
Within the string, a dash (-) may be used to specify a range of characters; e.g., [0-9] matches the same thing as [0123456789]. If the dash is the first character (following the circumflex) or last character in the string, it loses its special meaning.
The right square bracket (]) may be included in the string only if it is the first character of the string.
The other special characters have no special meaning within square brackets.
 Regular expressions may be concatenated to form larger regular expressions.
 A regular expression preceded by a circumflex (^) is constrained to match at the beginning of a line.
 A regular expression followed by a dollar sign ($) is constrained to match at the end of a line.
 A regular expression both preceded by a circumflex and followed by a dollar sign is constrained to match an entire line.
 A regular expression followed by an asterisk (*) matches zero or more occurrences of the regular expression. For example, ab*c matches ac, abc, abbc, and so forth. When a choice exists, the longest leftmost match will be chosen.
 A regular expression contained between \( and \) matches the same string that the unenclosed regular expression matches.
 The regular expression \n matches the same string that the nth regular expression enclosed in \( and \) in the same regular expression matches. For example, \(abc\)\1 matches the string abcabc.
 A regular expression followed by \{m\} matches exactly m occurrences of that regular expression. A regular expression followed by \{m,\} matches at least m occurrences of that regular expression. A regular expression followed by \{m,n\} matches at least m and no more than n occurrences of that regular expression.
This notation was originally introduced in PWB UNIX, and from there made its way into System V. Versions of UNIX that do not have PWB UNIX as an ancestor (i.e., Berkeley-based versions) do not support this notation.
 A regular expression preceded by \< is constrained to match at the beginning of a line or to follow a character that is not a digit, underscore, or letter.
A regular expression followed by \> is constrained to match at the end of a line or to precede a character that is not a digit, underscore, or letter.
This allows a regular expression to be constrained to match words.
This notation was introduced in the ex and vi editors. Versions of ed prior to the one in SVR4 do not support this notation.
The basic functions provided for using regular expressions in programs are regcmp and regex:
    #include <libgen.h>
    char *regcmp(const char *str1, /* const char *str2 */, ... , NULL);
    char *regex(const char *re, const char *str, /* char *ret0 */, ...);
    extern char *__loc1;
The regcmp function compiles the regular expression (consisting of its concatenated arguments) and returns a pointer to the compiled form. The memory to hold the compiled form is allocated with malloc; it is the user's responsibility to free this memory when it is no longer needed. If one of the arguments contains an error, regcmp returns NULL.
The regex function applies the compiled regular expression re to the string in str. Additional arguments may be given to receive values back (see below). If the pattern matches, a pointer to the next unmatched character in str is returned, and the external character pointer __loc1 will point to the place where the match begins. If the pattern does not match, regex returns NULL.
HP-UX 10.x requires you to link with the -lPW library to use these functions.
The regular expressions used by regcmp and regex are somewhat different from those described above:
 The dollar sign ($) matches the end of the string; \n matches a newline.
 A regular expression followed by a plus sign (+) matches one or more occurrences of the regular expression.
 The curly-brace notation does not use backslashes to escape the curly braces. For example, while ed uses \{m\}, regcmp and regex use {m}.
 The parenthesis notation from ed (\(...)\) has been replaced with the following:
(...)$n
The part of the string that matches the regular expression will be returned. The value will be stored in the string pointed to by the (n+1)th argument following str in the call to regex. Up to ten strings may be returned this way.
(...)
Parentheses are used for grouping. The operators *, +, and {} can operate on a single character or on a regular expression contained in parentheses.
SVR4 provides a second set of functions for implementing regular expressions, called compile, advance, and step. These functions implement regular expressions just as they exist in ed and grep, but their usage is complicated, and, because they are not available in other versions of the operating system, not portable. For more information on them, consult the regexpr (5) manual page.
Example 16-12 shows a different version of the file-searching program than Example 16-11; this one uses regular expressions, much like the grep command. Notice again that the pattern must be enclosed in quotes to prevent the shell from trying to interpret it.
Example 16-12:  regexp
#include <libgen.h>
#include <stdio.h>
int
main(int argc, char **argv)
{
    FILE *fp;
    char line[BUFSIZ];
    char *re, *pattern, *filename;
    /*
     * Check arguments.
     */
    if (argc != 3) {
        fprintf(stderr, "Usage: %s pattern file\n", *argv);
        exit(1);
    }
    pattern = *++argv;
    filename = *++argv;
    /*
     * Compile the regular expression.
     */
    if ((re = regcmp(pattern, NULL)) == NULL) {
        fprintf(stderr, "bad regular expression.\n");
        exit(1);
    }
    /*
     * Open the file.
     */
    if ((fp = fopen(filename, "r")) == NULL) {
        perror(filename);
        exit(1);
    }
    /*
     * Read lines from the file.
     */
    while (fgets(line, sizeof(line), fp) != NULL) {
        /*
         * Strip the newline.
         */
        line[strlen(line) - 1] = '\0';
        /*
         * If it matches, print it.
         */
        if (regex(re, line) != NULL)
            puts(line);
    }
    fclose(fp);
    exit(0);
}
    % regexp 'A....d' /usr/dict/words
    Aeneid
    Alameda
    Alfred
    Alfredo
    Amerada
    Aphrodite
    Arnold
    Atwood
    Avogadro
    % regexp '^A....d$' /usr/dict/words
    Aeneid
    Alfred
    Arnold
    Atwood
    % regexp 'b(an){2,}' /usr/dict/words
    banana
Porting Notes
The regcmp and regex functions are available on System V-based systems only. BSD-based systems provide a slightly different set of functions:
    char *re_comp(const char *re);
    int re_exec(const char *str);
The re_comp function compiles the regular expression contained in re and stores the result internally. If the expression is compiled successfully, re_comp returns NULL; otherwise it returns a pointer to an error message describing the problem. The re_exec function compares the string str to the last compiled regular expression and returns 1 if they match, 0 if they don't, and -1 if an error occurs (such as calling re_exec before calling re_comp).
The BSD functions are more user-friendly than their System V counterparts in that they accept standard ed regular expressions. The System V functions allow you to use multiple regular expressions simultaneously without having to recompile them, and they allow the program to obtain the parts of the string that matched the regular expression.
If portability is a concern, it is necessary to write code that is compatible with either set of regular expression functions. The lack of support for simultaneous use of multiple regular expressions in the BSD functions can make this difficult, however. Another approach is to obtain a free or public-domain implementation of regular-expression functions and simply include those with the program.
Henry Spencer offers a wonderful public domain implementation of the regular expression functions included in Research UNIX Version 8; his package includes not only the compile and match functions, but also a function to perform substitutions in strings much like a text editor does. The package is available from ftp://ftp.cs.toronto.edu/pub/regexp.shar.Z. The GNU Project also provides a fairly robust implementation of the regular expression functions; their implementation is covered by the GNU General Public License. The package is available from ftp://prep.ai.mit.edu/pub/gnu/regex-0.12.tar.gz.

Previous SectionNext Section
Books24x7.com, Inc © 2000 –  Feedback