Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Posting-Version: version B 2.10.1 6/24/83; site noscvax.UUCP Path: utzoo!watmath!clyde!akgua!sdcsvax!noscvax!kemp From: kemp@noscvax.UUCP (Stephen P. Kemp) Newsgroups: net.unix Subject: treatise on regular expressions (long) Message-ID: <638@noscvax.UUCP> Date: Fri, 28-Sep-84 12:17:04 EDT Article-I.D.: noscvax.638 Posted: Fri Sep 28 12:17:04 1984 Date-Received: Sat, 29-Sep-84 09:59:45 EDT Distribution: net Organization: Naval Ocean Systems Center Lines: 174 The following appeared in the Naval Ocean Systems Center newsletter COMPUTING HIGHLIGHTS. I thought it would be useful to USENETers. If you have comments, please mail them to ME and NOT to Mike. --------------------------- * ------------------------------ by Mike Bloomberg *** Regular expressions in Unix *** Theme and Variations A regular expression can be considered to be a string of characters with certain characters having special meanings as defined below. These special characters enable patterns to be defined in an efficient and general manner. The grep family (consisting of grep, egrep for expres- sions, and fgrep for fixed strings), as well as awk, ed and sed make heavy use of regular expressions. Greps search for the defined regular expression within the input text file reporting any occurrences found. Awk is a pattern processing language and is a generaliza- tion of grep. Awk uses regular expressions, which are enclosed in slashes (), to locate the lines in the text file to perform actions upon. Ed and sed are line editors. Ed interacts with the user while sed (stream editor) works in a "non- interactive" mode. Both processors use regular expressions as context addresses. A context address is the abso- lute position determined by the next location of the character string that the regular expression matches within the text file. These special characters are: ----------- . (period) matches any single character (wildcard character) EXAMPLE: the expression un.x will correspond to where the third character can be ANY character. So, unax, unbx, .... unzx, un3x, un&x, un;x etc. will all match. ----------- [] any one of the characters or range of characters within these square brackets will match. EXAMPLE: un[a3z]x will match unax or un3x or unzx only. EXAMPLE: un[a-z]x will match unax, unbx ... unzx. However, un3x will NOT match. NOTE: placing a ^ (caret) preceding a group of characters will match the COMPLEMENT of those letters. EXAMPLE: un[^abcde] will match unfx ... unzx un1x ... un9x un!x un$x etc. ----------- () Used for grouping of an expression. The expression enclosed in the parentheses can be operated by such operators as the * or + operators. EXAMPLE: (unix)+ would match unix or unixunix or unixunixunix etc. NB: Only for egrep,awk,ed,sed ----------- | an "or" conditional. Matches EITHER of the expressions to the left or right of the | (vertical bar) symbol. EXAMPLE: unix|grep will find the line that has either (or both) unix OR grep in it. NB: Only for egrep,awk,ed,sed ----------- ^ (caret) placed at the beginning of the expression, means to match the expression ONLY if it is at the beginning of the line (column 1). Otherwise, the ^ is taken as a literal. EXAMPLE: ^unix will match only if the line begins with the word unix ----------- $ (dollar sign) placed at the end of the expression means to match the expression ONLY if it is at the end of the line. Otherwise, the $ is taken as a literal. EXAMPLE: unix$ will match only if the line ends with the word unix. ----------- \ disables the special characters. EXAMPLE: \. would look for a period in the text. \\ would look for a backslash in the text. ============================ = The following symbols = = apply to the character = = immediately PRECEDING = = the symbol. = ============================ * (asterisk) Matches on any number (including zero) of occurrences of the character immediately preceding. EXAMPLE: un*x would match ux, unx, unnx, unnnx etc. ----------- + Similar to * but matches one or more occurrences. EXAMPLE: un+x would match unx, unnx, unnnx, unnnnx etc. ----------- \< matches expression that follows anything but a letter, digit or underscore. Normally used to find expressions at the front of the word. EXAMPLE: \matches expression that precedes anything but a letter, digit or underscore. Normally used to find expressions at the end of the word. EXAMPLE: abc\> will find all words ending with the letters abc NB: Only for grep,egrep,fgrep ----------- \( and \) Enclosing an expression with a "\(" on the left and a "\)" on the right makes it referable later in the expression by the syntax "\n". "n" is the numeric order of the enclosed expression. EXAMPLE: \(abc\)def\1 will match the string abcdefabc EXAMPLE: \(abc\)\(def\)ghi\2\1 would match abcdefghidefabc NB: Only for ed,sed ====================== For more more documentation about regular expressions, see the "Unix Programmer's Manual, Seventh Edition, November 1980, Computer Science Division, Univ. of California at Berkeley" which contains hardcopy version of "man" description of processors. Articles of Interest: 1. Tutorial Introduction to the Unix Text Editor (Ed) 2. Advanced Editing on Unix (Ed) 3. SED - non-interactive text editor. 4. AWK - A pattern scanning and processing language --------------------------- * ------------------------------ Steve Kemp {ihnp4, decvax, akgua, dcdwest, ucbvax}!sdcsvax!noscvax!kemp Computer Sciences Corp. kemp@nosc Naval Ocean Systems Center San Diego, CA