Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!mnetor!seismo!rutgers!ames!ucbcad!ucbvax!jade!eris!mwm
From: mwm@eris.BERKELEY.EDU (Mike (My watch has windows) Meyer)
Newsgroups: comp.unix.questions
Subject: Re: awk or sed question
Message-ID: <4249@jade.BERKELEY.EDU>
Date: Sat, 4-Jul-87 06:30:36 EDT
Article-I.D.: jade.4249
Posted: Sat Jul  4 06:30:36 1987
Date-Received: Sun, 5-Jul-87 08:37:51 EDT
References: <4780@columbia.UUCP> <3892@burdvax.PRC.Unisys.COM>
Sender: usenet@jade.BERKELEY.EDU
Reply-To: mwm@eris.BERKELEY.EDU (Mike (My watch has windows) Meyer)
Distribution: world
Organization: Missionaria Phonibalonica
Lines: 167

, agw@broadway.columbia.edu (Art Werschulz) says:
[A request for a sed or awk tool to break 80-character lines at whitespace.]

Some problems just aren't amenable to tackling with sed/awk. I think
this is one of them. It may be doable with sed, but I'm not sure how.
Any awk script to do this wiill be almost as complicated as a C
program to do the same thing.

For example:

In article  someone writes:
 N; n -= LEN) {
<           while (substr($0,LEN+i-1,1) != " ") {
<               LEN -= 1
<           }
<           if (i==1) {
<		printf "%s\\\n", substr($0, i, LEN)
<	   } else {
<		printf ">     %s\\\n", substr($0, i, LEN)
<	   }
<           i += LEN;
<        }
<        printf ">     %s\\\n", substr($0,i)
<     }
<} '


This is what I mean. First, converting tabs directly to 8 spaces has
*got* to be wrong. Secondly, this fails on files with lines longer
than awks internal buffer for records (minor, and usually acceptable).

The loose problem spec doesn't help much, of course. But that just
means the problem is a "real-life" problem, and not a classroom
exercise. The C code to solve the problem has some differences (no
tags on folded lines, and the whitespace where the fold is doesn't
get printed). It's also a pure filter, but allows for user-specified
fold columns, instead of wiring it to 80.

The main loop of the C code is 26 lines, not counting comments. The
awk script is 19 lines. The C code would shrink to 22 lines by using
printfs instead of fputs/putchar, and formatting if/else the same way
the awk script is.

Since (as far as I'm concerned)) sed and awk are for quickly building
programs that would be difficult in C, the small difference between the
two programs - which hopefully indicates a small difference in
construction time - shows that this is an problem for which awk isn't
really suited. 

On the other hand, some simple test case (the first n integers on a
single line, seperate by a singe space) show the C version can handle
n = 10000 in about the same sys and user times (as reported by
/bin/time on a Sun 3/50 running SunOS 3.3) as the sed/awk version for
n = 100. The sed/awk version drops core for n >= 1000, and the C
version takes less that 1/10th of a second of sys and user time for n
<= 1000, so I didn't do direct comparisons.

The shell script to emulate the awk/sed script user interfaces, and
the more complex script to combine the two, is left as an exercise for
the reader.

	

/*
 * MAXFOLD is the largest fold column we're willing to accept. All others
 * rejected.
 */
#define	MAXFOLD	160

void
main(argc, argv) int argc; char **argv; {
	register	foldc = 80 ;
	char		buffer[MAXFOLD + 2] ;
	register char	*fold_point, *leftovers ;

	/* Argument processing */
	if (argc > 2) {
		fprintf(stderr, "useage: %s [n]\n", argv[0]) ;
		exit(1) ;
		}
	if (argc == 2) foldc = atoi(argv[1]) ;
	if (foldc <= 0 || foldc > MAXFOLD) {
		fprintf(stderr,
			"%s: only fold columns between 1 and %d supported\n",
			argv[0], MAXFOLD) ;
		exit(1) ;
		}
	/*
	 * The plan is to treat each line + leftovers from last read as
	 * a new line. fold_point indicates where the end of the leftovers
	 * end. Initially set to the beginning of the buffer, it's set up
	 * correctly each time through the loop.
	 *
	 * We need to get one more characters than the maximum fold, as
	 * the first character past the fold column might be whitespace,
	 * and that's a legit fold point. Since fgets reads at most n-1
	 * characters (n is the second argument), we need to ask for foldc+2
	 * characters, minus however much leftovers there are from last loop.
	 */
	leftovers = buffer ;
	while (fgets(leftovers, foldc+2-(leftovers-buffer), stdin) != NULL) {
		/*
		 * If we got a complete line, print it.
		 */
		if (buffer[strlen(buffer) - 1] == '\n') {
			fputs(buffer, stdout) ;
			leftovers = buffer ;
			}
		/*
		 * Got a long line. Find the fold point, print up to the fold,
		 * then shuffle the remaining characters forward and try again.
		 */
		else {
			fold_point = buffer + foldc ;
			while (*fold_point != ' ' && *fold_point != '\t'
			    && fold_point > buffer)
				fold_point -= 1 ;
			/* Test for lines with no whitespace */
			if (fold_point == buffer) {
				fputs(buffer, stdout) ;
				putchar('\n') ;
				leftovers = buffer ;
				}
			else {
				/* Dump up to fold point */
				*fold_point = '\0' ;
				fputs(buffer, stdout) ;
				putchar('\n') ;
				/* Now, deal with the leftovers */
				fold_point += 1 ;
				strcpy(buffer, fold_point) ;
				leftovers = &buffer[strlen(buffer)] ;
				}
			}
		}
	exit(0) ;
	}


--
I'm gonna lasso you with my rubberband lazer,		Mike Meyer
Pull you closer to me, and look right to the moon.	mwm@berkeley.edu
Ride side by side when worlds collide,			ucbvax!mwm
And slip into the Martian tide.				mwm@ucbjade.BITNET