Xref: utzoo comp.sys.ibm.pc:35401 comp.sources.wanted:8860
Path: utzoo!attcan!uunet!lll-winken!sun-barr!apple!agate!hilbert!raymond
From: raymond@hilbert.berkeley.edu (Raymond Chen)
Newsgroups: comp.sys.ibm.pc,comp.sources.wanted
Subject: Re: Hyphenation code wanted
Message-ID: <1989Sep27.235236.22920@agate.berkeley.edu>
Date: 27 Sep 89 23:52:36 GMT
References: <1333@ole.UUCP> <888@friar-taac.UUCP>
Sender: usenet@agate.berkeley.edu (USENET Administrator;;;;ZU44)
Reply-To: raymond@hilbert.UUCP (Raymond Chen)
Distribution: na
Organization: Math Dept., UC Berkeley
Lines: 25

In article <1333@ole.UUCP> ray@ole.UUCP (Ray Berry) writes:
|    I am looking for c src code for rule-driven hyphenation of english
|words.  Does anyone have something they could e-mail?  Donations, pointers-
|all are encouraged/appreciated.  Thank you.
|-- 
|Ray Berry  kb7ht  uucp: ...ole!ray CIS: 73407,3152 /* "inquire within" */
|Seattle Silicon Corp. 3075 112th Ave NE. Bellevue WA 98004 (206) 828-4422

If you're after perfection, look at appendix H of Knuth's TeXbook.  It
describes the hyphenation algorithm used by the TeX program (which is
in turn based on a Stanford Ph.D. thesis).  The algorithm itself is
really simple.  It misses only 14 of the commonly-used words in the
English language (4 of them being "present" "presents" "project" and
"projects", which can be hyphenated in two different ways, depending on
the context).  The TeX Users' Group (TUG) has a list of all known words
which the algorithm fails to hyphenate correctly.  (Trust me, the words
on the list are words you'd never use.  How often do you have to
hyphenate "Grothendieck"?)  In most cases, the only error in the
algorithm is that it misses hyphenation points.  It rarely places a
hyphen where there shouldn't be one.

Disclaimer:  This is from memory.  I hope you get the idea of what
	I'm saying (i.e., read Appendix H, and get a copy of the hyphen.tex
	file from somebody).  Any errors in this article are unintentional
	and were made in good faith.