Xref: utzoo comp.sys.ibm.pc:35401 comp.sources.wanted:8860 Path: utzoo!attcan!uunet!lll-winken!sun-barr!apple!agate!hilbert!raymond From: raymond@hilbert.berkeley.edu (Raymond Chen) Newsgroups: comp.sys.ibm.pc,comp.sources.wanted Subject: Re: Hyphenation code wanted Message-ID: <1989Sep27.235236.22920@agate.berkeley.edu> Date: 27 Sep 89 23:52:36 GMT References: <1333@ole.UUCP> <888@friar-taac.UUCP> Sender: usenet@agate.berkeley.edu (USENET Administrator;;;;ZU44) Reply-To: raymond@hilbert.UUCP (Raymond Chen) Distribution: na Organization: Math Dept., UC Berkeley Lines: 25 In article <1333@ole.UUCP> ray@ole.UUCP (Ray Berry) writes: | I am looking for c src code for rule-driven hyphenation of english |words. Does anyone have something they could e-mail? Donations, pointers- |all are encouraged/appreciated. Thank you. |-- |Ray Berry kb7ht uucp: ...ole!ray CIS: 73407,3152 /* "inquire within" */ |Seattle Silicon Corp. 3075 112th Ave NE. Bellevue WA 98004 (206) 828-4422 If you're after perfection, look at appendix H of Knuth's TeXbook. It describes the hyphenation algorithm used by the TeX program (which is in turn based on a Stanford Ph.D. thesis). The algorithm itself is really simple. It misses only 14 of the commonly-used words in the English language (4 of them being "present" "presents" "project" and "projects", which can be hyphenated in two different ways, depending on the context). The TeX Users' Group (TUG) has a list of all known words which the algorithm fails to hyphenate correctly. (Trust me, the words on the list are words you'd never use. How often do you have to hyphenate "Grothendieck"?) In most cases, the only error in the algorithm is that it misses hyphenation points. It rarely places a hyphen where there shouldn't be one. Disclaimer: This is from memory. I hope you get the idea of what I'm saying (i.e., read Appendix H, and get a copy of the hyphen.tex file from somebody). Any errors in this article are unintentional and were made in good faith.