Path: utzoo!attcan!uunet!pyrdc!pyrnj!esquire!yost
From: yost@esquire.UUCP (David A. Yost)
Newsgroups: news.admin
Subject: Re: News delivery problems - old news again
Message-ID: <1355@esquire.UUCP>
Date: 14 Aug 89 19:35:10 GMT
References: <43675@bbn.COM> <651@vector.Dallas.TX.US> <505@logicon.arpa>
Reply-To: yost@esquire.UUCP (David A. Yost)
Organization: DP&W, New York, NY
Lines: 173

In article <43675@bbn.COM> denbeste@BBN.COM (Steven Den Beste) writes:
>Today we received a large number of news articles dated July 22, which we have
>received before. I located 50 or so of them and analyzed the paths by which
>they arrived here.

Today alone, we have 1,883 duplicate messages in comp alone.
We may have some other problem.

Enclosed is a shell script that will find duplicate
messages and list their pathnames.

Its implementation required another interesting,
more generally useful utility which I've wished
I had for a long time and finally wrote.

Here is quickie documentation on both scripts,
followed by a shar of the two scripts.

 --dave yost

-----------------

Usage: find newsdir ... -type f -print | newsdups [ -i file ] [ -o file ]

Takes news file pathnames on standard input and outputs
pathnames of later-numbered duplicate messages in each group.

If the -o file argument is used, newsdups also outputs a record of
the message IDs seen in this pass into the specified file for use
as the -i file argument to a future run of newsdups, at which time
newsdups will assume that those message IDs already exist.

-----------------

Usage: numdups awk-field-number-list file ...

For each line, prints a number n followed by a space followed
by the original text on the line.  The number identifies the
line as the nth line containing the specified fields.
If multiple field numbers are specified, they must be separated
by spaces, and the awk-field-number-list argument must be quoted.
Fields are numbered as in awk (0 = whole line, 1 = field 1,
2 = field 2, etc.).

-----------------

#!/bin/sh

unlink=NO
case $1 in
-u)
    unlink=YES
    shift
esac


echo x newsdups
case $unlink in
YES)
    rm -f newsdups
esac
sed 's/^X//' >newsdups <<'*-*-END-of-newsdups-*-*'
X#!/bin/sh
X
X# See "Usage:" below
X#
X# Requires the nonstandard shell script 'numdups'
X#
X# 890814 D Yost, Davis Polk & Wardwell
X#
X# Why the tmp1 file instead of a pipe?  Otherwise it doesn't work.
X
Xcase $# in
X0 | 2) ;;
X*) echo 1>&2 "
XUsage: find newsdir ... -type f -print | newsdups [ -i file ] [ -o file ]
X
XTakes news file pathnames on standard input and outputs
Xpathnames of later-numbered duplicate messages in each group.
X
XIf the -o file argument is used, newsdups also outputs a record of
Xthe message IDs seen in this pass into the specified file for use
Xas the -i file argument to a future run of newsdups, at which time
Xnewsdups will assume that those message IDs already exist.
X"
X    exit 2
Xesac
X
Xtmp1=/tmp/newsdups1.$$
Xtmp2=/tmp/newsdups2.$$
Xtmp3=/tmp/newsdups3.$$
Xtrap "status=$? ; rm -f $tmp1 $tmp2 $tmp3 ; exit $status " \
X     0 1 2 3 4 5 6 7 8 10 12 13 15 24 25 29
X
Xifile=
Xofile=$tmp2
X
Xcase "$1" in
X-i) ifile=$2 ; shift ; shift ;;
X-o) ofile=$2 ; shift ; shift ;;
Xesac
X
Xxargs grep -i '^message-id:' > $tmp1
Xsed < $tmp1 's,/\([^/:]*\):[^:]*: , \1 ,' \
X| awk '{printf "%s %07d %s\n", $1, $2, $3}' \
X| sort > $ofile
X
Xcase "$ifile" in
X"") cat $ofile
X    ;;
X*)  numdups '1 3' $ifile \
X    | awk '$1 == 1 { print $2, $3, $4 }' > $tmp3
X    cat $tmp3 $ofile
X    ;;
Xesac \
X| numdups '1 3' \
X| awk '$1 != 1 { printf "%s/%d\n", $2, $3 }'
X
Xexit
*-*-END-of-newsdups-*-*

echo x numdups
case $unlink in
YES)
    rm -f numdups
esac
sed 's/^X//' >numdups <<'*-*-END-of-numdups-*-*'
X#!/bin/sh
X
Xcase $# in
X0)  echo 1>&2 "
XUsage: numdups awk-field-number-list file ...
X
XFor each line, prints a number n followed by a space followed
Xby the original text on the line.  The number identifies the
Xline as the nth line containing the specified fields.
XIf multiple field numbers are specified, they must be separated
Xby spaces, and the awk-field-number-list argument must be quoted.
XFields are numbered as in awk (0 = whole line, 1 = field 1,
X2 = field 2, etc.).
X"
X    exit 2
Xesac
X
Xfield=$1 ; shift
X
Xtmp=/tmp/numdups.$$
Xtrap "status=$? ; rm -f $tmp ; exit $status " \
X     0 1 2 3 4 5 6 7 8 10 12 13 15 24 25 29
X
X(
Xecho -n "awk '{
X    tmpstr = sprintf "'"'
X
Xfor X in $field
Xdo
X    echo -n "%s "
Xdone
X
Xecho -n '"'
X
Xfor X in $field
Xdo
X    echo -n ', $'"$X"
Xdone
X
Xecho '
X    ++counts[tmpstr]
X    printf "%d %s\n", counts[tmpstr], $0
X}'"' $*" ) > $tmp
Xsh $tmp
*-*-END-of-numdups-*-*
exit