Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Posting-Version: version B 2.10.2 9/18/84; site fisher.UUCP
Path: utzoo!decvax!bellcore!petrus!scherzo!allegra!princeton!astrovax!fisher!david
From: david@fisher.UUCP (David Rubin)
Newsgroups: net.sport.baseball
Subject: Re: Lineup dependency
Message-ID: <757@fisher.UUCP>
Date: Mon, 16-Sep-85 22:36:20 EDT
Article-I.D.: fisher.757
Posted: Mon Sep 16 22:36:20 1985
Date-Received: Tue, 17-Sep-85 08:39:36 EDT
References: <444@philabs.UUCP>
Distribution: na
Organization: Princeton University.Mathematics
Lines: 143

You just KNEW I wasn't going to let Paul's demonstration pass without
challenge, didn't you??

First, let me say right off that while I disagree with what most of
Paul wrote, if I countered all his points,
	(a) this article would be another monster, and
	(b) general principles would be lost among specifics.

Much of Paul's arguments are anecdotal in nature: he brings up a case
which he believes supports his position, and concludes that, since his
explanation is CONSISTENT with his own observations, it must be TRUE.
As an example, he credits McGee's year to Coleman; he is satisfied
that since his explanation makes sense,

	(1) he may disregard alternate explanations of the event, and
	(2) he need not further investigate.


I shall limit myself, therefore, to the general comment (call it
Rubin's Law of Empirics, if you will) that HAVING A PLAUSIBLE
EXPLANATION FOR AN ANTICIPATED EFFECT IS NOT EVIDENCE THAT THAT EFFECT
HAS ACTUALLY OCCURRED.  All of Paul's explanations mean little,
therefore, until he establishes that what his explanations explain has
indeed happened!  Only in the case of Mattingly does he attempt to
actually demonstrate that a lineup effect exists, and I will therefore
concentrate on it.  Elsewhere, he merely shows lineup effects are
consistent with his selected observations without either showing other
explanations are inconsistent or that the observations would be
inexplicable without lineup effects. Unfortunately, he places far
more interpretive weight on his statistics than they can bear; odd,
considering his previously expressed worry that I, as a Statistician,
was likely to be taken in by a spurious (and superficial) correlation.

>              Mattingly's stats       Yankees record
>               BA     Slugging         W    L     Pct.
>Batting 2nd   .402      .715          27    8    .771
>Batting 3rd   .303      .495          42   40    .525
>   or 4th
>We can see that not only are personal stats highly dependent
>on the other people in the lineup, but also dependent on the order
>in which those people bat.

			CONFOUNDMENT?

We can only say this if we know the ONLY thing that is varying is the
lineup.  It may be that Mattingly has batted second ONLY against
right-handed pitching or that some OTHER factor is responsible for the
difference.  In other words, a simple breakdown such as this is
worthless (possibly even worse: it may be misleading) unless we also
know that the circumstances of the two categories (batting 2nd vs.
batting 3rd or 4th) are otherwise similar; otherwise, it may be some
other factor (such as lefty-righty, home-away, grass-turf, day-night,
etc.), strongly correlated with the categories, that is driving the
discrepancy (Statisticians refer to this confusion of one cause with
another as "confounding").

			AMOUNT OF DATA?

Moreover, even if Paul COULD assure us that this was so, he does not
have nearly enough data.  Examine, in particular, the data for batting
second: it is based on 35 games, i.e. about 100-150 at bats.  Most
fans will not put much store in a player's average after 35 games
(early May), and for good reason: the player has not yet accumulated
enough at bats for us to form any reasonable opinion as to his likely
seasonal productivity.  We are talking about guessing whether a player
is hitting .300 or .400 based on that many at bats: it would not be at
all unusual for the difference (10 to 15 hits) to be due to a "hot" or
"cold" streak (what Statisticians conveniently label "random", but we
may understand as being that which is beyond our knowledge).  We would
need to have many more at bats (perhaps in a couple of more seasons we
will) before we could say that the difference is due to the position
in the lineup rather than a propitious hot streak.  To put it another
way, if a lifetime .300 hitter were to have a .400 average on May 5th,
would you tentatively conclude (until further info was available) that
the man would bat .400 for the season?  Of course not.  You would
correctly conclude that he is more likely to hit .300 from June
through September than .400.  He may just have had a good April...

			LIMITED APPLICATION?

Even if it were established for Mattingly, it would hold only for
Don Mattingly with the current Yankees: to apply it to, say, Tony
Pena, it would have to be demonstrated for a wide variety of players on
a wide variety of teams.  Still, it would be quite a surprise to me if
anyone could get even that far.

			TSN BIAS!!!!

Finally, the selection is biased.  The Sporting News didn't say,
"Let's check on Mattingly's stats and publish regardless", as they
would have to if we were to have any hope that Mattingly was somehow
typical; they certainly perused all the available stats and published
the one(s) they considered most "interesting" or "newsworthy".  We can
be certain that the discrepancy in Mattingly's stats are therefore
unusually large.  If that is the greatest discrepancy available among
the 300 or so regular players, Mattingly's extra 10-15 hits and 20-30
extra bases in his 150 at bats, then I am very unimpressed: such
discrepancies would probably be as large in a similarly sized sample
broken down into phases of the moon.  Make no mistake: it is Sporting
News's job to publish discrepancies such as this because they are
among the largest, as their readers demand the unusual, not the
typical.

			PENA-CARTER

Yes, this came up again, and I have to point out that
	
	(1) My arguments were based entirely on pre-1985, and 
	(2) Pre-1985, Pena's team was about as productive as Carter's

so that Paul's argument (again) about Hernandez and Strawberry being
responsible (again) for Carter's stats are irrelevant (again).  And
even if we WERE to consider them, why does Paul believe that Carter
has his stats inflated by Hernandez, Strawberry, and Foster when NONE
of those three show any substantial increase in production over last
year?  I suppose Paul believes Carter has a special dispensation: in
moving from the Expos to the Mets, he gains by being surrounded by
Keith, Darryl, and George, while those three do NOT gain from Gary's
presence.  The fact is, the production of all four has remained about
the same over the past two years, an argument AGAINST lineup effects.
My apologies for not being able to resist the anecdotal argument..

			CONCLUSION

For Paul to demonstrate lineup effects, he will need

	(1) More data (more players, more at bats),
	(2) Better data (some effort to exclude other factors;
	      however, it may suffice to simply have more data, so
	      that we may reasonably expect to have other factors
	      balance out), and
	(3) Unbiased data (players selected because, a priori, we
	      think their records will be most illuminating; a
	      posteriori selection, a la TSN, is invalid).

Neither Paul nor I has the time nor resources to do this.  Some people
do, and are supposedly doing it (the folks at SABR...).  They have, so
far, according to Pete Palmer, found "no evidence" of lineup effects.
This does not "disprove" lineup effects; however, it detracts from
human understanding to accept as truth all that is not disproven.

					David Rubin
			{allegra|astrovax|princeton}!fisher!david