Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Posting-Version: version B 2.10.2 9/18/84; site fisher.UUCP Path: utzoo!decvax!bellcore!petrus!scherzo!allegra!princeton!astrovax!fisher!david From: david@fisher.UUCP (David Rubin) Newsgroups: net.sport.baseball Subject: Re: Lineup dependency Message-ID: <757@fisher.UUCP> Date: Mon, 16-Sep-85 22:36:20 EDT Article-I.D.: fisher.757 Posted: Mon Sep 16 22:36:20 1985 Date-Received: Tue, 17-Sep-85 08:39:36 EDT References: <444@philabs.UUCP> Distribution: na Organization: Princeton University.Mathematics Lines: 143 You just KNEW I wasn't going to let Paul's demonstration pass without challenge, didn't you?? First, let me say right off that while I disagree with what most of Paul wrote, if I countered all his points, (a) this article would be another monster, and (b) general principles would be lost among specifics. Much of Paul's arguments are anecdotal in nature: he brings up a case which he believes supports his position, and concludes that, since his explanation is CONSISTENT with his own observations, it must be TRUE. As an example, he credits McGee's year to Coleman; he is satisfied that since his explanation makes sense, (1) he may disregard alternate explanations of the event, and (2) he need not further investigate. I shall limit myself, therefore, to the general comment (call it Rubin's Law of Empirics, if you will) that HAVING A PLAUSIBLE EXPLANATION FOR AN ANTICIPATED EFFECT IS NOT EVIDENCE THAT THAT EFFECT HAS ACTUALLY OCCURRED. All of Paul's explanations mean little, therefore, until he establishes that what his explanations explain has indeed happened! Only in the case of Mattingly does he attempt to actually demonstrate that a lineup effect exists, and I will therefore concentrate on it. Elsewhere, he merely shows lineup effects are consistent with his selected observations without either showing other explanations are inconsistent or that the observations would be inexplicable without lineup effects. Unfortunately, he places far more interpretive weight on his statistics than they can bear; odd, considering his previously expressed worry that I, as a Statistician, was likely to be taken in by a spurious (and superficial) correlation. > Mattingly's stats Yankees record > BA Slugging W L Pct. >Batting 2nd .402 .715 27 8 .771 >Batting 3rd .303 .495 42 40 .525 > or 4th >We can see that not only are personal stats highly dependent >on the other people in the lineup, but also dependent on the order >in which those people bat. CONFOUNDMENT? We can only say this if we know the ONLY thing that is varying is the lineup. It may be that Mattingly has batted second ONLY against right-handed pitching or that some OTHER factor is responsible for the difference. In other words, a simple breakdown such as this is worthless (possibly even worse: it may be misleading) unless we also know that the circumstances of the two categories (batting 2nd vs. batting 3rd or 4th) are otherwise similar; otherwise, it may be some other factor (such as lefty-righty, home-away, grass-turf, day-night, etc.), strongly correlated with the categories, that is driving the discrepancy (Statisticians refer to this confusion of one cause with another as "confounding"). AMOUNT OF DATA? Moreover, even if Paul COULD assure us that this was so, he does not have nearly enough data. Examine, in particular, the data for batting second: it is based on 35 games, i.e. about 100-150 at bats. Most fans will not put much store in a player's average after 35 games (early May), and for good reason: the player has not yet accumulated enough at bats for us to form any reasonable opinion as to his likely seasonal productivity. We are talking about guessing whether a player is hitting .300 or .400 based on that many at bats: it would not be at all unusual for the difference (10 to 15 hits) to be due to a "hot" or "cold" streak (what Statisticians conveniently label "random", but we may understand as being that which is beyond our knowledge). We would need to have many more at bats (perhaps in a couple of more seasons we will) before we could say that the difference is due to the position in the lineup rather than a propitious hot streak. To put it another way, if a lifetime .300 hitter were to have a .400 average on May 5th, would you tentatively conclude (until further info was available) that the man would bat .400 for the season? Of course not. You would correctly conclude that he is more likely to hit .300 from June through September than .400. He may just have had a good April... LIMITED APPLICATION? Even if it were established for Mattingly, it would hold only for Don Mattingly with the current Yankees: to apply it to, say, Tony Pena, it would have to be demonstrated for a wide variety of players on a wide variety of teams. Still, it would be quite a surprise to me if anyone could get even that far. TSN BIAS!!!! Finally, the selection is biased. The Sporting News didn't say, "Let's check on Mattingly's stats and publish regardless", as they would have to if we were to have any hope that Mattingly was somehow typical; they certainly perused all the available stats and published the one(s) they considered most "interesting" or "newsworthy". We can be certain that the discrepancy in Mattingly's stats are therefore unusually large. If that is the greatest discrepancy available among the 300 or so regular players, Mattingly's extra 10-15 hits and 20-30 extra bases in his 150 at bats, then I am very unimpressed: such discrepancies would probably be as large in a similarly sized sample broken down into phases of the moon. Make no mistake: it is Sporting News's job to publish discrepancies such as this because they are among the largest, as their readers demand the unusual, not the typical. PENA-CARTER Yes, this came up again, and I have to point out that (1) My arguments were based entirely on pre-1985, and (2) Pre-1985, Pena's team was about as productive as Carter's so that Paul's argument (again) about Hernandez and Strawberry being responsible (again) for Carter's stats are irrelevant (again). And even if we WERE to consider them, why does Paul believe that Carter has his stats inflated by Hernandez, Strawberry, and Foster when NONE of those three show any substantial increase in production over last year? I suppose Paul believes Carter has a special dispensation: in moving from the Expos to the Mets, he gains by being surrounded by Keith, Darryl, and George, while those three do NOT gain from Gary's presence. The fact is, the production of all four has remained about the same over the past two years, an argument AGAINST lineup effects. My apologies for not being able to resist the anecdotal argument.. CONCLUSION For Paul to demonstrate lineup effects, he will need (1) More data (more players, more at bats), (2) Better data (some effort to exclude other factors; however, it may suffice to simply have more data, so that we may reasonably expect to have other factors balance out), and (3) Unbiased data (players selected because, a priori, we think their records will be most illuminating; a posteriori selection, a la TSN, is invalid). Neither Paul nor I has the time nor resources to do this. Some people do, and are supposedly doing it (the folks at SABR...). They have, so far, according to Pete Palmer, found "no evidence" of lineup effects. This does not "disprove" lineup effects; however, it detracts from human understanding to accept as truth all that is not disproven. David Rubin {allegra|astrovax|princeton}!fisher!david