Sabermetricians: Help!

I like sports statistics. They are a fantastic resource for fans. Properly used, stats help us understand what's happening and deepen our appreciation of the game. Statistics are very seldom the end of the argument, but they are often a good beginning.

Baseball, over the last 10 or 20 years, has seen a revolution of advanced statistics. For all that we still talk about batting average and RBI and pitching wins, we also have DIPS and WAR and OPS+ and UZR. Joe Posnanski recently wrote a clever piece trying to explain a little about UZR (Ultimate Zone Rating), a highly-regarded advanced fielding stat. The sabermetric community needs more of that. There are still far too many fans who are intimidated by VORP or don't understand BABIP, and until that changes, most fans and analysts will still rely on batting average and pitching wins.

Fifty homers, 200 hits, .300 average, 20 wins ... everyone knows what these numbers mean. The new stats have the potential to tell us far more, and to help us compare players in different parks and different eras and at different positions. If you're a baseball fan, that's got to be exciting. But everyone knows what batting average and ERA are measuring, and why those are useful stats. Most fans don't understand the new stats, and they're understandably wary of them. It's in everyone's best interest for that to change. Folks who rely on the traditional stats can use OPS+ and FIP to improve their understanding of the game. Stat geeks can stop wringing their hands about whether Tim Raines will ever get the respect he deserves.

I'm not a sabermetrician. I don't have a degree in math or statistics. And honestly, I don't know where some of these new statistics come from. I want to like these statistics. Combining defensive performance into a single number like UZR does, that's an immensely useful tool for fans and managers. Combining batting performance, like OPS+, or batting and fielding, like WAR — if we can have faith in those numbers, it opens up realms of objective analysis our parents never dreamed of. But right now, most fans don't have faith in the numbers. It sometimes seems like they're being shoved down our throats by those who are unwilling or unable to explain how they work and why we should trust them. Some sabermetricians come across like emo kids and indie snobs, jealously guarding their wisdom from mainstream acceptance, lest their "special" knowledge become commonplace.

Here are four questions I think the statheads should answer if they want the rest of us to switch from RBI to WAR.

Why is WAR better than Win Shares?

All but the most cynical of fans acknowledge that Bill James has contributed important things to baseball ("I made baseball as much fun as doing your taxes!"), and his work is more widely known than that of guys with weird names like Tom Tango and Voros McCracken. About a decade ago, James introduced a statistic called Win Shares, an all-in-one number that rates a player's performance as a single integer. Calculating Win Shares is not easy, but there are websites that will do it for you, and the results are easy to understand. A Win Share is equal to one-third of a win for your team, and 30 Win Shares is an MVP-quality season. Lots of people read Bill James, and lots of people know about Win Shares.

But most statheads today prefer Wins Above Replacement (WAR). James, when he unleashed Win Shares upon the public, made a pretty compelling case for the system, its analysis of fielding statistics in particular. James writes about the "false normalization of fielding statistics", so that a bad team's fielding often appears to be better than a good team's fielding. If we're measuring range by stats like putouts, both the good team and the bad will only have 27 per game. If we're using assists, bad teams may actually have more, because they allow more baserunners.

There are multiple calculations for WAR, but the most prominent are from Baseball-Reference and FanGraphs. Baseball-Reference uses a statistic called Total Zone for its defensive ratings in WAR. In 2010, according to Total Zone, the best fielding team in the American League, by far, was the Seattle Mariners, who had the worst record in the league. Second-best was the A's, who were a .500 team. Third was Cleveland, which lost 93 games. Only one playoff team, the Tampa Bay Rays, actually was better than league average. The Central champion Twins, according to this system, were very nearly the worst fielding team in the AL. I'm not saying that's wrong, but it seems strange. The Mariners lost 101 games despite having the best fielders in the league? The Twins and White Sox had the best records in the Central despite being the worst fielders? Really?

What makes WAR a better all-in-one stat than Win Shares? I don't mean to imply that it isn't better — it probably is — but I've never, ever, seen anyone explain why.

Baseball-Reference and FanGraphs WAR sometimes differ hugely for the same player. What gives?

Ichiro Suzuki was selected as AL MVP in 2001, his first year in the majors. Baseball-Reference credited him with 7.6 Wins Above Replacement, FanGraphs with only 6.1. That's a 25% difference, a huge gap. In 2005, Baseball-Reference saw Ichiro with 4.7 WAR, FanGraphs with only 3.2. That's almost a 50% difference. The next year, tables turned: BR 4.2, FG 5.5. It's not just Ichiro. We could play this same game with Alex Rodriguez (2000, '02, '03, '10) or Derek Jeter ('98, '02, '10) or dozens of other players. Here's Chase Utley:


Utley has played six full seasons (2005-10). These two systems show a difference of at least one full win in four of those six years, and the difference is 0.9 in one of the others. Are Baseball-Reference and FanGraphs even watching the same game? Last year, FanGraphs saw Josh Hamilton as the runaway AL MVP, with 8.0 WAR, no one else over 7. Baseball-Reference preferred Evan Longoria (7.7), with Robinson Cano a distant second (6.1) and Hamilton tied for third (6.0).

When these systems produce such wildly disparate results, how can we take them both seriously? Obviously someone is wrong. Why should we trust these numbers?

Is there a middle ground between fielding percentage and UZR?

It seems to me that all our fielding statistics are either overly simplistic (fielding percentage, errors, putouts, assists) or too arcane (UZR, Win Shares, Dewan plus/minus). Sometimes you have to sacrifice precision for simplicity and accessibility. That's part of why OPS is becoming widely accepted and wOBA is not. The formula for OPS is easy to understand, and you don't need a math degree to calculate it. OPS is not perfect. It underrates the importance of OBP, it's not park-adjusted, it doesn't include stolen bases, and so on. It's still a pretty good stat. The day television announcers tell us everyone's OPS when they come to bat, I'm throwing a party.

OPS is accessible — a lot of people use OPS, and it's not just statheads. All you do is add OBP and slugging percentage. Addition: nothing hard about that. And while on-base percentage and slugging aren't the most popular stats out there, they're widely recognized even by fans who prefer traditional statistics. Add them together and you have OPS, a single number that usually gives a fair indication of a hitter's efficiency.

Fielding needs a stat like that, something people can figure out in their living rooms. Plus/minus and UZR are accurate and useful because they're derived from an incredible amount of film study, an amount no fan could hope to do. We need stats like that, too, but surely there's some kind of middle ground. Is there a defensive equivalent to OPS? Something that may not be perfectly accurate, but is easy to understand and can help us evaluate fielders in a way that makes sense?

What pitching statistic(s) should we use?

A few years ago, David Gassko at Hardball Times wrote about a potentially revolutionary statistic he called Pitching Runs Created. It's a direct counterpart to the more widely known Runs Created stat for batters, which in its simplest form is OBP x TB (on-base percentage times total bases). The heart of Gassko's research was that a run saved is worth more than a run scored. This is critical knowledge when we try to balance offensive and defensive contributions, or compare batters to pitchers.

John Dewan mentioned PRC (which will always be People's Republic of China to me, but whatever) last season in his stat of the week at ACTA Sports, but it doesn't seem to get much attention. Is there something wrong with the premise? Has it been left by the wayside because there are better stats? Has it been incorporated into other systems?

ERA+, FIP, xFIP, FIP-, PRC, WPA, BR-WAR, FG-WAR, innings pitched ... I don't even know where these intersect with each other any more. Which ones are park-adjusted? Which are defense-independent? Which is the best average? What about linear weights, a WAR-type stat to reward someone who throws a lot of innings?

It's (relatively) easy to find people explaining why (for instance) ERA+ is better than ERA (it's adjusted for park, league, and era) or FIP is more predictive than ERA (it subtracts fielding performance and focuses on what the pitcher can control) or WAR is more telling than Wins (a great pitcher on a bad team — like Felix Hernandez — won't win many games, no matter how well he pitches, because his team never scores). But I'd love to see someone explain how ERA+ compares to xFIP, and why PRC is better than WAR, or vice versa.

* * *

Informed fans often gripe about baseball's honors: All-Star selections, Gold Glove choices, MVP and Hall of Fame voting. Some people are simply closed-minded, and will never be open to appreciating modern statistics. But many others would, if the stats people would take a break from talking amongst themselves to help the rest of us understand why their findings are useful. Why is WAR better than Win Shares? How can we reconcile the huge disparities in WAR? Are there useful fielding stats that don't require months of film study? How should we evaluate pitchers? Publicly answering these questions won't quiet clowns like Bruce Jenkins, or put Raines in Cooperstown, but it would be a step in the right direction.

Sabermetricians: help!

Comments and Conversation

April 27, 2011

Ron Johnson:

Hi Brad, There’s a thread running on this right now at Hopefully worth checking out.

First point: Fangraph’s WAR is different under the hood from BBRef’s. They just happen to share a common name. Confusing to be sure.

Comparing WAR to Win shares:

a) The offensive component of Win Shares is runs created per out. It’s frankly a mediocre offensive metric. WAR starts with linear weights and also includes support for baserunning, reaching on error, and grounding into double plays (with an adjustment for opportunities)

All in all, both methods have a measurable standard error, but WAR’s is probably 2-3 runs lower per player. That said, if you’re really interested in evaluating players to any kind of precision, it’s best to look at as many good offensive metrics as you can. All metrics have an error, and you’ll tend to get a better picture by looking at many metrics.

The defensive component is broadly similar. Basically it’s an attempt to adjust range factor for context (including staff composition). Both come up with results that broadly match contemporary reputations. WAR explicitly looks at OF arm which is nice.

All in all I’d trust WAR’s results a tad more. The system was designed after Win Shares, and Sean Smith (WAR’s designer) discussed design decisions with people who had been working on defense before putting the system into practice.

Win Shares clearly over-values playing time. You can do Loss Shares now, and calculate a wins above replacement based on any assumption of replacement level, but that’s work you have to do yourself (or you can look up Win Shares Above Base)

You may not like the way replacement level is set in WAR, but if you want you can recalculate WAR with your own definition of replacement level.

The difference between WAR and WAR now are basically:

How the offensive components are calculated — including park adjustments.

Different defensive systems.

Different assumptions about what replacement level is.

None of this truly troubles me. With the best will in the world you can’t evaluate a player’s offensive contribution more precisely than to within 5 runs (that’s on average about 1/2 a win in WAR) and the standard error for defensive stats is still in the 7 run range as best I can tell (and given that the range is only +20 to -20 that’s not great). I’m still not comfortable giving more than a letter grade on defense — with the understanding that I’m going to be off the mark around 5% of the time.

April 27, 2011

Ron Johnson:

Me again.

To answer the specific point about RBI. It’s important to understand that they’re hugely dependent on the contribution of other players.
I came up with a simple RBI estimator (Tom Ruane came up with something far more precise, but the essential point is the same. RBIs are a function of power and opportunity)

ABROB is at bats with runners on base. The formula for estimated RBI is ABROB *(SLG*1.09-BA*.66)

Yes that’s correct. Given equal opportunities and a typical distribution of baserunners if two players have the same SLG, the guy with the lower BA will tend to drive in a few more runs. That’s because there are more runners on first. (Plus a single with a runner on second is nothing close to a guaranteed RBI. First of all there are infield single, and even then a runner on second scores less than 70% of the time on a single to a corner OF)

Anyhow, all of this is a long-winded way to say that RBIs are mostly a function of power and opportunity and a player can’t control his opportunities so what you’re really left with is slugging percentage.

And while that’s important, on-base percentage is even more so. As a simple formula for player evaluation, OBP*1.7+SLG works quite well.

Leave a Comment

Featured Site