Accuracy of three major weather forecasting services

joelthelion · on June 21, 2014

This curve is not enough to evaluate the value of a weather forecasting service: if it rains, say 30% of the days in a specific area, you could forecast a 30% chance of rain every day and have good "accuracy". And yet that would be of no practical value.

I think a better metric would probably be something from information theory like mutual information, but I'm not sure which one exactly.

ivank · on June 21, 2014

If someone just predicted 30% every day, you'd notice the lack of samples for any other predicted probability, and no one would judge such a discontinuous calibration graph as well-calibrated.

Calibration graphs can still be gamed - you can predict dishonestly to "dent" the graph. For example, if your 70% is happening 60% of the time, predict 70% even if you're 100% sure an event will happen.

jmount · on June 22, 2014

Deviance or log-liklihood are good measures (and when properly normalized give you one of the cross-entropies). Roughly you do -sum(ifelse(rained_i,log(p_i),log(1-p_i)). So you git a high (bad) penalty score if you get wrong probabilities and a nice score of 0 if you put 1's on days that rained and 0's elsewhere.

bo1024 · on June 21, 2014

Right. The curve demonstrates "calibration". A separate question is daily accuracy, or what might be called "sharpness".

One way to measure accuracy is to use a "scoring rule" each day. The scoring rule takes the prediction (a probability of rain), e.g. p=0.3, and the actual outcome, e.g. "rain" or "no rain", and returns a numerical score, where higher is better. You'd expect a reasonable scoring rule to have properties like, the highest score is for predicting p=1 and it rains (or p=0 and it doesn't), the lowest score is for p=1 and it doesn't (or p=0 and it does), and so on.

The log scoring rule is a popular choice: If it rains, your score is log(p); if it doesn't, your score is log(1-p). (Don't let it bother you that these are negative numbers for p in [0,1]. Closer to zero is still better.) Notice that your expected score for predicting p is precisely plog(p) + (1-p)log(1-p), which is the negative of the entropy of the distribution, that is, -H(Bernoulli(p)).

Now, what we could do to measure accuracy is to average, over all the days, the log scoring rule applied to your prediction and the outcome. A perfect score would be if you predicted 1 every time it rained and 0 every time it didn't, which would give you a score of zero; if you're not perfect, you'll have a negative score, with more negative being worse.

So the average score will be (the negative of) the average entropy of your predictions, which makes sense as entropy corresponds to "uncertainty". (This is assuming your predictions are calibrated. If you are uncalibrated, i.e. you make predictions with little uncertainty but they are wrong, then you will pay a penalty in proportion.)

We can connect this to your idea of mutual information, if we had a more detailed setup. For instance, suppose that each day, nature picks a true probability of rain p', then you observe some information related to p' and make a prediction p; then it rains with probability p'. Here, the best you could possibly do is to observe all the information and predict p'. Then your average score will be the average entropy H(Bernoulli(p')). But if you are making poor predictions, then the theory of proper scoring rules tells us that your average score will be the average mutual information between the distributions Bernoulli(p') and Bernoulli(p). (If we are using some scoring rule other than the log scoring rule, this mutual information, which comes from KL-divergence, is replaced with a "Bregman divergence".)

Unfortunately, we don't know this true p' each day, so we can't actually use mutual information to evaluate forecasts, but we can still use the entropy, or log scoring rule.

For a survey on proper scoring rules (not very fun for the layperson I'm afraid), see http://www.eecs.harvard.edu/cs286r/courses/fall12/papers/Gne...

tshaddox · on June 22, 2014

> if it rains, say 30% of the days in a specific area, you could forecast a 30% chance of rain every day and have good "accuracy". And yet that would be of no practical value.

That's not exactly true. There is practical value to knowing the long-term probability of rain. Assuming that carrying an umbrella itself imposes a "cost" (such that it's not acceptable to simply carry one every single day), you're statistically better off randomly carrying an umbrella 30% of the time than carrying an umbrella any other percent of the time (including 0 or 100).

rhiever · on June 21, 2014

I think you're right, and in that sense, maybe "accuracy" isn't the best word. "Reliability" might be the better word here.

emkemp · on June 21, 2014

The plot in the article is an example of a "reliability diagram" frequently used in weather forecast verification. See, e.g., http://www.bom.gov.au/wmo/lrfvs/reliability.shtml. Reliability is considered separate from accuracy in meteorology -- the former evaluates success conditioned on what was forecasted, while the latter is an unconditional evaluation of success or failure.

Other facets of forecast "goodness" exist and are often considered in meteorology. A seminal paper on the subject was penned by Allan Murphy, who identified three types of "goodness" (consistency, quality, and value) and ten subsets of quality (including reliability and accuracy). See http://www.glerl.noaa.gov/seagrant/ClimateChangeWhiteboard/R.... [PDF warning]

A popular companion to the reliability diagram is the Relative Operating Characteristics (ROC) curve. Here different forecast probability thresholds are tested to calculate likelihood of success if the event occurred, and likelihood of error if the event did not occur. This evaluates what Murphy calls discrimination (forecast quality conditioned what was observed) which complements reliability. See, e.g., http://www.bom.gov.au/wmo/lrfvs/roc.shtml.

Curiously, accuracy tends to take a back seat in forecast verification to other aspects of quality, particularly in rare event situations. This trend first began in the mid 1880's with the "Finley Affair", a series of published articles debating how to evaluate tornado forecasts issued by the US Army Signal Corps. Murphy published a fascinating literature review on the subject and showed that many of the skill scores and debates born during the Finley Affair are still active today. See http://www.nssl.noaa.gov/users/brooks/public_html/feda/paper.... [PDF warning]

bcl · on June 22, 2014

If you're interested in the science (and some of the problems we have in the US) read UW Professor Cliff Mass' blog - http://cliffmass.blogspot.com/

jloughry · on June 21, 2014

It would be interesting to repeat the comparison for many specific locations, then plot the measured variance geographically as a "heat map" or landscape of error [1]. Are there patterns visible that could be attributed to local geography, population density, or other factors?

[1] It could be done, for weather.gov [2], using only data available from the web site [3].

[2] I don't really care about the other forecast sources.

[3] To trust rainfall observations obtained from weather.gov in order to compare them to predictions made by weather.gov seems vaguely wrong but there is no other comparable source of observations. They are physical measurements, after all.

[4] Some geographic areas probably have a coarser net of observation points. In some places, e.g., San Diego, the weather is inherently easier to predict. Some local forecast offices may be more skilled than others.

Shivetya · on June 21, 2014

how wide of an area is used to test against the predictions? I notice he calls out local weathermen, well a thirty percent chance of rain given out by the TV guys for the metro Atlanta area is either dead accurate or dead wrong depending what part of metro Atlanta I am in.

It would be really cool to see weather forecasts done up like topography maps, with +/- symbols denoting changes in precip and the like

rhizome · on June 21, 2014

A better model might be to baseline on NOAA predictions, then work with the variability of commercial predictions against that.

kator · on June 21, 2014

I have enjoyed http://darkskyapp.com/

In my totally un-scientific opinion weather.com has slowly gotten worse over the past several years. I think it's become a lot hard to monetize weather and the weather service has really stepped up the game on providing public outlets that are digestible by the general public.

My guess is in the early days what weather.com and the weather channel where really doing is translating the difficult to understand NWS messages and helping the general public quickly understand: "Do you I need a rain coat today?". That said over time as NWS has stepped up their public interfaces that value add is sliding backwards and getting harder to maintain.

That's my $0.02CPM worth.. :-)

MBCook · on June 22, 2014

I liked DarkSky too (I actually use Weather Line [1], which is powered by them, switched a while ago, better UI), but I've seen it take a noticeable dip in accuracy over the last 2 years or so.

On PC I use they site Forecast.io. I could never use Weather.com or any of those other sites, too many ads and other junk making it hard to use. I want to know tomorrow's weather, not the pollen count forecast fort this summer.

[1] http://weatherlineapp.com

kator · on June 22, 2014

Looks interesting I'll check it out!

Intermernet · on June 22, 2014

"The further you get from the government’s original data, and the more consumer facing the forecasts, the worse this bias becomes. Forecasts “add value” by subtracting accuracy."

As interesting as this is, isn't it obvious that the further you get from the primary source, and the more consumer facing a report is, the less accurate it's going to be? This is the only opportunity for interested parties in the info-chain to massage the data to suit their own ends (Not necessarily nefarious, just selfish). If it's going to happen, this is where it will happen.

After all, we take the news we see on TV with a cellar of salt, so why would we believe the weather?

angry_octet · on June 23, 2014

Actually a meteorologist isn't just massaging the data. Models often have know biases, eg under predicts rain when there is an offshore wind, or doesn't model complex local topography and gets mixing wrong, etc. Forecasters use their local knowledge to tweak the forecast, and achieve significant accuracy gains. But people are expensive, so they are affordable for high impact cases, like aviation weather.

simmons · on June 21, 2014

For a while, I've been thinking about doing exactly this sort of analysis. Thanks for putting in the effort!

Does anyone have experience getting a feed of the raw NWS forecast data for many points in a large region (e.g. a state or the whole country)? I was thinking the other day that it would be great to have a web site that showed the forecasted chance of precipitation across a region, to answer questions like "Where in the Colorado high country should I go camping this weekend?"

ByronT · on June 21, 2014

http://www.hpc.ncep.noaa.gov/pqpf/conus_hpc_pqpf.php

Theodores · on June 21, 2014

This is a quite cynical take on how weather forecasting works written by someone that quite clearly does not know one single weather forecaster.

First of all there are only two agencies on the planet that do the number crunching to work out some reasonable forecast data that encompasses the whole globe. These are the NWS and the UK Met Office. As well as having to have a lot of big computers these agencies also need source data, this data - observations - comes from airports and plenty of other places where things like wind speed, precipitation temperature and so on is actually measured. At times the observations are wrong - imagine the baking tarmac of that big airport and how that differs from the tranquil yet noisy houses close to a nearby river.

The NWS differs from the Met Office in that they don't charge for the GRIB data. The tax payer has paid for it already in the USA so they don't have to pay for it again. Hence the proliferation of things like The Weather Channel that use NWS rather than Met Office data.

One thing that outsiders to weather forecasting do not realise is what it is that weather forecasters actually do. They imagine them to be very scientific - which they are - but they don't realise that they are essentially in the 'betting shop' business. To take an automotive example, if you had perfect knowledge of every car that is entering tomorrow's F1 race and you had perfect knowledge of the well-being of every single driver, mechanic and tea lady involved in the event, can you actually predict which of the 22 drivers is going to win? Will it be the guy on pole? The guy who has one most of the races so far? The guy who consistently comes second? Or some random outsider?

The GRIB data is far from perfect knowledge, it is a forecast of what is going to happen and the accuracy depends on the time window going into the future. The data is fully 3 dimensional, think of it as lots of onion layers going around the whole planet. Data points are on a grid - what happens if your town is next to some huge mountain with 'your' data point on that grid being several thousand feet higher than where your town is? The GRIB data for your town is not actually for your town, it is for the mountain. A meteorologist will have rules of thumb plus the science to arrive at a more accurate guess than the GRIB gives - this is interpretation of the data, not some sixth sense, however, it is still nonetheless a gamble/guess.

As well as the GRIB data there are things like satellite images - from lots of different flavours of satellite - plus there is radar data. This can all be layered up on top of GRIB data and pretty maps to create an interpreted forecast. The 'wet bias' is more likely to be rookie mistake meteorology rather than devious ploy to get viewers watching. Look at any satellite image and see the low level haze from things like jet plane 'contrails' plus coastal fog etc. There is an awful lot of it on satellite images and it is very easy to permanently be predicting rain from seeing such cloudy greyness. Hence at the local weather station this is more likely to happen. On the Weather Channel where they have excellent interpretation tools for their forecasters this is less likely to happen, not so much because of the tools but because of the forecasters - they are more experienced gamblers.

The other thing to remember with weather forecasting is that today's predictions can be checked against tomorrow's observations. Things can be consistently wrong for a given town/area due to the way the GRIB data works (i.e. does not factor in local topology), and it can take a while before this error in the model is discovered and fixed. There may not be observation data available for smaller towns so some errors might never be fixed.

The weather prediction industry is fairly ripe for disruption. The tools that meteorologists have used to require big workstations to run, nowadays a Google Earth type of app would suffice, if someone could be bothered to write it.

Amongst themselves meteorologists know a lot more about the current factors influencing the big picture of the weather. For instance, the storms that start off on the west coast of Africa, cross the Atlantic and 'bounce back' to the UK, losing energy on the way to end up as mere rain. Clearly such weather patterns take weeks to do there thing, however, for a gardener in the UK it would be good to know if rain was on its way over the next few weeks. Yet the demands of forecasting format mean that the forecaster has to tie that down to 'rain expected teatime next Tuesday' (or whenever). Returning to the 'app' idea, it would be great for everyone if they could explore the raw data and have these bigger events pointed out by an expert, so that the raw data can be interpreted in a meaningful way. Instead we have banal 'insights' such a this article (that probably did not intend to be banal or naive but that is the way things sometimes happen despite trying hard).

shoyer · on June 22, 2014

You missed the biggest player in numerical weather forecasting -- the European Center for Medium-Range Weather Forecasts (ECMWF). Everyone agrees their models are the world's best. Regrettably, the American models are pretty far behind: http://cliffmass.blogspot.com/2014/04/the-us-slips-to-fourth...

The ECMWF forecasts are indeed quite expensive, but all the major players in weather forecasting, including The Weather Channel and the National Weather Service, buy them.

Theodores · on June 22, 2014

Very interesting! It has been a while since I worked in weather and as far as I was concerned there was the U.S. 'free' data and the British 'paid for' data. ECMWF does ring a bell though, I think we did have selected access to some of their forecast products but it has been a while. The centre is very much based in the UK, more specifically Reading, which is the place where anyone with a meteorology degree studies.

I think that a history of weather forecasting would be quite fascinating as it is tied in to the growth of the British Empire, the needs of Britain 'ruling the waves', the development of the telegraph, the development of aviation and, of course, satellites and supercomputing. For many decades meteorology has been in complete denial about climate change, this too would make an interesting chapter to the story.

angry_octet · on June 23, 2014

I say old chap, I'm afraid you are being rather anglo-centric and what not. I have it on good authority they have weather in the heathen lands to the Far East. Why the Japanese built an 'Earth Simulator' with 10TB of RAM back in the last century.

And apparently the Indians launch their own weather satellites into space. Gosh.

dfc · on June 22, 2014

Do you know where/if I can I find that data/chart on the NWS site? The author says NWS deserves credit for not holding anything back but for some reason I can not find that data on the NWS web sites.

shoyer · on June 22, 2014

I'm not sure... you should ask on Cliff's blog! The NWS/NOAA websites are rather ad-hoc and disorganized.

jarvist · on June 21, 2014

Not just this, but the forecasters themselves are (of course!) interested in their own accuracy & predictive power:

http://www.metoffice.gov.uk/about-us/who/accuracy/forecasts

vacri · on June 22, 2014

Around the turn of the century I knew a South African who'd spent some years in the UK. She said that in the UK, they'd predict 'rain tomorrow at 3, for 15 minutes', and the next day there'd be rain tomorrow at 3 for 15 minutes. In comparison, in Capetown you could actually see the inbound thunderstorms across the sea and they'd still get the forecast wrong.

samirmenon · on June 22, 2014

I'd love to see this done with temperatures, other kinds of precipitation, etc. I think I'll have to make it a weekend project...

jloughry · on June 21, 2014

Anecdotally, I can corroborate the observation: weather.gov slightly underpredicts rain; a "30% chance" is associated with rain more than half the time.