Tony O’Hagan Responds (Not on Behalf of RMS)

April 27th, 2009

Posted by: Roger Pielke, Jr.

[UPDATE: Prof. O'Hagan explains in the comments that his response reflects his personal views and not those of RMS, so I have altered the title.]

Last week I argued that the RMS expert elicitation of expert views on hurricane landfalls over the next five years gave a result no different than if a bunch of monkeys had engaged in the elicitation. Professor Tony O’Hagan who conducted the elicitation on behalf of RMS responds in the comments to that thread. Below I have reproduced his comments along with my rejoinders provided in bold:

I am the statistician who conducted the expert elicitation that Dr Pielke derides. I feel that I must answer his unbalanced criticism of the procedure that I adopted in collaboration with RMS. Like Dr Pielke, I was engaged by RMS as an expert to help them with the assessment of hurricane risks. My skills are in the area of probability and statistics, but in particular I have expertise in the process of elicitation of expert judgements. I am frequently dismayed by the way that some scientists seem unprepared to acknowledge the expertise of specialists in other fields from their own, and seem willing to speak out on topics for which they themselves have no specific training. During the elicitation exercise it was essential for me to trust the undoubted expertise that he and the other participants had in the science of hurricanes, and I wish that he had the courtesy to trust mine.

PIELKE RESPONSE: Prof. O’Hagan is apparently unaware that I am trained in the social sciences, with social science methodology as one of my major fields.

Let me now address Dr Pielke’s specific criticisms.

First, he says that the results obtained were indistinguishable from the results of randomly allocating weights between the various models, and he implies that this is inevitable. The latter implication is completely unjustified. I was not involved in the 2006 elicitation which Dr Pielke uses for his numerical illustration, but I can comment on the two most recent exercises. The experts were given freedom to allocate weights, and did so individually in quite non-random ways. In aggregate, they did not weight the models at all equally. The fact that the result came out in the middle of the range of separate model predictions in 2006 was therefore far from inevitable.

PIELKE RESPONSE: I am happy to see Prof. O’Hagan acknowledge that the results were indistinguishable from a random allocation. On this point we agree. Prof. O’Hagan can further clarify the situation by releasing the values for the five-year predictions from the 39 models used in the 2008 elicitation. I ask him to submit these in the comments to this thread.

The elicitation exercise was designed to elicit the views of a range of experts. They were encouraged to share their views but to make their own judgements of weights. Dr Pielke says that the more experts we have, the more likely it is that the elicited average will come out in the middle, which is again fallacious. The result depends on the prevailing opinions in the community of experts from whom the participants were drawn. The experts who took part were not chosen by me or by RMS but by another expert panel. If, from amongst the models that RMS proposed, all the ones which would give high hurricane landfalling rates were rejected (and so given very low weights) by the experts, then the result would have ended up below the centre of the range of model predictions. The fact that it comes somewhere in the middle is suggestive, if it suggests anything at all, of RMS having done a good job in proposing models that reflected the range of scientific opinion in the field.

PIELKE RESPONSE:Again, I am happy to see Prof. O’Hagan acknowledge that the experts did little more than confirm the distribution of models presented by RMS. I will reiterate that a group of experts with a wide range of views, such as found in the tropical cyclone community, will inevitably provide a result indistinguishable from a set of random views, just as a panel of monkeys allocating random weights would have done, as I argued in my earlier post. This point is simply a logical one. If the community had a consensus, presumably such an elicitation would be unnecessary.

I think the above also answers Dr Pielke’s criticism of RMS’s potential conflict of interest. I agree that this potential is real. RMS is a commercial organisation and their clients are hugely money-focused. Nevertheless, as I have explained, the outcome of the elicitation exercise is driven by the judgements of the hurricane experts like Dr Pielke. Any attempt by RMS to bias the outcome by proposing biased models should fail if the experts are doing their job. If Dr Pielke is convinced, as he appears to be, that no model can improve on using the long-term average strike rate, then he could have allocated all of his weight to this model. That he did not do so is not the fault of RMS or of me.

PIELKE RESPONSE: It is telling Prof. O’Hagan sees fit to attempt to reveal publicly my individual allocations in the exercise after the participants in the elicitation were assured by him and RMS that any individual information would remain confidential. I am sure that there are other “confidential” details about the elicitation that many people would be interested to hear about. I remain perfectly comfortable with my allocation in the process despite the fact that I could have been replaced with a monkey to no real effect on the outcome. My views on one to five year predictions are expressed in a paper (currently under review) that I am willing to share with anyone interested (pielke@colorado.edu).

This brings me back to the question of expertise. The elicitation was carefully designed to use to the full the expertise of the participants. We did not ask them to predict hurricane landfalling, which is in part a statistical exercise. What we asked them to do was to use their scientific skill and judgement to say which models were best founded in science, and so would give predictions that were most plausible to the scientific community. I believe that this shows full appreciation by RMS and myself of the expertise of Dr Pielke and his colleagues. For myself, the expertise that Dr Pielke seems to discount completely is based on familiarity with the findings of a huge and diverse literature, on practical experience eliciting judgements from experts in various fields, and on working with other experts in elicitation. In particular, I have collaborated extensively with psychologists and other social scientists. I don’t know how much Dr Pielke knows of such things, but to complain that what I do is “plain old bad social science” is an insult that I refute utterly.

PIELKE RESPONSE: I would agree that Prof. O’Hagan does not know how much I am aware of such things.

Dr Pielke is no doubt highly-respected in his field, but should stick to what he knows best instead of casting unfounded slurs on the work of experts in other fields.

PIELKE RESPONSE: Academics sometimes like to conflate a professional critique with a personal “slur,” perhaps to change the subject. I have the highest respect for RMS as the leading catastrophe modeling firm with an important role in the industry. It is the importance of RMS to business and policy that merits the close attention to what they are doing.

In this case, I judge the elicitation methodology to be significantly flawed in important respects. This perspective is no “slur,” just reality. Prof. O’Hagan can help to further clarify the situation by focusing on the critique rather than expressions of outrage. He might start by releasing the results from the 39 models used in the 2008 elicitation. I ask that he publish these in the comments to this thread.

45 Responses to “Tony O’Hagan Responds (Not on Behalf of RMS)”

    1
  1. Tony O'Hagan Says:

    My response was NOT made on behalf of RMS. It was my own personal reaction to the unfair claims in your original post, Roger.

    I apologise for inadvertently indicating something about your own participation in the elicitation exercise, although my intention was only to emphasise my point that the outcome is driven by the judgements of you and your fellow scientists, not by the choice of models. It’s worth pointing out also that all the experts were given the opportunity, well in advance of the elicitation meeting, to propose other models or to object to some of RMS’s proposed modelling.

    I was not aware that you have social science training, but this does not reduce my annoyance at your unjustified claims.

    Finally, of course, publication of the results from the 2008 elicitation is a matter for RMS, not for me.

  2. 2
  3. Roger Pielke, Jr. Says:

    -1-Prof. O’Hagan

    Thanks for the clarification that these are your personal views and not those of RMS. I have added an update to the top of the post noting this.

    While I appreciate that you may be annoyed at seeing a critique of your work, it does not change the fact that I have demonstrated quantitatively that a random set of weightings results in the same result to the 2006 and 2007 elicitations as the weightings from the experts (BTW this is also the case for the results for Cat 1-2 storms for 2006 and 2007).

    Should RMS wish to reveal the output from the 39 models used in 2008, then we can see if this is also the case for 2008. As the person responsible for the elicitation, presumably you have this information and can easily test the proposition on your own, even if releasing the information publicly is not your call.

    Again, thanks for sharing your views.

  4. 3
  5. Craig Loehle Says:

    In order for an elicitation to be useful, it must be demonstrated that the experts are capable of delivering a non-random judgement. This is not always the case. Elicitation of experts on likely stock market behavior does not often yield skillful forecasts. In the case at hand, the question is elicitation of opinions about model skill for a set of models. It is a second order question, where expert opinion is even less likely to be right. Just because one has a methodology for elicitation does not mean that the results will be meaningful. In statistics, the most fundamental question is the null expectation: what happens in the random case. For flipping coins or for samples from a normal distribution, we can obtain expected frequencies and then use these to test our experimental outcome. In the case here, Pielke is exactly right that the null expectation is random selection of models and the score from that random selection. The fact that the experts do not differ from the null does NOT prove that they don’t agree, but it does prove that you can’t say there is a consensus outside of the null (ie you can’t reject ignorance of model skill as explanatory) just like I could not reject a fair coin if 1000 flips came out 0.501 heads. This has nothing to do with ethics or Dr. O’Hagan’s credentials except insofar as it is not proper to claim a consensus among experts is X when you can’t prove it was not a random result.

  6. 4
  7. Craig Loehle Says:

    How might we determine that the experts DID agree and that the result was NOT the outcome of chance? We could use a binomial model with n experts and m models to obtain a distribution of frequencies likely to obtain by chance. If the experts were significantly more concentrated in their votes than chance (e.g. all of them picked only 2 models), then we might say that the result was NOT random. Otherwise, I am afraid that you must conclude that the result was indeterminate (ie, we can’t say whether it is a valid vote or a random outcome).

    And yes, I do know what I am talking about. Google me.

  8. 5
  9. Maurice Garoutte Says:

    Since this thread talks about social science it seems a good place to mention social discourse about science. I always consider the quality of the argument as relative to the quality of the science. This thread is a gentle example of using a fallacious argument, (ad hominem here) instead of arguing the facts of the case. Craig in post 4 gives an example of how to refute or confirm Roger’s critique. Since Mr. O’Hagan did not refute the facts of Roger’s argument a logical conclusion is that the ad hominem attacks were the best defense available.

    Once when searching realclimate for pages about Steve McIntyre and the hockey stick I found four paragraphs of straw men set afire before any mention of math. By that fifth paragraph I was not inclined to give any credibility to the science of the argument.

    Since fallacious arguments are not a staple here please see http://www.don-lindsay-archive.org/skeptic/arguments.html for the form of many arguments used by the faithful when logic and facts just won’t work.

  10. 6
  11. Tony O'Hagan Says:

    I will respond one more time to this thread.

    The argument about random weights is superficial. For a start, Roger does not specify what he means by random weights. There are infinitely many probability distributions on the (n-1)-dimensional simplex that could be used to draw a set of n weights summing to 1. You could restrict this by supposing that Roger intends all n weights to have the same marginal distribution, but that still leaves an infinite number of choices. There is an interesting literature in statistics about this. Any choice of a distribution will be arbitrary, but we cannot apply Craig Loehle’s suggestion of a significance test without making such an arbitrary choice. (Incidentally, Craig, I suggest you try implementing your “binomial model”.)

    For the record, there is some discussion of the results of the 2007 exercise in the article by O’Hagan, Ward and Coughlin entitled “How many more Katrinas? Predicting the number of hurricanes striking the USA” in Significance, December 2008, 162-165. (Significance is the Royal Statistical Society’s general interest magazine.) There were 7 experts who assigned weights to 20 models, so if we average over all models and experts the average weight given to any one model by any one expert is 0.05. However, across the 7 experts the average weights for individual models ranged from 0.0029 to 0.1274. As I’ve explained above, it isn’t possible to formulate a unique test for a hypothesis of randomness, but I think that according to any sensible criterion this would indicate clearly that the experts were not guessing and that there is some degree of consensus in the community about which models were scientifically credible.

    I hope this resolves the question somewhat, and that readers of this thread can now see more clearly why it is superficial to criticise the elicitation simply on the grounds that a “random” set of weights would have given the same average prediction. This can of course happen without the experts’ weights being at all random. I said as much in my original remarks and commented, “The experts were given freedom to allocate weights, and did so individually in quite non-random ways. In aggregate, they did not weight the models at all equally.” I trust that the above figures and the cited article demonstrate this if I was not believed in the first place.

    I’m sorry if Maurice Garoutte sees the debate as of poor quality and my criticisms as ad hominem. The latter complaint is to some extent justified, but my original response did also contain cogent criticism which was either ignored or not understood. I “did not refute the facts of Roger’s argument”, Maurice, but I did refute the logic.

    At the risk of repeating my offence, I will end by saying that I have spent forty years learning my trade as a statistician but unfortunately my discipline is one that people in other fields seem to think they can pontificate about on the basis on a fraction of that training.

  12. 7
  13. Roger Pielke, Jr. Says:

    -6-Prof. O’Hagan

    You write: “For a start, Roger does not specify what he means by random weights. ”

    Sure I did. As I wrote in the original post:

    “I created a panel of 5 “monkeys” by allocating weights randomly across the 20 models for each of my participating monkeys.”

    How did I do it? I used the random number function in Excel — RAND(). (No monkeys were harmed in the experiment;-)

    I have demonstrated that the results of such a panel of random weighting monkeys lead to landfall predictions indistinguishable from those of the experts. I have of course not argued that the experts themselves submitted random weightings. The point however is that the elicitation process does not distinguish between a panel of monkeys and a panel of world experts.

    Prof. O’Hagan agrees: “. . . it is superficial to criticise the elicitation simply on the grounds that a “random” set of weights would have given the same average prediction. This can of course happen without the experts’ weights being at all random.”

    Precisely my point.

  14. 8
  15. Mark Bahner Says:

    Some random comments:

    Dr. O’Hagan, you state, “During the elicitation exercise it was essential for me to trust the undoubted expertise that he and the other participants had in the science of hurricanes, and I wish that he had the courtesy to trust mine.”

    But there’s no need to “trust” anyone’s expertise. Even if you were another Einstein or Newton, there’d be no need to “trust” your expertise. Even Einstein and Newton were either right or wrong.

    2) Craig Loehle, you state, “If the experts were significantly more concentrated in their votes than chance (e.g. all of them picked only 2 models), then we might say that the result was NOT random. Otherwise, I am afraid that you must conclude that the result was indeterminate (ie, we can’t say whether it is a valid vote or a random outcome).”

    Indeed! Good stuff! :-)

    3) Roger and Craig Loehle, how do you respond to Tony O’Hagan’s statements:

    “There were 7 experts who assigned weights to 20 models, so if we average over all models and experts the average weight given to any one model by any one expert is 0.05. However, across the 7 experts the average weights for individual models ranged from 0.0029 to 0.1274. As I’ve explained above, it isn’t possible to formulate a unique test for a hypothesis of randomness, but I think that according to any sensible criterion this would indicate clearly that the experts were not guessing and that there is some degree of consensus in the community about which models were scientifically credible.”

    ?

  16. 9
  17. Roger Pielke, Jr. Says:

    -8-Mark

    I’ll respond with an excerpt from Prof O’Hagan’s paper that he cites:

    “The combination of the experts simply
    averages the seven density curves. Th e overall
    mean prediction for landfalling hurricanes, of
    any category, is 1.985, with a standard deviation
    of 0.210. Another measure of the agreement of
    the experts is that variability between experts
    contributed only 13% of the total combined
    variance. However, it should be emphasised
    that the experts weighted the various models
    quite differently, and it is possible that such
    differences of opinion over the underlying science
    would, in another year, have led to larger
    discrepancies in their numerical predictions.”

    FYI, Monkeys = 2.01

  18. 10
  19. lucia Says:

    Professor O’Hagan-

    Do you have a table showing the weights for each model given by the the seven experts. That is, a table of 7*20 entries showing:

    model : 1 2 3 4 …. 20
    E1: 0.1 0.5 0.2 0 ….
    E2: 0.05 0.3 0.1 0.02 ….

    With E1 for meaning expert 1, models 1-20 for the models and the entries being the weights by expert 1.

    I don’t need to models or experts named, just the table. I want check something to see if Roger’s expert monkey theory holds up in other tests.

  20. 11
  21. solman Says:

    Professor O’Hagan,

    The question at hand is this:

    Do elicitations, as carried out by yourself and others on behalf of RMS, provide any additional skill beyond a naive combination of those models selected by RMS for the elicitation?

    Nobody is suggesting that the experts involved assigned their weights randomly.

    I don’t believe that anybody doubts that an obviously fallacious model (i.e. one linking Atlantic cyclone activity to the performance of the New York Mets), would receive low weights from the participants.

    But, as you seem to admit, the results of the expert elicitations seem to be indistinguishable from the results of a non-expert elicitation.

    If 50% of the models with above average predictions were randomly removed from the elicitation, would there be a statistically significant difference between the expert’s results and a random weighting?

    Isn’t it the responsibility of RMS to demonstrate the value of this exercise by conducting such an experiment?

  22. 12
  23. Craig Loehle Says:

    Running the following Mathematica code (which has 7 experts randomly spread their votes among 20 models)
    wttot = Table[0, {i, 1, 20}]; li = Sort[Table[Random[], {i, 1, 19}]]; Do[wt = {li[[1]]}; Do[AppendTo[wt, li[[i]] – li[[i - 1]]], {i, 2, Length[li]}]; AppendTo[wt, 1 - li[[19]]]; wttot = wttot + wt, {k, 1, 7}]; wttot = wttot/7.
    a bunch of times, I find that the maximum weight on a single model of 0.12 is at the low end of the results (that is, most runs by chance give a single model more wt than this). This means that the “experts” in this elicitation appear to have overdispersion of choice among the models compared to random (they agree less than chance). This could result from selecting a range of experts with varying views of the models.

  24. 13
  25. Craig Loehle Says:

    Per my above note, there are other ways to allocate the individual’s weights to be “random”: maybe people only focus on a few models as best. My code above was totally random allocation of wts.

  26. 14
  27. Mark Bahner Says:

    Hi Roger,

    I’m probably a bit handicapped by not knowing exactly what’s going on. (Or caring.) ;-)

    But I think the crux of the matter is how those experts ranked the probabilities of the various models, not that the experts returned a value of 1.985 hurricanes and the monkeys (really chimpanzees…and it’s good to know you didn’t harm any of them! ;-) ) returned a value of 2.01.

    So I agree with where both Craig Loehle and Lucia seem to be: to get to the crux of the matter, it would be good to get the 7*20 table of the rankings of the models by the experts.

    Let’s say, in the most extreme (and completely implausible) case, that the experts ranked model #17 as having a probability of 1.0, and all the other 19 models as having a probability of 0.0. Well, no monkeys (or Excel random number generators) could do that.

    P.S. The above doesn’t deal with another criticism that I think Craig Loehle had that I share, i.e., “In the case at hand, the question is elicitation of opinions about model skill for a set of models. It is a second order question, where expert opinion is even less likely to be right.”

    If the question is, “How many more Katrinas? Predicting the number of hurricanes striking the USA” I don’t see the point in asking experts about models, with the experts being blind to the results of the models.

    If I was an expert, and found out (after the fact!) that my preferred model(s) had predicted ~20 landfalling hurricanes, rather than ~2, I would want a “do over” for my portion.

    I don’t understand why the elicitation wasn’t the more straightforward question of the number of hurricanes of various strengths.

    P.P.S. Oh! Craig Loehle’s results with Mathematica seems to indicate that a maximum weight of ~0.12 for a single model is no remarkable finding (i.e., not remarkable in the experts’ agreement on a particular model). This seems (to me, an amateur who isn’t even particularly interested enough to delve deeply into the subject) to be a pretty powerful criticism of the results. That is, a maximum weight of 0.1274 doesn’t seem spectacularly different from what might occur from chance.

  28. 15
  29. lucia Says:

    Craig,
    I suspect if RMS included a huge range of models including some predicting hurricane landfall based on the ancient Maya calendar or mentions in The Farmer’s Almanac, we would discover the experts rankings would tell us something about which models were best or worst. In contrast, monkeys would not.

    However, if RMS provides a bunch of models all of which have adherents among the experts, there will be no clear favorite among the models in the group of 20.

    If Tony would provide the 7 * 20 table, we could do a direct test to the hypothesis the models have identical rankings. While I have no reason to doubt Tony’s statement that there is no unique test for randomness, test do exist to test the hypothesis that the individual models would achieve different ratings if Tony were to repeat the experiment with some other set of randomly chosen experts.

    Maybe Tony has done these, but he hasn’t mentioned them here.

  30. 16
  31. Tony O'Hagan Says:

    OK, I really didn’t want to enter this discussion again because it is clear that Roger just fails to understand my argument (whether wilfully or not I don’t know). However, other contributors raise technical questions that have some interest. You must understand that I cannot release data that belong to RMS.

    First, Craig’s algorithm generates data from a distribution known to statisticians as the Dirichlet distribution with parameters a vector of ones. It is also sometimes known as the stick-breaking distribution. It has uniform density on the simplex, but otherwise has no special rationale as a distribution for this problem. (Roger’s description of his way of making random weights suggests his weights would not even sum to one. Perhaps he did the same as Craig without feeling the need to explain himself properly.)

    So this is an arbitrary choice, but let’s suppose that we really believe this is the way to test for randomness in this problem. I don’t use Mathematica but Craig’s code looks like it’s doing the right things. I repeated his exercise using my own preferred computing platform. I generated 7 independent sets of Dirichlet(1) distributed weights (using the same method of sorting random numbers) over 20 models and averaged them. I did this a million times and in only 10266 of these iterations (about one percent) did the maximum of the averaged weights equal or exceed 0.1274. In not one of these million iterations did the minimum of the averaged weights fall below 0.0029. I believe my code is right (and the result agrees with what intuition and algebra I can bring to bear), but it’s obviously quite different from what Craig got. Perhaps someone else would like to repeat this exercise.

  32. 17
  33. jasg Says:

    This result reflects quite a lot of modeling and statistical exercises in that regardless of how sound the method is, and we’re all sure of Prof O’Hagan’s expertise there, the exercise hinges on the initial assumptions: In this case of what constitutes an expert?

    An expert on hurricane predictions is not the same as an expert on how hurricanes work, it is someone who has a good track record of predicting them. Since there is no such beast the participants were not actually experts.

    Hence what Prof O’Hagen isn’t apparently understanding is that Roger was not questioning the method, or the expertise of the statistician, but simply the initial assumption that it was a valid exercise in the first place. Put away your tools Lucia and Craig because it’s utterly pointless.

    But shouldn’t the main concerns of the insurance industry be about construction standards and the number of people moving into harms way? Aren’t those the real issues?

  34. 18
  35. lucia Says:

    Tony–
    I understand that if data proprietary, you can’t release them. It’s a bit of a shame that the journal articles did not discuss any test that the mean weight assigned each model was actually identical across models. It could have easily been done and mentioned in a sentence, and it is a worth while question irrespective of the total number of land falls predicted.

    However, given the information you provide, I ginned up an excel spread sheet to come up with weights “monkeys” would give. I did this:

    Each “expert monkey” assigned a weights to each of 20 models using rand() in the uber-slick excel spread sheet. Since the sum will not equal 1, I then divided each of these weights by the sum over all 20 models. I then used the normalized weights for further analysis.

    I then created 7 expert monkeys and found the averages weight for each model using the “average()” function in Excel.

    Then I found the max and the min over all 20 models I then ran 20 trials manually, cutting and pasting the min and max into cells manually. I compared to the max and min weights of 0.0029 and 0.1274 Tony reports from the elicitation. The actual values I got were:

    Trials 0.0029 0.1274
    Max Min Lower Higher
    1 0.031 0.073 0 0
    2 0.023 0.075 0 0
    3 0.031 0.069 0 0
    4 0.029 0.065 0 0
    5 0.023 0.080 0 0
    6 0.032 0.076 0 0
    7 0.034 0.070 0 0
    8 0.028 0.068 0 0
    9 0.037 0.070 0 0
    10 0.035 0.074 0 0
    11 0.038 0.074 0 0
    12 0.021 0.071 0 0
    13 0.026 0.072 0 0
    14 0.025 0.075 0 0
    15 0.039 0.075 0 0
    16 0.029 0.076 0 0
    17 0.027 0.077 0 0
    18 0.034 0.068 0 0
    19 0.032 0.058 0 0
    20 0.035 0.071 0 0

    Totals Higher= 0
    Total Lower = 0

    Though I only did this 20 times, and while I greatly enjoy wasting time, eyeballing the results, I concluded that the experts are not interchangeable with the species of monkey I programmed into Excel.)

    So, I’m sorry to say, it appears to me that Roger and the other experts may behave differently from monkeys, at least when assigning weights to hurricane models.

    I’d still like the table of 7*20 weights. But, that evidence Tony provides does suggest Tony is right when he says the distributions of weights indicates the experts had detectable preferences.

    My spiffy EXCEL spreadsheet can be made available to any who wish to inspect it.

  36. 19
  37. Craig Loehle Says:

    Thank you Dr. O’Hagan, that was a very relevant response. I don’t think mine was exactly the dirichlet distribution, but it is an open question what random distribution would reflect experts choosing models. Maybe they only focus on 3 best (in their view) and give low wts to others. Re: your estimate that the 0.12 figure was not random, it still may not reflect a very STRONG agreement, just one a hair above chance. In any case, this is the proper type of question to ask.

  38. 20
  39. KevinUK Says:

    17 – jasg

    Stop spoiling all the fun! I’m going to wager just now that you aren’t an expert in hurricane predictions? Am I right? If so then I hope you are not insulted as certain selectors of hurricane prediction experts appear to be by being likened to a monkey. Personally I’m happy (and hopefully you are too) to be in the monkey group because after all as Roger has just shown I’m just as skillful at predicting hurricanes as an expert.

    Now tony o’h who should Roger send an email to in order to obtain the 2008 elicitation data? Where the same 7 experts used? Next year will RMS save themselves some dosh and enlist the services of Roger’s Excel monkeys instead? If so then I have Excel and I am available for a somewhat lesser fee than any of your chosen experts (sorry Roger but this is capitalism!). If its helps, I am a trained ‘borrower of client’s watches in order to tell them the time’ i.e. I am an ex safety and risk management consultant.

    KevinUK

  40. 21
  41. KevinUK Says:

    18- lucia,

    Can I have your spreadsheet please (on this occasion you can keep your watch as I’m sure you already know the time) as I can see a sniff of some potential work here. With Gordon Brown as our PM life is getting very hard here in the UK, so any possibility of work even if its remote has to be explored.

    KevinUK

  42. 22
  43. lucia Says:

    KevinUK–
    Yes. You may have my spreadsheet.

    However, I tweaked the spreadsheet to create a new breed of monkeys. Examining the expert/monkeys choices for weights. I noticed the my expert/monkeys tended to give weights that looked ‘too clustered’ compared to what I think inexpert people do when assigning weights. That is, my monkeys were random, but also not very opinionated.

    I wondered what would happen if I created a batch of opinionated monkeys would do. To create opinionated monkeys, I had each monky pick initial weights using rand()*10. As before, I then summed and normalized the sum of each monkey’s weight to 1.

    This time, even though the monkeys were still giving weights randomly, I found the max/min table looks like this:

    Trials 0.0029 0.1274
    Max Min lower than O”Hagna Higher
    1 0.003 0.200 0 1
    2 0.001 0.201 1 1
    3 0.001 0.153 1 1
    4 0.000 0.130 1 1
    5 0.003 0.159 1 1
    6 0.001 0.144 1 1
    7 0.000 0.204 1 1
    8 0.008 0.135 0 1
    9 0.001 0.145 1 1
    10 0.000 0.155 1 1
    11 0.002 0.120 1 0
    12 0.001 0.262 1 1
    13 0.009 0.128 0 1
    14 0.004 0.129 0 1
    15 0.000 0.162 1 1
    16 0.001 0.195 1 1
    17 0.001 0.146 1 1
    18 0.000 0.167 1 1
    19 0.000 0.140 1 1
    20 0.000 0.122 1 0
    Higher
    Totals 16 18

    So, 16 of the 20 models had weights lower than the lowest one Tony O’Hagan mentioned in comments, and 18 had higher weights than the highest.

    So, the weights underlying the choices might be similar to those assigned by opinionated monkeys.

    Clearly, knowing the highest and lowest weight is insufficient information to determine whether experts collectively actually prefer some models to others. Those maximum and minimum weights would result if the modelers individually have strong preferences, buy their preferences are totally uncorrelated from each other. That is to say: if there is no general preference within a group of opionated experts.

    To do a better test I would need the 7*20 individual weights.

  44. 23
  45. Roger Pielke, Jr. Says:

    Lucia has (unsurprisingly) perfectly replicated my method

    Prof O’Hagan’s analysis shows that scientists’ judgments about individual models can be shown to differ from a set of judgments made randomly. This is neither surprising nor relevant.

    The issue is the collective judgment of the panel, which is indistinguishable from the monkeys. The collective judgment matters because this is what RMS uses in its model (specifically, the average). All else is a red herring in this context.

  46. 24
  47. lucia Says:

    Roger–
    I tweaked the method to further test whether the evidence about the range of weights “that there is some degree of consensus in the community about which models were scientifically credible.” as claimed by Tony. Initially, I thought it did. But I changed my mind when I made the monkeys ‘opinionated’.

    I am interested in testing this underlying made by Tony OHagan, as I think it is interesting to know whether or not ther eis any consensus about which models are credible. I know one method to test that, but I would need the 7*20 individual weights. Unfortunately, those appear to be proprietary.

    I’m going to post at my blog. (I get a kick out of the idea that you all could be replaced by opinionated monkeys.

    (BTW: I’m sure the tweak would result in landfall estimates that, on average, match what you got. The standard deviation of landfall predictions from monkey-elicitation to monkey-elicitation would probably rise.)

  48. 25
  49. Martin Ringo Says:

    Let me pose a small question. Suppose we use Lucia’s (randomly) opinionated or un-opinionated Excel monkeys or Craig Mathematica gathered monkeys if different from Lucia’s or some I-only-pick-one-model monkeys or … well as Tony O’Hagan has noted there are infinitely many ways of expressing the distribution of weights. And further suppose we take a large sample from a finite (and reasonably small) selection of these ways. And then calculate the predicted number of hurricane landfalls (or whatever).

    Can we from these last statistics tell the difference of one method of simulating the weights from another? (We can so long as we don’t let the number of monkeys go to infinity too:-) ) And are any of these differences attributable to distributional modeling important viv-a-vis the fundamental difference of the sample average and the historical average? I suspect not. (Distribution theoretists unproved and believed by all, except for a measure zero, theorem: as the sample size increases all sampling distribution are statistically different.) So until we see a good demonstration, either theoretically or empirically, we should accept that we can’t tell the difference between the distributions of the sample average when we can’t tell the differences between the sample averages! :-)

  50. 26
  51. lucia Says:

    Martin

    Can we from these last statistics tell the difference of one method of simulating the weights from another?

    Do you mean from the final landfall statistics? By comparing one realization from method a to one realization to method b and having no additional information? I would assume we tell which method is better, nor whether the difference in numerical value is statistically significant.

    I think Tony was saying that we can tell detect a the existence of a consensus among experts as to which models are credible based on the maximum and minimum weights in the distribution of weights they assigned. Do you think we can detect that consensus? Or not?

    I think we don’t have enough information.

  52. 27
  53. Martin Ringo Says:

    I am basically disagreeing with Tony O’Hagan on the basis of a thought experiment (although one you can actually do if you want to make the effort). I posed that there were differences in the distribution of the choices of weights for the models. The panel of experts would have one. A normalized (to sum to one) uniform distribution of individual weights would be another. And it is little problem to come up with a bunch of others without even dealing with parameter differences.

    However, the bottom-line statistic that Roger was interested in was the weighted average of the models prediction (where the weights come from the average of weights from whatever distribution we use). What I was saying in my convoluted way was that when looking at only the predictions (and imagining we had a grand Monte Carlo of each of the distributions of weights), the differences will be small in comparison to the difference between the historical and expert average. And hence my tautological quip at the end.

  54. 28
  55. jasg Says:

    KevinUK
    Despite being no expert I did manage to predict that 2006 would be a quiet year while most “experts” were spreading panic after 2005. I also managed to predict that sea temperature differences would likely be turn out to be more important than absolute temperatures in forming hurricanes a year before Vecchi and Soden. Using a little common sense and noticing that the “experts” also predicted excess storminess under global cooling in the 70’s takes you a lot farther than randomness stats, dartboards or apes.

    One can be sure that all things are predicted to get worse under any scenario until you buy the miracle cure from the salesmen. The people to watch are those who are confident enough to say when they just don’t know something, because they are the intelligent ones who will one day figure it out.

  56. 29
  57. Mark Bahner Says:

    Hi Roger,

    You write, “The issue is the collective judgment of the panel, which is indistinguishable from the monkeys.”

    I don’t see why this is so. Suppose ALL seven experts assigned a probability of 1.0 to model 17, and zero to all other 19 models. And suppose that model 17 returned a value of 2.0 landfalls per year, which just happened to be the average value returned by all 20 models.

    Would you say in that case that because the end result was the same for the experts and the monkeys (or chimps), that the experts added nothing?

  58. 30
  59. Roger Pielke, Jr. Says:

    -29-Mark

    If the average of the expert judgments equals the average from the monkeys then, yes, the results are indistinguishable from a practical perspective even if the underlying PDFs can be shown to differ. The reason for this is that the PDFs are not used in any way, just the average.

    I have given an explanation why it is, in general, that the average results between experts and monkeys will not be expected to be different.

  60. 31
  61. KevinUK Says:

    30 – Roger

    Isn’t this just another result of the application of the central limit theorem?

    28 – jasg

    “One can be sure that all things are predicted to get worse under any scenario until you buy the miracle cure from the salesmen.”

    I totally agree with you on this one. In the case of AGW the snake (not crude of course) oil salesmen are the Goracle, the IPCC and various assortment of cap and trade pushers, carbon offset indulgence sellers etc all of whom stand to make a considerable amount of money out of ’scaring us to death’. While I mention that may I wholeheartedly recommend Christopher Booker and Richard North’s book ‘Scared to Death’ and of course (as well as CA) one of my favourite web sites Numberwatch.co.uk.

    KevinUK

  62. 32
  63. Tony O'Hagan Says:

    Lucia’s (and apparently Roger’s, too) method of generating weights by normalising random numbers can be shown to be equivalent to Craig’s, i.e. both generate weights according to the symmetric Dirichlet(1) distribution (the Dirichlet with all parameters equal to 1). If we raise the random numbers to the power m before normalising (and I presume Lucia’s “*10” means raised to the power 10), then a little algebra shows that the distribution you are sampling from is the symmetric Dirichlet(1/m). The smaller the values of the parameters in the Dirichlet distribution, the more dispersed are the sampled values. So Lucia’s intuition is right (and I’m sure she already confirmed this empirically) – her opinionated monkeys do spread their weights much more widely.

    If we raise different weights to different powers before normalising we get asymmetric Dirichlet distributions. What a lot of different kinds of monkeys there are! And the family of Dirichlet monkeys is just a small subclass of the class of all monkey species. Hence my belief that any such choice is arbitrary.

    However, there is a way to get a more stringent test of randomness that takes account of how opinionated the various experts seem to be. This is by using the actual weights that they produced and then randomly (and independently) permuting the values in each column of the matrix. So we replace each expert by a monkey who randomly assigns the same set of weights as the real expert between the various models. We can call these expert-mimic monkeys. I have done this for the old 2007 data matrix and generated again a million random sets of expert-mimic monkeys. Before I give you the results, I should say that I don’t very much like this approach because it pretends that the expert can only ever use the same set of weights as he/she actually used in the elicitation. The resulting tests of randomness are actually a little *too* stringent. But let’s see what we get.

    Now the observed maximum average weight of 0.1274 and minimum of 0.0029 are not quite so rare. The former occurs in 17% of simulations and the latter in 6%. This is still indicative of non-randomness, but less strongly so. However, some other statistics are perhaps more powerful. The max and min only look at extremes and don’t allow for the spread between them. So I next considered the standard deviation of the average weights across the 20 models. The standard deviation actually found among the real experts was 0.0355, and this value was exceeded in only 1% of the million simulations. So again we have quite powerful evidence that the experts do discriminate between the models more than monkeys would.

    Now standard deviations are much more amenable to algebraic analysis than maxima and minima, so we can say that the reason why the actual experts had more variation in their average weights than monkeys is because there is correlation between the experts. And this correlation is on the whole positive. It’s simple to compute the correlation matrix between the actual experts, and although there are a sprinkling of negative correlations (showing that some pairs of experts do not appear to agree with one another), the great majority of correlations are indeed positive. In the million simulations, I can of course compute this correlation matrix, and several of the positive correlations are strongly significant. For one pair of experts, only just over one in a thousand sets of monkeys would produce a higher correlation. None of the negative correlations is significant in this formal sense but a couple come close.

    So I continue to maintain that the experts are not just behaving like monkeys. They are doing exactly what one would hope them to do, and we see that while there is consensus between some there is apparent disagreement between others. They don’t just split into two camps, though, but show a range of partial agreements and intermediate positions.

    Now all of this will be dismissed by Roger, who continues to stress that in his opinion the only thing that matters is the final predicted landfalling rate. This really is going to be my last post on this site because I have other calls on my time, but I will make one last effort to get through to Roger. I’m not hopeful because Mark Bahmer’s perfectly sensible remarks didn’t seem to have one iota of effect, but here goes.

    When we start the elicitation, we don’t know what the experts will say. It is clear that they are making scientific judgements on the basis of their knowledge, but we don’t know what the results will be. In particular, we don’t know if the final predicted landfalling rate will be the same (for practical purposes) as we’d get by using monkeys (or in the interests of reducing animal experimentation, simulated monkeys – of some Dirichlet species, for instance). In that situation, there is no doubt about what I would prefer to do. I’d do the expert elicitation because it is expected to provide a prediction based on science and expertise. I would not use simulated monkeys, even though it would be cheaper. The experts’ knowledge is something I would pay for, and RMS apparently thinks so too.

    Even if 9 times out of 10 the results were to be indistinguishable from a simulated monkey exercise, I’d still want to use the experts for that 1 time in 10 when they give a noticeably different prediction. To argue as Roger does, that the experts are just like monkeys because on one occasion they came up with a very similar final prediction, is just perverse. I’ve tried to say this before, and this is my last gasp effort. I have also countered Roger’s claim that “in general, … the average results between experts and monkeys will not be expected to be different.” If, as I fear, Roger will still not back down, then I give up on him. Roger, the last word is yours – I will stay out of this now.

  64. 33
  65. lucia Says:

    Tony–
    Even though you are gone, thanks for the discussion.

    I agree with you that with regard to the methodology, whether or not the distribution of weights given by the experts is random matters very much. That’s why I was hoping for the table of weights to fiddle with. (But your discussion is even better, as it means I don’t have to fiddle creating every type of monkey I can dream up.)

    Other things matter too. For example, it’s worth knowing the range of forecasts possible based on the choices provided the experts. For example, if, by some odd chance, every method RMS included in there set made nearly the same forecast, then the elicitation can’t make much difference anyway. (Presumably, if all methods agreed, RMS wouldn’t go to the expense of the elicitation because you know it won’t make any difference.)

    That said, I think the range of possible predictions is better discussed in the paper than the issue of whether or not the experts weights shows any particular preferences toward models.

    The other difficulty is more philosophical and arises from the circularity of the exercise. Here’s the difficulty:

    a) We know that reasonable researchers at RMS would chose 20 models all of which appear and are cited in the literature.
    b) We know the experts are, by definition, dominated by people who either create or use models and publish in the literature.
    c) By definition, if a particular method is preferred by “experts” that method and close variants of that method will appear more frequently in the literature.

    Given a-c is it much of a surprise that non-monkey scientists who are polled end up reproducing the average for all models?

    (FWIW: I think the elicitation still adds some value. After all, due to the nature of the peer review process, it sometimes occurs that some older models continue to be cited long after everyone thinks they are useful. Sometimes, things are cited simply to explain how the newer models improve over older ones or because authors know advocates of older models are likely to review a paper and will be unhappy if their model is not cited. Elicitation can at least downplay the contribution of those models to the forecast. )

  66. 34
  67. Mark Bahner Says:

    -30-Roger,

    You write, “If the average of the expert judgments equals the average from the monkeys then, yes, the results are indistinguishable from a practical perspective even if the underlying PDFs can be shown to differ. The reason for this is that the PDFs are not used in any way, just the average.”

    So your method for determining that the experts are adding value is…

    Oh, wait a second, I should have read Tony O’Hagan’s entire comments. He says, “To argue as Roger does, that the experts are just like monkeys because on one occasion they came up with a very similar final prediction, is just perverse.”

    Absolutely. Roger, is that what you’re really arguing? (I just want to check, because it’s hard to believe you’d make that argument.)

    Let’s go further, and play with Tony Hagan’s opinions a bit. He wrote, “Even if 9 times out of 10 the results were to be indistinguishable from a simulated monkey exercise, I’d still want to use the experts for that 1 time in 10 when they give a noticeably different prediction.”

    I don’t think I’d go that far (depending on costs of soliciting the experts’ opinions, and benefits derived from their one different prediction).

    But Roger, suppose:

    1) 5 times out of 10 the experts returned predictions that were significantly different (higher or lower) than the monkeys (oops, chimps!),

    2) 9 times out of 10 the experts returned predictions that were significantly different (higher or lower) than the monkey-chimps?

    Would you still say the experts weren’t adding value?

    P.S. Tony O’Hagan wrote, “So I next considered the standard deviation of the average weights across the 20 models. The standard deviation actually found among the real experts was 0.0355, and this value was exceeded in only 1% of the million simulations. So again we have quite powerful evidence that the experts do discriminate between the models more than monkeys would.”

    Indeed.

    Does anyone disagree with that assessment? Lucia? Craig? Roger?

  68. 35
  69. Mark Bahner Says:

    Hi Lucia,

    “Other things matter too. For example, it’s worth knowing the range of forecasts possible based on the choices provided the experts. For example, if, by some odd chance, every method RMS included in there set made nearly the same forecast, then the elicitation can’t make much difference anyway. (Presumably, if all methods agreed, RMS wouldn’t go to the expense of the elicitation because you know it won’t make any difference.)”

    A couple comments:

    1) Yes, it seems to me it’s extremely important to know the range of forecasts possible based on the choices of the experts. If the range is between 1.95 and 2.05 landfalls per year, what’s the difference?

    2) Regarding your parenthetical, “(Presumably, if all methods agreed, RMS wouldn’t go to the expense of the elicitation because you know it won’t make any difference.)”

    Well, I actually would be concerned if all methods agreed, especially if they agreed on a higher number than historical average. It would be a wonderfully clever way to “game” the system…RMS solicits the experts, knowing that their answer will come back in a way that’s favorable to RMS. The cost of the solicitation is probably small compared to the increased revenue and profit from a general perception that risks will be higher than historical averages.

    I do agree with Roger in that I think he’s “spot on” (as the British would say) about RMS even being in the business of publishing predictions. It’s inherently problematic; it’s inherently open to the appearance of conflict of interest.

  70. 36
  71. PaulM Says:

    Who were these 7 experts?
    And how were they chosen?
    Did they include Christopher Landsea, the hurricane expert with dozens of papers on hurricanes over 20 years, who resigned from the IPCC saying
    “All previous and current research in the area of hurricane variability has shown no reliable, long-term trend up in the frequency or intensity of tropical cyclones, either in the Atlantic or any other basin. … It is beyond me why my colleagues would utilize the media to push an unsupported agenda that recent hurricane activity has been due to global warming” ?

  72. 37
  73. SteveF Says:

    An interesting exchange. The discussion of (statistical) types of monkeys is also quite entertaining.

    I am sorry that Tony O’Hagan is gone, since his contribution was really positive and informative, once he got over his (apparent) anger at Roger’s comments; I fear Roger will not be hired again as an expert by RMS. Certainly Tony is correct that a panel of experts is better able to evaluate the technical merits of the models used to predict the number of hurricane landfalls than are an equal number of monkeys, even Lucia’s highly opinionated monkeys, although I can understand Lucia’s approach, since the majority of experts I have met seem to have very little skill and very strong opinions.

    But I think this all really misses the point: do the MODELS, however originally selected by RMS and no matter how they were weighted by experts, really have any significant skill at predicting hurricane landfalls?

    Perhaps if Tony (or RMS) could release a list of the models’ predicted landfall rates, without revealing the identities of the models or the experts’ weightings of those models, we could at least see if the range of possible predictions is reasonable. Assuming that all of the models are technically serious (i.e., not based on how the New York Mets play between April 1 and June 1), then does the range of predicted landfalls include the known average landfall rate for the last 50 years and the average number of landfalls for 5 year periods over the last 50 years? If most or all the predictions are above or below the historical rates, then it seems quite possible, if not likely, that the expert-weighted combination of the models will overstate or understate the risk of future landfalls, since the historical landfall rate over 50 years does not appear to have any statistically significant trend.

    After all, garbage in, garbage out.

  74. 38
  75. Roger Pielke, Jr. Says:

    -Mark-

    Prof O’Hagan is correct about many things, among them:

    1. The experts produce a PDF entirely distinguishable from that of monkeys

    2. In theory the process would allow a result from the experts different from that of monkeys

    However, we disagree about a few things:

    3. He writes: “Even if 9 times out of 10 the results were to be indistinguishable from a simulated monkey exercise, I’d still want to use the experts for that 1 time in 10 when they give a noticeably different prediction.”

    The problem is that every time the elicitation has been done in this manner it has resulted in average landfall rates indistinguishable from moneys. If the elicitation failed to distinguish results from monkeys (i.e., 90% chance of being no different from random), yes I think that was problematic. So too I think would many people who rely on RMS.

    4. He writes: “[in Roger's] opinion the only thing that matters is the final predicted landfalling rate”

    Well no. I think a lot more matters, but as far as I am aware RMS uses only the final predicted landfall rate as input to its model. So while there are many of academic sort of things that could be said about the (potential) value of the elicitation, my focus has been narrowly on its practical application by RMS.

    5. He writes: “To argue as Roger does, that the experts are just like monkeys because on one occasion they came up with a very similar final prediction, is just perverse.”

    This is not what I’ve argued. (Just look at me and you can see I am no monkey;-) What I have argued is that the landfall rate produced by the elicitation of experts has been in every instance (not “one”) indistinguishable (not “similar”) to a landfall rate provided by monkeys. Perhaps there will be exceptions to this in the future.

    Does 3, 4, 5 above make me raise an eyebrow about the process? You bet.

    Thanks all for the exchange.

  76. 39
  77. Roger Pielke, Jr. Says:

    -37-SteveF

    The models used in 2006 and 2007 (a subset fo those used in 2008) can be seen here:

    http://www.rms-research.com/references/Jewson.pdf

    No, I don’t imagine I’ll be invited back ;-)

  78. 40
  79. lucia Says:

    If they based forecasts on elicitations and not one has ever come up with result different from what we would get if the panel consisted of monkeys, then that is worth noticing. Only after noticing can anyone investigate why this might occur. As I commented above, because the field of “hurricane studies” consists of a relatively small number of people and the experts are, to some extent, the originators or promoters of the models discussed in the literature, there is some circularity in the entire process.

    In broader fields where the you can separate those who develop models from those weighting them, elicitation would probably be very useful. But in a small field?

    What if the entire fields consists of 20 experts, each of whom proposed 20 models. The 20 experts might even be subdivided into “teams”, as in “William Gray’s students” vs. “The Florida Crowd” and “Unaffiliated”. Then RMS picks 7 experts of of the 20 and each just casts partisan votes. Will this look different than if 20 models exist and RMS gets to pick 7 experts out of 10,000 existing experts?

    I don’t know the answer to this. But it seems to me that this sort of thing could affect the value of elicitation. In particular, it’s possible that in small fields, one will tend to see the results of forecasts based on expert elicitation about models that cannot be differentiated from those based on elicitations of monkeys.

    Mark Bahner–
    I agree that Tony has shown the weights assigned by the experts could not have been cast by 7 monkeys.

  80. 41
  81. KevinUK Says:

    37 – stevef

    “I fear Roger will not be hired again as an expert by RMS. ”

    Thanks to Lucia sterling work I can now confirm that I have been given the gig! Cheers Lucia, its always good to take full credit for other peoples work as Gavin (I’m not interested in the Super Bowl) Schmidt knows.

    Roger didn’t stand a chance this year as sadly he is an expert and not an opinionated monkey like me. Having made the business case to RMS, it became a ‘no brainer’ for them. After all didn’t you all know that there is a credit crunch on? Why pay good cash to be told what you want to hear when peanuts will do. Sorry Roger!

    KevinUK

  82. 42
  83. SteveF Says:

    39 – Roger:

    Thanks for the reference. It looks to me like a self-serving hodge-podge produced by RMS staffers, which is not so surprising. Every model prediction is above the long term average, and every model prediction is above the medium term average…. There is no possibility the expert-weighted average prediction could be for anything except substantial increases in landfalls. Could a desire for profits at RMS somehow be involved in this research? I hope their customers call to complain 5 years from now about the accuracy of these predictions.

    Of course, it won’t be much of a conversation if RMS cuts cost and a monkey answers the phone.

  84. 43
  85. ¿Se puede distinguir entre un experto en huracanes y un mono? « PlazaMoyua.org Says:

    [...] Mi duda viene de un intenso cálculo computacional inspirado por una vieja discusión que comenzó con Inexpert Elicitation by RMS on Hurricanes, de Roger Pielke, y fue continuada en Tony O’Hagan Responds (Not on Behalf of RMS). [...]

  86. 44
  87. Mark Bahner Says:

    Hi Roger,

    It seems to me this whole exchange has been somewhat unfortunate. It seems to me that both you and Dr. O’Hagan are honest men, making honest points.

    Regarding your points:

    1) You question the value of the expert elicitation, if 9 times out of 10 it produced a prediction indistinguishable from monkeys. You write that makes you “raise an eyebrow.” Well, I agree with that. There are only a few instances where I would say that it *would* be important to get that 1-out-of-10 difference. For example, if there were asteroids 1 km across that the monkeys said every time that the probability that the asteroid would strike the earth was less than 1 in 10,000, and the experts agreed 9 out of 10 times, but one time thought that the probability was greater than 50/50, I’d definitely want to get the experts’ opinions all 10 times.

    2) You write, “What I have argued is that the landfall rate produced by the elicitation of experts has been in every instance (not “one”) indistinguishable (not “similar”) to a landfall rate provided by monkeys.”

    This is where I’m a bit handicapped by not caring very much about this whole subject. ;-) Exactly how many elicitations have there been? Two? Three? Ten? Scores? Hundreds? (I’m guessing not “hundreds”. ;-) )

    3) You write, “Perhaps there will be exceptions to this in the future.”

    Well, there’s a hugely important question. So it seems to me that Lucia’s question about the width of the distribution of model predictions is important. If the distribution of model predictions is very narrow, then it wouldn’t be expected that there’d ever be a significant difference. And if there have been scores or hundreds of previous elicitations, and every one has returned a predicted landfall number that is essentially identical to what monkeys would do, it would be reasonable not to expect future exceptions, under the Albert Einstein insanity definition. (I realize the previous sentence may contain an error in logic, but I expect geek points for invoking “the Albert Einstein insanity definition.” ;-) )

    Best wishes,
    Mark

  88. 45
  89. An Elicitation of Expert Monkeys: (Hurricane related.) | The Blackboard Says:

    [...] My uncertainty about my inability to distinguish hurricane experts from monkeys is the fruit of a computationally intensive calculation inspired by a long conversation that began with Inexpert Elicitation by RMS on Hurricanes, by Roger Pielke Jr. and continued in Tony O’Hagan Responds (Not on Behalf of RMS). [...]