Talk:Statistics

Mathematics Unassessed Top‑priority

	Mathematics portal This article is within the scope of WikiProject Mathematics, a collaborative effort to improve the coverage of mathematics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.MathematicsWikipedia:WikiProject MathematicsTemplate:WikiProject Mathematicsmathematics
???	This article has not yet received a rating on Wikipedia's content assessment scale.
Top	This article has been rated as Top-priority on the project's priority scale.

Statistics was a good article, but it was removed from the list as it no longer met the good article criteria at the time. There are suggestions below for improving the article. If you can improve it, please do; it may then be renominated.
Review: Error: Invalid time..

Template:V0.5

This page is for discussion of the article about statistics. Comments and questions about the special page about Wikipedia site statistics (number of pages, edits, etc.) should be directed to Wikipedia talk:Special pages.

Miscellaneous

I was taught statistics starting with the definition "a statistic is a function of data" as the first sentence of the Part 1B Stats course at Cambridge. I think the definition was useful and so it should be included BozMo (talk). Done

~~On special:Statistics, what are 'junk pages'? They seem to equal total pages minus (non-talk comma pages + talk pages). How many of these are #REDIRECTs? --Damian Yerrick~~

~~Why is the Main Page article counter different than the one in Statistics? --Chuck Smith~~

It's been some number of years since I studied statistics, but the terms used throughout the article did ring some bells very quietly in the back of my mind. The singular exception was ANOVA, so I followed the link to seek an explanation: Analysis of variance. That was familiar! I was then surprised by the number of hits that Google gave me about ANOVA (197,000). Still, I believe that the full expression is far more meaningful than the acronym, and I don't think that we should be encouraging the use of cute but meaningless acronyms. Eclecticology, Thursday, May 2, 2002

The three topics of statistics -- experimental design, description/exploration and inference -- are excellently described. The ongoing discussion between data miners and modellers (eg. Statistical Modeling: The Two Cultures, Leo Breiman and discussants, Statistical Science 2001;16:199-231) might deserve some more attention. Johannes Hüsing

I wonder if we can improve on the phrase "uncertain observations"? It's not the observations that are uncertain; it's what they entail about the population from which they came, the uncertainty resulting from the random way in which the observations came from the population. Michael Hardy 20:00 17 Jul 2003 (UTC)

Well, unless you're talking about measurement error, in which case the observations are uncertain. Anyway, I agree that the article needs a major rewrite. Oh, I guess that's not what you said... - dcljr 00:15, 9 Aug 2004 (UTC)

Even with measurement error, it's not the observations that are uncertain. You know what number your measuring instrument gave you; what you're uncertain about is what it should have given you. Michael Hardy 01:09, 9 Aug 2004 (UTC)

Hmm. A subtle distinction, indeed. But whatever. As a statistician yourself, surely you can provide us with a better introductory paragraph than the current version.... (See also item "What is statistics?" below.) - dcljr 05:46, 10 Aug 2004 (UTC)

Suggest update to US National Statistical Services to FedStats

Under "National Statistical Services", it appears that for a particular country, that country's main national statistics site is listed, except for the United States. For the US, the American Statistical Association is listed, which is primarily a professional association for statisticians. I would suggest that the FedStats web site, http://www.fedstats.gov, be listed as the web link for the US. The FedStats web site is the US government's gateway portal to it's underlying Federal statistical system, with links to more than 100 agencies with statistical information.

Puzzled by definition

Why is human knowledge part of the definition -- is it really necessary?CSTAR 03:26, 10 May 2004 (UTC)[reply]

I wouldn't call it a science either. — Miguel 06:28, 2004 May 10 (UTC)

Why not? cf Nelder JA (1999). From statistics to statistical science. The Statistician 48(2), 257-269. Johannes

What is statistics?

I don't like the introductory paragraph. I haven't come up with anything better, but here's a "definition of statistics" I used when I taught the subject to undergraduates:

[Statistics] is a logic and methodology for the measurement of uncertainty and for an examination of the consequences of that uncertainty in the planning and interpretation of experimentation or observation.

— Stephen M. Stigler, The History of Statistics (Belknap/Harvard, 1986)

Of course, I followed it with a lot of explanation...

I propose interested parties list their own preferred definition of statistics (serious ones, I mean) here and maybe we can come up with a consensus on the best one. (And then monkeys... well, nevermind.)

- dcljr 05:46, 10 Aug 2004 (UTC)

For me, statistics is a methodology for the collection, interpretation and presentation of information - I don't feel strongly about the words "methodology" or "information", but I don't like "uncertainty" in the primary definition. You can have statistics on the numbers of Olympic Gold Medal winners so far; they may be right or wrong, but I have yet to see anyone put error bands on them. To me "uncertainty" is part of the collection, interpretation and presentation in many cases, but not always a necessary part. --Henrygb 23:39, 12 Aug 2004 (UTC)

Your discomfort with the word uncertainty seems to stem from the difference between descriptive statistics (your definition) and inferential statistics ("mine"). (continued below)

Hmm. Or not. I just looked at your contributions, Henrygb. Anyway, I still say to do (or describe) meaningful statistics you have to have the idea of uncertainty or randomness in there somewhere. - dcljr 23:07, 31 Aug 2004 (UTC)

In descriptive stats, you usually just take the data as given; whether it's the whole population or just a sample, you can summarize it graphically and numerically in much the same ways. My background is mathematical statistics, so I usually don't even think of the descriptive side when I think statistics. It's my own bias. Anyway, we should try to address both aspects. - dcljr 22:55, 31 Aug 2004 (UTC)

I came to statistics through management science, the applied branch of operations research, and econometrics, an applied branch of mathematical statistics, with a big dose of John Tukey's pragmatism. I wound up with a perspective that some find unusual. For one thing, management science gave me a decision theoretical outlook. Part of that is reserving the word "uncertain" for situations that lack probability distributions. Data are raw materials; there's no infomation until you interpret descriptive or inferential statistics. I'm not sure what level to shoot for here, but here goes. I've done things like this with more example and less technical stuff but that takes more time or space, and I wanted to be brief.

Before you get to description, you have to know about the population the data represent (if any - most online polls, for example, represent no one except those who happened to participate. That includes some sampling theory. Then there's data entry and preparation, including quality checks, etc.

Assuming the data are numeric rather than categoric (counts of people belonging to various political parties, for example), the biggest challenge in description is to get people to pay attention to more than the median or mean. Box plots (aka box-and-whisker diagrams or plots) are critical for understanging data whose center is taken to be the median. The standard deviation is critical if you're assuming the normal distribution (I like to call it Gaussian but that's a small point) and using the mean, etc. Otherwise, you're trapped into the talking head focus on a single number that conveys very little useful information.

Once I get past description, statistics is about figuring out how much risk you are willing to take. Sometimes that's a guesstimate (choosing between pizza places in a town you've never visited before), sometimes it's as precise as you can make it (choosing the person who will perform open heart surgery on a loved one or yourself). In formal inference, that value is alpha and the decision about whether to reject the applicable null hypothesis comes down to whether the estimated risk that rejecting the null is a Type-I error (the p-value) is larger or smaller than the risk you are willing to take. If p>alpha, there is too much risk of a Type-I error to reject the null given your ex-ante choice of alpha. If alpha>=p, the risk of a Type I error is small enough (according to your ex-ante choice) to reject the null.

A single paragraph along those lines might be something like:

"Statistics is the art and science of seeking to understand a population and predict its future by collecting and using data that represent the population. Data collection includes sampling, data entry, and checking. Using data in statistics has two parts. Descriptive statistics includes estimates of most likely data values, their variation, and graphs. Inferential statistics looks for associations and causal relationships between variables that help to explain observed and predict future values."

That doesn't say anything about data mining, an approach that was taboo in my econometric youth. I haven't kept up with the subject, though, so I'm in no position to say anything about it here. If it's an outgrowth of resampling theory, for example, I'd be sympathetic even though that probably puts me outside mainstream econometrics, but I don't know enough to comment one way or another. --George Brower

Ah, now this paragraph (George's above) is, I think, mainly coming from a practical perspective of statistics as a set of procedures and "best practices" (i.e., what I would call applied statistics). (No offense, oversimplifying your viewpoint like that...) I come at statistics from a more theoretical standpoint (much to the chagrin of my students), emphasizing why those practices work and (ultimately, like in grad school) how to assess their efficacy and develop new and better ones. But my perspective is probably more suited to the mathematical statistics article (part of the reason I created it in the first place — in time I hope it will grow into something "useful").

I accept that this article should remain almost entirely "applied". At the very least we should allude to the following in the first paragraph:

data collection (sampling, etc.)
data summary (descriptive stats)
data interpretation (inference, relationship)

A more detailed outline, which might be the basis of constructing the opening paragraphs (i.e., preferably above the table of contents):

basics
- population
- sample
- randomness (uncertainty) and probability (frequentist/subjectivist viewpoints should probably be alluded to but not explained in any detail)
focus
- applied statistics (description, inference, modeling)
- theoretical (math stat)
data collection
- sampling
- experimental design
data summary: descriptive statistics
- graphical
- numerical
data interpretation: inferential statistics
- estimation
- prediction
- hypothesis testing
relationships and modeling
- correlation
- regression/ANOVA
- time series
- data mining? (I don't know much about it either!)

Obviously, and not surprisingly given my previous admissions, this reads like a course syllabus. But it does stress what you can actually do with statistics. If we could somehow pack all that information (if only obliquely, and certainly not necessarily in that order) into the opening paragraphs without hopelessly confusing everyone, that would be great!

Subsequent sections can flesh out what it all means and point to "main articles" about each topic for more detail. (Still, obviously I'm evisioning a much lengthier article!)

I think we should also mention above the table of contents the use of "statistics" or "stats" as a synonym for "data" and why that's not quite right.

These are my thoughts at the moment, anyway...

- dcljr 22:55, 31 Aug 2004 (UTC)

My attempt at article lead section

I just discovered the term lead section for what I've been variously calling preamble, intro[duction], introductory paragraphs, and stuff above the table of contents. <g>

Anyway, I'm sure some people thought it would be impossible to include all that stuff (see my previous comment) in the lead, but here's my attempt. I got almost everything in there.

Statistics is a broad mathematical discipline which studies ways to collect, summarize and draw conclusions from sample data. It is applicable to a wide variety of academic disciplines from the physical and social sciences to the humanities, as well as to business, government and industry.

Once data is collected, either through a formal sampling procedure or by recording responses to treatments in an experimental setting (cf experimental design), or by repeatedly observing a process over time (time series), graphical and numerical summaries may be obtained using descriptive statistics.

Randomness and uncertainty in the observations is modeled using probability in order ultimately to draw inferences about the larger population. These inferences may take the form of answers to essentially yes/no questions (hypothesis testing), estimates of numerical characteristics (estimation), prediction of future observations, descriptions of association (correlation), or modeling of relationships (regression).

The framework described above is sometimes referred to as applied statistics. In contrast, mathematical statistics (or simply statistical theory) is the subdiscipline of applied mathematics which uses probability theory and analysis to place statistical practice on a firm theoretical basis.

The word statistics (or stats) is also used colloquially to refer to data collected on an entire population rather than a subset of it. Formally, however, statistics is almost always based on samples. In fact, the word statistic (singular) may be defined as a quantity calculated from sample observations.

I found that I just couldn't find a good way to stick in the frequentist/subjectivist thing. My concern about that was mainly to point out the difference between "classical" and "bayesian" approaches. Perhaps another short "non-sequitur" paragraph could deal with that. Also, I didn't say anything about ANOVA (which is closely related to hypothesis testing, regression and experimental design, so I didn't feel too bad about not mentioning it by name) or data mining (maybe just doesn't belong in the lead). Oh, and not all the links lead to useful articles this point. (contined below)

I think there is no need to mention the frequentist/subjectivist split in an article on statistics. As far as "best practices" go, you can use whatever philosophy you like, or none at all, to come up with good statistical practice. In mathematical statistics, everyone must agree that, as mathematical theorems, frequentist and bayesian theorems are all "true". Finally, for a while I have held the opinion that frequentism as a philosophy of probability stems from the erroneous identification of the definition of probability on the one hand, and the measurement of a probability on the other hand. Whatever the meaning one ascribes to the word "probability", there is essentially only one way to determine it empirically, and that is to observe a large random sample and make inferences about it using statistics. — Miguel 07:53, 2 Sep 2004 (UTC)

But the two probability interpretations do lead to (almost) completely different approaches to inference. It probably should be mentioned somewhere, just not in the lead. BTW, despite being educated almost entirely from the frequentist perspective, I'm always a little uncomfortable when relative frequency is presented in textbooks as the "definition" of probability. (IOW, I agree with you.) - dcljr 19:31, 2 Sep 2004 (UTC)

Comments? Suggestions? (...I ask with much trepidation) - dcljr 20:49, 1 Sep 2004 (UTC)

Well it's better than what's there now. The reference to human knowledge in the first sentence of the current article is weird (I can't decide whether it's redundant or just wrong). Your additions will be the object of further modifications, but I suggest you blow away the current lead section.CSTAR 23:44, 1 Sep 2004 (UTC)

Okay, I'll leave it here for a few days so others can comment. If there are no strong objections, I'll move it to the article. - dcljr 19:31, 2 Sep 2004 (UTC)

Be bold in updating pages — Miguel 17:33, 3 Sep 2004 (UTC)

In my opinion: I am happy with your first paragraph except for the word "sample"; the rest of your paragraples should be in the contents; statistics is not "formally" about samples; nor is your distinction between mathematical statistics and applied statistics particularly clear. --Henrygb 01:04, 5 Sep 2004 (UTC)

Is this a Bayesian/frequentist (/decision theory) thing? As I recall, all the classes I've taken and all (?) the textbooks I've seen talk about the subject in terms of samples — both applied and theoretical approaches. I guess I still don't understand what alternative you're proposing. (If not "uncertainty", if not "samples", then what?? Hmm... Are you the person who added the note about decision theory in the opening paragraph?) And when you say "formally", how formal are we talking? "Let X₁, X₂, ..., X_n be a random sample" formal? "Let X be a random vector with covariance matrix T" formal? "Let X be absolutely continuous with respect to Lebesgue measure μ" formal? Anyway, as I've already mentioned, I don't think this should be an article about statistical theory. Speaking of which, that's what I mean by mathematical statistics: the theory as opposed to the applications (applied = what you do with statistics; theory = why it works). I'm not sure how I could make that paragraph more clear. Suggestions? - dcljr 18:41, 7 Sep 2004 (UTC)

No. I mean things both like "the population of the United Kingdom is about 59.5 million", and like "the difference between the mean and the median is less than or equal to one standard deviation", neither of which have anything to do with samples, but are about data. Statistics covers both of these, as well as sampling. --Henrygb 00:44, 11 Sep 2004 (UTC)

I'm responding to Henrygb's last comment above (at 00:44, 11 Sep 2004), but the indentation is getting a bit extreme, so it's back to the left margin... Okay. Your examples actually wouldn't (necessarily) be covered by the term "statistics" in my book (especially in an article that's trying to explain what statistics is, as opposed to other, similar disciplines/practices):

"the population of the United Kingdom is about 59.5 million"

This figure is a "statistic" only in the colloquial sense of the word. It's presumably based on a census. That's not statistics (as in, "I have a degree in statistics"). In fact, you may be familiar with the controversy over using statistical methods in the U.S. census (see the Census article). It's not allowed under most people's interpretation of the relevant clause in the Constitution. (This only serves to illustrate the difference in the concepts; I'm not saying it's an airtight argument.) One could argue that graphical and numerical summaries of populations fall under the term "descriptive statistics", but no one objects to the use of those techniques to interpret census data. My point is, when the word "statistics" is used by statisticians (or by someone teaching the subject, etc.) it almost always means "inferential statistics", which uses information about a sample to infer something about a larger population. Of course, confusing the whole issue is the use of the word "statistics" by governments to refer to census data and summaries thereof (e.g., "Statistical Abstract of the United States" or the "Bureau of Labor Statistics"). The difference here is akin to the difference between the colloquial use of the term geography to refer to the "lay of the land" of an area, and the academic subject of geography, which studies many other things. In any case, the issue(s) you raise (and I've discussed) here should certainly not be ignored, but should be dealt with directly in the article.

"the difference between the mean and the median is less than or equal to one standard deviation"

That statement can be made in probability; you don't need statistics at all for that one. Certainly statistics relies heavily on probability, but they are different fields (just as engineering and physics are very different fields, even though the former relies heavily on the concepts and methods of the latter). This is why a great many Wikipedia articles start out, "In probability and statistics..." and not just "In statistics...." I don't want to offend you, Henrygb, but may I ask what your academic background is, especially as it relates to statistics? As you can see above, at first I thought your objections were based on a philosophical difference among statisticians (Bayesians, etc.), then I thought maybe you were objecting at a deep mathematical/theoretical level. I'd like to know what exactly you're basing your views on. - dcljr 05:17, 13 Sep 2004 (UTC)

A strange request, but I'll play. I have a mathematics degree from the University of Cambridge having concentrated on what was called "applicable mathematics" (i.e. numerical analysis, probability, statistics, mathematical economics, coding theory etc.). I am now a member of the (British) Government Statistical Service. Your turn.
I am saying statistics is about data and its handling, presentation and use for drawing inferences, and that the use of samples is only one part of that. What you describe as the "colloquial sense of the word" (which presumably also refers to topics like baseball statistics) is not only the origin of statistics but one of its major contemporary meanings. While random variables and distributions in probability have descriptive statistics, so too do data sets which are not random. Indeed I would suggest that what you think of as statistics is much more probability based than the broader concept I am considering. Look at the list of statistical topics and my guess is that the majority of the articles do not mention sampling. --Henrygb 00:13, 14 Sep 2004 (UTC)

So... when you're doing inference and not using sampling, then you must be using either Bayesian analysis or some decision-theoretic approach, right? Not classical inference (t-test, ANOVA...). Anyway, nevermind. I give up. If others want to weigh in on this subject, please do. Henrygb, at my User page you can see both my statistics credentials (User:dcljr) and my (latest) revised lead section (User:dcljr/Statistics#Preamble — ~~I know you won't agree with one sentence in there~~). I haven't done anything to the article yet because I'd like to flesh out a little more of the main article text to complement the extensive lead section I'm proposing. Then others can have at it. ~~- dcljr 06:15, 21 Sep 2004 (UTC)~~ I removed the offending statement from my lead section draft in my last edit. - dcljr 06:36, 21 Sep 2004 (UTC)

Probability

I can't make heads or tails from this paragraph:

However, this can often lead to misunderstandings and dangerous behaviour, because people are unable to distinguish between, e.g., a probability of 10-4 and a probability of 10-9, despite the very practical difference between them. If you expect to cross the road about 105 or 106 times in your life, then reducing your risk of being run over per road crossing to 10-9 will make you safe for your whole life, while a risk per road crossing of 10-4 will make it very likely that you will have an accident, despite the intuitive feeling that 0.01% is a very small risk.

What is meant by 10-4 or 10-9? Is that meant to be scientific notation (ten to the -4th and 10 to the -9th)?

The example makes little sense either. Why 105 or 106 road crossings and not 100, say. And I don't think reducing the risk to 10-9 means it will make you safe for your whole life, rather than that it will be very unlikely that you will be run over.

Unfortunately, the only statistics I learnt was in high school, so I'm not certain how to improve this article myself.

--Martin Wisse 06:51, 2 Nov 2004 (UTC)

You are ignoring (or not seeing) the superscripts. 10⁻⁴ does indeed mean 10 to the power of −4, i.e. 0.0001 or a 1 in 10000 chance. --Henrygb 21:06, 29 Nov 2004 (UTC)

Could someone who feels half way competent to do so put some pointers to the philosophical foundations of probability and statistics. Statistical reasoning has always fascinated and amazed me with some breathtaking inferences, and it would be nice to know if there is a way into this stuff. --Publunch 18:08, 22 Dec 2004 (UTC)

Probabilities in Bayesian statistics

The following puzzles me

Use of prior probabilities of 0 (or 1) causes problems in Bayesian statistics, since the posterior distribution is then forced to be 0 (or 1) as well. In other words, the data is not taken into account at all! As Lindley puts it, if a coherent Bayesian attaches a prior probability of zero to the hypothesis that the Moon is made of green cheese, then even whole armies of astronauts coming back bearing green cheese cannot convince him. Lindley advocates (…)

I haven't read Lindley's book, but I am a statistician and Bayesian statistics is my area, and I have no idea what the above is supposed to mean. As it stands it is just nonsense to me.

To keep it simple, let's assume a linear model and a normal (Gaussian) distribution. In this case, the posterior distribution is a weighted average of the prior distribution and the distribution of the observations. Before any observations are gathered, the posterior distribution is identical to the prior distribution. As more and more observations arrive, the posterior distribution will converge to the distribution of the observations. Infinately many observations would result in a posterior distribution identical to distribution of the observations, with no weight on the prior distribution at all. No matter what the prior distribution is, it will count less and less as more observations are taken into account. In particular, if we use a degenerate prior distribution with infinite variance (and zero density everywhere), the Bayesian approach gives the same result as a “frequentist” approach. The reason is that the prior distribution has zero density and contributes nothing in the weighted average of the prior distribution and the distribution of the observations, giving a posterior distribution always identical to the distribution of the observations. Anyway, I have no idea why probabilities of 0 or 1 should cause trouble in Bayesian statistics. –Peter J. Acklam 22:36, 18 Jan 2005 (UTC)

I think you've misunderstood. The statement that if the prior probability of a proposition is 0 or 1, then so is the posterior, is correct; it's trivial mathematics. You're being really vague about your proposed model. You wrote:

let's assume a linear model and a normal (Gaussian) distribution. In this case, the posterior distribution is

Posterior distribution of what?? Often one talks about a N(μ, σ²) distribution of some quantity to be observed--call that X, and one speaks of prior and posterior distributions of μ (or of μ and σ, but let's keep it simple, and while we're at it assume σ = 1). That's the conditional distribution of X given μ. OK, simple case: the prior says that μ = 1 or 2, each with probability 1/2. Now keep repeating the experiment. The observations of i.i.d. copies of X are conditionally independent given μ. If μ is really equal to 1, then the posterior will, with probability 1, converge to a probability distribution that assigns probability 1 to μ = 1. The posterior distribution will not "converge to the distribution of the observations", since those will be normally distributed! Michael Hardy 02:50, 19 Jan 2005 (UTC)

a weighted average of the prior distribution and the distribution of the observations. Before any observations are gathered, the posterior distribution is identical to the prior distribution. As more and more observations arrive, the posterior distribution will converge to the distribution of the observations.

I wrote that way too late yesterday. What I had in mind was a case where you are estimating μ or σ². The posterior distribution does not, as you point out, converge to the distribution of the observations, but to a distribution based on the information in the observations. Anyway, never mind. –Peter J. Acklam 08:21, 19 Jan 2005 (UTC)

I wrote the paragraph quoting Lindley that Peter finds puzzling. Bayes' theorem can be expressed in the form

Posterior probability is proportional to Prior probability x Likelihood

If the prior is zero then so is the posterior, since zero times anything equals zero. A similar argument applies if the pror is 1. The likelihood is the part which mathematically models the information content of the data. In the case where the prior is zero it makes no difference what the likelihood is, since it just gets multiplied by zero to make zero. So by choosing a prior probability of zero (or one) you cut yourself off from the ability to take on board the informaton contained in the data. I hope this helps.Blaise 23:23, 19 September 2005 (UTC)[reply]

One of the great things about Bayesian stats is that you can do things directly rather than indirectly, as in frequentist stats. For example, in frequentist stats you can't attach a probability to a hypothesis, so you can't talk about the probability of the hypothesis given the data. Instead, you have to mess around with the probability of the data given the hypothesis, which is not really what you want. Bayesian hypothesis testing is thus very simple in principle (though the sums may get hard in practice). You attach a prior probability to the hypothesis, multiply by the likelihood and divide by the probability of he evidence to normalise it. The point I was making was that probabilities of zero and one make poor choices as priors. (I am not saying that Bayesian stats generally has a problem with zero and one as probabilities, just when they are used as priors.) Your mention of linear models and so on suggests that you are thinking in terms of Bayesian methods of parameter estimation, whereas I was thinking in terms of Bayesian hypothesis testing.Blaise 15:28, 20 September 2005 (UTC)[reply]

Help needed

Hi there. Could somebody take a look at the trend article? Is the statistical term trend correct? If so, it needs expansion. Thanks. Oleg Alexandrov | talk 03:46, 24 Jan 2005 (UTC)

Removed some external links

I have massaged the External links section a bit and removed the following entries (others could stand to be culled IMHO, but I didn't do so):

http://www.thenakedscientists.com/HTML/Columnists/robstanforthcolumn2.htm The Probability of Co-incidence
Dedicated website (in Italian)

While the first may be an interesting article, it's not really directly relevant to statistics (it would belong at Probability, if anywhere); and I moved (2) to the statistics article at the Italian Wikipedia. - dcljr 23:11, 27 Jan 2005 (UTC)

virtual reality

Probability: What is the meaning of: In reality there is virtually nothing...? In throwing a dice the event "the dice has been thrown" has probability exactly 1. Meant is, I assume, from no future event one can be absolutely sure. But then this "event" itself is absolutely sure!?130.89.219.54 17:18, 31 Jan 2005 (UTC)

...yes? --justing magpie 14:58, 2 August 2006 (UTC)[reply]

Statistical Software - removal of SigmaXL link

You seem to have a problem with links to commercial sites, but in a inconsistent manner. Why then don't you remove STATA's link? The rules of Wikipedia do not forbid links to commercial sites.

STATA seems to actually have helpful information on their page, and allows you to try certain things. While SigmaXL Excel Add-in only tells you to "Download a 30-Day trial". This together with your repeated insistance, makes me think that you are looking for free advertising. Then, Wikipedia is not the place to go. Oleg Alexandrov 16:33, 6 Mar 2005 (UTC)

For someone who does not know the difference between stata.com and statsoft.com you should not be editing the Statistical software page. As for my insistence, our product is a significant contribution to the market for powerful, easy to use and inexpensive statistical software. Therefore it deserves a place of mention alongside products like Minitab. I will remove the url, but request that you keep the name up.

This is meant to be a list of statistical packages in common use. I have previously heard of all the packages listed, except SigmaXL, StatPro and MacAnova. Some quick Google tests give 372,000 hits for Minitab, 363,000 hits for GNU Octave, over 6 million for Stata, and over 8 million for R (actually for statistical R). This compares to less than 500 for the StatPro add-in, about 700 for MacAnova, and 33,000 for SigmaXL. I have therefore removed StatPro and MacAnova from the list, and am tempted to remove SigmaXL unless someone can give evidence that it is as commonly used as some of the other remaining packages. -- Avenue 12:41, 21 Mar 2005 (UTC)

The 8 million Google references for Statistical R goes down to 41,000 if you enter "R language" or 38,000 for "R project" statistical.

A Google test for R is inevitably going to be subjective, and I admit that Statistical R will include some false hits, but I think those two search phrases are somewhat unnatural. There are 2.86 million results for R Statistical Software, and the first false hit was number 43 in the list, so I believe the true number of references to R would be measured in hundreds of thousands at least. -- Avenue 15:38, 31 Mar 2005 (UTC)

That SigmaXL was put in by an employee of that company, could be in itself a good enough reason to remove the thing. Probably that employee meant well, but we would not want Wikipedia to be a medium of free advertising. Oleg Alexandrov 13:00, 21 Mar 2005 (UTC)

I disagree; I believe their contribution should be judged on its merits. But the fact that, as an employee, they may have an interest in promoting their company's product means some skepticism is probably called for. -- Avenue 15:38, 31 Mar 2005 (UTC)

You are right, just because it is put by an employee it does not mean to be deleted automatically. It is all up to you if to keep that link, I know nothing about statistics. I am just weary of people abusing the external link section. Oleg Alexandrov 16:08, 31 Mar 2005 (UTC)

No evidence that SigmaXL is in relatively common use has been provided. I also note that it only has the fourth highest Google pagerank of the add-ins listed here [1]. I will therefore delete it from the list. However I will also add Google's link. -- Avenue 13:32, 3 Apr 2005 (UTC)

Questions and Suggestions

I'm neither a statistician nor mathematician so bear with me through these comments.

First sentence.

Is "statistics" a science or is "statistical theory" the science and "statistics" the term for the information gathered? Is "human knowledge" compared to "inhuman knowledge" or "non-human knowledge"? Should the separate article "data" be merged with "statistics" and a redirect left at the "data" heading? The separate "information" article within Wikipedia is significant and easily stands alone but the "data" article seems subsidiary to "statistics".

The first sentence would be more clear to me, a layman, if it read as follows: "Statistics are the information (i.e. knowledge) created by the application of mathematics to data."

Rest of first paragraph:

"The branch of mathematics used is statistical theory. Within statistical theory, randomness and uncertainty are modelled by probability theory. Because one aim of statistics is to produce the "best" information from available data, some authors consider statistics a branch of decision theory. Statistical practice includes the planning, summarizing, and interpreting of observations, allowing for variability and uncertainty."

I think the separate articles "data" and "probability theory" should be merged with "statistics".

And I think there needs to be a discussion of the "statistical failure" in the exit polls during the 2004 U.S. Presidential elections to explain -- in layman's terms -- the importance of the date accumulation, how errors arise, mathematically less probability of error the greater the number on the sample, etc.

Someone (either Mark Twain or Benjamin Disraeli) once said: "There are three kinds of lies: lies, damned lies, and statistics." I think there should be a discussion of "false statistics", information produced to prove a point rather than producing "correct" information.

Johnwhunt 18:49, 27 Mar 2005 (UTC)

Statistics is a science; for example, there are Statistics Departments in many universities. These teach the science (and hopefully some of the art) of statistics, including statistical theory and applications. But I agree that our article should probably also mention the more concrete meaning, i.e. statistics = the plural of statistic.

"Human" does seem redundant. I'll delete it and see if anyone complains.

Data has different meanings in statistics and in computer science, with the latter usage becoming more widespread over time. I think the Data article is needed to distinguish them.

Probability theory is a distinct area from statistics or even mathematical statistics, and deserves its own article.

The Misuse of statistics article discusses misleading statistics, and is listed in the "See also" section here.

There is a separate article on problems related to the 2004 exit polls: 2004 U.S. presidential election controversy, exit polls.

-- Avenue 01:38, 28 Mar 2005 (UTC)

A quick question--the sentence

The implication of using probability theory is that statistical results can not provide definitive cause and effect relationships but can only show correlation relationships.

doesn't seem right to me. Being able to determine cause-and-effect relationships vs. correlation relationships has more to do with experimental design than with the practice of statistics itself, no? (For example, double-blind studies on drug efficacy would presumably use statistics to analyze the data, and those studies are certainly after proving causal relationships.) And the problem of providing "definitive" proof is a weakness of science in general, not statistics. I think this sentence should either be removed or replaced with something more to the effect of "The implication of using probability theory is we can quantify how likely it is a particular outcome occurred due to random chance rather than another factor."

Plus, could this article include something on statistical weights? Or should there be a different article for that entirely?

Origin

The Origin section could stand to be made more consistent and accurate (esp. by cross-referencing with other Wikis and other sources). ~ Dpr 05:46, 11 Jun 2005 (UTC)

I've fixed it a little. Does anyone know why "most notably astronomy" is there? Was astronomy particularly important in driving the historical development of statistics? Joshuardavis 19:44, 19 February 2006 (UTC)[reply]

On the sciences end (vice social sciences), astronomical measurements (for navigation) and biometry were driving forces. JJL 22:22, 19 February 2006 (UTC)[reply]

"Random Sample" and "Simple Random Sample"

We have :

looks like they're talking about the same thing. Is there a statistician in the room ? Flammifer

They bear the same relation as mammals and monkeys; i.e. a simple random sample is a random sample, but a random sample need not be a simple random sample. For example, cluster samples and stratified samples can be random samples, but are not simple random samples. Avenue 13:15, 23 September 2005 (UTC)[reply]

Probability

I'm not real fond of part of the following paragraph:

The probability of an event is often defined as a number between one and zero. In reality however there is virtually nothing that has a probability of 1 or 0. You could say that the sun will certainly rise in the morning, but what if an extremely unlikely event destroys the sun? What if there is a nuclear war and the sky is covered in ash and smoke?

Based on this logic, one could say that the probability of something having the probability of 1 (or 0) is 0, and thus, by contradiction, the above statement currently in this section is incorrect. Just remove it and the statement following it.

I agree. As a whole, this section adds little to the Statistics article. It seems to connect only to Bayesian statistics, which does not even appear in the article until this point. Essentially, it is a meandering, anecdotal discussion of the real-world applicability of the mathematical notion of probability. On the other hand, parts of this section might belong in Probability. The example of 10^-4 vs. 10^-9 explains quite well how people misunderstand risk. (But I think these numbers should be written 1/10000 and 1/1000000000, since the intended audience is ostensibly not number-savvy.) Joshuardavis 02:44, 4 February 2006 (UTC)[reply]

Where do I find stats about Wikipedia?

For example:

The number of users,
Pages with the most revisions,
current number of pages,
Number of revisions per day,
and the like..

You would need this page: Wikipedia:Statistics. --shaile 03:18, 15 November 2005 (UTC)[reply]

Hidden assumption of probability theory

As written on the morning of Nov. 30, 2005, the text contained a strong but hidden assumption. The assumption was that probability theory was preserved. There are situations, though, under which probability theory is empirically invalidated. A failure to recognize the possibility of this happening has torpedoed much of the past research and literature in the engineering field of "nondestructive testing."

Researchers should be warned of the possibility of making this blunder. To accomplish this, I added a paragraph, which falls after the first paragraph. Terry Oldberg, http://www.oldberg.biz, [email protected]

I have deleted this paragraph from the article, because the failure of probability theory assumptions is much rarer in typical statistical practice than other mistaken assumptions (such as independence or normality). We should give these more common problems much greater prominence than the ideas listed in the deleted paragraph.

Also, the repeated self-citations give at least the appearance of vanity information, weakening the article. I have placed the deleted text below in case someone feels there is something here worth incorporating, although I would strongly suggest that articles such as Misuse of statistics or Probability theory would be better places to attempt that. Avenue 12:06, 1 December 2005 (UTC)[reply]

In virtually every case, the methods of statistics assume probability theory. However, like any other theory, probability theory can be incorrect. Christensen and Reichert (1976), Oldberg and Christensen (1995) and Oldberg (2005) report observations of systems in which the Unit Measure axiom of probability theory is empirically invalidated. A result is that a number of statistical concepts either do not apply or apply only under restrictive circumstances; these concepts include probability, population, sample, sampling unit, signal and noise. It follows that blindly applying the methods of statistics without first checking for preservation of Unit Measure can lead to blunders. Oldberg and Christensen (1995) and Oldberg (2005) report that a blunder of this type plagues an entire field of engineering. The following presentation assumes probability theory.

Christensen, R. and T. Reichert, 1976, "Unit Measure Violatiions in Pattern Recognition: Ambiguity and Irrelevancy," Pattern Recognition, Oct. 1976, pp. 239-245; Pergamon Press.

Oldberg, T. and R. Christensen, 1995, "Erratic Measure," in NDE for the Energy Industry 1995, pp. 1-6; The American Society of Mechanical Engineers, New York, NY. Republished by ndt.net at http://www.ndt.net/article/v04n05/oldberg/oldberg.htm .

Oldberg, T., 2005, "An Ethical Problem in the Statistics of Defect Detection Test Reliability", ndt.net, http://www.ndt.net/article/v10n05/oldberg/oldberg.htm.

Regarding the suggestion that my posting might have been motivated by vanity, in the hope of pouring water on a possible flaming war, I'll restrict my remarks to pointing out that, for seekers of truth, attacking one's opponent's errors is permissible. Attacking one's opponent is not.

That 2 of the 3 works cited bear my name is a result of the fact that I am unaware of any other works on the topic of violations of the Unit Measure axiom of probability theory in the practice of statistics. If anyone is aware of additional works, I request that they supply citations to them.

Failure to expose the assumptions supporting a conclusion is forbidden in technical writing. A failure of the statistics community to sufficiently expose this one has had dire consequences for the field of nondestructive testing. If you live near a nuclear reactor or refinery, cross bridges, work in a steel framed building, fly on aircraft or own stock in a company that owns any of these devices, your life and property are threatened by the simultaneous assumption of Unit Measure and empirical violation of it in nondestructive testing.

The person who edited out the paragraph which I contributed states that false assumptions of probability theory occur more rarely in the practice of statistics than other false assumptions but supplies no citation to a study or studies supporting this assertion. On the other hand, I supplied citations to peer reviewed articles demonstrating violations of Unit Measure that lay unrecognized in the field of nondestructive testing for an extended period and to the distinct detriment of the people of the world. Decisions about the content of the Wikipedia article on statistics should be made on the basis of peer reviewed articles rather than anecdotes, where possible.

For all of the above reasons, I submit that a warning of the assumption of probability theory and possibility of it being empirically violated is required in this article. Said warning should appear before the second paragraph, wherein the term "sample" appears. Samples do not exist under violations of Unit Measure.

Terry Oldberg Dec. 1, 2005

I have no wish to get into a flame war either. I said that your self-citations gave the appearance of vanity information. I did not mean to imply that you were motivated by vanity, and I apologise if I caused any offence. Perhaps a better way of phrasing my criticism would be that the references provided had overlapping authorship, not meeting Wikipedia's goal of multiple independent sources. References from reputable statistics journals would also carry more weight in the context of this article. (The 1976 reference might qualify, but I think not the others.)

I do agree with you that so far our article does not give enough emphasis to the dangers of unfounded assumptions, but I disagree about which assumptions are most important. I have searched for "unit measure" in the online versions of two reference works on statistics, namely the Encyclopedia of Statistical Sciences (2nd Ed., Wiley, 2005) and StatSoft's Electronic Statistics Textbook. No documents matched, suggesting that this is not a topic of serious concern to the editors of either work. In contrast, there are two articles in Wiley's encyclopedia on testing for normality, or departures from it, and there were over 500 matches both for "normality" and for "independence". I think this strongly supports my belief that our article should cover these assumptions first. Avenue 15:11, 2 December 2005 (UTC)[reply]

Avenue: I've added a brief, second paragraph, with a warning that a) statistics assumes probability theory, b) key elements of the terminology of statistics assume probability theory and c) probability theory can be violated empirically. There are no references to my own works or to any works at all, for that matter. I moved a detailed discussion to the topic of "Misuse of statistics", per your suggestion. Thank you for making it.

Unit measure is a way of identifying an axiom that is described as Kolmolgorov's 2nd axiom in the current edition of Wikipedia. The phrase "unit measure" appears in one of the papers that I referenced in my original submission. I don't know whether there is a way of referencing this axiom that is conventional or, if so, what this way is. I have no information on the frequency with which violations of this axiom have appeared in scientific studies or the usage of statistical models. I can tell you that a) it has been observed in biomedical research (see the paper by Christensen and Reichert) and in the field of defect detection testing. The latter is where I encountered it more than 20 years ago, while serving in a role in which I directed much of the world's research on the safety inspections of the tubes of nuclear reactor steam generators. The methods of inspection violated unit measure but the scientific literature assumed unit measure.

Violations of this axiom are ubiquitous in the literature of defect detection testing and they are buried by misusage of terms that imply the preservation of probability theory; this has been true for a period of more than 30 years. If a person you care about lives near a nuclear power reactor, refinery or chemical plant, flies on airplanes, crosses bridges, or relies on the reliability of any other kind of structure that functions under mechanical stress, this person's expectation of a healthy and prosperous life is diminished by this type of misuse of statistics.

My claim that what I have said in the above paragraph is true is based, in part, by four, peer reviewed publications. The contrary claim of the United States Nuclear Regulatory Commission (whose studies are statistically flawed and which agency has subjected the people of the United States to unneccessary risk due to its own incompetency, if my claim is true) were rejected and mine accepted by a peer review panel, for a highly reputed engineering society, that included an academic statistician. By the way, the co-author of one of my papers, is a theoretical physicist who has worked as a statistician for more than 35 years; he has published 7 books and a number of articles on theoretical and applied statistics. To my knowledge, nobody has refuted nor limited my claims in the 16 years since I began to publish them or 21 years since I made them orally to an engineering conference. This is true even though: a) a publication with a circulation of 1 million copies, Business Week magazine, published an article featuring one of my papers a decade ago b) two of my papers have been published on a Web-based journal with an international circulation of 80,000 readers and discussed in an online forum for more than 6 years.

In light of the above, it seems to me that caveats in Wikipedia's article on statistics are apropos.

Terry Oldberg 06:36, 16 December 2005 (UTC)[reply]

Mr. Oldberg, could I ask you to conform to Wikipedia conventions (see Wikipedia:Manual of Style)? You've created some articles with gratuitous capitals in their titles. I moved Unit Measure to unit measure (with a lower-case initial m). You've started article with dictionary-style definitions rather than complete sentences, and neglected to bold the title phrase at its first appearance. Sometimes you omit all links. To see what I have in mind, look at my edits to the article's you've worked on. Michael Hardy 19:56, 1 December 2005 (UTC)[reply]

Mr. Hardy: Thank you for alerting me to this. Terry Oldberg

Perhaps someone could expand unit measure? Presently it is very short. Punkmorten 11:58, 14 January 2006 (UTC)[reply]

In the comments above, Terry Oldberg asserts that "like any other theory, probability theory can be incorrect". Can we agree that probability theory, as a mathematical theory (distinct from a scientific theory -- see theory), cannot be incorrect (ignoring certain philosophical issues)? On the other hand, it can certainly be applied incorrectly to situations that do not satisfy the assumptions of the theory, and this is worth noting in the article. In my opinion, this warning belongs in the conceptual overview, not the intro, for two reasons:

I want to keep the intro short. It is an intro to the article, not an intro to statistics. The article itself is an intro to statistics.
Putting a "problem" in the intro might cause a casual reader to infer that statistics is unsound. I would like readers to gain a healthy skepticism for statistics, but not an unhealthy one. (In my mind this is analogous to the healthy skepticism that schoolkids should have of evolution or any other overwhelmingly validated scientific theory.)

For these reasons I removed the last of Mr. Oldberg's warnings today, but replaced them with a paragraph about misuse of statistics and statistical literacy. Respectfully, Joshuardavis 20:50, 20 February 2006 (UTC)[reply]

Joshuardavis asks: "Can we agree that probability theory, as a mathematical theory (distinct from a scientific theory -- see theory), cannot be incorrect (ignoring certain philosophical issues)?" If his proposition is that probability theory follows from its premises, it would be impossible to disagree with him. However, whether his proposition is true is off the pertinent topic. The topic is whether the Statistics article should warn readers that: a) mathematical statistics assumes probability theory and b) an axiom of probability theory can be and has frequently been empirically invalidated in scientific studies that assumed statistics; these studies reached necessarily false conclusions. Should the Statistics article fail to warn of the possibility that mathematical statistics doesn't work when large numbers of the world's people are exposed to unnecessary risks from explosions of nuclear reactors and downings of aircraft from false assumptions of mathematical statistics? Comments?Terry Oldberg 05:23, 25 February 2006 (UTC)[reply]

I do not believe that we need to warn readers about the issue discussed here. There are certainly many assumptions underlying statistical theory and practice, and some of these are widely recognized as being vital to much statistical work. For instance, many statistical procedures assume normality, and much has written about failures of this assumption and how to detect and address them (e.g. the articles in Wiley's Encyclopedia of Statistical Sciences that I mentioned above). Failures of the "unit measure" assumption that Mr Oldberg raises do not seem to concern many statisticians, and we should not include warnings about them in our article unless wide concern about them can be demonstrated. -- Avenue 12:32, 25 February 2006 (UTC)[reply]

Inspired by Terry Oldberg, I have expanded the "subtle but serious" discussion to emphasize how important/dangerous such errors in application of statistics can be. My feeling is that discussion of particular kinds of errors should be left entirely to the misuse of statistics article. I am not qualified to evaluate Avenue's assertion that the normality assumption is more worrisome than the unit measure assumption, but if it is then I hope someone will write about it for misuse of statistics; it's not there. Lastly, I suggest that language like "empirical invalidation of probability theory" should be replaced with language like "misapplication of statistics/probability (due to incorrect assumptions)", which places the blame where it's due. Joshuardavis 03:04, 27 February 2006 (UTC)[reply]

RFC

Hi. I was wondering if any of you guys can help us out over at Talk:Intelligent design. There's a line in the current ID article that says:

A Newsweek article reported The Discovery Institute's petition being signed by about 350 scientists, while the AAAS (the largest association of scientists in the U.S.) has 120,000 members, indicating that around 0.3 % of U.S. scientists give some support to ID. The international percentage is likely to be much smaller.

Some people (including me) there think there is some sort of sampling bias (I was thinking it is selection bias, but not being statisticians we're not really sure what exactly it is. Others are fairly adamant that it's just a simple matter of math. I was hoping someone could help us sort it out one way or the other. Thanks. You can just post here, I'll check back, or if you want leave a message on my talk page or post on Talk:Intelligent design under the heading "Support among scientists – this is bogus." --Ben 02:33, 23 December 2005 (UTC)[reply]

Forums for Statistics Help?

Are there any open forums for people who help each other with statistics? Can we add here some if anyone knows any? Thanks. Towsonu2003 04:22, 10 February 2006 (UTC)[reply]

Wikipedia:Reference desk/Mathematics is very well staffed; the helpers there complain when a day goes by without a question. It's a great place to ask, and to have a conversation, but be sure to read and follow the rules at the top of the page. Asking homework questions is fine, but some people will scoff. --James S. 05:09, 10 February 2006 (UTC)[reply]

Reorg on 19 Feb 2006

Hello. I've been lurking for a while, reading the many ideas and suggestions, and now I've chosen to be bold and implement some of them. Here were my main goals:

Make the intro concise. (If past tendencies prevail, it will be made long again soon enough.) I understand that people have strong feelings about what deserves to lie above the contents. I hope that the new conceptual overview section will satisfy most. It also brings in a lot of requested topics not treated here before.
Remove the probability section. There were several "complaints" about it. Most of it was not relevant. I have preserved it here, in case anyone wants to put it into statistical literacy, etc.:

Statistics makes extensive use of the concept of probability. The probability of an event is often defined as a number between one and zero. In reality however there is virtually nothing that has a probability of 1 or 0. You could say that the sun will certainly rise in the morning, but what if an extremely unlikely event destroys the sun? What if there is a nuclear war and the sky is covered in ash and smoke?

We often round the probability of such things up or down because they are so likely or unlikely to occur, that it's easier to recognize them as a probability of one or zero.

However, this can often lead to misunderstandings and dangerous behaviour, because people are unable to distinguish between, e.g., a probability of 10⁻⁴ and a probability of 10⁻⁹, despite the very practical difference between them. If you expect to cross the road about 10⁵ or 10⁶ times in your life, then reducing your risk of being run over per road crossing to 10⁻⁹ will make it unlikely that you will be run over while crossing the road for your whole life, while a risk per road crossing of 10⁻⁴ will make it very likely that you will have an accident, despite the intuitive feeling that 0.01% is a very small risk.

Use of prior probabilities of 0 (or 1) causes problems in Bayesian statistics, since the posterior probability is then forced to be 0 (or 1) as well. In other words, the data are not taken into account at all! As Dennis Lindley puts it, if a coherent Bayesian attaches a prior probability of zero to the hypothesis that the Moon is made of green cheese, then even whole armies of astronauts coming back bearing green cheese cannot convince him. Lindley advocates never using prior probabilities of 0 or 1. He calls it Cromwell's rule, from a letter Oliver Cromwell wrote to the synod of the Church of Scotland on August 5th, 1650 in which he said "I beseech you, in the bowels of Christ, consider it possible that you are mistaken."

And here is what I ripped out of the intro:

Key concepts and terms of statistics assume probability theory; among the terms are: population, sample, sampling, sampling unit and probability. Warning: systems are known to science that violate probability theory empirically.

Once data has been collected, either through a formal sampling procedure or by recording responses to treatments in an experimental setting (cf experimental design), or by repeatedly observing a process over time (time series), graphical and numerical summaries may be obtained using descriptive statistics.

Patterns in the data are modeled to draw inferences about the larger population, using inferential statistics to account for randomness and uncertainty in the observations. These inferences may take the form of answers to essentially yes/no questions (hypothesis testing), estimates of numerical characteristics (estimation), prediction of future observations, descriptions of association (correlation), or modeling of relationships (regression).

The framework described above is sometimes referred to as applied statistics. In contrast, mathematical statistics (or simply statistical theory) is the subdiscipline of applied mathematics which uses probability theory and analysis to place statistical practice on a firm theoretical basis.

I know the links need to be cleaned up, and some of what I added in the conceptual section is probably wrong. I am not a statistician, so I won't be surprised when it is ripped apart. Some things, such as the "empirical violation of probability theory", I left in despite not understanding. Cheers, Joshuardavis 22:59, 19 February 2006 (UTC)[reply]

Thanks for taking the initiative. The article has needed a revamp for a long time.

Your first sentence said that statistics is a "branch of mathematics". Since I believe many statisticians feel that statistics is no longer simply a part of mathematics, I changed this. Nrcprm2026 changed it back, saying "give math some credit". But saying statistics is a branch of mathematics gives math too much credit. (Maybe a wayward child would be a better analogy?) It gives the computational and cognitive aspects of statistics short shrift.

I'll quote the first sentence from Moore and Cobb (2000), Statistics and Mathematics: Tension and Cooperation, American Mathematical Monthly, pp. 615-630 pdf. (Moore is a past president of the American Statistical Association.)

It has become a truism, at least among statisticians, that while statistics is a mathematical science, it is not a subfield of mathematics.

The rest of this paper's lead paragraph gives further support to my belief, as do the reasons listed here. -- Avenue 13:28, 20 February 2006 (UTC)[reply]

In case you wanted a response: This is all fine by me. I did not write the passage you refer to (except for changing "refers to" to "is"). Your view of statistics distinct from math agrees with everything I have ever heard. In fact, I think I'll reword the Historical context section to make this more forceful. Joshuardavis 19:35, 20 February 2006 (UTC)[reply]

Presentation of data/Rates

An anonymous poster put up a section "Presentation of data" with a single subsection on "Rates", mainly death rates per 100,000 people. It is rather detailed and focused on death rates, and the definition of specific death rate is identical to that of the crude death rate. It has not been improved upon. I move that it be deleted, and thus the Presentation of data section with it. Do the knowledgeable editors around here think that such sections should exist? If so, what should go in them? Joshuardavis 16:45, 13 March 2006 (UTC)[reply]

I agree that the current section on Presentation of Data has little or no value. Deletion is justified, in my opinion. I wouldn't object to some general coverage of data presentation techniques and problems, perhaps as a subsection of Statistical methods. For example, we could cover the pros and cons of tabular vs graphical display, an overview of common graph types and how to use them well (and/or show problems with their usage), and perhaps some more general discussion of how most data summaries hide more than they reveal. At least some of this would probably be too detailed for the main article on Statistics, but it could easily be shifted to more specific articles if necessary. -- Avenue 20:06, 13 March 2006 (UTC)[reply]

I'm deleting it from the article and putting it right here:

Rates are the percentages that are based on a particular population. Rates will be based on the same figure, usually per 100,000 populations.

The most widely used rates are the death or mortality rate. Rates are divided into crude death rates and specific death rates.

The crude death rates are the number of deaths that occur in a given year per 100,000 persons in the entire population. Crude death rates are used oftenly on an international level as a comparison between countries. These rates are also used by population specialists to determine growth rates.

Crude death rate = (Number of deaths X 100,000) ÷ Total Population

Specific death rates are the number of deaths that occur in a given year per 100,000 persons in the entire population.

Specific death rate = (Number of deaths for a specific population X 100,000) ÷ Total Population

Joshua Davis 23:37, 11 April 2006 (UTC)[reply]

wikistatistics

Are there any wikis for collecting raw statistical data?

Do you mean a wiki in which users post raw data from experiments from various scientific disciplines? I know that some specific scientific subfields maintain community repositories of data, but a wiki doesn't seem appropriate to me for such a database. Joshua Davis 23:15, 11 April 2006 (UTC)[reply]

I am looking for a community repository of data. A repository which follows a standard form and is open to peer review. A repository for ALL data, including demographics, surveys, studies, patterns….etc…. Basically, a repository of all numerical data. The reason I am looking for it in wikiformat, is because Wiki's utilize vast human resources, and it is possible to not only collaborate extensively on popular topics, but also possible to find information on obscure topics. Do you know of any systems which utilizes vast human resources in order to collect raw data on all topics? --146.244.137.197 23:54, 11 April 2006 (UTC)[reply]

Statistical graphics are poorly covered

Graphical methods are currently mentioned only at one point in the article, in this paragraph:

Descriptive statistics deals with the description problem: Can the data be summarized in a useful way, either numerically or graphically, to yield insight about the population in question? Basic examples of numerical descriptors include the mean and standard deviation. Graphical summarizations include various kinds of charts and graphs.

This does not seem sufficient for such a pervasive aspect of statistical practice. Expanding on this would also provide plenty of scope for illustations. I'll add something soon. -- Avenue 09:35, 3 May 2006 (UTC)[reply]

Disputing "Good Article" status

See Wikipedia:Good articles/Disputes#Statistics. --zenohockey 00:26, 14 May 2006 (UTC)[reply]

Let's talk about what needs to be done, specifically, to polish off this article. The criticisms are essentially:

Historical overview section needs expansion
Criticisms section needs expansion
There are too many lists of people/concepts
We need mention of Bayesian methods

Aren't lists appropriate, as the Statistics article serves as an introduction and a jumping-off point for anyone reading WP articles on statistics? Furthermore, what other topics should go in history/criticisms? How much should we say about Bayesian stuff here? Can someone make a concrete outline? Joshua Davis 16:57, 14 May 2006 (UTC)[reply]

Wikilinks are needed, but preferably not in the form of several lists of bullet points. They can be embedded in prose instead, as in Mathematics#Overview_of_fields_of_mathematics for example. This has the advantage that someone adding such a link has to figure out something useful to say about it, which might help combat the profusion of links to obscure software packages, for instance. The sheer number of external links is also a problem.

I'm sure there are other important topics besides Bayesian methods that aren't mentioned, but should be. I'll give this some thought. Some pruning would be useful too. -- Avenue 04:42, 15 May 2006 (UTC)[reply]

I think whoever nominated this article for "Good Article" has to review his/her opinion, this article is a real disgrace to Wikipedia (only one image, very few and unorganised links, long text but no useful information, and of course, the lack of Bayes ...), neglecting a topic as important that it needs a portal...

I think of making a Portal:Statistics, however I am not familar with making new portals, so, any help would be appreciated. --Lord Snoeckx 19:20, 17 May 2006 (UTC)[reply]

Criticism

We currently have a Criticism section with very little in it. What should go in such a section? Or should the section be deleted, with its contents left to misuse of statistics and statistical literacy? Joshua Davis 14:56, 15 May 2006 (UTC)[reply]

We should be integrating the relationship between half-truths into the concept of statistics, half-truths

For example:

Milo Schield, who has a PhD in Astrophysics, is a professor at Augsburg College who teaches statistical literacy, traditional statistics and critical thinking at the undergraduate and graduate level.[4]In his 2005 paper, Statistical Prevarication: Telling Half Truths Using Statistics[5] , Schield notes,
All too often statistics are characterized as lies. But statistics are more likely to be half truths than lies.... statistical prevarication [is] the art of straddling both sides of an issue or idea... If statistics educators are to avoid a charge of statistical negligence, they should focus more on identifying and eliminating sources of statistical prevarication in their teaching and textbooks. And statistical educators should do more to help students become statistically literate in detecting statistical prevarication.
Let us assume a statistic is true, it represents one part of the whole picture. The fastest growing sport, may not be the most popoular. [5]

— Preceding unsigned comment added by User:Caesarjbsquitti (talk • contribs)

Statistics is like a hammer--a hammer can be used to kill, but we don't put warning labels on all hammers. Statistics, like any other tool, can be used incorrectly for wrong. The best thing to point out is how it can and has been used misleadingly. I think this would be an interesting sub-section, but short and to the point. You generalize in a way that makes it sound like everyone who uses statistics is a liar. Statistics are not always "generalizable," and that may contribute to misunderstandings. Furthermore, some statistics may be unknowingly incorrect due to error (that is, Type I and II error). We should be careful by discussing cases that were proven misleading rather than suspected. Overall, we should be careful to not make such sweeping generalizations about those that use statistics. Chris53516 20:42, 23 August 2006 (UTC)[reply]

Hypothesis tests versus confidence intervals

User:JJL recently changed the comment on confidence intervals (as an alternative to hypothesis tests) to read:

One possibility is to avoid hypothesis testing and report confidence intervals instead, but this merely avoids drawing the final conclusion of the test, and statisticians do wish to draw conclusions.

I disagree strongly with this statement. Confidence intervals present the uncertainty about the size of the difference, while hypothesis intervals typically condense this into a statement about whether the difference could reasonably be zero. But in many situations, simply having a non-zero difference is not the goal, and hypothesis tests miss the point. JJL's edit summary was "C.I. really is equiv. to hyp. test but w/o the final concl.; fine if no concl. needed, but try selling that to the FDA." Okay, so look at a drug example. Suppose we have a clinical trial on a new cancer drug. Assume it costs $100,000 more per patient than existing treatments, but this would be judged worthwhile if it decreases 5-year mortality by 10%. If the confidence interval for the mortality decrease is (2%,22%), we don't know enough to say whether the drug is worthwhile, but a hypothesis test would show there is a significant difference in mortality - not the right thing to conclude. Alternatively, if the confidence interval for the mortality decrease is (1%,9%), we do know enough to say that the drug is probably not worthwhile. However a hypothesis test would still show there is a significant difference in mortality, again leading us to the wrong conclusion.

Statisticians do not wish to draw wrong conclusions. This is why many of them have recommended confidence intervals over hypothesis tests. I can dredge out references if you want. -- Avenue 01:24, 23 May 2006 (UTC)[reply]

I wanted to make the point that in computing a C.I. one is doing essentially all the calculations done in performing a comparable-level hyp. test but then not condensing it to a yes-no answer. In this sense it is more a stylistic choice than a fundamentally different approach. I too have seen more and more people leave things as a C.I. rather than a reject/FTR, just as I've seen the rise of the emphasis on the p-value. Pointing this out is fine, but I think it's important to make it clear that we're talking about different ways of viewing the same calculation, not truly different approaches in the way that, say, a Bayesian approach is different. So, I am certainly fine with someone weakening my statement, but I'd like to see it remain clear that there is a close connection between these appproaches. JJL 03:15, 23 May 2006 (UTC)[reply]

I agree that mathematically p-values or confidence intervals are "only" stylistic choices, since (if one knows the point estimate and the null hypothesis) you can convert between them. But in practice, people make very different judgements based on these two presentations of the same information. I don't mind both aspects being discussed, but to me the practical advantages of CIs seem much more relevant to the Criticisms section than the mathematical equivalence of these approaches. -- Avenue 00:28, 24 May 2006 (UTC)[reply]

I don't disagree with your overall point, but I think the current wording is too strong--preventing common errors made by hyp. testers sounds like POV to me (would they agree that they make all these 'common' errors?). I also disagree with widely recommended remedy. It's an increasingly popular approach, to my mind. Let's see if I can help us iterate to a happy medium. Edit away at what I write, of course! JJL 00:52, 24 May 2006 (UTC)[reply]

Section on Basic statistical techniques for beginners

A section called "Basic statistical techniques for beginners" was recently added. Some of this might be merged into the "Statistical techniques" section, although I think much of this is getting too detailed for this article. I also wouldn't choose ANOVA, regression, and chi-squared tests as the three best techniques to cover here. I've removed the section from the article; its text follows below. -- Avenue 00:38, 6 June 2006 (UTC)[reply]

There are three basic statistical techniques: chi-square, which works well on counts; analysis of variance (anova), which works well on measurements from two or more groups; and regression, which builds the best simple equation for seeing how well one variable will convert into another. . . In the case of chi-square, the technique can be used to support the notion that there is an important, reliable difference in the observed counts in two or more groups, as in the number of members of group A who are found to have a given characteristic versus those in group B. In the case of anova, that technique can be used to support the notion that the means or averages in some groups of scores are importantly and reliably different from the means of other groups. In the case of (correlation)/regression, the technique helps show how much one variable can be changed to another by a simple formula. For instance, through the body mass index human body height might be found to translate into the weight of that body by a fairly simple and straightforward mathematical formula .

I concur with the removal of this section. JJL 14:52, 6 June 2006 (UTC)[reply]

Delisting of statistics from Wikipedia:Good articles

Hi all,

Unfortunately after discussion on Wikipedia:Good articles/Disputes this article has been delisted. The discusssion relating to this article's delisting follows. If you feel the issues discussed below are no longer applicable, please feel free to renominate the article.

Cedars 08:10, 11 June 2006 (UTC)[reply]

Largely consists of lists of people and disciplines, few of which are explained even briefly in the text. The explanations seem adequate, where they appear, but the "Criticism" section, e.g., needs to be much larger. --zenohockey 00:24, 14 May 2006 (UTC)[reply]

I agree, much that is important about statistics is missing from the article. For instance, Bayesian methods don't get a mention. Also contains only one image. --Avenue 04:51, 14 May 2006 (UTC)[reply]

I think this article sucks, only one image, few links, and no useful information, it's a disgrace to Wikipedia to neglect such an important topic which is even worth a portal!--Lord Snoeckx 19:12, 17 May 2006 (UTC)[reply]

Please keep things cool people, let us not go around with terms like that. By doing so you are insulting both the editors to the article and the person that passed it, without actually discussing the issue at hand - the article. If you read the GA guidelines, you'll see that images aren't required so that can't be used against. "Few links"? You mean it isn't a link farm or it doesn't link to other articles? We should discuss the context and quality of this article. Play on, Highway ^{Rainbow Sneakers} 14:59, 26 May 2006 (UTC)[reply]

I didn't understand the "few links" comment either, and I think "no useful information" is overstating things. The article has improved somewhat since its listing here, with the "Criticism" section being expanded. But the other criticisms remain. These problems mean that, in my opinion, the article fails to meet the good article criteria of being well written and having broad coverage of its subject. It also includes only one inline citation and one image; although these are not mandatory, more would certainly be desirable. -- Avenue 05:10, 6 June 2006 (UTC)[reply]

Not to mention that making free images for a Statistics article is much easier than most. --SeizureDog 08:35, 11 June 2006 (UTC)[reply]

Probability

As I looked this article, I think that Probability should be mentioned because it's one of the Mathematical information on about Statistic. Does anybody thinks that Probability needs to be mentioned in this article. I hesitated to mention Probability on this article. Perhaps, There is another article that is about Mathematical Probability. *~Daniel~* ☎ 02:32, 31 July 2006 (UTC)[reply]

Probability is mentioned as the basis for statistics in the historical overview (with moderate discussion) and in the conceptual overview (with brief mention as to its relevance). There are links to probability, probability theory, and mathematical statistics. What more do you think this article needs? Joshua Davis 18:25, 31 July 2006 (UTC)[reply]

No, I don't think that this article needs anything. But I just considered that Mathematical statistic should be mentioned. Thanks anyways. *~Daniel~* ☎ 02:40, 6 August 2006 (UTC)[reply]

Historical overview

The historical overview section has grown splendidly in recent months. But this has left it somewhat a mess. Today I tried to separate out a few threads and place each in its own subsection, with the mess concentrated in the "Origins in probability" subsection. In order to make the article sensible while it's being revised, I have temporarily deleted this passage:

Statistics eventually merged with the more mathematically oriented field of inverse probability, referring to the estimation of a parameter from experimental data in the experimental sciences (most notably astronomy).

I don't know enough to polish this section much further. Here's what I don't like:

The discussion of Quetelet is more than half as long as the article Adolphe Quetelet. Let's move most of this material there, and leave a brief mention of his significance here.
People keep wanting to write treatises on probability in this article. In particular, the discussion of aleatory vs. epistemic probability is long and not directly referenced elsewhere in this article. (And it's not historical, but conceptual.) Should it be moved to Probability?
I separated out the paragraph "Other contributors were..." because it's unclear to me whether it's related to the paragraph before it or the one after it.
The article is 33 KB long. If the Historical overview gets much longer, we might consider splitting it off into "History of statistics".

Joshua Davis 16:31, 5 August 2006 (UTC)[reply]

Responding to myself here...I have removed the treatment of aleatory vs. epistemic probability, since it appears verbatim at Probability, and I have moved the treatment of Quetelet to Talk: Adolphe Quetelet, leaving just a short summary here. Joshua Davis 15:39, 15 August 2006 (UTC)[reply]

Refs and Disraeli in Statistics article

I moved the references out of external links (they weren't external LINKS at all), and made them a bibliography for statistics, and added Dr. Joel Best's book to it. I also clarified the quote on "lies, damned lies, and statistics" to whom it has been reasonably shown to be from -- Disraeli and footnoted it as such. That was my intention. User:Chris53516 asked that I explain it here. I hope it's fine. Bests. --- (Bob) Wikiklrsc 21:35, 24 August 2006 (UTC)[reply]

Thanks for the updates. It looks much better. Chris53516 21:38, 24 August 2006 (UTC)[reply]

You're welcome, Chris. Thanks for your comments. Bests. --- (Bob) Wikiklrsc 12:50, 25 August 2006 (UTC)[reply]

Fallacy?

Statistics can be easily deemed a fallacy. If statistics say that kids whose parents don't talk to them about not smoking are more likely to smoke (you know the common argument), that is a fallacy. Yes, it may be a true statement, but it cannot be argued that the kids whose parents tell them not to smoke would not find smoking cool and that the kids whose parents didn't tell them not to smoke may decide may feel it is disgusting. Statistics as a field tend to treat all people as equal in all regards when that is clearly not true. Not everybody can throw 49 touchdown passes in an NFL season like Peyton Manning did in 2004 or be the leading goal scorer at the Soccer World Cup. I just figured this might be an idea to consider discussing in the article, even though it may be difficult to find a decent source. 205.166.61.142 00:31, 31 August 2006 (UTC)[reply]

You make some sweeping generalizations. One of the purposes of statistics is to attempt to explain an outcome with the most explanatory variables. If a certain type of person is more likely to have a certain kind of outcome (for example, black men tend to have more cardiovascular problems), it is in the best interest of such research to treat everyone differently, not the same. Statistics such as the t-test and ANOVA often differentiate people more than treat them the same. I think your football analogy may be one of the fallacies you are talking about. Football statistics are descriptive statistics--they only describe those people to which they apply (in your case, professional football players and nobody else). Inferential statistics, such as the t-test, often group people according to like kinds based on particular variables, like incidence rate of cardiovascular health problems. Chris53516 13:43, 31 August 2006 (UTC)[reply]

Let me add to that answer in case the poser of the question returns. Statisical methods are not (correctly) used to prove cause and effect or to make claims that something is always true. Statistics is more of an art of educated guessing where mathematical methods are used to make best decisions about what is most likely or what tends to be related. In fact, built into the methods of statistics are ways of determining how likely you are to make an error in your "educated guessing". Typically, someone using statistical methods correctly will say, "I am 99% sure that these two factors (such as not smoking and parents telling the child not to smoke) are related to each other." Then qualifiers will be added. Even in that case, a good statistician wouldn't claim that one factor causes the other. It could be that both items are caused by some third, unidentified, factor. But, of course, those types of misinterpretations of statistical results are made all the time. That doesn't mean, however, that the cause and effect is not logically the best interpretation to the situation. Suppose, for example, that a large number of people get sick who mostly all ate spinach. We might make a best guess that spinach caused the illness. But, really it might be something else like a common salad dressing used by spinach lovers or the fact that spinach stuck in their teeth chased away potential romantic relationships leaving the spinach-eaters in a heart-sick condition which eventually led to real illness. Of course, those alternatives are ridiculous. I guess they COULD be true, but most people would go with the theory that the spinach was teinted. And even if the spinach was the problem, it could be that, for some, there was another unidentified cause. So, we are left with concluding, "Probably this is the cause most of the time." --Newideas07 21:48, 3 November 2006 (UTC)[reply]

Need Link to Reliability (statistics) page

This page needs links to the pages on Reliability (statistics) and Factor Analysis. I'm not sure if these should be put under Statistical Techniques or See Also. I'm also wondering if there should be a link to Cronbach's Alpha (which is one type of reliability estimate).

It seems to me that there are probably quite a few statistical techniques that are not linked from this page. Perhaps it would be helpful to create a hierarchical index of statistical techniques. I see that something like this can be done in the Table of contents. Kbarchard 22:24, 16 September 2006 (UTC)[reply]

This page is not a list of statistical topics (which we link to in the "See also" section), and not every statistical technique or estimator needs to be listed here. The ones you mention seem a bit too specialised for a general article on statistics, but could be usefully added to articles like multivariate analysis and social statistics. -- Avenue 01:34, 18 September 2006 (UTC)[reply]

Standardized coefficient for DYK

I wrote an aricle on Standardized coefficient, but I am no expert in statistics. If this could be quickly vetted by an editor more experienced with this field, we could have a statistical WP:DYK.--_{Piotr Konieczny aka Prokonsul Piotrus | talk} 20:25, 7 October 2006 (UTC)[reply]

What is the difference between F(x) and f(x)?

Can somebody please explain to me with an example the difference between F(x) and f(x) for a continuous random variable? As far as I understand f(x) is a derivative of F(x), please correct me if I am wrong, but that is not sufficient enough for understanding the whole process. Many thanks. -Chetan. — Preceding unsigned comment added by Chetanpatel13 (talk • contribs)

Those two should be interchangable, as far as I know. By the way, use four ~ to sign with your user ID. Chris53516 17:07, 18 October 2006 (UTC)[reply]

Chris, thanks for the response, BTW they are very different. Thanks for the tip and hopefully I am doing it right this time. -- Chetan M Patel 18:24, 18 October 2006 (UTC)[reply]

How are they different? Please use 4 ~ to sign your name. It's easier than what you did. Chris53516 18:31, 18 October 2006 (UTC)[reply]

f(x) is probability density function (PDF) whereas, F(x) is cumulative distribution function (CDF). Chetan M Patel 18:58, 18 October 2006 (UTC)[reply]

The names of the functions are a convention, widely used in statistics. Perhaphs a better question is: whats the difference between a PDF and CDF? Its probably easiest to understand if you know about integration with

F(u)=\int _{x=-\infty }^{x}f(x)dx

. As we are working over a continuous domain the chance of a random variable taking a particular real-value, 0.123456789 say, is zero so it only makes sense to talk of probabilities calculated over a range of values and its a convention to use the range

[-\infty ,x]

giving the CDF. So yes

f(x)={dF \over dx}

. What is the meaning of the PDF, well if you consider a discrete probability distribution like the binomial distribution then the PDF is just the probability of a particular number, here the probabilities of a particualr number 0,1,2,3 occuring is non zero. Futhermore, PDF is useful for visulising the shape of a distribution, for the normal distribution it gives the familiar bell shaped curve, the CDF would be S-shaped and its harder to see whats happening. --Salix alba (talk) 20:45, 18 October 2006 (UTC)[reply]

Correction: that should be

F(u)=\int _{x=-\infty }^{u}f(x)dx

. The upper bound of integration must be u if F(u) is what you're evaluating. Michael Hardy 22:47, 18 October 2006 (UTC)[reply]

In case anyone wants a "Statistics for Dummies" explanation of all that: f(x) is the drawing of a curve that defines a certain probability density function (pattern). For example, a bell shaped curved has an equation, f(x), and represents a situation in which falling in the middle of some range is most likely with tapering probabilities as you go to the left or right. Most measurements of objects fall in this category. But, probabilities of having x in some range are found by calculating the area under the curve. To find the area under the curve, you have to integrate f(x) to get F(x). Sometimes, that is impossible or just really hard and so approximation techniques are used instead, which is why one reason why you usually get probabilities out of tables instead of using equations. There are other theoretical uses for the two functions. I'm not sure if that clarified things for anyone. --Newideas07 21:23, 3 November 2006 (UTC)[reply]

Name of Etymology subsection

Etymology here is the study of the history of the word statistics, not the history of statistics itself. The first paragraph or so of the current Etymology subsection is etymology, but the later paragraphs go beyond etymology to actual history of statistics. That's why I think there are many better, broader titles for this subsection. Or maybe I am interpreting etymology too narrowly? Joshua Davis 15:11, 21 October 2006 (UTC)[reply]

I think Etymology works, even if it does go beyond simple etymology. It's still related to the word's history. -- Chris53516 16:04, 22 October 2006 (UTC)[reply]