Science Network - Statistical thinking as I understand it - Wang Wei's blog

In his 1907 autobiography, the famous American novelist Mark Twain quoted a passage from Benjamin Disraeli, the former British Prime Minister:

There are three kinds of lies: lies, damned lies, and statistics.

Due to Mark Twain’s high popularity, this sentence became widely circulated after he said it.

Everyone has studied mathematics for many years. One of the reasons why we should learn mathematics is of course that we will use some mathematics in our daily life and in our profession. In other words, mathematics can be regarded as a tool. A person who is proficient in mathematics often has characteristics such as strong logic and accurate calculations. What about statistics?

Statistics are becoming more and more important on the one hand. When people make decisions, they must have statistics and regard statistics as a talisman. At the same time, there are also people like Mark Twain who sneer at statistics. Even in academia, many people believe that statistics is just a part of mathematics; but more statisticians believe and have repeatedly emphasized that statistics and mathematics are completely different.

It may be easier for us to feel what it means to be economically savvy, what it means to be literary, and what it means to be musically literate. So what does it mean to be statistically minded? Statistical cells? And what about statistical literacy? It’s not easy to explain clearly. This article attempts to talk about the above issues by explaining the way of statistical thinking.

1. Correctly understand the importance of statistical thinking

Let us first look at an example. In November 1985, an American scholar Gary Taylor found a poem (let's call it a "Taylor poem") in a library of Oxford University, England, which triggered a debate among British and American scholars studying Shakespeare's literary works. During the Great War, the focus of debate was whether this poem was written by Shakespeare.

Many experts believe that this "Taylor poem" is very different from other works of Shakespeare in terms of word choice and rhyme style. Two months after the debate, Science magazine published on January 24, 1986 published an article "Shakespeare's new poem: an ode to statistics", introducing two statisticians, Efron and Thisted's process of using statistical methods to identify whether this "Taylor poem" was written by Shakespeare.

The method of Efron and Thisted is this: everyone has their own word usage habits, especially for rare words, the differences in the habits used by each author may be even greater. In Shakespeare's total known works, there are 884,647 words, including 31,534 different words. Among these different characters, 14,376 characters appear only once from beginning to end, and 4,343 characters appear only twice. Words that appear several times are counted. Those words that appear less frequently in the total works are Shakespeare's rare words. Based on these data, assuming that the ***429-word "Taylor Poem" was written by Shakespeare, they estimated that there would be several words that have never appeared in the total work (that is, new words) and only appeared once. 2 times, ..., until it has appeared 99 times, estimates are given. The actual situation matched the estimate very well.

If this is not enough, could it be that the poets of that era all had similar word usage habits? Therefore, the two found three poets who were roughly contemporary with Shakespeare, and selected one poem from each of them, as well as four other poems by Shakespeare, to compare with this Taylor poem. After three statistical tests, it was found that for the first three pieces, if it is assumed to be Shakespeare's works, the actual value of the number of occurrences of rare words does not match the estimated value. Although there are occasional differences in the four selected poems by Shakespeare, they are generally acceptable. Efron and Thisted said their analysis could not completely prove that "Taylor's Poetry" was written by Shakespeare, but the use of rare words was so consistent with Shakespeare's general oeuvre that it was surprising.

A literary debate quickly subsided after statisticians spoke out. No wonder they pay tribute to statistics. The use of statistical methods to make decisions reflects an objective and reasonable thinking. Rather than subjective arguments about whether the style is the same or not, it is better to use objective statistical methods to determine. But how can we be objective enough? In addition to examining only "Taylor's poetry," Efron and Thisted also compared it to several poets who were Shakespeare's contemporaries, which was safer. In case the poets of Shakespeare's period had similar usage habits of uncommon words, as was the fashion, this test would have no reference value.

Statistics, just like our thinking, must be objective, otherwise we would be deceiving ourselves and others. On the other hand, if our thinking is statistical, it is extremely objective.

William J. Sutherland, a professor at the University of Cambridge in the United Kingdom, and others published an article in "Nature" magazine in 2013 titled "20 Facts You Should Know When Interpreting Scientific Views" , after reading it, I found that the scientific facts mentioned in it are all related to statistical thinking.

Statistics is one of the most important tools in modern scientific research. The famous British biologist Galton once said: "Statistics has the extraordinary ability to deal with complex problems. When scientific explorers When there are obstacles along the way, only statistics can help them open a channel. "When using scientific research conclusions to assist real-life decision-making, you must have good statistical thinking in order to maintain a clear understanding of scientific conclusions and interpret them more accurately. The scientific truth behind it.

The era of big data has changed from information shortage to information overflow. The crisis of information scarcity has given way to the difficulty of information screening. Under this background, scientific methods have become a required course for everyone. In today's world of increasing reliance on data, only by establishing correct statistical thinking can we effectively carry out data processing and analysis. Today's world is entering the era of big data with information explosion. Statistics are becoming more and more important, which verifies the prediction of British science fiction writer H.G. Wells: "Statistical thinking will one day become an efficient citizen like reading and writing." ”

Statistics is widely used in various disciplines, from natural sciences to humanities and social sciences, and even in intelligence decision-making in industry, commerce and government. As a tool and means to understand nature and society, statistics study the quantitative relationship between objective phenomena and help policymakers understand the role of scientific research evidence in decision-making. As Fisher, the founder of modern statistics, said: "The unique aspect that brought human progress to the 20th century is statistics. The ubiquity of statistics and its application in opening up new fields of knowledge have far exceeded those in the 20th century. Any technological or scientific invention. ”

Ma Yinchu once said: “Scholars cannot study without statistics, industrialists cannot practice without statistics, and politicians cannot govern without statistics.” Statistical thinking is to obtain knowledge. A thinking mode shown in the process of data, extracting information from data, and demonstrating the reliability of conclusions plays a huge role in improving human cognition. Whether it is scientific investigation to unravel the mysteries of nature, examining the authors of early anonymous literary works, giving a chronology of archaeological artifacts, or solving court disputes and making the best decisions, statistical thinking plays an irreplaceable role. important role.

Statistics is a kind of understanding from experience to rationality, and it is a science that uses accidental discovery of rules. It is not just a method or technique, but also contains elements of a worldview - a way of looking at thousands of things in the world. This is what people often mean when they talk about how something looks statistically. The development of statistical thinking requires not only learning some specific instructions, but also being able to connect these instructions into an organic and clear picture from a developmental perspective to gain a sense of historical depth. As Germany's Sleuze once said, "Statistics is dynamic history, and history is static statistics."

From a statistical point of view, the knowledge people obtain from experience or experiments contains various Deterministically, statistics focuses on the measurement of the uncertainty contained in this knowledge. Once the uncertainty can be measured, people's knowledge will be expanded and their understanding of the world will move forward. This process is The process of human knowledge accumulation is constantly repeated. No wonder someone concluded:

In the ultimate analysis, all knowledge is history: the knowledge we have now is the summary and derivative of things discovered in the past;

In In an abstract sense, all science is mathematics: all knowledge can be summarized as mathematical reasoning and operations;

On the basis of rationality, all judgments originate from statistics: All judgments They are all summaries of past patterns, that is to say, judging future trends based on past data and resume probability models.

2. What is statistical thinking and its common methods

First, let’s take a look at what exactly statistics is doing?

Finding regularity from randomness is the basic idea of ??statistics and also the charm of statistics.

Simply put, the two core concepts expressed in statistics are:

Most of the knowledge we learned in middle school discusses issues of inevitability. When it says 1, it is 1, without any error. Once a proposition is proved to be correct, the problem will always be correct, without exception, unless you can find the loopholes in the proof. In statistics, there are randomness problems everywhere. It allows for errors, but the absence of errors would make one suspect that there is something false. Statistics will also give a firm guarantee on a problem, but its guarantees are all based on probability. Moreover, the guaranteed probability is not only not 100%, but also contains errors. Statistics are full of "uncertainty". For example, claiming that there is a 95% probability that the volume of a certain drink is between 425 ml and 431 ml is a typical statistical guarantee. Statistics represent a way we see the world.

In a random world, the truth is often difficult to tell. Everything is a hypothesis, it depends on which one you are willing to accept.

The meaning of acceptance is just like at a wedding, when the bride nods and says "I do", it does not mean that the groom is really the most suitable for her. It's just that "she's willing to accept it for now." Similarly, in statistics, acceptance does not mean true, and rejection does not mean false. Statisticians' judgments will always give errors, which is a kind of statistical inference with allowed errors.

Probability and error constitute the two pillars of statistical thinking. and develops almost all key points in statistics.

There is a certain correspondence between the methods in statistics and people’s way of thinking. Below we will list some common ways of thinking in statistics.

(1) Be good at using data.

"Data! Data!Data!" he cried impatiently. "I can't make bricks without clay." This is the famous saying A sentence said by Sherlock Holmes in the novel.

There is no circle without rules, no brick wall without clay, and no decision-making without data.

Holmes can rely on some clues at the murder scene to speculate that the suspect may be left-handed or may have passed through an orchard. Fortune tellers and fortune tellers also rely on data. Collect the fortunes of many different faces, horoscopes, etc., and after "reading many people", it will naturally be easy to analyze people's future based on their faces. Don’t those who are good at seeing through human nature also read a lot of people? Making decisions requires data, and every piece of data may be useful information. To be able to use your skills as a statistician, you must make good use of information. So for statisticians, data is like rice to rats.

(2) You must have a mind that is good at capturing uncertainty

Inevitability and randomness are intertwined in the operation of the universe. For example, we know that Halley's comet approaches Earth every 76 years (this is an inevitability). Although we can know what happened 76 years later, will it rain tomorrow? It's not so certain (randomness). For another example, let go of the coin in your hand. I learned in the middle school physics course that if the air resistance is ignored, the time it takes for the coin to fall to the ground is a fixed value at a fixed height. But which side faces up after landing? It's impossible to predict. This is uncertainty.

People know roughly what will happen in the future and when it will happen, but they cannot fully grasp it. In a random world, inevitability makes people willing to prepare well in advance, while uncertainty makes people full of hope or fear for the future. In a world where there is only inevitability and no change, there will be a lack of hope for the future and people will lose the motivation to work hard. In a world with only randomness and luck, people will lose their determination to be positive and serious. Three points are destined, five points depend on hard work, and two points depend on luck. This is the great design of the Creator.

Due to the existence of uncertainty, all we can do is to understand it, and in many cases try to reduce these uncertainties. Therefore, our ancestors summarized some so-called laws to deal with such uncertainty in the random world. For example, the law of large numbers, another important random law is the central limit theorem.

Making predictions and estimates in statistics is essentially making partial generalizations. Although it is biased, it can generalize. This is the skill of statisticians.

(3) Thinking of believing in probability

Mathematician Pierre-Simon Laplace once said, “Most of the most important questions in life are just A matter of probability”. In the random world, everyone is familiar with the word probability, but not many people truly understand the meaning of probability.

What exactly is the meaning of probability? In situations such as throwing dice or drawing lots, we usually interpret probability in terms of "equal likelihood." That is, there are 6 sides of the dice, and the probability of each side appearing is considered to be 1/6. This explanation is quite applicable in daily life. When there is no other information, it is often assumed that every possible outcome has an equal probability of occurring.

The second way is to explain probability in terms of relative frequency. For example, if a professional basketball player's shooting percentage in the past was 0.527, it means that the player's shooting percentage will probably be 0.527 when he next shoots. This common explanation of probability is also relatively objective. The theoretical basis behind it is the law of large numbers. The targeted phenomenon can be observed repeatedly.

The last way is subjective probability. For example, the probability of the Brazilian team winning the World Cup, the probability of catching up with a certain girl, etc. are subjective probabilities. These events cannot be observed repeatedly and are one-time events.

The above three interpretations of probability are sometimes used interchangeably, or verify each other.

There are also small probability events. Things that you thought were impossible at first will definitely happen as long as you observe them enough times. Some people call this the law of truly large numbers. When a small chance meets a large sample, its occurrence will not be too surprising.

In a random world, believe in probability rather than challenge probability.

(4) Have a reasonable estimation mentality

Once upon a time there was a child who sold fried dough sticks, and he always put all the money he earned from selling dough sticks in a basket containing fried dough sticks. One day I was in urgent need of urination, so I put the basket on a big stone and went to the toilet. When I came back after a while, it was like a bolt from the blue. All the money in the basket was gone. He cried and ran to tell the magistrate. After hearing this, the county magistrate asked people to bring the stone for questioning. Despite repeated threats, Shitou didn't say a word. The magistrate was angry and asked people to use sticks to hit the rocks. But even if the stick breaks, the stone still doesn't speak. The people watching the excitement laughed. The magistrate became even more angry and fined the onlookers to take two copper coins each and throw them into a basin filled with water. Suddenly, the county magistrate pointed at a person and said, "You are the one who stole the money." The man shouted that he was wronged, and everyone was confused. The county magistrate explained: "The kid was selling fried dough sticks, and his money was all stained with oil. When other people's money was thrown into the water, no oil floated up. Only this person threw his money into the water, and oil floated up. It can be seen that This man stole the money." The man bowed his head and confessed, and everyone was convinced.

This kind of county magistrate's judgment-like wisdom is similar to the principle of asking the most naughty student first when the classroom glass is broken: when choosing from several possibilities, give priority to it. Most likely scenario. Will there be any errors? Of course it will. Just because there is oil in his pocket, do you think he stole the money from the kid selling youtiao? If someone receives change from a seller of fried dough sticks, wouldn’t it also be stained with oil?

However, this method that people often use when making choices is effective. From the perspective of statistical thinking, it is the famous method of maximum likelihood, which determines the estimated value based on the one with the highest probability of occurrence. This method has many good properties and often yields good estimators.

In the American NBA professional basketball game, each team has its own victory or defeat. It is difficult to say which team is the strongest. In regular games, each team plays 82 games, and the 8 teams with the highest winning percentage in each district can play in the playoffs. The so-called winning percentage is the number of games won divided by the number of games played. In order to maintain the visibility of the game, the NBA has a draft mechanism so that the strength of each team will not be very different. Sometimes the player ranked first in the entire season has a winning rate of less than 60%. Using the winning percentage after multiple games in a season to determine who is the stronger player this year and who can participate in the playoffs is a common practice in professional football games. For another example, when estimating the probability of success of a certain operation, estimating the probability of giving birth to triplets, etc., this idea of ??estimating based on relative frequency is often used.

With the development of statistics, hundreds of estimation methods are contending. These reasonable estimation methods often have their own advantages and are suitable for certain situations. No one method is always the best. For example, sometimes we feel that giving a range can describe it more clearly. This is the famous confidence interval (Confidence interval) estimation method.

(5) There should be a hypothesis-testing mindset of doubting guilt.

People often seek fairness or impartiality. Take the simple example of dividing a cake between two people. If both parties are unwilling to take a smaller piece, what is a good way to divide it? It should be a way for both of you not to feel disadvantaged. It would be best to draw lots as to who will cut. In order to prevent the chosen side from feeling that his gain is more than half, and the cut side from feeling that his gain is only half.

The principle of no presumption of guilt is similar to the principle of your choice. It is a sentencing method that can make both prosecutors and defendants feel more fair.

In 1933, the Polish Neyman and the British Pearson proposed the famous Neyman-Pearson lemma, which established the principle of presumption of innocence in statistics. This is hypothesis testing (Hypothesis testing).

The hypothesis in English is derived from the ancient Greek word hypothesishenai, which is also the word for scientific hypothesis (or hypothesis doctrine). In mathematics, we often prove whether a statement is true or false. But in the random world, many phenomena can only be regarded as hypotheses, it depends on which one you are more willing to accept. Acceptance does not mean that the hypothesis is true, and rejection does not mean that the hypothesis is false. After testing the assumptions in statistics, no matter which hypothesis is accepted, it cannot become a law. A hypothesis will always be a hypothesis.

3. Conclusion

Mr. Chen Xiru said in the preface of his "A Brief History of Mathematical Statistics": "Statistics is not only a method or technology, but also contains elements of a world view - - It is a way of looking at everything in the world. We often talk about how something looks from a statistical point of view, but statistical thinking also has a development process. Not only do you need to learn some specific knowledge, but you also need to be able to connect this knowledge into an organic and clear way from a developmental perspective, and gain a sense of historical depth. ”

Establishing statistical thinking is not. If there is any secret to achieving success overnight, it is to study, practice, study again, practice again, keep learning and keep practicing.

References: