Graduate Certificate in Data Journalism · Guide

Statistics for Data Journalists

In the Graduate Certificate in Data Journalism, Statistics for Data Journalists course, there are several key terms and vocabularies that are essential for students to understand. Here is a comprehensive explanation of these terms:

6 min read Updated 5 May 2026

1. Statistic: A single number that summarizes data from a sample or population. There are two main types of statistics: descriptive and inferential. * Descriptive statistics summarize and describe the main features of the data, such as the mean, median, mode, and standard deviation. * Inferential statistics make inferences and predictions about a population based on a sample of data. 2. Data: Information that is collected and analyzed to answer research questions. Data can be quantitative or qualitative. * Quantitative data is numerical and can be further divided into discrete and continuous data. * Qualitative data is non-numerical and can be further divided into nominal, ordinal, and categorical data. 3. Population: The entire group of individuals, events, or things that the researcher is interested in studying. * For example, if a researcher is interested in studying the reading habits of all high school students in the United States, then the population is all high school students in the United States. 4. Sample: A subset of the population that is selected to participate in the study. * For example, if a researcher is interested in studying the reading habits of 100 high school students in the United States, then the sample is the 100 students who were selected to participate in the study. 5. Variable: A characteristic or attribute that is being measured or observed in the data. * For example, in a study of reading habits, the variables might include the number of books read in the past year, the amount of time spent reading, and the genre of books preferred. 6. Measurement scale: The level of measurement used to measure a variable. There are four levels of measurement: nominal, ordinal, interval, and ratio. * Nominal scale: A categorical scale that assigns names or labels to objects or events. * Ordinal scale: A categorical scale that ranks objects or events in order of magnitude. * Interval scale: A numerical scale that measures the difference between objects or events but does not have a true zero point. * Ratio scale: A numerical scale that measures the difference between objects or events and has a true zero point. 7. Mean: The average value of a set of data. To calculate the mean, add up all the values and divide by the number of values. * For example, the mean of the numbers 2, 4, 6, and 8 is (2 + 4 + 6 + 8) / 4 = 5. 8. Median: The middle value of a set of data when the data is arranged in order of magnitude. * For example, the median of the numbers 2, 4, 6, and 8 is 6. 9. Mode: The most frequently occurring value in a set of data. * For example, the mode of the numbers 2, 4, 6, and 8 is none, but if we had the data set 2, 4, 4, 6, and 8, then the mode would be 4. 10. Standard deviation: A measure of the spread of a set of data. It measures how much the individual values in a data set deviate from the mean. * For example, if the mean of a set of data is 5 and the standard deviation is 2, then most of the values in the data set are likely to be between 3 and 7. 11. Correlation: A statistical relationship between two variables that measures the degree to which they move together. * A positive correlation means that as one variable increases, the other variable also increases. * A negative correlation means that as one variable increases, the other variable decreases. 12. Regression: A statistical technique used to model the relationship between a dependent variable and one or more independent variables. * Simple linear regression models the relationship between a dependent variable and one independent variable. * Multiple linear regression models the relationship between a dependent variable and two or more independent variables. 13. Confidence interval: A range of values that is likely to contain the true population parameter with a certain level of confidence. * For example, if a 95% confidence interval for the mean of a population is (4, 6), then there is a 95% chance that the true population mean is between 4 and 6. 14. Hypothesis testing: A statistical technique used to test a hypothesis or a claim about a population parameter. * A hypothesis test involves setting up a null hypothesis and an alternative hypothesis, collecting data, calculating a test statistic, and comparing the test statistic to a critical value to determine whether to reject or fail to reject the null hypothesis. 15. P-value: The probability of obtaining a test statistic as extreme or more extreme than the one observed, assuming that the null hypothesis is true. * A small p-value (typically less than 0.05) indicates strong evidence against the null hypothesis and suggests that the alternative hypothesis may be true. 16. Type I error: A mistake made in hypothesis testing when the null hypothesis is rejected when it is actually true. * The probability of making a Type I error is denoted by alpha (α) and is typically set to 0.05. 17. Type II error: A mistake made in hypothesis testing when the null hypothesis is not rejected when it is actually false. * The probability of making a Type II error is denoted by beta (β) and is typically set to 0.20. 18. Power: The probability of correctly rejecting the null hypothesis when the alternative hypothesis is true. * Power is equal to 1 minus the probability of making a Type II error (1 - β). 19. Sampling distribution: The distribution of a sample statistic, such as the mean or proportion, when repeated samples of the same size are drawn from the same population. * The sampling distribution of a sample statistic approaches a normal distribution as the sample size increases. 20. Central Limit Theorem: A fundamental theorem in statistics that states that the sampling distribution of the mean of a large number of independent, identically distributed random variables approaches a normal distribution, regardless of the shape of the population distribution.

To apply these concepts in practice, consider the following example:

Suppose a data journalist is interested in studying the relationship between the number of hours of exercise per week and the level of happiness among college students. The data journalist collects data from a sample of 100 college students and calculates the mean number of hours of exercise per week and the mean level of happiness. The data journalist also calculates the correlation between the two variables and performs a regression analysis to model the relationship between them.

The data journalist then tests the hypothesis that there is no relationship between the number of hours of exercise per week and the level of happiness (the null hypothesis) against the alternative hypothesis that there is a positive relationship between the two variables (the alternative hypothesis). The data journalist calculates a test statistic and compares it to a critical value to determine whether to reject or fail to reject the null hypothesis.

The data journalist also calculates a 95% confidence interval for the slope of the regression line to estimate the strength of the relationship between the two variables. Finally, the data journalist interprets the results of the analysis and writes a story that highlights the key findings.

In summary, understanding key terms and vocabulary in statistics is essential for data journalists to analyze and interpret data effectively. By applying these concepts in practice, data journalists can uncover insights and tell compelling stories that engage and inform their audiences.

Key takeaways

In the Graduate Certificate in Data Journalism, Statistics for Data Journalists course, there are several key terms and vocabularies that are essential for students to understand.
* For example, if a researcher is interested in studying the reading habits of 100 high school students in the United States, then the sample is the 100 students who were selected to participate in the study.
Suppose a data journalist is interested in studying the relationship between the number of hours of exercise per week and the level of happiness among college students.
The data journalist calculates a test statistic and compares it to a critical value to determine whether to reject or fail to reject the null hypothesis.
The data journalist also calculates a 95% confidence interval for the slope of the regression line to estimate the strength of the relationship between the two variables.
By applying these concepts in practice, data journalists can uncover insights and tell compelling stories that engage and inform their audiences.

Statistics for Data Journalists

Key takeaways

More from Graduate Certificate in Data Journalism