With the first in a new series of articles, Stephen MacDonald returns to the pages of Pathology in Practice. Here, he begins with an introduction to statistical interpretation focusing on improving the quality of the first statistical encounter with a dataset.
Laboratory professionals spend much of their working lives generating, reviewing, reporting and discussing numerical data. Yet many of the statistical mistakes made in routine laboratory work occur before any formal analysis has begun. A dataset is collected, a spreadsheet is opened, and familiar numbers are produced: a mean, a standard deviation, perhaps a P value later on. The problem is that these outputs are often generated before anyone has stopped to ask a more basic question: what sort of data are these, and what is the most honest way to describe them?
That question matters because laboratory data do not always behave like textbook examples. Some datasets are approximately symmetric and well-behaved, but many are not. They may be right-skewed, constrained by analytical limits, distorted by a small number of extreme observations, or made up of more than one underlying population. Even where a dataset looks fairly simple at first glance, the reason for its shape may not be simple at all. Biological heterogeneity, pre-analytical variation, analytical effects, and errors in data capture can all leave visible marks on the distribution. The reference interval literature has wrestled with this problem for decades, which is why it provides such a useful foundation for the present topic: it recognises that quantitative laboratory data often require careful descriptive and statistical treatment rather than default reliance on Gaussian assumptions.
This article therefore focuses on three connected themes: how laboratory data are distributed, how unusual values should be interpreted, and how confidence intervals help express uncertainty around the quantities we report. These are not advanced topics in the sense of requiring complex mathematics, but they are fundamental. If a dataset is described badly at the outset, everything that follows becomes harder to trust. If the centre and spread are summarised inappropriately, the conclusions may be distorted. If an extreme value is removed too quickly, an important clue may disappear. If a point estimate is reported with no indication of uncertainty, readers may place more confidence in it than it deserves.
Log in or register FREE to read the rest
This story is Premium Content and is only available to registered users. Please log in at the top of the page to view the full text.
If you don't already have an account, please register with us completely free of charge.