How To Generate Diagnostic Plots With Statsmodels For Statology

Leo Migdal

-Dec 4, 2025, 5:15 AM

how to generate diagnostic plots with statsmodels for statology

Regression analysis helps us understand the relationship between variables. However, after fitting a model, we need to check if it meets key assumptions. Diagnostic plots help us assess these assumptions visually. These plots check for patterns in residuals, normality, and influential points. In this article, we will learn how to create diagnostic plots using the statsmodels library in Python. Diagnostic plots are used to evaluate the validity of regression models by checking assumptions such as:

First, ensure you have the necessary libraries installed. You can install them using: We will use NumPy, pandas, statsmodels, Matplotlib, and Seaborn: In real-life, relation between response and target variables are seldom linear. Here, we make use of outputs of statsmodels to visualise and identify potential problems that can occur from fitting linear regression model to non-linear relation. Primarily, the aim is to reproduce visualisations discussed in Potential Problems section (Chapter 3.3.3) of An Introduction to Statistical Learning (ISLR) book by James et al., Springer.

Firstly, let us load the Advertising data from Chapter 2 of ISLR book and fit a linear model to it. In the following first we present a base code that we will later use to generate following diagnostic plots: now we generate diagnostic plots one by one. Graphical tool to identify non-linearity. Diagnostic plots are essential tools for evaluating the assumptions and performance of regression models. In the context of linear regression, these plots help identify potential issues such as non-linearity, non-constant variance, outliers, high leverage points, and collinearity.

The statsmodels library in Python provides several functions to generate these diagnostic plots, aiding in assessing model fit and validity. There are several different methods for generating diagnostic plots in statsmodels. Two common methods are plot_partregress_grid() and plot_regress_exog(). These methods work with a fitted regression results object. The plot_partregress_grid() method generates diagnostic plots for all explanatory variables in the model. It helps assess the relationship between the residuals and each independent variable.

The syntax for using plot_partregress_grid() is: The plot_regress_exog() method generates residual plots for a specific independent variable. This can help check the assumption of linearity with respect to a particular predictor. Building predictive models in Python is exciting, but how do you know if your model is truly reliable? This is where model diagnostics come in. They are crucial for validating assumptions and ensuring your model”s findings are trustworthy.

In this post, we”ll dive deep into performing model diagnostics using statsmodels, a powerful Python library. We”ll cover essential checks for regression models, helping you build more robust and accurate predictions. Ignoring model diagnostics can lead to misleading conclusions and poor decision-making. Every statistical model, especially Ordinary Least Squares (OLS) regression, relies on certain assumptions about the data. Violating these assumptions can result in biased coefficients, incorrect standard errors, and ultimately, unreliable predictions. Proper diagnostics help you identify and address these issues proactively.

Before diving into the diagnostics, let”s quickly review the core assumptions of OLS regression. Understanding these helps you interpret the diagnostic plots and tests. This page covers the statistical tests and diagnostics available in the statsmodels library. These tests help you validate model assumptions, detect specification issues, and evaluate goodness-of-fit. For information about model specification and fitting, see Regression and Discrete Choice Models. Regression diagnostics are tests and procedures used to evaluate the assumptions underlying regression models.

Heteroskedasticity tests check if the variance of the errors is constant across observations. Autocorrelation tests check if the residuals are correlated with their own lagged values. Tests for normality of residuals or data distribution. We can use a utility function to load any R dataset available from the great Rdatasets package. Influence plots show the (externally) studentized residuals vs. the leverage of each observation as measured by the hat matrix.

Externally studentized residuals are residuals that are scaled by their standard deviation where \(n\) is the number of observations and \(p\) is the number of regressors. \(h_{ii}\) is the \(i\)-th diagonal element of the hat matrix The influence of each point can be visualized by the criterion keyword argument. Options are Cook’s distance and DFFITS, two measures of influence. Regression analysis is a common method used to predict continuous values, like sales or prices.

Building a regression model is easy, but the results are only reliable if the model’s assumptions are met. Diagnostics help identify issues like multicollinearity, heteroscedasticity, non-linearity, or outliers that can distort predictions. Python provides libraries such as statsmodels, scikit-learn, and matplotlib to visualize and assess these aspects. In this article, we’ll explore how to interpret common regression diagnostics in Python and use them to refine model performance. We’ll use the Diabetes dataset and fit a regression model: Diagnostics help verify whether the assumptions of linear regression are satisfied.

Let’s look at the main plots. The Residuals vs. Fitted Values plot shows whether residuals are randomly scattered around zero. Patterns suggest non-linearity or heteroscedasticity (unequal variance). This very simple case-study is designed to get you up-and-running quickly with statsmodels. Starting from raw data, we will show the steps needed to estimate a statistical model and to draw a diagnostic plot.

We will only use functions provided by statsmodels or its pandas and patsy dependencies. After installing statsmodels and its dependencies, we load a few modules and functions: pandas builds on numpy arrays to provide rich data structures and data analysis tools. The pandas.DataFrame function provides labelled arrays of (potentially heterogenous) data, similar to the R “data.frame”. The pandas.read_csv function can be used to convert a comma-separated values file to a DataFrame object. patsy is a Python library for describing statistical models and building Design Matrices using R-like formulas.

This example uses the API interface. See Import Paths and Structure for information on the difference between importing the API interfaces (statsmodels.api and statsmodels.tsa.api) and directly importing from the module that defines the model. This page lists every Python tutorial available on Statology. Descriptive StatisticsHow to Calculate Z-Scores in PythonHow to Calculate Correlation in PythonHow to Calculate Spearman Rank Correlation in PythonHow to Calculate Partial Correlation in PythonHow to Calculate Cross Correlation in PythonHow to Calculate Point-Biserial... Data VisualizationsHow to Create an Ogive Graph in PythonHow to Make a Bell Curve in PythonHow to Create a Pareto Chart in PythonHow to Create Heatmaps in PythonHow to Create a Stem-and-Leaf Plot in... Probability DistributionsHow to Use the Binomial Distribution in PythonHow to Use the Poisson Distribution in PythonHow to Use the Uniform Distribution in PythonHow to Use the Log-Normal Distribution in PythonHow to Use the Multinomial...

Sampling MethodsStratified Sampling in PandasCluster Sampling in PandasSystematic Sampling in PandasSampling with Replacement in Pandas This example file shows how to use a few of the statsmodels regression diagnostic tests in a real-life context. You can learn about more tests and find out more information about the tests here on the Regression Diagnostics page. Note that most of the tests described here only return a tuple of numbers, without any annotation. A full description of outputs is always included in the docstring and in the online statsmodels documentation. For presentation purposes, we use the zip(name,test) construct to pretty-print short descriptions in the examples below.

Kurtosis below is the sample kurtosis, not the excess kurtosis. A sample from the normal distribution has kurtosis equal to 3. DW statistic always ranges from 0 to 4. The closer to 2, the less autocorrelation is in the sample. Breusch–Godfrey test for serial correlation:

How To Generate Diagnostic Plots With Statsmodels For Statology

People Also Search

Regression Analysis Helps Us Understand The Relationship Between Variables. However,

First, Ensure You Have The Necessary Libraries Installed. You Can

Firstly, Let Us Load The Advertising Data From Chapter 2

The Statsmodels Library In Python Provides Several Functions To Generate

The Syntax For Using Plot_partregress_grid() Is: The Plot_regress_exog() Method Generates