Statsmodels Fitting Models Using R Style Formulas Askpython

Leo Migdal

-Dec 4, 2025, 5:19 AM

statsmodels fitting models using r style formulas askpython

I’ve been working with statistical models in Python for years, and one feature that transformed how I approach regression analysis is statsmodels’ R-style formula syntax. Coming from R, I appreciated having a familiar, readable way to specify models without manually constructing design matrices. Let me show you how this works and why it matters for your statistical modeling workflow. Statsmodels allows users to fit statistical models using R-style formulas since version 0.5.0, using the patsy package internally to convert formulas and data into matrices for model fitting. The formula syntax provides an intuitive, readable way to specify relationships between variables. At its core, the formula interface uses string notation to describe your model.

Instead of creating arrays and matrices manually, you write something like “sales ~ advertising + price” and statsmodels handles the rest. The tilde (~) separates your dependent variable on the left from independent variables on the right, while the plus sign (+) adds variables to your model. The formula API lives in statsmodels.formula.api, which you import separately from the standard API. Lower case model functions like ols() accept formula and data arguments, while upper case versions take endog and exog design matrices. I prefer the formula approach because it keeps my code readable and reduces preprocessing steps. The standard api provides dataset loading and other utilities, while formula.api gives you access to formula-compatible model functions.

I always import both because statsmodels.formula.api doesn’t include everything you might need. Since version 0.5.0, statsmodels allows users to fit statistical models using R-style formulas. Internally, statsmodels uses the patsy package to convert formulas and data to the matrices that are used in model fitting. The formula framework is quite powerful; this tutorial only scratches the surface. A full description of the formula language can be found in the patsy docs: Notice that we called statsmodels.formula.api in addition to the usual statsmodels.api.

In fact, statsmodels.api is used here only to load the dataset. The formula.api hosts many of the same functions found in api (e.g. OLS, GLM), but it also holds lower case counterparts for most of these models. In general, lower case models accept formula and df arguments, whereas upper case ones take endog and exog design matrices. formula accepts a string which describes the model in terms of a patsy formula. df takes a pandas data frame.

dir(smf) will print a list of available models. Formula-compatible models have the following generic call signature: (formula, data, subset=None, *args, **kwargs) To begin, we fit the linear model described on the Getting Started page. Download the data, subset columns, and list-wise delete to remove missing observations: Python is popular for statistical analysis because of the large number of libraries. One of the most common statistical calculations is linear regression.

statsmodels offers some powerful tools for regression and analysis of variance. Here's how to get started with linear models. statsmodels is a Python library for running common statistical tests. It's especially geared for regression analysis, particularly the kind you'd find in econometrics, but you don't have to be an economist to use it. It does have a learning curve. but once you get the hang of it, you'll find that it's a lot more flexible to use than the regression functions you'll find in a spreadsheet program like Excel.

It won't make the plot for you, though. If you want to generate the classic scatterplot with a regression line drawn over it, you'll want to use a library like Seaborn. One advantage of using statsmodels is that it's cross-checked with other statistical software packages like R, Stata, and SAS for accuracy, so this might be the package for you if you're in professional or... If you just want to determine the relation ship of a dependent variable (y), or the endogenous variable in econometric and statsmodels parlance, vs the exogenous, independent, or "x" variable, you can do this... Looking at the summary printed above, notice that patsy determined that elements of Region were text strings, so it treated Region as a categorical variable. patsy's default is also to include an intercept, so we automatically dropped one of the Region categories.

If Region had been an integer variable that we wanted to treat explicitly as categorical, we could have done so by using the C( ) operator: We have already seen that "~" separates the left-hand side of the model from the right-hand side, and that "+" adds new columns to the design matrix. The "-" sign can be used to remove columns/variables. For instance, we can remove the intercept from a model by: ":" adds a new column to the design matrix with the interaction of the other two columns. "*" will also include the individual columns that were multiplied together:

Even if a given statsmodels function does not support formulas, you can still use patsy's formula language to produce design matrices. Those matrices can then be fed to the fitting function as endog and exog arguments. There was an error while loading. Please reload this page. Statsmodels organizes its functionality into topic-based subpackages rather than dumping everything into a single namespace. Understanding this structure helps you find the right models quickly and import them efficiently.

The library provides two primary access points: statsmodels.api for general use and statsmodels.formula.api for R-style formula syntax. Beyond these, specialized subpackages contain models, tools, and functions organized by statistical domain. When you import statsmodels.api, you’re not loading the entire library. The API module collects the most commonly used classes and functions from various subpackages and presents them through a clean interface. These imports give you access to regression models, GLMs, time series tools, and statistical tests without navigating the full directory structure. The API makes the most useful items available within one or two attribute levels.

The API doesn’t include every function in the library. Specialized features often require direct imports from their subpackages. Since version 0.5.0, statsmodels allows users to fit statistical models using R-style formulas. Internally, statsmodels uses the patsy package to convert formulas and data to the matrices that are used in model fitting. The formula framework is quite powerful; this tutorial only scratches the surface. A full description of the formula language can be found in the patsy docs:

You can import explicitly from statsmodels.formula.api Alternatively, you can just use the formula namespace of the main statsmodels.api. These names are just a convenient way to get access to each model’s from_formula classmethod. See, for instance All of the lower case models accept formula and data arguments, whereas upper case ones take endog and exog design matrices. formula accepts a string which describes the model in terms of a patsy formula.

data takes a pandas data frame or any other data structure that defines a __getitem__ for variable names like a structured array or a dictionary of variables. Since version 0.5.0, statsmodels allows users to fit statistical models using R-style formulas. Internally, statsmodels uses the patsy package to convert formulas and data to the matrices that are used in model fitting. The formula framework is quite powerful; this tutorial only scratches the surface. A full description of the formula language can be found in the patsy docs: You can import explicitly from statsmodels.formula.api

Alternatively, you can just use the formula namespace of the main statsmodels.api. Or you can use the following conventioin These names are just a convenient way to get access to each model's from_formula classmethod. See, for instance There was an error while loading. Please reload this page.

I’ve built dozens of regression models over the years, and here’s what I’ve learned: the math behind linear regression is straightforward, but getting it right requires understanding what’s happening under the hood. That’s where statsmodels shines. Unlike scikit-learn, which optimizes for prediction, statsmodels gives you the statistical framework to understand relationships in your data. Let’s work through linear regression in Python using statsmodels, from basic implementation to diagnostics that actually matter. Statsmodels is a Python library that provides tools for estimating statistical models, including ordinary least squares (OLS), weighted least squares (WLS), and generalized least squares (GLS). Think of it as the statistical counterpart to scikit-learn.

Where scikit-learn focuses on prediction accuracy, statsmodels focuses on inference: understanding which variables matter, quantifying uncertainty, and validating assumptions. The library gives you detailed statistical output including p-values, confidence intervals, and diagnostic tests. This matters when you’re not just predicting house prices but explaining to stakeholders why square footage matters more than the number of bathrooms. Start with the simplest case: one predictor variable. Here’s a complete example using car data to predict fuel efficiency:

Statsmodels Fitting Models Using R Style Formulas Askpython

People Also Search

I’ve Been Working With Statistical Models In Python For Years,

Instead Of Creating Arrays And Matrices Manually, You Write Something

I Always Import Both Because Statsmodels.formula.api Doesn’t Include Everything You

In Fact, Statsmodels.api Is Used Here Only To Load The

Dir(smf) Will Print A List Of Available Models. Formula-compatible Models