Linear Regression In Python Using R S Formula Style

Leo Migdal

-Dec 4, 2025, 5:19 AM

linear regression in python using r s formula style

In academic statistics, the dominant programming language is R, and that was my first language for implementing regression models. If you are familiar and comfortable with its formula syntax, I have some good news for you: You can use a similar syntax for running linear regression (and other generalized linear models) in Python. In this article, I will refer to an example of how to do this. In Python, the statsmodels package contains many useful modules and functions for statistical analyses. There are 2 broad ways to implement it. statsmodels.api uses a syntax that is based on matrices

statsmodels.formula.api uses a syntax that is based on formulas To mirror the regression formulas in R, you need to use statsmodels.formula.api. Python is popular for statistical analysis because of the large number of libraries. One of the most common statistical calculations is linear regression. statsmodels offers some powerful tools for regression and analysis of variance. Here's how to get started with linear models.

statsmodels is a Python library for running common statistical tests. It's especially geared for regression analysis, particularly the kind you'd find in econometrics, but you don't have to be an economist to use it. It does have a learning curve. but once you get the hang of it, you'll find that it's a lot more flexible to use than the regression functions you'll find in a spreadsheet program like Excel. It won't make the plot for you, though. If you want to generate the classic scatterplot with a regression line drawn over it, you'll want to use a library like Seaborn.

One advantage of using statsmodels is that it's cross-checked with other statistical software packages like R, Stata, and SAS for accuracy, so this might be the package for you if you're in professional or... If you just want to determine the relation ship of a dependent variable (y), or the endogenous variable in econometric and statsmodels parlance, vs the exogenous, independent, or "x" variable, you can do this... I’ve been working with statistical models in Python for years, and one feature that transformed how I approach regression analysis is statsmodels’ R-style formula syntax. Coming from R, I appreciated having a familiar, readable way to specify models without manually constructing design matrices. Let me show you how this works and why it matters for your statistical modeling workflow. Statsmodels allows users to fit statistical models using R-style formulas since version 0.5.0, using the patsy package internally to convert formulas and data into matrices for model fitting.

The formula syntax provides an intuitive, readable way to specify relationships between variables. At its core, the formula interface uses string notation to describe your model. Instead of creating arrays and matrices manually, you write something like “sales ~ advertising + price” and statsmodels handles the rest. The tilde (~) separates your dependent variable on the left from independent variables on the right, while the plus sign (+) adds variables to your model. The formula API lives in statsmodels.formula.api, which you import separately from the standard API. Lower case model functions like ols() accept formula and data arguments, while upper case versions take endog and exog design matrices.

I prefer the formula approach because it keeps my code readable and reduces preprocessing steps. The standard api provides dataset loading and other utilities, while formula.api gives you access to formula-compatible model functions. I always import both because statsmodels.formula.api doesn’t include everything you might need. I remember experimenting with doing regressions in Python using R-style formulae a long time ago, and I remember it being a bit complicated. Luckily it’s become really easy now – and I’ll show you just how easy. Before running this you will need to install the pandas, statsmodels and patsy packages.

If you’re using conda you should be able to do this by running the following from the terminal: (and then say yes when it asks you to confirm it) Before we can do any regression, we need some data – so lets read some data on cars: You may have noticed from the code above that you can just give a URL to the read_csv function and it will download it and open it – handy! Since version 0.5.0, statsmodels allows users to fit statistical models using R-style formulas. Internally, statsmodels uses the patsy package to convert formulas and data to the matrices that are used in model fitting.

The formula framework is quite powerful; this tutorial only scratches the surface. A full description of the formula language can be found in the patsy docs: Notice that we called statsmodels.formula.api in addition to the usual statsmodels.api. In fact, statsmodels.api is used here only to load the dataset. The formula.api hosts many of the same functions found in api (e.g. OLS, GLM), but it also holds lower case counterparts for most of these models.

In general, lower case models accept formula and df arguments, whereas upper case ones take endog and exog design matrices. formula accepts a string which describes the model in terms of a patsy formula. df takes a pandas data frame. dir(smf) will print a list of available models. Formula-compatible models have the following generic call signature: (formula, data, subset=None, *args, **kwargs) To begin, we fit the linear model described on the Getting Started page.

Download the data, subset columns, and list-wise delete to remove missing observations: You might also be interested in my page on doing Rank Correlations with Python and/or R. This page demonstrates three different ways to calculate a linear regression from python: In Python, Gary Strangman's library (available in the SciPy library) can be used to do a simple linear regression as follows:- >>> from scipy import stats >>> x = [5.05, 6.75, 3.21, 2.66] >>> y = [1.65, 26.5, -5.93, 7.96] >>> gradient, intercept, r_value, p_value, std_err = stats.linregress(x,y) >>> print "Gradient and intercept", gradient, intercept... Typing help(stats.linregress) will tell you about the return values (gradient, y-axis intercept, r, two-tailed probability, and the standard error of the estimate).

We assume you have loaded the following packages: Below we load more as we introduce more. In case of simple regression, the task is to find parameters \(\beta_0\) and \(\beta_1\) such that the mean squared error (MSE) is minimized wher MSE is defined as \[\begin{equation} MSE = \frac{1}{n}\sum_i (y_i -... \end{equation}\] Here \(n\) is the number of observations, \(x\) is our exogenous variable, and \(y\) is the outcome variable, and \(i\) indexes the observations. Normally we want to use software to perform this optimization (even more, this problem can be solved analytically) but it is instructive to attempt to solve the problem by hand. Let us experiment with iris data and estimate the relationship between petal width and length of versicolor flowers.

This is one of the most popular statistics and machine learning dataset, the version we use here originates from R datasets. You can download it from the Bitbucket repo of these notes. The dataset itself contains three species, and as their leaves may have different relationship, we filter out only one of those, versicolor. Also, as the variable names in this file are not suitable for modeling below, we rename variable called Petal.Length to plength, and Petal.Width to pwidth. First we load the data, and thereafter filter with rename chained at the end of the filtering operation: Recommended Video CourseStarting With Linear Regression in Python

Watch Now This tutorial has a related video course created by the Real Python team. Watch it together with the written tutorial to deepen your understanding: Starting With Linear Regression in Python Linear regression is a foundational statistical tool for modeling the relationship between a dependent variable and one or more independent variables. It’s widely used in data science and machine learning to predict outcomes and understand relationships between variables. In Python, implementing linear regression can be straightforward with the help of third-party libraries such as scikit-learn and statsmodels. By the end of this tutorial, you’ll understand that:

To implement linear regression in Python, you typically follow a five-step process: import necessary packages, provide and transform data, create and fit a regression model, evaluate the results, and make predictions. This approach allows you to perform both simple and multiple linear regressions, as well as polynomial regression, using Python’s robust ecosystem of scientific libraries. I use R, but I love Python. However, let’s face it, basic linear regression in R is very straightforward. A few clear and intuitive lines of R code produce textbook1 output that is informative and complete. This post compares building and analyzing simple linear regression models in R and Python.

Let’s look at the data set Earnings.txt2 from the Data and Story Library. DASL is a great resource for test data. Earnings.txt includes the price, SAT, ACT, and graduate earnings for over 700 US colleges. . Exploring the connection between the cost of college and future earnings is interesting in its own right and the post includes more models than needed for the R/Python comparison—I can’t resist—but the outputs are... Neutral.

I like getting standard deviation. Obviously, you’d do more EDA! Number formatting was not normalized. Is it worth paying more for a fancy college? Here are a few more model views to consider.

Linear Regression In Python Using R S Formula Style

People Also Search

In Academic Statistics, The Dominant Programming Language Is R, And

Statsmodels.formula.api Uses A Syntax That Is Based On Formulas To

Statsmodels Is A Python Library For Running Common Statistical Tests.

One Advantage Of Using Statsmodels Is That It's Cross-checked With

The Formula Syntax Provides An Intuitive, Readable Way To Specify