R makes it easy to fit a linear model to your data. The hard part is knowing whether the model you’ve built is worth keeping and, if so, figuring out what to do next.
This is a post about linear models in R, how to interpret lm results, and common rules of thumb to help side-step the most common mistakes.
Building a linear model in R
R makes building linear models really easy. Things like dummy variables, categorical features, interactions, and multiple regression all come very naturally. The centerpiece for linear regression in R is the lm function.
lm comes with base R, so you don’t have to install any packages or import anything special. The documentation for lm is very extensive, so if you have any questions about using it, just type ?lm into the R console.
For our example linear model, I’m going to use data from the original, or at least one of the earliest, linear regression models. The dataset consists of heights of children and their parents. The origin of the term “regression” stems from a 19th century statistician’s observation that children’s heights tended to “regress” towards the population mean in relation to their parent’s heights.
Fit the model to the data by creating a formula and passing it to the lm function. In our case we want to use the parent’s height to predict the child’s height, so we make the formula (child ~ parent). In other words, we’re representing the relationship between parents’ heights (X) and children’s heights (y).
We then set the data being used to galton so lm knows what data frame to associate “child” and “parent” to.
NOTE: Formulas in R take the form (y ~ x). To add more predictor variables, just use the + sign. i.e. (y ~ x + z).
We fit a model to our data. That’s great! But the important question is, is it any good?
There are lots of ways to evaluate model fit. lm consolidates some of the most popular ways into the summary function. You can invoke the summary function on any model you’ve fit with lm and get some metrics indicating the quality of the fit.
So if you’re like I was at first, your reaction was probably something like “Whoa this is cool…what does it mean?”
People often wonder how they can include categorical variables in their regression models. With R this is extremely easy. Just include the categorical variable in your regression formula and R will take care of the rest. R calls categorical variables factors. A factor has a set of levels, or possible values. These levels will show up as variables in the model summary.
One very important thing to note is that one of your levels will not appear in the output. This is because when fitting a regression with a categorical variable, one option must be left out to avoid overfitting the model. This is often referred to as the dummy variable trap. In our model, Africa is left out of the summary but it is still accounted for in the model.