Linear Regression

Overview and Refresher

Week 2 Practice in Lab

Step 1: Set the working directory for reading in files.

setwd('/insert full path to your folder’) Remember, if you can’t get the setwd (set working directory) function to work, try ?setwd to get some additional help.

Step 2: Read csv file from on-line source.

You can read the data set from on-line as the cvs file in R

Ads <- read.csv("http://faculty.marshall.usc.edu/gareth-james/ISL/Advertising.csv")
#read.csv("/home/jsshingl1/data/Ads")
summary(Ads)

##        X                TV             radio          newspaper     
##  Min.   :  1.00   Min.   :  0.70   Min.   : 0.000   Min.   :  0.30  
##  1st Qu.: 50.75   1st Qu.: 74.38   1st Qu.: 9.975   1st Qu.: 12.75  
##  Median :100.50   Median :149.75   Median :22.900   Median : 25.75  
##  Mean   :100.50   Mean   :147.04   Mean   :23.264   Mean   : 30.55  
##  3rd Qu.:150.25   3rd Qu.:218.82   3rd Qu.:36.525   3rd Qu.: 45.10  
##  Max.   :200.00   Max.   :296.40   Max.   :49.600   Max.   :114.00  
##      sales      
##  Min.   : 1.60  
##  1st Qu.:10.38  
##  Median :12.90  
##  Mean   :14.02  
##  3rd Qu.:17.40  
##  Max.   :27.00

Step 3a: Estimating the coefficients for simple linear regression

Now we will try to estimate the coefficients of the linear regression line. In order to estimate the coefficients we will use “lm” function. This function is already in the default R setting.

In Ads dataset, we have five variables: X, TV, radio, newspaper, and sales. X is a label of each data point. TV is a budget for TV ads. radio is a budget for ads in ratio. newspaper is a budget for ads in newspapers. They are all in thousands of dollars. We, here, want to ask the following question: Is there a relationship between advertising budget and sales? To answer this question we will try to fit the data with a simple linear regression model. First we will try to see any linear relation between sales and TV ads. Here sales as y variable, response and TV as x variable, explanatory.

lm.fit<-lm(sales~TV, data=Ads)
summary(lm.fit)

## 
## Call:
## lm(formula = sales ~ TV, data = Ads)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.3860 -1.9545 -0.1913  2.0671  7.2124 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.032594   0.457843   15.36   <2e-16 ***
## TV          0.047537   0.002691   17.67   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.259 on 198 degrees of freedom
## Multiple R-squared:  0.6119, Adjusted R-squared:  0.6099 
## F-statistic: 312.1 on 1 and 198 DF,  p-value: < 2.2e-16

If you want to print just estimates of coefficients…

lm.fit<-lm(sales~TV, data=Ads)
coef(lm.fit)

## (Intercept)          TV 
##  7.03259355  0.04753664

If you want to print the confidence intervals for each estimate, type:

confint(lm.fit)

##                  2.5 %     97.5 %
## (Intercept) 6.12971927 7.93546783
## TV          0.04223072 0.05284256

Step 3b: Now let’s plot the regression line with the data points.

The sales are stored in the fourth column of the dataset “Ads” which is a two dimensional array.

head(Ads[,4])

## [1] 69.2 45.1 69.3 58.5 58.4 75.0

#or Ads$sales

Here Ads[,5] means the fifth column of the matrix “Ads”. If you want to access the fifth rows type Ads[5,] instead. Make sure the index starts from 1. NOT 0. In Python it starts 0. These numbers are set as a response variable.

TV is stored in the first column of the matrix “Ads”.

head(Ads$TV)

## [1] 230.1  44.5  17.2 151.5 180.8   8.7

These values are set as an explanatory variable.

In order to plot Ads$TV as y axis and Ads$sales as x axis using plot() function, type:

plot(Ads$TV,Ads$sales)

Then we want to add a regression line in the plot. You will use abline() function

lm.fit<-lm(sales~TV, data=Ads)
plot(Ads$TV,Ads$sales)
abline(lm.fit)

If you want to draw the residual lines as you have seen in the lecture you can type:

lm.fit<-lm(sales~TV, data=Ads)
plot(Ads$TV,Ads$sales)
abline(lm.fit)
pre <- predict(lm.fit) 
segments(Ads$TV,Ads$sales, Ads$TV, pre, col="red")

Step 4: Estimating the coefficients for multiple linear regression

Now we will fit the data to multiple regression model. We will set TV, radio, and newspaper as explanatory variables, X1, X2, X3, and sales as a response variable, Y. To use lm() function for multiple linear regression, you have to set the model. To define we will do type “sales ~ TV+radio+newspaper”.

lm.fit<-lm(sales~TV+radio+newspaper, data=Ads)
summary(lm.fit)

## 
## Call:
## lm(formula = sales ~ TV + radio + newspaper, data = Ads)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.8277 -0.8908  0.2418  1.1893  2.8292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.938889   0.311908   9.422   <2e-16 ***
## TV           0.045765   0.001395  32.809   <2e-16 ***
## radio        0.188530   0.008611  21.893   <2e-16 ***
## newspaper   -0.001037   0.005871  -0.177     0.86    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.686 on 196 degrees of freedom
## Multiple R-squared:  0.8972, Adjusted R-squared:  0.8956 
## F-statistic: 570.3 on 3 and 196 DF,  p-value: < 2.2e-16

For a short cut you can do

lm.fit<-lm(sales~., data=Ads)
summary(lm.fit)

## 
## Call:
## lm(formula = sales ~ ., data = Ads)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.8105 -0.9008  0.2641  1.1783  2.8336 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.0052094  0.3942082   7.623 1.06e-12 ***
## X           -0.0005798  0.0020992  -0.276    0.783    
## TV           0.0457759  0.0013988  32.725  < 2e-16 ***
## radio        0.1883832  0.0086480  21.784  < 2e-16 ***
## newspaper   -0.0012433  0.0059319  -0.210    0.834    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.689 on 195 degrees of freedom
## Multiple R-squared:  0.8973, Adjusted R-squared:  0.8951 
## F-statistic: 425.7 on 4 and 195 DF,  p-value: < 2.2e-16

What do you think the difference between “lm.fit<-lm(sales~TV+radio+newspaper, data=Ads)” and “lm.fit<-lm(sales~., data=Ads)”? In lm() function, if we type as “sales~.” then we are assigning ALL variables except sales as explanatory variables.

What is your answers to the following questions?

  • Is at least one of the predictors X1, X2, . . . , Xp useful in predicting the response?

  • Do all the predictors help to explain Y, or is only a subset of the predictors useful?

  • How well does the model fit the data?

  • What does the coefficient for the newspaper variable suggest?

Before moving on, perform the following three actions to save your work and prepare for the practical application: Save your workspace as “LectureExampleData1.RData” (By typing save.image(“LectureExampleData1.RData”) function)