Include Required Libraries

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.5.3
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.5.3
library(statsr)
## Warning: package 'statsr' was built under R version 3.5.3
## Warning: package 'BayesFactor' was built under R version 3.5.3
## Warning: package 'coda' was built under R version 3.5.3
library(leaps)
## Warning: package 'leaps' was built under R version 3.5.3
library(grid)
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 3.5.3
library(GGally)
## Warning: package 'GGally' was built under R version 3.5.3
library(knitr)
## Warning: package 'knitr' was built under R version 3.5.3

Load Movies Dataset

load("movies.Rdata")

Part 1 - About data (Additional analysis of data apart from given reg_model_project)

From given information provided for data we can see that the sample was obtained randomly and therefore statitical analysis results should be able to generalize to the targeted population with caution. Moreover as we have 32 features in movies dataset we can achieve some insights which can be helpful for businesses for revenue generation, budget management, spending etc.

Part 2: Displaying data in table format using library knitr

dfm_all_data <- movies[(1:32)]
kable(head(dfm_all_data))
title title_type genre runtime mpaa_rating studio thtr_rel_year thtr_rel_month thtr_rel_day dvd_rel_year dvd_rel_month dvd_rel_day imdb_rating imdb_num_votes critics_rating critics_score audience_rating audience_score best_pic_nom best_pic_win best_actor_win best_actress_win best_dir_win top200_box director actor1 actor2 actor3 actor4 actor5 imdb_url rt_url
Filly Brown Feature Film Drama 80 R Indomina Media Inc. 2013 4 19 2013 7 30 5.5 899 Rotten 45 Upright 73 no no no no no no Michael D. Olmos Gina Rodriguez Jenni Rivera Lou Diamond Phillips Emilio Rivera Joseph Julian Soria http://www.imdb.com/title/tt1869425/ //www.rottentomatoes.com/m/filly_brown_2012/
The Dish Feature Film Drama 101 PG-13 Warner Bros. Pictures 2001 3 14 2001 8 28 7.3 12285 Certified Fresh 96 Upright 81 no no no no no no Rob Sitch Sam Neill Kevin Harrington Patrick Warburton Tom Long Genevieve Mooy http://www.imdb.com/title/tt0205873/ //www.rottentomatoes.com/m/dish/
Waiting for Guffman Feature Film Comedy 84 R Sony Pictures Classics 1996 8 21 2001 8 21 7.6 22381 Certified Fresh 91 Upright 91 no no no no no no Christopher Guest Christopher Guest Catherine Oโ€™Hara Parker Posey Eugene Levy Bob Balaban http://www.imdb.com/title/tt0118111/ //www.rottentomatoes.com/m/waiting_for_guffman/
The Age of Innocence Feature Film Drama 139 PG Columbia Pictures 1993 10 1 2001 11 6 7.2 35096 Certified Fresh 80 Upright 76 no no yes no yes no Martin Scorsese Daniel Day-Lewis Michelle Pfeiffer Winona Ryder Richard E. Grant Alec McCowen http://www.imdb.com/title/tt0106226/ //www.rottentomatoes.com/m/age_of_innocence/
Malevolence Feature Film Horror 90 R Anchor Bay Entertainment 2004 9 10 2005 4 19 5.1 2386 Rotten 33 Spilled 27 no no no no no no Stevan Mena Samantha Dark R. Brandon Johnson Brandon Johnson Heather Magee Richard Glover http://www.imdb.com/title/tt0388230/ //www.rottentomatoes.com/m/10004684-malevolence/
Old Partner Documentary Documentary 78 Unrated Shcalo Media Group 2009 1 15 2010 4 20 7.8 333 Fresh 91 Upright 86 no no no no no no Chung-ryoul Lee Choi Won-kyun Lee Sam-soon Moo NA NA http://www.imdb.com/title/tt1334549/ //www.rottentomatoes.com/m/old-partner/

Part 3: Research question

The research question here is to identy whether a subset of variables from the given movie dataset can be used to predict the audience score of a particular movie. Identifying the popularity before release of the movie can provide useful insights about the movie and can drive movie business. In case model predicts less popularity for a particular input, such an insight can help to reduce further cost. More better distribution of assets to boost the revenue and manage future decisions.

Part 4: Exploratory data analyis

Here we will observe every variable to see its importance and how it can affect the overall performance to predict the audience score.Before building a prediction model, we need to identify a subset of variables from the dataset for our multiple linear regression model. The response variable is audience_score, and explanatory variables should be a subset of variable that might affect the response variable.

dfm <- movies[ -c(1:2, 5:12, 15, 24, 31:32)]
dfm <- na.omit(dfm)
#Show a list of possible variables included for modeling 
str(dfm)
## Classes 'tbl_df', 'tbl' and 'data.frame':    634 obs. of  18 variables:
##  $ genre           : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 6 6 5 6 1 ...
##  $ runtime         : num  80 101 84 139 90 142 93 88 119 127 ...
##  $ imdb_rating     : num  5.5 7.3 7.6 7.2 5.1 7.2 5.5 7.5 6.6 6.8 ...
##  $ imdb_num_votes  : int  899 12285 22381 35096 2386 5016 2272 880 12496 71979 ...
##  $ critics_score   : num  45 96 91 80 33 57 17 90 83 89 ...
##  $ audience_rating : Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 1 2 2 2 ...
##  $ audience_score  : num  73 81 91 76 27 76 47 89 66 75 ...
##  $ best_pic_nom    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_pic_win    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_actor_win  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 2 1 1 2 ...
##  $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_dir_win    : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ director        : chr  "Michael D. Olmos" "Rob Sitch" "Christopher Guest" "Martin Scorsese" ...
##  $ actor1          : chr  "Gina Rodriguez" "Sam Neill" "Christopher Guest" "Daniel Day-Lewis" ...
##  $ actor2          : chr  "Jenni Rivera" "Kevin Harrington" "Catherine O'Hara" "Michelle Pfeiffer" ...
##  $ actor3          : chr  "Lou Diamond Phillips" "Patrick Warburton" "Parker Posey" "Winona Ryder" ...
##  $ actor4          : chr  "Emilio Rivera" "Tom Long" "Eugene Levy" "Richard E. Grant" ...
##  $ actor5          : chr  "Joseph Julian Soria" "Genevieve Mooy" "Bob Balaban" "Alec McCowen" ...
##  - attr(*, "na.action")= 'omit' Named int  6 25 131 172 175 198 223 236 334 386 ...
##   ..- attr(*, "names")= chr  "6" "25" "131" "172" ...

Part 5: Visualizing Spread of data using histogram

Visualizing some important features from the dataset

hist(dfm$audience_score)

hist(dfm$imdb_rating)

hist(dfm$critics_score)

hist(dfm$runtime)

### Histogram runtime is distributed but consists of some outliers

Part 6: Visualize how much percent of data lies in which quantile

quantile(dfm$imdb_rating, c(0,0.25,0.5,0.75,0.9,1))
##   0%  25%  50%  75%  90% 100% 
## 1.90 5.90 6.55 7.30 7.70 9.00
quantile(dfm$imdb_num_votes, c(0,0.25,0.5,0.75,0.9,1))
##        0%       25%       50%       75%       90%      100% 
##    183.00   4907.25  15508.00  59934.00 153568.40 893008.00
# Pie Chart from data frame with sample size from each category or genre
mytable <- table(dfm$genre)
lbls <- paste(names(mytable), "\n", mytable, sep="")
pie(mytable, labels = lbls, 
    
   main="Pie Chart of Species\n (with sample sizes)")

str(dfm)
## Classes 'tbl_df', 'tbl' and 'data.frame':    634 obs. of  18 variables:
##  $ genre           : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 6 6 5 6 1 ...
##  $ runtime         : num  80 101 84 139 90 142 93 88 119 127 ...
##  $ imdb_rating     : num  5.5 7.3 7.6 7.2 5.1 7.2 5.5 7.5 6.6 6.8 ...
##  $ imdb_num_votes  : int  899 12285 22381 35096 2386 5016 2272 880 12496 71979 ...
##  $ critics_score   : num  45 96 91 80 33 57 17 90 83 89 ...
##  $ audience_rating : Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 1 2 2 2 ...
##  $ audience_score  : num  73 81 91 76 27 76 47 89 66 75 ...
##  $ best_pic_nom    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_pic_win    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_actor_win  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 2 1 1 2 ...
##  $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_dir_win    : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ director        : chr  "Michael D. Olmos" "Rob Sitch" "Christopher Guest" "Martin Scorsese" ...
##  $ actor1          : chr  "Gina Rodriguez" "Sam Neill" "Christopher Guest" "Daniel Day-Lewis" ...
##  $ actor2          : chr  "Jenni Rivera" "Kevin Harrington" "Catherine O'Hara" "Michelle Pfeiffer" ...
##  $ actor3          : chr  "Lou Diamond Phillips" "Patrick Warburton" "Parker Posey" "Winona Ryder" ...
##  $ actor4          : chr  "Emilio Rivera" "Tom Long" "Eugene Levy" "Richard E. Grant" ...
##  $ actor5          : chr  "Joseph Julian Soria" "Genevieve Mooy" "Bob Balaban" "Alec McCowen" ...
##  - attr(*, "na.action")= 'omit' Named int  6 25 131 172 175 198 223 236 334 386 ...
##   ..- attr(*, "names")= chr  "6" "25" "131" "172" ...

Observation

We can see that we have 302 movies which lie in Drama category followed by comedy and Action and adventure. But we cannot conclude that large number of movies which are popular are drama movies just from the number of movies being released under specific category.

summary(dfm$audience_score)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11.00   46.00   65.00   62.03   79.75   97.00
summary(dfm$imdb_rating)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.900   5.900   6.550   6.473   7.300   9.000
summary(dfm$critics_score)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   33.00   60.50   57.17   82.00  100.00
ggplot(dfm, aes(x=factor(genre), y=audience_score))+
  geom_boxplot() +
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

qplot(genre,audience_score,data=dfm,geom = c("boxplot","jitter"),
      fill=genre, main= "Genre", xlab = "gene", ylab = "audience_score")+
      theme(axis.text.x = element_text(angle = 60, hjust = 1))

ggplot(dfm, aes(x=factor(actor1), y=audience_score))+
  geom_boxplot() +
  theme(axis.text.x = element_text(angle = 60, hjust = 1))