library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.5.3
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.5.3
library(statsr)
## Warning: package 'statsr' was built under R version 3.5.3
## Warning: package 'BayesFactor' was built under R version 3.5.3
## Warning: package 'coda' was built under R version 3.5.3
library(leaps)
## Warning: package 'leaps' was built under R version 3.5.3
library(grid)
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 3.5.3
library(GGally)
## Warning: package 'GGally' was built under R version 3.5.3
library(knitr)
## Warning: package 'knitr' was built under R version 3.5.3
load("movies.Rdata")
From given information provided for data we can see that the sample was obtained randomly and therefore statitical analysis results should be able to generalize to the targeted population with caution. Moreover as we have 32 features in movies dataset we can achieve some insights which can be helpful for businesses for revenue generation, budget management, spending etc.
dfm_all_data <- movies[(1:32)]
kable(head(dfm_all_data))
title | title_type | genre | runtime | mpaa_rating | studio | thtr_rel_year | thtr_rel_month | thtr_rel_day | dvd_rel_year | dvd_rel_month | dvd_rel_day | imdb_rating | imdb_num_votes | critics_rating | critics_score | audience_rating | audience_score | best_pic_nom | best_pic_win | best_actor_win | best_actress_win | best_dir_win | top200_box | director | actor1 | actor2 | actor3 | actor4 | actor5 | imdb_url | rt_url |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Filly Brown | Feature Film | Drama | 80 | R | Indomina Media Inc. | 2013 | 4 | 19 | 2013 | 7 | 30 | 5.5 | 899 | Rotten | 45 | Upright | 73 | no | no | no | no | no | no | Michael D. Olmos | Gina Rodriguez | Jenni Rivera | Lou Diamond Phillips | Emilio Rivera | Joseph Julian Soria | http://www.imdb.com/title/tt1869425/ | //www.rottentomatoes.com/m/filly_brown_2012/ |
The Dish | Feature Film | Drama | 101 | PG-13 | Warner Bros. Pictures | 2001 | 3 | 14 | 2001 | 8 | 28 | 7.3 | 12285 | Certified Fresh | 96 | Upright | 81 | no | no | no | no | no | no | Rob Sitch | Sam Neill | Kevin Harrington | Patrick Warburton | Tom Long | Genevieve Mooy | http://www.imdb.com/title/tt0205873/ | //www.rottentomatoes.com/m/dish/ |
Waiting for Guffman | Feature Film | Comedy | 84 | R | Sony Pictures Classics | 1996 | 8 | 21 | 2001 | 8 | 21 | 7.6 | 22381 | Certified Fresh | 91 | Upright | 91 | no | no | no | no | no | no | Christopher Guest | Christopher Guest | Catherine O’Hara | Parker Posey | Eugene Levy | Bob Balaban | http://www.imdb.com/title/tt0118111/ | //www.rottentomatoes.com/m/waiting_for_guffman/ |
The Age of Innocence | Feature Film | Drama | 139 | PG | Columbia Pictures | 1993 | 10 | 1 | 2001 | 11 | 6 | 7.2 | 35096 | Certified Fresh | 80 | Upright | 76 | no | no | yes | no | yes | no | Martin Scorsese | Daniel Day-Lewis | Michelle Pfeiffer | Winona Ryder | Richard E. Grant | Alec McCowen | http://www.imdb.com/title/tt0106226/ | //www.rottentomatoes.com/m/age_of_innocence/ |
Malevolence | Feature Film | Horror | 90 | R | Anchor Bay Entertainment | 2004 | 9 | 10 | 2005 | 4 | 19 | 5.1 | 2386 | Rotten | 33 | Spilled | 27 | no | no | no | no | no | no | Stevan Mena | Samantha Dark | R. Brandon Johnson | Brandon Johnson | Heather Magee | Richard Glover | http://www.imdb.com/title/tt0388230/ | //www.rottentomatoes.com/m/10004684-malevolence/ |
Old Partner | Documentary | Documentary | 78 | Unrated | Shcalo Media Group | 2009 | 1 | 15 | 2010 | 4 | 20 | 7.8 | 333 | Fresh | 91 | Upright | 86 | no | no | no | no | no | no | Chung-ryoul Lee | Choi Won-kyun | Lee Sam-soon | Moo | NA | NA | http://www.imdb.com/title/tt1334549/ | //www.rottentomatoes.com/m/old-partner/ |
The research question here is to identy whether a subset of variables from the given movie dataset can be used to predict the audience score of a particular movie. Identifying the popularity before release of the movie can provide useful insights about the movie and can drive movie business. In case model predicts less popularity for a particular input, such an insight can help to reduce further cost. More better distribution of assets to boost the revenue and manage future decisions.
Here we will observe every variable to see its importance and how it can affect the overall performance to predict the audience score.Before building a prediction model, we need to identify a subset of variables from the dataset for our multiple linear regression model. The response variable is audience_score, and explanatory variables should be a subset of variable that might affect the response variable.
dfm <- movies[ -c(1:2, 5:12, 15, 24, 31:32)]
dfm <- na.omit(dfm)
#Show a list of possible variables included for modeling
str(dfm)
## Classes 'tbl_df', 'tbl' and 'data.frame': 634 obs. of 18 variables:
## $ genre : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 6 6 5 6 1 ...
## $ runtime : num 80 101 84 139 90 142 93 88 119 127 ...
## $ imdb_rating : num 5.5 7.3 7.6 7.2 5.1 7.2 5.5 7.5 6.6 6.8 ...
## $ imdb_num_votes : int 899 12285 22381 35096 2386 5016 2272 880 12496 71979 ...
## $ critics_score : num 45 96 91 80 33 57 17 90 83 89 ...
## $ audience_rating : Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 1 2 2 2 ...
## $ audience_score : num 73 81 91 76 27 76 47 89 66 75 ...
## $ best_pic_nom : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_pic_win : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_actor_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 2 1 1 2 ...
## $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_dir_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ director : chr "Michael D. Olmos" "Rob Sitch" "Christopher Guest" "Martin Scorsese" ...
## $ actor1 : chr "Gina Rodriguez" "Sam Neill" "Christopher Guest" "Daniel Day-Lewis" ...
## $ actor2 : chr "Jenni Rivera" "Kevin Harrington" "Catherine O'Hara" "Michelle Pfeiffer" ...
## $ actor3 : chr "Lou Diamond Phillips" "Patrick Warburton" "Parker Posey" "Winona Ryder" ...
## $ actor4 : chr "Emilio Rivera" "Tom Long" "Eugene Levy" "Richard E. Grant" ...
## $ actor5 : chr "Joseph Julian Soria" "Genevieve Mooy" "Bob Balaban" "Alec McCowen" ...
## - attr(*, "na.action")= 'omit' Named int 6 25 131 172 175 198 223 236 334 386 ...
## ..- attr(*, "names")= chr "6" "25" "131" "172" ...
hist(dfm$audience_score)
hist(dfm$imdb_rating)
hist(dfm$critics_score)
hist(dfm$runtime)
### Histogram runtime is distributed but consists of some outliers
quantile(dfm$imdb_rating, c(0,0.25,0.5,0.75,0.9,1))
## 0% 25% 50% 75% 90% 100%
## 1.90 5.90 6.55 7.30 7.70 9.00
quantile(dfm$imdb_num_votes, c(0,0.25,0.5,0.75,0.9,1))
## 0% 25% 50% 75% 90% 100%
## 183.00 4907.25 15508.00 59934.00 153568.40 893008.00
# Pie Chart from data frame with sample size from each category or genre
mytable <- table(dfm$genre)
lbls <- paste(names(mytable), "\n", mytable, sep="")
pie(mytable, labels = lbls,
main="Pie Chart of Species\n (with sample sizes)")
str(dfm)
## Classes 'tbl_df', 'tbl' and 'data.frame': 634 obs. of 18 variables:
## $ genre : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 6 6 5 6 1 ...
## $ runtime : num 80 101 84 139 90 142 93 88 119 127 ...
## $ imdb_rating : num 5.5 7.3 7.6 7.2 5.1 7.2 5.5 7.5 6.6 6.8 ...
## $ imdb_num_votes : int 899 12285 22381 35096 2386 5016 2272 880 12496 71979 ...
## $ critics_score : num 45 96 91 80 33 57 17 90 83 89 ...
## $ audience_rating : Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 1 2 2 2 ...
## $ audience_score : num 73 81 91 76 27 76 47 89 66 75 ...
## $ best_pic_nom : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_pic_win : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_actor_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 2 1 1 2 ...
## $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_dir_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ director : chr "Michael D. Olmos" "Rob Sitch" "Christopher Guest" "Martin Scorsese" ...
## $ actor1 : chr "Gina Rodriguez" "Sam Neill" "Christopher Guest" "Daniel Day-Lewis" ...
## $ actor2 : chr "Jenni Rivera" "Kevin Harrington" "Catherine O'Hara" "Michelle Pfeiffer" ...
## $ actor3 : chr "Lou Diamond Phillips" "Patrick Warburton" "Parker Posey" "Winona Ryder" ...
## $ actor4 : chr "Emilio Rivera" "Tom Long" "Eugene Levy" "Richard E. Grant" ...
## $ actor5 : chr "Joseph Julian Soria" "Genevieve Mooy" "Bob Balaban" "Alec McCowen" ...
## - attr(*, "na.action")= 'omit' Named int 6 25 131 172 175 198 223 236 334 386 ...
## ..- attr(*, "names")= chr "6" "25" "131" "172" ...
We can see that we have 302 movies which lie in Drama category followed by comedy and Action and adventure. But we cannot conclude that large number of movies which are popular are drama movies just from the number of movies being released under specific category.
summary(dfm$audience_score)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11.00 46.00 65.00 62.03 79.75 97.00
summary(dfm$imdb_rating)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.900 5.900 6.550 6.473 7.300 9.000
summary(dfm$critics_score)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 33.00 60.50 57.17 82.00 100.00
ggplot(dfm, aes(x=factor(genre), y=audience_score))+
geom_boxplot() +
theme(axis.text.x = element_text(angle = 60, hjust = 1))
qplot(genre,audience_score,data=dfm,geom = c("boxplot","jitter"),
fill=genre, main= "Genre", xlab = "gene", ylab = "audience_score")+
theme(axis.text.x = element_text(angle = 60, hjust = 1))
ggplot(dfm, aes(x=factor(actor1), y=audience_score))+
geom_boxplot() +
theme(axis.text.x = element_text(angle = 60, hjust = 1))