EDA and data mining: analysis of ratings and box office of marvel and DC films

Posted by Stryker250 on Fri, 12 Nov 2021 09:37:07 +0100

Which is better, MCU or DC? Which movie has higher ratings? This article will analyze marvel and DC films based on total box office and ratings

Which is better, Marvel Cinematic vs DC Universe? It's an endless debate, isn't it? When you oppose any of these movies, fans will become crazy. In this article, we will compare Marvel and DC according to some data. The data can always tell the truth.

MCU and DC

A link to download the dataset is provided at the end of this article. You can write Python code in Jupyter Notebook, Google Colab, or any other preferred editor.

Code and analysis

import pandas as pd
import scipy.stats as stats

import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
import plotly 
import plotly.express as px
% matplotlib inline

Let's load the data and see what it looks like first.

df = pd.read_csv("/content/mdc.csv", encoding='latin-1')

The data includes film name, release year, film type, IMDB rating, IMDB total revenue, etc.

Here's more information



These two functions can help us view the general situation of data, such as the data type and information of each column, and check the null value in each column. There are no null values in our dataset.

Data correlation

f,ax = plt.subplots(figsize=(14,10))
sns.heatmap(df.corr(), annot=True, fmt=".2f", ax=ax)

The above is the heat map of the correlation between the two columns

If you want to watch Marvel movies or DC movies, you can do this by filtering data. like this:

df[df.entity == 'MARVEL'].tail(5)

df[df.entity == 'DC'].tail(5)

Who makes more movies?

fig = plt.figure(figsize = (10,10))
ax = fig.subplots()
df.entity.value_counts().plot(ax=ax, kind='pie')
ax.set_title("MARVEL VS DC (No. of Movies)")

The pie chart above clearly tells us that Marvel produces more films than DC. Here, MCU won a big victory.

Film type

fig = plt.figure(figsize = (10,10))
ax = fig.subplots()
df[df.entity == 'MARVEL'].genre.value_counts().plot(ax=ax, kind='pie')
ax.set_title("Marvel Movie Genre Type")

As you can see, most types of Marvel movie universe are action, adventure, science fiction, comedy and fantasy. Now let's look at DC.

fig = plt.figure(figsize = (10,10))
ax = fig.subplots()
df[df.entity == 'DC'].genre.value_counts().plot(ax=ax, kind='pie')
ax.set_title("DC Movie Genre Type")

You can see that DC films are more diverse than Marvel films. DC tried more types (although failed, ha). I think this is the best part of DC Universe.

The highest rated marvel and DC films based on IMDB

dc_movies = df[df.entity == 'DC']
marvel_movies = df[df.entity == 'MARVEL']

#Average and highest rated of dc moviesavrg_dc_imdb = dc_movies['imdb_rating'].mean()
avrg_dc_imdb = dc_movies['imdb_rating'].mean()
highest_dc_imdb = dc_movies['imdb_rating'].max()
print("Average: ",avrg_dc_imdb, "\n Highest: ",highest_dc_imdb)

#Average and highest rated of marvel movies
avrg_marvel_imdb = marvel_movies['imdb_rating'].mean()
highest_marvel_imdb = marvel_movies['imdb_rating'].max()
print("Average: ",avrg_marvel_imdb, "\n Highest: ",highest_marvel_imdb)

The result is this

Average:  6.133333333333335 
 Highest:  9.0

Average:  6.794736842105261 
 Highest:  8.4

The average score of DC films was 6.133 and that of Marvel films was 6.794. DC has one of the highest ratings movies ever.

IMDB score and box office

plt.scatter(data = marvel_movies, x = 'imdb_rating', y = 'imdb_gross')
plt.scatter(data = dc_movies, x = 'imdb_rating', y = 'imdb_gross')
plt.title('Marvel vs. DC in imdb ratings and gross')
plt.xlabel('IMDb Ratings')
plt.ylabel('IMDb Gross')
plt.legend(['Marvel', 'DC'])

In terms of total box office, a few Marvel films are significantly ahead of DC. Most Marvel films have IMDB scores between 6.7 and 8.2.

The scores of DC films are evenly distributed in the figure. The total box office performance of IMDB is good, but if it is compared with marvel, they are much less.

The one with the lowest score must be the green lantern 😎

Tomato vs IMDB score

Note: tomato meter is the score from rotten tomatoes

imdb_vs_tm = sns.lmplot(data=df, x="imdb_rating", y="tomato_meter", hue="entity", height=7)
imdb_vs_tm.set_axis_labels("IMDb Ratings", "Tomato meter Score")

IMDB based DC movie list

top_dc_movie = dc_movies.groupby('title').sum().sort_values('imdb_rating', ascending=False)
top_dc_movie = top_dc_movie.reset_index()
px.bar(x='title', y ="imdb_rating", data_frame=top_dc_movie)

The dark knight is the most watched DC movie. Its IMDB score is 9. Heath Ledger has no clowns after that. This film shows the ability of DC Universe.

List of Marvel movies based on IMDB

top_marvel_movie = marvel_movies.groupby('title').sum().sort_values('imdb_rating', ascending=False)
top_marvel_movie = top_marvel_movie.reset_index()
px.bar(x='title', y ="imdb_rating", data_frame=top_marvel_movie)

The final battle of the Avengers is Marvel's highest rated film. Its IMDB score is 8.4. Here may be some errors in the data set, because it shows that the magic four has the highest score. You don't believe it, do you 😂.

Release time

avrg_marvel_runtime = marvel_movies['runtime'].mean()
highest_marvel_runtime = marvel_movies['runtime'].max()

avrg_dc_runtime = dc_movies['runtime'].mean()
highest_dc_runtime = dc_movies['runtime'].max()

print("Marvel\nAverage: ",avrg_marvel_runtime, "\n Highest: ",highest_marvel_runtime)
print("DC\nAverage: ",avrg_dc_runtime, "\n Highest: ",highest_dc_runtime)

The output is as follows

Average:  124.54385964912281 
 Highest:  244
Average:  123.45454545454545 
 Highest:  164

The average release time of marvel and DC films is almost the same. But there is a big difference in their highest time.

Box office comparison based on IMDB

top_marvel_movie_gross = marvel_movies.groupby('title').sum().sort_values('imdb_gross', ascending=False)
top_marvel_movie_gross = top_marvel_movie_gross.reset_index()
px.bar(x='title', y ="imdb_gross", data_frame=top_marvel_movie_gross)

As we all know, "Avengers 4: the final battle" tops the list, followed by "Panther", "Infinite War" and so on. The total revenue of the final battle on IMDB exceeds $800.

top_dc_movie_gross = dc_movies.groupby('title').sum().sort_values('imdb_gross', ascending=False)
top_dc_movie_gross = top_dc_movie_gross.reset_index()
px.bar(x='title', y ="imdb_gross", data_frame=top_dc_movie_gross)

The dark knight's IMDB has the highest total revenue. In terms of revenue, DC is far from marvel.

Total revenue comparison between marvel and DC

fig = px.line(df, x="year", y="imdb_gross", color='entity')

In the past few years, Marvel's IMDB box office revenue has far exceeded DC's.

Frequency comparison of marvel or DC films

fig2 = px.line(df, x='year', y='title', color='entity')

From the above figure, we can conclude that after 2002, marvel produced films more frequently than DC. Perhaps this is why DC's fan base is slightly lower than marvel.


In the above comparison, we can't judge which of the two film production companies is good and which is bad, but the two production companies have been making good films and bringing joy to the audience in the past few decades, which is enough.

Finally, the collab address of this article: https://colab.research.google.com/drive/12DPN0BFYz0Slq2J8m-ney10x8DHWG8zh?usp=sharing

If you can't see the picture clearly, you can also view it directly on collab.

Author: Rohit Kumar Thakur

Topics: Data Analysis Data Mining eda dc