Oscar Gold, or Literal Gold? Walking the Tightrope Between Art and Commerce¶

Film has always delicately taken steps along the line between artistic integrity and commerce. This project aims to explore that disparity using a very large dataset of films (every film currently in The Movie Database, TMDB) that includes financial information (budget, revenue) as well as critical reception (IMDb ratings, number of votes, etc.). My goal is to understand where these two metrics coincide, where they differ, and what trends, over time, can be discovered.

Earlier this semester, using R, I found that

  • Film revenue has increased significantly over time, especially since the 1980s. The film industry has become much, much more of an industry.
  • High budgets strongly correlate with high revenue, suggesting that financial investment begets financial returns. You have to spend money to make money, it seems.
  • A histogram of IMDb ratings showed that most films cluster around the 6/10 to 7.5/10 range, meaning that critical success is much harder to come by consistently than financial success.
  • Running statistical tests, I found that films with an above-average IMDb rating earn significantly more revenue on average, but when using a Wilcoxon test (excludes outliers), the difference actually becomes statistically insignificant. This implies that blockbuster hits skew perceptions on success.

The hypothesis I want to test is that big-budget, higher-earning films will have more IMDb votes and higher IMDb ratings than lower-earning films.

This notebook builds on these findings and transitions the analysis to Python for further, deeper exploration.

  • Loading the dataset using Pandas.
  • Cleaning and preprocessing the data, assuring everything is in an appropriate format.
  • Recreating/Updating visualizations from my preliminary R analysis, including
    • Rating distributions
    • Revenue vs. Budget
  • Performing additional statistical tests in Python, such as
    • Correlation metrics
    • Normality tests or regression models
  • Applying K-means clustering, from scratch, on numerical variables like revenue, budget, IMDb rating, etc. to investigate whether films naturally group into "profit only," "acclaim only," or "profit and acclaim" categories.
  • Visualizing the clustering results and determine whether the clusters reject the null hypothesis or accept a link between financial and critical success.
  • Reflecting on this clustering method, asking if K-means is appropriate for film data and how it could possibly be improved.
In [1]:
import pandas as pd
df = pd.read_csv("TMDB_all_movies.csv")

df.head()
Out[1]:
id title vote_average vote_count status release_date revenue runtime budget imdb_id ... spoken_languages cast director director_of_photography writers producers music_composer imdb_rating imdb_votes poster_path
0 2 Ariel 7.100 353.0 Released 1988-10-21 0.0 73.0 0.0 tt0094675 ... suomi Marja Packalén, Olli Varja, Matti Pellonpää, J... Aki Kaurismäki Timo Salminen Aki Kaurismäki Aki Kaurismäki NaN 7.4 9329.0 /ojDg0PGvs6R9xYFodRct2kdI6wC.jpg
1 3 Shadows in Paradise 7.291 413.0 Released 1986-10-17 0.0 74.0 0.0 tt0092149 ... suomi, English, svenska Riikka Kuosmanen, Bertta Pellonpää, Aki Kauris... Aki Kaurismäki Timo Salminen Aki Kaurismäki Mika Kaurismäki NaN 7.4 8166.0 /nj01hspawPof0mJmlgfjuLyJuRN.jpg
2 5 Four Rooms 5.869 2709.0 Released 1995-12-09 4257354.0 98.0 4000000.0 tt0113101 ... English Paul Skemp, Sammi Davis, Quinn Hellerman, Davi... Robert Rodriguez, Allison Anders, Quentin Tara... Andrzej Sekula, Rodrigo García, Guillermo Nava... Robert Rodriguez, Allison Anders, Quentin Tara... Lawrence Bender, Quentin Tarantino, Alexandre ... Combustible Edison 6.7 114887.0 /75aHn1NOYXh4M7L5shoeQ6NGykP.jpg
3 6 Judgment Night 6.500 354.0 Released 1993-10-15 12136938.0 109.0 21000000.0 tt0107286 ... English Michael Wiseman, Michael DeLorenzo, Everlast, ... Stephen Hopkins Peter Levy Jere Cunningham, Lewis Colick Gene Levy, Marilyn Vance, Lloyd Segan Alan Silvestri 6.6 20268.0 /3rvvpS9YPM5HB2f4HYiNiJVtdam.jpg
4 8 Life in Loops (A Megacities RMX) 7.500 27.0 Released 2006-01-01 0.0 80.0 42000.0 tt0825671 ... English, हिन्दी, 日本語, Pусский, Español NaN Timo Novotny Wolfgang Thaler Michael Glawogger, Timo Novotny Ulrich Gehmacher, Timo Novotny NaN 8.1 285.0 /7ln81BRnPR2wqxuITZxEciCe1lc.jpg

5 rows × 28 columns

In [2]:
# Dropping columns that will not be used in analysis to declutter the dataset
df.drop(
    columns = ['imdb_id', 'original_language', 'original_title', 'overview',
               'tagline', 'spoken_languages', 'director_of_photography', 'music_composer',
               'poster_path'], inplace = True
)

df.head()
Out[2]:
id title vote_average vote_count status release_date revenue runtime budget popularity genres production_companies production_countries cast director writers producers imdb_rating imdb_votes
0 2 Ariel 7.100 353.0 Released 1988-10-21 0.0 73.0 0.0 1.0117 Comedy, Drama, Romance, Crime Villealfa Filmproductions Finland Marja Packalén, Olli Varja, Matti Pellonpää, J... Aki Kaurismäki Aki Kaurismäki Aki Kaurismäki 7.4 9329.0
1 3 Shadows in Paradise 7.291 413.0 Released 1986-10-17 0.0 74.0 0.0 0.6984 Comedy, Drama, Romance Villealfa Filmproductions Finland Riikka Kuosmanen, Bertta Pellonpää, Aki Kauris... Aki Kaurismäki Aki Kaurismäki Mika Kaurismäki 7.4 8166.0
2 5 Four Rooms 5.869 2709.0 Released 1995-12-09 4257354.0 98.0 4000000.0 2.6362 Comedy Miramax, A Band Apart United States of America Paul Skemp, Sammi Davis, Quinn Hellerman, Davi... Robert Rodriguez, Allison Anders, Quentin Tara... Robert Rodriguez, Allison Anders, Quentin Tara... Lawrence Bender, Quentin Tarantino, Alexandre ... 6.7 114887.0
3 6 Judgment Night 6.500 354.0 Released 1993-10-15 12136938.0 109.0 21000000.0 1.2895 Action, Crime, Thriller Largo Entertainment, JVC, Universal Pictures United States of America Michael Wiseman, Michael DeLorenzo, Everlast, ... Stephen Hopkins Jere Cunningham, Lewis Colick Gene Levy, Marilyn Vance, Lloyd Segan 6.6 20268.0
4 8 Life in Loops (A Megacities RMX) 7.500 27.0 Released 2006-01-01 0.0 80.0 42000.0 3.2030 Documentary inLoops Austria NaN Timo Novotny Michael Glawogger, Timo Novotny Ulrich Gehmacher, Timo Novotny 8.1 285.0
In [3]:
# Converting columns to appropriate data types
df['release_date'] = pd.to_datetime(df['release_date'], errors='coerce')
df['budget'] = pd.to_numeric(df['budget'], errors='coerce')
df['revenue'] = pd.to_numeric(df['revenue'], errors='coerce')
df['imdb_rating'] = pd.to_numeric(df['imdb_rating'], errors='coerce')
df['imdb_votes'] = pd.to_numeric(df['imdb_votes'], errors='coerce')

# Removing films with missing essential values
df = df.dropna(subset=['release_date', 'imdb_rating', 'imdb_votes', 'budget', 'revenue'])

# Filtering films to match the R analysis (post-1920 & > 5000 IMDb votes)
df_filtered = df[(df['release_date'] >= "1920-01-01") & (df['imdb_votes'] > 5000)]

df_filtered.info()
<class 'pandas.core.frame.DataFrame'>
Index: 19219 entries, 0 to 1083723
Data columns (total 19 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   id                    19219 non-null  int64         
 1   title                 19219 non-null  object        
 2   vote_average          19219 non-null  float64       
 3   vote_count            19219 non-null  float64       
 4   status                19219 non-null  object        
 5   release_date          19219 non-null  datetime64[ns]
 6   revenue               19219 non-null  float64       
 7   runtime               19219 non-null  float64       
 8   budget                19219 non-null  float64       
 9   popularity            19219 non-null  float64       
 10  genres                19201 non-null  object        
 11  production_companies  18896 non-null  object        
 12  production_countries  19086 non-null  object        
 13  cast                  19166 non-null  object        
 14  director              19201 non-null  object        
 15  writers               18891 non-null  object        
 16  producers             18376 non-null  object        
 17  imdb_rating           19219 non-null  float64       
 18  imdb_votes            19219 non-null  float64       
dtypes: datetime64[ns](1), float64(8), int64(1), object(9)
memory usage: 2.9+ MB

The blocks above load the film dataset into a Dataframe and convert numerical and date fields into usable formats. I also filtered the data to only include films released after 1920 with more than 5,000 IMDb votes, the same criteria used in my preliminary analysis, to remove obscure or early non-commercial films. This cleaned dataset will be used for visualizations, statistical tests, and clustering in later steps.

Visualizations¶

In [4]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(10,5))
sns.lineplot(
    x=df_filtered['release_date'].dt.year,
    y=df_filtered['revenue'],
    estimator='sum',
    errorbar=None
)
plt.title('Total Revenue Over Time')
plt.xlabel('Year')
plt.ylabel('Total Revenue (USD)')
plt.grid(True)
plt.show()
No description has been provided for this image
In [5]:
plt.figure(figsize=(8,5))
sns.histplot(
    df_filtered['imdb_rating'],
    bins=20,
    kde=False,
    color='skyblue',
    edgecolor='black'
)
plt.title('Distribution of IMDb Ratings')
plt.xlabel('IMDb Rating')
plt.ylabel('Number of Films')
plt.show()
No description has been provided for this image
In [6]:
plt.figure(figsize=(8,5))

# Scatter + trendline
sns.regplot(
    x='budget',
    y='revenue',
    data=df_filtered,
    scatter_kws={'alpha':0.5},
    line_kws={'color':'orange'}
)

plt.xscale('log')
plt.yscale('log')

plt.title('Budget vs Revenue')
plt.xlabel('Budget (USD, log scale)')
plt.ylabel('Revenue (USD, log scale)')
plt.grid(True, which='both', ls='--', lw=0.5)
plt.show()
No description has been provided for this image

Correlation and Statistical Measures¶

In [7]:
# Selecting relevant numerical columns
corr_cols = ['budget', 'revenue', 'imdb_rating', 'imdb_votes']

# Computing correlation matrices
pearson_corr = df_filtered[corr_cols].corr(method='pearson')
spearman_corr = df_filtered[corr_cols].corr(method='spearman')
kendall_corr = df_filtered[corr_cols].corr(method='kendall')

# Displaying results
print("Pearson Correlation:\n", pearson_corr, "\n")
print("Spearman Correlation:\n", spearman_corr, "\n")
print("Kendall Tau Correlation:\n", kendall_corr)
Pearson Correlation:
                budget   revenue  imdb_rating  imdb_votes
budget       1.000000  0.740239    -0.022827    0.497069
revenue      0.740239  1.000000     0.085193    0.628362
imdb_rating -0.022827  0.085193     1.000000    0.223450
imdb_votes   0.497069  0.628362     0.223450    1.000000 

Spearman Correlation:
                budget   revenue  imdb_rating  imdb_votes
budget       1.000000  0.741602    -0.103092    0.583164
revenue      0.741602  1.000000    -0.000279    0.634760
imdb_rating -0.103092 -0.000279     1.000000    0.164288
imdb_votes   0.583164  0.634760     0.164288    1.000000 

Kendall Tau Correlation:
                budget   revenue  imdb_rating  imdb_votes
budget       1.000000  0.614371    -0.075183    0.433624
revenue      0.614371  1.000000     0.000258    0.481712
imdb_rating -0.075183  0.000258     1.000000    0.112860
imdb_votes   0.433624  0.481712     0.112860    1.000000

These results show that budget and revenue have a strong positive correlation, like we see in the scatterplot above. Higher-budget movies tend to have higher revenue outcomes.

Also, budget AND revenue have basically no correlation to IMDb rating, meaning that just because you spend money to create a film and make money from it, that doesn't necessarily mean that critical success will follow. The two success metrics seem to be independent from each other.

In [8]:
from scipy import stats
import numpy as np

# Splitting films into above vs below mean IMDb rating
mean_rating = df_filtered['imdb_rating'].mean()

above_avg = df_filtered[df_filtered['imdb_rating'] > mean_rating]['revenue']
below_avg = df_filtered[df_filtered['imdb_rating'] <= mean_rating]['revenue']

# T-test (Welch's t-test, does not assume equal variances)
t_stat, t_pval = stats.ttest_ind(above_avg, below_avg, equal_var=False)
print(f"T-test:\n  t-statistic = {t_stat:.3f}, p-value = {t_pval:.3e}")

# Wilcoxon rank-sum test (non-parametric)
wilcox_stat, wilcox_pval = stats.ranksums(above_avg, below_avg)
print(f"Wilcoxon rank-sum test:\n  statistic = {wilcox_stat:.3f}, p-value = {wilcox_pval:.3e}")
T-test:
  t-statistic = 7.775, p-value = 7.951e-15
Wilcoxon rank-sum test:
  statistic = -1.057, p-value = 2.906e-01

The normal T-test, comparing means, says that high-rated films appear to make more money. However, the Wilcoxon test, which compares medians, shows that a few outlier films skew the results. The typical film's revenue is fairly independent from its IMDb rating.

Clustering¶

To reiterate, the hypothesis being tested is that big-budget, higher-earning films (cluster 1) will have more IMDb votes and higher IMDb ratings than lower-earning films (clusters 0 and 2).

In [9]:
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score

# Selecting features and drop missing values
features = ['budget', 'revenue', 'imdb_rating', 'imdb_votes']
df_cluster = df_filtered[features].dropna().copy()  # Make a safe copy

# Standardizing numeric values
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df_cluster)
In [10]:
# Run K-Means with 3 clusters
kmeans_3 = KMeans(n_clusters=3, random_state=42, n_init=10)
df_cluster['cluster_3'] = kmeans_3.fit_predict(X_scaled)

# Adding cluster labels back to original filtered DataFrame
df_filtered = df_filtered.copy()
df_filtered.loc[df_cluster.index, 'cluster_3'] = df_cluster['cluster_3']
df_filtered['cluster_3'] = df_filtered['cluster_3'].astype('category')
In [11]:
# Silhouette score and cluster info
sil_score_3 = silhouette_score(X_scaled, df_cluster['cluster_3'])
print(f"Silhouette Score for 3 clusters: {sil_score_3:.4f}")

print("\nCluster Sizes:")
print(df_cluster['cluster_3'].value_counts())

print("\nCluster Centers (original scale):")
centers_3 = scaler.inverse_transform(kmeans_3.cluster_centers_)
centers_3_df = pd.DataFrame(centers_3, columns=features)
print(centers_3_df)
Silhouette Score for 3 clusters: 0.3921

Cluster Sizes:
cluster_3
2    11371
0     6969
1      879
Name: count, dtype: int64

Cluster Centers (original scale):
         budget       revenue  imdb_rating     imdb_votes
0  1.306817e+07  2.088164e+07     5.407644   27031.331278
1  1.144209e+08  4.406570e+08     7.037543  500213.948805
2  6.583830e+06  1.746079e+07     7.139623   44286.039676
In [12]:
# Visualizing clusters
plt.figure(figsize=(10, 6))
sns.scatterplot(
    data=df_filtered,
    x='budget',
    y='revenue',
    hue='cluster_3',
    palette='tab10',
    size='imdb_votes',
    sizes=(20, 200),
    alpha=0.6
)

plt.xscale('log')
plt.yscale('log')
plt.title("K-Means Clustering of Movies (3 Clusters)\nBudget vs Revenue (Log Scale)")
plt.xlabel("Budget (log scale)")
plt.ylabel("Revenue (log scale)")
plt.legend(title='Cluster', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()
No description has been provided for this image

Most of the data here tends to stay around the 10^5 to the 10^7 level for both budget and revenue. Because of this, the low outliers are pulling the graph further to the right than it should be. I have to filter the data slightly more in order to account for this.

In [13]:
# Keeping only movies within a reasonable budget and revenue range
budget_min, budget_max = 1e3, 1e9
revenue_min, revenue_max = 1e3, 1e9

df_viz = df_filtered[
    (df_filtered['budget'] >= budget_min) & (df_filtered['budget'] <= budget_max) &
    (df_filtered['revenue'] >= revenue_min) & (df_filtered['revenue'] <= revenue_max)
].copy()
In [14]:
plt.figure(figsize=(10, 6))

sns.scatterplot(
    data=df_viz,
    x='budget',
    y='revenue',
    hue='cluster_3',
    palette='tab10',
    size='imdb_votes',
    sizes=(20, 200),
    alpha=0.6
)

for i, row in centers_3_df.iterrows():
    plt.scatter(
        row['budget'], 
        row['revenue'], 
        marker='X',           # X marker for centroid
        s=250,                # marker size
        c='black',            # color
        edgecolor='white',    # optional: border for visibility
        label=f'Center {i}'
    )

plt.xscale('log')
plt.yscale('log')
plt.title("K-Means Clustering of Movies with Cluster Centers\nBudget vs Revenue (Log Scale)")
plt.xlabel("Budget (log)")
plt.ylabel("Revenue (log)")
plt.legend(title='Cluster', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()
No description has been provided for this image

This clustering reveals three groups:

  • Cluster 0: Low-to-mid budget, lower-rated films
  • Cluster 1: High-budget blockbusters
  • Cluster 2: Low-budget, critically-acclaimed films

Most movies are modest in budget and performance, and blockbusters are rare but dominate financially. That being said, smaller films can still achieve critical success. The silhouette score, ~0.39, shows moderate separation in the clusters, meaning they overlap, but still reflect industry patterns.

In [15]:
cluster_summary = df_viz.groupby('cluster_3', observed=False)[['imdb_votes']].mean()
cluster_summary
Out[15]:
imdb_votes
cluster_3
0.0 43018.768414
1.0 486196.318293
2.0 83840.986545
In [16]:
cluster_summary.plot(kind='bar', figsize=(10,6))
plt.title('Cluster Comparison: Number of Votes')
plt.xlabel('Cluster')
plt.ylabel('Average Value')
plt.tight_layout()
plt.show()
No description has been provided for this image

This graphs comfirms part of the alternative hypothesis: Films with higher revenues, on average, have a significantly higher number of IMDb votes compared to those with low revenues.

In [17]:
from scipy.stats import pearsonr

for cluster in df_filtered['cluster_3'].unique():
    subset = df_filtered[df_filtered['cluster_3'] == cluster]
    r, p = pearsonr(subset['imdb_rating'], subset['revenue'])
    print(f"Cluster {cluster}: Rating vs Revenue correlation = {r:.3f}, p = {p:.4f}")
Cluster 2.0: Rating vs Revenue correlation = -0.048, p = 0.0000
Cluster 1.0: Rating vs Revenue correlation = 0.077, p = 0.0232
Cluster 0.0: Rating vs Revenue correlation = 0.133, p = 0.0000
  • The clustering results partially align with the alternative hypothesis, but overall, the null hypothesis cannot be rejected. while the correlation between financial success and visibility/popularity is strong, the correlation between financial success and critical success is weak. This demonstrates that strong critical acclaim can occur independently of revenue.
  • K-Means is suitable for detecting broad groupings based on numeric features, especially since these variables are continuous and roughly scaled. However, K-Means assumes spherical clusters of similar variance, which may not perfectly match the skewed distribution of movie budgets and revenues. The moderate silhouette score, ~0.39, suggests overlap between clusters and indicates some limitations in separation.
  • Some potential improvements are:
    • Log-transform skewed features, like budget and revenue, to reduce the influence of extreme outliers and improve cluster shape.
    • Try alternative clustering methods which can capture non-spherical clusters and better handle outliers.

Overall, K-Means provides a useful, high-level segmentation of movies into typical performers, blockbusters, and critically acclaimed low-budget films, but further refinement could improve interpretability and better capture nuanced patterns in the dataset.