At first, when asking 'what are movies made for?', the first answer that comes to mind is 'for entertainment!'. But movies are packed in diverse sources of information and are therefore rich objects of analysis. Often, only news articles or scientific publications are considered as reliable news sources and cinema seems to belong to fiction, they do not convey reality but its representation. In this project, we would like to focus on movies that portray History. Some try to represent historical events conscientiously and others just use them as a setting for their story plots. However, by bringing the biggest events of the past century to the screen, these movies fuel the heritage of the world's memory, so we never forget the events that shaped our existence. We therefore have an interest in the when and how movies have portrayed historical events. We will dive into the plot summaries to identify historical events and perform multi-step analysis to investigate the way these were handled over time by considering the genre of the movies, the plots and the actors that played in them.
The data used for this project comes from the CMU Movie Summary Corpus that contains a collection of 42,306 movie plot summaries and metadata at both the movie level (including box office revenues, genre and date of release) and character level (including gender and estimated age). Using the movie plot summaries majoritarily, the Gonios team identified important historical events that shaped our existence and categorized some movies into each event depending on the vocabulary found in their plot summaries. Then, specific features conveying each historical event were analyzed to understand how these are portrayed compared to each others, and how these features evolve with time.
Our dataset contains 81'741 movies. They were released between 1888 and 2016 where the median is 1985. Silent film predominated between 1903 and 1923 before being replaced by black and white movies in 1933.
Most of the movies are translated in multiple languages. Overall this represents 210 languages, illustrating cinema diversity. English is the most represented with 46.7% of movies available in this language.
Multiple countries produce movies and sometimes a few of them are jointly involved in the process. Here, 146 countries are represented, with the United States being on top of the table.
The mean box office revenue is about 4.79 million dollars. In fact, this number vary between 10 thousands and 2.78 billion across all movies and principally depend on the release year and the movie genre.
134'078 actors, women and men, are registered in our dataset. This number include 430 ethnicities, Indians and African Americans being the most represented. Also, the mean age of actor at movie release is 37 years.
Movies can differ a lot in their runtime, depending on the release year and the movie genre, as for the box office revenue. In our dataset, the mean is about 94.2 minutes.
To classify movies as depicting a particular event, we created ‘dictionaries’. These are lists of words that are specific to a historical event. These events are either defined by a finite period of time (e.g. World War I) or are movements that appeared at a certain time and then evolved in a continuous manner throughout the years and eventually reached an end.
The aim here is to parse the summary plot of each movie through these dictionaries in order to identify if the movie is related to the event either in his subject or his decorum. With that goal in mind, we chose some events that seem most recognizable in the summaries, using their different lexical fields.
To associate a movie with an event we first count the number of times each word in the dictionary occurs in its plot summary. Then we decide to assign the movie to the corresponding historical event if the word count is greater than a hand-picked threshold.
This threshold is designed to take into account the size of our dictionaries in order to improve the specificity and sensitivity of our classification method. When the dictionary is big, more words can be identified in each summary, leading to more summaries classified as the corresponding historical event. To avoid this bias, we penalized bigger dictionaries by performing a rescaled min-max normalization on their length. This coefficient was then added to a common baseline threshold. By doing so, movies’ word count would need to reach a bigger threshold for events with bigger dictionaries.
We performed a sanity check by labelling a small portion of the data and analyzing the precision-recall curve and f1 scores for our dictionary search technique. This enabled us to verify the accuracy of our classifier and to determine the optimal threshold value.
Evolution of the number of movies belonging to an event over the years
With the following subplots, we can investigate how historical events could influence the theme of movies released and the interests of our societies.
For the historical films narrating WWII, an important peak in frequency appears from 1940 to 1950, during the last few years of the war itself and the years following it. These probably are propaganda movies which were widely shown to the public during the war; the Nazis were known to use films to create an image of the “national community”, whereas the allied countries portrayed them as cruel enemies in their movies.
Movies about the digital revolution and industry 4.0 significantly increased from the 1990s onwards, which were the years that marked the beginning of a sequence of technological revolutions.
The number of movies telling the story of LGBTQ+ emancipation tripled from 1970 to 2000 and then doubled again for the following years. This coincides with the beginning of LGBTQ+ movement and then, the annual observance of LGBT History Month in the United States which started in 1994.
For space race movies, a first peak appears between 1950 and the end of 1970s, which could be due to the first satellites sent into orbit (1957), the first pilot-controlled space flight (1961) and the Apollo 11's successful mission in 1969. The frequency of these movies then increases from the 1980s onwards and a more important peak appears between 2005 and 2010, which were the periods where habitable exoplanets were discovered.
Rock movies occurrence start around 1930 which corresponds to the early origin of the rock and roll music. Then between 1940-1950 correspond to the emergence the rock and roll as a define music.
The number of movies corresponding to the Nuclear historical period oscillate during time. Indeed, the first occurrence of the nuclear power in movies was in 1950 which correlates with the first atomic bomb of Hiroshima and Nagasaki. Then, this "new weapons" evolved with time until 1986 where there is a second peak which coincides in time with Chernobyl event.
Lastly, movies about Twins Towers and more generally islamist terrorism have the main occurrence after 2005 which are years following the event, during which the world was still recovering from the traumatic event, and the US was actively combatting Taliban in Afghanistan. However, we can observe some occurrences before 2001 which can be associated with other terrorist movies.
Finally, we saw that historical events and movies are correlated in time, so movies themes are adapted depending on the socio-political environment.
Mean sentiment score by event
Next, we want to see the sentiment affiliated with the portrayed event. If we assume that the plot summary translates well the global sentiment of the movie, we can apply textual sentiment analysis to them. We used a famous Hugging Face sentiment analyser. The only problem was that these kinds of analysers do not work for long paragraphs. We thus tokenized our summaries by sentence, feed each sentence to the sentiment analyser, and apply an average of these grades weighted by their respective uncertainty score. By doing so, we obtain a sentiment score as a number of stars ranging from 1 (bad) to 5 (good) for each movie.
The sentiment graph does not show any significant events portrayed exceptionally well or exceptionally poorly. All events fall around 3.5, which is slightly above the mid-point which is 3. Segregation, LGBT, Rock and Comics may be slightly higher, but their uncertainty bars are too wide to draw any meaningful conclusions. This plot suggests that all events can be both positively and negatively portrayed, but on average are rated at 3.5 out of 5.
Matrix of cosine distance between lexical fields of events
Let's now analyse how the plot summaries are related through NLP analysis. The first and simplest approach is to analyse the plot summaries through the TF-IDF matrix.
To capture how different they are from eachother, we will construct the TFIDF matrix only from summaries included in only one of the categories. We also treated the summaries with current NLP techniques to reduce the sparsity of the TFIDF matrix including Latent Semantic Analysis. The following heatmap represents the cosine distance between the median latent semantic representation of summaries in each category. The idea is to compare global lexical fields used in the plot summaries of each event.
Cosine distance ranges from 0 to 2 and measures how different our summary embeddings are: the bigger the distance, the farthest the summaries are.
The Numerical Revolution summaries stand out significantly from other summaries. Some categories have more in common than others, such as Rock with Tech, AIDS or Comics, and Nuclear with Space. Although this analysis has potential, it may require heavier pre-processing before showing anything. Further clustering approaches could bring predominant topics to light. However, we will next explore a simpler and better tool to analyse the similarity between events.
Principal Component Analysis
Our technique to identify events should produce scores that are not so much correlated with eachother. It happens almost by construction: a WW2 movie should have a high WW2 score and a low score in all other categories. But in practice, it does not necessarily happen because of two possible reasons:
The following graph aims to observe:
The tool we used is Principle Component Analysis (PCA), usually used for dimension reduction. However, here setting movies as data points and the normalized event score we defined earlier as features, we can interpret this method in a different way.
The following plot shows all the movies, projected onto the three most significant principal components. The scores per event axis are also projected onto this space and scaled up to be seen. What is interesting about applying PCA to already non very correlated data is that we will only underlie the movies that are interesting to us: movies that can portray several events. These movies will help us to understand some links between multiple events. Movies appearing only in any of the events or in only one category will be respectfully at the origin or along a basis projected vector. But the structure built around the event scores is the thing to look out for because it will automatically and directionally group them by correlation score.
The 3D PCA projection underlies well some preferred directions towards the events themself as expected. Events that appear far away from the origin show that they contain more spread. This spread can be understood as the strength with which a movie is portraying the event.
What is clearly more interesting is the groups of event scores pointing in the same direction. Features appearing in the same direction have a correlation link. This shows how the scores assigned to each summary translate into a relation between events. In practice, this arises from the fact that, for instance, plot summaries presenting words from the WW1 dictionary also have a tendency to contain words from the WW2 dictionary.
Finally, we can group the events by their shared themes. World War I and II fall in the same direction, representing war. Nuclear, Tower, Comics, and Space are also close to each other, making sense given that Nuclear refers to explosions and violent acts, like tower terrorism; Comics often possess superheroes or villains with destructive powers; and Space movies are often unreal, like comics, and comics can also take place in space. AIDS, LGBT, and Segregation all refer to some kind of minority, while Technology and Numerical Revolution show similarities in their concepts. Rock is the only event that stands out from the rest.
This analysis helps us to understand how our developed technique and assigned scores are related to each other. It also highlights interesting links between events that we didn't expect. This can be used to better understand the underlying structure of the data and to improve our technique further.
LDA topic modeling
LDA is used to classify text in a document to a particular topic. It does so by building a topic per document model and words per topic model, modeled as Dirichlet distributions, which is where its name Latent Dirichlet Allocation comes from.
Since LDA detects the main topics in each plot summary, it could be interesting to output the list of keywords that classified a given plot into the dominant topic detected by the model.
If the topics detected correspond to the historical events used to select the movie plot summaries used in the LDA, then the keywords used for topic detection should correspond with each historical event's dictionnary.
As the concept of LDA assumes, each document is made up of a mixture of different topics.
In our case, we choose 12 topics in total, one per historical event previously defined in our dictionnaries.
Those topics then generate words based on their probability distribution.
Suppose a plot summary is made up of 70% of topic 1 and of other topics distributed in different ways (0.2% topic 2, 0.05% topic 3...), we are trying to find that dominant topic for each document using the following function.
Here are the 12 dominant topics are plotted using a PCA. LDA model doesn’t give a name to the group of words defining a topic, we need to interpret them and manually name them. Here are suggestions of themes each topic demonstrates :
Therefore, some of the topics detected by the LDA on the movies portraying the historical events analyzed throughout this project describe some of these historical events.
For example, topic 4 contains vocabulary strongly relating to WWII such as "war", "world", "american", "peace", "soviet", "russian", "german", "united", "army", "destruction" or even "John" which might represent JFK or the famous WWII "Dear John" letters.
The nuclear event seems to emerge in topic 3, since it contains words such as "nuclear", "power", "world", "enemy", "bomb", "attack", "scientist", "destroy" and much more.
Space race and technology themes seem to have been combined together in topics 5 showing words such as "www", "http", "moon", "martial" or "passengers".
However, topics 11 and 7 narrow down to space and extra-terrestrial themes with vocabularies like "race", "launched", "pod", "android" (the robot with human appearance) or "gravity" for topic 11 and "earth", "planet", "space", "spaceship", "alien" or "orbit" for topic 7.
It seems that the classification of movies into historical events done using manually defined dictionnaries worked properly.
Conclusion
By creating dictionaries, we tried to see how historical event are represented in the world of cinema. First, we saw that most of the event appeared for the first time on screen as the historic event begins. The evolution in cinema during time followed the time period of the historical event: WW2 is mostly expressed in an interval between 1940-1945.
In contrast, LGBT continues to grow over time, as does the movement. Then, we saw that each dictionary was well defined specifically. Cosine matrix showed that summaries' lexical field of events are distant from each other which confirmed that classified events were different.
Lastly, the PCA and the LDA revealed similarities between our historical events which allow to assign them to one of the following three parent themes : sociological, war and world disasters. Despite this connection, each historical event had specificity which made it possible to distinguish them.
Of course, our analysis remains sensitive to how we defined dictionaries and thresholds. Even if we checked the accuracy score, it is always possible to proceed in other ways. Comparing our results with the ones from alternative classification methods could be interesting to see how topics detection is impacted.
In conclusion, we can state that for any period of time, the cinema is likely to be directly influenced by the socio-political context of this same period. For further research, one could investigate if there is a shift in the perspective of an event as the socio-political context of the given event evolves over time.
To get an in depth look to our work, please check the project's github repository. There, you will have acess to more plots, analyses and technical explanations of the methods we used.
Markdown elite
PCA freak
Plot specialist
Improvised web dev
This research has been conducted as part of the CS-401-Applied Data Analysis course, thaught by Robert West, head of the dlab at École Polytechnique Fédérale de Lausanne.
The dataset of movie plot summaries and associated metadata has been downloaded from the CMU Movie Summary Corpus, collected and provided by David Bamman, Brendan O'Connor, and Noah Smith at the Language Technologies Institute and Machine Learning Department at Carnegie Mellon University. Original paper : Learning Latent Personas of Film Characters. David Bamman, Brendan O'Connor, and Noah A. Smith. ACL 2013, Sofia, Bulgaria, August 2013