Extraction and Analysis of Star Wars Characters Network

Stefan
11 min readFeb 12, 2021

“A long time ago in a galaxy far, far away…” If you are a Star Wars fan then you will know that very popular phrase that starts in majority of Star Wars film. Star Wars has been popular throughout the world and had been growing throughout the years with new movies as well as TV series. With new Star Wars movies coming out that will open new doors for new characters to enter in the Star Wars universe. We should all know Luke Skywalker, Han Solo, Princess Leia, and the darkest of them all, Darth Vader are the most common Star Wars characters. Have you ever wondered how many times a character spoken throughout a film and how many times a character mentions another character in their line? In this article, I will extract data from a website that has a full transcript from the Star Wars Episode V: The Empire Strikes Back in order to create a graph network of communications between characters. I will also define why nodes are significant in this article. Finally, I will apply the following concepts Page Rank algorithm and Betweenness Centrality functionality.

I had chosen this data to help other students and viewers to get a better understanding on graph network as well as other concepts. This data will help viewers to distinguish conversation between characters as well as recognizing quotes from the movie.

Steps to Extract Data From the Web, Manipulating the Data and Creating a Graph Network using NetworkX

Step 1: Import Libraries

For this assignment I mostly used Pycharm in order to implement my program, to install all necessary libraries, and debug any issues that occur while running the program.

To begin coding, I needed to get all my libraries that is needed for this assignment. I will concentrate on NetworkX because it’s the main component of this assignment. NetworkX will allow me to develop nodes and edges where I can develop a network graph as well as apply it to algorithms and functions that are needed to do additional data analyzing.

Here is the list of imports that I used:

Step 2: Extract Data

In order to get my data, I utilize the request() method that allows me to make a HTTP request. BeautifulSoup enables me to fetch my web content using the find() to search all content within a id called mw-content-text. I also used Firefox inspector in order to find the best spot to start extracting. My ultimate goal is to get all the transcripts between all the characters rather than the unnecessary content such as header and footer content.

Step 3: Manipulate Data

I created an array of popular characters that I want to show their conversations among each other. The transcript starts with the character’s name corresponding with their line. So, I created an algorithm in order to get the total of lines spoken from each character.

arr = np.array(["LUKE","HAN","VADER", "LEIA",  "THREEPIO", "YODA", "LANDO","BEN","PIETT","BOBA FETT","NEEDA","Chewie","Artoo","Princess Leia"])

Once I got all the lines that were spoken between each character. I removed the first character’s name so it leaves me with just their line that they had spoken. I created a For Loop to loop through all the paragraphs. Then i used a IF and ELSE statement to verify if the first word of the extraction is a character or not. Next I used another IF and ELSE statement to see if a specific character (ex. Han) was spoken as well as to if the speaker is not equal to the character we are searching. The algorithm below gets critical information such as the array of characters that mention Han Solo.

Here are the total lines for each character who was mention by another character’s name. You can see that Luke was referred by all other characters 30 times, Han 29 times, Lando 21 times, Vader 17 times, and Leia 12 times.

The big question how many times did each character mention Han? In this algorithm I used Data Dictionary technique to add like terms from the array above to get in order to get the total. See above:

The list below you can see how many times each character mention Han Solo name throughout the film.

LUKE :: 4
LEIA :: 5
CHEWIE :: 12
THREEPIO :: 3
LANDO
:: 5

Here is a example of all the characters that mention Han Solo with their corresponding dialogue. There should be a total of 29 characters. Unfortunately there are narrator verbiage that is within the transcript.

LUKE
(into comlink) Echo Three to Echo Seven. Han, old buddy, do you read me? After a little static a familiar voice is heard.

LEIA
Han!

LEIA
Han, we need you!

Chewie
is amused; he laughs in his manner. Han, enjoying himself, regards Chewie good-humoredly.

THREEPIO
(to Han and Leia) Oh! Wait for me!

Chewie
lets out a relieved shriek at seeing Han and Leia running toward the ship. The Wookiee runs out into the falling ice, lets out a howl, then runs up the ship’s ramp. Han and Leia run up the ramp after him, closely followed by Threepio.

LEIA
(over comlink) Han, get up here!

Chewie
barks in terror as a slightly smaller asteroid comes especially close — to close — and bounces off the Falcon with a loud crunch. Threepio’s hands cover his eyes. He manages a short peek at the cockpit window. Princess Leia sits stone-faced, staring at the action. Han gives her a quick look.

Chewie
barks “yes”. But Han thinks otherwise.

Chewie
brings his head back through the trap door in the ceiling and whines. Han glances back at Threepio, then speaks quietly to Chewie so only he can hear.

Chewie
barks through his face mask, and points toward the ship’s cockpit. A five-foot-long shape can be seen moving across the top of the Falcon. The leathery creature lets out a screech as Han blasts it with a laser bolt.

Chewie
barks and moves for the ship, followed closely by Leia and Han. The large wings of the Mynocks flap past them as they protect their faces and run up the platform.

Chewie
is very angry and starts to growl and bark at his friend and captain. Again, Han desperately pulls back on the throttle.

Chewie
barks over the intercom. Han quickly changes his readouts and stretches to look out the cockpit window.

LUKE
Han! Leia!

Chewie
growls as Han walks down the ramp. Lando and his men head across the bridge to meet the space pirate.

LANDO
Oh, not as well as I’d like. We’re a small outpost and not very self-sufficient. And I’ve had supply problems of every kind. I’ve had labor difficulties… (catches Han grinning at him) What’s so funny?

LUKE
But Han and Leia will die if I don’t.

LUKE
And sacrifice Han and Leia?

LANDO
I had no choice. They arrived right before you did. I’m sorry. HAN I’m sorry, too.

LANDO
That was never a condition of our agreement, nor was giving Han to this bounty hunter!

Chewie
helps Han to a platform and then turns as the door slides open revealing Leia. She, too, looks a little worse for wear. The troopers push her into the cell, and the door slides closed. She moves to Han, who is lying on the platform, and kneels next to him, gently stroking his head.

LEIA
What about Han?

LEIA
Do you think that after what you did to Han we’re going to trust you?

THREEPIO
It sounds like Han.

LANDO
There’s still a chance to save Han… I mean, at the East Platform…

THREEPIO
Turn around, you wooly…! (to Artoo) Hurry, hurry! We’re trying to save Han from the bounty hunter!

Chewie
works the controls as Leia sits in Han’s seat and Lando watches over their shoulders. As Chewie pulls back on the throttle, the ship begins to move.

LANDO
(into comlink) Princess, we’ll find Han. I promise.

Here is the full list of characters who they mention in their dialogue:

LUKE => Han :: 4
LEIA => Han :: 5
CHEWIE => Han :: 12
THREEPIO => Han :: 3
LANDO => Han :: 5
******************
THREEPIO => Luke :: 6
BEN => Luke :: 3
HAN => Luke :: 5
LEIA => Luke :: 5
Artoo => Luke :: 4
YODA => Luke :: 3
VADER => Luke :: 3
******************
THREEPIO => Leia :: 3
LANDO => Leia :: 4
LUKE => Leia :: 3
******************
PIETT => Vader :: 4
NEEDA => Vader :: 2
LUKE => Vader :: 2
YODA => Vader :: 2
LANDO => Vader :: 5
******************
HAN => Lando :: 9
LEIA => Lando :: 4
Chewie => Lando :: 6
******************

Step 4: Graphing Network

Nodes

Now that I have all the necessary information, I can now start to input that data into a network graph. To begin I used NetworkX as the main library and its components. The first code that I use is the DiGraph(). The DiGraph() stores all the nodes and edges. Network graphs consist of nodes and edges. Below is the list of nodes that I have incorporated into this graph network. The nodes will be the characters that have mention another character’s name. In order to add all the characters, I used the add_node() method.

Edges

Below is the list of edges where each of the characters mentions another characters name using the add_edge() method. For example, Han’s name was mention by Luke, Leia, Threepio, Chewie, and Lando. Finally, I used the draw() method to draw out the network graph.

Step 5: Analyzing the Data

Here is my final network graph. The point of the network graph is to see what character mention another character’s name and how many times have they mention them. As you can see there are 12 nodes. The nodes represent the Characters whereas the edges represent the number of times that the character mentions the corresponding character’s name. The edges have numbers that are called Weights. You will notice arrows pointing into one direction. They are called directed edges. The arrow that points into a direction means that the character’s name was mention whereas the arrow it came from was the character that mention the character’s name. There is also bidirectional lines where one line points to one node and back. For example, towards the bottom right corner (Han and Lando). Han mentions Lando’s name 9 times whereas Lando mentions Han’s name 5 times.

The table below represents how many times a character mentions (columns) another characters name (rows). For example, Han mention Luke’s name 5 times and Lando’s name 9 times.

Step 6: Concepts (Page Rank and Betweenness)

Page rank is a way to rank web pages based on their value and importance from incoming links. Of course, this data doesn’t really show any relationship with pages with the web, but we can still use this concept to show how it works between Star Wars characters due to the graph being a directed graph (See Figure: 1). For this part I will use all 10 characters and their corresponding characters that mention their names. The formula starts with a matrix. The matrix is critical to this formula because it will figure out what node it the most important one.

For example. the first column (Han) represents the number of outgoing links to the corresponding character, see Figure: 1. There are 3 links to branch off from Han to Luke, Leia, and Lando. So in the matrix I will input 1/3 for each of those characters, while a 0 for the others because there is no direct link between them. I will do the same with the rest of the characters (Srivastava).

Once I have our matrix, I will add all the rows then divide them by the number of columns. Once I have these new values, I will multiply them with our matrix values in order to get the most valuable node from them all and that is Han Solo.

[‘1:Han = 2’]
[‘2:Chewie = 1’]
[‘3:Threepio = 3’]
[‘4:Ben = 1’]
[‘5:Luke = 2’]
[‘6:Leia = 2’]
[‘7:Artoo = 1’]
[‘8:Yoda = 2’]
[‘9:Vader = 1’]
[‘10:Lando = 3’]

Betweenness functions seeks the shortest path from between the given nodes. NetworkX has a function that can determinate what path is the most sufficient (Camp). As you can see with the results that Luke and Han are the two main bridge for all other nodes. This is due to all those nodes are being directed towards Luke or Han (See Figure: 1).

{‘Han’: 0.09393939393939393, ‘Chewie’: 0.0, ‘Threepio’: 0.0, ‘Ben’: 0.0, ‘Luke’: 0.16363636363636364, ‘Leia’: 0.00303030303030303, ‘Artoo’: 0.0, ‘Yoda’: 0.0, ‘Vader’: 0.05757575757575758, ‘Lando’: 0.0, ‘Needa’: 0.0, ‘Peitt’: 0.0}

Bugs

When I created a multi-nested for loop, I was grabbing the wrong data. I had to take it step by step, one loop at a time to fetch the correct data. By performing a lot of debugging and inserting a lot of breaks points, I was able to view each variable to see what data it was grabbing. I was able get each algorithm to produce the total of characters mention by each character.

Limitation

There was a significant limitation that I found during this assignment.

  1. In the main content of the HTML coding had some narrator wordings that had characters names in them in which it was unnecessary data for this assignment. With this data, my total numbers were off by maybe 2 or 3 numbers due the narrator data.
  2. There are times when a character mentions another characters name using their nickname such as Chewie rather than Chewbecca or Ben rather than Obi-wan. There were so many different nicknames for each character. There were times when they mention by last name such as Skywalker. Which one were they referring to? Luke? Anakan?

Conclusion

In conclusion, this assignment was very challenging but rewarding when I finally got the program to perform what I needed it to do. Overall, I showed you how to extract data from a web page, implement many different algorithms in order to fetch Star Wars characters names that were mention in other characters lines, and apply the network graph concepts using the NetworkX library. You can see clearly that Luke and Han were mention more than compared to the rest of the cast. Im curious what kind of results I can produce with the rest of the Star Wars trilogy. Its quite obvious that would happen considering they are the main characters. I was shocked by the outcome on how well the program perform despite pulling narrator verbiage that provided unnecessary data. This minor issue will convince me to look further down the road on what algorithm I can implement in order to remove that unwanted data.

References

  1. Quiñones, Vanessa Rivera. Graphs and PageRank in Python, faculty.math.illinois.edu/~riveraq2/teaching/simcamp16/PageRankwithPython.html.
  2. Camp, Data. NETWORK ANALYSIS IN PYTHON I, https://s3.amazonaws.com/assets.datacamp.com/production/course_3286/slides/ch2_slides.pdf
  3. Srivastava, Tavish. (April 12, 2015). PageRank explained in simple terms!. Analytics Vidhya.https://www.analyticsvidhya.com/blog/2015/04/pagerank-explained-simple/
Unlisted

--

--