Understanding How Customers Feel About British Airways Flight Experience: A Topic Modelling and Sentiment Analysis Approach
Author: Simontagbor
Last updated: 2024-01-16
This project was completed as part of a job simulation as a junior data scientist at British Airways. I found this project very interesting and decided to share it with the community. I hope you find it useful. Please feel free to reach out to me if you have any questions or comments. The source code for this project can be found on my GitHub page
Introduction
Based on online reviews collected from Airlinequality.com, I performed topic modeling and sentiment analysis to understand British Airways’ customers’ flight experiences. The project was done in Python with the following libraries: requests, BeautifulSoup, pandas, numpy, nltk, gensim, pyLDAvis, matplotlib, seaborn, wordcloud, textblob, vaderSentiment, scikit-learn, spacy, re, warnings, pickle, time, random, matplotlib, seaborn, wordcloud, textblob, vaderSentiment, scikit-learn, logging, datetime, IPython, jupyter, selenium, webdriver, time, pickle, pyLDAvis, matplotlib,
Project Overview
For this project I completed the following tasks:
Retrieve 1000 online reviews About British Airways from Airlinequality.com
Clean and Preprocess The Data
Conduct Data exploration and visualization
Perfom Topic modeling on Reviews
Perform Sentiment analysis on Reviews
Task 1 - Data Retrieval
Using requests and BeautifulSoup libraries, I collected links to review pages, then collected text data on each review page. Afterward, the review texts were saved to a .csv file.
Scraping Data From Skytrax
This link shows all reviews related to British Airways.
Importing the libraries
Show the code
# package to handle web scrapingimport requestsfrom bs4 import BeautifulSoup# packages to manipulate dataimport pandas as pdimport numpy as np# packages for sentiment analysisfrom textblob import TextBlob# packages to visualize dataimport matplotlib.pyplot as pltimport seaborn as snsimport pyLDAvis# import wordcloud to create wordcloudsfrom wordcloud import WordCloud# packages to build Topic modelimport sklearnfrom sklearn.feature_extraction.text import CountVectorizerfrom sklearn.decomposition import LatentDirichletAllocation# packages to clean textimport reimport nltkfrom nltk.corpus import stopwordsfrom nltk.tokenize import RegexpTokenizer
Collecting the links to the review pages and retrieving each review
Show the code
base_url ="https://www.airlinequality.com/airline-reviews/british-airways"pages =10page_size =100reviews = []# for i in range(1, pages + 1):for i inrange(1, pages +1):print(f"Scraping page {i}")# Create URL to collect links from paginated data url =f"{base_url}/page/{i}/?sortby=post_date%3ADesc&pagesize={page_size}"# Collect HTML data from this page response = requests.get(url)# Parse content content = response.content parsed_content = BeautifulSoup(content, 'html.parser')for para in parsed_content.find_all("div", {"class": "text_content"}): reviews.append(para.get_text())print(f" ---> {len(reviews)} total reviews")
Scraping page 1
---> 100 total reviews
Scraping page 2
---> 200 total reviews
Scraping page 3
---> 300 total reviews
Scraping page 4
---> 400 total reviews
Scraping page 5
---> 500 total reviews
Scraping page 6
---> 600 total reviews
Scraping page 7
---> 700 total reviews
Scraping page 8
---> 800 total reviews
Scraping page 9
---> 900 total reviews
Scraping page 10
---> 1000 total reviews
The code above collects all the links to the review pages and then retrieves each review. The data is then appended to reviews list variable.
Task 2 - Data Preprocessing
Create a Dataframe from the reviews list and save it to a local csv file
finally our raw data is saved to a local csv file. The data will be used for the next task.
Cleaning The Review Data
After inspecting the data, I noticed that there were some unnecessary text in each of the rows. For example, “✅ Trip Verified” can be removed from each as it’s not relevant to what we want to investigate.
Show the code
#list unwanted dataunwanted_text = ["✅ Trip Verified ", "❎ Not Verified ","Not Verified ", "| ", " | ", "| ", " | ", "| ", "|"]# punctuation to removepunctuation ='!"$%&\'()*+,-./:;<=>?[\\]^_`{|}~•@'# create a list of stopwords# nltk.download('stopwords')stop_words = nltk.corpus.stopwords.words('english')# add word rooterword_rooter = nltk.stem.snowball.PorterStemmer(ignore_stopwords=False).stem# process the filedef clean_review(review, bigrams=False, save_to_file=False):"""Cleans a review by removing unwanted text, punctuation, and stopwords"""for text in unwanted_text:if text in review:# remove the unwanted text review = review.replace(text, "")# remove numbers review = re.sub('[0-9]+', '', review)# remove punctuation review = re.sub('[%s]'% re.escape(punctuation), '', review)# remove double spaces at beginning of review review = review.lstrip()# make all characters lowercase review = review.lower()if save_to_file:withopen("data/cleaned_reviews.csv", "a") as f: f.write(review +"\n")# tokenize the text and remove stop words review_token_list = [word for word in review.split(" ") if word notin stop_words]# Apply word rooter# review_token_list = [word_rooter(word) for word in review_token_list]if bigrams: review_token_list = review_token_list + [review_token_list[i] +" "+ review_token_list[i+1] for i inrange(len(review_token_list)-1)]# join tokens together review =" ".join(review_token_list)return review
The clean_review function takes in a string and returns a string with the following changes
Removes the “✅ Trip Verified” text from the string
Removes the “Not Verified |” text from the string
Removes the “✅ Verified Review |” text from the string
it also removes punctuation and numbers from the string
it also removes unnecessary spaces from the string
it removes stopwords from the string: stopwords are words that are not relevant to the analysis, such as “the”, “a”, “an”, “and”, etc.
Update the DataFrame with the cleaned reviews
# Create a new column in the DataFrame with the cleaned textdf["cleaned_reviews"] = df["reviews"].apply(clean_review)df.head(5)
reviews
cleaned_reviews
0
Not Verified | Extremely rude ground service....
extremely rude ground service nonrev flying lo...
1
✅ Trip Verified | My son and I flew to Geneva...
son flew geneva last sunday skiing holiday les...
2
✅ Trip Verified | For the price paid (bought ...
price paid bought sale decent experience altho...
3
✅ Trip Verified | Flight left on time and arr...
flight left time arrived half hour earlier sch...
4
✅ Trip Verified | Very Poor Business class pr...
poor business class product ba even close airl...
Task 3 - Conducting Topic Modelling
After cleaning the data, we can conduct topic modelling on the data.
Topic modelling is a technique that allows us to extract the main topics from a corpus of text. In this case, we will extract the main topics from the reviews that we have collected.
Retrieve Main Topics
I explored various approaches for conducting topic modeling on text data and, for this project, opted for the Latent Dirichlet Allocation (LDA) algorithm to uncover the primary themes within the reviews
The decision to use LDA is rooted in its adherence to two fundamental assumptions:
Semantic Similarity: LDA assumes that documents with similar words are likely to share the same underlying topic. This aligns with the nature of reviews, where common vocabulary often indicates a shared focus.
Word Co-occurrence: LDA assumes that words frequently occurring together in documents signify a common theme. In the context of reviews, this implies that recurring groups of words are indicative of specific topics, such as food, seating, or staff.
In our case, a “document” corresponds to a review, and a “topic” represents the main theme of a review (e.g., food quality, seating comfort, staff behavior).
By applying the LDA algorithm, I aim to extract and categorize the main topics inherent in the reviews. Each review will be assigned to a specific topic. The resulting topics will be stored in a .csv file, facilitating further analysis. This selection was driven by LDA’s ability to capture semantic relationships and word co-occurrence patterns, making it a suitable choice for uncovering latent themes in textual data.
Vectorize The Reviews
Before we can use the LDA algorithm, we need to vectorize the reviews. Vectorization is the process of converting text data to numerical data. The LDA algorithm can only work with numerical data.
For example, the review "The food was great" will be converted to a vector of numbers such as [0, 0, 0, 1, 0, 0, 0, 0, 0, 0].
The vector has 10 elements, each element represents a word in the review. The number 1 in the vector represents the word “food” in the review. The number 0 represents the other words in the review.
# create the transform and fit it on the cleaned reviewsvectorizer = CountVectorizer(ngram_range=(1, 2), stop_words='english')tf = vectorizer.fit_transform(df['cleaned_reviews']).toarray()# get the feature namestf_feature_names = vectorizer.get_feature_names_out()# create document term matrixdf_dtm = pd.DataFrame(tf, columns=tf_feature_names)term_frequency = df_dtm.sum(axis=0).tolist()doc_lenghts = df_dtm.sum(axis=1).tolist()
The vectorizer will be used to create a document-term matrix from the reviews. The document-term matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In this case, the document-term matrix will be a matrix of the frequency of words that occur in the reviews.
The tf variable is a matrix of the frequency of words that occur in the reviews. The tf_feature_names variable is a list of the words that occur in the reviews.
The term_frequency Variable is a list of the top 20 words that occur in the reviews. The term_frequency variable is a list of the top 20 words that occur in the reviews.
The doc_length variable is a list of the length of each review. The doc_length variable is a list of the length of each review.
Create LDA model object
The lda_model variable is an object that contains the LDA model. This object will be used to extract the main topics from the reviews.
number_of_topics =10try:# create the model lda_model = LatentDirichletAllocation(n_components=number_of_topics, random_state=0)# fit the model on the term frequency matrix lda_model.fit(tf)print("Model created Succesfully")exceptExceptionas e:print("An error occurred: ", e)
Model created Succesfully
Visualising the Topics
To visualize the topics, I used the pyLDAvis library. The visualization below shows the topics and the most relevant words in each topic. The size of the bubbles represents the importance of the topics relative to each other. The closer the bubbles are to each other, the more similar the topics are.
Top five(5) topics from 1000 reviews
# Create table to display topicsdef display_topics(model, feature_names, no_top_words):"""records the top words in each topic""" topic_dict = {}for topic_idx, topic inenumerate(model.components_): topic_dict["Topic %d words"% (topic_idx)] = ['{}'.format(feature_names[i])for i in topic.argsort()[:-no_top_words -1:-1]] topic_dict["Topic %d weights"% (topic_idx)] = ['{:.1f}'.format(topic[i])for i in topic.argsort()[:-no_top_words -1:-1]]return pd.DataFrame(topic_dict)df_topic_words = display_topics(lda_model, tf_feature_names, 10)df_topic_words.head(5)
Topic 0 words
Topic 0 weights
Topic 1 words
Topic 1 weights
Topic 2 words
Topic 2 weights
Topic 3 words
Topic 3 weights
Topic 4 words
Topic 4 weights
Topic 5 words
Topic 5 weights
Topic 6 words
Topic 6 weights
Topic 7 words
Topic 7 weights
Topic 8 words
Topic 8 weights
Topic 9 words
Topic 9 weights
0
ba
68.4
flight
468.1
flight
121.2
flight
200.3
flight
320.9
ba
64.1
flight
182.6
flight
244.6
ba
145.4
flight
103.3
1
flight
58.0
ba
191.2
ba
84.0
ba
117.2
ba
141.0
flight
60.5
ba
103.1
ba
194.6
flight
142.3
ba
96.0
2
staff
38.7
service
143.1
service
48.6
british
71.6
service
122.1
business
38.9
service
90.5
service
85.8
class
74.5
service
64.7
3
luggage
32.3
time
110.2
time
42.8
airways
69.2
good
95.6
service
35.1
seat
78.4
time
80.7
food
67.2
airways
47.9
4
london
27.1
good
104.8
british
37.0
london
68.3
food
83.4
class
34.1
seats
66.5
told
71.4
business
59.8
british
46.9
Show Topic Distribution
To show the top topic distributions for all the reviews, I used PyLDAvis to create an interactive visualization of the topics. The visualization below shows the topics and the most relevant words in each topic. The size of the bubbles represents the importance of the topics relative to each other. The closer the bubbles are to each other, the more similar the topics are.
The distribution shown above highlights the following:
Topic 1: People talked a lot about British Airways’ flights
Topic 2: People also talked a lot about london flights and services
Topic 3: People also talked about the food and drinks served on the flights
Topic 4: people also talked a lot about flight crew and services
Task - 3 Conduct Sentiment Analysis on the Reviews Data
To Gauge the sentiment of the reviews, I used the textblob and vaderSentiment libraries. The textblob library is a Python library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.
Prepare a cleaned data for analysis
I called the clean_review function on each row of the dataframe and save the cleaned data to a csv file
The clean_review function is called with the save_to_file parameter set to True. This will save the cleaned data to a csv file for later use.
Define a function to conduct sentiment analysis on the reviews
for this analysis I used the TextBlob package to conduct sentiment analysis on the reviews. The TextBlob Python library provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification and translation.
Show the code
# import textblob for sentiment analysisfrom textblob import TextBlobdef analyse_sentiment():"""This function returns the sentiment of a review"""try:# check if cleaned_review file existswithopen("data/cleaned_reviews.csv", "r") as f:# process the file reviews = [line for line in f.readlines() if line.strip()]# get sentiment sentiments = [TextBlob(review).sentiment.polarity for review in reviews] sentiment_scores = ["positive"if sentiment >0else"neutral"if sentiment ==0else"negative"for sentiment in sentiments]# add sentiment to dataframe df["sentiment_score"] = sentiment_scoresprint("Sentiment analysis completed successfully")exceptFileNotFoundError:# create the fileprint("cleaned_review.csv not found: Creating file\n", "Please wait...") df["cleaned_reviews"] = df["reviews"].apply(lambda x: clean_review(x, save_to_file=True))print("File created successfully\n", "resuming sentiment analysis...")# restart the function analyse_sentiment()exceptExceptionas e:print("An error occurred: ", e)
Conduct Analysis
# call the functionanalyse_sentiment()df.head(20)df.to_csv("data/BA_sentiment_reviews.csv")
Sentiment analysis completed successfully
Visualise the sentiment of the reviews using pie chart
I used the matplotlib package to render a pie chart visualisation of the sentiment of the reviews. The pie chart shows the percentage of positive, negative and neutral reviews.
Show the code
# import matplotlib import matplotlib.pyplot as pltfrom datetime import datetime# Assuming df is your DataFrame and 'sentiment_score' is your sentiment columnsentiment_counts = df['sentiment_score'].value_counts().reset_index()# Rename the columns for better understandingsentiment_counts.columns = ['sentiment', 'count']colors = ['#00205b', '#af1e2d', '#a1caf1'] # British Airways brand colorsexplode = (0.1, 0, 0) # explode 1st slice# Define a custom autopct function to make the percentage labels bolderdef custom_autopct(pct):return ('%1.1f%%'% pct) if pct >0else''# Plotplt.pie(sentiment_counts['count'], explode=explode, colors=colors, autopct=custom_autopct, shadow=True, startangle=140, textprops={'fontsize': 45, 'fontweight': 'bold', 'fontfamily':'sans-serif', 'color':'white'})# make plot area bigfig = plt.gcf()fig.set_size_inches(30,15)plt.axis('equal')# Add a title with a bolder fontplt.title('How Customers Felt About British Airways Flights', fontdict={'fontsize': 55, \'fontweight': 'bold',\'fontfamily':'sans-serif', 'color':'#00205b'})# Add a legendplt.legend(sentiment_counts['sentiment'], title="Categories", loc="lower right", fontsize=30, shadow=True, facecolor='white', # Change background color edgecolor='black') # Add an edge colornow = datetime.now()# Format the date and timeformatted_now = now.strftime("%B %d, %Y %H:%M:%S %A")side_note = (f'Note: This analysis is based on sentiment scores for 1000 reviews of 'f'British Airways Services.\nOn this website[ Date:{formatted_now}]:\n'f'https://www.airlinequality.com/airline-reviews/british-airways')# Format the date and timeformatted_now = now.strftime("%B %d, %Y %H:%M:%S %A")# Add a side noteplt.text(-2.9, -1.5, side_note, style='italic', fontsize=19, bbox={'facecolor': 'red', 'alpha': 0.5, 'pad': 10,})plt.show()
The pie chart shows that the majority of the reviews are positive. This is a good sign for British Airways as it shows that the majority of their customers are happy with their service.
Analysing the Topics Associated with each type of sentiment
I used the wordcloud package to visualise the sentiment of the reviews. Using wordcloud allowed me to visualise the top words based on the sentiment of the reviews (positive, negative, neutral).
Show the code
from random import choicefrom wordcloud import WordCloudfrom datetime import datetime# Get today's date and time for sidenotenow = datetime.now()# Format the date and timeformatted_now = now.strftime("%B %d, %Y %H:%M:%S %A")# shorten line of codeside_note = (f'Note: This analysis is based on sentiment scores for 1000 reviews of 'f'British Airways Services.\nOn this website[ Date:{formatted_now}]:\n'f'https://www.airlinequality.com/airline-reviews/british-airways')# Define the brand colors of British Airwayscolors = ['#00205b', '#af1e2d', '#a1caf1']# Define a custom color functiondef custom_color_func(word, font_size, position, orientation, random_state=None, **kwargs):return choice(colors)# retrieve reviews based on sentimentpositive_reviews = df[df['sentiment_score'] =='positive']['cleaned_reviews']negative_reviews = df[df['sentiment_score'] =='negative']['cleaned_reviews']neutral_reviews = df[df['sentiment_score'] =='neutral']['cleaned_reviews']# join all review sentimental words togetherpositive_words =' '.join(positive_reviews)negative_words =' '.join(negative_reviews)neutral_words =' '.join(neutral_reviews)# create wordcloudspositive_wordcloud = WordCloud(width=800, height=500, random_state=21, max_font_size=110, color_func=custom_color_func).generate(positive_words)negative_wordcloud = WordCloud(width=800, height=500, random_state=21, max_font_size=110, color_func=custom_color_func).generate(negative_words)neutral_wordcloud = WordCloud(width=800, height=500, random_state=21, max_font_size=110, color_func=custom_color_func).generate(neutral_words)
The wordcloud shows that the top words associated with neutral reviews are “ok”, “average”, “fine”, “okay”, “reasonable”, “standard”, “fair”, “normal”, “satisfactory”, “satisfied”, “decent”, “expected”, “u
Conclusion
The analysis shows that the majority of the reviews are positive.
However, the analysis with negative sentiments suggests that British Airways should do further investigation into the following areas:
Flight services
Food and drinks
Flight crew
Business class
Further investigation into these areas will help British Airways to improve their services and increase customer satisfaction.