Understanding How Customers Feel About British Airways Flight Experience: A Topic Modelling and Sentiment Analysis Approach

Author: Simontagbor

Last updated: 2024-01-16



This project was completed as part of a job simulation as a junior data scientist at British Airways. I found this project very interesting and decided to share it with the community. I hope you find it useful. Please feel free to reach out to me if you have any questions or comments. The source code for this project can be found on my GitHub page

Introduction

Based on online reviews collected from Airlinequality.com, I performed topic modeling and sentiment analysis to understand British Airways’ customers’ flight experiences. The project was done in Python with the following libraries: requests, BeautifulSoup, pandas, numpy, nltk, gensim, pyLDAvis, matplotlib, seaborn, wordcloud, textblob, vaderSentiment, scikit-learn, spacy, re, warnings, pickle, time, random, matplotlib, seaborn, wordcloud, textblob, vaderSentiment, scikit-learn, logging, datetime, IPython, jupyter, selenium, webdriver, time, pickle, pyLDAvis, matplotlib,

Project Overview

For this project I completed the following tasks:

Task 1 - Data Retrieval

Using requests and BeautifulSoup libraries, I collected links to review pages, then collected text data on each review page. Afterward, the review texts were saved to a .csv file.

Scraping Data From Skytrax

This link shows all reviews related to British Airways.

Importing the libraries

Show the code
# package to handle web scraping
import requests
from bs4 import BeautifulSoup
# packages to manipulate data
import pandas as pd
import numpy as np

# packages for sentiment analysis
from textblob import TextBlob

# packages to visualize data
import matplotlib.pyplot as plt
import seaborn as sns
import pyLDAvis
# import wordcloud to create wordclouds
from wordcloud import WordCloud

# packages to build Topic model
import sklearn
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

# packages to clean text
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer

Collecting the links to the review pages and retrieving each review

Show the code
base_url = "https://www.airlinequality.com/airline-reviews/british-airways"
pages = 10
page_size = 100

reviews = []

# for i in range(1, pages + 1):
for i in range(1, pages + 1):

    print(f"Scraping page {i}")

    # Create URL to collect links from paginated data
    url = f"{base_url}/page/{i}/?sortby=post_date%3ADesc&pagesize={page_size}"

    # Collect HTML data from this page
    response = requests.get(url)

    # Parse content
    content = response.content
    parsed_content = BeautifulSoup(content, 'html.parser')
    for para in parsed_content.find_all("div", {"class": "text_content"}):
        reviews.append(para.get_text())
    
    print(f"   ---> {len(reviews)} total reviews")
Scraping page 1
   ---> 100 total reviews
Scraping page 2
   ---> 200 total reviews
Scraping page 3
   ---> 300 total reviews
Scraping page 4
   ---> 400 total reviews
Scraping page 5
   ---> 500 total reviews
Scraping page 6
   ---> 600 total reviews
Scraping page 7
   ---> 700 total reviews
Scraping page 8
   ---> 800 total reviews
Scraping page 9
   ---> 900 total reviews
Scraping page 10
   ---> 1000 total reviews

The code above collects all the links to the review pages and then retrieves each review. The data is then appended to reviews list variable.

Task 2 - Data Preprocessing

Create a Dataframe from the reviews list and save it to a local csv file

df = pd.DataFrame()
df["reviews"] = reviews
df.head()
reviews
0 Not Verified | I was excited to fly BA as I'd ...
1 Not Verified | I just want to warn everyone o...
2 Not Verified | Paid for business class travell...
3 ✅ Trip Verified | The plane was extremely dir...
4 Not Verified | Overall journey wasn’t bad howe...

Save the DataFrame to a local csv file

df.to_csv("../BritishAirways-data-science/data/BA_reviews.csv")

finally our raw data is saved to a local csv file. The data will be used for the next task.

Cleaning The Review Data

After inspecting the data, I noticed that there were some unnecessary text in each of the rows. For example, “✅ Trip Verified” can be removed from each as it’s not relevant to what we want to investigate.

Show the code
#list unwanted data
unwanted_text = ["✅ Trip Verified ", "❎ Not Verified ","Not Verified ", "|   ", " |  ", "|  ", " | ", "| ", "|"]
# punctuation to remove
punctuation = '!"$%&\'()*+,-./:;<=>?[\\]^_`{|}~•@'
#  create a list of stopwords
# nltk.download('stopwords')
stop_words = nltk.corpus.stopwords.words('english')
# add word rooter
word_rooter = nltk.stem.snowball.PorterStemmer(ignore_stopwords=False).stem

# process the file
def clean_review(review, bigrams=False, save_to_file=False):
    """Cleans a review by removing unwanted text, punctuation, and stopwords"""
    for text in unwanted_text:
        if text in review:
            # remove the unwanted text
            review = review.replace(text, "")
    
    # remove numbers
    review = re.sub('[0-9]+', '', review)
    # remove punctuation
    review = re.sub('[%s]' % re.escape(punctuation), '', review)
    # remove double spaces at beginning of review
    review = review.lstrip()
    
    # make all characters lowercase
    review = review.lower()

    if save_to_file:
        with open("data/cleaned_reviews.csv", "a") as f:
            f.write(review + "\n")
    # tokenize the text and remove stop words
    review_token_list = [word for word in review.split(" ") if word not in stop_words]

    # Apply word rooter
    # review_token_list = [word_rooter(word) for word in review_token_list]

    if bigrams:
        review_token_list = review_token_list + [review_token_list[i] + " " + review_token_list[i+1] for i in range(len(review_token_list)-1)]
    
    # join tokens together
    review = " ".join(review_token_list)
    return review

The clean_review function takes in a string and returns a string with the following changes

Update the DataFrame with the cleaned reviews

# Create a new column in the DataFrame with the cleaned text
df["cleaned_reviews"] = df["reviews"].apply(clean_review)
df.head(5)
reviews cleaned_reviews
0 Not Verified | Extremely rude ground service.... extremely rude ground service nonrev flying lo...
1 ✅ Trip Verified | My son and I flew to Geneva... son flew geneva last sunday skiing holiday les...
2 ✅ Trip Verified | For the price paid (bought ... price paid bought sale decent experience altho...
3 ✅ Trip Verified | Flight left on time and arr... flight left time arrived half hour earlier sch...
4 ✅ Trip Verified | Very Poor Business class pr... poor business class product ba even close airl...

Task 3 - Conducting Topic Modelling

After cleaning the data, we can conduct topic modelling on the data.

Topic modelling is a technique that allows us to extract the main topics from a corpus of text. In this case, we will extract the main topics from the reviews that we have collected.

Retrieve Main Topics

I explored various approaches for conducting topic modeling on text data and, for this project, opted for the Latent Dirichlet Allocation (LDA) algorithm to uncover the primary themes within the reviews

The decision to use LDA is rooted in its adherence to two fundamental assumptions:

  1. Semantic Similarity: LDA assumes that documents with similar words are likely to share the same underlying topic. This aligns with the nature of reviews, where common vocabulary often indicates a shared focus.
  2. Word Co-occurrence: LDA assumes that words frequently occurring together in documents signify a common theme. In the context of reviews, this implies that recurring groups of words are indicative of specific topics, such as food, seating, or staff.

In our case, a “document” corresponds to a review, and a “topic” represents the main theme of a review (e.g., food quality, seating comfort, staff behavior).

By applying the LDA algorithm, I aim to extract and categorize the main topics inherent in the reviews. Each review will be assigned to a specific topic. The resulting topics will be stored in a .csv file, facilitating further analysis. This selection was driven by LDA’s ability to capture semantic relationships and word co-occurrence patterns, making it a suitable choice for uncovering latent themes in textual data.

Vectorize The Reviews

Before we can use the LDA algorithm, we need to vectorize the reviews. Vectorization is the process of converting text data to numerical data. The LDA algorithm can only work with numerical data.

For example, the review "The food was great" will be converted to a vector of numbers such as [0, 0, 0, 1, 0, 0, 0, 0, 0, 0].

The vector has 10 elements, each element represents a word in the review. The number 1 in the vector represents the word “food” in the review. The number 0 represents the other words in the review.


# create the transform and fit it on the cleaned reviews
vectorizer = CountVectorizer(ngram_range=(1, 2), stop_words='english')
tf = vectorizer.fit_transform(df['cleaned_reviews']).toarray()
# get the feature names
tf_feature_names = vectorizer.get_feature_names_out()

# create document term matrix
df_dtm = pd.DataFrame(tf, columns=tf_feature_names)
term_frequency = df_dtm.sum(axis=0).tolist()
doc_lenghts = df_dtm.sum(axis=1).tolist()

The vectorizer will be used to create a document-term matrix from the reviews. The document-term matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In this case, the document-term matrix will be a matrix of the frequency of words that occur in the reviews.

The tf variable is a matrix of the frequency of words that occur in the reviews. The tf_feature_names variable is a list of the words that occur in the reviews.

The term_frequency Variable is a list of the top 20 words that occur in the reviews. The term_frequency variable is a list of the top 20 words that occur in the reviews.

The doc_length variable is a list of the length of each review. The doc_length variable is a list of the length of each review.

Create LDA model object

The lda_model variable is an object that contains the LDA model. This object will be used to extract the main topics from the reviews.

number_of_topics = 10
try:
    # create the model
    lda_model = LatentDirichletAllocation(n_components=number_of_topics, random_state=0)
    # fit the model on the term frequency matrix
    lda_model.fit(tf)
    print("Model created Succesfully")
except Exception as e:
    print("An error occurred: ", e)
Model created Succesfully

Visualising the Topics

To visualize the topics, I used the pyLDAvis library. The visualization below shows the topics and the most relevant words in each topic. The size of the bubbles represents the importance of the topics relative to each other. The closer the bubbles are to each other, the more similar the topics are.

Top five(5) topics from 1000 reviews

# Create table to display topics
def display_topics(model, feature_names, no_top_words):
    """records the top words in each topic"""
    topic_dict = {}
    for topic_idx, topic in enumerate(model.components_):
        topic_dict["Topic %d words" % (topic_idx)] = ['{}'.format(feature_names[i])
                        for i in topic.argsort()[:-no_top_words - 1:-1]]
        topic_dict["Topic %d weights" % (topic_idx)] = ['{:.1f}'.format(topic[i])
                        for i in topic.argsort()[:-no_top_words - 1:-1]]
    return pd.DataFrame(topic_dict)

df_topic_words = display_topics(lda_model, tf_feature_names, 10)
df_topic_words.head(5)
Topic 0 words Topic 0 weights Topic 1 words Topic 1 weights Topic 2 words Topic 2 weights Topic 3 words Topic 3 weights Topic 4 words Topic 4 weights Topic 5 words Topic 5 weights Topic 6 words Topic 6 weights Topic 7 words Topic 7 weights Topic 8 words Topic 8 weights Topic 9 words Topic 9 weights
0 ba 68.4 flight 468.1 flight 121.2 flight 200.3 flight 320.9 ba 64.1 flight 182.6 flight 244.6 ba 145.4 flight 103.3
1 flight 58.0 ba 191.2 ba 84.0 ba 117.2 ba 141.0 flight 60.5 ba 103.1 ba 194.6 flight 142.3 ba 96.0
2 staff 38.7 service 143.1 service 48.6 british 71.6 service 122.1 business 38.9 service 90.5 service 85.8 class 74.5 service 64.7
3 luggage 32.3 time 110.2 time 42.8 airways 69.2 good 95.6 service 35.1 seat 78.4 time 80.7 food 67.2 airways 47.9
4 london 27.1 good 104.8 british 37.0 london 68.3 food 83.4 class 34.1 seats 66.5 told 71.4 business 59.8 british 46.9

Show Topic Distribution

To show the top topic distributions for all the reviews, I used PyLDAvis to create an interactive visualization of the topics. The visualization below shows the topics and the most relevant words in each topic. The size of the bubbles represents the importance of the topics relative to each other. The closer the bubbles are to each other, the more similar the topics are.

# compute  document  topic distribution
import warnings

# Ignore all warnings
warnings.filterwarnings('ignore')
topic_distribution = lda_model.transform(df_dtm)
panel = pyLDAvis.prepare(lda_model.components_, topic_distribution, doc_lenghts, tf_feature_names, term_frequency, mds='tsne')
pyLDAvis.display(panel)

The distribution shown above highlights the following: