Skip to the content.

Flight Price Prediction

Diseño sin título

Project Overview

This project is divided into multiple deliveries, each focusing on different aspects of data mining, from data selection and exploration to migration, modeling, and dashboard presentation.


Our Team

Integrante 1 Integrante 2
Data engineer Shadia Jafaar is a student at the Universidad del Norte in the systems engineering program in seventh semester, graduated from Biffi La Salle school in 2020, his area of interest is focused on the management of big data, cloud computing and technology sales. He has knowledge in various programming languages but focuses mainly on python and sql for data manipulation. He is passionate about sports and learning new things. The role he will develop in the team will be a data engineer and he will be one of the people in charge of data manipulation and data cleaning. Data engineer Yuli Meza Barros, 22 years old student on 9th-semester of systems engineering at Universidad del Norte, completed her studies at Nuestra Señora de Lourdes School in 2019. With a profound interest in cybersecurity, she demonstrates proficiency in Java and Python programming languages. Her fascination extends to data analytics, showcasing a versatile skill set. In the team, Yuli plays a vital role as a data engineer, specializing in meticulous data manipulation and cleaning processes Her diverse interests and skills contribute to the dynamic and collaborative environment within the team.
Integrante 3 Integrante 4
Backend engineer David Meza, a eighth-semester student at Universidad del Norte, graduated from INEDINSA, has a keen interest in backend development, databases, and distributed systems. With a passion for crafting robust solutions and learning from existent ones. Outside of the tech realm, David enjoys expressing his creativity through playing the piano and has a particular fondness for languages. Frontend engineer Julián Coll, a seventh-semester systems engineering student at Universidad del Norte, graduated from I.E.T.C. Francisco Javier Cisneros at 2020, who has some experience in backend development and database management. Proficient in Python, SQL, and relational databases. Dedicated to designing efficient systems to meet project goals. I am responsible for the visual part of the project emphasizing the user experience.

Delivery 1: Data Selection

Due Date: 16/08/2024

Dataset Chosen

The dataset selected for the “Flight Price Prediction” project comprises 300,261 datapoints and 11 features. This data was scraped from the “Ease My Trip” website, covering flight bookings between India’s top 6 metro cities over a 50-day period in early 2022.

Dataset Description

Features

Data Quality and Coverage


Justification

The chosen dataset is well-suited for analyzing and predicting flight prices due to its comprehensive nature and relevance to the project goals.

Potential Insights

  1. Price Variation by Airline: Analyze how different airlines set their prices, revealing competitive pricing strategies.
  2. Impact of Booking Time: Investigate how the number of days left until departure affects ticket prices.
  3. Effect of Departure and Arrival Times: Examine how different flight times impact prices.
  4. Class-Based Price Differences: Compare pricing between economy and business class tickets.

Expanded Methodology

  1. Data Preprocessing: Clean the dataset by addressing missing values, encoding categorical variables, and scaling continuous variables if necessary.
  2. Exploratory Data Analysis (EDA): Utilize statistical summaries and visualizations to understand feature distributions and relationships.
  3. Statistical Analysis: Perform correlation analysis and ANOVA tests to understand feature impacts on pricing.
  4. Predictive Modeling: Develop linear regression models and explore advanced techniques like decision trees and ensemble methods.
  5. Model Evaluation: Use cross-validation and performance metrics (MAE, RMSE) to assess and refine models.

Relevance

The dataset aligns closely with the project goals and research questions. Here’s how each feature addresses the research questions:

Examples of Similar Studies


Delivery 2: Data Selection

Due Date: 13/09/2024

Exploratory Data Analysis (EDA)

Data Overview

The dataset used for flight price prediction consists of 300,153 observations and 12 variables. Below is a summary of the dataset’s key details:

Variable Description:

Column Data Type Description
Unnamed: 0 int64 Index column (not used in analysis)
airline object Airline operating the flight
flight object Flight identification
source_city object City from which the flight departs
departure_time object Scheduled flight departure time
stops object Number of stops during the flight
arrival_time object Scheduled flight arrival time
destination_city object Destination city of the flight
class object Ticket class (e.g., economy, business)
duration float64 Duration of the flight in hours
days_left int64 Days left until flight departure
price int64 Flight ticket price (in local currency)

Data Types:


Exogenous Repositories

The following external data could enhance the accuracy of flight price predictions:

1. Tourism Demand or Seasonality Data

This data can help explain price variations based on flight destinations.

2. Fuel Price Data

Fuel price data reflects the operational costs of airlines.

Incorporating these additional datasets will allow for more accurate price predictions and a better understanding of the factors driving flight price variations.


Data Visualization

First, let’s look at the proportion of occurrences for some interesting categorical variables:

Departure and Arrival Times

Airlines and Ticket Class

Source and Destination Cities

Next, let’s examine the distribution of the numerical variables in our dataset:

Duration and Days Left Distribution

Price Distribution

Now, let’s check for outliers in the numerical columns:

Duration, Days Left, and Price Boxplots

We can observe that there are outliers in the duration and price columns.


Data Cleaning

Null Values Check

We verified that there are no null values across any of the columns in the dataset. All columns were complete.

Duplicate Values Check

We confirmed that there are no duplicate records in the dataset, ensuring the integrity of the data.

Removing Unnecessary Columns

The Unnamed column, which was irrelevant to the analysis, was removed for clarity.

Unique Values in Columns

The dataset contains the following number of unique values in each column:

Encoding Categorical Data

We converted the categorical variable stops into numerical format to facilitate modeling.

Duration, Days Left, and Price Boxplots

Reordering Columns

The encoded_stops column was moved to a more logical position for easier analysis.

Reordered Columns

Statistical Summary of Numerical Columns

Below is the statistical description of the numeric columns in the dataset, showing key metrics such as mean, standard deviation, minimum, and maximum values.

Numeric Statistics


Data Imputation

In this dataset, no missing values were found, as confirmed by the earlier data exploration step. Therefore, no data imputation was necessary.

Common Imputation Techniques

If missing values were present, the following imputation techniques could have been considered:

Justification

Since the dataset is complete and free from missing data, none of these techniques were required in this project. This ensures that no bias was introduced through the imputation process.


Delivery 3: Data Migration to BigQuery and Model Preparation

Due date: 18/10/2024

Data Migration:

In order to upload the Pandas dataframes used to training our model, we first had to create a new Google Cloud Project. Within it, we also created a new dataset, which would contain the previously mentioned dataframes.

Once everything was set and ready, we used pandas_gbq.to_gbq() function in Google Colab to upload all the files into the newly created dataset.

However, for this method to work, we first needed a way to access GCP within Google Colab. So we created a service account and generated a new JSON key, which we granted the Owner role for now.

After using the key in the pandas_gbq.to_gbq() function, we were able to upload the neccesary files to GCP.

Chosen Model

Ridge regression is a type of linear regression that includes a regularization term to penalize the magnitude of the coefficients of the model. It adds an L2 regularization term to the cost function, which is the sum of the squared coefficients. This regularization term helps to prevent the model from overfitting by shrinking the coefficients, which means it reduces the model’s complexity without affecting its predictive power significantly.

Training Process

The first step in building a machine learning model is to divide the data into training and testing sets. This is done to evaluate the model’s performance on unseen data and prove the generalization ability of the model. For the Ridge model predicting flight prices, the data would be split into a training set 80% of data and a test set (the remaining 20%).

Training set:

Used to train the model, where the model learns the relationship between features and the target variable (flight price).

Test set:

Used to evaluate how well the model generalizes to new data. It helps in assessing the model’s prediction accuracy and detecting overfitting or underfitting.

For this project, we used the train_test_split function. It consists of randomly dividing the original dataset into two parts: a training set and a test set, using a specified proportion (e.g., 70% for training and 30% for testing).

Hyperparameter Tunning:

Machine learning models often include hyperparameters. Ridge regression includes a hyperparameter called the regularization parameter (alpha), which controls the complexity of the model. However, finding the optimal value is not trivial, so we use a grid search to achieve this. This method systematically tests a range of possible values for the hyperparameter.

First, we define a set of candidate values for alpha (e.g., [0.01, 0.1, 1, 10, 100]). The algorithm evaluates the model’s performance for each value and selects the one that maximizes the metric of interest (in this case, we chose R²)

Cross validation:

Cross-validation helps to ensure that the model’s performance is evaluated more robustly. Using cv=5 means that the data is split into 5 “folds”:

The data is divided into five equal parts.

In each iteration, four parts are used for training, and one part is used for validation.

This process repeats five times, with each fold serving as the validation set once.

After the five iterations, the results are averaged to get a more reliable estimate of the model’s performance.

Metrics and Results Analysis: