Group of students brainstorming ideas for their final year project
Brainstorming for the perfect final year project idea

NLP based Duplicate Bug Report Detection using Supervised ML Algorithms

Table of Contents

Project Domain / Category
AI/Machine Learning/Prototype base

Abstract

A bug report is a technical document that contains all the necessary information about the
bug and the conditions under which it can be reproduced. It is a guide for the developers and
the team engaged in fixing the bug. Bug reports are the primary means through which
developers triage and fix bugs. To achieve this effectively, bug reports need to clearly
describe those features that are important for the developers. However, previous studies
have found that reporters do not always provide such features.
Our objective in this project is to Classify such bug reports using machine learning models on
the given dataset. Natural language processing (NLP) is the ability of a computer program to
understand human language as it is spoken and written referred to as natural language.
Natural language processing uses artificial intelligence to take real-world input, process it,
and make sense of it in a way a computer can understand. NLP perform data preprocessing
(Tokenization, Stop word removal, Lemmatization, etc…) which involves clearing textual data
for machine to be able to analyze it. To classify the duplicate bug reports we use machine
learning algorithms such as Naïve Bayes, Support Vector Machine and Random Forest.

Pre-Requisites:

This project is easy and interesting but requires in depth study of machine learning, natural
language processing techniques. The following link may help you better understand.

Dataset:
https://github.com/logpai/bugrepo/tree/master/Thunderbird

Functional Requirements:

The following are the functional requirements of the project:

  1. System must be set the environment online/offline (If Required)
  2. System apply different data processing techniques (Tokenization, Stop word removal,
    Lemmatization, etc…)
  3. System must Build Corpus
  4. System must be split the given dataset into testing and training.
  5. System must trained the specified model.
  6. User must be evaluate mentioned models in the form of Confusion Matrix, Accuracy,
    Precision, Recall
  7. User must be discussed the results of given algorithms ( Naïve Bayes, Support Vector
    Machine, Random Forest)
  8. User must retrained the model if accuracy is not good (less than 60%) by changing
    different training parameters (If Required)
    …………