Data Science Projects:

DS-20260312

A Multi-Class Predictive Model for Manufacturing Equipment Maintenance Systems

by Nikitha Gandra, Chukwuasia Madike, Naaram Srichandana, and Eve Thullen; THULLEN RESEARCHLAB, 02/01/2026

Unplanned equipment failures in manufacturing systems lead to production downtime, increased operational costs, and safety risks. While predictive maintenance techniques have advanced significantly, much of the existing work focuses on binary failure detection and provides limited insight into specific failure mechanisms. This paper presents a multi-class predictive modeling approach for manufacturing equipment maintenance systems that aims to identify distinct failure types using operational sensor data.

The study formulates failure type prediction as an imbalanced multi-class classification problem representative of real-world industrial environments, where failure events are rare compared to normal operation. Model performance is evaluated using imbalance-aware metrics to ensure reliable assessment across both dominant and minority failure classes. The results demonstrate that the proposed approach can effectively distinguish major mechanical and thermal failure types despite severe class imbalance. These findings highlight the importance of multi-class failure prediction for enabling more targeted maintenance decisions and improving the reliability of manufacturing equipment.

white and black microscope on white surface

See Paper Details

Cite: [1] Nikitha Gandra, Chukwuasia Madike, Naaram Srichandana, Eve Thullen, "A Multi-Class Predictive Model for Manufacturing Equipment Maintenance Systems," International Multidisciplinary Research Journal Reviews (IMRJR), 2026, DOI 10.17148/IMRJR.2026.030305

DS-20250801

Forecasting Household Electricity Consumption

by Priyanka Rana, and Krishan Kumar Sidh, University of the Cumberlands, 08/01/2025

Accurate forecasting of household electricity use is important for saving energy, running smart grids efficiently, and supporting sustainability. This study tested three machine learning models—Random Forest, XGBoost, and Long Short-Term Memory (LSTM)—to predict daily household electricity use in New York, United States. The dataset included contextual details such as household size and electricity price. Time-based features (month, day, weekday) were added, and minute-level data were aggregated into daily averages for stability.

Results showed that Random Forest and XGBoost performed better than LSTM in both accuracy and computational efficiency. Removing a highly correlated feature, XGBoost was the best-performing model, with a root mean square error (RMSE) of 0.453, mean absolute error (MAE) of 0.331, and R² of 0.519. The most important predictors were kitchen appliance usage, laundry/cold storage usage, and water heating/cooling usage. These findings suggest that machine learning can be a valuable tool for household energy management. Future research should consider using higher-frequency data, integrating weather and occupancy information, and exploring hybrid modeling approaches to further improve prediction accuracy.

Request Paper Details

DS-20250511

Advanced Time Series Forecasting Model for Gross Domestic Product (GDP) Prediction

by Akshatha Atmaram, and Stone Barnard, University of the Cumberlands, 05/11/2025

Forecasting the Gross Domestic Product (GDP) has long been a problem in economic analysis, particularly for governments, analysts, and organizations that want to predict macroeconomic changes. For the creation of policies, investment planning, evaluations of economic stability, and long-term strategic decision-making, precise GDP forecasts are essential. By creating a strong, AI-driven time series forecasting framework that uses historical data and macroeconomic variables to anticipate GDP at the national level, this study tackles these issues. This report presents a comprehensive analysis of advanced forecasting techniques aimed at improving the accuracy and efficiency of GDP predictions across diverse global economies. The study leverages a curated collection of open-source datasets, including annual GDP data, inflation rates, trade balances, and employment statistics, spanning multiple decades and regions. The primary objective of this project is to design an automated data pipeline capable of processing varied and often inconsistent economic data formats while building a scalable, deep learning model tailored for time-series forecasting.

The methodology centers on the use of Long Short-Term Memory (LSTM) neural networks, chosen for their ability to model long-term dependencies in sequential data. The model was trained using a 10-year historical input sequence, optimized through extensive hyperparameter tuning, and evaluated using established regression metrics: Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and R-squared (R²). Feature engineering techniques were applied to enhance performance, including the creation of lag variables, log transformations, and scaling operations. Baseline models such as ARIMA, Linear Regression, and Random Forest were implemented for comparison, providing a comprehensive performance benchmark. Through iterative experimentation and validation—including time-series cross-validation—the LSTM model demonstrated superior performance in forecasting GDP trends. The model outperformed traditional approaches in predictive accuracy, particularly in countries with stable historical data, and showed strong generalizability across varied economic contexts. Visual and quantitative evaluations revealed that the model effectively captured both linear growth and cyclical economic patterns, offering valuable insights into macroeconomic behavior.

Request Paper Details

DS-20250509

Hospital Readmissions Predictive Model

by Yvonna Donaldson, and Prasun Pokharel, University of the Cumberlands, 05/09/2025

This project focuses on predicting 30-day hospital readmissions using machine learning models trained on both clinical and public health data. By combining CMS hospital readmission data with social determinants of health from the County Health Rankings and CDC PLACES datasets, the model captures a broader range of factors influencing patient outcomes. After preprocessing and feature engineering, logistic regression, decision tree, and random forest classifiers were trained and evaluated. The final logistic regression model, calibrated using isotonic regression and tuned for a threshold of 0.4, achieved 81% accuracy, 83% precision, 95% recall, and an F1-score of 89%. Key predictors included readmission rate, flu vaccination rate, obesity, stroke, and short sleep duration. These findings support the value of integrating behavioral health indicators into predictive healthcare models. The results highlight the model’s potential to support early intervention and reduce preventable readmissions, with future work focusing on external validation and deployment in clinical settings.

worm's-eye view photography of concrete building

Request Paper Details

DS-20250512

Website and Domain Phishing Detection

by Akshada Bauskar, Enoch Johnson, University of the Cumberlands, 05/12/2025

Phishing continues to be one of the most prevalent forms of cybercrime, exploiting unsuspecting users by mimicking legitimate websites. This paper presents an end-to-end, machine learning-based phishing detection system that uses domain and structure-level URL features to classify websites as either phishing or legitimate. The UCI Phishing Websites Dataset is used for training and evaluation. A variety of models were assessed, with Gradient Boosting emerging as the top performer, achieving 93.5% accuracy and a 94.2% recall rate. The system is built with scalability in mind and incorporates MLOps tools like MLflow, Docker, and AWS for real-world deployment.

Request Paper Details

DS-20241013

Advanced Time Series Forecasting for Optimizing Retail Sales

by Kunle Dare, University of the Cumberlands, 10/13/2024

Retail sales forecasting is a critical challenge in business management, especially for large organizations like Corporación Favorita, a leading grocery chain. Accurate demand predictions are vital for optimizing inventory management, reducing wastage, ensuring product availability, and enhancing customer satisfaction. This study addresses these challenges by developing a sophisticated, data-driven time series forecasting model tailored to predict storelevel sales across thousands of items in multiple locations.

This report presents a comprehensive analysis of advanced time series forecasting techniques aimed at improving sales forecasting accuracy for Corporación Favorita stores. The study leverages a rich dataset from the Kaggle "Store Sales - Time Series Forecasting" competition, containing daily sales, promotions, holidays, oil prices, and other external factors.

The key objective of this project is to develop a robust forecasting solution that can handle the intricacies of retail sales, including seasonal patterns, economic factors, promotions, and storelevel variations.