academic research – introduction

The Effect of machine learning algorithms Advancement Level over their performance when solving Business Analytics problems

October 2019

Yossi Hohashvili

Academic adviser: Adir Even, PhD, Ben-Gurion university of the Negev

Additional adviser: Ephraim Goldin, CEO and founder, G-stat

Abstract

One of the most severe problems in Business Analytics (BA) is automation of decision-making processes. this automation is of great relevance do to the pros and cons that may accrue using a specific methodology. at a time where data, technology and analytics occupy a major place in the business sector, organization's ability to keep pace with the technological development of the big data world is critical. For this reason, the management team must identify an optimal alternative for solving a BA problem. In order that the director will make an optimal decision, he must identify the type of problem, tools and resources at his disposal to achieve the purpose of the business. The inability of an organization to make correct and reasonable strategic decisions that consider technological development may cause the organization to fail in its objectives.

In recent years, advanced Machine Learning (ML) algorithms such as Deep Learning (DL) has been empirically proven themselves superior to traditional ML and classical statistics algorithms for solving unstructured data problems, such as image, text and sound processing. On the other hand, there are few studies that regards the superiority of DL over other algorithms mentioned above or advanced ones such as XGboost, when tackling problems with structured data, which is known to characterize business information to a large extent.

The following research will examine the possibility of expanding the scope of learning algorithms used for BA problems with the use of advanced algorithms for solving complex classification and regression problems that characterize the BA world. we will achieve this, by conducting an experiment that tests if the advancement level of a ML algorithm affects its performance when solving BA problems. we believe the results of this research could highly aid solving BA problems for the rational, profit oriented, managerial, business decision maker.

Experimental design

Throughout this research Our unit of analysis is a dataset. We collected 60 datasets meticulously from Kaggle from 57 competitions and 3 open sourced datasets available in the platform, all the datasets relate the business sector, where 30 datasets regard classification problems and the other 30 datasets regards a regression problem.

To conduct the experiment, we used a factorial design with 3 variables; an independent variable (IV) which is a within-subjects variable, two control variables who are between-subjects variables.

The independent variable is the advancement level of a ML algorithm. This variable is ordinal containing three levels: classical statistics, classical ML and advanced ML. each of the Advancement levels regards preprocessing, processing and post processing actions been executed over the data as part of the learning model pipeline. There would be different and similar actions needed to be conducted, some of them requires more human intervention actions then others in terms of pre and post processes. For each level we choose a bundle of learning algorithms that represent it:

Classical statistics: Generalized linear models (GLM), naïve Bayes.

Classical ML: support vector machine (SVM), Random Forest (RF), K-nearest-neighbors (KNN), multilayer perceptron (MLP).

Advanced ML: Deep Learning (RNN-GRU/LSTM, 3-5 Hidden Layers), XGboost.

One of the other two control variables is the richness of data, that accounts for the dimensions of the dataset. we use this variable to control the confounding effect of the dimensions of the datasets. this is an ordinal variable with 3 levels: small datasets (<10MB), Medium sizes (10MB-10GB) and large (>10GB). The other control variable is the problem type of the dataset, this is an ordinal variable with 2 levels: Regression, Classification. This variable differs the goal of the learning algorithm for either predicting a class or a continues number.

The Dependent variable (DV) is the performance of a learning algorithm bean run over a dataset. The performance would be measured with two variables: goodness of prediction and running time. The goodness of prediction would be measured by the area under curve (AUC) of the Receiver operating characteristic (ROC) curve for classification problems, And the normalized mean square error (NMSE) for regression problems. The running time will be counted in minutes from the start of the ML pipeline till finish. each ML pipeline will start from the same conditions for a fair measure that only regards the ML pipeline.

To test the effect of each algorithm over the performance we split the 30 datasets for a specific problem type to one of the three levels of the richness of data variable, this means we have 6 groups of datasets, when in each group there are 10 datasets as shown in the next table.

Richness of data			Problem type
(1) <10MB	(2) 10MB-10GB	(3) >10GB	Regression
(4) <10MB	(5) 10MB-10GB	(6) >10GB	Classification

link to metadata of the datasets: https://docs.google.com/spreadsheets/d/1ZHdHq_Kvmnf1xzQfeLzVy_73L4znZ6AsG7OPgeOegeY/edit?usp=sharing

We will run each algorithm over each level of advancement over all the possible groups, when we change the goal the algorithm from classifier to regressor depending on the dataset problem type. we will split the dataset to train, validation and test sets. For each algorithm The Hyperparameters fine tuning will be conducted using 3-fold cross-validation over 70% of samples randomly selected from a dataset when the other 30% will be used for testing the algorithm goodness of prediction. The running time will measure the ML pipeline learning over the 70% of samples randomly picked for training. The same datasets split will be shared identically over all the algorithms.

To test the effect of the richness of data we will take each dataset and see the performance over 20% percent of the features compared to 80% and over 20% percent of the samples compared to 80%. For the elimination of a confounding variable of selection we will create multiple datasets where each feature has been randomly selected 10 times. And we will repeat this process for the samples of the dataset. Eventually from a single data set we will create: 40 randomly sampled datasets by sampling 10 datasets with 0.2 * n samples, 10 datasets with 0.8 * n samples, 10 datasets with 0.2 * m features, 10 datasets with 0.8 * m features, where n is the number of samples and m is the number of features, that means that from each dataset we will create 40 new datasets. That give us 2,400 datasets and each algorithm will run over all of them.

The experiment is implemented with Python using the basic open source packages available in python, and mainly: SciPy, NumPy, pandas, scikit-learn, keras, TensorFlow, xgboost for the algorithms. Furthermore, all the experiment will conduct with the same conditions of hardware, over an ec-2 instance in AWS.

The data analysis we will be conducted using MANOVA, because of the multivariable of the DV and the multi-level we encounter in the IVs.

Schedule

	#	Task	Start	End
P	1	Forming a research question and hypothesis	May 2018	June 2018
P	2	Literature review	July 2018	November 2018
P	3	Collecting datasets from Kaggle and others	October 2018	December 2018
P	4	Data preparation	January 2019	February 2019
P	5	Model development – Skeleton	August 2018	December 2018
P	6	Model development – Full Model	August 2019	September 2019
*	7	Model Evaluation	October 2019	November 2019
	8	Analyze and Results	November 2019	December 2019
	9	Thesis writing and submission	December 2019	January 2020

Refferences

1. Kraus.M, Feuerriege.S, Oztekin.A (2018). Deep learning in business analytics and operations research: Models, applications and managerial implications. European Journal of Operational Research.

2. Breiman.L (2001). Statistical Modeling: The Two Cultures. Statistical Science 2001, Vol. 16, No. 3, 199–231.

3. Sirignano.J.A., A.Sadhwani, K.Giesecke (2018). Deep Learning for Mortgage Risk.

4. Wangperawonga.A, Cyrille.B, Olav.L, Rujikorn.P (2016). Churn Analysis Using Deep Convolutional Neural Networks and Autoencoders.

5. Heaton.J.B, N.G.Polson, J.H.Witte (2016). Deep Learning for Finance: Deep

Portfolios.

6. Galindo.J, P.Tamayo (2000). Credit Risk Assessment Using Statistical and Machine Learning: Basic Methodology and Risk Modeling Applications.

7. Mahapatra.S (2014, March 21). Why Deep Learning over Traditional Machine Learning. Retrieved from https://towardsdatascience.com/why-deep-learning-is-needed-over-traditional-machine-learning-1b6a99177063

8. Steer.D (2015, October 23). The Human Element of Data Science. Retrieved from: https://www.kdnuggets.com/2015/10/human-element-data-science.html

9. Burns.E (2017, May 18). Deep Learning vs. Machine Learning: The Difference Starts with Data. Retrieved from: https://searchenterpriseai.techtarget.com/news/450419131/Deep-learning-vs-machine-learning-The-difference-starts-with-data

10. Dancho.M (2017, November 28). Customer analytics: using deep learning with Keras to predict customer churn. Retrieved from: http://www.businessscience.io/business/2017/11/28/customer_churn_analysis_keras.html

11. Shubharthi.D, Kumar.Y, Saha.S, Basak.S (2016). Forecasting to Classification: Predicting the Direction of Stock Market Price Using XGBoost.

12. Shavit.I, Segal.E (2018). Regularization Learning Networks. Weizmann Institute of Science.

13. Goodfellow.I, Bengio.Y, Courville.A (2016). Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.

14. O’connell.A, Buchanan.L. A Brief History of Decision Making (2006). Harvard Business Review, organizational culture.

15. Certo.T, Connelly.B.L, Tihanyi.L (2008). Managers and their not-so rational decisions. Mays Business School, Texas A&M University, College Station, TX 77843-4221, USA.

16. Parasuraman.R, Mouloua.M (1996). Automation and Human Performance: Theory and Applications. Francis and Tylor Group.

17. Shepherd.D.A, Williams.T.A (2014). Thinking About Entrepreneurial Decision Making: Review and Research Agenda. Journal of Management Vol. 41 No. 1, January 2015 11–46.

18. Brynjolfsson.E, Hitt.L.M, Heekyung.H.K (2011). Strength in Numbers: How Does Data-Driven Decisionmaking Affect Firm Performance. SSRN.

19. Shim.J.P, Warkentin.M, Courtney.J.F, Powe.D.J, Sharda.R, Carlsson.C (2002). Past, present, and future of decision support technology. Elsevier: Decision Support Systems 33 (2002) 111 –126.

20. Chen.T, Guestrin.C (2016). A Scalable Tree Boosting System. Cornell University: arxiv.

In the following posts we will show our results as soon as we get them, and share the complete code that we developed as part of this endeavor.