FAQ on Machine Learning

What is the fundamental difference between Statistics and Machine Learning?

Statistics is the branch of mathematics. It is the practice or science of collecting and analyzing numerical data in large quantities, especially to infer proportions in a whole from those in a representative sample (need scientific definition and reference). Whereas, Machine Learning is part of the Data Science domain. Data Science becomes a necessity after the big-data and boom of the internet. It took us a long time in understanding how to deal with such massive data. Once that data storage, retrieval, and meaningful extraction becomes the possibility, then apparent next step was to use that data for predicting the future and driving business decisions accordingly. After that, Machine Learning emerged as a specific profession.

Why Machine Learning over the traditional Statistical Analysis?

Sometimes, it comes to the mind why Machine Learning algorithms, when it also uses all the statistical modelling. Well, indeed, Machine Learning is also based on Statistical modelling, but there is a foundation difference between the general statistical analysis and Machine Learning. I think the difference is the purpose. Traditional statistical analysis using the software like SPSS and others is for the drawing some inferences based on the variables. Whereas, Machine Learning is far more than where it trains the model, test its accuracy and then give the predictions. Therefore, training itself and improving on-the-go becomes the most significant distinguishing factor.

What are the advantages of the Machine Learning models over the traditional Statistical Analysis?

Data Collection in traditional research was always a challenge. We are dealing with the Field researchers’ competency, Inter-observer reliability, and Socially desirable responsible for decades. Machine Learning model is the breakthrough on this front. You get the data directly from the end-users. Machine Learning does not expect the research questionnaires but the actual operational data to make the predictions. Therefore, its data manipulation is restricted up to a great extent. Another important thing is the self-balancing nature of the ML models. It keeps balancing itself based on the collected data.

Risk of using the Machine Learning models in the research

Machine Learning models are more focused on predictions than its accuracy. It believes in improving on the go. More data it gets more accurate it becomes. On the contrary, research field demands precision at the time publishing the papers and makes its finding available to the public. Therefore, most of the time Machine Learning is used for solving the business cases where its possible to change the course of action over the period. Research field still works on the traditional statistical analysis where variables interplay and available data at any given instance make the concrete statements.

Will Machine Learning be better to use during the research?

It is clearly observed that the reactive nature of the research and traditional statistical tools, makes it less productive for taking an informed decision in the current dynamic and fast-changing environment. Machine Learning can play a role here during research. For example, if any program intervention needs to be researched then its day-to-day updates can be provided to the Machine Learning model, which will help you do the course corrections on the fly. Your interventions can be tested and monitored every day. Based on the data collected from various users daily, it can give more accurate predictions. It can be tested again at the end of the experiments using traditional research methods to reassure the results.

Can the Accuracy of predictions in the Machine Learning be improvised?

Well, it's very much possible. Currently, the prevalent practice is training the ML model using the training data sets and then checking its validity by providing the independent variables test data. If the predictions of the dependent variable are closer to the already listed test data-dependent data, then it is assumed that the model is working fine. It’s an excellent short cut for the fast-paced business environment. In the case of research, more traditional parameters can be deployed like confidence intervals or significance tests. It is very much supported in the ML modelling as well. Most of the Data Scientists depth of the statistics and business requirements decide whether they will use these kinds of methods or not. It is observed that Data Scientist start with the assumptions on these fronts to meet the deadlines.

Is Data Science relevant in the Social Sciences as well?

Yes, certainly. Moreover, due to increasing use and its applications, Social Data Science is the emerging niche field of work. Social Sciences collect a large amount of data, which fulfills the first requirement of Data Science. Another counter supporting thing is the dynamic nature of the social sciences. Social Science research never had wide-scale applicability because it is very much context-dependent. Data Science can deal with the contextual changes because it is built on the responses from the people belonging to the same context. You can build models and train them in any environment. It will start predicting accordingly.

Does the nature of the Data collection changes in the Social Data Science?

Yes. It will change quite a bit. There cannot be single researcher monitored the Data Collection phase. Data will be collected directly from the end-users. It will be unsupervised data collection. Another important point to mention here is that the data will not necessarily come through the research tools. It will be operational data which can collect as part of the monitoring. For example, if teachers are updating the weekly Continuous Evaluation data to the system, then it will be a useful input for the Social Data Scientist to work on. Even though Data Science is based on the extensive operational data, I’ll be happy if Data Scientist can collect some data under the control environment and cross-check the results. It will reassure the outcomes.