Algorithm Construction for Data Analytics
Algorithm Construction for Data Analytics This paper programs my perspectives of confronting with a brand-new dataset and approaches to find a direction to take in algorithm constructions.
To program the algorithm, there are several steps to approach: 1. Understand the dataset and Clean the data set; 2. Understand what model would be the fit; 3. Build the model; 4. Draw some conclusions and test the model; 5. Understand the meaning of the model and criticize.
- Understand the dataset and Clean the data set:
Before doing anything to the dataset, I would like to understand the exact meaning of each column and each row. Several probing questions: What is the meaning of each variable?
What do I need to predict or conclude? Sometimes the predicting model might need to be changed to other formats. Essentially, in the stock selecting process, I might change the stock return value into rankings’ value (stock with highest return will be ranked 1st, second highest will be ranked 2nd, etc.) to avoid the effect of over high returns or too close return.
Then, I will put the dataset into some programming language for further analytics. In this part, the dataset is presumably a csv file. Use Python to open the dataset and summarize the dataset. This step is checking does the dataset have empty data, N/A, infinities, etc.? If any of the data shows up, I need to find the meaning of the missing value.
And choose the best method to fill in the data. For instance, some of the companies’ market values are missed in my previous analysis projects. Other companies’ data is still important. I single out the company and use “forward filling” (‘ffill’) to fill in the data to avoid using “future information” (not necessary to single out the data, put the files into the format of pivot table can make this step easier). Then, put the data back into the original format or the format required for analysis.
- Understand what model would be the fit:
Foreshadowing to this step, I assume the dataset is ready for analysis. If not, go back to the first step and do further data cleanings.
Based on the understanding of the variables and the predicting variable, I would make a scatter plot between each variable and predicting variable. Some of the variables may have linear proportions to the predicting variable and some of variables may have other proportions. Thus, I might need to make decisions on the models for each variable.
For instance, some of the variables may have an exponential relationship to the predicting variable and some may have an inverse relationship. I will create a new column that takes the natural log to the data or take the inverse of the data. In this process, new missing data or infinites may generate. If missing data or infinites show up, I will look back into the data and make decisions on excluding the data or fill N/A by some methods. Besides, I will filter out the variables that have no relationships to the predicting variable.
On the other hand, I will also look into some data between variables. Sometimes two variables might correlate to each other. I will put a side note to the variables for further checking. These variables might be the reason for overfittings.
- Build the model:
This part is relatively more straightforward than other steps. After I have a sense of what model I am building and the meanings of the variables, I can just put them into the programming language to build the model. In this process, I will break the dataset into a training set and testing set. Use a training set to build the model and use a testing set to confirm the prediction is correct.
There are still some crucial points that need to be considered. First point is to format the input files to match the requirement for the code. Second point is which model shall I use? We might use models such as Ordinary Least Square, Logistic Regression, Poisson Regression, Random Forest, ARIMA, VARMA, etc. Still using the stock example described previously (stock return ranked from highest to lowest), Random Forest might be better fit than Ordinary Least Square.
Thus, the consideration of the models must be rigorous. Third point is taking care of the in-sample and out-sample data. Out-sample data might be more convincing than In-sample data but In-sample data might look alluring. If the direct result looks alluring, the dataset sometimes may involve some future information which will be described in the next part.
- Draw some conclusions and test the model:
Usually, the P-value of each variable for the Ordinary least square or the importance of Random Forest can help decide which of the variables to be used and which of the variables to be excluded. Then, construct another model using the updated variable list.
Then, using R-square, residual plot, and/or some testing method to test the model for overfitting, underfitting, and heteroskedasticity. In these processes, repeatedly update the variable list and test the model. If the model is alluring but only a few variables are used, previous steps might use some of the future information or the model is using In-sample data. Whenever the alluring model was constructed, go back to check the code and the dataset for the logics and the data. This step is varying for different industries.
- Understand the meaning of the model and criticize
After the model was precisely constructed, look back into the variable list we had. I would want to convince myself that each variable is logically true. If the model shows something interesting such as the revenue dominating the model instead of EBITDA, you still need to go back to check the code and the dataset. If there is no problem among the code and the dataset does show the odd trend, a convincing reasoning must be asserted and raise the problem to the groupmates or others who may help. Certainly, a report of the model and some interesting detections will be constructed.
Algorithm Construction for Data Analytics