Data analysis is number 1 requirement to start any data driven project. When BSUPERIOR SYSTEM wants to create application by using data, we need to make sure that it is good enough and it is big enough data. What does it mean? In this article, we will go through some steps that you can face on the typical meeting with programmers and software developers as well as why we at BSUPERIOR SYSTEM recommend for your business to start collecting data as soon as possible.
First step is a good question. That is essential part of any project and there are 2 main types of questions:
- Prediction – a good example of this would be Real-estate industry – we have features of a Realestate Property and we want predict the price of this property by using the collected Data. With Prediction we have to find the best solution to get from point A (data that we have) to point B (what we want to predict) with least amount of possible errors.
- Classification – examples: identify if there is a man, dog, or cat on the picture; mark written review as positive or negative; identify client segment based on purchase information. In this case, we have a set of things that we want to know. Important thing here is to create exclusive classes. Imagine working with example of classifying a cat or man on the picture and getting 80% of the pictures where man holding a cat. It is not impossible, but require much more efforts to implement.
In order to keep everything grounded, we will go through my research project “Identify Dividend Aristocrats”. This is a classification problem, I want to show machine learning some data and want it to figure out if that firm can be considered dividend aristocrat or not. Why would we be interested? Because they are great to invest in (see picture below).
Now, we know the problem, we can start looking for data to solve it. My idea was that it is possible to make some progress with financial statements alone. Therefore, we show application financial statements and some basic valuations and receive the answer.
Of course, financial statements alone would not be sufficient to get the answer with high accuracy, but here are couple of factors to consider: this is just the first step; we end up with proof of concept; tool with low precision can still be useful with high level of recall .
Recall and precision are 2 important words, and pretty easy to understand:
- Precision – is how often we say “Yes” right. In our case if application say that someone is dividend Aristocrat, it is “Yes”.
- Recall – is how often we say “No” right. If we say “No”, not a dividend aristocrat.
It is tough to have high precision, since there are obviously more factors involved. But high recall would allow us to have a tool that we can use to help us filter unwanted firms from the search. Since we are looking on easy to access data (simply financial statements), it would be easy to get new data, when we want to reuse project in the future. And obviously it is possible to add more factors (financial and non-financial) in the future to increase precision.
Big difference would be the fact that we will have something to compare our progress to, so we actually know if factors, that we add are relevant.
Finally, we want to find the data. In this case, we are dealing with something publicly available (Kaggle). That is extremely important to remember. Human kind is generating massive amount of data. 90% of it was generated in recent years (IBM). Moreover, a lot of time we can enhance performance with something that is publicly available.
However, many companies will be stuck on this last stage. Public data can be used, but something specific to your firm will always be required. If you just start thinking, “what can be beneficial for my firm to predict?”, “what sort of thing that I am (or someone) doing regularly can be recorded to support it?”.
After that data engineer can start applying different machine learning models to data. Many times the only way to see how something performs is to implement it, so that process can be lengthy, we will go over report samples, and how exactly do we know if the model we have is reliable in details.
In conclusion, there is no work around data collection. It is lengthy period and with more and more applications getting AI based and data driven, firms that have this data ready will have significant advantage. Let us know what are your thoughts? Feel free to contact us and let’s have chat on what do you want to predict?
Next week we will get deeper in machine learning concepts and address deep learning and neural nets.