written 6.0 years ago by |
• Data preparation
A few hours of measurements later, we have gathered our training data. Now it’s time for the next step of machine learning: Data preparation, where we load our data into a suitable place and prepare it for use in our machine learning training. We’ll first put all our data together, and then randomize the ordering. We don’t want the order of our data to affect what we learn, since that’s not part of determining.
This is also a good time to do any pertinent visualizations of your data, to help you see if there are any relevant relationships between different variables you can take advantage of, as well as show you if there are any data imbalances.
We’ll also need to split the data in two parts. The first part, used in training our model, will be the majority of the dataset. The second part will be used for evaluating our trained model’s performance. We don’t want to use the same data that the model was trained on for evaluation, since it could then just memorize the “questions”, just as you wouldn’t use the same questions from your math homework on the exam.
Sometimes the data we collect needs other forms of adjusting and manipulation. Things like de-duping, normalization, error correction, and more. These would all happen at the data preparation step.
• Choosing a model
The next step in our workflow is choosing a model. There are many models that researchers and data scientists have created over the years. Some are very well suited for image data, others for sequences (like text, or music), some for numerical data, others for text-based data.