Points to ponder for Data Science beginners

I was going through the posts on kdnuggets.com and found this article enlisting some basic steps to gain expertise of Machine learning. I am listing those points here again –

  • Feature Engineering – For most of the models that you build, Features will be the basic building blocks. Features are nothing but the attributes in your data sample (training & test datasets) which you provide to ML component. When I say about Feature Engineering: In the market – there are lot of techniques involved, think it will be separate MOOC course to explain on how to do this Feature Engineering. Still Machine Learning needs human in order to choose the right features for the respective use case. This is where domain knowledge play a key role. Lot of efforts are spent on collecting the feature data, correcting the feature data, choosing the right features for your model, introducing new features, filling up the missing features, changing the feature classes and normalizing the existing feature across the entire dataset. At this juncture, whenever you see any feature engineering scripts – just grab and store in your code repository – you never know when you will need those code snippets. Also keep in mind – too many features could affect the overall accuracy, add more noise and latency to your model build process.
  • Model tuning – You can easily get hold of any basic implementation of ML algorithm in R or Python. But further – model tuning is another critical factor. Understanding each model parameters and how to play with those parameters – proves to be significant ones to get more accuracy. This requires good in depth understanding of ML algorithm. Lot of You tube videos and white papers available on several models which you can bank on.
  • Avoid Overfitting – This is the most common error which could affect your overall model accuracy. When you train our model to the training set – building models which are more specific to input data alone could result in over fitting issue as it will not be able to predict well with respect to new test data or future data. In case of over fitting , Your model accuracy on new set of data will drop significantly. Always when you train any model – try to see if any over fitting occurs.
  • Getting handle on various ML techniques – You will learn the basic ML Techniques in any MOOC course but lot of ML Techniques are available in the market. MOOC Courses are not tuned to include all kinds of new techniques. So you need spend some quality time in learning new ML techniques on your own, how they implemented with R / Python / Java, what kind of parameters are passed to the model etc. etc.
  • Model Ensemble Techniques – this is one hot topic on increasing the model accuracy. In short, this is about building models based on various ML techniques, samplings of data and do your final prediction based on combination of results from multiple models (You could average out the results from several models, assign weights on the results from different model, looking for majority wins etc. and do the final prediction). You can also look on Stacking and Blending concepts which are again ensemble techniques. These Ensemble techniques has been one of the critical success factors for winning Kaggle competitions.
  • ML Implementation platforms – Multiple platforms like R, Python, Java, Spark ML libraries etc  are available for ML implementation. You can start with whatever platforms you are comfortable with. Most of the common code repositories are based out of Python & R. Again computing requirement plays another role in training the model – multiple times, I have seen my machine crashing over multiple iterations of training your model due to memory constraints. Well, there are several options to overcome computing limitations – you can also go for cloud based computing resources to train your model with exhaustive data. Try to look at Azure ML platform. But for a starter – basic computing resources are good enough.
  • Visualization (Data/Model) – Visualization of base data and model is another critical area – This will give holistic view on how well our data is organized, how your model performs etc. Lot of plotting functions available with R to start with. You can get hold of several plotting techniques available and use it whenever necessary
  • Work on real projects to gain practitioner view on deploying ML techniques. You can make use of Kaggle platform which already has lot of use-cases to start with.

To read the whole article, go to http://www.kdnuggets.com/2015/09/acquire-machine-learning-skills.html.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s