As the title suggests, this article aims at the newbie developers interested to be a part of this digital revolution, Data Science, which possess minimal knowledge of machine learning and Python.
What is Machine Learning?
Machine learning is a field of computer science that often uses statistical techniques to give computers the ability to “learn” with data, without being explicitly programmed. It’s an application of Artificial Intelligence (AI). Practically, it means, we need to feed data into an algorithm, and use it to make predictions about what might happen in the future.
In 1997, Tom Mitchell gave a “well-posed” definition that has been proven to be more useful for the engineering types: “A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.”
So if you want your program to predict, for example, traffic patterns at a busy intersection (task T), you can run it through a machine learning algorithm with data about past traffic patterns (experience E) and, if it has successfully “learned”, it will then do better at predicting future traffic patterns (performance measure P).
Among the different types of ML tasks, a crucial distinction is drawn between supervised and unsupervised learning:
- Supervised machine learning – The program is “trained” on a pre-defined set of “training examples”, which then facilitate its ability to reach an accurate conclusion when given new data.
- Unsupervised machine learning – The program is given a bunch of data and must find patterns and relationships therein.
There is a really vast range of applications which involves domains such as,
- Healthcare (e.g., personalized treatments and medications, drug manufacturing)
- Finance (e.g., fraud detection)
- Retail (e.g., product recommendations, improved customer service)
- Travel (e.g., dynamic pricing like, how does Uber determine the price of your ride, and sentimental analysis, like, TripAdvisor collects information of the travellers from social media when we share photos and reviews, and tries on improvising its service based on the reviews)
- Media(e.g., Facebook, from personalizing news feed to rendering targeted ads, machine learning is the heart of all social media platforms for their own and user benefits.)
On the other hand, unlike Ruby, Python is a complete language and platform that one can use for research and developing production systems. It can feel overwhelming to choose from multiple libraries and modules.
So, let’s start with the step by step procedure to be followed by a beginner to start machine learning using Python.
- First step should be to learn Python.
Python is a general-purpose interpreted, interactive, object-oriented, and high-level programming language. It was created by Guido van Rossum during 1985- 1990. The python source code is available under the GNU General Public License (GPL).
You can follow the following sources to leverage your Python skills:
Python has an amazing ecosystem of libraries that make machine learning easy to get started with. It’s is one of the most popular and in-demand languages in the job market, today. This is why; we can get plenty of resources online to learn. Learners will find hardly any difficulty.
- The next step is to install Anaconda from the given link,
Follow the instructions and procedure for the installation stated in the site. The Anaconda package contains the required package to explore machine learning.
- You have to learn basic machine learning skills.
If you want to have an overall idea about Machine learning, from scratch, you might want to follow this crash course by Google:
Andrew Ng’s Machine Learning course is also a great option for learners.
Once we are comfortable with Python and Machine Learning, we shall shift to Python libraries.
- Pandas: Our first step is to read in the data and bring out some relevant and quick summary statistics, for which we shall use the Pandas library. Pandas provide data structures and data analysis tool that make manipulating data in Python much quicker and effective.
We’ll read in our data from a CSV file into a Pandas dataframe, using the read_csv
- NumPy: The most common data structure is called a dataframe. A dataframe is an extension of a matrix.
A matrix is a two-dimensional data structure, with rows and columns. Matrices in Python can be used via the NumPy As in case of matrices, we can’t easily access columns and rows by name, and each column has to have the same datatype, hence, we use Dataframes, which can have different datatypes in each column. It has a lot of built-in features for analyzing data.
- Matplotlib: It is the main plotting infrastructure in Python, and most other plotting libraries, like seaborn and ggplot2, are built on top of Matplotlib. We import Matplotlib’s plotting functions with import matplotlib.pyplot as plt. We can then draw and show plots.
- Scikit-learn: The library is built upon the SciPy (Scientific Python) that must be installed before you can use scikit-learn. This stack includes:
- NumPy: Base n-dimensional array package
- SciPy: Fundamental library for scientific computing
- Matplotlib: Comprehensive 2D/3D plotting
- IPython: Enhanced interactive console
- Sympy: Symbolic mathematics
- Pandas: Data structures and analysis
Extensions or modules for SciPy care conventionally named SciKits. As such, the module provides learning algorithms and is named scikit-learn.
Now, as you have the grip of the basics of Python and its libraries and Machine learning algorithms, it’s always best to start with a small project. Here are the steps on how to start with the project:
- Define a Problem
- Prepare the Data
- Evaluate the Algorithms
- Improve the Results
- Present the Results
To start with Machine Learning using Python, after the above-given step of installing Anaconda, first check the version of python you are using and then,
- Import the libraries, such as sklearn, pandas, matplotlib, scipy, numpy
- Load the dataset: We load the data using pandas.
- Summarize the dataset : Which includes :
Dimensions of the dataset : print(dataset.shape)
Statistical summary :
- Visualization of the dataset: Data Visualization comprises of 2 kinds of plots: Univariate and Multivariate
Univariate plots are used to understand each attribute better. In this case, we can create box-plots and histograms.
Whereas, Multivariate plots are used to understand the relationship between each attributes better. In this case, the scatter plot can describe the correlation between the attributes.
- Evaluation of some algorithm:
Firstly, separate out the validation set from the dataset, let’s say, it’s 20% of the dataset, which the algorithm won’t be able to see or access.
Next, we shall split the remaining dataset into 2 parts, Training (80%) and Test (20%). Now set a scoring metric, based on which evaluation is to be done on the models, let’s say, accuracy.
This is a ratio of the number of correctly predicted instances divided by the total number of instances in the dataset multiplied by 100 to give a percentage (g., 95% accuracy).
Hence, after setting up everything, we shall build the model.
To get good accuracy, we need to pass the training dataset in different models, after which we can find out the accuracy of each model. Then, the model with maximum accuracy shall be considered the best suit for the given problem.
- Make Predictions: After getting the best model, we want to get an idea on the validation set. We shall run the best fit model directly on the validation set and summarize the results as a final accuracy score. It’s always a good practice to keep a validation set as it shall help us find whether the training set is overfitted and giving us some overly optimistic results.