Welcome to the Data Analytics and Machine Learning using Python Bootcamp at MSRIT

MSRIT Bootcamp

The Data Analytics and Machine Learning using Python bootcamp is intended to give you a flavor of many aspects of Data Science and Machine Learning with the goal of inspiring you to go on your own learning journey.

In this bootcamp we will cover the following;

  • Introduction to Python programming
  • Data Analysis
  • Data Visualization
  • Machine Learning
  • How to Keep Learning

Follow me on LinkedIn where I share updates and suggestions for early-career Data Science/Machine Learning professionals and students looking to gain a headstart in this field.

Essential Data Science Skills

Data Science Technical Skills

A data scientist may require many skills, but what sets them apart is their technical knowledge. Many technical skills and knowledge of specialized tools will be essential for data scientists to be familiar with but there exists a core set of of technical knowledge that can be applied to a majority of problems across doamins.

Data scientists use programming skills to apply techniques such as machine learning, artificial intelligence (AI) and data mining. It is essential for them to have a good grasp on the mathematics and statistics involved in these techniques. This allows a data scientist to know when to apply each technique.

In addition to understanding the fundamentals, data scientists should be familiar with the popular programming languages and tools used to implement these techniques.

Data Visualization

Data visualization is an essential skill to acquire for a data scientist. Visualization enables the data scientist to see patterns and guide their exploration of the data. Second, it allows them to tell a compelling story using data. These are both critical aspects of the data science workflow.

Programming/Software

Data scientists use a variety of programming languages and software packages to comprehensively and efficiently extract, clean, analyze, and visualize data. Though there are always new tools in the rapidly changing word of data science, a few have stood the test of time.

Below are a few important and popularly used tools that aspiring data scientists should familiarize themselves with to develop programming and software data scientist skills:

R

R was once confined almost exclusively to academia, but social networking services, financial institutions, and media outlets now use this programming language and software environment for statistical analysis, data visualization, and predictive modeling. R is open-source and has a long history of use for statistics and data analytics

Python

Python, unlike R, was not primarily designed for data analysis. The pandas python library was created to fill this gap. Python has gained in popularity with a very extensive ecosystem of tools and libraries for all aspects of the data science workflow in addition to be used for software engineering tasks.

Tableau

Tableau provides a high-level interface for exploring and visualizing data in friendly and dynamic dashboards. These dashboards are essential at organizations that prioritize data-driven decision making.

Hadoop

Hadoop is an open-source software framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. Hadoop offers computing power, flexibility, fault tolerance and scalability. Hadoop is developed by the Apache Software Foundation and includes various tools such as the Hadoop Distributed File System and an implementation of the MapReduce programming model.

SQL

SQL, or Structured Query Language, is a special-purpose programming language for managing data held in relational database management systems. There are multiple implementations of the same general syntax, including MySQL, SQLite and PostgreSQL.

Some of what you can do with SQL—data insertion, queries, updating and deleting, schema creation and modification, and data access control—you can also accomplish with R, Python, or even Excel, but writing your own SQL code could be more efficient and yield reproducible scripts.

Apache Spark

Similar to Hadoop, Spark is a cluster computing framework that enables clusters of computers to process data in parallel. Spark is faster at many tasks than Hadoop due to its focus on enabling faster data access by storing data in RAM. It replaces Hadoop’s MapReduce implementation but still relies on the Hadoop Distributed File System.

Statistics/Mathematics

Today, it is Software that run all the necessary statistical tests, but a data scientist still needs to possess the sensibility to know which test to run when and how to interpret the results.

A solid understanding of multivariable calculus and linear algebra, which form the basis of many data science and machine learning techniques, will allow a data scientist to build their understanding on strong foundations.

An understanding of statistical concepts will help data scientists develop the skills to understand the capabilities, but also the limitations and assumptions of these techniques. A data scientist should understand the assumptions that need to be met for each statistical test.

Data scientists don’t only use complex techniques like neural networks to derive insight. Even linear regression is a form of machine learning that can provide valuable learnings. Simply plotting data on a chart and understanding what it means are basic but essential first steps in the data science process.

Mathematical concepts such as logarithmic and exponential relationships are common in real-world data. Understanding and applying both the fundamentals as well as advanced statistical techniques are skills that data scientists need to find meaning in data.

Though much of the mathematical heavy lifting is done by computers, understanding what makes this possible is essential. Data scientists are tasked with knowing what questions to pose, and how to make computers answer them.

Computer science is in many ways a field of mathematics. Therefore, the need for mathematical data scientist skills is clear.

Data Scientist Soft Skills

Data science requires a diverse set of skills. It is an interdisciplinary field that draws on aspects of science, math, computer science, business and communication. Data scientists may benefit from a diverse skill-set that enables them to both crunch the numbers and effectively influence decisions.

Because data scientists focus on using data to influence and inform real-world decisions, they should be able to bridge the gap between numbers and actions. This requires skilled communication and an understanding of the business implications of their recommendations. Data scientists should be able to work as part of a larger team, providing data-driven suggestions in a compelling form. This requires skills that go beyond the data, statistics and tools that data scientists use.

Communication

Data scientists should be able to report technical findings such that they are comprehensible to non-technical colleagues, whether corner-office executives or associates in the marketing department.

Make your data-driven story not just comprehensible but compelling.

One important data scientist skill is communication. In order to be effective as a data scientist, people need to be able to understand the data. Data scientists act as a bridge between complex, uninterpretable raw data and actual people. Though cleaning, processing and analyzing data are essential steps in the data science pipeline, this work is useless without effective communication.

Effective communication requires a few key components.

Visualization allows a data scientist to craft a compelling story from data. Whether the story describes a problem, proposes a solution or raises a question; it is essential that the data be presented in a way that leads the audience to reach the intended conclusions. In order for this to happen, data scientists should describe the data and process in a shared language, avoiding jargon and unnecessary complexity.

Business Acumen

Business awareness could now be considered a prerequisite for effective data science. A data scientist should develop an understanding of the field they are working in before they are able to understand the meaning of data. Though some metrics, like profit and conversions, exist across industries, many key performance indicators (KPIs) are highly specialized. This data makes up the industry’s business intelligence, which is used to understand where the business is and the historical trends that have taken it there.

The unique goals, requirements and limitations of each industry define every step that a data scientist takes. Without understanding the underlying aspects of the industry, it could be impossible to find meaningful insight or make useful recommendations.

A data scientist may be most effective when they truly understand the business they are advising. Though data can provide unique insights, it may not capture the full picture. This requires a data scientist to be aware of the processes and realities at play in their industry. Though they may share a job title, the precise goals and tasks of a data scientist will vary greatly by industry. To be successful, a data scientist should understand the industry that they are working in.

Data-Driven Problem Solving

Data-driven problem solving allows data to inform the entire data science process. By using a structured approach to identify and frame problems, the decision making process could be simplified. In data science, the vast quantity of data and tools creates nearly endless avenues to pursue. Managing these decisions is an essential job for a data scientist. Data science both informs and is informed by the data-driven problem solving process.

A data scientist is likely to know how to productively approach a problem. This means identifying a situation’s salient features, figuring out how to frame a question that will yield the desired answer, deciding what approximations make sense, and consulting the right co-workers at the appropriate junctures of the analytic process. All of that in addition to knowing which data science methods to apply to the problem at hand.

A data scientist’s job is to understand how to take raw data and derive meaning from it. This requires more than just an understanding of advanced statistics and machine learning. They also need to integrate their understanding of the problem domain, available information and their goals when deciding how to proceed.

Data science problems and solutions are never obvious. There are many possible paths to explore, and it is easy to become overwhelmed with the options. A structured approach to data-driven problem solving allows for a data scientist to track and manage progress and outcomes. Structured techniques such as Six SigmaExternal link:open_in_new are great tools to help data scientists and teams solve real world data science problems.

Excerpted from Essential Data Science Skills

Tableau

Tableau is the data analytics tool that companies across the globe have embraced to communicate with data and develop a culture of data-driven decision making.

1 TABLEAU IS EASY

Data can be complicated. Tableau makes it easy. Tableau is a data visualization tool that takes data and presents it in a user-friendly format of charts and graphs.

There is no code writing required. You’ll easily master the end-to-end cycle of data analytics.

2 TABLEAU IS TREMENDOUSLY USEFUL

“Anyone who works in data should learn tools that help tell data stories with quality visual analytics.”

The smart data analyst, data scientist, and data engineer who have quickly adopted and started to use have gained key competitive advantage in the recent data-related hiring frenzy.

Check out the visualizations developed by Tableau users.

TABLEAU DATA ANALYSTS ARE IN DEMAND

As more and more businesses discover the value of data, the demand for analysts is growing. One advantage of Tableau is that it is so visually pleasing and easy for busy executives — and even the tech-averse — to use and understand. Tableau presents complicated and sophisticated data in a simple visualization format. In other words, CEOs love it.

Think of Tableau as your secret weapon. Once you learn it, you can easily surface critical information to stakeholders in a visually compelling format. That

“Tableau helps organizations leverage business intelligence to become more data-driven in their decision-making process.”

Get Started

The Tableau Public desktop app is a great way for you to get started.

Do the following to get started quickly;

  • Download and install Tableau Public on your desktop
  • Head over to Kaggle Datasets and find a CSV file of your liking and download it
  • Open Tableau and read the downloaded CSV File
  • Check out the tutorials here on how to use Tableau and build simple visualiztions with it

Run Your Own Tableau Bootcamp

For students who are pursuing Analytics degree programs a great way to conduct a cohort based bootcamp is to adop the following framework. The placement cooridnator or student leaders can adop the framework described below to run the competition. Industry leaders from your aluni group can be invited to judge the best dashboards developed.

  • Students will organize themselves into a maximum of 10 teams.

  • Each team will do the following;

    • choose a team name.
    • choose a team mascot/logo
    • nominate a team manager who will manage the operations of managing the team, setting up their meetings, choosing the final tableau layout, dashboard name and being the team representative.
    • lead analyst 1 who will develop the dashboard
    • lead analyst 2 who will develop the dashboard
    • business analyst who will research the dataset and consult with the team and discuss which dataset to choose for building the dashboard and the final business narrative
    • visualization researcher who will survey the relevant public dashboards and discuss with the team on which visualizations to choose for the dashboard

If the team has more than 5 members, the additional members will be nominated as lead analyst 3, lead analyst 4 etc.

The first meeting with with the students will have each team manager present a slide where they will describe their team, the dataset chosen and what they have in mind for the final dashboard.

We will use this session to answer any questions they may have (not technical questions). This session can also be for just the managers to meet with us and discuss their approach and choice of dataset and vision for the final dashboard.

A week later we meet with the students for the final presentations. Each team will be given 10 minutes to go over the dashboard. The manager will present for 5 to 6 minutes. We will ask questions for the last 4 minutes or so of each presentation.

We will use the following to guide our assessments;

Did the team select an interesting story to tell with the data as it related to the topic and audience?

A story should have a clear beginning, middle, and end. Questions are useful to guide the audience with answers as are takeaways that drive the narrative from introduction to conclusion.

Did the team select appropriate visualizations (chart types) to present the data?

Charts and graphs should clearly show the data without bias. The chart type should be familiar to the audience. Avoid overly-complex charts that look fancy but don't clearly show insights.

Did the team apply effective design principles to the charts to clearly present the data?

Charts should be devoid of extraneous non-data elements such as shadows, borders, and use of color for decoration.

Did the team apply effective design principles to the presentation?

Presentation decks should be designed to support the presentation rather than be the presentation. Text should be limited and the use of visuals should be emphasized. All text should be readable (at least 36-point font) for the audience.

Did the group present visualizations that worked together to tell a coherent story?

The visualizations used should progressively reveal the insights or trends. Each visualization should highlight a single and different takeaway.

Students will be asked to upload their presentations into the Tableau Public Gallery.

Following dashboards from various teams were created by students from Delhi Universit Statistics Department in 2021.

Insight Strategists

Dashboard

Team KRASS

Dashboard 1

Dashboard 2

Viz-Zards

Dashboard 1

Dashboard 2

Dashboard 3

Pentacle

Dashboard

Abraca-Data

Dashboard

Datatrons

Dashboard

Outliers

Dashboard

Data Demystifiers

Dashboard

Nirvachan

Dashboard

V Stay

Dashboard

SQL

SQL (Structured Query Language) is the main programming language used by database management systems.

We live in a data-driven world. Everything that we do online creates data. Every business transaction creates data: cash register sales, inventory changes, and upcoming shipments. According to Forbes, 90% of the data in the world was created in the last two years. All of this data provides invaluable information for businesses.

Over 97% of business organizations are investing in data.

SQL (Structured Query Language) is a programming language used to manage data in databases. SQL is the second-most popular programming language, and it is used by a majority of all developers. It has been the primary language of digital databases since the 1970s.

SQL is the most common method of accessing the immense amount of essential data stored in databases. Given how valuable this data has become to companies, people who know how to use SQL to access it has a considerable advantage in the job market.

Why learn SQL?

SQL as a job skill is relevant in every industry and one of the foundations of a career in data science: a hot career path Harvard named “the most promising career of the 21st century.” Learning SQL is recommended for anyone who wants to work in either coding or data. Some job positions that require SQL training are SQL Developer, Software Engineer, Data Scientist, Data Analyst, Database Administrator, and Big Data Architect.

Two of the primary uses for SQL include:

  • Numerous types of databases, including Microsoft SQL, MySQL, and PostgreSQL. These databases support innumerable companies of various sizes: Microsoft SQL alone is used by over 200,000 companies worldwide.
  • SQL can also be found in other types of technology, including iPhones and Androids. For example, SQLite is used on every smartphone, in every browser, on Skype calls, music streaming services, and some televisions.
  • Unlike other languages, SQL can be used for non-programming purposes: such as marketers or finance workers querying for data.

With such widespread usage, developers who learn SQL will remain in demand for decades.

There are several reasons why SQL makes it so much easier for workers to learn other programming languages, including:

  • SQL’s structure offers a straightforward framework for data analysis
  • SQL is flexible and can be optimized by developers to run any query
  • SQL language can be adapted and extended with new calculations
  • SQL Has a Secure Future

Get Started with SQL

SQL is how you interact with a database. When interacting with a database you connect to it using a client. This client is a software program that you run into which you can issue SQL commands.

Additionally you can also connect to databases via programming languages and issue SQL commands via programming language interfaces.

Head over to this article which showcases how you can use a popular SQL client, DBeaver to connect to a variety of databases such as PostgreSQL

With DBeaver you can also connect to a CSV file and query it as thought it were a database table. Additionally you can also connect to SQLite databases which are embedded database. This only means the entire database functionality is contained within a file.

Business Case Study with SQL

The Android App Store draws millions of users everyday to download apps for their smart phones. Apps belong to various categories like Entertainment, Productivity, Children, Gaming.

Apps have various ratings and reviews.

Imagine being able to analyze this information and draw insights that can help you learn more about the dynamics of this vibrant marketplace.

A three-part series that takes you into this dataset and shows you how you can use SQL to answer interesting business questions.

Android Marketplace Analysis - Part 1

Android Marketplace Analysis - Part 2

Android Marketplace Analysis - Part 3

Python is the #1 programming language for Data Scientists and Machine Learning(DSML) professionals.

Cubbon Park
Cubbon Park, Bangalore

Learning and investing your time in Python is necessary if you want to become a DSML professional. The ecosystem around Python is very extensive and the community is very passionate and helpful to new users.

The video below walks through the a report released by Jet Brains where they surveyed the Big Data industry. As you can see, Python continues to be the #1 choice across the industry.

Now that you have decided to learn Python, you will need to learn how to;

  • Run and test your Python code quickly
  • Visualize data
  • Run models and see results
  • Develop Python scripts
  • Develop Python web applications
  • Develop software with Python
  • and many more..

To be able to do a variety of tasks you can learn how to use Jupyter for quick prototyping and research. And you will need to learn how to use an Integrated Development Environment (IDE) for more of the Software Development aspects.

PyCharm

Jet Brains makes a very cool Python IDE called PyCharm.

Check it out. The Community edition is free.

VS Code

Another popular IDE is VS Code.

VS Code has certainly captured the imagination of DSML professionals. VS Code is free and is developed by Microsoft.

VS Code isn't specifically used for Python programming. You code code in many programming languages with VS Code. With a very large marketplace of plugins and extensions using VS Code will give you massive abilities as a DSML professional.

Python is a versatile language and is used for a variety of Software Engineering tasks in addition to being the #1 choice in DSML.

Style Guide

As you start writing code you will deal with basic challenges such as naming of variables and functions. How to create comments. You will wonder how to write clean and readable code. Styles guides provide just that. Guidance on how to write clean and readable code. They also create a form of standardization that improves knowledge sharing amongst peers and across the industry.

The Python foundation has published a style guide called PEP 8. You should read PEP 8 after you have read the Google Python style guide.

The one from Google is an easier read.

Remember these style guides are not a one-time read. You have to keep coming back to it and read it.

If you read it only when you are trying to solve the problem of naming variables you will not find an immediate answer. These style guides need to be read at leisure and with a certain regularity. Only during such readings will you "absorb" the suggestions and conventions so that they become part of your natural coding style.

Only then will the style guides work for you.

Kaggle is a great platform that combines access to a variety of building blocks that comprises of a end-to-end Data Science and Machine Learning (DSML).

Datasets

On Kaggle a large collection of datasets are available. These datasets are of different kinds. Tabular data, Images and Text datasets are available for you to analyze and create models from. These Datasets are released by companies and also by data enthusiasts.

Competitions

Competitions on Kaggle is how this platform first gained huge popularity. There are many types of competitions. The ones that have a cash prize are sponsored by large companies who wish to learn from the best Data professionals in the world.

Imagine a company wants to solve a DSML problem. Kaggle gives them a chance to present the problem to the world and have the best DSML professionals and enthusiasts compete with each other for the final prize.

Kaggle has created a very solid talent network for Data professionals. The winners of many of these competitions go on to get top jobs in the DSML industry. Many of the winners are already top professionals and academics in DSML and seeing how they win a competition is the best kind of practical education in DSML world.

Notebooks

Notebooks are the building blocks of all DSML workflows. Students, professors and professionals all use Notebooks during research and also in production.

Notebooks are a very different approach to developing technical solutions. For DSML professionals this is a huge advantage.

In a Notebook you can write code and see the output right there. And you can break up a long problem statement into small sub parts. And for each sub part you can write code and see the results.

With this approach you can build up from zero to the final solution. You can visialize data and results. You can extract output in the form of reports.

And you can also finally put this Notebook into production as the final DSML product.

On Kaggle you get to see how the top DSML professionals have solved problems. Their Notebooks are there for you to see and learn from. This is like working with the top DSML people but from the comfort of your home and country.

Discussions from the top professionals in the industry

Discussions on Kaggle are across many topics. These discussions are a gold mine. In these forums you get to learn from DSML professionals asking very important and critical questions.

These questions are based on data quality, feature engineering, model selection, performance testing of model results and many other critical aspects of DSML work.

The responses to these questions reveals the best practices and alternative approaches that DSML people apply in solving challenging problems. You can very well imagine that having access to such discussions can increase your understanding of DSML by leaps and bounds.

These are evolving discussions so you should keep coming back to the forums to see who how the discussion has changed over time. You are seeing the industry change in real-time. This is what it means to be at the cutting-edge.

Courses

Kaggle Courses are short introductions into all aspects of DSML. You can find introduction to programming, data analysis, machine learning and deep neural network related courses.

Python is used successfully in thousands of real-world business applications around the world, including many large and mission critical systems.

Here are some quotes from happy Python users

Python is fast enough for our site and allows us to produce maintainable features in record times, with a minimum of developers," said Cuong Do, Software Architect, YouTube.com.

Python plays a key role in our production pipeline. Without it a project the size of Star Wars: Episode II would have been very difficult to pull off. From crowd rendering to batch processing to compositing, Python binds all things together," said Tommy Burnette, Senior Technical Director, Industrial Light & Magic.

Python has been an important part of Google since the beginning, and remains so as the system grows and evolves. Today dozens of Google engineers use Python, and we're looking for more people with skills in this language." said Peter Norvig, director of search quality at Google, Inc.

Python usage survey and reports

The PSF (Python Software Foundation) compiled a brochure many years ago to showcase the advantages of using Python. While this brochure is useful as a quick read, there is a lot more happening with Python today.

A more updated version lists a large number of Python usage trends the industry such as;

  • Internet of Things
  • Machine Learning
  • Startups
  • Web Development
  • Fintech
  • Data Science
  • Data Engineering

The Jet Brains survey on Python is a great read. Check the video below to see the highlights of this survey.

Learn Python

Learning how to code can take on a variety of approaches. You can learn by reading a book, YouTube videos or blogs. But the most important thing is to practice.

In the section below you will come across a few concepts in Python which can be a great place to start.

Create your first Notebook

We will use Kaggle to create a Notebook where all of our code will be executed.

Data Types

# string
name = "Aryabhatta"
occupation = "Mathematician"

# an integer
age = 23

# float
weight = 64.81

# boolean
is_a_mathematican = True
can_ride_elephant = False

# list
isro_missions = ["GSLV-F10", "PSLV-C51", "PSLV-C50", "PSLV-C49"]

# dictionary
isro_mission_dates = {
'PSLV-C53' : 'Jun 30 2022',
'PSLV-C52' : 'Feb 14 2022',
'GSLV-F10' : 'Aug 12 2021',
'PSLV-C51' : 'Feb 28 2021',
'PSLV-C50' : 'Dec 17 2020'
}

just_the_isro_mission_names = isro_mission_dates.keys()  # a list of mission name
just_the_isro_mission_dates = isro_mission_dates.values() # a list of mission dates

# tuple
# can organize the basic variable types in a comma delimited manner
# and retrieve them in that order
pslv_isro_launches = ('PSLV-C53', 'PSLV-F10')
pslv_c, pslv_f = pslv_isro_launches 

# objects (classes)
# use objects when you have a lot of things with similiar data structure and actions

class SatelliteLaunch(object):
    # __init__ automatically called when creating an object
    def __init__(self, name = None):
        print(f'Creating new launcher for {name}')
        self.name = name

    def set_name(self, name):
        self.name = name

launcher = SatelliteLaunch("Chandrayan")

launcher.set_name("Chandrayan 2")
launcher.payload = "Pragyan Rover" # dynamically set object property/variable

print(f'{launcher.name} is carrying the payload of {launcher.payload}')

print("This variable is :", type(launcher))

Print

print("Hello World!")

print("Python is the #1 programming language in the world")

temperature = 23
print(f"It is {temperature} degree Celsius in Bangalore today")

Control Flows

isro_location = 'blr'

if isro_location == 'blr':
    print("You got the location right")
else:
    print("You got the location wrong")

List Comprehensions

#List comprehensions is a pythonic way of expressing a ‘for-loop’.

result = []
for i in range(10):
    if i%2 == 0:
        result.append(i)

print(result)

# this is a list comprehension. The above is not.
[i for i in range(10) if i % 2 == 0]


[i**2 for i in range(10)]

Continue Learning

FreeCodeCamp has great tutorials and you can learn Python from one of their YouTube videos.

Introduction to Data Analysis with Pandas

pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.

History of development

In 2008, pandas development began at AQR Capital Management. By the end of 2009 it had been open sourced, and is actively supported today by a community of like-minded individuals around the world who contribute their valuable time and energy to help make open source pandas possible.

Library Highlights

  • A fast and efficient DataFrame object for data manipulation with integrated indexing;

  • Tools for reading and writing data between in-memory data structures and different formats:

    • CSV and text files
    • Microsoft Excel
    • SQL databases
    • The fast HDF5 format
  • Intelligent data alignment and integrated handling of missing data: gain automatic label-based alignment in computations and easily manipulate messy data into an orderly form;

  • Flexible reshaping and pivoting of data sets;

  • Intelligent label-based slicing, fancy indexing, and subsetting of large data sets;

  • Columns can be inserted and deleted from data structures for size mutability;

  • Aggregating or transforming data with a powerful group by engine allowing split-apply-combine operations on data sets;

  • High performance merging and joining of data sets;

  • Hierarchical axis indexing provides an intuitive way of working with high-dimensional data in a lower-dimensional data structure;

  • Time series-functionality: date range generation and frequency conversion, moving window statistics, date shifting and lagging. Even create domain-specific time offsets and join time series without losing data;

  • Highly optimized for performance, with critical code paths written in Cython or C.

  • Python with pandas is in use in a wide variety of academic and commercial domains, including Finance, Neuroscience, Economics, Statistics, Advertising, Web Analytics, and more.

Mission

pandas aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language.

Vision

A world where data analytics and manipulation software is:

  • Accessible to everyone
  • Free for users to use and modify
  • Flexible
  • Powerful
  • Easy to use
  • Fast

Values

Is in the core of pandas to be respectful and welcoming with everybody, users, contributors and the broader community. Regardless of level of experience, gender, gender identity and expression, sexual orientation, disability, personal appearance, body size, race, ethnicity, age, religion, or nationality.

Quick Pandas Tour

This is a short introduction to pandas, geared mainly for new users. Taken from 10 min pandas guide

import numpy as np

import pandas as pd

Object creation

Creating a Series by passing a list of values, letting pandas create a default integer index:

s = pd.Series([1, 3, 5, np.nan, 6, 8])

s
Out[4]: 
0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

Creating a DataFrame by passing a NumPy array, with a datetime index and labeled columns:

dates = pd.date_range("20130101", periods=6)

dates
Out[6]: 
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))

df
Out[8]: 
                   A         B         C         D
2013-01-01  0.469112 -0.282863 -1.509059 -1.135632
2013-01-02  1.212112 -0.173215  0.119209 -1.044236
2013-01-03 -0.861849 -2.104569 -0.494929  1.071804
2013-01-04  0.721555 -0.706771 -1.039575  0.271860
2013-01-05 -0.424972  0.567020  0.276232 -1.087401
2013-01-06 -0.673690  0.113648 -1.478427  0.524988
Creating a DataFrame by passing a dictionary of objects that can be converted into a series-like structure:

df2 = pd.DataFrame(
    {
        "A": 1.0,
        "B": pd.Timestamp("20130102"),
        "C": pd.Series(1, index=list(range(4)), dtype="float32"),
        "D": np.array([3] * 4, dtype="int32"),
        "E": pd.Categorical(["test", "train", "test", "train"]),
        "F": "foo",
    }
)


df2
Out[10]: 
     A          B    C  D      E    F
0  1.0 2013-01-02  1.0  3   test  foo
1  1.0 2013-01-02  1.0  3  train  foo
2  1.0 2013-01-02  1.0  3   test  foo
3  1.0 2013-01-02  1.0  3  train  foo
The columns of the resulting DataFrame have different dtypes:

df2.dtypes
Out[11]: 
A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

Viewing data

# Here is how to view the top and bottom rows of the frame:

df.head()
Out[13]: 
                   A         B         C         D
2013-01-01  0.469112 -0.282863 -1.509059 -1.135632
2013-01-02  1.212112 -0.173215  0.119209 -1.044236
2013-01-03 -0.861849 -2.104569 -0.494929  1.071804
2013-01-04  0.721555 -0.706771 -1.039575  0.271860
2013-01-05 -0.424972  0.567020  0.276232 -1.087401

df.tail(3)
Out[14]: 
                   A         B         C         D
2013-01-04  0.721555 -0.706771 -1.039575  0.271860
2013-01-05 -0.424972  0.567020  0.276232 -1.087401
2013-01-06 -0.673690  0.113648 -1.478427  0.524988
Display the index, columns:

df.index
Out[15]: 
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

df.columns
Out[16]: Index(['A', 'B', 'C', 'D'], dtype='object')

# describe() shows a quick statistic summary of your data:

df.describe()
Out[19]: 
              A         B         C         D
count  6.000000  6.000000  6.000000  6.000000
mean   0.073711 -0.431125 -0.687758 -0.233103
std    0.843157  0.922818  0.779887  0.973118
min   -0.861849 -2.104569 -1.509059 -1.135632
25%   -0.611510 -0.600794 -1.368714 -1.076610
50%    0.022070 -0.228039 -0.767252 -0.386188
75%    0.658444  0.041933 -0.034326  0.461706
max    1.212112  0.567020  0.276232  1.071804


# Transposing your data:

df.T
Out[20]: 
   2013-01-01  2013-01-02  2013-01-03  2013-01-04  2013-01-05  2013-01-06
A    0.469112    1.212112   -0.861849    0.721555   -0.424972   -0.673690
B   -0.282863   -0.173215   -2.104569   -0.706771    0.567020    0.113648
C   -1.509059    0.119209   -0.494929   -1.039575    0.276232   -1.478427
D   -1.135632   -1.044236    1.071804    0.271860   -1.087401    0.524988
Sorting by an axis:

df.sort_index(axis=1, ascending=False)
Out[21]: 
                   D         C         B         A
2013-01-01 -1.135632 -1.509059 -0.282863  0.469112
2013-01-02 -1.044236  0.119209 -0.173215  1.212112
2013-01-03  1.071804 -0.494929 -2.104569 -0.861849
2013-01-04  0.271860 -1.039575 -0.706771  0.721555
2013-01-05 -1.087401  0.276232  0.567020 -0.424972
2013-01-06  0.524988 -1.478427  0.113648 -0.673690
Sorting by values:

df.sort_values(by="B")
Out[22]: 
                   A         B         C         D
2013-01-03 -0.861849 -2.104569 -0.494929  1.071804
2013-01-04  0.721555 -0.706771 -1.039575  0.271860
2013-01-01  0.469112 -0.282863 -1.509059 -1.135632
2013-01-02  1.212112 -0.173215  0.119209 -1.044236
2013-01-06 -0.673690  0.113648 -1.478427  0.524988
2013-01-05 -0.424972  0.567020  0.276232 -1.087401

Selection

While standard Python / NumPy expressions for selecting and setting are intuitive and come in handy for interactive work, for production code, it is recommended the optimized pandas data access methods, .at, .iat, .loc and .iloc.

Selecting a single column, which yields a Series, equivalent to df.A:

df["A"]
Out[23]: 
2013-01-01    0.469112
2013-01-02    1.212112
2013-01-03   -0.861849
2013-01-04    0.721555
2013-01-05   -0.424972
2013-01-06   -0.673690
Freq: D, Name: A, dtype: float64
Selecting via [], which slices the rows:

df[0:3]
Out[24]: 
                   A         B         C         D
2013-01-01  0.469112 -0.282863 -1.509059 -1.135632
2013-01-02  1.212112 -0.173215  0.119209 -1.044236
2013-01-03 -0.861849 -2.104569 -0.494929  1.071804

df["20130102":"20130104"]
Out[25]: 
                   A         B         C         D
2013-01-02  1.212112 -0.173215  0.119209 -1.044236
2013-01-03 -0.861849 -2.104569 -0.494929  1.071804
2013-01-04  0.721555 -0.706771 -1.039575  0.271860

# Selection by label

df.loc[dates[0]]
Out[26]: 
A    0.469112
B   -0.282863
C   -1.509059
D   -1.135632
Name: 2013-01-01 00:00:00, dtype: float64
Selecting on a multi-axis by label:

df.loc[:, ["A", "B"]]
Out[27]: 
                   A         B
2013-01-01  0.469112 -0.282863
2013-01-02  1.212112 -0.173215
2013-01-03 -0.861849 -2.104569
2013-01-04  0.721555 -0.706771
2013-01-05 -0.424972  0.567020
2013-01-06 -0.673690  0.113648
Showing label slicing, both endpoints are included:

df.loc["20130102":"20130104", ["A", "B"]]
Out[28]: 
                   A         B
2013-01-02  1.212112 -0.173215
2013-01-03 -0.861849 -2.104569
2013-01-04  0.721555 -0.706771

# Reduction in the dimensions of the returned object:

df.loc["20130102", ["A", "B"]]
Out[29]: 
A    1.212112
B   -0.173215
Name: 2013-01-02 00:00:00, dtype: float64
For getting a scalar value:

df.loc[dates[0], "A"]
Out[30]: 0.4691122999071863

# For getting fast access to a scalar (equivalent to the prior method):

df.at[dates[0], "A"]
Out[31]: 0.4691122999071863

# Select via the position of the passed integers:

df.iloc[3]
Out[32]: 
A    0.721555
B   -0.706771
C   -1.039575
D    0.271860
Name: 2013-01-04 00:00:00, dtype: float64

# By integer slices, acting similar to NumPy/Python:

df.iloc[3:5, 0:2]
Out[33]: 
                   A         B
2013-01-04  0.721555 -0.706771
2013-01-05 -0.424972  0.567020

# By lists of integer position locations, similar to the NumPy/Python style:

df.iloc[[1, 2, 4], [0, 2]]
Out[34]: 
                   A         C
2013-01-02  1.212112  0.119209
2013-01-03 -0.861849 -0.494929
2013-01-05 -0.424972  0.276232

# For slicing rows explicitly:

df.iloc[1:3, :]
Out[35]: 
                   A         B         C         D
2013-01-02  1.212112 -0.173215  0.119209 -1.044236
2013-01-03 -0.861849 -2.104569 -0.494929  1.071804


#For slicing columns explicitly:

df.iloc[:, 1:3]
Out[36]: 
                   B         C
2013-01-01 -0.282863 -1.509059
2013-01-02 -0.173215  0.119209
2013-01-03 -2.104569 -0.494929
2013-01-04 -0.706771 -1.039575
2013-01-05  0.567020  0.276232
2013-01-06  0.113648 -1.478427

# For getting a value explicitly:

df.iloc[1, 1]
Out[37]: -0.17321464905330858

# For getting fast access to a scalar (equivalent to the prior method):

df.iat[1, 1]
Out[38]: -0.17321464905330858

# Boolean indexing
Using a single column’s values to select data:

df[df["A"] > 0]
Out[39]: 
                   A         B         C         D
2013-01-01  0.469112 -0.282863 -1.509059 -1.135632
2013-01-02  1.212112 -0.173215  0.119209 -1.044236
2013-01-04  0.721555 -0.706771 -1.039575  0.271860


#Selecting values from a DataFrame where a boolean condition is met:

df[df > 0]
Out[40]: 
                   A         B         C         D
2013-01-01  0.469112       NaN       NaN       NaN
2013-01-02  1.212112       NaN  0.119209       NaN
2013-01-03       NaN       NaN       NaN  1.071804
2013-01-04  0.721555       NaN       NaN  0.271860
2013-01-05       NaN  0.567020  0.276232       NaN
2013-01-06       NaN  0.113648       NaN  0.524988

# Using the isin() method for filtering:

df2 = df.copy()

df2["E"] = ["one", "one", "two", "three", "four", "three"]

df2
Out[43]: 
                   A         B         C         D      E
2013-01-01  0.469112 -0.282863 -1.509059 -1.135632    one
2013-01-02  1.212112 -0.173215  0.119209 -1.044236    one
2013-01-03 -0.861849 -2.104569 -0.494929  1.071804    two
2013-01-04  0.721555 -0.706771 -1.039575  0.271860  three
2013-01-05 -0.424972  0.567020  0.276232 -1.087401   four
2013-01-06 -0.673690  0.113648 -1.478427  0.524988  three

df2[df2["E"].isin(["two", "four"])]
Out[44]: 
                   A         B         C         D     E
2013-01-03 -0.861849 -2.104569 -0.494929  1.071804   two
2013-01-05 -0.424972  0.567020  0.276232 -1.087401  four

Setting

Setting a new column automatically aligns the data by the indexes:

s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range("20130102", periods=6))

s1
Out[46]: 
2013-01-02    1
2013-01-03    2
2013-01-04    3
2013-01-05    4
2013-01-06    5
2013-01-07    6
Freq: D, dtype: int64

df["F"] = s1

# Setting values by label:

df.at[dates[0], "A"] = 0

# Setting values by position:

df.iat[0, 1] = 0

# Setting by assigning with a NumPy array:

df.loc[:, "D"] = np.array([5] * len(df))

# The result of the prior setting operations:

df
Out[51]: 
                   A         B         C  D    F
2013-01-01  0.000000  0.000000 -1.509059  5  NaN
2013-01-02  1.212112 -0.173215  0.119209  5  1.0
2013-01-03 -0.861849 -2.104569 -0.494929  5  2.0
2013-01-04  0.721555 -0.706771 -1.039575  5  3.0
2013-01-05 -0.424972  0.567020  0.276232  5  4.0
2013-01-06 -0.673690  0.113648 -1.478427  5  5.0

# A where operation with setting:

df2 = df.copy()

df2[df2 > 0] = -df2

df2
Out[54]: 
                   A         B         C  D    F
2013-01-01  0.000000  0.000000 -1.509059 -5  NaN
2013-01-02 -1.212112 -0.173215 -0.119209 -5 -1.0
2013-01-03 -0.861849 -2.104569 -0.494929 -5 -2.0
2013-01-04 -0.721555 -0.706771 -1.039575 -5 -3.0
2013-01-05 -0.424972 -0.567020 -0.276232 -5 -4.0
2013-01-06 -0.673690 -0.113648 -1.478427 -5 -5.0

Missing data

pandas primarily uses the value np.nan to represent missing data. It is by default not included in computations.

# Reindexing allows you to change/add/delete the index on a specified axis. This returns a copy of the data:

df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ["E"])

df1.loc[dates[0] : dates[1], "E"] = 1

df1
Out[57]: 
                   A         B         C  D    F    E
2013-01-01  0.000000  0.000000 -1.509059  5  NaN  1.0
2013-01-02  1.212112 -0.173215  0.119209  5  1.0  1.0
2013-01-03 -0.861849 -2.104569 -0.494929  5  2.0  NaN
2013-01-04  0.721555 -0.706771 -1.039575  5  3.0  NaN

# To drop any rows that have missing data:

df1.dropna(how="any")
Out[58]: 
                   A         B         C  D    F    E
2013-01-02  1.212112 -0.173215  0.119209  5  1.0  1.0
Filling missing data:

df1.fillna(value=5)
Out[59]: 
                   A         B         C  D    F    E
2013-01-01  0.000000  0.000000 -1.509059  5  5.0  1.0
2013-01-02  1.212112 -0.173215  0.119209  5  1.0  1.0
2013-01-03 -0.861849 -2.104569 -0.494929  5  2.0  5.0
2013-01-04  0.721555 -0.706771 -1.039575  5  3.0  5.0

# To get the boolean mask where values are nan:

pd.isna(df1)
Out[60]: 
                A      B      C      D      F      E
2013-01-01  False  False  False  False   True  False
2013-01-02  False  False  False  False  False  False
2013-01-03  False  False  False  False  False   True
2013-01-04  False  False  False  False  False   True

Stats

Operations in general exclude missing data.

# Performing a descriptive statistic:

df.mean()
Out[61]: 
A   -0.004474
B   -0.383981
C   -0.687758
D    5.000000
F    3.000000
dtype: float64

# Same operation on the other axis:

df.mean(1)
Out[62]: 
2013-01-01    0.872735
2013-01-02    1.431621
2013-01-03    0.707731
2013-01-04    1.395042
2013-01-05    1.883656
2013-01-06    1.592306
Freq: D, dtype: float64

Apply

Applying functions to the data:

df.apply(np.cumsum)
Out[66]: 
                   A         B         C   D     F
2013-01-01  0.000000  0.000000 -1.509059   5   NaN
2013-01-02  1.212112 -0.173215 -1.389850  10   1.0
2013-01-03  0.350263 -2.277784 -1.884779  15   3.0
2013-01-04  1.071818 -2.984555 -2.924354  20   6.0
2013-01-05  0.646846 -2.417535 -2.648122  25  10.0
2013-01-06 -0.026844 -2.303886 -4.126549  30  15.0

df.apply(lambda x: x.max() - x.min())
Out[67]: 
A    2.073961
B    2.671590
C    1.785291
D    0.000000
F    4.000000
dtype: float64

# Histogramming


s = pd.Series(np.random.randint(0, 7, size=10))

s
Out[69]: 
0    4
1    2
2    1
3    2
4    6
5    4
6    4
7    6
8    4
9    4
dtype: int64

s.value_counts()
Out[70]: 
4    5
2    2
6    2
1    1
dtype: int64

String Methods

Series is equipped with a set of string processing methods in the str attribute that make it easy to operate on each element of the array, as in the code snippet below. Note that pattern-matching in str generally uses regular expressions by default (and in some cases always uses them).


s = pd.Series(["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"])

s.str.lower()
Out[72]: 
0       a
1       b
2       c
3    aaba
4    baca
5     NaN
6    caba
7     dog
8     cat
dtype: object

Concat

Concatenating pandas objects together with concat():

df = pd.DataFrame(np.random.randn(10, 4))

df
Out[74]: 
          0         1         2         3
0 -0.548702  1.467327 -1.015962 -0.483075
1  1.637550 -1.217659 -0.291519 -1.745505
2 -0.263952  0.991460 -0.919069  0.266046
3 -0.709661  1.669052  1.037882 -1.705775
4 -0.919854 -0.042379  1.247642 -0.009920
5  0.290213  0.495767  0.362949  1.548106
6 -1.131345 -0.089329  0.337863 -0.945867
7 -0.932132  1.956030  0.017587 -0.016692
8 -0.575247  0.254161 -1.143704  0.215897
9  1.193555 -0.077118 -0.408530 -0.862495

# break it into pieces
pieces = [df[:3], df[3:7], df[7:]]

pd.concat(pieces)
Out[76]: 
          0         1         2         3
0 -0.548702  1.467327 -1.015962 -0.483075
1  1.637550 -1.217659 -0.291519 -1.745505
2 -0.263952  0.991460 -0.919069  0.266046
3 -0.709661  1.669052  1.037882 -1.705775
4 -0.919854 -0.042379  1.247642 -0.009920
5  0.290213  0.495767  0.362949  1.548106
6 -1.131345 -0.089329  0.337863 -0.945867
7 -0.932132  1.956030  0.017587 -0.016692
8 -0.575247  0.254161 -1.143704  0.215897
9  1.193555 -0.077118 -0.408530 -0.862495

Join


# SQL style merges. 

left = pd.DataFrame({"key": ["foo", "foo"], "lval": [1, 2]})

right = pd.DataFrame({"key": ["foo", "foo"], "rval": [4, 5]})

left
Out[79]: 
   key  lval
0  foo     1
1  foo     2

right
Out[80]: 
   key  rval
0  foo     4
1  foo     5

pd.merge(left, right, on="key")
Out[81]: 
   key  lval  rval
0  foo     1     4
1  foo     1     5
2  foo     2     4
3  foo     2     5

# Another example that can be given is:

left = pd.DataFrame({"key": ["foo", "bar"], "lval": [1, 2]})

right = pd.DataFrame({"key": ["foo", "bar"], "rval": [4, 5]})

left
Out[84]: 
   key  lval
0  foo     1
1  bar     2

right
Out[85]: 
   key  rval
0  foo     4
1  bar     5

pd.merge(left, right, on="key")
Out[86]: 
   key  lval  rval
0  foo     1     4
1  bar     2     5

Grouping

By “group by” we are referring to a process involving one or more of the following steps:

  • Splitting the data into groups based on some criteria

  • Applying a function to each group independently

  • Combining the results into a data structure

df = pd.DataFrame(
    {
        "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
        "B": ["one", "one", "two", "three", "two", "two", "one", "three"],
        "C": np.random.randn(8),
        "D": np.random.randn(8),
    }
)


df
Out[88]: 
     A      B         C         D
0  foo    one  1.346061 -1.577585
1  bar    one  1.511763  0.396823
2  foo    two  1.627081 -0.105381
3  bar  three -0.990582 -0.532532
4  foo    two -0.441652  1.453749
5  bar    two  1.211526  1.208843
6  foo    one  0.268520 -0.080952
7  foo  three  0.024580 -0.264610

#Grouping and then applying the sum() function to the resulting groups:

df.groupby("A").sum()
Out[89]: 
            C         D
A                      
bar  1.732707  1.073134
foo  2.824590 -0.574779

# Grouping by multiple columns forms a hierarchical index, and again we can apply the sum() function:

df.groupby(["A", "B"]).sum()
Out[90]: 
                  C         D
A   B                        
bar one    1.511763  0.396823
    three -0.990582 -0.532532
    two    1.211526  1.208843
foo one    1.614581 -1.658537
    three  0.024580 -0.264610
    two    1.185429  1.348368

Plotting

# We use the standard convention for referencing the matplotlib API:

import matplotlib.pyplot as plt

plt.close("all")

# The close() method is used to close a figure window:

ts = pd.Series(np.random.randn(1000), index=pd.date_range("1/1/2000", periods=1000))

ts = ts.cumsum()

ts.plot()

# If running under Jupyter Notebook, the plot will appear on plot(). Otherwise use matplotlib.pyplot.show to show it or matplotlib.pyplot.savefig to write it to a file.

plt.show();

# On a DataFrame, the plot() method is a convenience to plot all of the columns with labels:

df = pd.DataFrame(
    np.random.randn(1000, 4), index=ts.index, columns=["A", "B", "C", "D"]
)


df = df.cumsum()

plt.figure()

df.plot()

plt.legend(loc='best');

Lets Dive In

Head over to this Kaggle Notebook and clone this notebook.

Additional Practice with Pandas

Pandas Exercises

Spotify Data Analysis with Pandas

Pandas Stackoverflow Questions

Pandas Tutorials on real-world datasets

Plotting with Python

Plotly

A great practice notebook to learn Plotly can be found here.

Introduction to Machine Learning

Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks.[1] It is seen as a part of artificial intelligence. Machine learning algorithms build a model based on sample data, known as training data, in order to make predictions or decisions without being explicitly programmed to do so.

Machine learning algorithms are used in a wide variety of applications, such as in medicine, email filtering, speech recognition, and computer vision, where it is difficult or unfeasible to develop conventional algorithms to perform the needed tasks.

Glossary

Machine Learning is an extensive area of study and as you continue down this path it is essential you have access to a glossary of terms you will routinely encounter.

Google's Machine Learning Glossary is a great resource to keep at hand.

There are many courses available online to learn Machine Learning.

FastAI

Machine Learning FastAI course.

There are around 24 hours of lessons, and you should plan to spend around 8 hours a week for 12 weeks to complete the material. The course is based on lessons recorded at the University of San Francisco for the Masters of Science in Data Science program. We assume that you have at least one year of coding experience, and either remember what you learned in high school math, or are prepared to do some independent study to refresh your knowledge.

NPTEL

NPTEL Machine Learning Course

India's NPTEL (National Programme on Technology Enhanced Learning), is a joint venture of the IITs and IISc, funded by the Ministry of Education (MoE) Government of India, and was launched in 2003. Initially started as a project to take quality education to all corners of the country, NPTEL now offers close to 600+ courses for certification every semester in about 22 disciplines.

Ebook

An Introduction to Statistical Learning

As the scale and scope of data collection continue to increase across virtually all fields, statistical learning has become a critical toolkit for anyone who wishes to understand data. An Introduction to Statistical Learning provides a broad and less technical treatment of key topics in statistical learning. Each chapter includes an R lab. This book is appropriate for anyone who wishes to use contemporary tools for data analysis.

The First Edition topics include:

  • Sparse methods for classification and regression
  • Decision trees
  • Boosting
  • Support vector machines
  • Clustering

The Second Edition adds:

  • Deep learning
  • Survival analysis
  • Multiple testing
  • Naive Bayes and generalized linear models
  • Bayesian additive regression trees
  • Matrix completion

Tips for Beginners

Set concrete goals or deadlines.

Machine learning is a rich field that's expanding every year. It can be easy to go down rabbit holes. Set concrete goals for yourself and keep moving.

Walk before you run.

You might be tempted to jump into some of the newest, cutting edge sub-fields in machine learning such as deep learning or NLP. Try to stay focused on the core concepts at the start. These advanced topics will be much easier to understand once you've mastered the core skills.

Alternate between practice and theory.

Practice and theory go hand-in-hand. You won't be able to master theory without applying it, yet you won't know what to do without the theory.

Write a few algorithms from scratch.

Once you've had some practice applying algorithms from existing packages, you'll want to write a few from scratch. This will take your understanding to the next level and allow you to customize them in the future.

Seek different perspectives.

The way a statistician explains an algorithm will be different from the way a computer scientist explains it. Seek different explanations of the same topic.

Tie each algorithm to value.

For each tool or algorithm you learn, try to think of ways it could be applied in business or technology. This is essential for learning how to "think" like a data scientist.

Don't believe the hype.

Machine learning is not what the movies portray as artificial intelligence. It's a powerful tool, but you should approach problems with rationality and an open mind. ML should just be one tool in your arsenal!

Ignore the show-offs.

Sometimes you'll see people online debating with lots of math and jargon. If you don't understand it, don't be discouraged. What matters is: Can you use ML to add value in some way? And the answer is yes, you absolutely can.

Think "inputs/outputs" and ask "why."

At times, you might find yourself lost in the weeds. When in doubt, take a step back and think about how data inputs and outputs piece together. Ask "why" at each part of the process.

Find fun projects that interest you!

Rome wasn't built in a day, and neither will your machine learning skills be. Pick topics that interest you, take your time, and have fun along the way.

Excerpted from Elite Data Science

Survey of Machine Learning

One of the leaders in the field of Machine Learning, Sebastian Raschka has written a very extensive survey of the state of Python in Machine Learning.

This survey paper is recommended for graduates and professionals who wish to equip themselves quickly with an overview of various aspects of Machine Learning.

Introduction to scikit-learn

“I literally owe my career in the data space to scikit-learn. It’s not just a framework but a school of thought regarding predictive modeling. Super well deserved, folks :) “ Maykon Schots from Brasil

scikit-learn is;

  • Simple and efficient tools for predictive data analysis
  • Accessible to everybody, and reusable in various contexts
  • Built on NumPy, SciPy, and matplotlib
  • Open source, commercially usable - BSD license

scikit-learn is the most popular Python library for Machine Learning.

from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(random_state=0)
X = [[ 1,  2,  3],  # 2 samples, 3 features
    [11, 12, 13]]
y = [0, 1]  # classes of each sample
clf.fit(X, y)

In the few lines of code you see above, we have done a lot of work.

scikit-learn allows you to apply a large number of ML techniques. All of these techniques can be applied through a common interface that looks much like the above code snippet.

The samples matrix (or design matrix) X whose size is typically (n_samples, n_features).

The target values y which are real numbers for regression tasks, or integers for classification (or any other discrete set of values).

For unsupervized learning tasks, y does not need to be specified.

Once the estimator (Random Forest in the code snippet above) is fitted, it can be used for predicting target values of new data.

Lets dive in with this Notebook to develop an end-to-end ML model with scikit-learn.

Introduction to Deep Learning

Coming Soon..

How to Keep Learning

We live in a time where knowledge sources are abundant and their access is free and easy. This allows millions of people across to learn and acquire skills and knowledge very easily at low cost.

We are also presented with a lot of choices adding to our confusion.

I have found the following sources of knowledge to be reliable.

Github

Awesome Data Science

Online Lectures

NPTEL

MIT Open Courseware

Research Papers

Arxiv

YouTube

FreeCodeCamp

About Harsh Singhal

Harsh Singhal side profile picture

Harsh Singhal is a global Data Science and Machine Learning (DSML) leader.

Over the last decade Harsh has developed high-impact Machine Learning solutions at companies like LinkedIn and Netflix.

After having spent a decade in California, Harsh decided to move to India and contribute to India's DSML ecosystem.

Harsh is the head of Machine Learning at Koo, Indias #1 social media platform connecting millions of people and allowing them to express themselves in their mother tongue.

Harsh actively works with student communities and guides them to excel in their journey towards DSML excellence. Harsh is also involved as an advisor in developing DSML curricula at academic institutions to increase AI talent density amongst India's student community.