Introduction to Data Analysis with Pandas

pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.

History of development

In 2008, pandas development began at AQR Capital Management. By the end of 2009 it had been open sourced, and is actively supported today by a community of like-minded individuals around the world who contribute their valuable time and energy to help make open source pandas possible.

Library Highlights

  • A fast and efficient DataFrame object for data manipulation with integrated indexing;

  • Tools for reading and writing data between in-memory data structures and different formats:

    • CSV and text files
    • Microsoft Excel
    • SQL databases
    • The fast HDF5 format
  • Intelligent data alignment and integrated handling of missing data: gain automatic label-based alignment in computations and easily manipulate messy data into an orderly form;

  • Flexible reshaping and pivoting of data sets;

  • Intelligent label-based slicing, fancy indexing, and subsetting of large data sets;

  • Columns can be inserted and deleted from data structures for size mutability;

  • Aggregating or transforming data with a powerful group by engine allowing split-apply-combine operations on data sets;

  • High performance merging and joining of data sets;

  • Hierarchical axis indexing provides an intuitive way of working with high-dimensional data in a lower-dimensional data structure;

  • Time series-functionality: date range generation and frequency conversion, moving window statistics, date shifting and lagging. Even create domain-specific time offsets and join time series without losing data;

  • Highly optimized for performance, with critical code paths written in Cython or C.

  • Python with pandas is in use in a wide variety of academic and commercial domains, including Finance, Neuroscience, Economics, Statistics, Advertising, Web Analytics, and more.

Mission

pandas aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language.

Vision

A world where data analytics and manipulation software is:

  • Accessible to everyone
  • Free for users to use and modify
  • Flexible
  • Powerful
  • Easy to use
  • Fast

Values

Is in the core of pandas to be respectful and welcoming with everybody, users, contributors and the broader community. Regardless of level of experience, gender, gender identity and expression, sexual orientation, disability, personal appearance, body size, race, ethnicity, age, religion, or nationality.

Quick Pandas Tour

This is a short introduction to pandas, geared mainly for new users. Taken from 10 min pandas guide

import numpy as np import pandas as pd

Object creation

Creating a Series by passing a list of values, letting pandas create a default integer index:

s = pd.Series([1, 3, 5, np.nan, 6, 8]) s Out[4]: 0 1.0 1 3.0 2 5.0 3 NaN 4 6.0 5 8.0 dtype: float64

Creating a DataFrame by passing a NumPy array, with a datetime index and labeled columns:

dates = pd.date_range("20130101", periods=6) dates Out[6]: DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04', '2013-01-05', '2013-01-06'], dtype='datetime64[ns]', freq='D') df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD")) df Out[8]: A B C D 2013-01-01 0.469112 -0.282863 -1.509059 -1.135632 2013-01-02 1.212112 -0.173215 0.119209 -1.044236 2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 2013-01-04 0.721555 -0.706771 -1.039575 0.271860 2013-01-05 -0.424972 0.567020 0.276232 -1.087401 2013-01-06 -0.673690 0.113648 -1.478427 0.524988 Creating a DataFrame by passing a dictionary of objects that can be converted into a series-like structure: df2 = pd.DataFrame( { "A": 1.0, "B": pd.Timestamp("20130102"), "C": pd.Series(1, index=list(range(4)), dtype="float32"), "D": np.array([3] * 4, dtype="int32"), "E": pd.Categorical(["test", "train", "test", "train"]), "F": "foo", } ) df2 Out[10]: A B C D E F 0 1.0 2013-01-02 1.0 3 test foo 1 1.0 2013-01-02 1.0 3 train foo 2 1.0 2013-01-02 1.0 3 test foo 3 1.0 2013-01-02 1.0 3 train foo The columns of the resulting DataFrame have different dtypes: df2.dtypes Out[11]: A float64 B datetime64[ns] C float32 D int32 E category F object dtype: object

Viewing data

# Here is how to view the top and bottom rows of the frame: df.head() Out[13]: A B C D 2013-01-01 0.469112 -0.282863 -1.509059 -1.135632 2013-01-02 1.212112 -0.173215 0.119209 -1.044236 2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 2013-01-04 0.721555 -0.706771 -1.039575 0.271860 2013-01-05 -0.424972 0.567020 0.276232 -1.087401 df.tail(3) Out[14]: A B C D 2013-01-04 0.721555 -0.706771 -1.039575 0.271860 2013-01-05 -0.424972 0.567020 0.276232 -1.087401 2013-01-06 -0.673690 0.113648 -1.478427 0.524988 Display the index, columns: df.index Out[15]: DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04', '2013-01-05', '2013-01-06'], dtype='datetime64[ns]', freq='D') df.columns Out[16]: Index(['A', 'B', 'C', 'D'], dtype='object') # describe() shows a quick statistic summary of your data: df.describe() Out[19]: A B C D count 6.000000 6.000000 6.000000 6.000000 mean 0.073711 -0.431125 -0.687758 -0.233103 std 0.843157 0.922818 0.779887 0.973118 min -0.861849 -2.104569 -1.509059 -1.135632 25% -0.611510 -0.600794 -1.368714 -1.076610 50% 0.022070 -0.228039 -0.767252 -0.386188 75% 0.658444 0.041933 -0.034326 0.461706 max 1.212112 0.567020 0.276232 1.071804 # Transposing your data: df.T Out[20]: 2013-01-01 2013-01-02 2013-01-03 2013-01-04 2013-01-05 2013-01-06 A 0.469112 1.212112 -0.861849 0.721555 -0.424972 -0.673690 B -0.282863 -0.173215 -2.104569 -0.706771 0.567020 0.113648 C -1.509059 0.119209 -0.494929 -1.039575 0.276232 -1.478427 D -1.135632 -1.044236 1.071804 0.271860 -1.087401 0.524988 Sorting by an axis: df.sort_index(axis=1, ascending=False) Out[21]: D C B A 2013-01-01 -1.135632 -1.509059 -0.282863 0.469112 2013-01-02 -1.044236 0.119209 -0.173215 1.212112 2013-01-03 1.071804 -0.494929 -2.104569 -0.861849 2013-01-04 0.271860 -1.039575 -0.706771 0.721555 2013-01-05 -1.087401 0.276232 0.567020 -0.424972 2013-01-06 0.524988 -1.478427 0.113648 -0.673690 Sorting by values: df.sort_values(by="B") Out[22]: A B C D 2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 2013-01-04 0.721555 -0.706771 -1.039575 0.271860 2013-01-01 0.469112 -0.282863 -1.509059 -1.135632 2013-01-02 1.212112 -0.173215 0.119209 -1.044236 2013-01-06 -0.673690 0.113648 -1.478427 0.524988 2013-01-05 -0.424972 0.567020 0.276232 -1.087401

Selection

While standard Python / NumPy expressions for selecting and setting are intuitive and come in handy for interactive work, for production code, it is recommended the optimized pandas data access methods, .at, .iat, .loc and .iloc.

Selecting a single column, which yields a Series, equivalent to df.A:

df["A"] Out[23]: 2013-01-01 0.469112 2013-01-02 1.212112 2013-01-03 -0.861849 2013-01-04 0.721555 2013-01-05 -0.424972 2013-01-06 -0.673690 Freq: D, Name: A, dtype: float64 Selecting via [], which slices the rows: df[0:3] Out[24]: A B C D 2013-01-01 0.469112 -0.282863 -1.509059 -1.135632 2013-01-02 1.212112 -0.173215 0.119209 -1.044236 2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 df["20130102":"20130104"] Out[25]: A B C D 2013-01-02 1.212112 -0.173215 0.119209 -1.044236 2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 2013-01-04 0.721555 -0.706771 -1.039575 0.271860 # Selection by label df.loc[dates[0]] Out[26]: A 0.469112 B -0.282863 C -1.509059 D -1.135632 Name: 2013-01-01 00:00:00, dtype: float64 Selecting on a multi-axis by label: df.loc[:, ["A", "B"]] Out[27]: A B 2013-01-01 0.469112 -0.282863 2013-01-02 1.212112 -0.173215 2013-01-03 -0.861849 -2.104569 2013-01-04 0.721555 -0.706771 2013-01-05 -0.424972 0.567020 2013-01-06 -0.673690 0.113648 Showing label slicing, both endpoints are included: df.loc["20130102":"20130104", ["A", "B"]] Out[28]: A B 2013-01-02 1.212112 -0.173215 2013-01-03 -0.861849 -2.104569 2013-01-04 0.721555 -0.706771 # Reduction in the dimensions of the returned object: df.loc["20130102", ["A", "B"]] Out[29]: A 1.212112 B -0.173215 Name: 2013-01-02 00:00:00, dtype: float64 For getting a scalar value: df.loc[dates[0], "A"] Out[30]: 0.4691122999071863 # For getting fast access to a scalar (equivalent to the prior method): df.at[dates[0], "A"] Out[31]: 0.4691122999071863 # Select via the position of the passed integers: df.iloc[3] Out[32]: A 0.721555 B -0.706771 C -1.039575 D 0.271860 Name: 2013-01-04 00:00:00, dtype: float64 # By integer slices, acting similar to NumPy/Python: df.iloc[3:5, 0:2] Out[33]: A B 2013-01-04 0.721555 -0.706771 2013-01-05 -0.424972 0.567020 # By lists of integer position locations, similar to the NumPy/Python style: df.iloc[[1, 2, 4], [0, 2]] Out[34]: A C 2013-01-02 1.212112 0.119209 2013-01-03 -0.861849 -0.494929 2013-01-05 -0.424972 0.276232 # For slicing rows explicitly: df.iloc[1:3, :] Out[35]: A B C D 2013-01-02 1.212112 -0.173215 0.119209 -1.044236 2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 #For slicing columns explicitly: df.iloc[:, 1:3] Out[36]: B C 2013-01-01 -0.282863 -1.509059 2013-01-02 -0.173215 0.119209 2013-01-03 -2.104569 -0.494929 2013-01-04 -0.706771 -1.039575 2013-01-05 0.567020 0.276232 2013-01-06 0.113648 -1.478427 # For getting a value explicitly: df.iloc[1, 1] Out[37]: -0.17321464905330858 # For getting fast access to a scalar (equivalent to the prior method): df.iat[1, 1] Out[38]: -0.17321464905330858 # Boolean indexing Using a single column’s values to select data: df[df["A"] > 0] Out[39]: A B C D 2013-01-01 0.469112 -0.282863 -1.509059 -1.135632 2013-01-02 1.212112 -0.173215 0.119209 -1.044236 2013-01-04 0.721555 -0.706771 -1.039575 0.271860 #Selecting values from a DataFrame where a boolean condition is met: df[df > 0] Out[40]: A B C D 2013-01-01 0.469112 NaN NaN NaN 2013-01-02 1.212112 NaN 0.119209 NaN 2013-01-03 NaN NaN NaN 1.071804 2013-01-04 0.721555 NaN NaN 0.271860 2013-01-05 NaN 0.567020 0.276232 NaN 2013-01-06 NaN 0.113648 NaN 0.524988 # Using the isin() method for filtering: df2 = df.copy() df2["E"] = ["one", "one", "two", "three", "four", "three"] df2 Out[43]: A B C D E 2013-01-01 0.469112 -0.282863 -1.509059 -1.135632 one 2013-01-02 1.212112 -0.173215 0.119209 -1.044236 one 2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 two 2013-01-04 0.721555 -0.706771 -1.039575 0.271860 three 2013-01-05 -0.424972 0.567020 0.276232 -1.087401 four 2013-01-06 -0.673690 0.113648 -1.478427 0.524988 three df2[df2["E"].isin(["two", "four"])] Out[44]: A B C D E 2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 two 2013-01-05 -0.424972 0.567020 0.276232 -1.087401 four

Setting

Setting a new column automatically aligns the data by the indexes:

s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range("20130102", periods=6)) s1 Out[46]: 2013-01-02 1 2013-01-03 2 2013-01-04 3 2013-01-05 4 2013-01-06 5 2013-01-07 6 Freq: D, dtype: int64 df["F"] = s1 # Setting values by label: df.at[dates[0], "A"] = 0 # Setting values by position: df.iat[0, 1] = 0 # Setting by assigning with a NumPy array: df.loc[:, "D"] = np.array([5] * len(df)) # The result of the prior setting operations: df Out[51]: A B C D F 2013-01-01 0.000000 0.000000 -1.509059 5 NaN 2013-01-02 1.212112 -0.173215 0.119209 5 1.0 2013-01-03 -0.861849 -2.104569 -0.494929 5 2.0 2013-01-04 0.721555 -0.706771 -1.039575 5 3.0 2013-01-05 -0.424972 0.567020 0.276232 5 4.0 2013-01-06 -0.673690 0.113648 -1.478427 5 5.0 # A where operation with setting: df2 = df.copy() df2[df2 > 0] = -df2 df2 Out[54]: A B C D F 2013-01-01 0.000000 0.000000 -1.509059 -5 NaN 2013-01-02 -1.212112 -0.173215 -0.119209 -5 -1.0 2013-01-03 -0.861849 -2.104569 -0.494929 -5 -2.0 2013-01-04 -0.721555 -0.706771 -1.039575 -5 -3.0 2013-01-05 -0.424972 -0.567020 -0.276232 -5 -4.0 2013-01-06 -0.673690 -0.113648 -1.478427 -5 -5.0

Missing data

pandas primarily uses the value np.nan to represent missing data. It is by default not included in computations.

# Reindexing allows you to change/add/delete the index on a specified axis. This returns a copy of the data: df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ["E"]) df1.loc[dates[0] : dates[1], "E"] = 1 df1 Out[57]: A B C D F E 2013-01-01 0.000000 0.000000 -1.509059 5 NaN 1.0 2013-01-02 1.212112 -0.173215 0.119209 5 1.0 1.0 2013-01-03 -0.861849 -2.104569 -0.494929 5 2.0 NaN 2013-01-04 0.721555 -0.706771 -1.039575 5 3.0 NaN # To drop any rows that have missing data: df1.dropna(how="any") Out[58]: A B C D F E 2013-01-02 1.212112 -0.173215 0.119209 5 1.0 1.0 Filling missing data: df1.fillna(value=5) Out[59]: A B C D F E 2013-01-01 0.000000 0.000000 -1.509059 5 5.0 1.0 2013-01-02 1.212112 -0.173215 0.119209 5 1.0 1.0 2013-01-03 -0.861849 -2.104569 -0.494929 5 2.0 5.0 2013-01-04 0.721555 -0.706771 -1.039575 5 3.0 5.0 # To get the boolean mask where values are nan: pd.isna(df1) Out[60]: A B C D F E 2013-01-01 False False False False True False 2013-01-02 False False False False False False 2013-01-03 False False False False False True 2013-01-04 False False False False False True

Stats

Operations in general exclude missing data.

# Performing a descriptive statistic: df.mean() Out[61]: A -0.004474 B -0.383981 C -0.687758 D 5.000000 F 3.000000 dtype: float64 # Same operation on the other axis: df.mean(1) Out[62]: 2013-01-01 0.872735 2013-01-02 1.431621 2013-01-03 0.707731 2013-01-04 1.395042 2013-01-05 1.883656 2013-01-06 1.592306 Freq: D, dtype: float64

Apply

Applying functions to the data:

df.apply(np.cumsum) Out[66]: A B C D F 2013-01-01 0.000000 0.000000 -1.509059 5 NaN 2013-01-02 1.212112 -0.173215 -1.389850 10 1.0 2013-01-03 0.350263 -2.277784 -1.884779 15 3.0 2013-01-04 1.071818 -2.984555 -2.924354 20 6.0 2013-01-05 0.646846 -2.417535 -2.648122 25 10.0 2013-01-06 -0.026844 -2.303886 -4.126549 30 15.0 df.apply(lambda x: x.max() - x.min()) Out[67]: A 2.073961 B 2.671590 C 1.785291 D 0.000000 F 4.000000 dtype: float64 # Histogramming s = pd.Series(np.random.randint(0, 7, size=10)) s Out[69]: 0 4 1 2 2 1 3 2 4 6 5 4 6 4 7 6 8 4 9 4 dtype: int64 s.value_counts() Out[70]: 4 5 2 2 6 2 1 1 dtype: int64

String Methods

Series is equipped with a set of string processing methods in the str attribute that make it easy to operate on each element of the array, as in the code snippet below. Note that pattern-matching in str generally uses regular expressions by default (and in some cases always uses them).

s = pd.Series(["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"]) s.str.lower() Out[72]: 0 a 1 b 2 c 3 aaba 4 baca 5 NaN 6 caba 7 dog 8 cat dtype: object

Concat

Concatenating pandas objects together with concat(): df = pd.DataFrame(np.random.randn(10, 4)) df Out[74]: 0 1 2 3 0 -0.548702 1.467327 -1.015962 -0.483075 1 1.637550 -1.217659 -0.291519 -1.745505 2 -0.263952 0.991460 -0.919069 0.266046 3 -0.709661 1.669052 1.037882 -1.705775 4 -0.919854 -0.042379 1.247642 -0.009920 5 0.290213 0.495767 0.362949 1.548106 6 -1.131345 -0.089329 0.337863 -0.945867 7 -0.932132 1.956030 0.017587 -0.016692 8 -0.575247 0.254161 -1.143704 0.215897 9 1.193555 -0.077118 -0.408530 -0.862495 # break it into pieces pieces = [df[:3], df[3:7], df[7:]] pd.concat(pieces) Out[76]: 0 1 2 3 0 -0.548702 1.467327 -1.015962 -0.483075 1 1.637550 -1.217659 -0.291519 -1.745505 2 -0.263952 0.991460 -0.919069 0.266046 3 -0.709661 1.669052 1.037882 -1.705775 4 -0.919854 -0.042379 1.247642 -0.009920 5 0.290213 0.495767 0.362949 1.548106 6 -1.131345 -0.089329 0.337863 -0.945867 7 -0.932132 1.956030 0.017587 -0.016692 8 -0.575247 0.254161 -1.143704 0.215897 9 1.193555 -0.077118 -0.408530 -0.862495

Join

# SQL style merges. left = pd.DataFrame({"key": ["foo", "foo"], "lval": [1, 2]}) right = pd.DataFrame({"key": ["foo", "foo"], "rval": [4, 5]}) left Out[79]: key lval 0 foo 1 1 foo 2 right Out[80]: key rval 0 foo 4 1 foo 5 pd.merge(left, right, on="key") Out[81]: key lval rval 0 foo 1 4 1 foo 1 5 2 foo 2 4 3 foo 2 5 # Another example that can be given is: left = pd.DataFrame({"key": ["foo", "bar"], "lval": [1, 2]}) right = pd.DataFrame({"key": ["foo", "bar"], "rval": [4, 5]}) left Out[84]: key lval 0 foo 1 1 bar 2 right Out[85]: key rval 0 foo 4 1 bar 5 pd.merge(left, right, on="key") Out[86]: key lval rval 0 foo 1 4 1 bar 2 5

Grouping

By “group by” we are referring to a process involving one or more of the following steps:

  • Splitting the data into groups based on some criteria

  • Applying a function to each group independently

  • Combining the results into a data structure

df = pd.DataFrame( { "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"], "B": ["one", "one", "two", "three", "two", "two", "one", "three"], "C": np.random.randn(8), "D": np.random.randn(8), } ) df Out[88]: A B C D 0 foo one 1.346061 -1.577585 1 bar one 1.511763 0.396823 2 foo two 1.627081 -0.105381 3 bar three -0.990582 -0.532532 4 foo two -0.441652 1.453749 5 bar two 1.211526 1.208843 6 foo one 0.268520 -0.080952 7 foo three 0.024580 -0.264610 #Grouping and then applying the sum() function to the resulting groups: df.groupby("A").sum() Out[89]: C D A bar 1.732707 1.073134 foo 2.824590 -0.574779 # Grouping by multiple columns forms a hierarchical index, and again we can apply the sum() function: df.groupby(["A", "B"]).sum() Out[90]: C D A B bar one 1.511763 0.396823 three -0.990582 -0.532532 two 1.211526 1.208843 foo one 1.614581 -1.658537 three 0.024580 -0.264610 two 1.185429 1.348368

Plotting

# We use the standard convention for referencing the matplotlib API: import matplotlib.pyplot as plt plt.close("all") # The close() method is used to close a figure window: ts = pd.Series(np.random.randn(1000), index=pd.date_range("1/1/2000", periods=1000)) ts = ts.cumsum() ts.plot() # If running under Jupyter Notebook, the plot will appear on plot(). Otherwise use matplotlib.pyplot.show to show it or matplotlib.pyplot.savefig to write it to a file. plt.show(); # On a DataFrame, the plot() method is a convenience to plot all of the columns with labels: df = pd.DataFrame( np.random.randn(1000, 4), index=ts.index, columns=["A", "B", "C", "D"] ) df = df.cumsum() plt.figure() df.plot() plt.legend(loc='best');

Lets Dive In

Head over to this Kaggle Notebook and clone this notebook.

Additional Practice with Pandas

Pandas Exercises

Spotify Data Analysis with Pandas

Pandas Stackoverflow Questions

Pandas Tutorials on real-world datasets

Plotting with Python

Plotly

A great practice notebook to learn Plotly can be found here.