Python and `matplotlib`

In order to interpret data, it's important to be able to create visual representations of that data. Two common uses include looking at a large volume of data ("Big Data") to identify characteristics of that data, and looking at x-y data in which you expect to see some function-based relationship between those values.

If you're using Python to look at this information, it's easy to use some Python libraries that will allow you to easily create graphs. This brief tutorial takes you through that process.

I. Big Data

Sophisticated analysis of data relationships is often performed using SQL ("Structured Query Language," pronounced either "ess-cue-ell" or "sequel"), which operates on data stored in a relational database. The operations and procedures that can be performed with an SQL language like MySQL or SQLite3 can be impressive.

Really, REALLY large sets of data may be stored in simple, text-based flat files, typically with each line of data in file representing a record, and commas used to separate the data fields in each record. Because the commas separate the values in this system, the format is called "Comma-Separated Values," or CSV.

In this activity you'll get a brief introduction to using Python to load up CSV files of data, process that data, and display visual analyses of that data. Being able to examine and demonstrate numerical relationships in a visual format is fundamental to any kind of analysis of that data.

I.A. Getting Data Into Python

Python has the ability to read data in from external files built in, but we're actually going to use a module that's specifically built for handling CSV files.

If you've got a relatively small amount of data—a few hundred lines of data, or even 100MB of data—you can probably read it all into a Python program as a list of lines, like this:

"""Demonstration 1. Bringing in data from CSV files"""
    
import csv   # this module helps manage csv data

infile = open('dataFile.csv')
myData = []
for row in csv.reader(infile):
    myData.append(row.strip())

Once this loop has run, the list myData contains all of the lines of data from the file. (The .strip() method is used to remove the newline character that's at the end of each row in this csv/text file.)

That works fine for manageable quantities of data. But what if I've got 100GB of data? My computer, a fairly powerful machine, only has 32GB of RAM, and not all of that is even available for program data. It's clear that I'm not going to be able to digest this entire file into the myData list.

For these circumstances, we'll be processing the file one line at a time:

"""Demonstration 2. Bringing in data from CSV files"""
    
import csv  # this module helps manage csv data

infile = open('dataFile.csv')
counter = {}            # A dictionary that will be used to count word frequency
for row in csv.reader(infile):
    # Do an analysis based on the single row of information
    # If each row contained a word, and we wanted to count the word frequency...
    if row in counter.keys():
        counter[row] += 1
    else:
        counter[row] = 1

This has the benefit of only reading a single line into memory at a time, allowing you to process a data set that is larger than the memory you have on your computer.

I.B. Identifying Selected Data in a CSV File

Identifying the data that you want to use in a CSV file is usually easy. Figuring out how to extract that data can sometimes be tricky. For this exercise, we'll use the fakePeople.csv file that contains information on 1000 fake people, including their Gender, GivenName, Surname, State, Birthday, Race, Weight in Kilograms, Height in Centimeters, YearsEducation, and YearlyIncome, in that order.

How would we figure out the percentage of each Race in this population? We'd need to know how many there were of each Race, and how many total people there are.

"""Demonstration 3. Extracting data from CSV files"""
    
import csv
infile = open('fakePeople.csv')

races = {}                              # dictionary that counts race populations

for person in csv.reader(infile):       # the .reader() method splits the CSV line into a list
                                        
                                        # person[5] refers to the sixth comma-separated value
                                        # (counting from 0) in the record: the Race field
    if person[5] in races:
        races[person[5]] += 1           # increment the dictionary value for this race
    else:
        races[person[5]] = 1            # initialize a dictionary key for this race

If we look at the races dictionary now, it looks like this:

{'Black': 141, 'Brown': 186, 'White': 486, 'Asian': 81, 'Other': 106}

How can we find the total number of people identified? We could count the number of lines in our infile, but we can also just total up the values for our Races dictionary:

totalPeople = 0
for aRace in races.keys():       # Set aRace = to each key in turn
    totalPeople += races[aRace]  # Get number of people of each race and sum them

totalPeople has a value of 1000 at this point.

Let's calculate percentages now:

racePercentages = {}
for aRace in races.keys():
    racePercentages[aRace] = (races[aRace] / totalPeople) * 100

Show/hide final code for Race analysis

#!/usr/bin/env python3
"""
Demonstration 4. Extracting data from CSV files
"""

import csv
    
def main():
    infile = open('fakePeople.csv')
    
    # First count how many of each race
    races = {}                              # dictionary that counts race populations
    for person in csv.reader(infile):       # the .reader() method splits the CSV line into a list
                                            # person[5] refers to the sixth comma-separated value
                                            # (counting from 0) in the record: the Race field
        if person[5] in races:
            races[person[5]] += 1           # increment the dictionary value for this race
        else:
            races[person[5]] = 1            # initialize a dictionary key for this race
    
    # Calculate how many people total
    totalPeople = 0
    for aRace in races.keys():       # Set aRace = to each key in turn
        totalPeople += races[aRace]  # Get number of people of each race and sum them

    # Calculate race percentages
    racePercentages = {}
    for aRace in races.keys():
        racePercentages[aRace] = ( races[aRace] / totalPeople) * 100
    print(racePercentages)

if __name__ == "__main__":
    main()

The dictionary racePercentages now looks like this:

{'Black': 14.099999999999998, 'Brown': 18.6, 'White': 48.6, 'Asian': 8.1, 'Other': 10.6}

This is a fine collection of data, but what we really want to be able to do is to graph it. Because there's no quantitative value on the x-axis—it's just the labels of each race—we'll use a bar graph for this analysis.

I.C. Creating a Bar Graph of Data

We'll be using two additional modules to create our graphs: numpy and matplotlib. These modules aren't typically included with a standard Python installation, so you'll have to check whether you have them installed.

Check to see if `numpy` and `matplotlib` are installed

The easiest way to see if you have these modules available to you is to try to import them. Go int a Terminal, start an interactive session in Python, and try to import numpy and matplotlib. If you get no error messages, you're okay.

(base) rwhite@Ligne2 Desktop % python
Python 3.7.4 (default, Aug 13 2019, 15:17:50)
[Clang 4.0.1 (tags/RELEASE_401/final)] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy
>>> import matplotlib
>>>

If you see something more like this:

$ python3
Python 3.6.9 (default, Nov  7 2019, 10:44:02) 
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy
Traceback (most recent call last):
  File "", line 1, in 
ModuleNotFoundError: No module named 'numpy'
>>> import matplotlib
Traceback (most recent call last):
  File "", line 1, in 
ModuleNotFoundError: No module named 'matplotlib'
>>>

... then you'll need to install those modules. (How you do this depends on your Python installation. See the instructor for further info.)

The information that we want to graph is contained in the racePercentages dictionary.

First, split that dictionary into two lists. One will be used to create the x-axis of entries, and one will be used to create the y-axis values.

"""
Demonstration 5. Creating bar graphs of data
This code doesn't show lines previously developed above.
"""
    
import matplotlib.pyplot as plt
import numpy

races = []          # plotted along the x-axis
popPercent = []     # plotted along the y-axis

# Split dictionary up into these two lists
for aRace in racePercentages:
    races.append(aRace)     # puts the keys in races
    popPercent.append(racePercentages[aRace])

# Identify how many columns will be in our bar graph
columns = numpy.arange(len(races))

Now we just need to tell matplotlib.pyplot to create and display the bar graph:

plt.bar(columns, popPercent)
plt.show()

That's not bad, but we're obviously missing quite a bit of information: x-axis labels, y-axis labels, a title...

plt.bar(columns, popPercent)      # Indicates a bar chart
plt.xticks(columns, races, rotation=90)        # Label and Values on x-axis
plt.ylabel('Percentages of total population')  # Label on y-axis
plt.title('Race population by percentage')     # Title of graph
# And don't forget to display the plot!
plt.show()

Outstanding! Here's the complete code listing for the race-analysis.py program.

Show/hide final code for Race analysis

#!/usr/bin/env python3
"""
Demonstration 6. Extracting data from CSV files
"""

import csv
import matplotlib.pyplot as plt
import numpy

def main():
    infile = open('fakePeople.csv')

    # First count how many of each race
    races = {}    # A dictionary that counts race populations
    for person in csv.reader(infile):
        # person[5] refers to the sixth comma-separated value
        # (counting from 0) in the record: the Race field
        if person[5] in races:
            races[person[5]] += 1
        else:
            races[person[5]] = 1

    print(races)

    # Calculate how many people total
    totalPeople = 0
    for aRace in races.keys(): # Set aRace = to each race in turn
        totalPeople += races[aRace]  # Get value associated with each key

    # Calculate race percentages
    racePercentages = {}
    for aRace in races.keys():
        racePercentages[aRace] = ( races[aRace] / totalPeople) * 100

    print(racePercentages)

    races = []          # plotted along the x-axis
    popPercent = []     # plotted along the y-axis

    # Split dictionary up into these two lists
    for aRace in racePercentages:
        races.append(aRace)    # puts the keys in races
        popPercent.append(racePercentages[aRace])

    # Identify how many columns will be in our bar graph
    columns = numpy.arange(len(races))

    plt.bar(columns, popPercent)
    plt.xticks(columns, races, rotation=90)
    plt.ylabel('Percentages of total population')
    plt.title('Race population by percentage')
    # And don't forget to display the plot!
    plt.show()

if __name__ == "__main__":
    main()

II. Creating x-y Plots of Data

Creating x-y plots of data is even easier than creating a bar graph. Simply supply the plot method with two equal-sized lists of x-data and y-data and you're done.

Let's see how to do that.

II.A. x and y axes

Let's compare two values available to us in the fakePeople.csv file: mass in kilograms and yearly income.

First, collect the data into two lists. Mass is in field 6, and yearly income is in field 9.

"""Demonstration 7. Creating x-y plots of data"""
    
import csv
import matplotlib.pyplot as plt
import numpy

# get weight and height information from file
infile = open('fakePeople.csv','r')
mass = []
income = []
for person in csv.reader(infile):
    mass.append(eval(person[6]))
    income.append(eval(person[9]))

# Note that matplotlib v2.1 can work with strings, so
# we have to be sure to convert values above to nums
# using the eval function
plt.plot(mass, income)
plt.show()

If we try to plot this, it's going to do the default behavior which is to draw blue lines from one data point to the next.

Instead, let's indicate that we want red "o"s to be displayed on the graph, making a scatter plot:

plt.plot(mass, income, 'ro')      # 'ro' --> Red "o" as symbol on plot
plt.show()

Clearly, there doesn't appear to be any correlation between a person's weight and how much yearly income they have. There are a number of data points that we would expect to be related, though. Years of education and income? Height and weight?

Let's look at height and weight for the population in newData.csv.

#!/usr/bin/env python3
"""Demonstration 8. Creating x-y plots of data"""

import csv
import matplotlib.pyplot as plt
import numpy

# get weight and height information from file
infile = open('newData.csv','r')
mass = []
height = []
for person in csv.reader(infile):
    mass.append(eval(person[6]))
    height.append(eval(person[7]))

# Note that matplotlib v2.1 can work with strings, so
# we have to be sure to convert values above to nums
# using the eval function
plt.plot(mass, height, 'ro')      # 'ro' --> Red "o" as symbol on plot
plt.show()

This is a nice scatter plot that clearly indicates, at least qualitatively, that there is a relationship between height and weight.

II.B. Analysis of Data - Regressions (Trendlines)

Graphs are nice because they give us a visual way of recognizing the relationships between different sets of values.

In the sciences, often, we're looking for a more specific description of the relationship: a mathematical model that we can use to test our understanding of the world.

How can we use Python and its data analysis capabilities to reveal this relationship more clearly? How do we get a regression of this data, a function that clearly demonstrates a statistical quantitative relationship between two terms?

One way is certainly to plug all these numbers into your TI-Calculator and use the Math capabilities of that device. Microsoft's Excel and Google's Sheets are able to take a table of data and perform a regression on that information. You won't be surprised to find that `numpy` is able to do the same thing.As with any regression calculation, you need to decide what kind of mathematical model would be the best fit for your data. You then indicate which analysis model you want `numpy` to perform.

The three most common models are linear, polynomial ("quadratic" is a second-order polynomial, for example), and exponential.

Based on a preliminary view of the x-y scatterplot above, which model do you think will best fit our data?

The plot above is produced by this code. Check the comments to see what does what.

#!/usr/bin/env python3
"""Demonstration 9. Creating x-y plots of data"""

import matplotlib.pyplot as plt
import numpy
import os


# get weight and height information from file
infile = open('newData.csv','r')
mass = []
height = []
for line in infile:
    theLine = line.strip().split(',') # create tuple
    mass.append(eval(theLine[6]))
    height.append(eval(theLine[7]))

# take heights list and convert it
# to a numpy array so we can process with it
heights = numpy.array(height) 

fig = plt.figure() # set up figure that we can save later
plt.title('Mass vs. Height for Fake Random People')
plt.ylabel('Mass (kg)')
plt.xlabel('Height (cm)')

plt.plot(heights,mass,'bo')  # put data on figure, 'bo' = blue dots

# identify slope, intercept of linear best fit line for data
slope, intercept = numpy.polyfit(height, mass, 1) # 1 for linear fit

# create "mass" values that will be used to show best-fit line
avgmasses = slope * heights + intercept # heights is numpy array, so avgmasses is too

# create regression plot and legend for graph
fit_label = 'Linear fit: y = {0:.2f}x + {1:.2f}'.format(slope, intercept)
plt.plot(heights, avgmasses, color='red', linestyle='--', label=fit_label)
plt.legend(loc='lower right')

'''
# Save the plot in a directory called plots
if not os.path.exists('plots'):
    os.mkdir('plots')
fig.savefig('plots/mass_vs_height.png')
'''

# Show the graph on screen
plt.show()

Python and matplotlib