Python and matplotlib
In order to interpret data, it's important to be able to create visual representations of that data. Two common uses include looking at a large volume of data ("Big Data") to identify characteristics of that data, and looking at x-y data in which you expect to see some function-based relationship between those values.
If you're using Python to look at this information, it's easy to use some Python libraries that will allow you to easily create graphs. This brief tutorial takes you through that process.
I. Big Data
Sophisticated analysis of data relationships is often performed using SQL ("Structured Query Language," pronounced either "ess-cue-ell" or "sequel"), which operates on data stored in a relational database. The operations and procedures that can be performed with an SQL language like MySQL or SQLite3 can be impressive.
Really, REALLY large sets of data may be stored in simple, text-based flat files, typically with each line of data in file representing a record, and commas used to separate the data fields in each record. Because the commas separate the values in this system, the format is called "Comma-Separated Values," or CSV.
In this activity you'll get a brief introduction to using Python to load up CSV files of data, process that data, and display visual analyses of that data. Being able to examine and demonstrate numerical relationships in a visual format is fundamental to any kind of analysis of that data.
I.A. Getting Data Into Python
Python has the ability to read data in from external files built in, but we're actually going to use a module that's specifically built for handling CSV files.
If you've got a relatively small amount of data—a few hundred lines of data, or even 100MB of data—you can probably read it all into a Python program as a list of lines, like this:
"""Demonstration 1. Bringing in data from CSV files""" import csv # this module helps manage csv data infile = open('dataFile.csv') myData = [] for row in csv.reader(infile): myData.append(row.strip())
Once this loop has run, the list myData
contains all of the lines of data from the file. (The .strip()
method is used to remove the newline character that's at the end of each row in this csv/text file.)
That works fine for manageable quantities of data. But what if I've got 100GB of data? My computer, a fairly powerful machine, only has 32GB of RAM, and not all of that is even available for program data. It's clear that I'm not going to be able to digest this entire file into the myData
list.
For these circumstances, we'll be processing the file one line at a time:
"""Demonstration 2. Bringing in data from CSV files""" import csv # this module helps manage csv data infile = open('dataFile.csv') counter = {} # A dictionary that will be used to count word frequency for row in csv.reader(infile): # Do an analysis based on the single row of information # If each row contained a word, and we wanted to count the word frequency... if row in counter.keys(): counter[row] += 1 else: counter[row] = 1
This has the benefit of only reading a single line into memory at a time, allowing you to process a data set that is larger than the memory you have on your computer.
I.B. Identifying Selected Data in a CSV File
Identifying the data that you want to use in a CSV file is usually easy. Figuring out how to extract that data can sometimes be tricky. For this exercise, we'll use the fakePeople.csv file that contains information on 1000 fake people, including their Gender, GivenName, Surname, State, Birthday, Race, Weight in Kilograms, Height in Centimeters, YearsEducation, and YearlyIncome, in that order.
How would we figure out the percentage of each Race in this population? We'd need to know how many there were of each Race, and how many total people there are.
"""Demonstration 3. Extracting data from CSV files""" import csv infile = open('fakePeople.csv') races = {} # dictionary that counts race populations for person in csv.reader(infile): # the .reader() method splits the CSV line into a list # person[5] refers to the sixth comma-separated value # (counting from 0) in the record: the Race field if person[5] in races: races[person[5]] += 1 # increment the dictionary value for this race else: races[person[5]] = 1 # initialize a dictionary key for this race
If we look at the races
dictionary now, it looks like this:
{'Black': 141, 'Brown': 186, 'White': 486, 'Asian': 81, 'Other': 106}
How can we find the total number of people identified? We could count the number of lines in our infile, but we can also just total up the values for our Races dictionary:
totalPeople = 0 for aRace in races.keys(): # Set aRace = to each key in turn totalPeople += races[aRace] # Get number of people of each race and sum them
totalPeople
has a value of 1000 at this point.
Let's calculate percentages now:
racePercentages = {} for aRace in races.keys(): racePercentages[aRace] = (races[aRace] / totalPeople) * 100
Show/hide final code for Race analysis
#!/usr/bin/env python3 """ Demonstration 4. Extracting data from CSV files """ import csv def main(): infile = open('fakePeople.csv') # First count how many of each race races = {} # dictionary that counts race populations for person in csv.reader(infile): # the .reader() method splits the CSV line into a list # person[5] refers to the sixth comma-separated value # (counting from 0) in the record: the Race field if person[5] in races: races[person[5]] += 1 # increment the dictionary value for this race else: races[person[5]] = 1 # initialize a dictionary key for this race # Calculate how many people total totalPeople = 0 for aRace in races.keys(): # Set aRace = to each key in turn totalPeople += races[aRace] # Get number of people of each race and sum them # Calculate race percentages racePercentages = {} for aRace in races.keys(): racePercentages[aRace] = ( races[aRace] / totalPeople) * 100 print(racePercentages) if __name__ == "__main__": main()
The dictionary racePercentages
now looks like this:
{'Black': 14.099999999999998, 'Brown': 18.6, 'White': 48.6, 'Asian': 8.1, 'Other': 10.6}
This is a fine collection of data, but what we really want to be able to do is to graph it. Because there's no quantitative value on the x-axis—it's just the labels of each race—we'll use a bar graph for this analysis.
I.C. Creating a Bar Graph of Data
We'll be using two additional modules to create our graphs: numpy
and matplotlib
. These modules aren't typically included with a standard Python installation, so you'll have to check whether you have them installed.
Check to see if numpy
and matplotlib
are installed
The easiest way to see if you have these modules available to you is to try to import them. Go int a Terminal, start an interactive session in Python, and try to import numpy
and matplotlib
. If you get no error messages, you're okay.
(base) rwhite@Ligne2 Desktop % python Python 3.7.4 (default, Aug 13 2019, 15:17:50) [Clang 4.0.1 (tags/RELEASE_401/final)] :: Anaconda, Inc. on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import numpy >>> import matplotlib >>>
If you see something more like this:
$ python3 Python 3.6.9 (default, Nov 7 2019, 10:44:02) [GCC 8.3.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import numpy Traceback (most recent call last): File "", line 1, in ModuleNotFoundError: No module named 'numpy' >>> import matplotlib Traceback (most recent call last): File " ", line 1, in ModuleNotFoundError: No module named 'matplotlib' >>>
... then you'll need to install those modules. (How you do this depends on your Python installation. See the instructor for further info.)
The information that we want to graph is contained in the racePercentages
dictionary.
First, split that dictionary into two lists. One will be used to create the x-axis of entries, and one will be used to create the y-axis values.
""" Demonstration 5. Creating bar graphs of data This code doesn't show lines previously developed above. """ import matplotlib.pyplot as plt import numpy races = [] # plotted along the x-axis popPercent = [] # plotted along the y-axis # Split dictionary up into these two lists for aRace in racePercentages: races.append(aRace) # puts the keys in races popPercent.append(racePercentages[aRace]) # Identify how many columns will be in our bar graph columns = numpy.arange(len(races))
Now we just need to tell matplotlib.pyplot
to create and display the bar graph:
plt.bar(columns, popPercent) plt.show()
That's not bad, but we're obviously missing quite a bit of information: x-axis labels, y-axis labels, a title...
plt.bar(columns, popPercent) # Indicates a bar chart plt.xticks(columns, races, rotation=90) # Label and Values on x-axis plt.ylabel('Percentages of total population') # Label on y-axis plt.title('Race population by percentage') # Title of graph # And don't forget to display the plot! plt.show()
Outstanding! Here's the complete code listing for the race-analysis.py program.
Show/hide final code for Race analysis
#!/usr/bin/env python3 """ Demonstration 6. Extracting data from CSV files """ import csv import matplotlib.pyplot as plt import numpy def main(): infile = open('fakePeople.csv') # First count how many of each race races = {} # A dictionary that counts race populations for person in csv.reader(infile): # person[5] refers to the sixth comma-separated value # (counting from 0) in the record: the Race field if person[5] in races: races[person[5]] += 1 else: races[person[5]] = 1 print(races) # Calculate how many people total totalPeople = 0 for aRace in races.keys(): # Set aRace = to each race in turn totalPeople += races[aRace] # Get value associated with each key # Calculate race percentages racePercentages = {} for aRace in races.keys(): racePercentages[aRace] = ( races[aRace] / totalPeople) * 100 print(racePercentages) races = [] # plotted along the x-axis popPercent = [] # plotted along the y-axis # Split dictionary up into these two lists for aRace in racePercentages: races.append(aRace) # puts the keys in races popPercent.append(racePercentages[aRace]) # Identify how many columns will be in our bar graph columns = numpy.arange(len(races)) plt.bar(columns, popPercent) plt.xticks(columns, races, rotation=90) plt.ylabel('Percentages of total population') plt.title('Race population by percentage') # And don't forget to display the plot! plt.show() if __name__ == "__main__": main()
II. Creating x-y Plots of Data
Creating x-y plots of data is even easier than creating a bar graph. Simply supply the plot
method with two equal-sized lists of x-data and y-data and you're done.
Let's see how to do that.
II.A. x and y axes
Let's compare two values available to us in the fakePeople.csv file: mass in kilograms and yearly income.
First, collect the data into two lists. Mass is in field 6, and yearly income is in field 9.
"""Demonstration 7. Creating x-y plots of data""" import csv import matplotlib.pyplot as plt import numpy # get weight and height information from file infile = open('fakePeople.csv','r') mass = [] income = [] for person in csv.reader(infile): mass.append(eval(person[6])) income.append(eval(person[9])) # Note that matplotlib v2.1 can work with strings, so # we have to be sure to convert values above to nums # using the eval function plt.plot(mass, income) plt.show()
If we try to plot this, it's going to do the default behavior which is to draw blue lines from one data point to the next.
Instead, let's indicate that we want red "o"s to be displayed on the graph, making a scatter plot:
plt.plot(mass, income, 'ro') # 'ro' --> Red "o" as symbol on plot plt.show()
Clearly, there doesn't appear to be any correlation between a person's weight and how much yearly income they have. There are a number of data points that we would expect to be related, though. Years of education and income? Height and weight?
Let's look at height and weight for the population in newData.csv.
#!/usr/bin/env python3 """Demonstration 8. Creating x-y plots of data""" import csv import matplotlib.pyplot as plt import numpy # get weight and height information from file infile = open('newData.csv','r') mass = [] height = [] for person in csv.reader(infile): mass.append(eval(person[6])) height.append(eval(person[7])) # Note that matplotlib v2.1 can work with strings, so # we have to be sure to convert values above to nums # using the eval function plt.plot(mass, height, 'ro') # 'ro' --> Red "o" as symbol on plot plt.show()
This is a nice scatter plot that clearly indicates, at least qualitatively, that there is a relationship between height and weight.
II.B. Analysis of Data - Regressions (Trendlines)
Graphs are nice because they give us a visual way of recognizing the relationships between different sets of values.
In the sciences, often, we're looking for a more specific description of the relationship: a mathematical model that we can use to test our understanding of the world.
How can we use Python and its data analysis capabilities to reveal this relationship more clearly? How do we get a regression of this data, a function that clearly demonstrates a statistical quantitative relationship between two terms?
One way is certainly to plug all these numbers into your TI-Calculator and use the Math capabilities of that device. Microsoft's Excel and Google's Sheets are able to take a table of data and perform a regression on that information. You won't be surprised to find that `numpy` is able to do the same thing.As with any regression calculation, you need to decide what kind of mathematical model would be the best fit for your data. You then indicate which analysis model you want `numpy` to perform.
The three most common models are linear, polynomial ("quadratic" is a second-order polynomial, for example), and exponential.
Based on a preliminary view of the x-y scatterplot above, which model do you think will best fit our data?
The plot above is produced by this code. Check the comments to see what does what.
#!/usr/bin/env python3 """Demonstration 9. Creating x-y plots of data""" import matplotlib.pyplot as plt import numpy import os # get weight and height information from file infile = open('newData.csv','r') mass = [] height = [] for line in infile: theLine = line.strip().split(',') # create tuple mass.append(eval(theLine[6])) height.append(eval(theLine[7])) # take heights list and convert it # to a numpy array so we can process with it heights = numpy.array(height) fig = plt.figure() # set up figure that we can save later plt.title('Mass vs. Height for Fake Random People') plt.ylabel('Mass (kg)') plt.xlabel('Height (cm)') plt.plot(heights,mass,'bo') # put data on figure, 'bo' = blue dots # identify slope, intercept of linear best fit line for data slope, intercept = numpy.polyfit(height, mass, 1) # 1 for linear fit # create "mass" values that will be used to show best-fit line avgmasses = slope * heights + intercept # heights is numpy array, so avgmasses is too # create regression plot and legend for graph fit_label = 'Linear fit: y = {0:.2f}x + {1:.2f}'.format(slope, intercept) plt.plot(heights, avgmasses, color='red', linestyle='--', label=fit_label) plt.legend(loc='lower right') ''' # Save the plot in a directory called plots if not os.path.exists('plots'): os.mkdir('plots') fig.savefig('plots/mass_vs_height.png') ''' # Show the graph on screen plt.show()