Unit 2 - Algorithm Analysis; Big-O Notation
Table of Contents
2.1. Overview
Now that we've gotten some Python syntax figured out, let's move on to the main topics in this course: using algorithms and Abstract Data Types (ADTs) to solve problems.
In this short unit, we'll be discussing the general idea behind algorithms, and figuring out how we can both predict and measure their efficiency.
- Why analyze our algorithms?
- Big-O notation describes execution time
- Big-O re: Python lists, dictionaries
Let's get started.
2.2. Algorithm Analysis
Which program is faster? Which program uses less memory? What algorithm strategies make one program better than another?
Definition: Algorithm
Algorithm = generic, step-by-step list of instructions to solve a problem.
Computer scientists study algorithms, which are often presented mathematically rather than programmatically. Computer scientists also study what types of problems are solvable, and what problems may be unsolvable... and whether or not we can even know if a type of problem can be known to be unsolvable.
If you go into Computer Science, you may very well find yourself the eventual owner of one of the classics in the field, Donald E. Knuth's The Art of Computer Programming. Volume 1 of that series is entitled Fundamental Algorithms, and here's what a page from that text looks like →
Computer science is a math thing.
Definition: Program
Program = the coded implementation of an algorithm in a programming language.
Coders take the algorithms designed by computer scientists and implement them in a specific language.
There are different algorithms that can be used to solve any given problem, and for each of these, there are different programs that may be implemented for the same algorithm.
Which is better?
Compare the two programs here, one with variables/identifiers that are short, and the other with self-documenting variable names. Which program is better?
for p in d: if p.em(): s(p)
for person in directory: if person.has_email_address(): send_email(person)
In this course, we certainly want readable code. Our primary goal is examining algorithms and how efficient they are in terms of resources used, both time and memory (and that goal, of course, will be served by having readable code, even if the algorithm itself doesn't require it).
Definition: Benchmark
Benchmark = a standard by which something is measured or evaluated.
In this course, most of our efficiency measurements will be based on time of execution. With this in mind, we'll very commonly be importing Python's time
module and using it to measure how much time an algorithm takes to execute.
Let's write a program to figure out how long it takes to run an algorithm on your computer.
Problem: Execution time of calculating a factorial
Write the program factorial_execution_time.py. How do your results compare with those of others?
So we know how quickly your computer can run the program. But your computer's speed is not our primary concern here.
Execution Time vs. Efficiency
Note that the time it takes your program to execute a program on your computer is related to lots of different variables, including your CPU, bus speed, whether output is displayed, the computer language used, etc. In this class, we're not concerned with your hardware or how quickly your computer runs.
We're concerned with how efficiently an algorithm performs a task as the size of the task increases.
Let's write a new program, and look at how the time it takes to run compares with the size of the task it has to do.
Problem: Summing a Varying Range of Value
Write the program sum_range_of_values.py. Can you identify what the graph of the data produced by this program looks like?
2.3. Big-O Notation
We can determine the efficiency of an algorithm in two ways:
- Empirically, by writing it, running a series of tests on it, and measuring the time each one takes to run, or
- Theoretically, by considering the number of steps involved in running the algorithm, where a "step" is usually an assignment statement. It takes time for a computer to assign a value to a variable, so if we count the number of times that happens, we'll have a good measure of how much time it takes the computer to complete the task.
Time: A Measure of Efficiency
In determining efficiency, one thing we can examine is how the time of execution T
changes as a function of the task size, n
.
How can we go about determining T(n)
?
2.3.1. Measuring T(n)
One way of determining T(n)
is simply to write a program and measure it on your computer.
Let's try it.
Testing an algorithm
Write a program that uses a loop to calculate the sum of values 1 to n
.
sum = 0 for i in range(1, n + 1): sum = sum + n
Once you've written the program, import the time
module and use it to determine how much time it takes to calculate that sum.
Run a series of tests, adding all the numbers from 1 to 100000 (one-hundred thousand). Are the results the same each time? Are they relatively close? What might account for any variation in the times?
Increasing the size of the problem
Continuing the summation problem from before, how much longer does the program take to run if we double the number of numbers we're summing (ie. add the numbers from 1 to 200000)? Does the execution time double?
What if we add the numbers from 1 to a million, 10 times the number of values. Does the execution time increase correspondingly?
What would the graph of execution time vs number of values look like?
2.3.2. Analyzing T(n)
A second approach to determining T(n)
is a more thorough, formal analysis.
As mentioned above, if the most "time-expensive" instruction is assigning a value into memory, we can get a very good measure of how efficient a program is by counting up how many times an assignment instruction is executed.
Let's take a look at the summation algorithm we wrote above.
Analyzing the summation algorithm
Here's the code we were running above.
sum = 0 for i in range(1, n + 1): sum = sum + n
How many assignment instructions were in that code?
- the line
sum = 0
is one assignment - the index variable
i
is assignedn
values - there were
n
times that stored a value insum
The total number of assignments, then, is 2n + 1
.
We can write T(n) = 2n + 1
.
Analyzing a multiplication table
Take a look at this code which places products into a "list of lists" that has been previously set up. Calculate the T(n)
for filling the table.
for row in range(n): for col in range(n): table[col][row] = col * row
How many assignment instructions were in that code?
The variable row
is assigned n
times, and the the value col
is assigned n * n
times. The line table[col][row] = col * row
is one assignment, which is executed n
times in the row
loop and n
times in the col
loop. Therefore, the total number of assignments is n + (n * n) + (n * n)
, or 2n2 + n
.
We can write T(n) = 2n2 + n
.
2.3.3. Big-O notation
While the time function T(n)
is occasionally useful, computer scientists typically use a different measure of efficiency, called "big-O notation." The "O" refers to the order of the time-based polynomial function that is dominant in determining the execution time.
If we're trying to determine the efficiency of T(n) = 2n + 1
, you can see that as n
gets very large, the value 1
becomes less and less significant in the time. Even the factor 2
is a constant factor, so it doesn't change with the size of n
. In an effort to simplify the analysis of this function, then, the computer scientist would classify the function as O(n)
. We might say "The running time for this algorithm is O(n)," meaning that the time increases linearly with the size of the data set.
The multiplication table took time T(n) = 2n2 + n
. The order of this function is two, so the big-O representation of this algorithm would be O(n2)
.
More complex big-O analyses
What is the performance of this T(n)
function, expressed in big-O notation?
T(n) = 2 + (n * (3*n)) + (4 * n) + 3
The function simplifies to T(n) = 3n2 + 4n + 5
.
While the 3n
and the 4
might be significant with regard to time, in determining the performance with increasing data size n
, we are only concerned with the greatest order of magnitude, the exponent 2
.
We would indicate that this algorithm had a performance of O(n2)
.
In some cases, an algorithm's performance may depend upon the nature of the specific data given to it: a sorting algorithm, for example, may be much faster if the data is already sorted. Other sorts may take the same amounbt of time regardless of the order, and some might even be worse if the data is already sorted. You may see references to an worst-case performance or an average performance when looking at big-O values. In most cases in this course, we'll be referring to average performance.
2.3.4. Common Big-O functions
Here are some of the more common big-O functions, listed in order of decreasing efficiency.
In conversation, one typically says "big oh of log n", or "big oh of n-squared", or simply "n-squared performance" or "has constant performance."
f(n) | Name |
O(1) | Constant |
O(log n) | Logarithmic |
O(n) | Linear |
O(n log n) | Log Linear |
O(n2) | Quadratic |
O(n3) | Cubic |
O(2n) | Exponential |
We'll take some time to look at these different orders as we work our way through the course.
2.4 Anagram Detection
An anagram of a word consists of the letters of the original word rearranged into a different order.
The word "cat" is an anagram of "act", and the words "genuine" and "ingenue" are anagrams of each other. "catt" and "cat" are not anagrams of each other, nor are "catt" and "ccat". The same letters and the same number of those letters must appear in both anagrams.
How to identify anagrams?
Given two words word1
and word2
that are possibly anagrams of each other, how would you check them to identify whether they are or not?
It may help to consider a concrete example. Are these two "words" anagrams of each other? How could you determine that?
word1 = "tyvcunaazhbeaqjolinucylllztjydjvdhfhsczdayrrmxkmhizpodhwkxlilkypllwqplothnysaoxgpmgatwtjqhdzzpzwkeed" word2 = "hljyzcqxmlmtlznnkjulacpszlarehpzemnkojdvowtpyawliaphyodtdldqhkpjvobigyukethraxhylhwfzdzadwyqgtcsizlx"
2.4.1. Anagram Detection Strategies
The following strategies are all valid for identifying anagrams.
- Checking off letters one-by-one
Go throughword1
, take each letter found, and look throughword2
, checking off the found letters inword2
as you go. - Sort and Compare
Sort the letters inword1
andword2
and compare the two words that result. - Brute Force
Write all the possible combinations of the letters inword1
and see ifword2
matches any of them. - Count and Compare
Count the frequency of each letter inword1
, and compare it to the frequencies of the letters inword2
.
The real question, of course, is which strategy do you think might be most efficient?
Anagram Analysis, Strategy 1
Write the program anagram_analysis.py, and test it.
2.5. Performance of Python Operations
When creating a new programming language, decisions have to be made in how the features of that language will be implemented.
Will that language have strict or dynamic data typing? Will int
values be of a fixed size or will they be dynamically implemented? Etc.
Sometimes the language will have multiple ways of accomplishing something—creating a list of values, for example—in which case it may be helpful for you to know which strategy will be most efficient for a given purpose.
2.5.1. Creating a list of values
Which of these four strategies is best for creating a list of int
values 0-999? How could we figure which one is fastest?
# strategy 1 l = [] for i in range(1000): l = l + [i] # concatenates a list to another list # strategy 2 l = [] for i in range(1000): l.append(i) # classic appending to a list # strategy 3 l = [i for i in range(1000)] # list comprehension strategy, functional style # strategy 4 l = list(range(1000)) # using Python's list and range operators
Let's write a program to do try each of these strategies and compare the results.
Here's a start:
#!/usr/bin/env python3 """ listfuns.py This program compares the performance of the various list creation strategies. Miller & Ranum, Ch 2 @author Richard White @version 2017-02-21 """ import time # strategy 1 def test1(): l = [] for i in range(1000): l = l + [i] # strategy 2 def test2(): l = [] for i in range(1000): l.append(i) # strategy 3 def test3(): l = [i for i in range(1000)] # strategy 4 def test4(): l = list(range(1000)) def main(): NUM_OF_TESTS = 1000 # Testing strategy 1 total_time = 0 for i in range(NUM_OF_TESTS): start = time.time() test1() stop = time.time() total_time += stop - start print("test1", total_time / NUM_OF_TESTS) # Testing strategy 2 total_time = 0 for i in range(NUM_OF_TESTS): start = time.time() test2() stop = time.time() total_time += stop - start print("test2", total_time / NUM_OF_TESTS) if __name__ == "__main__": main()
Testing list
-filling strategies
Complete the program above with timers for the two other strategies.
Which list
-creation strategy turns out to be fastest?
2.5.2. Time vs. Performance
In the problem above we timed how long each strategy took, but we haven't actually determined the performance based on increasing size.
How to measure Big-O performance
In order to measure performance empirically, you need to run a series of tests with increasingly large data sets to determine how the algorithm performs as a function of size of data set.
Let's write a program to see what the performance is of the first list creation strategy, the concatenation strategy.
#!/usr/bin/env python3 """ performance_test1.py This program analyzes the performance of the list concatenation operation. @author Richard White @version 2017-02-21 """ import time def test1(num_of_values): """Creates a list of the specified number of values by concatenating lists. """ l = [] for i in range(num_of_values): l = l + [i] def main(): LOWERLIMIT = 1000 UPPERLIMIT = 10000 NUM_OF_TESTS = 10 print("Num of tests","Execution time") # prints results on monitor outfile = open("list_analysis_results.csv","w") # stores results in file outfile.write("Num of tests,Execution time\n") for i in range(LOWERLIMIT, UPPERLIMIT, int((UPPERLIMIT - LOWERLIMIT) / NUM_OF_TESTS)): start = time.time() test1(i) stop = time.time() print(str(i) + "\t" + str(stop - start)) # results on monitor outfile.write(str(i) + "," + str(stop - start) + "\n") # store in file outfile.close() # close file once tests are done if __name__ == "__main__": main()
Big-O performance of concatenation
Run the program above and use the data produced to create a graph of the execution times. Based on the results, what would you guess the Big-O performance is of the concatenation operation?
Now create a new performance test, performance_test3.py
, that analyzes the performance of strategy 3, the list comprehension strategy. What does the graph of those results look like? What would you estimate the Big-O performance of the list comprehension strategy to be?
2.5.3. Data Type comparison: list
vs dict
Python includes two types of data types that allow for working with collections of data: the list
and the dict
. (There are other collection types as well.)
Lists are good for keeping track of data that needs to have its order maintained, while dictionaries don't have a preferred order: values in a dictionary are identified by their keys.
Again, there are trade-offs between the two types of collection data, and which one you choose to use in a program will be determined by what your needs are.
Both lists and dictionaries allow for adding items to a collection:
l = [] l.append(random.randrange(1000)) d = {} d[random.randrange(1000)] = 1 # the line above adds a dictionary key with a value of 1, # which we're not really using in this analysis
... and both lists and dictionaries allow for items to be removed from a collection:
l.remove(7) # removes the value 7 from the list (assuming it exists) del d[7] # removes the value 7 from the dict (assuming it exists)
Which is faster? Removing items from a dictionary, or removing items from a list?
del
vs. remove
Write a program del_vs_remove.py
that does the following:
- Create two large (n >= 10000) collections, one list and one dictionary, each containing the same set of randomly generated integers (in the range 0-10000).
- Time how long it takes to try to remove one thousand randomly generated numbers from each collection type and report the results.
- It may be that one of the randomly generated numbers to be removed won't be found in your list or dictionary. To avoid the error that will be thrown in this case, use the
try-except
error trapping facility in Python:try: l.remove(value_to_remove) # print("Removed from list") except: # print("Couldn't find random value in list... continuing") pass
Note the commented lines that I was using when I was testing my program to make sure it was working the way it should! - In this basic version of the program, we're only timing the two operations. Which one is faster?
- Modify your program to collect data with collections of increasing size. What is the performance, expressed in Big-O notation, as you use
del
andremove
on lists and dictionaries of increasing size?
The self-documenting code is better for readability, especially for a human trying to understand how an algorithm works. From an algorithmic perspective, however, they are identical.