Advanced Topics in Computer Science

Unit 2 - Algorithm Analysis; Big-O Notation

Table of Contents

  1. Overview
  2. Algorithm Analysis
  3. Big-O Notation
  4. Anagram Detection
  5. Performance of Python Collections

2.1. Overview

Now that we've gotten some Python syntax figured out, let's move on to the main topics in this course: using algorithms and Abstract Data Types (ADTs) to solve problems.

In this short unit, we'll be discussing the general idea behind algorithms, and figuring out how we can both predict and measure their efficiency.

Let's get started.

2.2. Algorithm Analysis

Which program is faster? Which program uses less memory? What algorithm strategies make one program better than another?

Definition: Algorithm

Algorithm = generic, step-by-step list of instructions to solve a problem.

A page from Knuth's "Fundamental Algorithms"

Computer scientists study algorithms, which are often presented mathematically rather than programmatically. Computer scientists also study what types of problems are solvable, and what problems may be unsolvable... and whether or not we can even know if a type of problem can be known to be unsolvable.

If you go into Computer Science, you may very well find yourself the eventual owner of one of the classics in the field, Donald E. Knuth's The Art of Computer Programming. Volume 1 of that series is entitled Fundamental Algorithms, and here's what a page from that text looks like →

Computer science is a math thing.

Definition: Program

Program = the coded implementation of an algorithm in a programming language.

Coders take the algorithms designed by computer scientists and implement them in a specific language.

There are different algorithms that can be used to solve any given problem, and for each of these, there are different programs that may be implemented for the same algorithm.

 

Which is better?

Compare the two programs here, one with variables/identifiers that are short, and the other with self-documenting variable names. Which program is better?

for p in d:
    if p.em():
        s(p)
for person in directory:
    if person.has_email_address():
        send_email(person)
 

Show/hide answer

The self-documenting code is better for readability, especially for a human trying to understand how an algorithm works. From an algorithmic perspective, however, they are identical.

In this course, we certainly want readable code. Our primary goal is examining algorithms and how efficient they are in terms of resources used, both time and memory (and that goal, of course, will be served by having readable code, even if the algorithm itself doesn't require it).

Definition: Benchmark

Benchmark = a standard by which something is measured or evaluated.

In this course, most of our efficiency measurements will be based on time of execution. With this in mind, we'll very commonly be importing Python's time module and using it to measure how much time an algorithm takes to execute.

Let's write a program to figure out how long it takes to run an algorithm on your computer.

Problem: Execution time of calculating a factorial

Write the program factorial_execution_time.py. How do your results compare with those of others?

So we know how quickly your computer can run the program. But your computer's speed is not our primary concern here.

Execution Time vs. Efficiency

Note that the time it takes your program to execute a program on your computer is related to lots of different variables, including your CPU, bus speed, whether output is displayed, the computer language used, etc. In this class, we're not concerned with your hardware or how quickly your computer runs.

We're concerned with how efficiently an algorithm performs a task as the size of the task increases.

Let's write a new program, and look at how the time it takes to run compares with the size of the task it has to do.

Problem: Summing a Varying Range of Value

Write the program sum_range_of_values.py. Can you identify what the graph of the data produced by this program looks like?

2.3. Big-O Notation

We can determine the efficiency of an algorithm in two ways:

  1. Empirically, by writing it, running a series of tests on it, and measuring the time each one takes to run, or
  2. Theoretically, by considering the number of steps involved in running the algorithm, where a "step" is usually an assignment statement. It takes time for a computer to assign a value to a variable, so if we count the number of times that happens, we'll have a good measure of how much time it takes the computer to complete the task.

Time: A Measure of Efficiency

In determining efficiency, one thing we can examine is how the time of execution T changes as a function of the task size, n.

How can we go about determining T(n)?

2.3.1. Measuring T(n)

One way of determining T(n) is simply to write a program and measure it on your computer.

Let's try it.

Testing an algorithm

Write a program that uses a loop to calculate the sum of values 1 to n.

sum = 0
for i in range(1, n + 1):
    sum = sum + n

Once you've written the program, import the time module and use it to determine how much time it takes to calculate that sum.

Run a series of tests, adding all the numbers from 1 to 100000 (one-hundred thousand). Are the results the same each time? Are they relatively close? What might account for any variation in the times?

Increasing the size of the problem

Continuing the summation problem from before, how much longer does the program take to run if we double the number of numbers we're summing (ie. add the numbers from 1 to 200000)? Does the execution time double?

What if we add the numbers from 1 to a million, 10 times the number of values. Does the execution time increase correspondingly?

What would the graph of execution time vs number of values look like?

2.3.2. Analyzing T(n)

A second approach to determining T(n) is a more thorough, formal analysis.

As mentioned above, if the most "time-expensive" instruction is assigning a value into memory, we can get a very good measure of how efficient a program is by counting up how many times an assignment instruction is executed.

Let's take a look at the summation algorithm we wrote above.

Analyzing the summation algorithm

Here's the code we were running above.

sum = 0
for i in range(1, n + 1):
    sum = sum + n

How many assignment instructions were in that code?

Show/hide answer

  • the line sum = 0 is one assignment
  • the index variable i is assigned n values
  • there were n times that stored a value in sum

The total number of assignments, then, is 2n + 1.

We can write T(n) = 2n + 1.

Analyzing a multiplication table

Take a look at this code which places products into a "list of lists" that has been previously set up. Calculate the T(n) for filling the table.

for row in range(n):
    for col in range(n):
        table[col][row] = col * row

How many assignment instructions were in that code?

Show/hide answer

The variable row is assigned n times, and the the value col is assigned n * n times. The line table[col][row] = col * row is one assignment, which is executed n times in the row loop and n times in the col loop. Therefore, the total number of assignments is n + (n * n) + (n * n), or 2n2 + n.

We can write T(n) = 2n2 + n.

2.3.3. Big-O notation

While the time function T(n) is occasionally useful, computer scientists typically use a different measure of efficiency, called "big-O notation." The "O" refers to the order of the time-based polynomial function that is dominant in determining the execution time.

If we're trying to determine the efficiency of T(n) = 2n + 1, you can see that as n gets very large, the value 1 becomes less and less significant in the time. Even the factor 2 is a constant factor, so it doesn't change with the size of n. In an effort to simplify the analysis of this function, then, the computer scientist would classify the function as O(n). We might say "The running time for this algorithm is O(n)," meaning that the time increases linearly with the size of the data set.

The multiplication table took time T(n) = 2n2 + n. The order of this function is two, so the big-O representation of this algorithm would be O(n2).

More complex big-O analyses

What is the performance of this T(n) function, expressed in big-O notation?

T(n) = 2 + (n * (3*n)) + (4 * n) + 3

Show/hide answer

The function simplifies to T(n) = 3n2 + 4n + 5.

While the 3n and the 4 might be significant with regard to time, in determining the performance with increasing data size n, we are only concerned with the greatest order of magnitude, the exponent 2.

We would indicate that this algorithm had a performance of O(n2).

In some cases, an algorithm's performance may depend upon the nature of the specific data given to it: a sorting algorithm, for example, may be much faster if the data is already sorted. Other sorts may take the same amounbt of time regardless of the order, and some might even be worse if the data is already sorted. You may see references to an worst-case performance or an average performance when looking at big-O values. In most cases in this course, we'll be referring to average performance.

2.3.4. Common Big-O functions

Here are some of the more common big-O functions, listed in order of decreasing efficiency.

In conversation, one typically says "big oh of log n", or "big oh of n-squared", or simply "n-squared performance" or "has constant performance."

f(n)Name
O(1)Constant
O(log n)Logarithmic
O(n)Linear
O(n log n)Log Linear
O(n2)Quadratic
O(n3)Cubic
O(2n)Exponential

We'll take some time to look at these different orders as we work our way through the course.

2.4 Anagram Detection

An anagram of a word consists of the letters of the original word rearranged into a different order.

The word "cat" is an anagram of "act", and the words "genuine" and "ingenue" are anagrams of each other. "catt" and "cat" are not anagrams of each other, nor are "catt" and "ccat". The same letters and the same number of those letters must appear in both anagrams.

How to identify anagrams?

Given two words word1 and word2 that are possibly anagrams of each other, how would you check them to identify whether they are or not?

It may help to consider a concrete example. Are these two "words" anagrams of each other? How could you determine that?

word1 = "tyvcunaazhbeaqjolinucylllztjydjvdhfhsczdayrrmxkmhizpodhwkxlilkypllwqplothnysaoxgpmgatwtjqhdzzpzwkeed"
word2 = "hljyzcqxmlmtlznnkjulacpszlarehpzemnkojdvowtpyawliaphyodtdldqhkpjvobigyukethraxhylhwfzdzadwyqgtcsizlx"

2.4.1. Anagram Detection Strategies

The following strategies are all valid for identifying anagrams.

  1. Checking off letters one-by-one
    Go through word1, take each letter found, and look through word2, checking off the found letters in word2 as you go.
  2. Sort and Compare
    Sort the letters in word1 and word2 and compare the two words that result.
  3. Brute Force
    Write all the possible combinations of the letters in word1 and see if word2 matches any of them.
  4. Count and Compare
    Count the frequency of each letter in word1, and compare it to the frequencies of the letters in word2.

The real question, of course, is which strategy do you think might be most efficient?

Anagram Analysis, Strategy 1

Write the program anagram_analysis.py, and test it.

2.5. Performance of Python Operations

When creating a new programming language, decisions have to be made in how the features of that language will be implemented.

Will that language have strict or dynamic data typing? Will int values be of a fixed size or will they be dynamically implemented? Etc.

Sometimes the language will have multiple ways of accomplishing something—creating a list of values, for example—in which case it may be helpful for you to know which strategy will be most efficient for a given purpose.

2.5.1. Creating a list of values

Which of these four strategies is best for creating a list of int values 0-999? How could we figure which one is fastest?

# strategy 1
l = []
for i in range(1000):
    l = l + [i]         # concatenates a list to another list
    
# strategy 2
l = []
for i in range(1000):
    l.append(i)         # classic appending to a list
    
# strategy 3
l = [i for i in range(1000)]    # list comprehension strategy, functional style

# strategy 4
l = list(range(1000))           # using Python's list and range operators

Let's write a program to do try each of these strategies and compare the results.

Here's a start:

#!/usr/bin/env python3

"""
listfuns.py
This program compares the performance of the various
list creation strategies.
Miller & Ranum, Ch 2
@author Richard White
@version 2017-02-21
"""

import time

# strategy 1
def test1():
    l = []
    for i in range(1000):
        l = l + [i]

# strategy 2
def test2():
    l = []
    for i in range(1000):
        l.append(i)

# strategy 3
def test3():
    l = [i for i in range(1000)]

# strategy 4
def test4():
    l = list(range(1000))


def main():
    NUM_OF_TESTS = 1000
    
    # Testing strategy 1
    total_time = 0
    for i in range(NUM_OF_TESTS):
        start = time.time()
        test1()
        stop = time.time()
        total_time += stop - start
    print("test1", total_time / NUM_OF_TESTS)

    # Testing strategy 2
    total_time = 0
    for i in range(NUM_OF_TESTS):
        start = time.time()
        test2()
        stop = time.time()
        total_time += stop - start
    print("test2", total_time / NUM_OF_TESTS)


if __name__ == "__main__":
    main()

Testing list-filling strategies

Complete the program above with timers for the two other strategies.

Which list-creation strategy turns out to be fastest?

2.5.2. Time vs. Performance

In the problem above we timed how long each strategy took, but we haven't actually determined the performance based on increasing size.

How to measure Big-O performance

In order to measure performance empirically, you need to run a series of tests with increasingly large data sets to determine how the algorithm performs as a function of size of data set.

Let's write a program to see what the performance is of the first list creation strategy, the concatenation strategy.

#!/usr/bin/env python3

"""
performance_test1.py
This program analyzes the performance of the 
list concatenation operation.
@author Richard White
@version 2017-02-21
"""

import time

def test1(num_of_values):
    """Creates a list of the specified number of values
    by concatenating lists.
    """
    l = []
    for i in range(num_of_values):
        l = l + [i]

def main():
    LOWERLIMIT = 1000
    UPPERLIMIT = 10000
    NUM_OF_TESTS = 10
    
    print("Num of tests","Execution time")              # prints results on monitor
    outfile = open("list_analysis_results.csv","w")     # stores results in file
    outfile.write("Num of tests,Execution time\n")
    
    for i in range(LOWERLIMIT, UPPERLIMIT, int((UPPERLIMIT - LOWERLIMIT) / NUM_OF_TESTS)):
        start = time.time()
        test1(i)
        stop = time.time()
        print(str(i) + "\t" + str(stop - start))        # results on monitor
        outfile.write(str(i) + "," + str(stop - start) + "\n")      # store in file
        
    outfile.close()             # close file once tests are done

if __name__ == "__main__":
    main()

Big-O performance of concatenation

Run the program above and use the data produced to create a graph of the execution times. Based on the results, what would you guess the Big-O performance is of the concatenation operation?

Now create a new performance test, performance_test3.py, that analyzes the performance of strategy 3, the list comprehension strategy. What does the graph of those results look like? What would you estimate the Big-O performance of the list comprehension strategy to be?

2.5.3. Data Type comparison: list vs dict

Python includes two types of data types that allow for working with collections of data: the list and the dict. (There are other collection types as well.)

Lists are good for keeping track of data that needs to have its order maintained, while dictionaries don't have a preferred order: values in a dictionary are identified by their keys.

Again, there are trade-offs between the two types of collection data, and which one you choose to use in a program will be determined by what your needs are.

Both lists and dictionaries allow for adding items to a collection:

l = []
l.append(random.randrange(1000))

d = {}
d[random.randrange(1000)] = 1
# the line above adds a dictionary key with a value of 1,
# which we're not really using in this analysis

... and both lists and dictionaries allow for items to be removed from a collection:

l.remove(7)
# removes the value 7 from the list (assuming it exists)

del d[7]
# removes the value 7 from the dict (assuming it exists)

Which is faster? Removing items from a dictionary, or removing items from a list?

del vs. remove

Write a program del_vs_remove.py that does the following:

  1. Create two large (n >= 10000) collections, one list and one dictionary, each containing the same set of randomly generated integers (in the range 0-10000).
  2. Time how long it takes to try to remove one thousand randomly generated numbers from each collection type and report the results.
  3. It may be that one of the randomly generated numbers to be removed won't be found in your list or dictionary. To avoid the error that will be thrown in this case, use the try-except error trapping facility in Python:
    try:
        l.remove(value_to_remove)
        # print("Removed from list")
    except:
        # print("Couldn't find random value in list... continuing")
        pass
    Note the commented lines that I was using when I was testing my program to make sure it was working the way it should!
  4. In this basic version of the program, we're only timing the two operations. Which one is faster?
  5. Modify your program to collect data with collections of increasing size. What is the performance, expressed in Big-O notation, as you use del and remove on lists and dictionaries of increasing size?