Unit 4 - Sorting and Searching Algorithms
Table of Contents
- Intro to Algorithms; Selection Sort
- BubbleSort; Big-O notation
- Linear Search
- Binary Search
- Nested Loops
- Debugging
- Content Review
4.0. Introduction to Algorithms; Selection Sort
4.0.0. Overview
We've spent the first weeks of this class learning a little about Python's basic data types (int, float, str), Python's decision-making control structures (if, if-else, if-elif-else), Python's looping control structures (while, for), and the list data type.
Now let's learn how to use those tools to actually build some working algorithms.
4.1. Computational Thinking, Problem Solving, and Algorithms
Computational Thinking
Computational thinking is a problem-solving approach that is useful in learning how to program a computer, but also has applications in many other problem-solving contexts. Computational thinking includes:
- formulating problems in such a way that a formal problem-solving approach may be applied
- breaking down a problem into smaller components
- logically organizing data
- looking for patterns in data
- logically analyzing data
- using abstract models and simulations to represent data
- developing algorithms—step-by-step strategies—to solve problems
- considering efficiency as a factor in a problem solution
- applying a solution to more generalized situations
We've been using the principles of computational thinking almost every day in this course as we learn how to think about problems in a way that will help us better understand how to write code that will solve the problem.
Definition: Algorithm
An algorithm is a step-by-step strategy for solving a problem, usually discovered or developed by breaking a problem down into individual components. By understanding the pattern of problem presented, we can develop an abstract strategy for solving that type of problem.
Think of an algorithm as a "recipe," with instructions on how to do something.
What algorithms have we developed in here? The Sieve of Eratosthenes is one example. Based on the idea that prime numbers are numbers that don't have any multiples in them, we created a list of numbers, and starting at number 2 (the lowest prime), we identified that as a prime, and took out all the other numbers on the list that were multiples of that. Then we moved up to number 3, identified that as a prime, and took out the multiples of that number as well, and so on. That strategy, that recipe, that algorithm is one way of creating a list of prime numbers.
An algorithm may be expressed as a series of instructions, as above. It might be a numbered list of items in a procedure. It might be encoded as a flowchart, or written as pseudocode. For an algorithm to be expressed as code, of course, it eventually has to get written in a form that the computer will understand.
Show/hide Sieve of Eratosthenes
4.1.1. Sorting and Searching
A classic way to learn about algorithms is to examine how one performs two very important computer operations: sorting, and searching.
We'll begin with the Selection Sort.
4.1.2. The Selection Sort algorithm
Selection Sort
The Selection Sort algorithm works as follows:
- Begin with a list of unsorted items.
- Beginning with the first item on the list, go through and find the location of the smallest value in the list.
- After finding the location of the smallest value, exchange the value at that smallest location with the value at the start of the list. (Smallest value on list is now first in the list.)
- Repeat, starting at 2nd position: go through list, find location of smallest item on list, and place it in 2nd position.
- Go through entire list in this fashion until done.
Let's write a Selection Sort algorithm in Python.
selection_sort.py
Write a program that:
- creates a list called num_list of 20 items, each one a random integer between 0 and 1000, inclusive.
- applies a Select Sort to this list. Display the list onscreen as your program works, and included a pause statement that requires you to hit the <Enter> key to advance the program each step.
- prints out the final sorted list.
4.1.3. Refactoring
Once you've written the basic program, let's refactor it.
Refactoring
"Refactoring" a program is the process of rewriting it so that it works differently, often more efficiently, then the original version of the program.
You've probably written this program with a single main() function running the sorting process. Let's refactor your Selection Sort.
In this case, let's just split the program up into three logical pieces. We'll have:
- a main() function that establishes the initial values that our program will need: how many values we're going to be sorting, and what the range of those values are, for example.
- a create_random_values() function that is responsible for creating the list of random values, and
- a sort() function that actually sorts the values in our list.
Refactoring selection_sort.py
Rewrite your selection_sort program so that the the process of creating the list of numbers and the process of sorting a list of numbers each occur in separate functions.
4.1.4. Timing your program
Once the basic program is working, proceed through the following modifications of the program so that we can examine it in more detail.
Selection Sort continued
- In the selection_sort.py program, comment out your program's pause statement so that it will run without interference.
- Modify the program so that the list contains 1000 items rather than just 20.
- A major source of slowdown for the program is having to print out the list during each iteration. Comment out the print statement in the loop, and print only the final sorted list.
- You can determine more precisely how long it took the program to sort the items by including a time analysis like this.
4.2. Bubblesort, and Big-O Notation
4.2.0. Overview
The SelectionSort is one way of working through a list of items and putting them in order. Are there any other ways? Other strategies that we could use? Other algorithms that we could program?
And are some of the strategies faster than other, or more efficient? How would we know? You're running a Mac, I'm running a PC, that guy is running Windows 10, that girl is running Linux... We can't just time how fast the programs work, can we? That would measure the processing speed, and not the algorithm speed.
How do we determine which algorithm is better when we're all running it on different computers?
Let's find out!
4.2.1. The Bubble Sort algorithm
Bubblesort Algorithm
The Bubble Sort algorithm works as follows:
- Begin with a list of unsorted items.
- Start with the first item in the list, and compare it with the one right after it. If the second item is smaller than the first, swap them. (At this point, the higher of the two items has been "bubbled" up in the list.)
- Continue on to the second item, and compare it with the third, swapping them as required like before. Continue through the entire list in this fashion. By the time you've passed through the list this first time, the highest value will have bubbled up to the end of the list.
- Repeat this process again, passing through the list and swapping items as necessary.
- Continue passing through the list in this fashion until no more swaps are performed. At this time the list is sorted.
Let's write a BubbleSort algorithm in Python.
BubbleSort
Write a program that:
- creates a list called numList of 20 items, each one a random integer between 1 and 1000, inclusive.
- applies a Bubble Sort to this list. Display the list onscreen as your program works, and included a pause statement that requires you to hit the <Enter> key to advance the program each step.
- prints out the final sorted list.
Once the basic program is working, apply modifications to the program as you did above to determine how quickly your program can perform its sorting.
4.2.2. Big-O notation
Do you have any sense of which of these programs is better, and under what conditions? Which one is faster? Which one is more efficient?
One aspect of computer science is determining the relationship between the amount of data we feed an algorithm and how quickly that algorithm runs. Although computers are able to perform individual instructions very quickly, programs that require the execution of thousands, millions, and billions of instructions are obviously going to take longer, and execution time becomes a concern.
Which sorting algorithm above performed a sort of 20 numbers more quickly? Which algorithm performed a sort of 10,000 numbers more quickly?
Does the performance of these algorithms vary under certain conditions? What if the list to be sorted is already partly sorted? What if it's completely reversed? Does this affect the time it takes for the algorithm to work?
Big-O notation
To analyze the performance of a specific type of algorithm, computer scientists use "Big-O notation," which describes the number of instructions that must be performed as a function of the size of the data set.
Example: "Selection Sort is an O(n2) 'Oh-n-squared' algorithm." If the number of values to be sorted (n) triples, then the time to perform the sort will increase by a factor of n-squared, or nine.
There are three ways that we can determine the O-value of an algorithm.
- We can construct a mathematical proof of its operation parameters.
- We can do a quick-and-dirty analysis on a simplified version of the algorithm.
- We can conduct experiments that actually collect time measurements of the algorithm's operation.
Because we've already built a timer into our programs above, let's look at some data collected for a Selection Sort algorithm working to sort a random list of words.
All times were collected using a Selection Sort running in Python 2.7.2 on an Apple Macbook Pro with a 2.66GHz Intel Core i7 dual core processor.
list size | time (s) |
100 | 0.00186 |
500 | 0.020346 |
1000 | 0.069943 |
2000 | 0.26478 |
4000 | 0.577992 |
5000 | 1.63441 |
10000 | 6.392076 |
15000 | 15.56388 |
And here's a graph of those list sizes and times:
Notice that the data points line up almost perfectly along a quadratic curve, clearly indicating that Selection Sort is an O(n2) algorithm.
Bubble Sort efficiency
Collect data on the Bubble Sort algorithm by running it with data sets of varying size. Then do an analysis of your data points.
Is Bubble Sort an O(n2) algorithm also? an O(n) algorithm? an O(n log n) algorithm?
4.3. Linear Search
4.3.0. Overview
A linear search is just what it sounds like: given a list of values (sorted or unsorted), go through the list one item at a time, starting at the beginning, and checking each value in turn until you either:
- Find the item on the list, in which case you indicate its location, or
- You go through the entire list and don't find the number, in which case you indicate that the search term has not been found on the list.
Why would we want to do this? It turns out that knowing the location of an item in a list is an important element in more complex algorithms and problems.
A linear search is not usually the most efficient means of finding an item in a list, but it's a good start for us at this point.
Let's see how to write one.
linear_search.py
Write a program that uses a function to create a list of random numbers. Include:
- A main() function that calls the functions identified here.
- A create_random_values() function that creates a list of random values
- A search function takes two parameters: a list of numbers, and a number to search for. The "search" function returns the position (index) of the number in the list, or a -1 if the number is not on the list.
4.4. Binary Search
4.4.0. Overview
Sorting is one classic context for exploring different types of algorithms. You might be interested to know that Python actually provides a .sort() method that works on lists. It's built in to the language, and uses a Quicksort algorithm to sort any list you give it.
Pretty cool!
The next context for examining algorithms is the process of searching. Given a list of items, how do you go about finding an item on the list?
Imagine the various collections of things you have in your life: your music (if you have MP3s or CDs), your photos on your cellphone, your files on your computer, your notes for class... How do you look through those collections of items to find stuff? Do you keep those things ordered so that you can easily find one item in the collection, or do you have them piled up in various folders/piles?
Computers need to be able to find stuff. Let's see how we can search for things.
4.4.1. Searching
Okay, so we've got our lists of items sorted. How do you quickly go about finding something on the list? Just flipping through the list sequentially, starting with the first item and continuing until you get to the end doesn't seem very efficient, especially if you've got a very long list. This is a legitimate search strategy—it's called a Linear Search—but it's probably not the one we'd want to use here.
Binary Search algorithm
The Binary Search algorithm works as follows:
- Begin with a list of sorted items and a "search item" that you're looking for.
- Compare the search item with the item at the middle of the sorted list.
- If the items match, you've found the location of that item!
- If the search item is greater than the middle item of the list, the bottom half of the sorted list can be eliminated from consideration. The top half of the list will become our new search space.
- If the search item is less than the middle of the list, the upper half can be eliminated, and the bottom half of the list will become our new search space.
- Continue searching through the smaller and smaller sections of lists until either the search item is found, or it is not found.
The Binary Search can be tricky to implement. Challenges include identifying the middle of a list of an even number of items, and keeping track of upper- and lower-bounds of the search space.
There are also more sophisticated concerns, like "what happens if you have a billion items to search through?" Large lists like this can tax an algorithm, and even the experts can get tripped up.
We'll keep things simple in here.
4.4.2 Implementing BinarySearch
Once you think you know how a binary search works, take a few moments to see if you can write and debug a Python version of it.
BinarySearch
Write a program that:
- begins with a sorted list l = [1, 2, 3, 4, 7, 9, 13, 14, 20]
- asks for a search item (a number between 1 and 20).
- performs a Binary Search for that item in your list, displaying the results as it goes.
- Print out the location of the search item, if it exists. Otherwise, print out a "search item not found" message.
Test your program with the following lists to see if it works. When testing your program, be sure to look for an item that is on the list, and also an item that isn't on the list.
Search patterns to try | ||
List to search | Item to look for | Expected result |
[1, 2, 3, 4, 7, 9, 13, 14, 20] | 7 | Position 4 |
[1, 2, 3, 4, 7, 9, 13, 14, 20] | 1 | Position 0 |
[1, 2, 3, 4, 7, 9, 13, 14, 20] | 20 | Position 8 |
[1, 2, 3, 4, 7, 9, 13, 14, 20] | -2 | Search item not found |
[1, 2, 3, 4, 7, 9, 13, 14, 20] | 23 | Search item not found |
[1, 2, 3, 4, 7, 9, 13, 14, 20] | 10 | Search item not found |
[4, 7, 9, 13, 14, 20] | 9 | Position 2 |
[4, 7, 9, 13, 14, 20] | 13 | Position 3 |
[4, 7, 9, 13, 14, 20] | 20 | Position 5 |
[4, 7, 9, 13, 14, 20] | 2 | Search item not found |
[4, 7, 9, 13, 14, 20] | 10 | Search item not found |
[4, 7, 9, 13, 14, 20] | 22 | Search item not found |
4.5. Nested Loops
4.5.0. Overview
Sorting and Searching algorithms really give you a chance to get some practice with one-dimensional lists. The next logical step is to look at a "list of lists"—a two-dimensional array.
4.5.1. Basic nested loops
You may recall that a basic nested loops consists of a loop inside a loop. One example would be using variables for ones and tens to count from 0 0 to 9 9:
Another example is the Number Guessing Game. We had an inner loop that allowed the user to repeatedly guess a secret number, and an outer loop that asked them if they wanted to play again.
4.5.2. Lists of lists
We've been able to keep track of a list of data, what you might call a row of data:
A list can keep track of any kind of information: numbers, strings, ... and even lists. This "row" of data could hold three lists like this:
Let's make sure we know how we can refer to these values. If lst[1] is [5,6,7,8], what is lst[1][2] ?
It's the item index 2, in [5,6,7,8], the number 7.
In fact, we can refer to any element in the list-of-lists by referring to the index of the main list, and then the index of the element in the "sub-list."
4.5.3. Re-thinking Lists of Lists
The list of lists we've been looking at is this one:
This is fine, but you can also write the same list like this:
When we arrange the lists like this, it's easy to think of the big list of 3 items as representing three rows, and the sub-lists in each row as representing four columns of data.
And that's exactly how we're going to use them.
4.5.4. Setting up a list-of-lists in Python
Here's one way to fill up a list of lists, assuming we want it to be height rows high and width cols across:
4.5.5. Printing out a list-of-lists in Python
Once you've got a list of lists all set up, you can print it all out using something like this:
4.6. Debugging
From your very first program you know how easy it is to make mistakes when trying to get the computer to do something for you.
You may already have developed some sense of what kind of strategy you can use to figure out what's not working. Nevertheless, given that we spend so much of our time debugging, double-check that you're being as efficient as possible when fixing your code.
Here's how.
4.6.1. Debugging Strategies
Our programs have started to get pretty complex... and they're only going to get more complex from here on out. It's useful to have a few tricks that you can use when writing and debugging your program. You'll quickly learn the advantages of each of these as our problems become more difficult.
Coding and Debugging Strategies
- Top-Down Design
Begin writing your program at the top-level, and then work your way down to the finer parts. We did this with our secretcode program when we wrote the menu part first, and got that working, before we proceeded to working on what each menu item actually did. - Bottom-Up Design
Begin writing your program at the lowest-level—get one small part of the program to work first—and once that's working, start working your way up to the larger structure of the big program. In the craps program one of the very first things we did was figure out how to roll the dice and add them up. After that, we continued building the rest of the larger program that used that dice-rolling code. - "Break Points" in your code
Often, despite your best efforts, you'll find that your program isn't behaving as you expected it would. In these cases, it's a good idea to find out what's happening in your program by adding break points.
Break Point
A break point is a stopping place in a program that is used during debugging to understand the status of the program at that point.
Integrated Development Environments (IDEs) provide a programmer with the capability to create breakpoints automatically, but because we're coding "by hand," you'll need to write your own breakpoints. Here's how you do that:
- Insert a print statement to indicate the status of variables that you're concerned about:
print("DEBUG: a =",a,"and b=",b) print(locals()) # an easier, fancier way of doing this in Python
- Insert an input statement so that the program stops to wait for user input:
print("DEBUG: a =",a,"and b=",b) print(locals()) # an easier, fancier way of doing this in Python pause = input()
We don't actually care about any input here, but this statement has the effect of pausing the program for a moment so that we can read the status of the variables, and hopefully figure out what might be causing the program to misbehave.
Once the problem has been solved, you can comment out or remove these statements entirely so that only the clean code is in your program.
4.6.2. Example of Debugging: Binary Search
If I'm writing the Binary Search program described above, I might think I have a pretty good idea of how my program is working, and yet... I'm not getting the right results. You don't need to know about the Binary Search algorithm to understand the point here. Just look at my (not working yet) code initially:
This version of the program is producing output but the results are incorrect. How can I go about diagnosing what's going wrong with my program?
Here's a new version of the program, with additional print statements and a pause built in so I can see what's going on. The breakpoint code is highlighted in bold.
All those extra print statements slow down the program, and having to hit the [Enter] key every time the loop is definitely not a feature that I'll want in my final version. But in the meantime, I can use the breakpoint—the print statements along with the pause—to let me examine the inner workings of the program and figure out what's going on. (It turns out that I should have said if upper < lower, not if upper == lower.) And once I've got it working, I'll either remove those statements completely, or just comment them out in case I think I might need to use them again later on when I add a new feature to the program.
4.7. Review of the First Quarter
This just about brings us to the end of some of our major work in learning how to program. At this point you know the major strategies and structures for writing instructions in Python.
Here's a quick summary of what we've covered already in just the first few weeks of study.
Some major elements of programming in Python
- Output:
- the print() statement
- the print("abc", end = '') statement
- Data types:
- integers for whole numbers
- floats for decimal values
- strings for sequences of characters (enclosed in single (') or double (") quotes)
- lists for sequences of data (enclosed in square brackets [])
- Data structures:
- variables of various types as listed above
- lists as listed above
- functions, defined using a def functionName construcdtion
- objects (not yet covered)
- tuples (not covered in this course)
- Input:
- the input() statement for numbers
- the input() statement for strings
- Loops:
- the for loop when you known how many times the loop needs to repeat
- the while loop when you'll use a Boolean value (True or False) to decide when the loop should end
- Decision-Making:
- the if statement for a simple decision
- the if-else statement for an either this-or-that decision
- the if-elif-else statement (or nested if - else statements) for more complex decisions
There is a lot more to programming in Python than this, but we've made a very good start, and we're already in a position to write some complex and powerful programs.