CS50x Week 7

scanf()

The first argument of scanf is %i, a specifier for integer, scanf expects an integer from the terminal, when an integer is encountered, it will be stored in x. The second argument is the address of x.

This above shows how scanf can be used to input data into array. In each iteration, one number is stored in array x and one character is stored in array y.

Structures

By combining structures and scanf, we can input data into structures. A structure data type with 2 members is declared and is typedef into person. Next, an array class of datatype person is declared. Think of each element in the array class as having a data type person. Then, use a loop to loop over each element, for each element, use scanf to input student age into class[i].age and student gender into class[i].gender, where i is the looping index from 0 to 2.

Write data to file

From the structure section of this post, we learnt how to input data into an array of data type struct. Now, we would like to store the data in a file in our computer. The code above is similar to previous code, only the “save class to disk” and “free memory” code is added.

We first declare a pointer file of type FILE. Next, we use fopens() to open the file using a given mode. fopens() accepts file name and mode as arguments. In this case, our file name is class.csv and mode is w, w stands for write mode. There are other modes that we can use, such as ‘r’ which stands for read, and ‘a’ which stands for append. Next, we loop over each element of array class, and use fprintf to write to our file. Lastly, we close our file using fclose().

As we learnt from previous week, we have to free memory after we use them to prevent memory leak. The last segment of code shows just that.

A file called class.csv will be created in the same folder as this c file, inside the class.csv file, there will be age separated by gender. .csv file stands for comman-separated value.

Forensics

Hard disk (other than SSD) has tiny little magnetic particles on its surface, up and down represents 0 and 1. Inside your computer, there is a file that remembers maps the filenames to their memory address, e.g. resume.doc to address 0x123, and each byte on the hard disk is one memory address.

Laptop hard drive exposed by Evan-Amos is licensed under CC BY-SA 3.0
Perpendicular Recording Diagram by TyIzaeL

When you delete a file, what the computer does is erase the filename, so that you don’t know what file that particular byte is for, but the 1s and 0s are still there. The 1s and 0s will still be there until another file overwritten it. So before they are overwritten, there is a chance to recover this data.

Linked list

Disadvantages of array
1. Need to know in advance how big it is
2. Not easy to insert element into middle of an array. To insert an element into middle of an array, we have to make a new array, put the element in the position we want, and shift elements after the element by one place.

array
array

Linked list
Linked list

So, we use linked list, instead of elements, we have nodes. Each node has data and reference. It is much more convenient to insert new data into linked list than array (as you will see in the next section).

Each node is a struct node data type and contains the data we want to store, n, and a pointer next with data type struct node* that points to the next node.  The first pointer (the Head) points to the first node. The linked list ends with the last pointer pointing to NULL.

As you can see, linked list requires extra space for pointers, this is a disadvantage of linked list. But the disadvantage is offset by the fact that linked list is more flexible.

* Remember that the pointer points to the entire node instead of just the data of the node, because a node is considered as one thing of data type struct node.

Inserting data into linked list

To insert a data into a linked list, we first have to allocate memory of size node to newNode. The n part of newNode will be the data, and the pointer part of newNode will be pointing at NULL. Then, we make the newNode’s pointer points to the next node and the previous pointer points to the newNode .

Linked list adding node by Derrick Coetzee

Searching linked list

Besides inserting node into linked list, we would also want to search data from linked list.

The function search takes in the number you want to search, n, and the pointer to the first node in the linked list, list as arguments. Then a pointer ptr is declared to be used to walk through the list, it is first assigned the first pointer. While the ptr does not reach end of linked list, indicated by NULL, we check if the data of node the pointer is pointing at, ptr->n is n, if yes, we return true, if no, we assign the pointer of node that the pointer is pointing at, ptr->next to ptr. Return false, if NULL is met and no true has been returned.

Pointer, ptr, data of node the pointer is pointing at, ptr->n, and pointer of node the pointer is pointing at, ptr->next.
Pointer, ptr, data of node the pointer is pointing at, ptr->n, and pointer of node the pointer is pointing at, ptr->next.

The ptr->n is a syntactic sugar to make things easier for us, it is the same as (ptr).n. Similarly, ptr->next is syntactic sugar for (ptr).next.

Freeing linked list

Code example of linked list

This code has 4 functions: delete, insert, search, and traverse data in linked list.

Code: http://d2o9nyf4hwsci4.cloudfront.net/2013/fall/lectures/7/w/src7w/list-0.c

And a header file with struct node declaration at http://d2o9nyf4hwsci4.cloudfront.net/2013/fall/lectures/7/w/src7w/list-0.h

Section video of week 7 teaches you how to code an insertion function into linked list.

Hash table

Hash table consists of array where data (key) will be inserted and hash function. The purpose of the hash function is to give an index (hash value) to the data, and data will be stored at the index location. The process of getting the index/ store location for a data is called hashing.

Hash function outputs an index where data will be or is stored
Hash function outputs an index where data will be or is stored

A good hash function has two properties. First, it should be deterministic, this means it will always output the same indices if we put the same string in, for example, in the case above, hash function will always return 2 if we pass “Paul” into it. Second, it should return valid indices, in the above case, the hash function is good if it output any values from 0 to 9, and no other values.

Lets have a hash table for strings of names. We use a simple hash function, where the first character of the string is the index number, e.g. string starts with A, go to index 0, strings start with B, go to index 1, and so on. Hash function can be more complicated than this, but we will stick to this for this case.

First case, we give the hash function the string “Alice”, the hash function outputs 0, so we store “Alice” at location with index 0. Then, “David”, stores at location with index 3. Next, “Aaron”, stores at location with index 0, however, the location has already been occupied by “Alice”. So we have a collision.

There are two ways to deal with collision. We can use linear probing or separate chaining. For linear probing, we move “Aaron” to the next location until an empty location is found, then we put it there. This, in worst case, iterate over every elements, takes n operations O(n), where n is the number of elements in the array, compared to O(1), or constant time, if we just put the string into an empty location.

Second way to deal with collision is separate chaining. Separate chaining proposes storing  a pointer pointing to the first node of a linked list in each location of array. So, at each location, is a linked list. When “Aaron” who was suppose to go into index 0, met with “Alice”, “Aaron” will be inserted into the beginning of the linked list at index 0, now the linked list at index 0 contains “Aaron” and “Alice, in that order. This solves the problem of collision. Note that since every collision is solved by inserting the data at the beginning of the respective linked list, the data in the linked list is unsorted, however, the benefit of this is that insertion is done in constant time, O(1).

Hash collision by separate chaining with head records in the bucket array. by Jorge Stolfi is licensed under CC BY-SA 3.0

Separate chaining takes constant time, O(1), to insert/ store data, compared to linear probing that has O(n) for insertion. Now we know separate chaining is better at inserting data than linear probing, how efficient is searching data in hash table with separate chaining? Let’s assume the worst case scenario where all the data is stored in a linked list in only one location of the hash table (all data hash to the same value), and the data is at the last of the linked list, we have to traverse n elements to get to the data, so O(n). Separate chaining takes constant time to insert data, but has a O(n) for search.

Tries

Trie is a different data structure than hash table.

CS50x Lecture 7, continued notes by Andrew Sellergren

Each node of the array is another array (this takes up a lot of memory, however, what we are interested here is the time taken to perform the task, and not memory). If we want to store a data “Maxwell”, we put ‘M’ at the main node, then ‘a’ at the second node, ‘x’ at the third node and so on. Since each node is an array, we only need to take k steps, where k is the length of the data to store it. So the bigO for insertion into trie is O(k), which when the data set gets larger, becomes O(1), constant time. How about searching data in trie? To search for “Poincare”, we first go to first node’s ‘P’, then ‘o’ of second node, and so on. This takes O(k) also, which means search in trie is also constant time, O(1). Trie wins against hash table in terms of search and insertion time.

The Δ symbol at the end is to flag the end of string.

An algorithm is said to be constant time, O(1) when the time it takes to solve the problem does not depend on the size of problem. For example, searching a string “Maxwell” in Trie always take the same amount of time no matter how many other names are in the Trie, be it 10 names or 1 million names.

Other posts in the series: Harvard CS50x 2014

Leave a Reply