Coding Geek http://www.sunsetandecho.com A blog about IT, programming and Java Tue, 07 Feb 2017 22:49:29 +0000 en-US hourly 1 https://wordpress.org/?v=5.1.1 the best programming language http://www.sunsetandecho.com/the-best-programming-language/ http://www.sunsetandecho.com/the-best-programming-language/#comments Mon, 24 Aug 2015 16:33:59 +0000 http://www.sunsetandecho.com/?p=1220

When I want to take a break at work, I sometimes read technology forums. And there is one kind of posts that I really like: the flame wars between programming languages. I like these posts because you can see passionate and smart people who are arguing as if their lives were?at?play.

These posts have 2 advantages:

  • they make me laugh
  • I learn new stuff

 

If I had to sum up this kind of posts, it would be something like:

Post Title “Java is the best language” by NewJavaFanBoy

NewJavaFanBoy: Java is the best language because of its community. Moreover, it has really cool features like lambdas. Why so many people hate Java?

FormerJavaFanBoy: Oracle killed Java.

DotNetFanBoy: The evolution of Java is too slow, C# had lambdas a while ago. Moreover, some critical features like optional and named parameters are not in Java. Now that dotnet is more open sourced and can be run on Linux with Mono, Java is going to die.

TrollRoxxoR: BecauseJavaDevelopersDontKnowHowToWriteCode

RealG33k: Both your languages are for kids, C++ is way better but it’s for real developers only. Do you even know what SOLID means?

HipsterGeek: So old and lame … you should try Node.js, it’s based on asynchronous calls and it’s very fast.

LinusTorvalds: Pussies, a real developer uses C or assembly. You can’t have performances with those high level shits.

 

I hate PHP. I can’t explain why; it must be because I tried to learn it when I was 14 and it messed with my brain. But guess what, you’re reading this post on a server using PHP/NGINX (which is a kickass server by the way). I’m good with Java. So, I could have used a Java framework running on a fast fat JVM. But, WordPress is a great platform. It’s often looked down by purists but it clearly answers my needs. The aim of my blog is not to be the fastest in the world (though it has surprisingly but?painfully survived 2 Hacker News and Reddit front pages involving 500 simultaneous connections). I just want a user-friendly interface where I can share my thoughts.

 

Which leads to my point: there is no best programming language, it depends on the situation.

 

1) Do you need performances?

If yes, what kind of performances are we talking about?

  • Seconds? Every language can do it!
  • Milliseconds? Every language with good programmers can do it.
  • Microseconds? At this step, you can remove all the interpreted languages (like python, which is a good language). I know that a well-tuned JVM with very good Java programmers can do it. I imagine that it’s the same for C#. Of course, a pure-compiled language can deal with that.

But in all these cases the programmer’s skills are more important than the language.

  • Nanoseconds? Only assembly or maybe C can deal with that.

So, in most situations developers’ skills are what matters.

 

2) What’s about the ecosystem?

More than the language itself, the ecosystem is important.

I’ve used Visual Studio during my scholarship and I have been amazed by the coherency of Microsoft’s ecosystem.

 

Now, I’m more an Eclipse guy. Even in the Java community, Eclipse is looked down by purists who now use IntelliJ IDEA. Eclipse is an open source software developed by different people and it’s clearly visible (in a bad way). Compared to the coherency of Visual Studio, you’ll find different logics in the different plugins of Eclipse.

But, if having tools is great, knowing how to use them is better. For example, when I started in Java, I was very slow. I learned by hearth some Eclipse keywords and it’s changed my developer life. I’ve also looked for useful plugins,?and Eclipse has plenty of them, because it’s a rich ecosystem.

 

3) What’s about the online help?

Ok, you’re using your kickass programming language but don’t tell me you know every side of this language by hearth. Having a well known language is useful when you need help. A simple Google or StackOverflow search and you get your answer by Ninja_Guru_666 and I_AM_THE_EXPERT. If you’re more like an in-depth programmer, you can also check for the official documentation assuming it exists for the problem you’re looking for.

 

4) What are the skills of the team?

If the developers don’t really know how a computer works, using a compiled language is a suicidal move. And, compared to the purists, I don’t see why knowing (exactly) how a computer works makes you a good developer (though, I must admit, it helps; but there are more important skills).

It’s better not using the best tool but a known one. Moreover, many developers are fan boys. Using their preferred language will help them to stay motivated on the project.

 

5) The business side

An objective point of view is to see what the most in-demand languages are. It doesn’t mean they’re the best but at least you’ll get a job. In this case, Java, C#, PHP, SQL and JavaScript are clearly above all (at least in France).

Moreover, as a technical leader, it’s always good to check the skills in?the market before choosing a technology. If you choose the best but rare technology to deal with your problem, good luck for?finding skilled developers on the technology.

But what’s true in 2015 might change in 2018. ActionScript was a must have not so long ago. Likewise, with Swift, all the hours spent on Objective C will become obsolete in a few years.

 

 

To conclude, I’ll end up with a lame and (I hope) obvious conclusion: there are no best programming languages or best frameworks; what’s best now might not exist tomorrow.?A programming language is just a tool;?what matters is the way you overcome your problems.

]]>
http://www.sunsetandecho.com/the-best-programming-language/feed/ 10
How does a relational database work http://www.sunsetandecho.com/how-databases-work/ http://www.sunsetandecho.com/how-databases-work/#comments Wed, 19 Aug 2015 02:00:02 +0000 http://www.sunsetandecho.com/?p=858

When it comes to relational databases, I can’t help thinking that something is missing. They’re used everywhere. There are many different databases: from the small and useful SQLite to the powerful Teradata. But, there are only a few articles that explain how a database works. You can google by yourself “how does a relational database work” to see how few results there are. Moreover, those articles are short.??Now, if you look for the last trendy technologies (Big Data, NoSQL or JavaScript), you’ll find more in-depth articles explaining how they work.

Are relational databases too old and too boring to be explained outside of university courses, research papers and books?

 

logos of main databases

 

As a developer, I HATE using something I don’t understand. And, if databases have been used for 40 years, there must be a reason. Over the years,?I’ve spent hundreds of hours to really understand these weird black boxes I use every day.?Relational Databases are very interesting because they’re based on useful and reusable concepts. If understanding a database interests you but you’ve never had the time or the will to dig into this wide subject, you should like this article.

 

Though the title of this article is explicit, the aim of this article is NOT?to understand how to use a database. Therefore, you should already know how to write a simple join query and basic CRUD queries; otherwise you might not understand this article. This is the only thing you need to know, I’ll explain everything else.

I’ll start with some computer science stuff like time complexity. I know that some of you hate this concept but, without it, you can’t understand the cleverness inside a database. Since it’s a huge topic, I’ll focus on what I think is essential: the way a database handles an SQL query. I’ll only present the basic concepts behind a database so that at the end of the article you’ll have a good idea of what’s happening under the hood.

 

Since it’s a long and technical article that involves many algorithms and data structures, take your time to read it. Some concepts are more difficult to understand; you can skip them and still get the overall idea.

For the more knowledgeable of you, this article is more or less divided into 3 parts:

  • An overview of low-level and high-level database components
  • An overview of the query optimization process
  • An overview of the transaction and buffer pool management

 

Back to basics

A long time ago (in a galaxy far, far away….), developers had to know exactly the number of operations they were coding. They knew by heart their algorithms and data structures because they couldn’t afford to waste the CPU and memory of their slow?computers.

In this part, I’ll remind you about some of these concepts because they?are essential to understand a database. I’ll also introduce the notion of database index.

 

O(1) vs O(n2)

Nowadays, many developers don’t care about time complexity … and they’re right!

But when you deal with a large amount of data (I’m not talking about thousands) or if you’re fighting for milliseconds, it becomes critical to understand this concept. And guess what, databases have to deal with both situations! I won’t bore you a long time, just the time to get the idea. This will help us later to understand the concept of?cost based optimization.

 

The concept

The time complexity is used to see how long an algorithm will take for a given amount of data. To describe this complexity, computer scientists use the mathematical big O notation. This notation is used with a function that describes how many operations an algorithm needs for a given amount of input data.

For example, when I say “this algorithm is in O( some_function() )”, it means that for a certain amount of data the algorithm needs some_function(a_certain_amount_of_data) operations to do its job.

What’s important is not the amount of data but the way the number of operations increases when the amount of data increases. The time complexity doesn’t give the exact number of operations but a good idea.

time complexity analysis

In this figure, you can see the evolution of different types of complexities. I used a logarithmic scale to plot it. In other words, the number of data is quickly increasing from 1 to 1 billion. We can see that:

  • The O(1) or constant complexity stays constant (otherwise it wouldn’t be called constant complexity).
  • The O(log(n)) stays low even with billions of data.
  • The worst complexity is the O(n2) where the number of operations quickly explodes.
  • The two other complexities are quickly increasing.

 

Examples

With a low amount of data, the difference between O(1) and O(n2) is negligible. For example, let’s say you have an algorithm that needs to process 2000 elements.

  • An O(1) algorithm will cost you 1 operation
  • An O(log(n)) algorithm will cost you 7 operations
  • An O(n) algorithm will cost you 2 000 operations
  • An O(n*log(n)) algorithm will cost you 14 000 operations
  • An O(n2) algorithm will cost you 4?000?000 operations

The difference between O(1) and O(n2) seems a lot (4 million) but you’ll lose at max 2 ms, just the time to blink your eyes. Indeed, current processors can handle hundreds of millions of operations per second. This is why performance and optimization are not an issue in many IT projects.

 

As I said, it’s still important to know this concept when facing a huge number of data. If this time the algorithm needs to process 1?000?000 elements (which is not that big for a database):

  • An O(1) algorithm will cost you 1 operation
  • An O(log(n)) algorithm will cost you 14 operations
  • An O(n) algorithm will cost you 1?000?000 operations
  • An O(n*log(n)) algorithm will cost you 14?000?000 operations
  • An O(n2) algorithm will cost you 1?000?000?000 000 operations

I didn’t do the math but I’d say with the O(n2) algorithm you have the time to take a coffee (even a second one!). If you put another 0 on the amount of data, you’ll have the time to take a long nap.

 

Going deeper

To give you an idea:

  • A search in a good hash table gives an element in O(1)
  • A search in a well-balanced tree gives a result in O(log(n))
  • A search in an array gives a result in O(n)
  • The best sorting algorithms have an O(n*log(n)) complexity.
  • A bad sorting algorithm has an O(n2) complexity

Note: In the next parts, we’ll see these algorithms and data structures.

 

There are multiple types of time complexity:

  • the average case scenario
  • the best case scenario
  • and the worst case scenario

The time complexity is often the worst case scenario.

I only talked about time complexity but complexity also works for:

  • the memory consumption of an algorithm
  • the disk I/O consumption of an algorithm

 

Of course there are worse complexities than n2, like:

  • n4: that sucks! Some of the algorithms I’ll mention have this complexity.
  • 3n: that sucks even more! One of the algorithms we’re going to see in the middle of this article has this complexity (and it’s really used in many databases).
  • factorial n : you’ll never get your results, even with a low amount of data.
  • nn: if you end-up with this complexity, you should ask yourself if IT is really your field…

 

Note: I didn’t give you the real definition of the big O notation but just the idea. You can read this article on Wikipedia for the real (asymptotic) definition.

 

Merge Sort

What do you do when you need to sort a collection? What? You call the sort() function …? ok, good answer… But for a database you have to understand how this sort() function works.

There are several good sorting algorithms so I’ll focus on the most important one: the merge sort. You might not understand right now why sorting data is useful but you should after the part on query optimization. Moreover, understanding the merge sort will help us later to understand a common database join operation called the merge join.

 

Merge

Like many useful algorithms, the merge sort is based on a trick: merging 2 sorted arrays of size N/2 into a N-element sorted array only costs N operations. This operation is called a merge.

Let’s see what this means with a simple example:

merge operation during merge sort algorithm

You can see on this figure that to construct the final sorted array of 8 elements, you only need to iterate one time in the 2 4-element arrays. Since both 4-element arrays are already sorted:

  • 1) you compare both current elements in the 2 arrays (current=first for the first time)
  • 2) then take the lowest one to put it in the 8-element array
  • 3) and go to the next element in the array you took the lowest element
  • and repeat 1,2,3 until you reach the last element of one of the arrays.
  • Then you take the rest of the elements of the other array to put them in the 8-element array.

This works because both 4-element arrays are sorted and therefore you don’t need to “go back” in these arrays.

 

Now that we’ve understood this trick, here is my pseudocode of the merge sort.

array mergeSort(array a)
   if(length(a)==1)
      return a[0];
   end if

   //recursive calls
   [left_array right_array] := split_into_2_equally_sized_arrays(a);
   array new_left_array := mergeSort(left_array);
   array new_right_array := mergeSort(right_array);

   //merging the 2 small ordered arrays into a big one
   array result := merge(new_left_array,new_right_array);
   return result;

The merge sort breaks the problem into smaller problems then finds the results of the smaller problems to get the result of the initial problem (note: this kind of algorithms is called divide and conquer). If you don’t understand this algorithm, don’t worry; I didn’t understand it the first time I saw it. If it can help you, I see this algorithm as a two-phase algorithm:

  • The division phase where the array is divided into smaller arrays
  • The sorting phase where the small arrays are put together (using the merge) to form a bigger array.

 

Division phase

division phaseduring merge sort algorithm

During the division phase, the array is divided into unitary arrays using 3 steps. The formal number of steps is log(N)? (since N=8, log(N) = 3).

How do I know that?

I’m a genius! In one word: mathematics. The idea is that each step divides the size of the initial array by 2. The number of steps is the number of times you can divide the initial array by two. This is the exact definition of logarithm (in base 2).

 

Sorting phase

sort phaseduring merge sort algorithm

In the sorting phase, you start with the unitary arrays. During each step, you apply multiple merges and the overall cost is N=8 operations:

  • In the first step you have 4 merges that cost 2 operations each
  • In the second step you have 2 merges that cost 4 operations each
  • In the third step you have 1 merge that costs 8 operations

Since there are log(N) steps, the overall costs N * log(N) operations.

 

The power of the merge sort

Why this algorithm is so powerful?

Because:

  • You can modify it in order to reduce the memory footprint, in a way that you don’t create new arrays but you directly modify the input array.

Note: this kind of algorithms is called in-place.

  • You can modify it in order to use disk space and a small amount of memory at the same time without a huge disk I/O penalty. The idea is to load in memory only the parts that are currently processed. This is important when you need to sort a multi-gigabyte table with only a memory buffer of 100 megabytes.

Note: this kind of algorithms is called external sorting.

  • You can modify?it to?run on multiple processes/threads/servers.

For example, the distributed merge sort is one of the key components of Hadoop (which is?THE framework in Big Data).

  • This algorithm can turn lead into gold (true fact!).

 

This sorting algorithm is used in most (if not all) databases but it’s not the only one. If you want to know more, you can read this research paper that discusses the pros and cons of the common sorting algorithms in a database.

 

Array, Tree and Hash table

Now that we understand the idea behind time complexity and sorting, I have to tell you about 3 data structures. It’s important because they’re the backbone of modern databases. I’ll also introduce the notion of database index.

 

Array

The two-dimensional array is the simplest data structure. A table can be seen as an array. For example:

array table in databases

This 2-dimensional array is a table with rows and columns:

  • Each row represents a subject
  • The columns the features that describe the subjects.
  • Each column stores a certain type of data (integer, string, date …).

Though it’s great to store and visualize data, when you need to look for a specific value it sucks.

For example, if you want to find all the guys who work in the UK, you’ll have to look at each row to find if the row belongs to the UK. This will cost you N operations (N being the number of rows) which is not bad but could there be a faster way? This is where trees come into play.

 

Note: Most modern databases provide advanced arrays to store tables?efficiently like heap-organized tables or index-organized tables. But it doesn’t change the problem of fast searching for a specific condition on a group of columns.

 

Tree and database index

A binary search tree is a binary tree with a special property, the key in each node must be:

  • greater than all keys stored in the left sub-tree
  • smaller than all keys stored in the right sub-tree

 

Let’s see what it means visually

The idea

binary search tree

 

This tree has N=15 elements. Let’s say I’m looking for 208:

  • I start with the root whose key is 136. Since 136<208, I look at the right sub-tree of the node 136.
  • 398>208 so, I look at the left sub-tree of the node 398
  • 250>208 so, I look at the left sub-tree of the node 250
  • 200<208 so, I look at the right sub-tree of the node 200. But 200 doesn’t have a right subtree, the value doesn’t exist (because if it did exist it would be in the right subtree of 200)

Now let’s say I’m looking for 40

  • I start with the root whose key is 136. Since 136>40, I look at the left sub-tree of the node 136.
  • 80>40 so, I look at the left sub-tree of the node 80
  • 40= 40, the node exists. I extract the id of the row inside the node (it’s not in the figure) and look at the table for the given row id.
  • Knowing the row id let me know where the data is precisely on the table and therefore I can get it instantly.

In the end, both searches cost me the number of levels inside the tree. If you read carefully the part on the merge sort you should see that there are log(N) levels. So the cost of the search is log(N), not bad!

 

Back to our problem

But this stuff is very abstract so let’s go back to our problem. Instead of a stupid integer, imagine the string that represents the country of someone in the previous table. Suppose you have a tree that contains the column “country” of the table:

  • If you want to know who is working in the UK
  • you look at the tree to get the node that represents the UK
  • inside the “UK node” you’ll find the locations of the rows of the UK workers.

This search only costs you log(N) operations instead of N operations if you directly use the array. What you’ve just imagined was a database index.

You can build a tree index for any group of columns (a string, an integer, 2 strings, an integer and a string, a date …) as long as you have a function to compare the keys (i.e. the group of columns) so that you can establish an order among the keys (which is the case for any basic types in a database).

 

B+Tree Index

Although this tree works well to get a specific value, there is a BIG problem when you need to get multiple elements between two values. It will cost O(N) because you’ll have to look at each node in the tree and check if it’s between these 2 values (for example, with an in-order traversal of the tree). Moreover this operation is not disk I/O friendly since you’ll have to read the full tree. We need to find a way to efficiently do a range query. To answer this problem, modern databases use a modified version of the previous tree called B+Tree. In a B+Tree:

  • only the lowest nodes (the leaves) store information (the location of the rows in the associated table)
  • the other nodes are just here to route to the right node during the search.

B+Tree index in databases

As you can see, there are more nodes (twice more). Indeed, you have additional nodes, the “decision nodes” that will help you to find the right node (that stores the location of the rows in the associated table). But the search complexity is still in O(log(N)) (there is just one more level). The big difference is that the lowest nodes are linked to their successors.

With this B+Tree, if you’re looking for values between 40 and 100:

  • You just have to look for 40 (or the closest value after 40 if 40 doesn’t exist) like you did with the previous tree.
  • Then gather the successors of 40 using the direct links to the successors until you reach 100.

Let’s say you found M successors and the tree has N nodes. The search for a specific node costs log(N) like the previous tree. But, once you have this node, you get the M successors in M operations with the links to their successors. This search only costs M + log(N) operations vs N operations with the previous tree. Moreover, you don’t need to read the full tree (just M + log(N) nodes), which means less disk usage. If M is low (like 200 rows) and N large (1?000?000 rows) it makes a BIG difference.

 

But there are new problems (again!). If you add or remove a row in a database (and therefore in the associated B+Tree index):

  • you have to keep the order between nodes inside the B+Tree otherwise you won’t be able to find nodes inside the mess.
  • you have to keep the lowest possible number of levels in the B+Tree otherwise the time complexity in O(log(N)) will become O(N).

I other words, the B+Tree needs to be self-ordered and self-balanced. Thankfully, this is possible with smart deletion and insertion operations. But this comes with a cost: the insertion and deletion in a B+Tree are in O(log(N)). This is why some of you have heard that using too many indexes is not a good idea. Indeed, you’re slowing down the fast insertion/update/deletion of a row in a table since the database needs to update the indexes of the table with a costly O(log(N)) operation per index. Moreover, adding indexes means more workload for the transaction manager (we will see this manager at the end of the article).

For more details, you can look at the Wikipedia article about B+Tree. If you want an example of a B+Tree implementation in a database, look at this article and this article from a core developer of MySQL. They both focus on how innoDB (the engine of MySQL) handles indexes.

Note: I was told by a reader that, because of low-level optimizations, the B+Tree needs to be fully balanced.

 

Hash table

Our last important data structure is the hash table. It’s very useful when you want to quickly look for values. ?Moreover, understanding the hash table will help us later to understand a common database join operation called the hash join. This data structure is also used by a database to store some internal stuff (like the lock table or the buffer pool, we’ll see both concepts later)

The hash table is a data structure that quickly finds an element with its key. To build a hash table you need to define:

  • a key for your elements
  • a hash function for the keys. The computed hashes of the keys give the locations of the elements (called buckets).
  • a function to compare the keys. Once you found the right bucket you have to find the element you’re looking for inside the bucket using this comparison.

 

A simple example

Let’s have a visual example:

hash table

This hash table has 10 buckets. Since I’m lazy I only drew 5 buckets but I know you’re smart so I let you imagine the 5 others. The Hash function I used is the modulo 10 of the key. In other words I only keep the last digit of the key of an element to find its bucket:

  • if the last digit is 0 the element ends up in the bucket 0,
  • if the last digit is 1 the element ends up in the bucket 1,
  • if the last digit is 2 the element ends up in the bucket 2,

The compare function I used is simply the equality between 2 integers.

Let’s say you want to get the element 78:

  • The hash table computes the hash code for 78 which is 8.
  • It looks in the bucket 8, and the first element it finds is 78.
  • It gives you back the element 78
  • The search only costs 2 operations (1 for computing the hash value and the other for finding the element inside the bucket).

Now, let’s say you want to get the element 59:

  • The hash table computes the hash code for 59 which is 9.
  • It looks in the bucket 9, and the first element it finds is 99. Since 99!=59, element 99 is not the right element.
  • Using the same logic, it looks at the second element (9), the third (79), … , and the last (29).
  • The element doesn’t exist.
  • The search costs 7 operations.

 

A good hash function

As you can see, depending on?the value you’re looking for, the cost is not the same!

If I now change the hash function with the modulo 1 000?000 of the key (i.e. taking the last 6 digits), the second search only costs 1 operation because there are no elements in the bucket 000059. The real challenge is to find a good hash function that will create buckets that contain a very small amount of elements.

In my example, finding a good hash function is easy. But this is a simple example, finding a good hash function is more difficult when the key is:

  • a string (for example the last name of a person)
  • 2 strings (for example the last name and the first name of a person)
  • 2 strings and a date (for example the last name, the first name and the birth date of a person)

With a good hash function, the search in a hash table is in O(1).

 

Array vs hash table

Why not using an array?

Hum, you’re asking a good question.

  • A hash table can be half loaded in memory and the other buckets can stay on disk.
  • With an array you have to use a contiguous space in memory. If you’re loading a large table it’s very difficult to have enough contiguous space.
  • With a hash table you can choose the key you want (for example the country AND the last name of a person).

For more information, you can read my article on the Java HashMap which is an efficient hash table implementation; you don’t need to understand Java to understand the concepts inside this article.

 

Global overview

We’ve just seen the basic components inside a database. We now need to step back to see the big picture.

A database is a collection of information that can easily be accessed and modified. But a simple bunch of files could do the same. In fact, the simplest databases like SQLite are nothing more than a bunch of files. But SQLite is a well-crafted bunch of files because it allows you to:

  • use transactions that ensure data are safe and coherent
  • quickly process data even when you’re dealing with millions of data

 

More generally, a database can be seen as the following figure:

global overview of a database

Before writing this part, I’ve read multiple books/papers and every source had its on way to represent a database. So, don’t focus too much on how I organized this database or how I named the processes because I made some choices to fit the plan of this article. What matters are the different components; the overall idea is that a database is divided into multiple components that interact with each other.

The core components:

  • The process manager: Many databases have a pool of processes/threads that needs to be managed. Moreover, in order to gain nanoseconds, some modern databases use their own threads instead of the Operating System threads.
  • The network manager: Network I/O is a big issue, especially for distributed databases. That’s why some databases have their own manager.
  • File system manager: Disk I/O is the first bottleneck of a database. Having a manager that will perfectly handle the Operating System file system or even replace it is important.
  • The memory manager: To avoid the disk I/O penalty a large quantity of ram is required. But if you handle a large amount of memory, you need an efficient memory manager. Especially when you have many queries using memory at the same time.
  • Security Manager: for managing the authentication and the authorizations?of the users
  • Client manager: for managing the client connections

The tools:

  • Backup manager: for saving and restoring a database.
  • Recovery manager: for restarting the database in a coherent state after a crash
  • Monitor manager: for logging the activity of the database and providing tools to monitor a database
  • Administration manager: for storing metadata (like the names and the structures of the tables) and providing tools to manage databases, schemas, tablespaces, …

The query Manager:

  • Query parser: to check if a query is valid
  • Query rewriter: to pre-optimize a query
  • Query optimizer: to optimize a query
  • Query executor: to compile and execute a query

The data manager:

  • Transaction manager: to handle transactions
  • Cache manager: to put data in memory before using them and put data in memory before writing them on disk
  • Data access manager: to access data on disk

 

For the rest of this article, I’ll focus on how a database manages an SQL query through the following processes:

  • the client manager
  • the query manager
  • the data manager (I’ll also include the recovery manager in this part)

 

Client manager

client manager in databases

The client manager is the part that handles the communications with the client. The client can be a (web) server or an end-user/end-application. The client manager provides different ways to access the database through a set of well-known APIs: JDBC, ODBC, OLE-DB …

It can also provide proprietary database access APIs.

 

When you connect to a database:

  • The manager first checks your authentication (your login and password) and then checks if you have the authorizations to use the database. These access rights are set by your DBA.
  • Then, it checks if there is a process (or a thread) available to manage your query.
  • It also checks if the database if not under heavy load.
  • It can wait a moment to get the required resources. If this wait reaches a timeout, it closes the connection and gives a readable error message.
  • Then it sends your query to the query manager and your query is processed
  • Since the query processing is not an “all or nothing” thing, as soon as it gets data from the query manager, it stores the partial results in a buffer and start sending them to you.
  • In case of problem, it stops the connection, gives you a readable explanation and releases the resources.

 

Query manager

query manager in databases

This part is where the power of a database lies. During this part, an ill-written query is transformed into a fast executable code. The code is then executed and the results are returned to the client manager. It’s a multiple-step operation:

  • the query is first parsed to see if it’s valid
  • it’s then rewritten to remove useless operations and add some pre-optimizations
  • it’s then optimized to improve the performances and transformed into an execution and data access plan.
  • then the plan is compiled
  • at last, it’s executed

In this part, I won’t talk a lot about the last 2 points because they’re less important.

 

After reading this part, if you want a better understanding I recommend reading:

  • The initial research paper (1979) on cost based optimization: Access Path Selection in a Relational Database Management System. This article is only 12 pages and understandable with an average level in computer science.
  • A very good and in-depth presentation on how DB2 9.X optimizes queries here
  • A very good presentation on how PostgreSQL optimizes queries here. It’s the most accessible document since it’s more a presentation on “let’s see what query plans PostgreSQL gives in these situations“ than a “let’s see the algorithms used by PostgreSQL”.
  • The official SQLite documentation about optimization. It’s “easy” to read because SQLite uses simple rules. Moreover, it’s the only official documentation that really explains how it works.
  • A good presentation on how SQL Server 2005 optimizes queries here
  • A white paper about optimization in Oracle 12c here
  • 2 theoretical courses on query optimization from the authors of the book “DATABASE SYSTEM CONCEPTS” here and there. A good read that focuses on disk I/O cost but a good level in CS is required.
  • Another theoretical course that I find more accessible but that only focuses on join operators and disk I/O.

 

Query parser

Each SQL statement is sent to the parser where it is checked for correct syntax. If you made a mistake in your query the parser will reject the query. For example, if you wrote “SLECT …” instead of “SELECT …”, ?the story ends here.

But this goes deeper. It also checks that the keywords are used in the right order. For example a WHERE before a SELECT will be rejected.

Then, the tables and the fields inside the query are analyzed. The parser uses the metadata of the database to check:

  • If the tables exist
  • If the fields of the tables exist
  • If the operations for the types of the fields are possible (for example you can’t compare an integer with a string, you can’t use a substring() function on an integer)

 

Then it checks if you have the authorizations to read (or write) the tables in the query. Again, these access rights on tables are set by your DBA.

During this parsing, the SQL query is transformed into an internal representation (often a tree)

If everything is ok then the internal representation is sent to the query rewriter.

 

Query rewriter

At this step, we have an internal representation of a query. The aim of the rewriter is:

  • to pre-optimize the query
  • to avoid unnecessary operations
  • to help the optimizer to find the best possible solution

 

The rewriter executes a list of known rules on the query. If the query fits a pattern of a rule, the rule is applied and the query is rewritten.? Here is a non-exhaustive list of (optional) rules:

  • View merging: If you’re using a view in your query, the view is transformed with the SQL code of the view.
  • Subquery flattening: Having subqueries is very difficult to optimize so the rewriter will try to modify a query with a subquery to remove the subquery.

For example

SELECT PERSON.*
FROM PERSON
WHERE PERSON.person_key IN
(SELECT MAILS.person_key
FROM MAILS
WHERE MAILS.mail LIKE 'christophe%');

Will be replaced by

SELECT PERSON.*
FROM PERSON, MAILS
WHERE PERSON.person_key = MAILS.person_key
and MAILS.mail LIKE 'christophe%';
  • Removal of unnecessary operators: For example if you use a DISTINCT whereas you have a UNIQUE?constraint that prevents the data from being non-unique, the DISTINCT keyword is removed.
  • Redundant join elimination: If you have twice the same join condition because one join condition is hidden in a view or if by transitivity there is a useless join, it’s removed.
  • Constant arithmetic evaluation: If you write something that requires a calculus, then it’s computed once during the rewriting. For example WHERE AGE > 10+2 is transformed into WHERE AGE > 12 and TODATE(“some date”) is transformed into the date in the datetime format
  • (Advanced) Partition Pruning: If you’re using a partitioned table, the rewriter is able to find what partitions to use.
  • (Advanced) Materialized view rewrite: If you have a materialized view that matches a subset of the predicates in your query, the rewriter checks if the view is up to date and modifies the query to use the materialized view instead of the raw tables.
  • (Advanced) Custom rules: If you have custom rules to modify a query (like Oracle policies), then the rewriter executes these rules
  • (Advanced) Olap transformations: analytical/windowing functions, star joins, rollup … are also transformed (but I’m not sure if it’s done by the rewriter or the optimizer, since both processes are very close it must depends on the database).

 

This rewritten query is then sent to the query optimizer where the fun begins!

 

Statistics

Before we see how a database optimizes a query we need to speak about statistics because without them a database is stupid. If you don’t tell the database to analyze its own data, it will not do it and it will make (very) bad assumptions.

But what kind of information does a database need?

I have to (briefly) talk about how databases and Operating systems store data. They’re using a minimum unit called a page or a block (4 or 8 kilobytes by default). This means that if you only need 1 Kbytes it will cost you one page anyway. If the page takes 8 Kbytes then you’ll waste 7 Kbytes.

 

Back to the statistics! When you ask a database to gather statistics, it computes values like:

  • The number of rows/pages in a table
  • For each column in a table:
    • distinct data values
    • the length of data values (min, max, average)
    • data range information (min, max, average)
  • Information on the indexes of the table.

These statistics will help the optimizer to estimate the disk I/O, CPU and memory usages of the query.

The statistics for each column are very important. For example if a table PERSON needs to be joined on 2 columns: LAST_NAME, FIRST_NAME. With the statistics, the database knows that there are only 1 000 different values on FIRST_NAME and 1?000?000 different values on LAST_NAME. Therefore, the database will join the data on LAST_NAME, FIRST_NAME instead of FIRST_NAME,LAST_NAME because it?produces way less comparisons since the LAST_NAME are unlikely to be the same so?most of the time a comparison on the 2 (or 3) first characters of the LAST_NAME is enough.

 

But these are basic statistics. You can ask a database to compute advanced statistics called histograms.? Histograms are statistics that inform about the distribution of the values inside the columns. For example

  • the most frequent values
  • the quantiles

These extra statistics will help the database to find an even better query plan. Especially for equality predicate (ex: WHERE AGE = 18 ) or range predicates (ex: WHERE? AGE > 10 and AGE <40 ) because the database will have a better idea of the number rows concerned by these predicates (note: the technical word for this concept is selectivity).

 

The statistics are stored in the metadata of the database. For example you can see the statistics for the (non-partitioned) tables:

  • in USER/ALL/DBA_TABLES and USER/ALL/DBA_TAB_COLUMNS for Oracle
  • in SYSCAT.TABLES and SYSCAT.COLUMNS for DB2.

 

The statistics have to be up to date. There is nothing worse than a database thinking a table has only 500 rows whereas it has 1 000 000 rows. The only drawback of the statistics is that it takes time to compute them. This is why they’re not automatically computed by default in most databases. It becomes difficult with millions of data to compute them. In this case, you can choose to compute only the basics statistics or to compute the stats on a sample of the database.

For example, when I was working on a project dealing with hundreds of millions rows in each tables, I chose to compute the statistics on only 10%, which led to a huge gain in time. For the story it turned out to be a bad decision because occasionally the 10% chosen by Oracle 10G for a specific column of a specific table were very different from the overall 100% (which is very unlikely to happen for a table with 100M rows). This wrong statistic led to a query taking occasionally 8 hours instead of 30 seconds; a nightmare to find the root cause. This example shows how important the statistics are.

 

Note: Of course, there are more advanced statistics specific for each database. If you want to know more, read the documentations of the databases.?That being said, I’ve tried to understand how the statistics are used and the best official documentation I found was the one from PostgreSQL.

 

Query optimizer

CBO

All modern databases are using a Cost Based Optimization (or CBO) to optimize queries. The idea is to put a cost an every operation and find the best way to reduce the cost of the query by using the cheapest chain of operations to get the result.

 

To understand how a cost optimizer works I think it’s good to have an example to “feel” the complexity behind this task. In this part I’ll present you the 3 common ways to join 2 tables and we will quickly see that even a simple join query is a nightmare to optimize. After that, we’ll see how real optimizers do this job.

 

For these joins, I’ll focus on their time complexity but a database optimizer computes their CPU cost, disk I/O cost and memory requirement. The difference between time complexity and CPU cost is that time cost is very approximate (it’s for lazy guys like me). For the CPU cost, I should count every operation like an addition, an “if statement”, a multiplication, an iteration … Moreover:

  • Each high level code operation has a specific number of low level CPU operations.
  • The cost of a CPU operation is not the same (in terms of CPU cycles) whether you’re using an Intel Core i7, an Intel Pentium 4, an AMD Opteron…. In other words it depends on the CPU architecture.

 

Using the time complexity is easier (at least for me) and with it we can still get the concept of CBO. I’ll sometimes speak about disk I/O since it’s an important concept. Keep in mind that the bottleneck is most of the time the disk I/O and not the CPU usage.

 

Indexes

We talked about indexes when we saw the B+Trees. Just remember that these indexes are already sorted.

FYI, there are other types of indexes like bitmap indexes. They don’t offer the same cost in terms of CPU, disk I/O and memory than B+Tree indexes.

Moreover, many modern databases can dynamically create temporary indexes just for the current query if it can improve the cost of the execution plan.

 

Access Path

Before applying your join operators, you first need to get your data. Here is how you can get your data.

Note: Since the real problem with all the access paths is the disk I/O, I won’t talk a lot about time complexity.

 

Full scan

If you’ve ever read an execution plan you must have seen the word full scan (or just scan). A full scan is simply the database reading a table or an index entirely. In terms of disk I/O, a table full scan is obviously more expensive than an index full scan.

 

Range Scan

There are other types of scan like index range scan. It is used for example when you use a predicate like “WHERE AGE > 20 AND AGE <40”.

Of course you need have an index on the field AGE to use this index range scan.

We already saw in the first part that the time cost of a range query is something like log(N) +M, where N is the number of data in this index and M an estimation of the number of rows inside this range. Both N and M values are known thanks to the statistics (Note: M is the selectivity for the predicate AGE >20 AND AGE<40). Moreover, for a range scan you don’t need to read the full index so it’s less expensive in terms of disk I/O than a full scan.

 

Unique scan

If you only need one value from an index you can use the unique scan.

 

Access by row id

Most of the time, if the database uses an index, it will have to look for the rows associated to the index. To do so it will use an access by row id.

 

For example, if you do something like

SELECT LASTNAME, FIRSTNAME from PERSON WHERE AGE = 28

If you have an index for person on column age, the optimizer will use the index to find all the persons who are 28 then it will ask for the associate rows in the table because the index only has information about the age and you want to know the lastname and the firstname.

 

But, if now you do something like

SELECT TYPE_PERSON.CATEGORY from PERSON ,TYPE_PERSON
WHERE PERSON.AGE = TYPE_PERSON.AGE

The index on PERSON will be used to join with TYPE_PERSON but the table PERSON will not be accessed by row id since you’re not asking information on this table.

Though it works great for a few accesses, the real issue with this operation is the disk I/O. If you need too many accesses by row id the database might choose a full scan.

 

Others paths

I didn’t present all the access paths. If you want to know more, you can read the Oracle documentation. The names might not be the same for the other databases but the concepts behind are the same.

 

Join operators

So, we know how to get our data, let’s join them!

I’ll present the 3 common join operators: Merge Join, Hash Join and Nested Loop Join. But before that, I need to introduce new vocabulary: inner relation and outer relation. A relation can be:

  • a table
  • an index
  • an intermediate result from a previous operation (for example the result of a previous join)

When you’re joining two relations, the join algorithms manage the two relations differently. In the rest of the article, I’ll assume that:

  • the outer relation is the left data set
  • the inner relation is the right data set

For example, A JOIN B is the join between A and B where A is the outer relation and B the inner relation.

Most of the time, the cost of A JOIN B is not the same as the cost of B JOIN A.

In this part, I’ll also assume that the outer relation has N elements and the inner relation M elements. Keep in mind that a real optimizer knows the values of N and M with the statistics.

Note: N and M are the cardinalities of the relations.

 

Nested loop join

The nested loop join is the easiest one.

nested loop join in databases

 

Here is the idea:

  • for each row in the outer relation
  • you look at all the rows in the inner relation to see if there are rows that match

Here is a pseudo code:

nested_loop_join(array outer, array inner)
  for each row a in outer
    for each row b in inner
      if (match_join_condition(a,b))
        write_result_in_output(a,b)
      end if
    end for
   end for

Since it’s a double iteration, the time complexity is O(N*M)

 

In term of disk I/O, for each of the N rows in the outer relation, the inner loop needs to read M rows from the inner relation. This algorithm needs to read N + N*M rows from disk. But, if the inner relation is small enough, you can put the relation in memory and just have M +N reads. With this modification, the inner relation must be the smallest one since it has more chance to fit in memory.

In terms of time complexity?it makes no difference but in terms of disk I/O it’s way better to read only once both relations.??????

Of course, the inner relation can be replaced by an index, it will be better for the disk I/O.

?

Since this algorithm is very simple, here is another version that is more disk I/O friendly if the inner relation is too big to fit in memory. Here is the idea:

  • instead of reading both relation row by row,
  • you read them bunch by bunch and keep 2 bunches of rows (from each relation) in memory,
  • you compare the rows inside the two bunches and keep the rows that match,
  • then you load new bunches from disk and compare them
  • and so on until there are no bunches to load.

Here is a possible algorithm:

// improved version to reduce the disk I/O.
nested_loop_join_v2(file outer, file inner)
  for each bunch ba in outer
  // ba is now in memory
    for each bunch bb in inner
	    // bb is now in memory
		for each row a in ba
          for each row b in bb
            if (match_join_condition(a,b))
              write_result_in_output(a,b)
            end if
          end for
       end for
    end for
   end for

 

With this version, the time complexity remains the same, but the number of disk access decreases:

  • With the previous version, the algorithm needs N + N*M accesses (each access gets one row).
  • With this new version, the number of disk accesses becomes number_of_bunches_for(outer)+ number_of_ bunches_for(outer)* number_of_ bunches_for(inner).
  • If you increase the size of the bunch you reduce the number of disk accesses.

Note: Each disk access gathers more data than the previous algorithm but it doesn’t matter since they’re sequential accesses (the real issue with mechanical disks is the time to get the first data).

 

Hash join

The hash join is more complicated but gives a better cost than a nested loop join in many situations.

hash join in a database

The idea of the hash join is to:

  • 1) Get all elements from the inner relation
  • 2) Build an in-memory hash table
  • 3) Get all elements of the outer relation one by one
  • 4) Compute the hash of each element (with the hash function of the hash table) to find the associated bucket of the inner relation
  • 5) find if there is a match between the elements in the bucket and the element of the outer table

In terms of time complexity I need to make some assumptions to simplify the problem:

  • The inner relation is divided into X buckets
  • The hash function distributes hash values almost uniformly for both relations. In other words the buckets are equally sized.
  • The matching between an element of the outer relation and all elements inside a bucket costs the number of elements inside the buckets.

The time complexity is (M/X) * N + cost_to_create_hash_table(M) + cost_of_hash_function*N

If the Hash function creates enough small-sized buckets then the time complexity is O(M+N)

 

Here is another version of the hash join which is more memory friendly but less disk I/O friendly. This time:

  • 1) you compute the hash tables for both the inner and outer relations
  • 2) then you put them on disk
  • 3) then you compare the 2 relations bucket by bucket (with one loaded in-memory and the other read row by row)

 

Merge join

The merge join is the only join that produces a sorted result.

Note: In this simplified merge join, there are no inner or outer tables; they both play the same role. But real implementations make a difference, for example, when dealing with duplicates.

The merge join can be divided into of two steps:

  1. (Optional) Sort join operations: Both the inputs are sorted on the join key(s).
  2. Merge join operation: The sorted inputs are merged together.

 

Sort

We already spoke about the merge sort, in this case a merge sort in a good algorithm (but not the best if memory is not an issue).

But sometimes the data sets are already sorted, for example:

  • If the table is natively ordered, for example an index-organized table on the join condition
  • If the relation is an index on the join condition
  • If this join is applied on an intermediate result already sorted during the process of the query

?

Merge join

merge join in a database

This part is very similar to the merge operation of the merge sort we saw. But this time, instead of picking every element from both relations, we only pick the elements from both relations that are equals. Here is the idea:

  • 1) you compare both current elements in the 2 relations (current=first for the first time)
  • 2) if they’re equal, then you put both elements in the result and you go to the next element for both relations
  • 3) if not, you go to the next element for the relation with the lowest element (because the next element might match)
  • 4) and repeat 1,2,3 until you reach the last element of one of the relation.

This works because both relations are sorted and therefore you don’t need to “go back” in these relations.

This algorithm is a simplified version because it doesn’t handle the case where the same data appears multiple times in both arrays (in other words a multiple matches). The real version is more complicated “just” for this case; this is why I chose a simplified version.

 

If both relations are already sorted?then the time complexity is O(N+M)

If both relations need to be sorted?then the time complexity is the cost to sort both relations: O(N*Log(N) + M*Log(M))

 

For the CS geeks, here is a possible algorithm that handles the multiple matches (note: I’m not 100% sure about my algorithm):

mergeJoin(relation a, relation b)
  relation output
  integer a_key:=0;
  integer b_key:=0;
 
  while (a[a_key]!=null or b[b_key]!=null)
    if (a[a_key] < b[b_key])
      a_key++;
    else if (a[a_key] > b[b_key])
      b_key++;
    else //Join predicate satisfied
	//i.e. a[a_key] == b[b_key]

      //count the number of duplicates in relation a
      integer nb_dup_in_a = 1:
      while (a[a_key]==a[a_key+nb_dup_in_a])
        nb_dup_in_a++;
		
      //count the number of duplicates in relation b
      integer dup_in_b = 1:
      while (b[b_key]==b[b_key+nb_dup_in_b])
        nb_dup_in_b++;
		
	  //write the duplicates in output
	   for (int i = 0 ; i< nb_dup_in_a ; i++)
	     for (int j = 0 ; i< nb_dup_in_b ; i++)	   
	       write_result_in_output(a[a_key+i],b[b_key+j])
		   
      a_key=a_key + nb_dup_in_a-1;
      b_key=b_key + nb_dup_in_b-1;

    end if
  end while

 

Which one is the best?

If there was a best type of joins, there wouldn’t be multiple types. This question is very difficult because many factors come into play like:

  • The amount of free memory: without enough memory you can say goodbye to the powerful hash join (at least the full in-memory hash join)
  • The size of the 2 data sets. For example if you have a big table with a very small one, a nested loop join will be faster than a hash join because the hash join has an expensive creation of hashes. If you have 2 very large tables the nested loop join will be very CPU expensive.
  • The presence of indexes. With 2 B+Tree indexes the smart choice seems to be the merge join
  • If the result need to be sorted: Even if you’re working with unsorted data sets, you might want to use a costly merge join (with the sorts) because at the end the result will be sorted and you’ll be able to chain the result with another merge join (or maybe because the query asks implicitly/explicitly for a sorted result with an ORDER BY/GROUP BY/DISTINCT operation)
  • If the relations are already sorted: In this case the merge join is the best candidate
  • The type of joins you’re doing: is it an equijoin (i.e.: tableA.col1 = tableB.col2)? Is it an inner join, an outer join, a cartesian product or a self-join? Some joins can’t work in certain situations.
  • The distribution of data. If the data on the join condition are skewed (For example you’re joining people on their last name but many people have the same), using a hash join will be a disaster because the hash function will create ill-distributed buckets.
  • If you want the join to be executed by multiple threads/process

 

For more information, you can read the DB2, ORACLE or SQL Server documentations.

 

Simplified example

We’ve just seen 3 types of join operations.

Now let’s say we need to join 5 tables to have a full view of a person. A PERSON can have:

  • multiple MOBILES
  • multiple MAILS
  • multiple ADRESSES
  • multiple BANK_ACCOUNTS

In other words we need a quick answer for the following query:

SELECT * from PERSON, MOBILES, MAILS,ADRESSES, BANK_ACCOUNTS
WHERE
PERSON.PERSON_ID = MOBILES.PERSON_ID
AND PERSON.PERSON_ID = MAILS.PERSON_ID
AND PERSON.PERSON_ID = ADRESSES.PERSON_ID
AND PERSON.PERSON_ID = BANK_ACCOUNTS.PERSON_ID

As a query optimizer, I have to find the best way to process the data. But there are 2 problems:

  • What kind of join should I use for each join?

I have 3 possible joins (Hash Join, Merge Join, Nested Join) with the possibility to use 0,1 or 2 indexes (not to mention that there are different types of indexes).

  • What order should I choose to compute the join?

For example, the following figure shows different possible plans for only 3 joins on 4 tables

join ordering optimization problem in a database

 

So here are my possibilities:

  • 1) I use a brute force approach

Using the database statistics, I compute the cost for every possible plan and I keep the best one. But there are many possibilities. For a given order of joins, each join has 3 possibilities: HashJoin, MergeJoin, NestedJoin. So, for a given order of joins there are 34 possibilities. The join ordering is a permutation problem on a binary tree and there are (2*4)!/(4+1)! possible orders. For this very simplified problem, I end up with 34*(2*4)!/(4+1)! possibilities.

In non-geek terms, it means 27 216 possible plans. If I now add the possibility for the merge join to take 0,1 or 2 B+Tree indexes, the number of possible plans becomes 210 000. Did I forget to mention that this query is VERY SIMPLE?

  • 2) I cry and quit this job

It’s very tempting but you wouldn’t get your result and I need money to pay the bills.

  • 3) I only try a few plans and take the one with the lowest cost.

Since I’m not superman, I can’t compute the cost of every plan. Instead, I can arbitrary choose a subset of all the possible plans, compute their costs and give you the best plan of this subset.

  • 4) I apply smart rules to reduce the number of possible plans.

There are 2 types of rules:

I can use “logical” rules that will remove useless possibilities but they won’t filter a lot of possible plans. For example: “the inner relation of the nested loop join must be the smallest data set”

I accept not finding the best solution and apply more aggressive rules to reduce a lot the number of possibilities. For example “If a relation is small, use a nested loop join and never use a merge join or a hash join”

 

In this simple example, I end up with many possibilities. But a real query can have other relational operators like OUTER JOIN, CROSS JOIN, GROUP BY, ORDER BY, PROJECTION, UNION, INTERSECT, DISTINCT … which means even more possibilities.

So, how a database does it?

 

Dynamic programming, greedy algorithm and heuristic

A relational database tries the multiple approaches I’ve just said. The real job of an optimizer is to find a good solution on a limited amount of time.

Most of the time an optimizer doesn’t find the best solution but a “good” one.

For small queries, doing a brute force approach is possible. But there is a way to avoid unnecessary computations so that even medium queries can use the brute force approach. This is called dynamic programming.

 

Dynamic Programming

The idea behind these 2 words is that many executions plan are very similar. If you look at the following plans:

overlapping trees optimization dynamic programming

They share the same (A JOIN B) subtree. So, instead of computing the cost of this subtree in every plan, we can compute it once, save the computed cost and reuse it when we see this subtree again. More formally, we’re facing an overlapping problem. To avoid the extra-computation of the partial results?we’re using memoization.

Using this technique, instead of having a (2*N)!/(N+1)! time complexity, we “just” have 3N. In our previous example with 4 joins, it means passing from 336 ordering to 81. If you take a bigger query with 8 joins (which is not big), it means passing from 57 657 600 to 6561.

 

For the CS geeks, here is an algorithm I found on the formal course I already gave you. I won’t explain this algorithm so read it only if you already know dynamic programming or if you’re good with algorithms (you’ve been warned!):

procedure findbestplan(S)
if (bestplan[S].cost infinite)
   return bestplan[S]
// else bestplan[S] has not been computed earlier, compute it now
if (S contains only 1 relation)
         set bestplan[S].plan and bestplan[S].cost based on the best way
         of accessing S  /* Using selections on S and indices on S */
     else for each non-empty subset S1 of S such that S1 != S
   P1= findbestplan(S1)
   P2= findbestplan(S - S1)
   A = best algorithm for joining results of P1 and P2
   cost = P1.cost + P2.cost + cost of A
   if cost < bestplan[S].cost
       bestplan[S].cost = cost
      bestplan[S].plan = “execute P1.plan; execute P2.plan;
                 join results of P1 and P2 using A”
return bestplan[S]

 

For bigger queries you can still do a dynamic programming approach but with extra rules (or heuristics) to remove possibilities:

  • If we analyze only a certain type of plan (for example: the left-deep trees) we end up with n*2n instead of 3n

left deep tree example

  • If we add logical rules to avoid plans for some patterns (like “if a table as an index for the given predicate, don’t try a merge join on the table but only on the index”) it will reduce the number of possibilities without hurting to much the best possible solution.
  • If we add rules on the flow (like “perform the join operations BEFORE all the other relational operations”) it also reduces a lot of possibilities.

 

Greedy algorithms

But for a very big query or to have a very fast answer (but not a very fast query), another type of algorithms is used, the greedy algorithms.

The idea is to follow a rule (or heuristic) to build a query plan in an incremental way. With this rule, a greedy algorithm finds the best solution to a problem one step at a time.? The algorithm starts the query plan with one JOIN. Then, at each step, the algorithm adds a new JOIN to the query plan using the same rule.

 

Let’s take a simple example. Let’s say we have a query with 4 joins on 5 tables (A, B, C, D and E). To simplify the problem we just take the nested join as a possible join. Let’s use the rule “use the join with the lowest cost”

  • we arbitrary start on one of the 5 tables (let’s choose A)
  • we compute the cost of every join with A (A being the inner or outer relation).
  • we find that A JOIN B gives the lowest cost.
  • we then compute the cost of every join with the result of A JOIN B (A JOIN B being the inner or outer relation).
  • we find that (A JOIN B) JOIN C gives the best cost.
  • we then compute the cost of every join with the result of the (A JOIN B) JOIN C …
  • ….
  • At the end we find the plan (((A JOIN B) JOIN C) JOIN D) JOIN E)

Since we arbitrary started with A, we can apply the same algorithm for B, then C then D then E. We then keep the plan with the lowest cost.

By the way, this algorithm has a name: it’s called the Nearest neighbor algorithm.

I won’t go into details, but with a good modeling and a sort in N*log(N) this problem can easily be solved. The cost of this algorithm is in O(N*log(N)) vs O(3N) for the full dynamic programming version. If you have a big query with 20 joins, it means 26 vs 3 486 784 401, a BIG difference!

 

The problem with this algorithm is that we assume that finding the best join between 2 tables will give us the best cost if we keep this join and add a new join.? But:

  • even if A JOIN B gives the best cost between A, B and C
  • (A JOIN C) JOIN B might give a better result than (A JOIN B) JOIN C.

To improve the result, you can run multiple greedy algorithms using different rules and keep the best plan.

 

Other algorithms

[If you’re already fed up with algorithms, skip to the next part, what I’m going to say is not important for the rest of the article]

The problem of finding the best possible plan is an active research topic for many CS researchers. They often try to find better solutions for more precise problems/patterns. For example,

  • if the query is a star join (it’s a certain type of multiple-join query), some databases will use a specific algorithm.
  • if the query is a parallel query, some databases will use a specific algorithm

 

Other algorithms are also studied to replace dynamic programming for large queries. Greedy algorithms belong to larger family called heuristic algorithms. A greedy algorithm follows a rule (or heuristic), keeps the solution it found at the previous step and “appends” it to find the solution for the current step. Some algorithms follow a rule and apply it in a step-by-step way but don’t always keep the best solution found in the previous step. They are called heuristic algorithms.

For example, genetic algorithms follow a rule but the best solution of the last step is not often?kept:

  • A solution represents a possible full query plan
  • Instead of one solution (i.e. plan) there are P solutions (i.e. plans) kept at each step.
  • 0) P query plans are randomly created
  • 1) Only the plans with the best costs are kept
  • 2) These best plans are mixed up to produce P news plans
  • 3) Some of the P new plans are randomly modified
  • 4) The step 1,2,3 are repeated T times
  • 5) Then you keep the best plan from the P plans of the last loop.

The more loops you do the better the plan will be.

Is it magic? No, it’s the laws of nature: only the fittest survives!

FYI, genetic algorithms are implemented in PostgreSQL but I wasn’t able to find if they’re used by default.

There are other heuristic algorithms used in databases like Simulated Annealing, Iterative Improvement, Two-Phase Optimization… But I don’t know if they’re currently used in enterprise databases or if they’re only used in research databases.

For more information, you can read the following research article that presents more possible algorithms:?Review of Algorithms for the Join Ordering Problem in Database Query Optimization

 

Real optimizers

[You can skip to the next part, what I’m going to say is not important]

But, all this blabla is very theoretical. Since I’m a developer and not a researcher, I like concrete examples.

Let’s see how the SQLite optimizer works. It’s a light database so it uses a simple optimization based on a greedy algorithm with extra-rules to limit the number of possibilities:

  • SQLite chooses to never reorder tables in a CROSS JOIN operator
  • joins are implemented as nested joins
  • outer joins are always evaluated in the order in which they occur
  • Prior to version 3.8.0, SQLite uses the “Nearest Neighbor” greedy algorithm when searching for the best query plan

Wait a minute … we’ve already seen this algorithm! What a coincidence!

  • Since version 3.8.0 (released in 2015), SQLite uses the “N Nearest Neighborsgreedy algorithm when searching for the best query plan

 

Let’s see how another optimizer does his job. IBM DB2 is like all the enterprise databases but I’ll focus on this one since it’s the last one I’ve really used before switching to Big Data.

If we look at the official documentation, we learn that the DB2 optimizer let you use 7 different levels of optimization:

  • Use greedy algorithms for the joins
    • 0 – minimal optimization, use index scan and nested-loop join and avoid some Query Rewrite
    • 1 – low optimization
    • 2 – full optimization
  • Use dynamic programming for the joins
    • 3 – moderate optimization and rough approximation
    • 5 – full optimization, uses all techniques with heuristics
    • 7 – full optimization similar to 5, without heuristics
    • 9 – maximal optimization spare no effort/expense considers all possible join orders, including Cartesian products

We can see that DB2 uses greedy algorithms and dynamic programming. Of course, they don’t share the heuristics they use since the query optimizer is the main power of a database.

FYI, the default level is 5. By default the optimizer uses the following characteristics:

  • All available statistics, including frequent-value and quantile statistics, are used.
  • All query rewrite rules (including materialized query table routing) are applied, except computationally intensive rules that are applicable only in very rare cases.
  • Dynamic programming join enumeration is used, with:
    • Limited use of composite inner relation
    • Limited use of Cartesian products for star schemas involving lookup tables
  • A wide range of access methods is considered, including list prefetch (note: will see what is means), index ANDing (note: a special operation with indexes), and materialized query table routing.

By default, DB2 uses dynamic programming limited by heuristics for the join ordering.

The others conditions (GROUP BY, DISTINCT…) are handled by simple rules.

 

Query Plan Cache

Since the creation of a plan takes time, most databases store the plan into a query plan cache to avoid useless re-computations of the same query plan. It’s kind of a big topic since the database needs to know when to update the outdated plans. The idea is to put a threshold and if the statistics of a table have changed above this threshold then the query plan involving this table is purged from the cache.

 

Query executor

At this stage we have an optimized execution plan. This plan is compiled to become an executable code. Then, if there are enough resources (memory, CPU) it is executed by the query executor. The operators in the plan (JOIN, SORT BY …) can be executed in a sequential or parallel way; it’s up to the executor. To get and write its data, the query executor interacts with the data manager, which is the next part of the article.

 

Data manager

data manager in databases

At this step, the query manager is executing the query and needs the data from the tables and indexes. It asks the data manager to get the data, but there are 2 problems:

  • Relational databases use a transactional model. So, you can’t get any data at any time because someone else might be using/modifying the data at the same time.
  • Data retrieval is the slowest operation in a database, therefore the data manager needs to be smart enough to get and keep data in memory buffers.

In this part, we’ll see how relational databases handle these 2 problems. I won’t talk about the way the data manager gets its data because it’s not the most important (and this article is long enough!).

 

Cache manager

As I already said, the main bottleneck of databases is disk I/O. To improve performance, modern databases use a cache manager.

cache manager in databases

Instead of directly getting the data from the file system, the query executor asks for the data to the cache manager. The cache manager has an in-memory cache called buffer pool. Getting data from memory dramatically speeds up a database. It’s difficult to give an order of magnitude because it depends on the operation you need to do:

  • sequential access (ex: full scan) vs random access (ex: access by row id),
  • read vs write

and the type of disks used by the database:

  • 7.2k/10k/15k rpm HDD
  • SSD
  • RAID 1/5/…

but I’d say memory is 100 to 100k times faster than disk.

But, this leads to another problem (as always with databases…). The cache manager needs to get the data in memory BEFORE the query executor uses them; otherwise the query manager has to wait for the data from the slow disks.

 

Prefetching

This problem is called prefetching. A query executor knows the data it’ll need because it knows the full flow of the query and has knowledge of the data on disk with the statistics. Here is the idea:

  • When the query executor is processing its first bunch of data
  • It asks the cache manager to pre-load the second bunch of data
  • When it starts processing the second bunch of data
  • It asks the CM to pre-load the third bunch and informs the CM that the first bunch can be purged from cache.

The CM stores all these data in its buffer pool. In order to know if a data is still needed, the cache manager adds an extra-information about the cached data (called a latch).

 

Sometimes the query executor doesn’t know what data it’ll need and some databases don’t provide this functionality. Instead, they use a speculative prefetching (for example: if the query executor asked for data 1,3,5 it’ll likely ask for 7,9,11 in a near future) or a sequential prefetching (in this case the CM simply loads from disks the next contiguous data after the ones asked).

 

To monitor how well the prefetching is working, modern databases provide a metric called buffer/cache hit ratio. The hit ratio shows how often a requested data has been found in the buffer cache without requiring disk access.

Note: a poor cache hit ratio doesn’t always mean that the cache is ill-working. For more information, you can read the Oracle documentation.

 

But, a buffer is a limited amount of memory. Therefore, it needs to remove some data to be able to load new ones. Loading and purging the cache has a cost in terms of disk and network I/O. If you have a query that is often executed, it wouldn’t be efficient to always load then purge the data used by this query. To handle this problem, modern databases use a buffer replacement strategy.

 

Buffer-Replacement strategies

Most modern databases (at least SQL Server, MySQL, Oracle and DB2) use an LRU algorithm.

 

LRU

LRU stands for Least Recently Used. The idea behind this algorithm is to keep in the cache the data that have been recently used and, therefore, are more likely to be used again.

Here is a visual example:

LRU algorithm in a database

For the sake of comprehension, I’ll assume that the data in the buffer are not locked by latches (and therefore can be removed). In this simple example the buffer can store 3 elements:

  • 1: the cache manager uses the data 1 and puts the data into the empty buffer
  • 2: the CM uses the data 4 and puts the data into the half-loaded buffer
  • 3: the CM uses the data 3 and puts the data into the half-loaded buffer
  • 4: the CM uses the data 9. The buffer is full so data 1 is removed since it’s the last recently used data. Data 9 is added into the buffer
  • 5: the CM uses the data 4. Data 4 is already in the buffer therefore it becomes the first recently used data again.
  • 6: the CM uses the data 1. The buffer is full so data 9 is removed since it’s the last recently used data. Data 1 is added into the buffer

This algorithm works well but there are some limitations. What if there is a full scan on a large table? In other words, what happens when the size of the table/index is above the size of the buffer? Using this algorithm will remove all the previous values in the cache whereas the data from the full scan are likely to be used only once.

 

Improvements

To prevent this to happen, some databases add specific rules. For example according to Oracle documentation:

“For very large tables, the database typically uses a?direct path read, which loads blocks directly […], to avoid populating the buffer cache. For medium size tables, the database may use a direct read or a cache read. If it decides to use a cache read, then the database places the blocks at the end of the LRU list to prevent the scan from effectively cleaning out the buffer cache.”

There are other possibilities like using an advanced version of LRU called LRU-K. For example SQL Server uses LRU-K for K =2.

This idea behind this algorithm is to take into account more history. With the simple LRU (which is also LRU-K for K=1), the algorithm only takes into account the last time the data was used. With the LRU-K:

  • It takes into account the K last times the data was used.
  • A weight is put on the number of times the data was used
  • If a bunch of new data is loaded into the cache, the old but often used data are not removed (because their weights?are?higher).
  • But the algorithm can’t keep old data in the cache if they aren’t used anymore.
  • So the weights decrease?over time if the data is not used.

The computation of the weight is costly and this is why SQL Server only uses K=2. This value performs well for an acceptable overhead.

For a more in-depth knowledge of LRU-K, you can read the original research paper (1993): The LRU-K page replacement algorithm for database disk buffering.

 

Other algorithms

Of course there are other algorithms to manage cache like

  • 2Q (a LRU-K like algorithm)
  • CLOCK (a LRU-K like algorithm)
  • MRU (most recently used, uses the same logic than LRU but with another rule)
  • LRFU (Least Recently and Frequently Used)

Some databases let the possibility to use another algorithm than the default one.

 

Write buffer

I only talked about read buffers that load data before using them. But in a database you also have write buffers that store data and flush them on disk by bunches instead of writing data one by one and producing many single disk accesses.

 

Keep in mind that buffers store pages (the smallest unit of data) and not rows (which is a logical/human way to see data). A page in a?buffer pool is dirty if?the page has been modified and not written on disk. There are multiple algorithms to decide the best time to write the dirty pages on disk but it’s highly linked to the notion of transaction, which is the next part of the article.

 

Transaction manager

Last but not least, this part is about the transaction manager. We’ll see how this process ensures that each query is executed in its own transaction. But before that, we need to understand the concept of ACID transactions.

 

I’m on acid

An ACID transaction is a unit of work that ensures 4 things:

  • Atomicity: the transaction is “all or nothing”, even if it lasts 10 hours. If the transaction crashes, the state goes back to before the transaction (the transaction is rolled back).
  • Isolation: if 2 transactions A and B run at the same time, the result of transactions A and B must be the same whether A finishes before/after/during transaction B.
  • Durability: once the transaction is committed (i.e. ends successfully), the data stay in the database no matter what happens (crash or error).
  • Consistency: only valid data (in terms of relational constraints and functional constraints) are?written to the database. The consistency is related to atomicity and isolation.

 

one dollar

During the same transaction, you can run multiple SQL queries to read, create, update and delete data. The mess begins when two transactions are using the same data. The classic example is a money transfer from an account A to an account B.? Imagine you have 2 transactions:

  • Transaction 1 that takes 100$ from account A and gives them to account B
  • Transaction 2 that takes 50$ from account A and gives them to account B

If we go back to the ACID properties:

  • Atomicity ensures that no matter what happens during T1 (a server crash, a network failure …), you can’t end up in a situation where the 100$ are withdrawn from A and not given to B (this case is an inconsistent state).
  • Isolation ensures that if T1 and T2 happen at the same time, in the end A will be taken 150$ and B given 150$ and not, for example, A taken 150$ and B given just $50 because T2 has partially erased the actions of T1 (this case is also an inconsistent state).
  • Durability ensures that T1 won’t disappear into thin air if the database crashes just after T1 is committed.
  • Consistency ensures that no money is created or destroyed in the system.

 

[You can skip to the next part if you want, what I’m going to say is not important for the rest of the article]

Many modern databases don’t use a pure isolation as a default behavior because it comes with a huge performance overhead. The SQL norm defines 4 levels of isolation:

  • Serializable (default behaviour in SQLite): The highest level of isolation. Two transactions happening at the same time are 100% isolated. Each transaction has its own “world”.
  • Repeatable read (default behavior in MySQL): Each transaction has its own “world” except in one situation. If a transaction ends up successfully and adds new data, these data will be visible in the other and still running transactions. But if A modifies a data and ends up successfully, the modification won’t be visible in the still running transactions. So, this break of isolation between transactions is only about new data, not the existing ones.

For example,? if a transaction A does a “SELECT count(1) from TABLE_X” and then a new data is added and committed in TABLE_X by Transaction B, if transaction A does again a count(1) the value won’t be the same.

This is called a phantom read.

  • Read committed (default behavior in Oracle, PostgreSQL and SQL Server): It’s a repeatable read + a new break of isolation. If a transaction A reads a data D and then this data is modified (or deleted) and committed by a transaction B, if A reads data D again it will see the modification (or deletion) made by B on the data.

This is called a non-repeatable read.

  • Read uncommitted: the lowest level of isolation. It’s a read committed + a new break of isolation. If a transaction A reads a data D and then this data D is modified by a transaction B (that is not committed and still running), if A reads data D again it will see the modified value. If transaction B is rolled back, then data D read by A the second time doesn’t make no sense since it has been modified by a transaction B that never happened (since it was rolled back).

This is called a dirty read.

 

Most databases add?their own custom levels of isolation (like the snapshot isolation used by PostgreSQL, Oracle and SQL Server). Moreover, most databases don’t implement all the levels of the SQL norm (especially the read uncommitted level).

The default level of isolation can be overridden by the user/developer at the beginning of the connection?(it’s a very simple line of code to add).

 

Concurrency Control

The real issue to ensure isolation, coherency and atomicity is the write operations on the same data (add, update and delete):

  • if all transactions are only reading data, they can work at the same time without modifying the behavior of another transaction.
  • if (at least) one of the transactions is modifying a data read by other transactions, the database needs to find a way to hide this modification from the other transactions. Moreover, it also needs to ensure that this modification won’t be erased by another transaction that didn’t see the modified data.

This problem is a called concurrency control.

The easiest way to solve this problem is to run each transaction one by one (i.e. sequentially). But that’s not scalable at all and only one core is working on the multi-processor/core server, not very efficient…

The ideal way to solve this problem is, every time a transaction is created or cancelled:

  • to monitor all the operations of all the transactions
  • to check if the parts of 2 (or more) transactions are in conflict because they’re reading/modifying the same data.
  • to reorder the operations inside the conflicting transactions to reduce the size of the conflicting parts
  • to execute the conflicting parts in a certain order (while the non-conflicting transactions are still running concurrently).
  • to take into account that a transaction can be cancelled.

More formally it’s a scheduling problem with conflicting schedules. More concretely, it’s a very difficult and CPU-expensive optimization problem. Enterprise databases can’t afford to wait hours to find the best schedule for each new transaction event. Therefore, they use less ideal approaches that lead to more time wasted between conflicting transactions.

 

Lock manager

To handle this problem, most databases are using locks and/or data versioning. Since it’s a big topic, I’ll focus on the locking part then I’ll speak a little bit about data versioning.

 

Pessimistic locking

The idea behind locking is:

  • if a transaction needs a data,
  • it locks the data
  • if another transaction also needs this data,
  • it’ll have to wait until the first transaction releases the data.

This is called an exclusive lock.

But using an exclusive lock for a transaction that only needs to read a data is very expensive since it forces other transactions that only want to read the same data to wait. This is why there is another type of lock, the shared lock.

With the shared lock:

  • if a transaction needs only to read a data A,
  • it “shared locks” the data and reads the data
  • if a second transaction also needs only to read data A,
  • it “shared locks” the data and reads the data
  • if a third transaction needs to modify data A,
  • it “exclusive locks” the data but it has to wait until the 2 other transactions release their shared locks to apply its exclusive lock on data A.

Still, if a data as an exclusive lock, a transaction that just needs to read the data will have to wait the end of the exclusive lock to put a shared lock on the data.

lock manager in a database

The lock manager is the process that gives and releases locks. Internally, it stores the locks in a hash table (where the key is the data to lock) and knows for each data:

  • which transactions are locking the data
  • which transactions are waiting for the data

 

Deadlock

But the use of locks can lead to a situation where 2 transactions are waiting forever for a data:

deadlock with database transactions

In this figure:

  • transaction A has an exclusive lock on data1 and is waiting to get data2
  • transaction B has an exclusive lock on data2 and is waiting to get data1

This is called a deadlock.

During a deadlock, the lock manager chooses which transaction to cancel (rollback) in order to remove the deadlock. This decision is not easy:

  • Is it better to kill the transaction that modified the least amount of data (and therefore that will produce the least expensive rollback)?
  • Is it better to kill the least aged transaction because the user of the other transaction has waited longer?
  • Is it better to kill the transaction that will take less time to finish (and avoid a possible starvation)?
  • In case of rollback, how many transactions will be impacted by this rollback?

 

But before making this choice, it needs to check if there are deadlocks.

The hash table can be seen as a graph (like in the previous figures). There is a deadlock if there is a cycle in the graph. Since it’s expensive to check for cycles (because the graph with all the locks is quite big), a simpler approach is often used: using a timeout. If a lock is not given within this timeout, the transaction enters a deadlock state.

 

The lock manager can also check before giving a lock if this lock will create a deadlock. But again it’s computationally expensive to do it perfectly. Therefore, these pre-checks are often a set of basic rules.

 

Two-phase locking

The simplest way to ensure a pure isolation is if a lock is acquired at the beginning of the transaction and released at the end of the transaction. This means that a transaction has to wait for all its locks before it starts and the locks held by a transaction are released when the transaction ends. It?works but it?produces a lot of time wasted to wait for all locks.

A faster way is the Two-Phase Locking Protocol (used by DB2 and SQL Server) where a transaction is divided into 2 phases:

  • the growing phase where a transaction can obtain locks, but can’t release any lock.
  • the shrinking phase where a transaction can release locks (on the data it has already processed and won’t process again), but can’t obtain new locks.

 

a problem avoided with two phase locking

The idea behind these 2 simple rules is:

  • to release the locks that aren’t used anymore to reduce the wait time of other transactions waiting for these locks
  • to prevent from cases where a transaction gets data modified after the transaction started and therefore aren’t coherent with the first data the transaction acquired.

 

This protocol works well except if a transaction that modified a data and released the associated lock is cancelled (rolled back). You could end up in a case where another transaction reads the modified value whereas this value is going to be rolled back. To avoid this problem, all the exclusive locks must be released at the end of the transaction.

 

A few words

Of course a real database uses a more sophisticated system involving more types of locks (like intention locks) and more granularities (locks on a row, on a page, on a partition, on a table, on a tablespace) but the idea remains the same.

I only presented the pure lock-based approach. Data versioning is another way to deal with this problem.

The idea behind versioning is that:

  • every transaction can modify the same data at the same time
  • each transaction has its own copy (or version) of the data
  • if 2 transactions modify the same data, only one modification will be accepted, the other will be refused and the associated transaction will be rolled back (and?maybe re-run).

It increases the performance since:

  • reader transactions don’t block writer transactions
  • writer transactions don’t block reader transactions
  • there is no overhead from the “fat and slow” lock manager

Everything is better than locks?except when 2 transactions write the same data. Moreover, you can quickly end up with a huge disk space overhead.

 

Data versioning and locking are two different visions: optimistic locking vs pessimistic locking. They both have pros and cons; it really depends on the use case (more reads vs more writes). For a presentation on data versioning, I recommend this very good presentation on how PostgreSQL implements multiversion concurrency control.

Some databases like DB2 (until DB2 9.7) and SQL Server (except for snapshot isolation) are only using locks. Other like PostgreSQL, MySQL and Oracle use a mixed approach involving locks and data versioning. I’m not aware of a database using only data versioning (if you know a database based on a pure data versioning, feel free to tell me).

[UPDATE 08/20/2015] I was told by a reader that:

Firebird and Interbase use versioning without record locking.
Versioning has an interesting effect on indexes:?sometimes a unique index contains duplicates, the index can have more entries than the table has rows, etc.

 

If you read the part on the different levels of isolation, when you increase the isolation level you increase the number of locks and therefore the time wasted by transactions to wait for their locks. This is why most databases don’t use the highest isolation level (Serializable) by default.

As always, you can check by yourself in the documentation of the main databases (for example MySQL, PostgreSQL or Oracle).

 

Log manager

We’ve already seen that to increase its performances, a database stores data in memory buffers. But if the server crashes when the transaction is being committed, you’ll lose? the data still?in memory during the crash, which breaks the Durability of a transaction.

You can write everything on disk but if the server crashes, you’ll end up with the data half written on disk, which breaks the Atomicity of a transaction.

Any modification written by a transaction must be undone or finished.

To deal with this problem, there are 2 ways:

  • Shadow copies/pages: Each?transaction creates its own copy of the database (or just a part of the database) and works on this copy. In case of error, the copy is removed. In case of success, the database switches instantly the data from the copy with a filesystem trick then it removes the “old” data.
  • Transaction log: A transaction log is a storage space. Before each write on disk, the database writes an info on the transaction log so that in case of crash/cancel of a transaction, the database knows how to remove (or finish) the unfinished transaction.

 

WAL

The shadow copies/pages creates a huge disk overhead when used on large databases involving many transactions. That’s why modern databases use a transaction log. The transaction log must be stored on a stable storage. I won’t go deeper on storage technologies but using (at least) RAID disks is mandatory to prevent from a disk failure.

Most databases (at least Oracle, SQL Server, DB2, PostgreSQL, MySQL?and?SQLite) deal with the transaction log using the Write-Ahead Logging protocol (WAL). The WAL protocol is a set of 3 rules:

  • 1) Each modification into the database produces a log record, and the log record must be written into the transaction log before the data is written on disk.
  • 2) The log records must be written in order; a log record A that happens before a log record B must but written before B
  • 3) When a transaction is committed, the commit order must be written on the transaction log before the transaction ends up successfully.

 

log manager in a database

This job is done by a log manager. An easy way to see it is that between the cache manager and the data access manager (that writes data on disk) the?log manager writes every update/delete/create/commit/rollback on the transaction log before they’re written on disk. Easy, right?

 

WRONG ANSWER! After all we’ve been through, you should know that everything related to a database is cursed by the “database effect”. More seriously, the problem is to find a way to write logs while keeping good performances. If the writes on the transaction log are too slow they will slow down everything.

 

ARIES

In 1992, IBM researchers “invented” an enhanced version?of WAL called ARIES. ARIES is more or less used by most modern databases. The logic might not be the same but the concepts behind ARIES are used everywhere. I put the quotes on invented because, according to this MIT course, the IBM researchers did “nothing more than writing the good practices of transaction recovery”. Since I was 5 when the ARIES paper was published, I don’t care about this old gossip from bitter?researchers. In fact, I only put this info to give you a break before we start this last technical part. I’ve read a huge part of the research paper on ARIES and I find it very interesting! In this part I’ll only give you an overview of ARIES but I strongly recommend to read the paper if you want a real knowledge.

 

ARIES stands for Algorithms for Recovery and Isolation Exploiting Semantics.

The aim of this technique is double:

  • 1) Having good performances when writing logs
  • 2) Having a fast and reliable recovery

 

There are multiple reasons a database has to rollback a transaction:

  • Because the user cancelled it
  • Because of server or network failures
  • Because the transaction has broken the integrity of the database (for example you have a UNIQUE constraint on a column and the transaction adds a duplicate)
  • Because of deadlocks

 

Sometimes (for example, in case of network failure), the database can recover the transaction.

How is that possible? To answer this question, we need to understand the information stored in a log record.

 

The logs

Each operation (add/remove/modify) during a transaction produces a log. This log record is composed of:

  • LSN: A unique Log Sequence Number. This LSN is given in a chronological order*. This means that if an operation A happened before an operation B the LSN of log A will be lower than the LSN of log B.
  • TransID: the id of the transaction that produced the operation.
  • PageID: the location on disk of the modified data. The minimum amount of data on disk is a page so the location of the data is the location of the page that contains the data.
  • PrevLSN: A link to the previous log record produced by the same transaction.
  • UNDO: a way to remove the effect of the operation

For example, if the operation is an update, the UNDO will store either the value/state of the updated element before the update (physical UNDO) or the reverse operation to go back at the previous state (logical UNDO)**.

  • REDO: a way replay the operation

Likewise, there are 2 ways to do that. Either you store the value/state of the element after the operation or the operation itself to replay it.

  • …: (FYI, an ARIES log has 2 others fields: the UndoNxtLSN and the Type).

 

Moreover, each page on disk (that stores the data, not the log) has id of the log record (LSN) of the last operation that modified the data.

*The way the LSN is given is more complicated because it is linked to the way the logs are stored. But the idea remains the same.

**ARIES uses only logical UNDO because it’s a real mess to deal with physical UNDO.

Note: From my little knowledge, only PostgreSQL is not using an UNDO. It uses instead a garbage collector daemon that removes the old versions of data. This is linked to the implementation of the data versioning in PostgreSQL.

 

To give you a better idea, here is a visual and simplified example of the log records produced by the query “UPDATE FROM PERSON SET AGE = 18;”. Let’s say this query is executed in transaction 18.

simplified logs of ARIES protocole

Each log has a unique LSN. The logs that are linked belong to the same transaction. The logs are linked in a chronological order (the last log of the linked list is the log of the last operation).

 

Log Buffer

To avoid that log writing becomes a major bottleneck, a log buffer is used.

log writing process in databases

When the query executor asks for a modification:

  • 1) The cache manager stores the modification in its buffer.
  • 2) The log manager stores the associated log in its buffer.
  • 3) At this step, the query executor considers the operation is done (and therefore can ask for other modifications)
  • 4) Then (later) the log manager writes the log on the transaction log. The decision when to write the log is done by an algorithm.
  • 5) Then (later) the cache manager writes the modification on disk. The decision when to write data on disk is done by an algorithm.

 

When a transaction is committed, it means that for every operation in the transaction the steps 1, 2, 3,4,5 are done. Writing in the transaction log is fast since it’s just “adding a log somewhere in the transaction log” whereas writing data on disk is more complicated because it’s “writing the data in a way that it’s fast to read them”.

 

STEAL and FORCE policies

For performance reasons the step 5 might be done after the commit because in case of crashes it’s still possible to recover the transaction with the REDO logs. This is called a NO-FORCE policy.

A database can choose a FORCE policy (i.e. step 5 must be done before the commit) to lower the workload during the recovery.

Another issue is to choose whether the data are written step-by-step on disk (STEAL policy) or if the buffer manager needs to wait until the commit order to write everything at once (NO-STEAL). The choice between STEAL and NO-STEAL depends on what you want: fast writing with a long recovery using UNDO logs or fast recovery?

 

Here is a summary of the impact of these policies on recovery:

  • STEAL/NO-FORCE needs UNDO and REDO: highest performances but gives more complex logs and recovery processes (like ARIES). This is the choice made by most databases. Note: I read this fact on multiple research papers and courses but I couldn’t find it (explicitly)? on the official documentations.
  • STEAL/ FORCE needs only UNDO.
  • NO-STEAL/NO-FORCE needs only REDO.
  • NO-STEAL/FORCE needs nothing: worst performances and a huge amount of ram is needed.

 

The recovery part

Ok, so we have nice logs, let’s use them!

Let’s say the new intern has crashed the database (rule n°1: it’s always the intern’s fault). You restart the database and the recovery process begins.

 

ARIES recovers from a crash in three passes:

  • 1) The Analysis pass: The recovery process reads the full transaction log* to recreate the timeline of what was happening during the crash. It determines which transactions to rollback (all the transactions without a commit order are rolled back) and which data needed to be written on disk at the time of the crash.
  • 2) The Redo pass: This pass starts from a log record determined during analysis, and uses the REDO to update the database to the state it was before the crash.

During the redo phase, the REDO logs are processed in a chronological order (using the LSN).

For each log, the recovery process reads the LSN of the page on disk containing the data to modify.

If LSN(page_on_disk)>=LSN(log_record), it means that the data has already been written on disk before the crash (but the value was overwritten by an operation that happened after the log and before the crash) so nothing is done.

If LSN(page_on_disk)<LSN(log_record) then the page on disk is updated.

The redo is done even for the transactions that are going to be rolled back because it simplifies the recovery process (but I’m sure modern databases don’t do that).

  • 3) The Undo pass: This pass rolls back all transactions that were incomplete at the time of the crash. The rollback starts with the last logs of each transaction and processes the UNDO logs in an anti-chronological order (using the PrevLSN of the log records).

 

During the recovery, the transaction log must be warned of the actions made by the recovery process so that the data written on disk are synchronized with what’s written in the transaction log. A solution could be to remove the log records of the transactions that are being undone but that’s very difficult. Instead, ARIES writes compensation logs in the transaction log that delete logically the log records of the transactions being removed.

When a transaction is cancelled “manually” or by the lock manager (to stop a deadlock) or just because of a network failure, then the analysis pass is not needed. Indeed, the information about what to REDO and UNDO is available in 2 in-memory tables:

  • a transaction table (stores the state of all current transactions)
  • a dirty page table (stores which data need to be written on disk).

These tables are updated by the cache manager and the transaction manager for each new transaction event. Since they are in-memory, they are destroyed when the database crashes.

The job of the analysis phase is to recreate both tables after a crash using the information in the transaction log. *To speed up the analysis pass, ARIES provides the notion of checkpoint. The idea is to write on disk from time to time the content of the transaction table and the dirty page table and the last LSN at the time of this write so that during the analysis pass, only the logs after this LSN are analyzed.

 

To conclude

Before writing this article, I knew how big the subject was and I knew it would take time to write an in-depth article about it. It turned out that I was very optimistic and I spent twice more time than expected, but I learned a lot.

If you want a good overview about databases, I recommend reading the research paper “Architecture of a Database System “. This is a good introduction on databases (110 pages) and for once it’s readable by non-CS guys. This paper helped me a lot to find a plan for this article and it’s not focused on data structures and algorithms like my article but more on the architecture concepts.

 

If you read this article carefully you should now understand how powerful a database is. Since it was a very long article, let me remind you about what we’ve seen:

  • an overview of the B+Tree indexes
  • a global overview of a database
  • an overview of the cost based optimization with a strong focus on join operators
  • an overview of the buffer pool management
  • an overview of the transaction management

But a database contains even more cleverness. For example, I didn’t speak about some touchy problems like:

  • how to manage clustered databases and global transactions
  • how to take a snapshot when the database is still running
  • how to efficiently store (and compress) data
  • how to manage memory

 

So, think twice when you have to choose between a buggy NoSQL database and a rock-solid relational database. Don’t get me wrong, some NoSQL databases are great. But they’re still young and answering specific problems that concern a few applications.

 

To conclude, if someone asks you how a database works, instead of running away you’ll now be able to answer:

magic gif

Otherwise you can give him/her this article.

]]>
http://www.sunsetandecho.com/how-databases-work/feed/ 155
How to conduct technical interviews? http://www.sunsetandecho.com/how-to-conduct-technical-interviews/ http://www.sunsetandecho.com/how-to-conduct-technical-interviews/#comments Sat, 04 Jul 2015 00:09:34 +0000 http://www.sunsetandecho.com/?p=871

I’ve interviewed approximately 60 developers since I started working and I still wander how I can improve my skills as a technical interviewer. I’ve never got a formation that explains how to conduct interviews. During the years, I’ve changed the way I interview people. I’ve also?lowered?my?expectations since I’m not working on projects that requires world class developers.

 

French IT Market

Before speaking about recruitments, I must present the French IT market because it’s very a very specific and peculiar market. In France, technical jobs are underestimated.? It’s true in many countries but this phenomenon is highly developed in France.

Only a very few corporations value experience in technical fields, especially in IT. This means that beyond 35, if you’re still a developer you’re a “looser” and you’re likely to struggle beyond 40 to find a new job. Your salary quickly stops to grow since you become too expensive for a “simple job a junior can do”. A friend of mine didn’t get a developer position because he wasn’t ambitious enough (meaning he didn’t say he wanted to be a manager in 5 years).

As a result, most good developers switch to management, functional jobs, ?leave France or become independents (it’s one of the rare way to stay in the technical field but it requires being more than a developer since it involves managing a one-person company).

French IT world is more or less a two-speed market with:

  • contractor companies (like Atos, Cap Gemini, Accenture) that sell applications or consultants (meaning technical guys) to other companies,
  • “client” companies (like Total, Sanofi, L’Oréal, BNP, Orange) that produce products or services for consumers.

 

It’s difficult to be hired by “client” corporations because French labor laws are over protective so that it’s very difficult for these corporations to fire a guy even if he is?very bad. Instead, they prefer using expensive consultants they can fire whenever they want (the money goes to the contractor company, not the consultant’s pocket). Moreover, French society is elitist, your Master’s Degree and the University you studied are very important to get a job on these companies, even 10 years after you get your diploma. Therefore, many good IT guys end up unwillingly in a contractor company where they work as a consultant for the same “client” corporation during years without the good benefits of the corporation.

Note: there are also small companies and startups in France but many developers end up in a contractor company and a few in a “client” company.

 

I’m one of the lucky ones who’ve never worked for a contractor company, but I still wander about my future. Most candidates I’ve interviewed were consultants from contractor companies. Since consultants are expensive, client corporations want them to be “ready to use”. The consultants’ resume is sometimes modified by the contractors so that it matches exactly the needs of the client corporations. Many times, the consultant doesn’t pass through a technical interview, that’s why the contractor modifies the resume to easily sell the consultant. I have friends in contractor companies, some of them discovered their resume?was modified just before an interview. Consultants are sometimes forced by the contractor to take a mission (even it’s far away or they don’t like the technologies of the mission) otherwise they’re fired. In fact, this is how contractor companies can easily fire people or make them want to leave.

Note: It’s not a black and white situation, I’ve met some people happy to work for a contractor company.

 

Sourcing

Before the interview there is the sourcing. We first need to provide a description of the technical skills were looking for and a description of the project we’re working on. Then, the manager sends it to many contractor companies. In the rare cases of employee and intern recruitments, the manager sends it directly to the HR service.

I’ve often?tried to reduce the requirements because it’s impossible for someone to fit exactly what we’re looking for. For example, in a previous project we asked for someone who’s good at 13 different technologies including 3 very specific technologies… This kind of WTF requirements is a French particularity called the “five-legged sheep”. In my current Big Data project, we’re only looking for good and motivated java developers.

When we’re dealing with consultants, there is no pre-selection done by an HR service, all the work is done by the technical team. The price of a consultant is decided by an HR service?and?doesn’t depend on the technical skills of the consultant (though there are some very rare exceptions).

 

At the beginning I was carefully reading each resumes but after dozens of them I realized it was taking a lot of time, especially when you have a project to finish. I became a “keyword reader” guy, which is kind of ironic since I became the type of recruiters I used to despise. I look at a resume?for less than 30 seconds and if I find it interesting then I look at it more carefully. I know that by doing so:

  • I’m losing good candidates that don’t know how to write a good resume,
  • I’m losing good candidates that matches only a bit with our needs,
  • I encourage contractors and candidates to fake their resume.

But I can’t afford to spend too much time on this stuff, I’m a developer after all.

If the resume?is matching exactly the position requirements I look carefully for the candidate on the Internet because it looks very suspicious. Otherwise, I still look quickly for the candidate on the net. When I have time and I see an unknown technology, I look about it too roughly understand it in order to be able to ask very simple questions.

Now, let’s talk about the interviews!

 

my experience as a candidate

Before speaking about how I handle interviews, let’s talk about some interviews I got as an interviewee. I haven’t done a lot of?interviews but they all?helped me to shape the interviewer I became. Here are 2 very different experiences I had as a candidate.

my worst experience

The worst interview I had was for a junior java developer position. At that time, my experience with java was a 6-month internship. I did an HR interview then a one-hour MCQ about java and was called a few days after for a technical interview. The technical interviewer asked me specific questions above my beginner level and was condescending. As a result I lost my self-confidence and wasn’t able to answer simple questions. During this interview, I really felt like a punching ball. If I had to do the same kind of interviews today, I think I would leave during the interview. Of course, I didn’t get the job (are you surprised?).

 

my best experience

As a candidate, the best interviews I had were at Microsoft. The interviewers were nice and the questions interesting (though it’s subjective).? But here comes the problem: I had an HR call, a one-hour online exercise, a one-hour technical call (that I really liked), a 5-round interview (some where very fun), to discover during the middle of the 5-round interview that the job position was mostly about doing SQL queries (whereas I thought it would involve?complex algorithms)… In the end I didn’t get the job because they rightfully feared I’d get bored and I was proposed to schedule another round (with less interviews) for a more technical job but I declined because it’s?exhausting and time consuming (and at that time the future of Microsoft was uncertain).

Although I really liked these interviews, I don’t see the need for asking a very good?level in algorithms and data structures if the real job requires 5% of this level.

In my professional?career:

  • I’ve never coded my own sorting algorithm,
  • I’ve never coded my own self-balancing tree,
  • I’ve never used recursion,
  • I’ve rarely created “touchy” algorithms.

Still, since most of my projects were dealing with a large amount of data, my understanding of time complexity helped me to optimize processes.

On my free time I sometimes look at algorithms and data structures because I like it (I’m currently looking at quantum algorithms, wonderful stuff!). But I understand all the developers that criticize the “google-like” interviews, especially the web developers: why would you need to know dynamic programming or divide and conquer algorithms to do MVC, MVP or MVVM?

I understand “google-like” interviews are?a way to see how a candidate thinks but someone who’s not used to write algorithms will have great?difficulty to answer?and that doesn’t mean he’s stupid. ?As a result, many?“google-like” candidates are using books or sites (like glassdoor, careercup?and?geeksforgeeks) to brain dump many?exercices.

 

my experience as an interviewer

For me, a good interview is a nice technical discussion where both I and the candidate learn stuff. At this end of the technical interview (that lasts between 1h and 2h), I must be able to tell if the candidate is good. That leads to another question: What is a good developer?

A good developer

There are as many definitions of a good developer as there are developers. When I started working, my definition was strongly focused on the (academic) problem solving skills and the ability to quickly find a solution?to a problem. But I realized that, unless you’re working on a specific field, these skills are not important for day-to-day work. So, here is my current definition.

A good developer is someone who:

  • Knows when to ask for help

By that I mean he knows when he’s facing something too difficult so that it’s better to ask someone than wasting days on a problem. To do so, the developer needs to know his level and not being too proud to ask for help. Moreover, he also needs to know when he’s facing simple problems that can quickly be answered by thinking/google/stackoverflow instead of asking for help and wasting his coworker’s time.

  • Can quickly understand news languages and concepts

A project involves many technologies that change so a developer must be able to quickly understand a new technology?(not mastered it, most of the time it’s useless).?I must admit the notion of “speed” is very subjective.

  • Is logical

Logic is very important for developers but I’ve worked with developers who weren’t logical. I’m not talking about super logic but good sense. For example, when working with data, we often need to process and filter data. I’ve seen some developers processing the data before filtering it…

I’ve worked with a developer who,?instead of looking for a way to replace a word in many linux files (for example using sed or even coding a simple program), took 4?days to do it manually …

  • Can resolve problems alone and learn from previous problems

I think it’s important to be able to face a problem and being able to solve it alone. If someone always needs help, he reduces the productivity of the team. Of course, a brainless copy paste from google/stackoverflow is not solving a problem. I’ve also worked with developers who?were asking for help and weren’t able to see they had already faced this problem.

  • Is willing to improve his code

The notion of “good code” is very subjective. For me a good function has to be read without scrolling. I’ve met people that prefer to put a full algorithm in the same function because it’s more readable for them. Neither of us is wrong, we just don’t have the same definition of what a good code is.

A?good developer thinks how to code the requirements in order to be as readable and maintainable as possible. He also needs to be critical about his work and willing to improve himself.

  • Can clearly express his ideas

This point is very?important since a developer will often communicate with technical and non-technical guys.

  • Has an average level for the technologies I’m looking for*

I don’t think knowing a specific language is mandatory to be a good developer. But, since I mostly interview consultants that are supposedly “ready to work”, they must have at least an average level. Moreover, having a technology in common helps me to see how deep the candidate has looked at the technology and sometimes how he learns.

  • Is passionate*

Though I like to see the passion for programming, I can’t blame a candidate if he doesn’t read or do code stuff on its free time. Moreover, working for a contractor company (which is what most French developers do) is passion killer.? For me, passion is not required to be a good developer but it is to become a great developer, which is not what we need because we’re not doing rocket science (like 98% of the projects in IT). Of course, being passionate doesn’t imply being good. In fact, the best people I’ve worked with were doing IT only at work. But they were smart, knew their stuff and stayed longer when needed. Though I couldn’t share this?passion with them, it was a real pleasure to work with them (I even became friend with some of them).

That being said, I’m currently working on Big Data where the technologies are new and quickly evolving. That’s why I currently prefer a passionate candidate because he’s more likely to stay up to date on the technologies (but it’s not mandatory).

 

The way I interview

The number of interviews depends on the type of positions, but most of the time:

  • An intern will have an HR interview then a technical interview then an interview with a?manager,
  • A consultant will have a technical interview then an interview with a?manager or all in one,
  • An employee ?will have an HR interview then one or two technical interviews then one or two interviews with managers.?I’ve only done it once since we rarely recruit employees.

 

I don’t have any order in the way I do a technical interview. Sometimes I start by presenting my project, sometimes a let the candidate present himself. Most of the time I let the candidate (or another interviewer) choose the order.

When I have time, I read before the interview about the unknown technologies the guy used. I also prepare some questions when an experience inside the resume?seems odd or if I’m interested by an unknown technology.

I often start by asking “anti-bullshit” questions about the technologies the candidate has used.

Then, I like asking broad questions about the technologies/architectures the guy used in his projects. Most of the time, I let the candidate present his experience and ask my prepared questions and/or my broad questions. I don’t know what kind of broad questions I’m going to ask before the interview, it really depends on the candidate. But here is what I?try to see during the discussion:

  • how the candidate learnt a technology/subject,
  • how?far he learnt a technology/subject,
  • the technical/management problems he faced and how he solved it,
  • the reasons he chose a solution instead of another,
  • the solutions he proposes if I add?more constrains to a problem he faced,
  • how he works with people,
  • if the guy understands what he did during his previous experiences (technically and functionally),
  • if the guy is understandable,
  • how he behaves when he doesn’t know an answer,
  • how he behaves when I’m wrong.

I’m a big fan of paper and pencil so that both I and the candidate can express ours ideas. When the candidate can’t answer a question, I try to ask other questions that will lead to my initial question. But if he still can’t answer, I give him a possible answer I was expecting.

 

When I present my project, I often tell the candidate he can ask questions if he wants. I like to be asked questions or even challenged. It’s a good way to see if the candidate can quickly think about problems and expose his point of view. But it happens rarely, I understand that I’m not a common interviewer so most candidates are not used to ask these kinds of questions.

 

I don’t ask for real code because I don’t believe I can learn from seeing someone coding for a short and stressful time-period. I would need at least half a day (a full day would be better) doing pair programming with a candidate on a real problem which is too much time for me and the candidate. Since I never tried, I might be wrong.?Who knows, maybe next year?I’ll be asking?FizzBuzz and Fibonacci to candidates.

 

I also don’t believe in asking for a side-project.?My few side-projects are ill-coded, unfinished and most of them are useless because they’re not intended to be read by others and I don’t have the time nor the will to write something clean and robust.

Moreover, asking for side-projects filters all the non-passionate developers and the passionate ones without side-projects, which means a lot of potential good candidates. As a candidate, I wouldn’t work for companies that asks only for this type of candidates because I’d assume they?would me to work on my free?time.

 

Though I haven’t done hundreds of interviews, I’ve faced some situations multiple times.

 

The bullshiters

I have to distinguish 2 kinds of liars:

  • those who lie a little,
  • those who are total liars (the bullshiters).

Unlike some countries, we rarely check for references in France.

I don’t mind being lied about an experience if the candidate really has the level of his fake experience. But I’ve sometime interviewed?total liars. I hate this kind of candidates because it’s a waste of time for me and them. On paper the guys are good but I’m quickly disappointed during the interview.

The last bullshiter I’ve interviewed told me he?had developed?a map-reduce job (Big Data stuff) that does complex mathematical stuff on 250 Giga Bytes of data in 50 seconds.?I asked him again to be sure he wasn’t mistaking Giga Bytes with Mega Bytes but no, it was Giga Bytes. So, the guy worked on an overpowered platform that can read and process data at 5 Giga Bytes per second, not bad! Not to mention the overhead of Hadoop (a Big Data techno) that takes at least 5-10 seconds to run a job, plus the network and disk overhead of HDFS (a Big Data file system). So the guy lied to my face but wasn’t smart enough to find a realistic lie. As expected, he didn’t know the key components of a Hadoop cluster after one year “working” with Hadoop (which was twice my experience at the time). Of course, he “did” some machine learning stuff, but it was too complicated to be explained … ?a total waste of time.

 

The use of wrong keywords

Again I have to distinguish 2 cases:

  • the case where I start to speak about a keyword in the candidate’s resume?and he quickly tells me that he used this technology/concept only a bit or a long time ago.
  • the case where the candidate doesn’t understand a keyword in his resume.

I remember interviewing a guy who had worked on a “real-time” banking application during multiple years. In my opinion, real-time applications in Java are one of the rare cases when you really need to understand how a JVM (Java Virtual Machine) works and the underlying garbage collection algorithms. I’ve never worked on real-time applications but I’m interested by JVM stuff because it uses clever mechanisms (you can read my article about JVM memory zones). So, I was happy to interview the guy to learn new stuff. It turned out he wasn’t working at all on a real-time application and he had no idea how a JVM works. His “real-time” constrain was to answer a service call in a few seconds … I was very disappointed but I kept it to myself.

In this kind of situation, I can’t directly blame the candidate because I don’t know if it’s him or his contractor that put the mysterious keyword.

 

The anxious?guy

It’s very easy to see if someone is anxious: you can quickly detect it by the tone of his voice, the way he look at you or some repetitive movements (especially the fingers and foots). I often try to break the ice by being kind or asking very simple questions, but it doesn’t often work. I find it very difficult to interview a nervous candidate because I can’t tell if he doesn’t answer because he is stressed or because he doesn’t know the answer. I?prefer focusing?about projects the candidate did and add new constrains to them because I’m in his?comfort zone and (I think) I’m more likely to get an answer (despite the stress). But I can’t do that every time?since I need a?high level of?concentration to challenge someone in his comfort zone.

 

When I disagree with an?other interviewer

I’ve interviewed alone a few times but most of the times with one or two interviewers (another technical interviewer and/or a manager). And here comes the situations where I disagree with the other interviewer, for example:

  • An interviewer was intentionally stressing a candidate to see how he was handling stress. In my opinion, an interview is stressing enough. I was feeling bad for the candidate but I couldn’t express my strong disagreement during the interview.
  • Another situation I faced was when a candidate didn’t understand the other interviewer’s questions but neither did I: the questions didn’t make sense technically speaking! Again, for the sake of the interview I acted as if I understood the questions and tried to “rephrase” the questions to the candidate.
  • the last situation I remember is when another interviewer asked (what I think are) stupid questions like “what’s the difference between JUnit 3 and JUnit 4” or asking for a specific?word/function/regexp.

 

Awkward situations

Since the sourcing has always been done as a team, I haven’t always decided who we should interview. I’ve end up a few time with a very awkward situation:

  • the situation where I know before the interview the guy won’t get the job since his experience doesn’t fit with our needs. I was right (the 3 times) and felt sorry for the candidate to waste his time.
  • I interviewed a candidate I had strongly refused a year before for the same project. It happened just once but it’s a 8/10 on the awkwardness scale.

The last awkward situation is typically French and is due to the way contractor firms behave toward consultants: I’ve interviewed a few candidates that weren’t interested by the job and were doing the interview just because they were forced by the contractor.

 

When I’m wrong

Being a technical interviewer is not an easy task. You see many candidates with different technical backgrounds. Sometimes when a candidate describes his project, it feels like it’s not possible. I always ask for more precision then skip to another point. But I record exactly what the candidate just told me and check on the Internet after the interview (or ask some coworkers). For example, a candidate told me he was using Struts (a web Java framework) to develop WAP interfaces (the ancestor of mobile internet). During the entire interview I thought the candidate was bullshiting me (and therefore I was?biased during?the interview) but after looking on the Internet, it turned out it was possible!

Evolution of the process

I don’t know if it’s just the French market but many senior developers I’ve interviewed were at best very average candidates. This is why we recently added in my current project a MCQ with basic to average questions about java and object concepts. If someone doesn’t pass a certain threshold we don’t do a face to face interview (which, I recall lasts 1h to 2h). I don’t really like it because I know some good developers that couldn’t answer this MCQ but it’s a way to filter the bullshiters and the candidates whose resume?are “over optimized”.

Unfortunately, the projects I’ve worked on had their ups and downs. That led good developers I had recruited to leave after 6 months (and honestly, I understand them). I now put more weight on the feeling the candidate is likely to leave (which is very subjective).

To conclude

In the end of the face to face interview, if I can say “I could work with?this guy” and “this guy seems good” then it’s a yes. ?I’ve sometime had to struggle for a candidate I believed in but not the other(s) interviewer(s). I’ve never been wrong when I felt a candidate was good but I have a high rejection ratio, that’s why I am lowering my thresold. I’ve sometimes accepted a candidate I wasn’t sure about because we desperately?needed manpower and the other(s) interviewer(s) liked the guy, which led to good and bad developers.

 

If you’ve never did interviews, you now know what’s going on behind the scenes (or at least a possible way).

If you’re a technical interviewer, I’d be glad to read how you do it.

Whether you’re an interviewer or an interviewee, what do you think of the way I do it?

]]>
http://www.sunsetandecho.com/how-to-conduct-technical-interviews/feed/ 14
What is a good application? http://www.sunsetandecho.com/what-is-a-good-application/ http://www.sunsetandecho.com/what-is-a-good-application/#comments Sun, 28 Jun 2015 16:26:35 +0000 http://www.sunsetandecho.com/?p=813

I recently applied for a position?at my current corporation and one of the questions I was asked was “what is a good application?”.

I never thought of it before.?Therefore, it was a really good exercise to formalize my own vision of a good software application. It was a technical position so my answer was from a developer point of view. Since the time I gave my answer, I have thought about it again and here is my vision of a good application.

 

A simple application

When I design applications, I always think ??the simpler, the better??. Developers should only focus on the requirements and just the requirements. I really hate:

  • over-design (with over use of design patterns)
  • over-optimization
  • over-development
  • “flexible” architectures

Nothing beats something simple than can easily be mastered. Most of the time, the over-complex architectures designed for potential future needs will never be used at 100% (or even 50%) because:

  • the needs have changed,
  • the expectations were too high and/or too far away (who can predict what will happen in 5 years?) ,
  • the budget of the project has decreased a lot,
  • the corporate strategy has changed.

That’s why I prefer something simple that will require a big refactoring if the architecture is not enough for the future requirements.

For example, why would you need a Big Data cluster for an application that has just started and therefore deals with very little data? As a Big Data developer, most candidates I have interviewed were working on Big Data projects dealing with a (very) few millions of data. When I asked them why they were using Big Data, some told me “because there are a lot of data” and others “because in the future the application could deal with more data”. ?A few years ago, I was working on a large bank and we were dealing with hundreds of millions of data with “only” relational databases. Between a mastered relational database and a buggy Hadoop cluster (buggy because of the little knowledge of people, the complexity of a distributed platform, and the evolving technology itself) the choice is simple. For example, Criteo (a retargeting ads?startup) started with a Microsoft SQL server. They only switched to Hadoop during summer 2011 and they now have the biggest Hadoop cluster in Europe.

For me, more complex means:

  • more expensive (because it’s more difficult to understand and therefore it takes more time)
  • more bugs (again, because it’s more difficult to understand and therefore to test)
  • more difficult to monitor (again, because it’s more difficult to understand)

Of course, if the future needs are 100% sure (for example, when designing the future referential of a very big corporation which will be fully used in 3 years), designing a complex architecture is worth it.

 

A readable code

A key of a good application is a readable code. Since an application will be written and read by many developers, it’s important that they don’t spend too much time understanding legacy?code.

In object programming, this means having a code with short functions (with little cyclomatic complexity) and classes with few responsibilities. I’m a big fan of loose coupling which increases the isolation of the components and therefore helps to share the work among developers.

Another key point is the code coherency. On big projects, hundreds of developers modify the source code. If anyone codes as he likes, the code quickly becomes a total nightmare with many styles of codes. It’s like reading a book written sometimes in English, sometimes in French, sometimes in German, sometimes in Russian … The development team must converge to a unique code convention and the code must be owned by anyone. I really like self-explained code: instead of writing tones of comments, using explicit names for functions, classes and variables is most of the time enough. Of course, it’s still necessary to write documentation (I prefer to directly write it in the class and the function).

One of the best examples I have read is the Spring Framework: it has more than a million lines of code but the few parts I read (on Spring Core, Spring Batch and Spring Data) where coherent and the naming conventions were very comprehensive (but some people don’t like the verbosity of the code conventions used by Spring). Moreover, the Spring documentation is awesome. One bad example is a very famous C library: libavcodec. This library is used by most applications to deal with compression, decompression and on the fly transcoding of audio and video streams. The work done by this library is very impressive but, when you’re a beginner and try to understand it, it’s a total nightmare. I spent approximately 15 hours to understand some parts of it in order to make an audio streaming server on my free time but I dropped it: there was not enough documentation and the code conventions were not explicit enough (for me) to be quickly understood without documentation.

 

A tested code

When I started working, a coworker told me about unit and integration tests and I was thinking “man, why would I need tests, most of the time my code works at the first time”. To be honest, a part of me still thinks the same … But my vision has changed a lot.

When I code I always think: “is it unit testable?”. ?It’s a very good exercise because it forces me to always (when I have enough time) apply loose coupling. Moreover, having integration and acceptance tests allows you to refractor code and modify someone else’s code without the fear of modifying the behavior of the application.

But ok, let’s say you have unit tests, integration and acceptance tests with a very good code coverage and on top of that the tests are automated (with Jenkins for instance). ?But are the tests relevant? I’ll give you two bad (and real) examples:

  • In a previous project, people where faking most of the tests just to increase the code coverage so that it looks like a very good project. They didn’t have the time to write real tests and having high code coverage was mandatory.
  • In another project, I made an audit of someone else’s code and saw that the guy was not testing the behavior of the code but the behavior of the mocks used in its tests.

In both scenarios the code coverage was high but the application wasn’t really tested.

Relevant tests are only possible if the development team really understands the client’s needs and the concepts behind testing. Of course, it also requires having enough time.

 

An exploitable application

So great, you have a good application, but:

  • Is it working well? Will it still be the case in 3 months?
  • Are you being hacked? Have you been hacked?
  • Are you sure?

In order to answer you need to be able to monitor your application.

To do so, you must have a good logging system with different levels of loging so that you can “switch” your production application in a more verbose mode if you really don’t understand a bug.

Moreover, you need to log what matters when a problem happens. I’ve dealt many times with crashes in production and sometimes the only thing I got was “ERROR”. Great, but what caused this error? What were the variables and the use case just before this crash? I had hard times understanding some root causes (and it made me want to kill the developer who made this piece of ####)

Another aspect is to write technical and functional indicators to monitor the application behavior. For instance the memory usage, the number of current process, the number of customers/contracts/whatever created on daily basis… These indicators might help you to detect an abnormal behavior. Maybe the memory and CPU are overused because you’re being hacked. Maybe there are too many customers created since the last installation because you have a new bug that duplicates creations. Here is an example of a bad exploitable application: On a previous project (on a big corporation), we discovered we were being hacked after 2 weeks of hacking just because the hackers became too greedy and our clustered servers collapsed under the load. With a good monitoring system, we could have detected the intrusion much faster.

All these logs and indicators needs to be stored somewhere. Again I like simple things. For a simple application, using just flat-files is enough. But, if your application has many servers, using a centralized logging solution is useful (like logstash/kibana/elastic search). I worked on a project with more than 30 servers in production and no centralized system, let me tell you that reading each log to see if everything is ok was a huge waste of time.

 

Customer satisfaction

I’ve seen many developers who were only driven by technologies.? But an application is not about technologies, it’s about answering the needs of a client.

I love technologies (especially the new and shiny ones), but sometimes I choose an “old” technology I don’t like because it fits exactly the client’s needs.

Moreover, most of the time knowing the specificities of a language is useless but understanding the business domain you’re working on is useful. I have never understood developers who don’t care about the business domains and only care about technologies (often only a few technologies). For example, on a project in a bank (the project dealing with hundreds of millions of data) that dealt with Basel II norms (which are European norms created to avoid bank bankruptcy), many developers in this project had never read an article about Basel II nor had tried to understand the basics of these norms. They could have worked on a nuclear plant or web crawler it would have been the same: they were only implementing what they were asked without understanding the use of their developments (even a little bit).

For me this point is the most important. Even if your code is a total nightmare, if what you deliver makes the client happy, that what matters. And to do that, you need to understand the business domain of the client.

Customer satisfaction should always be the top priority.

 

A few words

You’ve just read my vision of a good application which is very close to the AGILE philosophy. I didn’t spoke about ergonomics or usability because there are many types of applications (user-interfaces, batches, web-services …).

It’s a very subjective and therefore controversial subject. So, what do YOU think is a good?application?

]]>
http://www.sunsetandecho.com/what-is-a-good-application/feed/ 4
Design pattern: singleton, prototype and builder http://www.sunsetandecho.com/design-pattern-singleton-prototype-and-builder/ http://www.sunsetandecho.com/design-pattern-singleton-prototype-and-builder/#comments Sun, 21 Jun 2015 16:51:02 +0000 http://www.sunsetandecho.com/?p=815

 

In my previous article, I spoke about the factory patterns. These patterns are part of creational patterns. In this post we’ll focus on the rest of the creational patterns: Singleton, Builder and Prototype.

In my opinion, these patterns are less important than factories. Yet, it’s still useful to know them. Following the same logic as my previous article, I’ll provide an UML description, simple java examples (so that even if you don’t know java you can understand) and present real examples from famous Java frameworks or APIs.

I’ll sometimes use factories so read my previous article if you don’t feel comfortable with factory patterns.

 

Creational Patterns

Creational patterns are design patterns that deal with object initialization and overcome the limitations of constructors. The Gang of Four in their book Design Patterns: Elements of Reusable Object-Oriented Software” described five of them:

  • Singleton,
  • Builder,
  • Prototype,
  • Abstract Factory,
  • Factory pattern.

Since this?book was?released (in 1994), many creational patterns have been?invented:

  • other type of factories (like the static one),
  • pool pattern,
  • lazy initialization,
  • dependency injection,
  • service locator,

In this post, we’ll only focus on the rest of the GoF’s creational patterns I haven’t already described. As I said in the introduction, they are less important than factories because you can live without them (whereas factories are the backbone of many applications and frameworks). But, they are useful and unlike factories they don’t make the code much more difficult to read.

 

Singleton Pattern

This pattern is the most famous. During the last decades, it was over-used but its popularity has decreased since. I personally avoid ?using it since it makes the code more difficult to unit test and creates a tight coupling. I prefer to use a factory (like the Spring container) that deals with the number of authorized instances of a class, we’ll speak about this approach. I think you should avoid the singleton pattern. In fact, the most important use of this pattern is to be able to answer an interviewer when he asks “what is a singleton?”. This?pattern is very controversial and there are still people in favor of it.

That being said, according to the GoF a singleton aims to:

“Ensure a class only has one instance, and provide a global point of access to it”

So, there are 2 requirements for a class to be a singleton:

  • Having a unique instance
  • Being accessible from anywhere

Some people only think about the first requirement (like me few years ago). In this case, the class is only a single instance.

Let’s look how to do a singleton in UML

singleton pattern

 

In this UML diagram, the Singleton class has 3 items:

  • a class attribute (instance): this attribute contains the unique instance of the singleton class.
  • a class public method (getInstance()) : it provides the only way to get the unique instance of the class Singleton. The method can be called from anywhere since it’s a class method (and not an instance method).
  • a private constructor (Singleton()) : it prevents anyone to instantiate a Singleton with a constructor.

In this example, a developer that needs an instance of Singleton will call the Singleton.getInstance() class method.

The singleton instance inside the Singleton class can be:

  • pre-initialized?(which means it is instantiated before someone call getInstance())
  • lazy-initialized (which means it is instantiated during the first call of getInstance())

Of course a real singleton has other methods and attributes to do its business logic.

 

Java implementations

Here is a very simple way to create a singleton in Java using the pre-instantiated approach.

public class SimpleSingleton {
	private static final SimpleSingleton INSTANCE = new SimpleSingleton();

    private SimpleSingleton() { }

	public static SimpleSingleton getInstance(){
		return INSTANCE;
	}
}

Using this way, the singleton instance is created only once when the class is loaded by the classloader. If the class is never used in your code, the instance won’t be instantiated (because the classloader of the JVM?won’t load it) and therefore waste memory. But if the class appears and?you don’t use it (for example if?it’s only used in a very very rare condition), the singleton will be initialized for nothing. Unless your singleton takes a huge amount of memory, you should use this way.

Still, if you need to create your singleton only when it’s really used (the lazy initialization), here is a way to do it in a multithreaded environment. This part is a bit tricky since it involves thread coherency.


public class TouchySingleton {
	private static volatile TouchySingleton instance;

	private TouchySingleton() {
	}

	public static TouchySingleton getInstance() {
		if (instance == null) {
			synchronized (TouchySingleton.class) {
				if (instance == null) {
					instance = new TouchySingleton();
				}
			}
		}
		return instance;
	}
}

As I said, it’s really more difficult to read (this is why the pre-instanciated way is better). This singleton involves a lock to avoid that 2 threads calling the getInstance() at the same time create 2 instances. Since the lock is costly, there is first a test without a lock then a test with the lock (it’s a double-checked locking) so that when the instance already exists the lock is not used.

Other particularity, the instance has to be volatile to ensure that its state is?the same on the different processor cores when it is created.

 

When do you need to use a singleton?

  • When you need only one resource (a database connection, a socket connection …)
  • To avoid multiple instances of a stateless class to avoid memory waste
  • For business reasons

You shouldn’t use singleton for sharing variables/data between different objects since it produces a very tight coupling!

 

Why shouldn’t you use a singleton?

At the beginning, I said that you shouldn’t use singletons because of the way you get the singleton. It’s based on a class function that can be called anywhere in the code. ?I read an excellent answer on stackoverflow that gives 4 reasons why it’s bad:

  • With singletons, you hide the dependencies between the classes instead of exposing them through the interfaces. This means you need to read the code of each method to know if a class is using another class.
  • They violate the?single responsibility principle: they control their own creation and lifecycle (using lazy initialization, the Singleton chose when it is created). A class should only focus on what it is meant to do. If you have a Singleton that manages people, it should only manage people and not how/when it is created.
  • They inherently cause code to be tightly?coupled. This makes faking or mocking them for unit testing very difficult.
  • They carry states around for the lifetime of the application (for stateful singletons).
    • It makes unit testing difficult since you can end up with a situation where tests need to be ordered which is a piece of nonsense. By definition, each unit test should be independent from each?other.
    • Moreover, it makes the code less predictable.

Ok, so singleton are?bad. But what should you use instead?

 

Use a single instance instead of a singleton

A singleton is just a specific type of single instance that can be getting anywhere with its class method. If you remove this second requirement, you remove many problems. But how can you deal with single instances?

A possible way is to manage single instances with a factory and Dependency Injection (it will be the subject of a future post).

Let’s take an example to understand:

  • You have a PersonBusiness class that needs a unique DatabaseConnection instance.
  • Instead of using a singleton to get this connection, the PersonBusiness will have a DatabaseConnection attribute.
  • This attribute will be injected at the instantiation of PersonBusiness by its constructor. Of course, you can inject any type of DatabaseConnection:
    • A MysqlDatabaseConnection for your development environment
    • A OracleDatabaseConnection for the production environment
    • A MockDatabaseConnection for the unit tests
  • At this stage, nothing prevents the DatabaseConnection to be unique. This is where the factory is useful. You delegate the creation of PersonBusiness to a factory and this factory also takes care of the creation of DatabaseConnection:
    • It chooses which kind of connection to create (for example using a property file that specify the type of connection)
    • It ensures that the DatabaseConnection is unique.

 

If you didn’t understand what I’ve just said, look the next java example then re-read this part again, it should by more comprehensive. Otherwise, feel free to tell me.

Here is an example in Java where the factory creates a MysqlDatabaseConnection but you could imagine a more complex factory that decides?the type of connection according to a property file or an environment variable.

////////////////An interface that represents a database connection
public interface DatabaseConnection {
	public void executeQuerry(String sql);
}

////////////////A concrete implementation of this interface
////////////////In this example it's for a mysql database connection
public class MysqlDatabaseConnection implements DatabaseConnection {
	public MysqlDatabaseConnection() {
		// some stuff to create the database connection
	}

	public void executeQuerry(String sql) {
		// some stuff to execute a SQL query
		// on the database
	}
}

////////////////Our business class that needs a connection
public class PersonBusiness {
	DatabaseConnection connection;

	//dependency injection using the constructor
	// it is a singleton because the factory that
	//creates a PersonBusiness object ensure that
	//UniqueDatabaseConnection has only one instance
	PersonBusiness(DatabaseConnection connection){
		this.connection = connection;
	}

        //a method that uses the injected singleton
	public void deletePerson(int id){

		connection.executeQuerry("delete person where id="+id);
	}
}

////////////////A factory that creates business classes
//////////////// with a unique MysqlDatabaseConnection
public class Factory {
	private static MysqlDatabaseConnection databaseConnection = new MysqlDatabaseConnection();

	public static MysqlDatabaseConnection getUniqueMysqlDatabaseConnection(){
		return databaseConnection;
	}

	public static PersonBusiness createPersonBusiness(){
		//we inject a MysqlDataConnection but we could chose
		//another connection that implements the DatabaseConnection
		//this is why this is a loose coupling
		return new PersonBusiness(databaseConnection);
	}

}

This is not a good example since the PersonBusiness could have a single instance since it has no state. But you could imagine that there are a ContractBusiness and a HouseBusiness that also needs that unique DatabaseConnection.

Still, I hope you see that using dependency injection + a factory you end up with a single instance of DatabaseConnection in your business classes as if you used a singleton. But this time, it’s a loose coupling, which means instead of using a MysqlDatabaseConnection, you can easily use?a MockDatabaseConnection for?testing only the PersonBusiness class.

Moreover, it’s easy?to know that?PersonBusiness is using a?DatabaseConnection. You just have to look at the attributes of the class and not one of the 2000 lines of code of the class (ok, imagine this class has many functions and the overall takes 2000 lines of code).

This approach is used by most Java framework (Spring, Hibernate…) and Java containers (EJB containers). It’s not a real singleton since you can instantiate the class multiple times if you want to and you can’t get the instance from everywhere. But if you only create your instances through the factory/container, you’ll end up with a unique instance of the class?in your code.

Note: I think the Spring Framework is very confusing because its “singleton” scope is only a single instance. It took me some time to understand that it wasn’t a real GoF’s singleton.

 

Some thoughts

The single instance has the same drawback than the singleton when it comes to global states. You should avoid using a single instance to share data between different classes! The only exception I see is caching:

  • Imagine you have a trading application that makes hundreds of calls per seconds and it only needs to have the stock prices from the last minutes. You could use a single instance?(StockPriceManager) shared among the trading business classes, and every function that needs the prices would get it from the Cache. If the price is outdated, the cache would refresh it. In this situation, the drawbacks of tight coupling are worth the gain in performance. But when you need to understand a bug in production because of this global state you cry (I’ve been there and it wasn’t funny).

 

I told you to use the single instance?approach instead of the singleton but sometimes it is worth using the real singleton when you need this object in all of the classes. For example when you need to log:

  • Each class needs to log and this log class is often unique (because the logs are written in the same file). Since all classes use the log class, you know that every class has an implicit dependency to this log class. Moreover, it’s not a business needs so it’s “less important” to unit test the logs (shame on me).

Writing a singleton is easier than writing a single instance using Dependency Injection. For a quick and dirty solution I’d use a singleton. For a long and durable solution I’d use a?single instance. Since most applications are based on frameworks, the implementation of the single instance is easier than from scratch (assuming you know how to use the framework).

If you want to know more about singletons:

 

Real examples

The single instance?pattern uses a factory.?If you use an instantiable factory, you might need to ensure that this factory is unique. And more broadly, when you use a factory you might want it to be unique in order to avoid that 2 factory instances mess with each other. You could use a “meta-factory“ to build the unique factory but you’ll end up with the same problem for the “meta-factory”. So, the only way to do that is to create the factory with a singleton.

It’s the case for java.awt.Toolkit? in the old graphical library AWT. This class provides a getDefaultToolkit() method that gives the unique?Toolkit instance and?it’s the only way to get one. Using this toolkit (which is a factory), you can create a windows, a button, a checkbox …

 

But you can also encounters singletons for other concerns. When you need to monitor a system in Java, you have to use the class java.lang.Runtime. I guess this class has to be unique because it represents the global state (environment variables)?of the process. If I quote the java API:

“Every Java application has a single instance of class Runtime that allows the application to interface with the environment in which the application is running. The current runtime can be obtained from the getRuntime method.”

 

Prototype Pattern

I’ve used prototypes through Spring but I never had the need to use my own prototypes. This pattern is meant to build objects by copy instead of constructors. Here is the definition given by the GoF:

“Specify the kinds of objects to create using a prototypical instance, and create new objects by copying this prototype.”

As often with the GoF, I don’t understand their sentences (is this because English is not my native language?). If you’re like me, here is another explanation: If you don’t want or can’t use the constructor of a class, the prototype pattern lets you create new instances of this class by duplicating an already existing instance.

Let’s look at the formal definition using a UML diagram:

prototype pattern

In this diagram

  • the prototype is a an interface that defines a function clone()
  • A real prototype has?to implement this interface and implement the clone() function to return an copy of itself.

A developer will have to instantiate the ConcretePrototype once. Then, he will be able to create new instances of ConcretePrototype by:

  • duplicating?the first instance using the clone() function
  • or creating a ConcretePrototype using (again) the?constructor.

 

When to use prototypes?

According to the Gof, the prototype should be used:

  • when a system should be independent of how its products are created, composed, and represented
  • when the classes to instantiate are specified at run-time, for example, by dynamic loading
  • to avoid building a class hierarchy of factories that parallels the class hierarchy of products
  • when instances of a class can have one of only a few different combinations of state. It may be more convenient to install a corresponding number of prototypes and clone them rather than instantiating the class manually, each time with the appropriate state.

 

The dynamic loading of an unknown class is a very rare case, even more if the dynamically loaded instance needs to be duplicated.

This book was written in 1994. Now, you can?“avoid building a class hierarchy of factories” by using dependency injection (again, I’m going to present this wonderful pattern in a future article).

In my opinion the most common case is where creating a stateful instance is way more expensive than copying an existing instance and you need to create lots of this object. For example if the creation needs to:

  • get data from a database connection,
  • get data from the system (with system calls) or the filesystem,
  • get data from another server (with sockets, web services or whatever),
  • compute a large amount of data (for example if it needs to sort data),
  • do anything that takes time.

The object must be stateful because if it has no state, a Singleton (or a single instance) will do the trick.

There is also another use case. If you have an instance that is mutable and you want to give it to another part of the code, for security reasons you might want to give a duplicate instead of the real instance?because this instance?can be?modified by the client code and have an impact on other parts of the code that use?it.

 

Java?implementation

Let’s look at a simple example in Java:

  • We?have a CarComparator business class. This class contains a function that compares 2 cars.
  • To instanciate a?CarComparator, the constructor needs to load a default configuration from a database to configure the car comparison algorithm (for example to put more weight on the fuel consumption than the speed or the price).
  • This class cannot be a singleton because the configuration can be modified by each user (therefore each user needs its own instance).
  • This is why we create only once a instance using the costly constructor.
  • Then when a client needs an instance of?CarComparator he gets a duplicate of the first instance.
//////////////////////////The Prototype interface
public interface Prototype {
	Prototype duplicate();
}

//////////////////////////The class we want to duplicate
public class CarComparator implements Prototype{
	private int priceWeigth;
	private int speedWeigth;
	private int fuelConsumptionWeigth;

	//a constructor that makes costly calls to a database
	//to get the default weigths
	public CarComparator(DatabaseConnection connect){
		//I let you imagine the costly calls to the database
	}

	//A private constructor only use to duplicate the object
	private CarComparator(int priceWeigth,int speedWeigth,int fuelConsumptionWeigth){
		this.priceWeigth=priceWeigth;
		this.speedWeigth=speedWeigth;
		this.fuelConsumptionWeigth=fuelConsumptionWeigth;
	}

	//The prototype method
	@Override
	public Prototype duplicate() {
		return new CarComparator(priceWeigth, speedWeigth, fuelConsumptionWeigth);
	}

	int compareCars(Car first, Car second){
		//some kickass and top secret algorithm using the weigths
	}

	////////////////The setters that lets the possibility to modify
	//////////////// the algorithm behaviour
	public void setPriceWeigth(int priceWeigth) {
		this.priceWeigth = priceWeigth;
	}

	public void setSpeedWeigth(int speedWeigth) {
		this.speedWeigth = speedWeigth;
	}

	public void setFuelConsumptionWeigth(int fuelConsumptionWeigth) {
		this.fuelConsumptionWeigth = fuelConsumptionWeigth;
	}

////////////////////////// A factory that creates a CarComparator instance using
////////////////////////// constructors then it creates the others by duplication.
////////////////////////// When a client ask for a CarComparator
////////////////////////// he gets a duplicate

public class CarComparatorFactory {
	CarComparator carComparator;
	public BusinessClass (DatabaseConnection connect) {
		//We create one instance of CarComparator
		carComparator = new CarComparator(connect);
	}

	//we duplicate the instance so that
	//the duplicated instances can be modified
	public CarComparator getCarComparator(){
		return carComparator.duplicate();
	}

}

If you look at the next part, you’ll see that I could have made a simpler code using the right Java interface but I wanted you to understand a prototype.

In this example, at start-up, the prototype will be created using the default configuration in the database and each client will get a copy of this instance using the getCarComparator() method of the factory.

 

Real example

The Java API provides a prototype interfaces called Cloneable. This interface define a clone() function that a concrete prototype needs to implements. Many Java classes from the Java APIs implement this interface, for example the collections from the collection API. Using an ArrayList, I can clone it and get a new ArrayList that contains the same data as the original one:

   // Let's initialize a list
   // with 10  integers
   ArrayList<Integer> list = new ArrayList<Integer>();
   for (int i = 0; i < 10; i++) {
      list.add(i);
   }
   System.out.println("content of the list "+list);

   //Let's now duplicate the list using the prototype method
   ArrayList<Integer> duplicatedSet = (ArrayList<Integer>) list.clone();
   System.out.println("content of the duplicated list "+duplicatedSet);

The result of this code is:

content of the set [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
content of the duplicated set [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

 

Builder Pattern

The Builder pattern is very useful to factorize code. According to the GoF, this pattern:

“Separate the construction of a complex object from its representation so that the same construction process can create different representations.”

Its aim has changed through time and?it’s most of the time used to avoid creating a lot of constructors that differs only by the number of arguments. It’s a way to avoid the telescoping constructor anti-pattern.

 

The problem it solves

Let’s look at the problem this pattern solves. Imagine a class Person with 5 attributes:

  • age
  • weight
  • height
  • id
  • name

We?want to be able to construct a person knowing:

  • only this age,
  • or only this age and weight,
  • or only this age, weight and height,
  • or only this age, weight, height and id
  • or only this age, weight, height, id and name

 

In java, we?could write?something like that

public class Person {
	private int age;
	private int weigth;
	private int height;
	private int id;
	private String name;

	//////////////Here comes the telescopic constructors
	public Person() {
		//some stuff
	}

	public Person(int age) {
		this();//we're using the previous constructor
		this.age = age;
	}

	public Person(int age, int weigth) {
		this(age);//we're using the previous constructor
		this.weigth = weigth;
	}

	public Person(int age, int weigth, int height) {
		this(age, weigth);//we're using the previous constructor
		this.height= height;
	}	

	public Person(int age, int weigth, int height,int id) {
		this(age, weigth, height);//we're using the previous constructor
		this.id = id;
	}	

	public Person(int age, int weigth, int height,int id,String name) {
		this(age, weigth, height, id);//we're using the previous constructor
		this.name = name;
	}	

}

In order to deal with this simple need, we’ve just created 5 constructors which a lot of code. I know java is a very verbose language (troll inside) but what if there was a cleaner way?

Moreover, using this telescopic approach, the code is hard to read. For example, if you read the following?code, can you easily understand what the parameters are? Are they age, id or height?

Person person1 = new Person (45, 45, 160, 1);

Person person2 = new Person (45, 170, 150);

Next problem: imagine you now want to be able to create a?Person with every possible part of information you get, for example:

  • an age
  • a weight
  • an age and a weight
  • an age and an id,
  • an age, a weight and a name,

With the constructor approach, you’ll end up with 120 constructors (5! = 120). And you’ll have another problem, how can you deal with different constructors using the same types? For example how can you have?both:

  • a constructor for the age and the weigth (which are 2 int) and
  • a constructor for the age and the id (which are also 2 int)?

You could use a static factory method?but it would still require 120 static factory methods.

This is where the builder comes into play!

The idea of this pattern is to simulate named optional arguments. These types of arguments are natively available in some languages like python.

Since the UML version is very complicated (I think), we’ll start with a simple java example and end with the UML formal definition.

 

A?simple java example

In this example, I have a person but this time the id field is mandatory and the other fields are optional.

I create a builder so that a developer can use the optional fields if he?wants to.

//////////////////////the person class
/////////////////////if you look at its constructor
/////////////////////it requieres a builder
public class Person {
   private final int id;// mandatory
   private int weigth;// optional
   private int height;// optional
   private int age;// optional
   private String name;// optional

   public Person(PersonBuilder builder) {
       age = builder.age;
       weigth = builder.weigth;
       height = builder.height;
       id = builder.id;
       name = builder.name;
    }
}
//////////////////////the builder that
/////////////////////takes care of
/////////////////////Person creation
public class PersonBuilder {
	// Required parameters
	final int id;

	// Optional parameters - initialized to default values
        int height;
	int age;
	int weigth;
	String name = "";

	public PersonBuilder(int id) {
		this.id = id;
	}

	public PersonBuilder age(int val) {
		age = val;
		return this;
	}

	public PersonBuilder weigth(int val) {
		weigth = val;
		return this;
	}

	public PersonBuilder height(int val) {
		height = val;
		return this;
	}

	public PersonBuilder name(String val) {
		name = val;
		return this;
	}

	public Person build() {
		return new Person(this);
	}
}

//////////////////////Here is how to use the builder in order to build a Person
//////////////////////You can see how readable is the code
public class SomeClass {
	public void someMethod(int id){
		PersonBuilder pBuilder = new PersonBuilder(id);
		Person robert = pBuilder.name("Robert").age(18).weigth(80).build();
		//some stuff
	}

	public void someMethodBis(int id){
		PersonBuilder pBuilder = new PersonBuilder(id);
		Person jennifer = pBuilder.height(170).name("Jennifer").build();
		//some stuff
	}

}

In this example, I suppose the classes Person and PersonBuilder?are in the same package, which allows the builder to use the Person constructor and the?classes outside the package will have to use the PersonBuilder to create a Person.

This PersonBuilder has 2 kinds of methods, one for building a part of a person and one for creating a person. All the properties of a person can only be modified by classes in the same package. ?I should have?used getter and setter but I wanted to have a short example. You see that the part th uses the builder is easy to read, we know that we are creating

  • a person named Robert whose age is 18 and weight 80,
  • another person named Jennifer whose length is?170.

Another advantage of this technic is that you can still create immutable objects. In my example, if I don’t add public setters in the Person class, a?Person instance?is immutable since no class outside the package can modify its attributes.

 

The formal definition

Now let’s look at the UML:

builder pattern by the GoF

 

This diagram is really abstract, ?a GoF’s?builder has:

  • a?builder?interface that specify functions?for creating parts of a Product object. In my diagram, there is just one method, buildPart().
  • a ConcreteBuilder that constructs and assembles parts of the product by implementing the Builder interface.
  • a Director :?it constructs a?product?using the Builder interface.

 

According to the GoF, this pattern is useful when:

? the algorithm for creating a complex object should be independent of the parts that make up the object and how they’re assembled.
? the construction process must allow different representations for the object that’s constructed.

The example given by the GoF was a TextConverter builder that has 3 implementations to build: an ASCIIText?or a?TeXText or?a TextWidget. The 3 builder implemenations (ASCIIConverter, TeXConverter and TextWidgetConverter) have the same functions except the createObject() function that differs (this is why this function is not in the interface of this pattern). Using this pattern, the code that converts a text (the Director) uses the builder interface so it can easily switch from ASCII to TeX or TextWidget. Moreover, you can add a new converter without modify the rest of the code. In a way, this pattern is very close to the State pattern.

But this problem?is a rare case.

 

Another use of this pattern was popularized?by Joshua Bloch, a Java developper who led the construction of many Java APIs. He wrote in his book “Effective Java”:

“Consider a builder when faced with many constructor parameters”

Most of the time the pattern is used for this use case. You don’t need the builder interface nor multiple builder implementations nor a director for this problem.?In my java example and most of the time you will find just a?concrete builder.

The UML then becomes easier:

builder pattern by Joshua Bloch

In this diagram?the ConcreteBuilder has multiple ?functions? that create each part of the product (but I just put one, buildPart(), because I’m lazy). These functions return the ConcreteBuilder so that you can chain the function calls,?for example: builder.buildPart1().buildPart7().createObject(). The builder has a createObject() method to create the product when you don’t need to add more parts.

 

To sum up, the builder pattern is a good choice when?you have a class with?many optional parameters and you don’t want to end up with to many constructors. Though this pattern was not designed?for this problem, it’s most of the time used for that (at least in Java).

 

Real example

The most common?example in the Java APIs is the StringBuilder. Using it, you can create a temporary string, append new strings to it and when you’re finished you can create a real String object (that is immutable).

      StringBuilder sBuilder = new StringBuilder();
      String example = sBuilder.append("this").append(" is").
         append(" an").append(" example").toString();
      System.out.println(example);

 

Conclusion

You should now have a better overview of the creational patterns. If you need to remember one thing it’s to use single instances instead of singletons. Keep in mind the builder pattern (Joshua Bloch’s version), it might be useful if you’re dealing with optional parameters.

]]>
http://www.sunsetandecho.com/design-pattern-singleton-prototype-and-builder/feed/ 6
Design Pattern: factory patterns http://www.sunsetandecho.com/design-pattern-factory-patterns/ http://www.sunsetandecho.com/design-pattern-factory-patterns/#comments Thu, 11 Jun 2015 10:06:41 +0000 http://localhost/wordpress/?p=1

?

Factories are one of the key creational patterns that each developper should know. They are the main component of many advanced patterns. For a long time I had trouble with the different types of factory patterns. Moreover,?it’s difficult to find information about these types in the same article. This article is about 4 types of factory patterns:

  • the factory method pattern,
  • the abstract factory pattern,
  • the static factory method,
  • the simple factory (also called factory).

The factory method pattern was described in the book “Design Patterns: Elements of Reusable Object-Oriented Software” ?by theGang of Four.?The first time I read this pattern, I misinterpreted it?with the static one which was described by Joshua Bloch?– one of the main architects of the Java APIs –?in his book “Effective Java“. The simple factory (sometimes called factory) is informal but appears many times on the net. The last one, the abstract factory pattern, was also described in the book of the?“Gang of Four”?and?is a broader concept of factory method pattern.

In this post, I’ll explain why factories are useful then?I’ll present each type?with real examples from famous Java frameworks or Java APIs.?I’ll use Java code to implement factories but if you don’t know Java you’ll still be able to understand the idea. Moreover,?I’ll use UML to describe the patterns formally.

Anti-pattern?

Though this article is about factory?patterns, using patterns just for using patterns?is worst than never using them. This behaviour is an anti-pattern. Indeed, most patterns make the code more difficult to understand. Most of the time, I?don’t use factories.?For example:

  • When I code alone at home/work, I avoid using them.
  • For small projects that won’t change a lot I also avoid factories.
  • For medium to large projects involving multiple developers using the same code I find them useful.

I always think of factories as a tradeoff between their advantages (will see them in the next parts) and the readability and comprehension of the code.

 

The main objective of factories is to instanciate objects. But why not directly create objets?with constructor calls ?

For simple use cases, there is no need to use a factory. Let’s look at this code.

public class SimpleClass {
 ? private final Integer arg1;
 ? private final Integer arg2;

 ? SimpleClass(Integer arg1, Integer arg2) {
 ?  ? this.arg1 = arg1;
 ?  ? this.arg2 = arg2;
 ? }

 ? public Integer getArg1(){
 ?  ? return arg1;
 ? }

 ? public Integer getArg2(){
 ?  ? return args;
 ? }
}
...
public class BusinessClassXYZ {
 ? public static void someFunction(){
 ?  ? SimpleClass mySimpleClass = new SimpleClass(1,2);
  ?  ? // some stuff
 ? }
}

In this code, SimpleClass is a very simple class with a state, no dependencies, no polymorphism and no business logic. You could use a factory to create this object but it would double the amount of code.?Therefore, it would make the code?more difficult to understand. If you can avoid using factories do it, you’ll end up with simpler a code!

But, you’ll often encounter more complex?cases when writing large applications that require many?developers and many code changes. For these?complex cases, the advantages of factories overtake their drawbacks.

 

The need for factories

Now that I warned you about the use of factories, let’s see why they are so powerful and therefore used in most projects.

Control over instanciation

A common use case with enterprise applications is to limit the number of instances of a class. How would you?manage to have only one (or 2, or 10) ?instance of a?class ?because?it consumes a resource like a?socket, or a database connection, or?a file system descriptor or whatever?

With the constructor approach, it would be difficult for different?functions (from different classes) to know if an instance of a class already exists. And, even if there was an instance, how could a function get this instance? You could do that by using shared?variables that each function would check but

  • it?would link the behaviour of all the functions that needs to instanciate the same class since they are using and modifying the same shared variables,
  • multiple parts of the code would have same logic to check if the class has already be instanciated which would lead to code duplication (very bad!).

Using a static factory method, you could easily do that:

public class Singleton {
   private static final Singleton INSTANCE = new Singleton();

   private Singleton(){}

   public static Singleton getInstance(){
      return INSTANCE;
   }
...
}
...
public class ClassXXX{
   ...
   public static void someFunctionInClassXXX(){
      Singleton instance = Singleton.getInstance();
      //some stuff
   }
}
...

public class ClassYYY{
   ...
   public static void someFunctionInClassYYY(){
      Singleton instance = Singleton.getInstance();
      //some stuff
   }
}

In this code, we’re using a factory that limits the number of instance of the class Singleton to one. By limiting the number of objects we’re creating a pool of instances and this Pool Pattern is based on a factory.

Note: Instead of limiting the number of instances, we could have modified the way an?instance is created (for example by using a?prototype pattern instead of creating a new object from scratch each time).

 

Loose coupling

Another advantage of factories is the loose coupling.

Let’s assume you write a program that computes stuff and needs to write logs. Since it’s a big projet, one of your mate codes the?class that writes the logs into a filesystem (the class FileSystemLogger) while you’re coding the business classes. Without factories, you need to instanciate the FileSystemLogger with a constructor before using it:

public class FileSystemLogger {
   ...
   public void writeLog(String s) {
   //Implemation
   }
}
...
public void someFunctionInClassXXX(some parameters){
   FileSystemLogger logger= new FileSystemLogger(some paramters);
   logger.writeLog("This is a log");
}

But what happens if there is a sudden change and you now need to write logs in a database with the implememtation?DatabaseLogger? Without factories, you’ll have to modify all the functions using the FileSystemLogger class. Since this logger is used everywhere you’ll need to modify hundreds of functions/classes whereas using a factory you could ?easiliy switch from one implementation to another by modifying only the factory:

//this is an abstraction of a Logger
public interface ILogger {
   public void writeLog(String s);
}

public class FileSystemLogger implements ILogger {
   ...
   public void writeLog(String s) {
      //Implemation
   }
}

public class DatabaseLogger implements ILogger {
   ...
   public void writeLog(String s) {
      //Implemation
   }
}

public class FactoryLogger {
   public static ILogger createLogger() {
      //you can choose the logger you want
      // as long as it's an ILogger
      return new FileSystemLogger();
   }
}
////////////////////some code using the factory
public class SomeClass {
   public void someFunction() {
      //if the logger implementation changes
      //you have nothing to change in this code
      ILogger logger = FactoryLogger.createLogger();
      logger.writeLog("This is a log");
   }
}

If you look at this code, you can easily change the logger implementation from FileSystemLogger to DatabaseLogger. You just have to modify the function createLogger() (which is a factory). This change is invisible for the client (business) code since the client code use an interface of logger (ILogger) and the choice of the logger implementation is made by the factory. By doing so, you’re creating a loose coupling between the implementation of the logger and the parts of codes that uses the?logger.

 

Encapsulation

Sometimes, using a factory improves the readibility of your code and?reduces its complexity by encapsulation.

Let assume you need to use a business Class CarComparator that compares 2 cars. This class needs a DatabaseConnection to get the features of millions of cars and a FileSystemConnection to get a configuration file that parametrizes the comparison?algorithm (for example: adding more weigth to the fuel consumption than the maximum speed).
Without a factory you could code something like:

public class DatabaseConnection {
   DatabaseConnection(some parameters) {
      // some stuff
   }
   ...
}

public class FileSystemConnection {
   FileSystemConnection(some parameters) {
      // some stuff
   }
   ...
}

public class CarComparator {
   CarComparator(DatabaseConnection dbConn, FileSystemConnection fsConn) {
      // some stuff
   }

   public int compare(String car1, String car2) {
      // some stuff with objets dbConn and fsConn
   }
}
...
public class CarBusinessXY {
   public void someFunctionInTheCodeThatNeedsToCompareCars() {
      DatabaseConnection db = new DatabaseConnection(some parameters);
      FileSystemConnection fs = new FileSystemConnection(some parameters);
      CarComparator carComparator = new CarComparator(db, fs);
      carComparator.compare("Ford Mustang","Ferrari F40");
   }
...
}

public class CarBusinessZY {
   public void someOtherFunctionInTheCodeThatNeedsToCompareCars() {
      DatabaseConnection db = new DatabaseConnection(some parameters);
      FileSystemConnection fs = new FileSystemConnection(some parameters);
      CarComparator carComparator = new CarComparator(db, fs);
      carComparator.compare("chevrolet camaro 2015","lamborghini diablo");
   }
...
}

This code works but you can see that in order to use the comparison method, you need to instanciate

  • a DatabaseConnection,
  • a FileSystemConnection,
  • then a CarComparator.

If you need to use the comparison in multiple functions, you will have to duplicate your code which means if the construction of the CarComparator changes, you will have to modifiy all the duplicated parts. The use of a factory could factorize the code and hide the complexity of the construction of the CarComparator class.

...
public class Factory {
   public static CarComparator getCarComparator() {
      DatabaseConnection db = new DatabaseConnection(some parameters);
      FileSystemConnection fs = new FileSystemConnection(some parameters);
      CarComparator carComparator = new CarComparator(db, fs);
   }
}
//////////////////////////////some code using the factory
public class CarBusinessXY {
   public void someFunctionInTheCodeThatNeedsToCompareCars() {
      CarComparator carComparator = Factory.getCarComparator();
      carComparator.compare("Ford Mustang","Ferrari F40");
   }
...
}
...
public class CarBusinessZY {
   public void someOtherFunctionInTheCodeThatNeedsToCompareCars() {
      CarComparator carComparator = Factory.getCarComparator();
      carComparator.compare("chevrolet camaro 2015","lamborghini diablo");
   }
...
}

If you compare both codes, you can see that using a factory:

  • Reduces the number of line of code.
  • Avoid code duplication.
  • Organise the code: the factory has the reponsability to build a CarComparator and the business class just uses it.

The last point is important (in fact, they’re all important!) because a it’s about separation of concerns. A business class shouldn’t have to know how to build a complex object it needs to use: the business class needs to focus only on business concerns. Moreover, it also increases?the division of work among the?developers of?the same project:

  • One works on the CarComparator and the way it’s created.
  • Others work on business objects that use the CarComparator.

 

Disambiguation

Let’s assume that you have a class with multiple constructors (with very different behaviors). How can you be sure that you won’t use the wrong constructor by mistake?
Let’s look at the following code:

 class Example{
     //constructor one
     public Example(double a, float b) {
         //...
     }
    //constructor two
     public Example(double a) {
         //...
     }
     //constructor three
     public Example(float a, double b) {
         //...
     }
}

Though constructor one and two doesn’t have the same number of arguments, you can quicky fail to choose the right one, especially at the end of a busy day using the nice autocomplete from you favorite IDE (I’ve been there). It’s even more difficult to see the difference between constructor one and constructor three. This example looks like a fake one but I saw it on legacy code (true story !).
The question is, how could you implement different constructors with the same type of parameters (while avoiding a dirty way like the contructors one and three) ?

Here is a clean solution using a factory:

 class Complex {
     public static Complex fromCartesian(double real, double imag) {
         return new Complex(real, imag);
     }

     public static Complex fromPolar(double rho, double theta) {
         return new Complex(rho * Math.cos(theta), rho * Math.sin(theta));
     }

     private Complex(double a, double b) {
         //...
     }
 }

In this example, using a factory adds a description of what the creation is about with the factory method name: ?you can create a Complex number from cartesian coordinates or from polar coordinates. In both cases, you know exactly what the creation is about.

 

The factory patterns

Now that we saw the pros?and cons?of factories, let’s focus on the different types of factory patterns.

I’ll present each factory from the simplest to the most abstract. If you want to use factories, keep in mind that the simpler the better.

 

static factory method?

Note: If you read this article and don’t know a lot about Java, a static method is a?class method.

The static factory method was described by Joshua Bloch in “Effective Java”:

“A class can provide a public static factory method, which is simply a static method that returns an instance of the class.”

In other words, instead of using a constructor to create an instance, a class could provide a static method that returns an instance. If this class has subtypes, the static factory method can return a type of the class or its subtypes. Though I hate UML, I said in the beginning of the article that I’d use UML to give a formal description. Here it is:

 

simplified UML version of static factory

In this diagram, the class?ObjectWithStaticFactory has a static factory method (called getObject()). This method can instanciate any type of class ObjectWithStaticFactory, which means a type ObjectWithStaticFactory or a type SubType1 or a type SubType2. Of course, this class can have other methods, properties and static factory methods.

 

Let’s look at this code:

public class MyClass {
   Integer a;
   Integer b;

   MyClass(int a, int b){
      this.a=a;
      this.b=b;
   };

   public static MyClass getInstance(int a, int b){
      return new MyClass(a, b);
   }

   public static void main(String[] args){
      //instanciation with a constructor
      MyClass a = new MyClass(1, 2);
      //instanciation with a static factory method
      MyClass b = MyClass.getInstance(1, 2);
   }
}

This code shows 2 ways to create an instance of MyClass:

  • a static factory method getInstance() inside MyClass
  • a constructor of MyClass

But this concept can go deeper. What if a class with a static factory method could instanciate another class? Joshua Bloch described this possibility:

“Interfaces can’t have static methods, so by convention, static factory methods for?an interface named Type are put in a noninstantiable class (Item 4) named Types.”

Here is the associated UML:

simplified UML version of static factory

In this case the factory method getObject is inside an abstract class? name Types. The factory method can create instances of class Type or any subtype of class Type (SubType1 or SubType2 in the diagram). The getObject() method can have parameters so that it returns a SubType1 for a given parameter and a SubType2 otherwise.

Let’s got back in Java and ?assume that we have 2 classes: Ferrari and Mustang that implements an interface Car. The static factory method can?be put in an abstract class named “CarFactory” (using Joshua Boch’s conventions the name of the class should be “Cars” but I don’t like it):

/////////////////////////the products
public interface Car {
   public void drive();
}

public class Mustang implements Car{
   public void drive() {
      //	some stuff
   }
   ...
}

public class Ferrari implements Car{
   public void drive() {
      //	some stuff
   }
	...
}
///////////////////////// the factory
public abstract class CarFactory{
   public static Car getCar() {
      //	choose which car you want
   }
}
...
/////////////////////////some code using the factory
public static void someFunctionInTheCode(){
   Car myCar = CarFactory.getCar();
   myCar.drive();
}

The power of this pattern compared to the other factory patterns is that you don’t need

  • to instanciate the factory in order to use it (you’ll understand what I mean in a few minutes),
  • the factory to implement an interface (same comment).

It’s easy to use but works only for langages that provides class methods (i.e the static java keyword).

Note: When it comes to factories, many posts on the net are wrong?, like this post on stackoverflow that was upvoted 1.5k times. The problem with?the given examples of factory method patterns? is that they are static factory methods. ?If I quote?Joshua Bloch:

a static factory method is not the same as the Factory Method pattern?from Design Patterns [Gamma95, p. 107]. The static factory method described in
this item has no direct equivalent in Design Patterns.”

If you look at the stackoverflow post, only the last example?(the?URLStreamHandlerFactory) is a factory method pattern by the GoF (we’ll see this pattern in a few minutes)

Real examples

Here are some examples of static factory methods in Java frameworks and?Java APIs.?Finding examples in the Java APIs is very easy since Joshua Bloch was the?main architect for?many Java APIs.

Logging frameworks

The java logging frameworks slf4j, logback and log4j use an abstract class, LoggerFactory. If a developper wants to write logs, he needs to get an instance of Logger from the static method getLogger() of LoggerFactory.
The Logger implementation returned by getLogger() will depend on the of getLogger() implementation (and also the configuration file writen by the developer that is used by getLogger()).

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

 public class Example{

   public void example() {
   //we're using the static factory method to get our logger
   Logger logger = LoggerFactory.getLogger(Example.class);

   logger.info("This is an example.");
   }

 }

Note: the name of the factory class and its static factory method is not exactly the same whether you’re using slf4j or log4j or slf4j.

Java String class

The String class in Java represents a string. Sometimes, you need to get a string from a boolean or an integer. But String doesn’t provide constructors like String(Integer i) or String(Boolean b). Instead, it provides multiple static factory methods String.valueOf(…).

	int i = 12;
	String integerAsString = String.valueOf(i);

 

simple factory

This pattern is not a “real one” but I ‘ve seen it many times on the Internet. It?doesn’t have a formal description but here is mine: A simple factory (or factory) is a tool

  • whose job is to create/instanciate objects,
  • and is neither a factory method pattern (we’ll see this pattern after),
  • nor an abstract factory pattern (same comment).

You can see it has a generalization of the static factory pattern but this time?the factory can?be instanciated (or not)? because the “factory method” is not a class method (but it can). For a Java developer using the simple factory in its non-static form is rare. So, this pattern is most of the time equivalent to the static one. Here is an UML for the non-static form:

 

simplified UML version of simple factory

In this case the factory method getObject() is inside a class? named Factory. The factory method is not class method?so,?you?need to instanciate the Factory before using it. The factory method can create instances of class Type or any of its subtypes .

 

Here is the previous example from the static factory method but?this time I instanciate the factory before using it

/////////////////////////the products
public interface Car {
   public void drive();
}

public class Mustang implements Car{
   public void drive() {
      //	some stuff
   }
   ...
}

public class Ferrari implements Car{
   public void drive() {
      //	some stuff
   }
	...
}
/////////////////////////The factory
public class CarFactory{
   //this class is instantiable
   public CarFactory(){
      //some stuff
   }
   public Car getCar() {
      //	choose which car you want
   }
}
...
/////////////////////////some code using the factory
public static void someFunctionInTheCode(){
   CarFactory carFactory = new CarFactory();
   Car myCar = carFactory.getCar();
   myCar.drive();
}

As you see, this time I need to instanciate the Factory in order to use it. I didn’t find real example in java since it’s better to use a static factory method than a simple factory. Still, you can use this pattern in its non-static form if your factory method needs some instances to work. For example, if you need a database connection, you could first instanciate your factory (that would instanciate the database connection) and then use the factory method that requieres this connection. Personnally, in this case I’d still use a static factory with lazy initialization (and a pool of database connections).

Tell me if you know a Java framework that uses a simple factory in its non-static form.

 

factory method?pattern

The factory method pattern is a more abstract factory. Here is the definition of?the pattern given by the “Gang of Four”:

“Define an interface for creating an object, but let subclasses decide which class to instantiate. Factory Method lets a class defer instantiation to subclasses”

Here is a simplified UML diagram of the factory method pattern:

 

simplified UML version of factory method pattern

This diagram looks like the simple factory one (in its non-static form). The only (and BIG !) difference is the interface Factory:

  • the Factory represents the “interface for creating an object”. It describes?a factory method:?getObjects().
  • the ConcreteFactory represents one of the “subclasses [that] decide which class to instantiate”. Each ConcreteFactory has its own implementation of the factory method getObjects().

In the diagram getObjects() has to return a Type (or its subtypes). Which means that one conctrete factory could return a Subtype1 whereas another could return a SubType2.

Why using a factory method pattern instead of a simple factory?

Only when your code requires multiple factory implementations. This will force each factory implemantion to have the same logic so that a developer that uses one implementation can easily switch to?another one without wandering how to use it (since he just have to call the factory method that has the same signature).

 

Since this is abstract, let’s go back to the car example. It’s not a great example but I use it so that you can see the difference with?the simple factory, (we’ll see the real examples to understand the power of this pattern):

/////////////////////////the products
public interface Car {
   public void drive();
}

public class Mustang implements Car{
   public void drive() {
      //	some stuff
   }
   ...
}

public class Ferrari implements Car{
   public void drive() {
      //	some stuff
   }
	...
}
///////////////////////// the factory
//the definition of the interface
public interface CarFactory{
  public Car getCar() {}
}

//the real factory with an implementation of the getCar() factory method
public class ConcreteCarFactory implements CarFactory{
   //this class is instantiable
   public CarFactory(){
      //some stuff
   }
   public Car getCar() {
      //	choose which car you want
      return new Ferrari();
   }
}
...
/////////////////////////some code using the factory
public static void someFunctionInTheCode(){
   CarFactory carFactory = new ConcreteCarFactory();
   Car myCar = carFactory.getCar();
   myCar.drive();
}

If you compare this code with the simple factory, I added this time an interface (CarFactory). The real factory (ConcreteCarFactory)?implements this interface.

As I said this is not a great example because in this example you shouldn’t use a factory method pattern since there is only one concrete factory.? It would be useful only if I have multiple implementations like SportCarFactory, VintageCarFactory, LuxeCarFactory, CheapCarFactory?… . In this case, a developper could easily switch from one implementation to another since the factory method is always getCar().

 

Real examples

Java API

In java, a common example is the iterator() function in the collection API. Each collection implements the interface?Iterable<E>?. This interface describes a?function iterator() that returns an Iteractor<E>. ?An ArrayList<E>?is a collection.?So, it implements the interface Iterable<E> and its factory method iterator() that returns a subclass of Iterator<E>

//here is a simplified definition of an iterator from the java source code
public interface Iterator<E> {
    boolean hasNext();
    E next();
    void remove();
}

//here comes the factory interface!
public interface Iterable<T> {
    Iterator<T> iterator();
}

//here is a simplified definition of ArrayList from the java source code
//you can see that this class is a concrete factory that implements
//a factory method  iterator()
//Note : in the real Java source code, ArrayList is derived from
//AbstractList which is the one that implements the factory method pattern
public class ArrayList<E> {
 //the iterator() returns a subtype and an "anonymous" Iterator<E>
 public Iterator<E> iterator()
   {
     return new Iterator<E>()
     {
	//implementation of the methods hasNext(), next() and remove()
     }
   }
...
}

And here is a standard?use?of the ArrayList

import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;

public class Example {
	public static void main(String[] ars){
		//instantiation of the (concrete factory) ArrayList
		List<Integer> myArrayList = new ArrayList<>();
		//calling the factory method iterator() of ArrayList
		Iterator<Integer> myIterator = myArrayList.iterator();
	}
}

I showed an ArrayList but I could have shown a HashSet, a LinkedList or a HashMap since they are all part of the collection API. The strength of this pattern is that you don’t need to know what type of collections you’re using, each collection will provide an Iterator through the factory method iterator().

Another good example is the stream() method?in the new Java 8 collection API.

Spring

The Spring Framework is based on a factory method pattern. The ApplicationContext implements the BeanFactory?Interface. This interface describes a function Object getBean(param) that returns an Object. This example is interesting because,?every Class in java are derived from Object. So, this factory can return an instance of any?class?(depending on the parameters).


 public class Example{

   public static void main(String[] args) {
	//creation of the BeanFactory
	ApplicationContext context = new ClassPathXmlApplicationContext("config.xml");
	//creation totaly different type of objets with the factory
	MyType1 objectType1 = context.getBean("myType1");
	MyType2 objectType2 = context.getBean("myType2");
   }
 }

The abstract factory

Here comes the fat?one ! This factory was described by the Gang of Four with the following sentence:

“Provide an interface for creating families of related or dependent objects without specifying their concrete classes”

If you don’t understand this sentence, don’t worry it’s normal.?It’s not called abstract factory for nothing!

If it can help you, I see the abstract factory pattern as a generalization of the factory method pattern execpt this time the factory interface have multiple factory methods that are related. When I say related, I mean conceptually linked so that?they form?a “familly” of factory methods. Let’s look at the UML diagram to see the difference with the factory method pattern:

 

simplified UML version of abstract factory pattern

  • Factory?is an interface that defines mulitple factory methods,?2 in our cases: getObject1() and?getObject2(). Each method creates a type (or its subtypes).
  • ConcreteFactory?implements?the Factory interface and therefore has its own implementation of?getObject1() and?getObject2(). Now?imagine 2 concrete factories: one could returns instances of SubType1.1 and SubType2.1 and the other?SubType1.2 and SubType2.2

 

Since this is very absract, let’s go back to the CarFactory example.

With the factory method pattern, the factory interface had just one method, getCar(). An abstract factory could be an interface with 3 factory methods:?getEngine(), getBody() and getWheel(). You could have multiple concrete factories:

  • SportCarFactory that could return instances of PowerfulEngine, RaceCarBody and?RaceCarWheel
  • CheapCarFactory that could return instance of WeakEngine, HorribleBody and RottenWheel

If you want to build a sport car, you’ll need to instanciate a SportCarFactory then use it. And if you want to build a cheap car,?you’ll need to instanciate a CheapCarFactory then use it.

The 3 factory methods of this abstract factory are related. They?all belong to the car production concept.
Of course, the factory methods can have parameters so that they return different types. For example getEngine(String model) factory from SportCarFactory could return a Ferrari458Engine or a FerrariF50Engine or a?Ferrari450Engine or …? depending on the parameter.

?

Here is the same example in java (with only the SportCarFactory and 2 factory methods).

/////////////////////////the different products
public interface Wheel{
   public void turn();
}

public class RaceCarWheel implements Wheel{
   public void turn(){
      //	some stuff
   }
   ...
}

public interface Engine{
   public void work();
}

public class PowerfulEngine implements Engine{
   public void work(){
      //	some stuff
   }
   ...
}

/////////////////////////the factory
public interface CarFactory{
   public Engine getEngine();
   public Wheel getWheel();
}

public class SportCarFactory implements CarFactory{
   public Engine getEngine(){
       return new PowerfulEngine();
   }
   public Wheel getWheel(){
       return new RaceCarWheel();
   }
}
/////////////////////////some code using the factory
public class SomeClass {
   public void someFunctionInTheCode(){
      CarFactory carFactory = new SportCarFactory();
      Wheel myWheel= carFactory.getWheel();
      Engine myEngine = carFactory.getEngine();
   }
}

This factory is not an easy one. So, when should you use it ?

NEVER!!!! Hum, difficult to answer. I see this factory pattern as a way to organise code. If you end up with many factory method patterns in your code an you see a common theme between some a them, you can gather this group?with?an abstract factory. I’m not a big fan of the “let’s use an abstract factory because we could need one in the future” because this pattern is very abstract. I prefer building simple things and refactor them after if needed.

Yet, a common use case is when you need to create a user interface with?different look and feel. This example was used by the Gang of Four to present this pattern. This UI will need some products like a window, a scroll bar, buttons …? You can create a Factory with a concret factory for each look and feel. Of course, this example was written before the Internet era, now you could have one component and modify its look and feel using CSS (or some scripting languages) even for desktop applications.?Which means a static factory method is most of the time enough.

But if you still want to use this pattern, here are some?use cases from the GoF:

“a system should be configured with one of multiple families of products”

“you want to provide a class library of products, and you want to reveal just their interfaces, not their implementations”

Real examples

Most DAO (Data Access Object) frameworks use an abtract factory to specify the basic operations a concrete factory shoud do. Though the names of the factory methods depend on the framework, it’s often closed to:

  • createObject(…) or persistObject(…)
  • updateObject(…) or saveObject(…)
  • deleteObject(…) or removeObject(…)
  • readObject(…) or findObject(…)

 

For each type of object you manipulate, you’ll need a concrete factory. For example, if you need to manage people, houses, and contracts using a database. I’ll have a PersonFactory, a HouseFactory and a ContractFactory.

The CrudRepository from Spring is a good example of an abstract Factory.

If you want Java code, you can look for JPA, Hibernate or SpringData tutorials.

 

Conclusion

I hope that you now have a good idea of the different types of factory patterns and when to use them. Though I said it many times in this article, keep in mind that most of the time a factory make the code more complex/abstract. Even if you know the factories?(if you don’t, read this article again!), what about your co-workers? Yet, when working on medium/large?applications it’s worth using factories.

I struggled a long time?with the factories to understand which is which and an article like this one?would have helped me. I hope this article was?understandable and that I didn’t write too many mistakes. Feel free to tell me if something you read bothered you so that I can improve this article.

]]>
http://www.sunsetandecho.com/design-pattern-factory-patterns/feed/ 21
Leonard Susskind’s Quantum Mechanics course http://www.sunsetandecho.com/leonard-susskinds-quantum-mechanics-course/ http://www.sunsetandecho.com/leonard-susskinds-quantum-mechanics-course/#comments Mon, 08 Jun 2015 12:54:38 +0000 http://www.sunsetandecho.com/?p=668

I’ve heard a few times about quantum mechanics, from friends and TV show/series (Bing Bang Theory, Fringe …) that make references about this weird subject. Since this time, I put quantum mechanics on my “to-learn” list. I also wanted to know quantum mechanics in order to understand the developments of quantum computing which is likely to be the future of IT (it will be one of my next articles).

 

I first tried to learn it by reading articles on Wikipedia but I quickly stopped because the subject is spread on too many articles and some pages are (very) hard to understand for a non-physicist. This is why I watched a 10-lecture course on YouTube about Quantum Mechanics by Leonard Susskind. Those videos are records of a continuing education course from 2008 at Stanford. I’ve chosen this course because of Leonard Susskind’s reputation and the fact that it’s a community course and therefore not destined to physics (under)graduate students.

 

But for a “simplified” course it’s a pretty tough one! Indeed, this course involves:

  • advanced linear algebra (matrix and vector, inner/outer product, Eigen value and Eigen vector , complex numbers)
  • mathematical analysis (vectorial/Hilbert space, Hermitian operators, Hamiltonian operators, dual space, integration, Fourier transform, Dirac function)
  • good notions of classical physics (Newton’s laws, phase, phase space, waves, photons, electrons, electromagnetic fields, polarization)

Though Leonard Susskind (quickly) explains all those concepts, if you’ve never learned/seen most of them before (especially the mathematical ones and some physical ones like polarization) it will require a lot of work and re-watching to really understand the lectures. You can still skip the parts you don’t understand but if you skip too many parts without getting at least the overall idea you won’t understand the course.

 

Course overview

picture from youtube video Modern Physics: Quantum Mechanics

 

The course is divided into 10 two-hour lectures. Each lecture starts with a review of the previous one. Here is a summary of the 10 lectures:

  • The first one is the most accessible: Leonard Susskind gives some concrete and physical ideas of what quantum mechanics is and the differences with classical mechanics
  • The second one is a reminder of mathematical concepts: vector space (orthogonal and orthonormal bases, scalar product, linear operators), complex number, ket and bra vectors, He also gives some concrete examples of linear operators.
  • The third one continues on mathematical concepts: Hermitian operators, Eigen vector/value. It follows with the (mathematical) postulates of quantum mechanics with a physical interpretation. The rest of the course is about the position and momentum mathematical/physical operators with a focus on the position operator (and its physical interpretation).
  • The fourth one introduces the Dirac function and focus on the mathematical momentum operator. Susskind ends with a physical example of what this operator means.
  • The fifth one presents 2 mathematical concepts: the outer product between 2 vectors and the Fourier transform that links the position with the momentum. Then the lecturer recalls some physical concepts: electromagnetic fields, photons and polarization, and ends with a definition of the quantum polarization state and the quantum polarization operator.
  • The sixth one continues on (linear) photon polarization and photon circular polarization that involves complex numbers. The course ends with the concept of expected values (the average value of an observable).
  • The seventh one starts with a comparison between a classical mechanics problem and a quantum one. Susskind introduces the concept of phase. The rest of the lecture is about how quantum states change with time and change with each other (quantum entanglement). This part introduces new math concepts: Hermitian conjugate and unitary operator.
  • The eight one continues on how the quantum states changes with time using a new mathematical concept: Hamiltonian operators. It also introduces Schr?dinger’s equations.
  • The ninth one starts with a (very good) speak on the history of QM and then continues with Schr?dinger’s equations. It ends with studying how a wave packet moves in time that links QM with classical mechanics.
  • The last one is an application of the concepts learned during the course by analyzing the quantum harmonic oscillator.

 

Review of the course

I took this course to have a good understanding of quantum physics and I’m satisfied. Before this course, I knew nothing about quantum mechanics. Now, I can read technical papers/articles on this subject and understand the ideas. The last time I studied advanced physics and mathematics was in 2007 and the “light” explanations from Leonard Susskind were enough to make me remember most of my old courses. Concerning the difficulty, I watched another online course from Stanford (Coursera) on machine learning (that also involves heavy mathematics) and the difficulty of Leonard Susskind’s course is way above Andrew Ng’s famous course. I had to watch some part multiple times to understand them (or just get the overall idea), especially in the last lectures.

In my opinion, the advantages of this course are:

  • Leonard Susskind is a good speaker (it’s not boring)
  • Though I didn’t carefully understand every mathematical proof, I got the overall idea thanks to Leonard Susskind’s explanations.
  • It’s a very detailed course and not a “it’s too difficult, but trust me this is the truth” course. If you follow it carefully you’ll have a deep understanding of the subject (but won’t be an expert).
  • Each lecture starts with a summary of the previous lecture so that you can watch the course on multiple months without forgetting the previous concepts.
  • It’s free (the “live” course cost 350$).

The drawbacks are:

  • Leonard Susskind is not rigorous when it comes to mathematics and sometimes it’s difficult to follow his mathematical proofs. For example, he sometimes
    • uses the same notation for Eigen values and Eigen vectors,
    • uses different notations for the same mathematical object,
    • makes no difference between a function f and f(x) which is the function f at “x”
  • In my opinion, this course focuses too much on the (heavy) mathematical part of quantum mechanics and not enough on the physical interpretation. This course has approximately a 85/15 ratio. Although QM involves a lot of math, I wonder if a 70/30 would have been possible (especially for a “simplified” course).
  • The course requires a high level of abstraction (I can’t tell if it’s because of the teacher or the subject itself).
  • The questions from students are sometimes difficult to hear since the micro is on the teacher.
  • The video has a low resolution (240p) and is too dark. However, It doesn’t prevent from reading the whiteboard.

 

For physics enthusiasts with a good knowledge on physics and mathematics this course is good, but not great due to a (too) strong focus on the mathematical parts. If you don’t know more than 70% of the concepts I used in the course overview, you should learn more about physics and/or mathematics before watching this course.

For physicists (or physics students), this course can be helpful to have another point of view of quantum physics but I don’t think that it is sufficient to be a stand-alone course.

 

Quantum confusion

Concerning quantum mechanics, I’m more confused than before starting the course. Unlike all the physics subjects I learned when I was student (Newton’s laws, electricity, electromagnetisms, thermodynamics …), it is counter intuitive and the physicals results are just consequences of mathematical theories applied on a set of axioms. Though some results can really be observed since the end of the 20th century (like the dual?particle-like?and?wave-like?behavior that works only at quantum level or the quantum entanglement) they don’t make sense to me. I feel like there is a missing piece in this theory, the one that would explain why the quantum world behaves like it does. Like Richard Feynman -one of the greatest minds in QM- said “If you think you understand quantum mechanics, you don’t understand quantum mechanics.”.

]]>
http://www.sunsetandecho.com/leonard-susskinds-quantum-mechanics-course/feed/ 4
How does Shazam work http://www.sunsetandecho.com/how-shazam-works/ http://www.sunsetandecho.com/how-shazam-works/#comments Sat, 23 May 2015 15:29:50 +0000 http://localhost/wordpress/?p=109

 

Have you ever wondered how Shazam works? I asked myself this question a few years ago and I read a research article written by Avery?Li-Chun Wang, the confounder of Shazam, to understand the magic behind Shazam. The quick answer is audio fingerprinting, which leads to another question: what is audio fingerprinting?

 

shazam logo

 

When I was student, I never took a course in signal processing. To really understand Shazam (and not just have a vague idea) I had to start with the basics. This article is a summary of the search I did to understand Shazam.

I’ll start with the basics of music theory, present some?signal?processing stuff and end with the mechanisms behind Shazam. You don’t need any knowledge to read this article but since it involves computer science and mathematics it’s better to have a good scientific background (especially for the last parts). If you already know what the words “octaves”, “frequencies”, “sampling” and “spectral leakage” mean you can skip the first parts.

Since it’s a long and technical article (11k words) feel free to read each part at different times.

 

 

Music and physics

A sound is a vibration that propagates through air (or water) and can be decrypted by ears. For example, when you listen to your mp3 player the earphones produce vibrations that propagate through air until they reach your ears. The light is also a vibration but you can’t hear it because your ears can’t decrypt it (but your eyes can).

A vibration can be modeled by sinusoidal?waveforms. In this chapter, we’ll see how music can be physically/technically described.

 

Pure tones vs real sounds

A?pure tone?is a tone with a?sinusoidal?waveform. A?sine wave?is characterized by:

  • Its frequency: the number of cycles per second. Its unit is the Hertz (Hz), for example 100Hz = 100 cycles per second.
  • Its?amplitude (related to loudness for sounds): the size of each cycle.

Those characteristics are decrypted by the human ear to form a sound. Human can hear pure tones from 20 Hz to 20 000 Hz (for the best ears) and this range decreases over age. By comparison, the light you see?is composed of sinewaves from 4*10^14 Hz to 7.9*10^14 Hz.

You can check the range of your ears with youtube videos like this one that displays all the pure tones from 20 Hz to 20k Hz, in my case I can’t hear anything above 15?kHz.

 

The human perception of loudness depends on the frequency of the pure tone. For instance, a pure tone at amplitude 10 of frequency 30Hz will be quieter than a pure tone at amplitude 10 of frequency 1000Hz. Humans ears follow a psychoacoustic model, you can check this article on Wikipedia for more information.

Note: This fun fact will have consequences at the end of the article.

 

pure sinewave at 20 Hz

pure sinewave at 20 Hz

In this figure, you can see the representation of a pure sine wave of frequency 20hz and amplitude 1.

Pure tones doesn’t naturally exist but every sound in the world is the sum a multiple pure tones at different amplitudes.

 

composition of sinewaves

composition of sinewaves

 

In this figure, you can see the representation of a more realistic sound which is the composition of multiple sinewaves:

  • a pure sinewave of frequency 20hz and amplitude 1
  • a pure sinewave of frequency 40hz and amplitude 2
  • a pure sinewave of frequency 80hz and amplitude 1.5
  • a pure sinewave of frequency 160hz and amplitude 1

A real sound can be composed of thousands of pure tones.

 

Musical Notes

simple_gifts_partition_min

A music partition is a set of notes executed at a certain moment. Those notes also have a duration and a loudness.

The notes are divided in octaves. In most occidental countries, an octave is a set of 8 notes (A, B, C, D, E, F,G in most English-speaking countries and Do, Re, Mi, Fa, Sol, La, Si in most Latin occidental countries) with the following property:

  • The frequency of a note in an octave doubles in the next octave. For example, the frequency of A4 (A in the 4th octave) at 440Hz equals 2 times the frequency of A3 (A in the 3rd octave) at 220Hz and 4 times the frequency of A2 (A in the 2nd octave) at 110Hz.

Many instruments provides more than 8 notes by octaves, those notes are called semitone or halfstep.

octave

For the 4th octave (or 3rd octave in Latin occidental countries), the notes have the following frequency:

  • C4 (or Do3) = 261.63Hz
  • D4 (or Re3) = 293.67Hz
  • E4 (or Mi3) = 329.63Hz
  • F4 (or Fa3) = 349.23Hz
  • G4 (or Sol3) = 392Hz
  • A4 (or La3) = 440Hz
  • B4 (or Si3) = 493.88Hz

Though it might be odd, the frequency sensitivity of ears is logarithmic. It means that:

  • between 32.70 Hz and 61.74Hz (the 1st octave)
  • or between 261.63Hz and 466.16Hz (4th octave)
  • or between 2?093 Hz and 3?951.07Hz (7th octave)

Human ears will be able to detect the same number of notes.

FYI, the A4/La3 at 440Hz is a standard reference for the calibration of acoustic equipment and musical instruments.

 

Timbre

The same note doesn’t sound exactly the same if it’s played by a guitar, a piano, a violin or a human singer. The reason is that each instrument has its own timbre for a given note.

For each instrument, the sound produced is a multitude of frequencies that sounds like a given note (the scientific term for a musical note is pitch). This sound has a fundamental frequency (the lowest frequency) and multiple overtones (any frequency higher than the fundamental).

Most instruments produce (close to) harmonic sounds. For those instruments, the overtones are multiples of the fundamental frequency called harmonics. For example the composition of pure tones A2 (fundamental), A4 and A6 is harmonic whereas the composition of pure tones A2, B3, F5 is inharmonic.

Many percussion instruments (like cymbals or drums) create inharmonic sounds.

Note: The pitch (the musical note perceived) might not be present in the sound played by an instrument. For example, if an instrument plays a sound with pure tones A4, A6 and A8, Human brain will interpret the resulting sound has an A2 note. This note/pitch will be an A2 whereas the lowest frequency in the sound is A4 (this fact is called the missing fundamental).

 

Spectrogram

A music song is played by multiple instruments and singers. All those instruments produce a combination of sinewaves at multiples frequencies and the overall is an even bigger combination of sinewaves.

It is possible to see music with a spectrogram. Most of the time, a spectrogram is a 3 dimensions graph where:

  • on the horizontal (X) axis, you have the time,
  • on the vertical (Y) axis you have the frequency of the pure tone
  • the third dimension is described by a color and it represents the amplitude of a frequency at a certain time.

For example, here is a sound of?a piano playing of?C4 note (whose?fundamental frequency?is 261.63Hz)

And here is the associated spectrogram:

spectrogram of a C4 piano note

The color represents the amplitude in dB (we’ll see in a next chapter what it means).

As I told you in the previous chapter, though the note played is a C4 there are other frequencies than 261Hz in this record: the overtones. What’s interesting is that the other frequencies are multiple of the first one: the piano is an example of a harmonic instrument.

Another interesting fact is that the intensity of the frequencies changes through time. It’s another particularity of an instrument that makes it unique. If you take the same artist but you replace the piano, the evolution of frequencies won’t behave the same and the resulting sound will be slightly different because each artist/instrument has its own style. Technically speaking, these evolutions of frequencies are modifying the envelope of the sound signal (which is a part of the timbre).

To give you a first idea of Shazam music fingerprinting algorithm, you can see in this spectrogram that some frequencies (the lowest ones) are more important than others. What if we kept just the strongest ones?

 

Digitalization

Unless you’re a vinyl disk lover, when you listen to music you’re using a digital file (mp3, apple lossless,? ogg, audio CD, whatever ). But when artists produce music, it is analogical (not represented by bits). The music is digitalized in order to be stored and played by electronics devices (like computers, phones, mp3 players, cd players …). ?In this part we’ll see how to pass from an analog sound to a digital one. Knowing how a digital music is made will help us to analyse and manipulate this digital music in the next parts.

 

Sampling

Analog signals are continuous signals, which means if you take one second of an analog signal, you can divide this second into [put the greatest number you can think of and I hope it’s a big one !]? parts that last a fraction of second. ?In the digital world, you can’t afford to store an infinite amount of information. You need to have a minimum unit, for example 1 millisecond. During this unit of time the sound cannot change so this unit needs to be short enough so that the digital song sounds like the analog one and big enough to limit the space needed for storing the music.

For example, think about your favorite music. Now think about it with the sound changing only every 2 seconds, it sounds like nothing. Technically speaking the sound is aliased. In order to be sure that your song sounds great you can choose a very small unit like a nano (10^-9) second. This time your music sounds great but you don’t have enough disk space to store it, too bad.

This problem is called sampling.

The standard unit of time in digital music is 44?100 units (or samples) per second. But where does this 44,1kHz come from? Well, some dude thought it would be good to put 44100 units per second and that all … I’m kidding of course.

In the first chapter I told you that humans can hear sounds from 20Hz to 20kHz. A theorem from Nyquist and Shannon states that if you want to digitalize a signal from 0Hz to 20kHz you need at least 40?000 samples per second. The main idea is that a sine wave signal at a frequency F needs at least 2 points per cycle to be identified. If the frequency of your sampling is at least twice than the frequency of your signal, you’ll end up with at least 2 points per cycle of the original signal.

Let’s try to understand with a picture, look at this?example of a good sampling:

sampling of a signal

In this figure, a sound?at 20Hz is digitalized using?a 40Hz sampling rate:

  • the blue curve represents the sound?at 20 Hz,
  • the red crosses represent the sampled sound,?which means I marked the blue curve with a red cross every 1/40 second,
  • the green line an interpolation of the sampled sound.

Though it hasn’t the same shape nor the same amplitude,?the?frequency of the sampled signal remains the same.

And here is an example of bad sampling :

example of undersampling

In this figure, a sound at 20 Hz is digitalized with a 30Hz sampling rate. This time the frequency of the sampled signal is not the same as the original signal: it’s only 10Hz. If you look carefully, you can see that one cycle in the sampled signal represents two cycles in the original signal. This case is an under sampling.

This case also shows something else: if you want to digitalize a signal between 0Hz and 20 kHz, you need remove from the signal its frequencies over 20kHz before the sampling. Otherwise those frequencies will be transformed into frequencies between 0Hz?and 20Khz and therefore add unwanted sounds (it’s called aliasing).

 

To sum up, if you want a good music conversion from analogic to digital you have to record the analog music at least 40000 times per second. HIFI corporations (like Sony) chose 44,1kHz during the 80s because it was above 40000 Hz and compatible with the video norms NTSC and PAL. Other standards exist for audio like 48 kHz (Blueray),?96 kHz or 192 kHz but if you’re neither a professional nor an audiophile you’re likely to listen to 44.1 kHz music.

Note1: The theorem of Nyquist-Shannon is broader than what I said, you can check on Wikipedia if you want to know more about it.

Note2: The frequency of the sampling rate needs to be strictly superior of 2 times the frequency of the signal to digitalize because in the worst case scenario, you could end up with a constant digitalized signal.

 

Quantization

We saw how to digitalize the frequencies of an analogic music but what about the loudness of music? The loudness is a relative measure: for the same loudness inside the signal, if you increase your speakers the sound will be higher. The loudness measures the variation between the lowest and the highest level of sound inside a song.

The same problem appears loudness: how to pass from a continuous world (with an infinite variation of volume) to a discrete one?

Imagine your favorite music with only 4 states of loudness: no sound, low sound, high sound and full power. Even the best song in the world becomes unbearable.? What you’ve just imagined was a 4-level quantization.

Here is an example of a low quantization of an audio signal:

8_level_quantization-min

This figure presents an 8 level quantization. As you can see, the resulting sound (in red) is very altered. The difference between the real sound and the quantized one is called quantization error?or?quantization noise. This 8 level quantization is also called a 3 bits quantization because you only need 3 bits to implement the 8 different levels (8 = 2^3).

Here is the same signal with a 64 levels quantization (or 6 bits quantization)

64_levels_quantization-min

Though the resulting sound is still altered, it looks (and sounds) more like the original sound.

 

Thankfully, humans don’t have extra sensitive?ears. The standard quantization is coded on 16 bits, which means 65536 levels. ?With a 16 bits quantization, the quantization noise is low enough for human ears.

Note: In studio, the quantization used by professionals is 24 bits, which means there are 2^24 (16 millions) possible variations of loudness between the lowest point of the track and the highest.

Note2:?I made some approximations in my examples concerning the number of quantization levels.

 

Pulse Coded Modulation

PCM or Pulse Coded Modulation is a standard that represents digital signals. It is used by compact discs and most electronics devices. For example, when you listen to an mp3 file in your computer/phone/tablet, the mp3 is automatically transformed into a PCM signal and then send to your headphones.

A PCM stream is a stream of organized bits. It can be composed of multiple channels.?For example, a stereo music has 2 channels.

In a stream, the?amplitude?of the signal is divided into samples.? The number of samples per second correspond to the sampling rate of the music.?For instance a 44,1kHz sampled music will have 44100 samples per second. Each sample gives the (quantized) amplitude of the sound of the corresponding fraction of seconds.

There?are multiple PCM formats but the most used one in audio is the (linear) PCM 44,1kHz, 16-bit depth stereo format. This format has 44?100 samples for each second of music. Each sample takes 4 bytes:

  • 2 bytes (16 bits) for the intensity (from -32,768 to 32,767) of the left speaker
  • 2 bytes (16 bits) for the intensity (from -32,768 to 32,767) of the right speaker

 

example of a pulse code modulation stereo sample

In a PCM 44,1kHz 16-bit depth stereo format, you have 44100 samples like this one for every second of music.

 

From digital sound to frequencies

You now know how to pass from an analog sound to a digital one. But how can you get the frequencies inside a digital signal? This part is very important since the Shazam fingerprinting algorithm works only with frequencies.

For analog (and therefore continuous) signals, there is a transformation called the Contiguous Fourier transform. This function transforms a?function?of time into a function of frequencies. In other words, if you apply the Fourier transform on a sound, it will give you the frequencies (and their intensities) inside this sound.

But there are 2 problems:

  • We are dealing with digital sounds and therefore finite (none continuous) sounds.
  • To have a better knowledge of the frequencies inside a music, we need to apply the Fourier Transform on small parts of the full length audio signal, like 0.1 second parts so that we know what are the frequencies for each 0.1 second parts of an audio track).

Thankfully, there is another mathematical function, the?Discrete Fourier Transform (DFT),?that works with some limitations.

Note: The Fourier Transform must be applied on only one channel, which means that if you have a stereo song you need to transform it into a mono song.

 

Discrete Fourier Transform

The DFT (Discrete Fourier Transform) applies to discrete signals and gives a discrete spectrum (the frequencies inside the signal).

 

Here is the magic formula to transform a digital signal into frequencies (don’t run away, I’ll explain it):

DFT-min

In this formula:

  • N is the size of the window:?the number of samples that composed the signal (we’ll talk a lot?about windows in the next part).
  • X(n) represents the nth bin of frequencies
  • x(k) is kth sample of the audio signal

 

For example, for an audio signal with a 4096-sample window this formula must be applied 4096 times:

  • 1 time for n = 0 to compute the 0th bin a frequencies
  • 1 time for n = 1 to compute the 1st bin a frequencies
  • 1 time for n = 2 to compute the 2nd bin a frequencies

 

As you might have noticed, I spoke about bin of frequencies and not frequency. The reason is that the DFT gives a discrete spectrum. A bin of frequencies is the smallest unit of frequency?the DFT can compute. The size of the bin (called spectral/spectrum resolution or frequency resolution) equals the sampling rate of the signal divided by the size of the window (N). In our example, with a 4096-sample window and a standard audio sampling rate at 44.1kHz, the frequency resolution is 10.77 Hz (except the 0th bin that is special):

  • the 0th bin represents the frequencies between 0Hz to 5.38Hz
  • the 1st bin represents the frequencies between 5.38Hz to 16.15Hz
  • the 2nd bin represents the frequencies between 16.15Hz to 26.92Hz
  • the 3rd bin represents the frequencies between 26.92Hz to 37.68Hz

 

That means that the DFT can’t dissociate 2 frequencies that are closer than 10.77Hz. For example notes at 27Hz, 32Hz and 37Hz ends up in the same bin. If the note at 37Hz is very powerful you’ll just know that the 3rd bin is powerful. This is problematic for dissociating notes in the lowest octaves.? For example:

  • a A1 (or La -1) is at 55Hz whereas a B1 (or Si -1) is at 58.27Hz and a G1 (or Sol -1) is at 49 Hz.
  • the first note of a standard 88-key piano is a A0 at 27.5 Hz followed by a A#0 at 29.14Hz.

 

You can improve the frequency resolution by increasing the window size but that means losing fast frequency/note changes inside the music:

  • An audio signal has a sampling rate of 44,1 kHz
  • Increasing the window means taking more samples and therefore increasing the time taken by the window.
  • With 4096 samples, the window duration is 0.1 sec and the frequency resolution is 10.7 Hz: you can detect a change every 0.1 sec.
  • With 16384 samples, the window duration is 0.37 sec and the frequency resolution is 2.7 Hz: you can detect a change every 0.37 sec.

 

Another particularity for an audio signal is that we only need half the bins computed by the DFT. In the previous example, the bin definition is 10.7 Hz, which means that the 2047th bin represents the frequencies from 21902,9 Hz to 21913,6 Hz.? But:

  • The 2048th bin will give the same information as the 0th bin
  • The 2049th bin will give the same information as the 1th bin
  • The X+2048th bin will give the same information as the Xth bin
  • ..

If you want to know why the bin resolution equals” the sampling rate” divided by “the size of the window” or why this formula is so bizarre, ?you can read a 5-part article on Fourier Transform on this very good website ?(especially part 4 and part 5) which is the best article for beginners that I read (and I read a lot of articles on the matter).

 

Window functions

If you want to get the frequency of a one-second sound for each 0.1-second parts, you have to apply the Fourier Transform for the first 0.1-second part, apply it for the second 0.1-second part, apply it for the third 0.1-second part …

The problem

By doing so, you are implicitly applying a (rectangular) window function:

  • For the first 0.1 second you are applying the Fourier transform on the full one-second signal multiplied by a function that equals 1 between 0 and 0.1second, and 0 for the rest
  • For the second 0.1 second you are applying the Fourier transform on the full one-second signal multiplied by a function that equals 1 between 0.1 and 0.2 second, and 0 for the rest
  • For the third 0.1 second you are applying the Fourier transform on the full one-second signal multiplied by a function that equals 1 between 0.2 and 0.3 second, and 0 for the rest

Here is a visual example of the window function to apply to a digital (sampled) audio signal to get the first 0.01-second part:

rectangulare window with a signal

In this figure, to get the frequencies for the first 0.01-second part, you need to multiply the sampled audio signal (in blue) with the window function (in green).

rectangulare window with a signal

In this figure, to get the frequencies for the second 0.01-second part, you need to multiply the sampled audio signal (in blue) with the window function (in green).

By “windowing” the audio signal, you multiply your signal audio(t) by a window function window(t). This window function produces spectral leakage. Spectral leakage is the apparition of new frequencies that doesn’t exist inside the audio signal.?The power of the real frequencies is leaked to others frequencies.

Here is a non-formal (and very light) mathematical explanation. Let’s assume you want a part of the full audio signal. You will multiply the audio signal with a window function that let pass the sound only for the part you want:

part_of_audio(t) = full_audio(t) . window (t)

When you try to get the frequencies of the part of audio, you apply the Fourier transform on the signal

Fourier(part_of_audio(t)) = Fourier(full_audio(t) . window (t))

According to the convolution theorem (* represents the convolution operator and . the multiplication operator)

Fourier(full_audio(t) . window (t)) = Fourier(full_audio(t)) ?* ?Fourier(window (t))

—>Fourier(part_of_audio(t)) = Fourier(full_audio(t))? *? Fourier(window (t))

—>The frequencies of the part_of_audio(t) depend on the window() function used.

I won’t go deeper because it requires advanced mathematics. If you want to know more, look at this link on page 29, the chapter “the truncate effects” presents the mathematical effect of applying a rectangular window on a signal. What you need to keep in mind is that cutting an audio signal into small parts to analyze the frequencies of each part produces spectral leakage.

 

different types of windows

You can’t avoid spectral leakage but you can handle how the leakage will behave by choosing the right window function: instead of using a rectangular window function, you can choose a triangular widows, a Parzen window, a Blackman window, a Hamming window …

The rectangular window is the easiest window to use (because you just have to “cut” the audio signal into small parts) but for analyzing the most important frequencies in a signal, it might not be the best type of windows. Let’s have a look of 3 types of windows: rectangular, Hamming and Blackman. In order to analyse the effect of the 3 windows, we will use the following audio signal composed of:

  • A frequency 40 Hz with an amplitude of 2
  • A frequency 160 Hz with an amplitude of 0.5
  • A frequency 320 Hz with an amplitude of 8
  • A frequency 640Hz with an amplitude of 1
  • A frequency 1000 Hz with an amplitude of 1
  • A frequency 1225 Hz with an amplitude of0.25
  • A frequency 1400 Hz with an amplitude of 0.125
  • A frequency 2000 Hz with an amplitude of 0.125
  • A frequency 2500Hz with an amplitude of 1.5

In a perfect world, the Fourier transform of this signal should give us the following spectrum:

example of a spectrum of multiple sinewave

This figure shows a spectrum with only 9 vertical lines (at 40 Hz, 160 Hz, 320 Hz, 640 Hz, 1000 Hz, 1225 Hz, 1400 Hz, 2000 Hz and 2500 Hz. The y axis gives the amplitude in decibels (dB) which means the scale is logarithmic. With this scale a sound at 60 dB is 100 times more powerful than a sound at 40 dB and 10000 times more powerful than a sound at 20 dB.? To give you an idea, when you speak in a quiet room, the sound you produce is 20-30 dB higher (at 1 m of you) than the sound of the room.

In order to plot this “perfect” spectrum, I applied the Fourier Transform with a very long window: a 10-second window. Using a very long window reduces the spectrum leakage but 10 seconds is too long because in a real song?the sound changes much faster. ?To give you an idea of how fast the music changes:

  • here is a video with 1 change ( or beat) per second, it sounds slow but it’s a common rhythm for classical music.
  • here is a video with 2.7 changes per second, it sounds much faster but this rhythm is common for electro music
  • here is a video with 8.3 changes per second, it’s a very (very) fast rhythm but possible for small parts of songs.

In order to capture those fast changes, you need to “cut” the sound into very small parts using window functions. Imagine you want to analyze the frequencies of a sound every 1/3 second.

example of rectangular, blackman and hamming windows

In this figure, you can multiply the audio signal with one of the 3 window types to get the part of the signal between 0.333sec and 0.666 sec. As I said, using a rectangular window is like cutting the signal between 0.333sec and 0.666sec whereas with the Hamming or the Blackman windows?you need to multiply the signal with the window signal.

Now, here is the spectrum of the previous audio signal with a 4096-sample window:

spectrogram of rectangular, blackman and hamming windows

The signal is sampled at 44100Hz so a 4096-sample window represents a 93-millisecond part (4096/44100) and a frequency resolution of 10.7 Hz.

This figure shows that all windows modify the real spectrum of the sound. We clearly see that a part of the power of the real frequencies is spread to their neighbours. The spectrum from the rectangular window is the worst since the spectrum leakage is much higher than the 2 others. It’s especially true between 40 and 160 Hz. The Blackman window gives the closest spectrum from?the real spectrum.

Here is the same example with a Fourier Transform of a 1024 window:

spectrogram of rectangular, blackman and hamming windows

The signal is sampled at 44100Hz so a 1024-sample window represents a 23-millisecond part (1024/44100) and a frequency resolution of 43 Hz.

This time the rectangular window gives the best spectrum. With the 3 windows the 160 Hz frequency is hidden by the spectrum leakage produced by the 40 Hz and 320 Hz frequencies. The Blackman window gives the worst result with a 1225 Hz frequency close to invisible.

Comparing both figures shows that the spectrum leakage increases (for all the window function) as the frequency resolution increases. The fingerprint algorithm used by Shazam look for the loudest frequencies inside an audio track. Because of spectrum leakage, we can’t just take the X highest frequencies. In the last example, the 3 loudest frequencies are approximately 320 Hz, 277 Hz (320-43) and 363 Hz (320+43) whereas only the 320 Hz frequency exists.

 

Which window is the best?

There are?no?“best” or “worst” windows. Each window has its specificities and depending on the problem you might want to use a certain type.

A rectangular window has excellent resolution characteristics for sinusoids of comparable strength, but it is a poor choice for sinusoids of disparate amplitudes (which is the case inside a song because the musical notes don’t have the same loudness).

Windows like Blackman are better to prevent from?the case where spectrum leakage of strong frequencies hides weak frequencies. But, these windows deal badly with noise since a noise will hide more frequencies than rectangular window. This is problematic for an algorithm like Shazam that needs to handle noise (for instance when you Shazam a music in a bar or outdoor there are?a lot of noise).

A Hamming window is between these two extremes and is (in my opinion) a better choice for an algorithm like shazam.

 

Here are some useful links to go deeper on window functions and spectrum leakage:

http://en.wikipedia.org/wiki/Spectral_leakage

http://en.wikipedia.org/wiki/Window_function

http://web.mit.edu/xiphmont/Public/windows.pdf

 

Fast Fourier Transform and time complexity

 

the problem

Descrite Fourier Transform Formula

If you look again at the DFT formula (don’t worry, it’s the last time you see it), you can see that to compute one bin you need to do N additions and N multiplications (where N is the size of the window). Getting the N bins requires 2 *N^2 operations which is a lot.

For example, let’s assume you have a three-minute song at 44,1 kHz and you compute the spectrogram of the song with a 4096-sample window. You’ll have to compute 10.7 (44100/4096) DFT per second so 1938 DFTs for the full song. Each DFT needs 3.35*10^7 operations (2* 4096^2). To get the spectrogram of the song you need to do 6,5*10^10 operations.

Let’s assume you have a music collection of 1000 three-minutes-long songs, you’ll need 6,5*10^13 operations to get the spectrograms of your songs. Even with a good processor, it would take days/months to get the result.

Thankfully, there are faster implementations of the DFT called FFT (Fast Fourier Transforms). Some?implementations require just 1.5*N * log(N) operations. For the same music collection, using the FFT instead of the DFT requires 340 times less additions?(1.43*10^11) and it would take minutes/hours to get the result.

This example shows another tradeoff: though increasing the size of the window improves the frequency resolution, it also increases the computation time. For the same music collection, if you compute the spectrogram using a 512 sample window (frequency resolution of 86 Hz), you get the result with the FFT in 1.07*10^11 operations, approximately 1/4 time faster than with a 4096 sample window (frequency resolution of 10.77 Hz).

This time complexity is important since when you shazam a sound, your phone needs to compute the spectrogram of the recorded audio and a mobile processor is less powerful than a desktop processor.

 

downsampling

Thankfully, there is a trick to keep the frequency resolution and reduce the window size at the same time, it’s called downsampling. Let’s take a standard song at 44100 Hz, if you resample it at 11025 Hz (44100/4) you will get the same frequency resolution whether you do a FFT on the 44.1kHz song with a 4096 window or you do a FFT on the 11kHz resampled song with a 1024 window. The only difference is that the resampled song will only have frequencies from 0 to 5 kHz. But the most important part of a song is between 0 and 5kHz.?In fact most of you won’t hear a big difference between a music at 11kHz and?a music at 44.1kHz. So, the most important frequencies are still in the resampled song which is what matters for an algorithm like Shazam.

 

example of signal downsampling

Downsampling a 44.1 kHz song to a 11.025 kHz one is not very difficult: A simple way to do it is to take the samples by group of 4 and to transform this group into just one sample by taking the average of the 4 samples. The only tricky part is that before downsampling a signal, you need to filter the higher frequencies in the sound to avoid aliasing (remember the Nyquist-Shannon theorem). This can be done by using a digital low pass filter.

 

FFT

But let’s go back to the FFT. The simplest implementation of the FFT is the radix 2 Cooley–Tukey?algorithm which is a divide a conquer algorithm. The idea is that instead of directly computing the Fourier Transform on the N-sample window, the algorithm:

  • divides the N-sample window into 2 N/2-sample windows
  • computes (recursively) the FFT for the 2 N/2-sample windows
  • computes efficiently the FFT for the N-sample windows from the 2 previous FFT

The last part only costs N operations using?a mathematical trick on the roots of unity (the exponential terms).

Here is a readable version of the FFT (written in python) that I found on Wikipedia

 

from cmath import *
def fft(x):
        N=len(x)
        if N==1: return x

        even=fft([x[k] for k in range(0,N,2)])
        odd= fft([x[k] for k in range(1,N,2)])

        M=N/2
        l=[ even[k] + exp(-2j*pi*k/N)*odd[k] for k in range(M) ]
        r=[ even[k] - exp(-2j*pi*k/N)*odd[k] for k in range(M) ]

        return l+r

For more information on the FFT, you can check this article on Wikipedia.

 

Shazam

We’ve seen a lot of stuff during the previous parts. Now, we’ll put everything together to explain how Shazam quickly identifies songs (at last!). ?I’ll first give you a global overview of Shazam, then I’ll focus on the generation of the fingerprints and I’ll finish with the efficient audio search mechanism.

Note: From now on, I assume that you read the parts on musical notes, FFT and window functions.? I’ll sometimes use the words “frequency”, “bin”, ”note” ?or the full expression “bin of frequencies” but it’s the same concept since we’re dealing with digital audio signals.

 

Global overview

An?audio fingerprint?is a digital summary?that can be used to identify an?audio sample?or quickly locate similar items in an audio database. For example, when you’re humming a song to someone, you’re creating a fingerprint because you’re extracting from the music what you think is essential (and if you’re a good singer, the person?will recognize the song).

Before going deeper, here is a simplified architecture of what Shazam might be. I don’t work at Shazam so it’s only a guess (from the 2003 paper of the co-founder of Shazam):

shazam overview

On the server side:

  • Shazam precomputes fingerprints from a very big database of music tracks.
  • All those fingerprints are put in a fingerprint database which is updated whenever a new song is added in the song database

On the client side:

  • when a user uses the Shazam app, the app first records the current music with the phone microphone
  • the phone applies the same fingerprinting algorithm as?Shazam on the record
  • the phone sends the fingerprint to Shazam
  • Shazam checks if this fingerprint matches with one of its fingerprints
    • If no it informs the user that the music can’t be?found
    • If yes, it looks for the metadata associated with the fingerprints (name of the song, ITunes url, Amazon url …) and gives it back to the user.

The key points of Shazam are:

  • being Noise/Fault tolerant:
    • because the music recorded by a phone in a bar/outdoor has a bad quality,
    • because of the artifact due to window functions,
    • because of the cheap microphone inside a phone that produces noise/distortion
    • because of many physical stuff I’m not aware of
  • fingerprints needs to be time invariant: the fingerprint of a full song must be able to match with just a 10-second record of the song
  • fingerprint matching need to be fast: who wants to wait minutes/hours to get an answer from Shazam?
  • having few false positives: who wants to get an answer that doesn’t correspond to the right song?

 

Spectrogram filtering

Audio fingerprints differ from standard computer fingerprints like SSHA or MD5 because two different files (in terms of bits) that contain the same music must have the same audio fingerprint. For example a song in a 256kbit ACC format (ITunes) must give the same fingerprint as the same song in a 256kbit MP3 format (Amazon) or in a 128kbit WMA format (Microsoft). To solve this problem, audio fingerprinting algorithms uses the spectrogram of audio signals to extract?fingerprints.

Getting our spectrogram

I told you before that to get the spectrogram of a digital sound you need to apply a FFT. For a fingerprinting algorithm we need a good frequency resolution (like 10.7Hz) to reduce spectrum leakage and have a good idea of the most important notes played inside the song. At the same time, we need to reduce the computation time as far as possible and therefore use the lowest possible window size. In the research paper from Shazam, they don’t explain how they get the spectrogram but here is a possible solution:

getting a spectrogram of a signal

On the server side (Shazam), the 44.1khz sampled sound (from CD,?MP3 or whatever sound format) needs to pass from stereo to mono. We can do that by taking the average of the left speaker and the right one. Before downsampling, we need to filter the frequencies above 5kHz to avoid aliasing. Then, the sound can be downsampled at 11.025kHz.

On the client side (phone), the sampling rate of the microphone that records the sound needs to be at 11.025 kHz.

Then, in both cases we need to apply a window function to the signal (like a hamming 1024-sample window, read the chapter on window function to see why) and apply the FFT for every 1024 samples. By doing so, each FFT analyses 0.1 second of music. This gives us a spectrogram:

  • from 0 Hz to 5000Hz
  • with a bin size of 10.7Hz,
  • 512 possible frequencies
  • and a unit of time of 0.1 second.

 

Filtering

At this stage we have the?spectrogram of the song. Since Shazam needs to be noise tolerant, only the loudest notes are kept. But you can’t just keep the X more powerful frequencies every 0.1 second. Here are some reasons:

  • In the beginning of the article I spoke about psychoacoustic models. Human ears have more difficulties to hear a low sound (<500Hz) than a mid-sound (500Hz-2000Hz) or a high sound (>2000Hz). As a result low sounds of many “raw” songs are artificially increased before being released. If you only take the most powerful frequencies you’ll end up with only the low ones and If 2 songs have the same drum partition, they might have a very close filtered spectrogram whereas there are flutes in the first song and guitars in the second.
  • We saw on the chapter on window functions that if you have a very powerful frequency other powerful frequencies close to this one will appeared on the spectrum whereas they doesn’t exist (because of spectrum leakage). You must be able to only take the real one.

 

Here is a simple way to keep only strong frequencies while reducing the previous problems:

step1 – For each FFT result, you put the 512 bins you inside 6 logarithmic bands:

  • the very low sound band (from bin 0 to 10)
  • the low sound band (from bin 10 to 20)
  • the low-mid sound band (from bin 20 to 40)
  • the mid sound band (from bin 40 to 80)
  • the mid-high sound band (from bin 80 to 160)
  • the high sound band (from bin 160 to 511)

step2 – For each band you keep the strongest bin of frequencies.

step3 – You then compute the average value of these 6 powerful bins.

step4 – You keep the bins (from the 6 ones) that are above this mean (multiplied by a coefficient).

The step4 is very important because you might have:

  • an a cappella music involving soprano singers with only mid or mid-high frequencies
  • a jazz/rap music with only low and low-mid frequencies
  • ..

And you don’t want to keep a weak frequency in a band just because this frequency is the strongest of its band.

But this algorithm has a limitation. In most songs some parts are very weak (like the beginning or the end of a song). If you analyze these parts you’ll end up with false strong frequencies because the mean value (computed at step 3) of these parts is very low. To avoid that, instead of taking the mean of the 6 powerful beans of the current FFT (that represents only 0.1sec of the song) you could take the mean of the most powerful bins of the full song.

To summarize, by applying this algorithm we’re filtering the spectrogram of the song to keep the peaks of energy in the spectrum that represent the loudest notes. To give you a visual idea of what this filtering is, here is a real spectrogram of a 14-second song.

shazam_full_spectrogram_min

This figure is from the Shazam research article. In this spectrogram, you can see that some frequencies are more powerful than others. If you apply the previous algorithm on the spectrogram here is what you’ll get:

shazam_filtered_spectrogram-min

This figure (still from the Shazam research article) is a filtered spectrogram. Only the strongest frequencies from the previous figure are kept. ?Some parts of the song have no frequency (for example between 4 and 4.5 seconds).

The number of frequencies in the filtered spectrogram depends on the coefficient used with the mean during step4. It also depends on the number of bands you use (we used 6 bands but we could have used another number).

 

At this stage, the intensity of the frequencies is useless. Therefore, this spectrogram can modeled as a 2-column table where

  • the first column represents the frequency inside the spectrogram (the Y axis)
  • the? second column represents the time when the frequency occurred during the song (the X axis)

This filtered spectrogram is not the final fingerprint but it’s a huge part of it. Read the next chapter to know more.

 

Note: I gave you a simple algorithm to filter the spectrogram. A better approach could be to use a logarithmic sliding window and to keep only the most powerful frequencies above the mean + the standard deviation (multiplied by a coefficient) of a moving part of the song. I used this approach when I did my own Shazam prototype ?but it’s more difficult to explain (and I’m not even sure that what I did was correct …).

 

Storing Fingerprints

We’ve just ended up with a filtered spectrogram of a song. How can we store and use it in an efficient way? This part is where the power of Shazam lies. To understand the problem, I’ll present a simple approach where I search for a song by using directly the filtered spectrograms.

 

Simple search approach

Pre-step: I precompute a database of filtered spectrograms for all the songs in my computer

Step 1: I record a 10-second part of a song from TV in my computer

Step 2: I compute the filtered spectrogram of this record

Step 3: I compare this “small” spectrogram with the “full” spectrogram of each songs. How can I compare a 10-second spectrogram with a spectrogram of a 180-second song? Instead of losing myself in a bad explanation, here is a visual explanation of what I need to do.

spectrogram-min

Visually speaking, I need to superpose the small spectrogram everywhere inside the spectrogram of the full song to check if the small spectrogram matches with a part of the full one.

spectrogram2-min

And I need to do this for each song until I find a perfect match.

spectrogram3-min

In this example, there is a perfect match between the record and the end of the song. If it’s not the case, I have to?compare the record with another song and so on until I find a perfect match. If I don’t find?a perfect match I can?choose the closest match I found (in all the songs) if the matching rate is?above a threshold. For instance, if the best match I found gives?me a 90% similarity between the record and a part of a song, I can?assume it’s the right song because the 10% of none similarity are?certainly due to external noise.

 

Though it works well, this simple approach requires a lot of computation time. It needs to compute all the possibilities of matching between the 10-second record and each song in the collection. Let’s assume on average music contains 3 peak frequencies per 0.1 seconds. Therefore, the filtered spectrogram of the 10-second record has 300 time-frequency points. In the worst case scenario, you’ll need 300 * 300 * 30* S?operations to find the right song where S?is the number of second of music in your collection. If like me you have 30k songs (7 * 10^6 seconds of music) it might take a long time and it’s harder for Shazam with its 40 million songs collection (it’s a guess I couldn’t find the current size of Shazam).

So, how Shazam does it efficiently?

 

Target zones

Instead of comparing each point one by one, the idea is to look for multiple points at the same time. In the Shazam paper, this group of point is called a target zone. The paper from Shazam doesn’t explain how to generate these target zones but here is a possibility. For the sake of comprehension I’ll fix the size of the target zone at 5 frequency-time points.

In order to be sure that both the record and the full song will generate the same target zones, you need an order relation between the time-frequency points in a filtered spectrogram. Here is one:

  • If two time-frequency points have the same time, the time-frequency point with the lowest frequency is before the other one.
  • If a time time-frequency point has a lower time than another point one then it is before.

Here is what you get if you apply this order on the simplified spectrogram we saw before:

ordered_spectrogram1-min

In this figure I labeled all the time-frequency points using this order relation. For example:

  • The point 0 is before any other points in the spectrogram.
  • The point 2 is after point 0 and 1 but before all the others.

Now that the spectrograms can be inner-ordered, we can create the same target zones on different spectrogram with the following?rule: “To generate target zones in a spectrogram, you need for each time-frequency point to create a group composed of this point and the 4 points after it”. We’ll end up with approximately the same amount of target zones as the number of points. This generation is the same for the songs or the record

ordered_spectrogram2-min

In this simplified spectrogram, you can see the different target zones generated by the previous algorithm. Since the target size is 5, most of the points belong to 5 target zones (except the points at the beginning and the end of the spectrogram).

Note: I didn’t understand at first why for the record we needed to compute that much target zones. We could generate target zones with a rule like “for each point whose label is a multiple of 5 you need to create a group composed of this frequency and the 4 frequencies after it”. With this rule, the number of target zones would be reduced by 5 and so the search time (explained in the next part). The only reason I found is that computing all the possible zones on both the record and the song increases a lot the noise robustness.

 

Address generation

We now have multiple target zones, what do we do next? We create for each point an address based on those target zones. In order to create those addresses, we also need an anchor point per target zone. Again, the paper doesn’t explain how to do it. I propose this anchor point to be the 3rd point before the target zone. The anchor can be anywhere as long as the way it is generated is reproducible (which it is thanks to our order relation).

ordered_spectrogram4-min

 

In this picture I plotted 2 target zones with their anchor points. Let’s focus on the purple target zone. The address formula proposed by shazam is following one:

[“frequency of the ?anchor”;” frequency of the ?point”;”delta time between the anchor and the point”].

For the purple target zone:

  • the address of point 6 is [“frequency of ?3”;” frequency of ?point 6”;”delta_time between point 3 & point 6”] so concretely [10;30;1],
  • the address of point 7 is [10;20;2].

Both points appeared also in the brown target zone, their addresses with this target zone are [10;30;2] for point 6 and [10;20;3] for point 7.

I spoke about addresses, right? That means that those addresses are linked to something. In the case of the full songs (so only in the server side), those addresses are linked to the following couple [“absolute time of the anchor in the song”;”Id of the song”]. In our simple example with the?2 previous points we have the following result:

[10;30;1] –>[2;1]

[10;30;2]–>[2;1]

[10;30;2] –>[1;1]

[10;30;3] –>[1;1]

If you apply the same logic for all the points of all the target zones of all the song spectrograms, you’ll end up with a very big table with 2 columns:

  • the addresses
  • the couples (“time of anchor” ; “song Id”).

This table is the fingerprint database of Shazam. If on average a song contains 30 peak frequencies per second and the size of the target zone is 5,? the size of this table is 5 * 30 *S?where S?is the number of seconds of the music collection.

If you remember, we used an FFT with 1024 samples which means that there are only 512 possible frequency values. Those frequencies can be coded in 9 bits (2^9 = 512). Assuming that the delta time is in milliseconds,? it will never be over 16 seconds because it would imply a song with a 16-second part without music (or very low sound).? So, the delta time can be coded in 14 bits (2^14 = 16384). The address can be coded in a 32-bit integer:

  • 9 bits for the “frequency of the ?anchor”
  • 9 bits for the ” frequency of the ?point”
  • 14 bits for the ”delta time between the anchor and the point”

Using the same logic, the couple (“time of anchor” ; “song Id”) can be coded in a 64-bit integer (32 bit for each part).

The fingerprint table can be implemented as a simple array of list of 64-bit integers where:

  • the index of the array is the 32-bit integer address
  • the list of 64-bits integers is all the couples for this address .

In other words, we transformed the fingerprint table into an inverted look-up that allows search operation in O(1) (ie. very effective search time).

Note: You may have noticed that I didn’t choose the anchor point inside the target zone (I could have chosen the first point of the target Zone for example). If I did it would have generated a lot of addresses like [frequency anchor;frequency anchor;0] and therefore too many couples(“time of anchor” ; “song Id”) would have an address like [Y,Y,0] where Y?is the frequency (between 0 and 511). In other words, the look-up would have been skewed.

 

Searching And Scoring the fingerprints

We now have a great data structure on the server side, how can we use it? It’s my last question, I promise!

 

Search

To perform a search, the fingerprinting step is performed on the recorded sound file to generate an address/value structure slightly different on the value side:

[“frequency of the ?anchor”;” frequency of the ?point”;”delta time between the anchor and the point”] -> [“absolute time of the anchor in the record”].

 

This data is then sent to the server side (Shazam). Let’s take the same assumption than before (300 time-frequency points in the filtered spectrogram of the 10-second record and the size of the target zone of 5 points), it means there are approximately 1500 data sent to Shazam.

Each address from the record is used to search in the fingerprint database for the associated couples [“absolute time of the anchor in the song”;”Id of the song”]. In terms of time complexity, assuming that the fingerprint database is in-memory, the cost is the search is proportional to the number of address sent to Shazam (1500 in our case). This search returns a big amount of couples, let’s say for the rest of the article it returns M couples.

 

Though M is huge, it’s way lower than the number of notes (time-frequency points) of all the songs. The real power of this search is that instead of looking if a one note exists in a song, we’re looking if 2 notes separated from delta_time seconds exist in the song. At the end of this part we’ll?talk more about time complexity.

 

Result filtering

Though it is not mentioned in the Shazam paper, I think the next thing to do is to filter the M results of the search by keeping only the couples of the songs that have a minimum number of target zones in common with the record.

 

For example, let’s suppose our search has returned:

  • 100 couples from song 1 which has 0 target zone in common with the record
  • 10 couples from song 2 which has 0 target zone in common with the record
  • 50 couples from song 5 which has 0 target zone in common with the record
  • 70 couples from song 8 which has 0 target zone in common with the record
  • 83 couples from song 10 which has 30 target zones in common with the record
  • 210 couples from song 17 which has 100 target zones in common with the record
  • 4400 couples from song 13 which has 280 target zones in common with the record
  • 3500 couples from song 25 which has 400 target zones in common with the record

Our 10-second record has (approximately) 300 targets zone. In the best case scenario:

  • song 1 and the record will have a 0% matching ratio
  • song 2 and the record will have a 0% matching ratio
  • song 5 and the record will have a 0% matching ratio
  • song 8 and the record will have a 0% matching ratio
  • song 10 and the record will have a 10% matching ratio
  • song 17 and the record will have a 33% matching ratio
  • song 13 and the record will have a 91.7% matching ratio
  • song 25 and the record will have a 100% matching ratio

We’ll only keep the couples of song 13 and 25 from the result. Although songs 1 2,5 and 8 have multiples couples in common with the record, none of them form at least a target zone (of 5 points) in common with the record. This step can filter a lot of false results because the fingerprint database of Shazam has a lot of couples for the same address and you can easily end up with couples at the same address that don’t belong to the same target zone. If you don’t understand why, look the last picture of the previous part: the [10;30;2] address is used by 2 time-frequency points that doesn’t belong to the same target zone. If the record also have an [10;30;2], (at least) one of the 2 couples in the result will be filtered in this step.

 

This step can be done in O(M) with the help of a hash table whose key is the couple (songID;absolute time of the anchor in the song) and value the number of time it appears in the result:

  • We iterate through the M results and count (in the hash table ) the number of time a couple is present
  • We remove all the couples (i.e. the key of the hash table) that appear less than 4 times (in other words we remove all the points that doesn’t form a target zone)*
  • We count the number X of times the song ID is part of a key in the hash table (i.e we count the number of complete target zones in the song. Since the couple come from the search, those target zones are also in the record)
  • We only keep the result whose song number is above 300*coeff (300 is the number of target zone of the record and we reduce this number with a coeff because of the noise).
  • We put the remaining results in a new hash table whose index is the songId (this hashmap will be useful for the next step)

 

*?The idea is to look for the target zone created by an anchor point in a song. This anchor point can be defined by the id of the song it belongs and the absolute time it occurs. I made an approximation because in?a?song, you can have multiple anchor points at the same time. Since we’re dealing with a filtered spectrogram, you won’t have a lot of anchor points at the same time. But the key?[songID;absolute time of the anchor in the song] will gather all the target zones created by these target points.

Note: I used 2 hash tables in this algorithm. If you don’t know how it works, just see it as a very efficient way to store and get data. If you wan’t to know more,?you can read my article on the HashMap in Java which is just an efficient hash table.

 

Time coherency

At this stage we only have songs that are really close to the record. But we?still need to verify the time coherency between the notes of the record and these songs. Let’s see why:

search-min

 

In this figure, we have 2 target zones that belong to 2 different songs. If we didn’t look for time coherency, those target zones would increase the matching score between the 2 songs whereas they don’t sound alike since the notes in those target zones are not played in the same order.

This last step ?is about time ordering. The idea is:

  • to compute for each remaining song the notes and their absolute time position in the song.
  • to do the same for the record, which gives us the notes and their absolute time position in the record.
  • if the notes in the song and those in the record are time coherent, we should find a relation like this one: “absolute time of the note in the song = absolute time of the note in record + delta”, were delta is the starting time of the part of the song that matches with the record.
  • for each song, we need to find the delta ?that maximizes the number of notes that respect this time relation
  • Then we choose the song that has the maximum number of time coherent notes with the record

 

Now that you get the idea let’s see how to do it technically. At this stage we have for the record a list of address/value:

[“frequency of the ?anchor”;” frequency of the ?point”;”delta time between the anchor and the point”] -> [“absolute time of the anchor in the record”].

 

And we have for each song a list address/value (stored in the hash table of the previous step):

[“frequency of the ?anchor”;” frequency of the ?point”;”delta time between the anchor and the point”] -> [“absolute time of the anchor in the song”;”Id of the song”].

 

The following process needs to be done for all the remaining songs:

  • For each address in the record, we get the associated value of the song and we compute delta = “absolute time of the anchor in the record” – “absolute time of the anchor in the song” and put the delta in a “list of delta”.
  • It is possible that the address in the record is associated with multiples values in the song (i.e. multiple points in different target zones of the song), in this case we compute the delta for each associated values and we put the deltas in the “list of delta”
  • For each different value of delta in the “list of delta” we count its number of occurrence (in other words, we count for each delta the number of notes that respect the rule “absolute time of the note in the song = absolute time of the note in record + delta”)
  • We keep the greatest value (which gives us the maximum number of notes that are time coherent between the record and the song)

From all the songs, we keep the song with the maximum time coherent notes. ?If this coherency is above “the number of note in the record” * “a coefficient” then this song is the right one.

We just have to look for the metadata of the song (“artist name”,”song name, “Itunes URL”,”Amazon URL”, …) ?with the Song ID and gives the result back to the user.

 

Let’s talk about complexity!

This search is really more complicated that the simple one we first saw, let’s see if this is worth it. The enhanced search is a step by step approach that reduces the complexity at each step.

For the sake of comprehension, I’ll recall all the assumptions (or choices) I made?and make new ones to simplify the problem:

  • We have 512 possible frequencies
  • on average a song contains 30 peak frequencies per second
  • Therefore the 10-sec record contains 300 time-frequency points
  • S?is the number of seconds of music off all the songs
  • The size of the target zone is 5 notes
  • (new) I assume that the delta time between a point and its anchor is ether 0 or 10 msec
  • (new) I assume the generation of addresses is uniformly distributed, which means there is the same amount of couple for any address [X,Y,T] where X and Y are one of the 512 frequencies and T is either 0 or 10 msec

The first step, the search, only requires 5 * 300 unitary searches.

The size of the result M is the sum of the result of the 5 * 300 unitary searches

M =(5 * 300) *(S *30* 5 * 300) / (512 *512 * 2)

The second step, the result filtering can be done in M operations. At the end of this step there are N notes distributed in Z?songs. Without a statistical analysis of the music collection, it’s impossible to get the value of N and Z. I feel N is really lower than M and Z?represent only a few songs, even for a 40-million song database like Shazam.

The last step is the analysis of the time coherency of the Z?songs. We’ll assume that each song as approximately the same amount of notes: N/Z. In the worst case scenario (a record that comes from a song that contains only one note played continuously), the complexity of one analysis is (5*300) * (N/Z).

The cost of the Z?songs is 5 * 300 * N.

Since N<<M, the real cost of this search is M = (300 * 300 * 30* S) * (5 * 5)/ (512 *512 * 2)

If you remember, the cost of the simple search was: 300 * 300 * 30* S.

This new search is 20?000 times faster

Note: The real complexity depends on distribution of frequencies inside the songs of the collection but this simple calculus gives us a good idea of the real one.

 

Improvements

The Shazam paper is from 2003 which means the associated research is even older. In 2003,?64-bits processors were released to the mainstream market. Instead of using one anchor point per target zone like the paper proposes (because of the limited size of a 32-bit integer), you could use 3 anchor points (for example the 3 points just before the target zone) and store the address of a point in the target zone in a 64-bit integer. This would dramatically improve the search time. Indeed, ?the search would be to find 4 notes in a song separated from detla_time1, detla_time2 and detla_time3 seconds which means the number of results M would be very (very) lower than the one we just computed.

A great advantage of this fingerprint search is its high scalability:

  • Instead of having 1 fingerprint database you can have D databases, each of them containing 1/D of the full song collection
  • You can search at the same time for the closest song of the record in the D databases
  • Then you choose the closest song from the D songs
  • The whole process is D times faster.

 

Tradeoffs

Another good discussion is the noise robustness of this algorithm. I could easily add 2k words just for this subject but after 11k words I think it’s better not to speak about it … or? just a?few words.

If you read carefully, you noticed that I used a lot of thresholds, coefficients and fixed values (like the sampling rate, ?the duration of a record, …). I also chose/made many algorithms (to filter a spectrogram, to generate a spectrogram,…). They all have an impact on the noise resistance and the time complexity. The real challenge?is to find the right values and algorithms that maximize:

  • The noises resistance
  • The time complexity
  • The precision (reducing the number of false positive results)

 

Conclusion

I hope you now understand how Shazam works. It took me a lot of time to understand the different subjects of this article and I still don’t master them. This article won’t make you an expert but I hope you have a very good picture of the processes behind Shazam. Keep in mind that Shazam is just one possible audio fingerprinting implementation.

 

You should be able to code your own Shazam. You can look at this very good article that focuses more on how to code a simplified Shazam in Java than the concepts behind it. The same author made a presentation at a Java conference and the slides are available here. You can also check this link for a MatLab/Octave implementation of Shazam. ?And of course, you can read by yourself the paper from Shazam co-founder Avery?Li-Chun Wang by clicking right here.

 

The world of music computing is a very interesting field with touchy algorithms that you use every day without knowing it. Though Shazam is not easy to understand it’s easier than:

  • query by humming: for example SoundHound ,a concurrent of Shazam, allows you to hum/sing the song you’re looking for
  • speech recognition and speech synthesis: implemented by Skype, Apple “Siri” and Android “Ok Google”
  • music?similarity: which is the ability to find that 2 song are similar. It’s used by Echonest a start-up recently acquired by Spotify

If you’re interested, there is an annual contest between researchers on those topics and the algorithms of each participant are available. Here is the link to the MIREX contest.

 

I spent approximately 200 hours during the last 3 years to understand the signal processing concepts, the mathematics behind them, to make my own Shazam prototype, to fully understand Wang’s?paper and imagine the processes the paper doesn’t explain. I wrote this article because I have never found an article that really explains Shazam and I wished I could have found one when I began this side project in 2012. I hope I didn’t write too many technical mistakes.?The only thing I’m sure of is that despite my efforts there are many grammar and spelling mistakes (alas!). ?Tell me what you think of this article on the comments.

 

 

]]>
http://www.sunsetandecho.com/how-shazam-works/feed/ 134
JVM memory model http://www.sunsetandecho.com/jvm-memory-model/ http://www.sunsetandecho.com/jvm-memory-model/#comments Wed, 01 Apr 2015 11:57:23 +0000 http://www.sunsetandecho.com/?p=414

 

The leitmotiv of JAVA is its famous WOTA: “write once, run anywhere”. In order to apply it, Sun Microsystems created the Java Virtual Machine, an abstraction of the underlying OS that interprets compiled java code. The JVM is the core component of the JRE (Java Runtime Environment) and was created to run Java code but is now used by other languages (Scala, Groovy, JRuby, Closure …).

In this article, I’ll focus on the Runtime Data Areas described in the JVM specifications. Those areas are designed to store the data used by a program or by the JVM itself. I’ll first present an overview of the JVM then what bytecode is and end with?the different data areas.

 

?Global Overview

The JVM is an abstraction of the underlying OS. It ensures that the same code will run with the same behavior no matter what hardware or OS the JVM is running on. For example:

  • The size of the primitive type int will always be a 32-bit signed integer from -2^31 to 2^31-1 whether the JVM is running on a 16bit/32bit/64bit OS.
  • Each JVM stores and uses data in-memory in a big-endian order (where high bytes come first) whether the underlying OS/Hardware is big-endian or little endian.

Note: sometimes, the behavior of a JVM implementation differs from another one but it’s generally the same.

 

overwiew of the functioning of a JVM

 

This diagram gives on overview of the JVM:

  • The JVM interprets bytecode which is produced by the compilation of the source code of a class. Though the term JVM stands for “Java Virtual Machine”, it runs other languages like scala or groovy, as long as they can be compiled into java bytecode.
  • In order to avoid disk I/O, the bytecode is loaded into the JVM by classloaders in one of the the runtime data areas. This code stays in memory until the JVM is stopped or the classloader (that loaded it) is destroyed.
  • The loaded code is then interpreted and executed by an execution engine.
  • The execution engine needs to store data like a pointer to the ligne of code being executed. It also needs to store the data handled in the developer’s code.
  • The execution engine also takes care of dealing with the underlying OS.

Note: Instead of always interpreting bytecode, the execution engine of many JVM implementations compiles the bytecode into native code if it’s often used. It’s called the Just In Time (JIT) compilation and greatly speeds up the JVM. The compiled code is temporary kept in a zone?often called?Code Cache. Since the zone is not in the JVM specifications, I won’t talk about it during?the rest of the article.

 

?Stack based architecture

The JVM uses a stack based architecture. Though it’s invisible for the developer it has a huge impact on the generated bytecode and the JVM architecture, this is why I’ll briefly explain the concept.

The JVM executes the developer’s code by executing basics operations described in the Java bytecode (we’ll see it in the next chapter). An?operand?is a value on which an?instruction operates. According to the JVM specifications, those operations require that the parameters are passed through a stack called the operand stack.

example of the state of a java operand stack during the iadd operation

For example, let’s take the basic addition of 2 integers. This operation is called iadd (for integer addition). If one wants to add 3 and 4 in bytecode:

  • He first pushes 3 and 4 in the operand stack.
  • Then calls the iadd instruction.
  • The iadd will pop the last 2 values from the operand stack.
  • The int result (3 + 4) is pushed into the operand stack in order to be used by other operations.

This way of functioning is called stack based architecture. There are other ways to deal with basics operations, for example the register based architecture stores the operands in small registers instead of a stack. This register based architecture is used by desktop/server (x86) processors and by the former android virtual machine Dalvik.

 

Bytecode

Since the JVM interprets bytecode it’s useful to understand what it is before going deeper.

The java bytecode?is the?java source code transformed into a set?of basic operations. Each?operation?is composed by one byte that represents the instruction to execute (called opcode or operation code), along with zero or more bytes for passing parameters (but most of the operation uses the operand stack to pass parameters). Of the 256 possible one byte-long?opcodes (from value 0x00 to 0xFF in hexadecimal), 204 are currently in use in the java8 specifications.

Here is a list of the different category of bytecode operations. For each category, I added?a small description and the hexadecimal range of the operation codes:

  • Constants: for pushing values from the constant pool (we’ll see it later) or from known values into the operand stack. From value 0x00 to 0x14
  • Loads: for loading values from local variables into the operand stack. From value 0x15 to 0x35
  • Stores: for storing from the operand stack into local variables. From value 0x36 to 0x56
  • Stack: for handling the operand stack. From value 0x57 to 0x5f
  • Math: for basic mathematical operations on values from the operand stack. From value 0x60 to 0x84
  • Conversions: for converting from one type to another. From value 0x85 to 0x93
  • Comparisons: for basic comparison between two values. From value 0x94 to 0xa6
  • Control: the basics operations like goto, return, … that allows more advanced operation like loops or functions that return values. From value 0xa7 to 0xb1
  • References: for allocating objects or arrays, getting or checking references on objects, method or static methods. Also used for invoking (static) methods. From value 0xb2 to 0xc3
  • Extended: operations from the others categories that were added after. From value 0xc4 to 0xc9
  • Reserved: for internal use by each Java Virtual Machine implementation. 3 values: 0xca, 0xfe and 0xff.

These 204 operations are very simple, for example:

  • The operand ifeq (0x99 ) checks if 2 values are equals
  • The operand iadd (0x60) adds 2 values
  • The operand i2l (0x85) converts an integer to a long
  • The operand arraylength (0xbe)? gives the size of an array
  • The operand pop (0x57) pops the first value from the operand stack

To create bytecode one needs a compiler, the standard java compiler included in the JDK is javac.

Let’s have a look of a simple addition:

public class Test {
  public static void main(String[] args) {
    int a =1;
    int b = 15;
    int result = add(a,b);
  }

  public static int add(int a, int b){
    int result = a + b;
    return result;
  }
}

The “javac Test.java” command generates a bytecode in Test.class. Since the java bytecode is a binary code, it’s not readable by humans. Oracle provides a tool in its JDK,? javap, that transforms?binary bytecode into human readable set of labeled operation codes from the JVM specifications.

The command “javap -verbose Test.class” gives the following result?:

Classfile /C:/TMP/Test.class
  Last modified 1 avr. 2015; size 367 bytes
  MD5 checksum adb9ff75f12fc6ce1cdde22a9c4c7426
  Compiled from "Test.java"
public class com.codinggeek.jvm.Test
  SourceFile: "Test.java"
  minor version: 0
  major version: 51
  flags: ACC_PUBLIC, ACC_SUPER
Constant pool:
   #1 = Methodref          #4.#15         //  java/lang/Object."<init>":()V
   #2 = Methodref          #3.#16         //  com/codinggeek/jvm/Test.add:(II)I
   #3 = Class              #17            //  com/codinggeek/jvm/Test
   #4 = Class              #18            //  java/lang/Object
   #5 = Utf8               <init>
   #6 = Utf8               ()V
   #7 = Utf8               Code
   #8 = Utf8               LineNumberTable
   #9 = Utf8               main
  #10 = Utf8               ([Ljava/lang/String;)V
  #11 = Utf8               add
  #12 = Utf8               (II)I
  #13 = Utf8               SourceFile
  #14 = Utf8               Test.java
  #15 = NameAndType        #5:#6          //  "<init>":()V
  #16 = NameAndType        #11:#12        //  add:(II)I
  #17 = Utf8               com/codinggeek/jvm/Test
  #18 = Utf8               java/lang/Object
{
  public com.codinggeek.jvm.Test();
    flags: ACC_PUBLIC
    Code:
      stack=1, locals=1, args_size=1
         0: aload_0
         1: invokespecial #1                  // Method java/lang/Object."<init>":()V
         4: return
      LineNumberTable:
        line 3: 0

  public static void main(java.lang.String[]);
    flags: ACC_PUBLIC, ACC_STATIC
    Code:
      stack=2, locals=4, args_size=1
         0: iconst_1
         1: istore_1
         2: bipush        15
         4: istore_2
         5: iload_1
         6: iload_2
         7: invokestatic  #2                  // Method add:(II)I
        10: istore_3
        11: return
      LineNumberTable:
        line 6: 0
        line 7: 2
        line 8: 5
        line 9: 11

  public static int add(int, int);
    flags: ACC_PUBLIC, ACC_STATIC
    Code:
      stack=2, locals=3, args_size=2
         0: iload_0
         1: iload_1
         2: iadd
         3: istore_2
         4: iload_2
         5: ireturn
      LineNumberTable:
        line 12: 0
        line 13: 4
}

The readable .class shows that the bytecode contains more than a simple transcription of the java source code. It contains:

  • the description of the constant pool of the class. The constant pool is one of the data areas of the JVM that stores metadata about classes like the name of the methods, their arguments …When a class is loaded inside the JVM this part goes into the constant pool.
  • Information like LineNumberTable? or LocalVariableTable that specify the location (in bytes) of the function and their variables in the bytecode.
  • A?transcription in bytecode of the developer’s java code (plus the hidden constructor).
  • Specific operations that handle the operand stack and more broadly the way of passing and getting parameters.

FYI, here is?a light description of the information stored in a .class file:

ClassFile {
  u4 magic;
  u2 minor_version;
  u2 major_version;
  u2 constant_pool_count;
  cp_info constant_pool[constant_pool_count-1];
  u2 access_flags;
  u2 this_class;
  u2 super_class;
  u2 interfaces_count;
  u2 interfaces[interfaces_count];
  u2 fields_count;
  field_info fields[fields_count];
  u2 methods_count;
  method_info methods[methods_count];
  u2 attributes_count;
  attribute_info attributes[attributes_count];
}

 

Runtime Data Areas

The runtime data areas are the in-memory areas designed to store data. Those data are used by the developer’s program or by the JVM for its inner working.

 

overview of the different runtime memory data areas of a JVM

This figure?shows an overview of the different runtime data areas in the JVM. Some areas are unique of other are per thread.

 

Heap

The heap is a memory area shared among all Java Virtual Machine Threads. It is created on virtual machine start-up. All class instances and arrays are allocated in the heap (with the new operator).

 MyClass myVariable = new MyClass();
 MyClass[] myArrayClass = new MyClass[1024];

This zone must be managed by a garbage collector?to remove the instances allocated by the developer when they are not used anymore. The strategy for cleaning the memory is up to the JVM implementation (for example, Oracle Hotspot provides multiple algorithms).

The heap can be dynamically expanded or contracted and can have a fixed minimum and maximum size. For example, in Oracle Hotspot, the user can specify the minimum size of the heap with the Xms and Xmx parameters by the following way “java -Xms=512m -Xmx=1024m …”

 

Note: There is a maximum size that the heap can’t exceed. If this limit is exceeded the JVM throws an OutOfMemoryError.

Method area

The Method area is a memory shared among all Java Virtual Machine Threads. It is created on virtual machine start-up and is loaded by classloaders from bytecode. The data in the Method Area stay in memory as long as the classloader which loaded them is alive.

The method area stores:

  • class information (number of fields/methods, super class name, interfaces names, version, …)
  • the bytecode of methods and constructors.
  • a runtime constant pool per class loaded.

The specifications don’t force to implement the method area in the heap. For example, until JAVA7, Oracle HotSpot used a zone called PermGen to store the Method Area. This PermGen was contiguous with the Java heap (and memory managed by the JVM like the heap) and was limited to a default space of 64Mo (modified by the argument -XX:MaxPermSize). Since Java 8, HotSpot now stores the Method Area in a separated native memory space called the Metaspace, the max available space is the total available system memory.

 

Note: There is a maximum size that the method area can’t exceed. If this limit is exceeded the JVM throws an OutOfMemoryError.

Runtime constant pool

This pool is a subpart of the Method Area. Since it’s an important part of the metadata, Oracle specifications describe the Runtime constant pool apart from the Method Areas. This constant pool is increased for each loaded class/interface. This pool is like a symbol table for a conventional programming language. In other words, when a class, method or field is referred to, the JVM searches the actual address in the memory by using the runtime constant pool. It also contains constant values like string litterals or constant primitives.

String myString1 = “This is a string litteral”;
static final int MY_CONSTANT=2;

 

The pc Register (Per Thread)

Each thread has its own pc (program counter) register, created at the same time as the thread. At any point, each Java Virtual Machine thread is executing the code of a single method, namely the current method for that thread. The pc register contains the address of the Java Virtual Machine instruction (in the method area) currently being executed.

Note: If the method currently being executed by the thread is native, the value of the Java Virtual Machine’s pc register is undefined.The Java Virtual Machine’s pc register is wide enough to hold a returnAddress or a native pointer on the specific platform.

 

Java Virtual Machine Stacks (Per Thread)

The stack area stores multiples frames so before talking about stacks I’ll present the frames.

Frames

A frame is a data structure that contains multiples data that represent the state of the thread in the current method (the method being called):

  • Operand Stack: I’ve already presented the operand stack in the chapter about stack based architecture. This stack is used by the bytecode instructions for handling parameters. This stack is also used to pass parameters in a (java) method call and to get the result of the called method at the top of the stack of the calling method.
  • Local variable array: This array contains all the local variables in a scope of the current method. This array can hold values of primitive types, reference, or returnAddress. The size of this array is computed at compilation time. The Java Virtual Machine uses local variables to pass parameters on method invocation, the array of the called method is created from the operand stack of the calling method.
  • Run-time constant pool reference: reference to the constant pool of the current class of the current method being executed. It is used by the JVM to translate symbolic method/variable reference ( ex: myInstance.method()) to the real memory reference.

?Stack

Each Java Virtual Machine thread has a private Java Virtual Machine stack, created at the same time as the thread. A Java Virtual Machine stack stores frames. A new frame is created and put in the stack each time a method is invoked. A frame is destroyed when its method invocation completes, whether that completion is normal or abrupt (it throws an uncaught exception).

Only one frame, the frame for the executing method, is active at any point in a given thread. This frame is referred to as the current frame, and its method is known as the current method. The class in which the current method is defined is the current class. Operations on local variables and the operand stack are typically with reference to the current frame.

 

Let’s look at the following example which is a simple addition

public int add(int a, int b){
  return a + b;
}

public void functionA(){
// some code without function call
  int result = add(2,3); //call to function B
// some code without function call
}

Here is how it works inside the JVM when the functionA() is running on:

example of the state of a jvm method stack during after and before an inner call

 

Inside functionA() the Frame A is the top of the stack frame and is the current frame. At the beginning of the inner call to add () a new frame (Frame B) is put inside the Stack. Frame B becomes the current frame. The local variable array of frame B is populated from popping the operand stack of frame A. When add() finished, Frame B is destroyed and Frame A becomes again the current frame. The result of add() is put on the operand stack of Frame A so that functionA() can use it by popping its operand stack.

 

Note: the functioning of this stack makes it dynamically expandable and contractable. There is a maximum size that a stack can’t exceed, which limit the number of recursive calls. If this limit is exceeded the JVM throws a?StackOverflowError.

With Oracle HotSpot, you can specify this limit with the parameter -Xss.

Native method stack (Per Thread)

This is a stack for native code written in a language other than Java and called through JNI (Java Native Interface).? Since it’s a “native” stack, the behavior of this stack is entirely dependent of the underlying OS.

 

Conclusion

I hope this article help you to have a better understanding of the JVM. In my opinion, the trickiest part is the JVM stack since it’s strongly linked to the internal functioning of the JVM.

If you want to go deeper:

  • you can read the JVM specifications here.
  • there is also a very good article here.
  • (for French readers) here is a series of 22 posts about JVM with a very strong focus on bytecode.

 

]]>
http://www.sunsetandecho.com/jvm-memory-model/feed/ 18
How does a HashMap work in JAVA http://www.sunsetandecho.com/how-does-a-hashmap-work-in-java/ http://www.sunsetandecho.com/how-does-a-hashmap-work-in-java/#comments Sun, 22 Mar 2015 18:43:44 +0000 http://www.sunsetandecho.com/?p=411

 

Most JAVA developers are using Maps and especially HashMaps. A HashMap is a simple yet powerful way to store and get data. But how many developers know how a HashMap works internally? A few days?ago, I’ve read a huge part of the source code of java.util.HashMap (in Java 7 then Java 8) in order to have a deep?understanding of this fundamental data structure. In this post, I’ll explain the implementation of java.util.HashMap, present what’s new in the JAVA 8 implementation and talk about performance, memory and known issues when using HashMaps.

 

Internal storage

The JAVA HashMap class implements the interface Map<K,V>. The main methods of this interface are:

  • V put(K key, V value)
  • V get(Object key)
  • V remove(Object key)
  • Boolean containsKey(Object key)

HashMaps use an inner class? to store data: the Entry<K, V>. This entry is a simple key-value pair with two extra data:

  • a reference to another Entry so that a HashMap can store entries like?singly linked lists
  • a hash value that represents the hash value of the key. This hash value is stored to avoid the computation of the hash every time the HashMap needs it.

Here is a part of the Entry implementation in JAVA 7:

static class Entry<K,V> implements Map.Entry<K,V> {
        final K key;
        V value;
        Entry<K,V> next;
        int hash;
…
}

A HashMap stores data into multiple singly?linked lists of entries (also called buckets or bins). All the lists are registered in an array of Entry (Entry<K,V>[] array) and the default capacity of this inner array is 16.

 

internal_storage_java_hashmap?The following picture shows the inner storage of a?HashMap instance with an array of nullable entries. Each Entry can link to another Entry to?form a linked list.

 

All the keys with the same hash value are put in the same linked list (bucket). Keys with different hash values can end-up in the same bucket.

When a user calls put(K key, V value) or get(Object key), the function computes the index of the bucket?in which the Entry should be. Then, the function iterates through the list to look for the Entry that has the same key (using the equals() function of the key).

In the case of?the get(), the function returns the value associated with the entry (if the entry?exists).

In the case of the put(K key, V value), if the entry exists the function replaces it with the new value otherwise it creates a new entry (from the key and value in arguments) at the head of the singly linked list.

 

This index of the bucket?(linked list) is generated in 3 steps by the map:

  • It first gets the hashcode of the key.
  • It rehashes the hashcode to prevent against a bad hashing function from the key that would put all data in the same index (bucket) of the inner array
  • It takes the rehashed hash hashcode?and bit-masks it with the length (minus 1) of the array. This operation assures that the index can’t be greater than the size of the array. You can see it as a very computationally optimized modulo function.

Here is?the JAVA 7 and 8 source code that deals with the index:

// the "rehash" function in JAVA 7 that takes the hashcode of the key
static int hash(int h) {
    h ^= (h >>> 20) ^ (h >>> 12);
    return h ^ (h >>> 7) ^ (h >>> 4);
}
// the "rehash" function in JAVA 8 that directly takes the key
static final int hash(Object key) {
    int h;
    return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
    }
// the function that returns the index from the rehashed hash
static int indexFor(int h, int length) {
    return h & (length-1);
}

In order to work efficiently, the size of the inner array needs to be a power of 2, let’s see why.

Imagine the array size is 17,?the mask value is going to be 16 (size -1). The binary representation of 16 is 0…010000, so for any hash value H the index generated?with the bitwise formula “H AND 16” is going to be either 16 or 0. This means that the array of size 17 will only be used for 2 buckets: the one at?index 0 and the one at?index 16, not very efficient…

But, if you now take a size that is?a power of 2 like 16, the bitwise index formula is “H AND?15”. The binary representation of 15 is 0…001111 so the index formula can output values from 0 to 15 and the array of size 16 is fully used. For example:

  • if H =?952?, its binary?representation is 0..01110111000, the associated?index is 0…01000?=?8
  • if H =?1576 its binary representation is?0..011000101000, the associated?index is ?0…01000?=?8
  • if H = 12356146, its binary?representation is 0..0101111001000101000110010, the?associated?index is 0…00010 = 2
  • if H = 59843, its binary?representation is 0..01110100111000011, the?associated?index is 0…00011?=?3

 

This is why the array size is a power of two. This mechanism is transparent for the developer: if he?chooses a HashMap with a?size of?37, the Map will automatically choose the next power of 2 after 37 (64) for the?size of its inner array.

 

Auto resizing

After getting the index, the function (get, put or remove) visits/iterates the associated linked list to see if there is an existing Entry for the given key.?Without modification, this mechanism could lead to performance issues?because the function?needs to?iterate through the entire list to see if the entry exists. Imagine that the size of the inner array is the default value (16) and you need to store 2 millions values. In the best case scenario, each linked list will have a size of 125?000 entries (2/16 millions). So, each get(), remove() and put() will?lead to 125?000?iterations/operations. To avoid this case, the HashMap has the ability to increase?its inner array in order to keep very short linked lists.

When you create a HashMap, you can specify an initial size and a loadFactor with the following constructor:

public HashMap(int initialCapacity, float loadFactor)

If you don’t specify arguments, the default initialCapacity is 16 and the default loadFactor is 0.75. The initialCapacity represents to the size of the inner array of linked lists.

Each time you add a new key/value in your Map with?put(…), the function checks if it needs to increase the capacity of the inner array. In order to do that, the map stores 2 data:

  • The size of the map: it represents the number of entries in the HashMap. This value is updated each time an Entry is added or?removed.
  • A threshold: it’s equal to (capacity of the inner array) * loadFactor and it is refreshed after each resize of the inner array

Before adding the new Entry, put(…) checks if size > threshold and if it?the case?it recreates a new array with a doubled size. Since the size of the new array?has changed, the indexing function (which returns the bitwise operation “hash(key) AND (sizeOfArray-1)”) changes. So, the resizing of the array creates twice more buckets (i.e. linked lists) and?redistributes all the existing entries into the?buckets (the old ones and the newly created).

This aim of this resize operation is to decrease the size of the linked lists so that the time cost of put(), remove() and get() methods stays low.?All entries?whose?keys?have the same hash will stay in the same bucket after the resizing. But, 2 entries with different hash keys that were in the same bucket before might not be in the same bucket after the?transformation.

resizing_of_java_hashmap

The picture shows a representation before and after the resizing of the inner array. Before the increase, in order to get Entry E, the map had to iterate through a list of 5 elements. After the resizing, the same get() just iterates through a?linked list of 2 elements, the get() is 2 times faster after the resizing !

 

Note: the HashMap only increases the size of the inner array, it doesn’t provide a way to decrease it.

 

Thread Safety

If you already know HashMaps, you know that is not threads safe, but why? For example imagine that you have a Writer thread that puts only new data into the Map and a Reader thread that reads data from the Map, why shouldn’t it work?

Because during the auto-resizing mechanism, if a thread tries to put or get an object, the map might use the old index value and won’t find the new bucket in which the entry is.

The worst case scenario is when 2 threads put a data at the same time and the 2 put() calls resize the Map at the same time. Since both threads modify the linked lists at the same time, the Map might end up with an inner-loop in one of its linked lists. If you tries to get a data in the list with an inner loop, the get() will never end.

The HashTable implementation is a thread safe implementation that prevents from this situation. But, since all the CRUD methods are synchronized this implementation is very slow. For example, if thread 1 calls get(key1), thread 2 calls get(key2) and thread 3 calls get(key3), only one thread at a time will be able to get its value whereas the 3 of them could access the data at the same time.

A smarter implementation of a thread safe HashMap exists since JAVA 5: the ConcurrentHashMap. Only the buckets are synchronized so?multiples threads can get(), remove() or put() data at the same time if it doesn’t imply accessing the same bucket or resizing the inner array. It’s better to use this implementation in a multithreaded application.

 

Key immutability

Why Strings and Integers are a good implementation of keys for HashMap? Mostly because they are immutable! If you choose to create your own Key class?and don’t make it immutable, you might lose data inside the HashMap.

Look at?the following use case:

  • You have a key that has an inner value “1”
  • You put an object in the HashMap with this key
  • The HashMap generates a hash from the hashcode of the Key (so from “1”)
  • The Map? stores this hash?in the newly created Entry
  • You modify the inner value of the key to “2”
  • The hash value of the key is modified but the HashMap doesn’t know it (because the old hash value is stored)
  • You try to get your object with your modified key
  • The map computes the new hash of your key (so from “2”) to find in which linked list (bucket) the entry is
    • Case 1: Since you modified your key, the map tries to find the entry in the wrong bucket and doesn’t find it
    • ?Case 2: Luckily, the modified key generates?the same bucket as the old key. The map then iterates through the linked list to find the entry with the same key. But to find the key, the map first compares the hash values and then calls the equals() comparison. Since your modified key doesn’t have the same hash as the old hash value (stored in the entry), the map won’t find the entry in the linked-list.

Here is a concrete example in Java. I put 2 key-value pairs in my Map,?I modify the first key and then try to get the 2 values. Only the second value is returned from the map, the first value is “lost” in the HashMap:

public class MutableKeyTest {

	public static void main(String[] args) {

		class MyKey {
			Integer i;

			public void setI(Integer i) {
				this.i = i;
			}

			public MyKey(Integer i) {
				this.i = i;
			}

			@Override
			public int hashCode() {
				return i;
			}

			@Override
			public boolean equals(Object obj) {
				if (obj instanceof MyKey) {
					return i.equals(((MyKey) obj).i);
				} else
					return false;
			}

		}

		Map<MyKey, String> myMap = new HashMap<>();
		MyKey key1 = new MyKey(1);
		MyKey key2 = new MyKey(2);

		myMap.put(key1, "test " + 1);
		myMap.put(key2, "test " + 2);

		// modifying key1
		key1.setI(3);

		String test1 = myMap.get(key1);
		String test2 = myMap.get(key2);

		System.out.println("test1= " + test1 + " test2=" + test2);

	}

}

The output is: “test1= null test2=test 2”.?As expected, the Map wasn’t able to retrieve the string 1 with the modified key 1.

 

JAVA 8 improvements

The inner representation of the HashMap has changed a lot in JAVA 8. Indeed, the implementation in JAVA?7 takes 1k lines of code whereas the implementation in JAVA 8 takes 2k lines.?Most of what I’ve said previously is true except the linked lists of entries. In JAVA8, you still have an array but it now stores Nodes that contains the exact same information as Entries and therefore are also linked lists:

Here is a part of the Node?implementation in JAVA 8:


   static class Node<K,V> implements Map.Entry<K,V> {
        final int hash;
        final K key;
        V value;
        Node<K,V> next;

So what’s the big difference with JAVA 7? Well, Nodes can be extended to TreeNodes. A TreeNode is a red-black tree structure that stores really more information so that it can add, delete or get an element in O(log(n)).

FYI, here is the?exhaustive list of the data stored inside a TreeNode

static final class TreeNode<K,V> extends LinkedHashMap.Entry<K,V> {
	final int hash; // inherited from Node<K,V>
	final K key; // inherited from Node<K,V>
	V value; // inherited from Node<K,V>
	Node<K,V> next; // inherited from Node<K,V>
	Entry<K,V> before, after;// inherited from LinkedHashMap.Entry<K,V>
	TreeNode<K,V> parent;
	TreeNode<K,V> left;
	TreeNode<K,V> right;
	TreeNode<K,V> prev;
	boolean red;
 

Red black trees are self-balancing binary search trees. Their inner mechanisms ensure that their?length is always in log(n) despite new adds or removes of nodes. The main advantage to use those trees is in a case where many data are in the same index (bucket) of the inner table, the search in a?tree will cost?O(log(n)) whereas it would have cost O(n) with a linked list.

As you see, the?tree takes really more space than the linked list (we’ll speak about it in the next part).

By inheritance, the inner table can contain both Node (linked list ) and TreeNode (red-black tree). Oracle decided to use both data structures with the following rules:
– If for a given index (bucket) in the inner table there are more than 8 nodes, the linked list is transformed into a red black tree
– If for a given index (bucket) in the inner table there are less than 6 nodes, the tree is transformed into a linked list

internal_storage_java8_hashmap

This picture shows an inner array of a JAVA?8 HashMap with both trees (at bucket 0) and linked lists (at bucket 1,2 and 3). Bucket 0 is a Tree because it has more than 8 nodes.

 

Memory overhead

JAVA 7

The use of a HashMap comes at a cost in terms of memory. In JAVA?7, a?HashMap wraps key-value pairs in Entries. An entry has:

  • a reference to a next entry
  • a precomputed hash (integer)
  • a reference to the key
  • a reference to the value

Moreover, a JAVA 7 HashMap uses an inner array of Entry.?Assuming a JAVA 7 HashMap contains N elements and its inner array has a capacity CAPACITY, the extra?memory cost?is approximately:

sizeOf(integer)* N + sizeOf(reference)* (3*N+C)

Where:

  • the size of an integer depends equals 4 bytes
  • the size of a reference depends on the JVM/OS/Processor but is often 4 bytes.

Which means that the overhead is often 16 * N + 4 * CAPACITY bytes

Reminder: after an auto-resizing of the Map, the ?CAPACITY ?of the inner array equals the next power of two after N.

Note: Since JAVA 7, the HashMap class has a lazy init. That means that even if you?allocate a HashMap, the inner array of entry (that costs 4 * CAPACITY bytes) won’t be allocated in memory until the first use of the put() method.

JAVA 8

With the JAVA 8 implementation, it becomes a little bit complicated to get the memory usage because a Node can contain the same data as an Entry or the same data plus 6 references and a Boolean (if it’s a TreeNode).

If the all the nodes are only?Nodes, the memory consumption of the?JAVA 8 HashMap is the same as the?JAVA 7 HashMap.

If the all the nodes are TreeNodes, the memory consumption of a JAVA 8 HashMap becomes:

N * sizeOf(integer) + N * sizeOf(boolean) + sizeOf(reference)* (9*N+CAPACITY )

In most standards JVM, it’s equal to 44 * N + 4 * CAPACITY bytes

 

Performance issues

Skewed HashMap vs well balanced HashMap

In the best case scenario, the?get() and put() methods have a O(1) cost in time complexity. But, if you don’t take care of the hash function of the?key, you might end up with very slow put() and get() calls. The good performance of the put() and get depends on the repartition of the data into the different indexes of the inner array (the buckets). If the hash function of your key is ill-designed, you’ll have a skew repartition (no matter how big the?capacity of the inner array is). All the put() and get() that use the biggest linked lists of entry will be slow because they’ll need to iterate the entire lists. In the worst case scenario (if most of?the?data are in the same buckets), you could end up with a O(n) time complexity.
Here is a visual example. The first picture shows a skewed HashMap and the second picture a well balanced?one.

skewed_java_hashmap

 

In the case of this skewed HashMap the get()/put() operations on the bucket 0 are costly.?Getting the Entry K?will cost 6?iterations

well_balanced_java_hashmapIn the case of this well balanced?HashMap, getting the Entry K?will cost 3?iterations. Both HashMaps store the same amount of data and have the same inner array size. The only difference is the hash (of the key) function that distributes the entries in the buckets.

Here is an extreme example in JAVA where I create a hash function that puts all the data in the same bucket then I add 2 million elements.

public class Test {

	public static void main(String[] args) {

		class MyKey {
			Integer i;
			public MyKey(Integer i){
				this.i =i;
			}

			@Override
			public int hashCode() {
				return 1;
			}

			@Override
			public boolean equals(Object obj) {
			…
			}

		}
		Date begin = new Date();
		Map <MyKey,String> myMap= new HashMap<>(2_500_000,1);
		for (int i=0;i<2_000_000;i++){
			myMap.put( new MyKey(i), "test "+i);
		}

		Date end = new Date();
		System.out.println("Duration (ms) "+ (end.getTime()-begin.getTime()));
	}
}

On my core i5-2500k @ 3.6Ghz it takes more than 45?minutes with java 8u40 (I stopped the process after 45?minutes).

Now, If I run the same code but this time I use the following hash function

	@Override
	public int hashCode() {
		int key = 2097152-1;
		return key+2097152*i;
}

it takes 46 seconds, which is way better! This hash function has a better repartition than the previous one so the put() calls are faster.

And If I run the same code with the following hash function that provides an even better hash repartition

 @Override
 public int hashCode() {
 return i;
 }

it now takes 2?seconds.

I hope you realize how important the hash function is. If a ran?the same test on JAVA 7, the results would have been worse for the first and second cases?(since the time complexity of put is O(n) in JAVA?7 vs O(log(n)) in JAVA?8)

When using a HashMap, you need to find a hash?function?for your keys that spreads the keys into the most possible buckets. To do so, you need to avoid hash collisions. The String Object is a good key because of it has good hash function. Integers are also good because their?hashcode is their?own value.

 

Resizing overhead

If you need to store a lot of data, you should create your HashMap with an initial capacity close to your expected volume.

If you don’t do that, the Map will take the default size of 16 with a factorLoad of 0.75. The 11 first put() will be very fast but the 12th (16*0.75) will recreate a new inner array (with its associated linked lists/trees) with a new capacity of 32. The 13th to 23th will be fast but the 24th (32*0.75) will recreate (again) a costly new representation that doubles the size of the inner array.?The internal resizing operation will appear at the 48th, 96th,192th, … call of put(). At low volume the full recreation of the inner array is fast but at high volume it can takes seconds to minutes. By initially setting your expected size, you can avoid these?costly operations.

But there is a drawback: if you set a very high array size like 2^28 whereas you’re only using 2^26 buckets?in your array, you will waste a lot of memory (approximately 2^30 bytes in this case).

 

Conclusion

For simple use cases, you don’t need to know how HashMaps?work since you won’t see the difference between a O(1) and a O(n)?or?O(log(n)) operation. But it’s always better to understand the underlaying mecanism of one of the most used data structures. Moreover, for a java developer position it’s a typical interview question.

At high volume it becomes important to?know how it works and to understand the importance of the hash function of the key.

I hope this article helped you to have a deep?understanding of the HashMap implementation.

]]>
http://www.sunsetandecho.com/how-does-a-hashmap-work-in-java/feed/ 80
博悦彩票平台