Talk 10: Concurrency Control in RDBMS
Talk given by Arjen Lentz, Community Relations Manager, MYSQL AB. (Don't you love how half the time I get the name... it depends on how long they leave the first slide up. Hurrah.)
Concurrency control is the mechanism required to manage multiple users accessing the same resource (in this case, tables and rows). In general the goal is to implement this with the least possible overhead, optimized for what you're trying to accomplish. Every approach has pros and cons. Also, this talk is not specific to MySQL at all.
Without any rules, obviously you're going to run into rules. Connection A and B do a select, then an update, but they're a little behind on B, so someone's update gets lost. Whatever concurrency control does, it has to prevent this sort of situation.
You must also avoid "uncommitted dependencies." When one person does something, then another does something using the results of the first person, then the first person issues a rollback. This is similar to something called "inconsistent analysis", which is someone who's working on rows that have changed, and then you end up with bad data.
One big concept here is locking granularity. You want to lock things, but how low level do you want to go? SQLite does it on the database level, MySQL/MyISAM and MySQL/MEMORY do it at the tabel, BerkeleyDB and Ingres do it at the page level, and pretty much everybody else uses row level locking.
Once you've got granularity, you have types of locks. Shared read locks allow other people to read rows, as long as they don't change them. Update locks mean you intend to update the row. Nobody else can get an update lock while you've got one. Exclusive locks mean it's just you. All you!
Then, after locking is implemented, you want transactions. It's certainly not always necessary, but it's definitely something that needs supporting. It allows multiple commands to execute within a safe area where they can't get run over. This is using the ACID rules...
Atomicity - one or more operations are a single unit of work, and indivisible. Either all or nothing.
Consistency - the database must always move from one consistent state to another. Inconsistencies may exist during a transaction, but must not after you commit or rollback.
Isolation - changes in flight (in an uncommitted transaction) must not be visible to other transactions.
Durability - once committed, data must persist somewhere/somehow.
(This is all stuff I know/am familiar with. The talk is a bit geared towards newer people and reminds me a lot about the database courses at school...)
There are now four isolation levels. Not all of these are implemented by every database server, but some of them are. For completeness here they are:
Read Uncommitted - doesn't solve isolation, you can see data that is coming from other transactions that are in flight. You can get bad data/dirty reads this way.
Read Committed - you only see data that is committed, and you don't see uncommitted transactions. This can cause non-repeatable reads, which is where you're reading data, someone commits a transaction, and you do the same read again, getting new data.
Repeatable Read - you only see committed transactions that were committed before you started your transaction. Except for phantoms. Which is where you select data that doesn't exist, then someone inserts the row, and you select again - now you get the row. (InnoDB prevents this by locking the gap on the rows that don't exist.)
Serializable - transactions are completely isolated from eachother. The problem is that every single select requires a shared lock. Prevents phantoms, regardless of the database you use. Can be very slow.
So, higher isolation level is more correct, but hurts concurrency. It's a huge tradeoff to make. If you don't implement every isolation level, then your database should escalate to a level higher if you don't implement one they're asking for. (Unfortunately, some databases don't do this.)
When doing locking, escalate to the next higher level if you're starting to do a lot of it. If you're starting to lock a ton of rows, lock pages. Lots of pages? Lock tables. This is not stuff that the end user decides, this is all dictated by the database. Set it up to save memory as appropriate.
There are two types of locking... optimistic, and pessimistic. The former is to only take action when necessary, the latter assumes that you're going to run into problems, so lock things all the time just to make sure we don't have problems. (At this point he's skimming through things, as we're 5 minutes to the end of the session.)
There are other concepts here, such as two-phase locking, multi-versioned concurrency control (MVCC), and deadlocks. The latter mean you have to rollback transactions.
Most database also support explicit locking - you can tell it to lock something. Not all databases support this, and it's not defined in the SQL standards. (Interesting - a community standard? Among databases? Huh.)