• 19 August 2016Brainstorming upcoming performance update

    I'm looking into improving the database queries that are used when browsing the Study pages. The main query that pulls out the shared stories is actually pretty bad. I'm regularly seeing queries that take between 1 - 2 seconds! For comparison, the forum (which is based on MyBB) typically returns a full page (php and mysql queries) under 100 milliseconds!

    It looks like it's not so complicated after all. Although MyBB uses "Join Decomposition" for pulling posts within a thread, and I thought this would be a similar scenario for the stories, it turns out I don't even need to use this technique.

    I might be able to do this first performance update within a week or two. Read on if you are interested in the MySQL ins and outs!

    First, I converted the stories from a TEXT to a VARCHAR field. This is not strictly necessary because we want to remove the use of "temporary" tables, but it may still be a good thing in unforeseen queries. Stories are currently limited to 512 characters, so that's approx. 1500 bytes assuming the whole story uses UTF-8 characters, which is not very long.

    Secondly, I moved the sum of stories' stars and reports back into the main stories table. Those fields store the total of votes and reports, in order to simplify queries. We do store every individual vote in another table, which can be used to recalculate the vote / report sums if necessary.

    This is called "de normalization". This could potentially be worse with MyISAM because whenever someones upvotes or edits a story, the whole table is locked (MyISAM does not handle individual row locks for writing). Previously we would lock a separate table holding those vote / report totals. But now we would lock the stories table itself.

    Hence, I am updating the table to InnoDB. InnoDB as far as I understand can lock individual rows when writing. This means concurrent votes, story edits, and new story insertions can all happen in parrallel without too much of an impact (hopefully).

    After moving the total stars / reports into the stories table, we have solved the "Using temporary" problem. There is no longer a JOIN with the vote / reports summary table. Because we used a TEXT field as well, the temporary tables were on disk. So hopefully that is a significant improvement to the query performance. In memory temporary tables can be 10 to 100 times faster (source).

    Lastly, I added a new index which covers the main sort for the shared stories : stars descending, "last edit time" descending. The index can actually be used for sorting, so we also get rid of the "filesort" step in the MySQL query.

    Along with the schema change I need to refactor some code. The backend code that handles votes / reports need to update a different table now. Because of additional columns I also need to change the way the "lest edit" timestamp is handled in the stories table.

    Those "ON UPDATE" timestamps are incredibly annoying. I just noticed that I destroyed all the timestamps in the stories table on my local copy, after I did some simple ALTER TABLE queries. So now I have to restore a backup, and repeat the schema changes. So I am changing the code now to manually update the time stamp when the user actually edits a story. May as well do that since we don't actually want to change the "last edit" timestamp when we just want to update the count of votes or reports.

    There are also interesting query design decisions I didn't think about before. It turns out, for example, that public stories only amount to approx. 6 percent of all the stored stories! Yes, only 6 percent of 6+ millions of rows in the stories table are public. Therefore, the "public" column in a story query is highly selective. If we add "WHERE public = 1" we have already excluded ~94 % of the rows in that table. So this also informs how to better design the indexes.

    All in all, not very complicated. I'm looking forward to browse the Study pages and have much faster response times.