Blog / August 2016
-
30 August 2016More brainstorming of the database performance
Since the last post I've tried different techniques and tested the performance of various indexes in the database. I thought I had a decent solution now, but I am still seeing queries from 30 to 100 ms. 100 ms is ten times better than 1 second, but it's still pretty bad considering I'm running the queries on my computer, where I am the only user!I'm pretty sure there is a significant improvement already if I updated the site with the changes I've made.
One good improvement already is that the "New Stories" and "Shared Stories" can both be cached together. This wasn't the case before for odd reasons involving how I wrote the SQL queries. So a better caching of the shared stories is a good thing.
Another thing is that the improved schema and indexing of the stories removed the temporary table step from the queries, and I suspect that was a bottleneck when many users are accessing the site at the same time.
However today I've just run into the notion of Partitioned Tables. This sounds like it could be very useful. Indeed only 6 to 7 % of 6+ million rows, that is approx. 6 % of 1 GIGABYTE of stories is actually being publicly shared. And I suspect a big performance impact is due to how some of the older stories that have many votes are at beginning of the physical storage, while newer stories that get one or two votes will be towards the ends. Likely what happens is that MySQL, even with a good index, has to collect stories that are spread in a 1 GB physical storage space.
It would be far more efficient if all the public stories were all clustered together on disk. Even if the server has SSD drives, a 1GB table won't fit in memory and that also limits the performance.
So I need to look into this Partitioned Tables thingamagic right now. In theory if I can have a partition store only the public stories, it will be less than a hundred megabytes (ie. ~7 % of 1 GB) which may fit into memory. And even if it doesn't fit in memory it means all the stories pulled by the Shared Stories query will be clustered together on the physical storage.
Assuming the other partition would be private stories, and it takes approx 900+ MB .. it is not impacted in the same way because typically what the website does is simply read ONE row of the database to retrieve whichever story the user is viewing / editing. So in that case the index works fine.
All in all it's nice to get to grips with MySQL.. and I need to take care of this because it's a problem that only gets worse as the database grows. In fact this problem wasn't there several years ago. I'm surprised actually that no one ever tells me about it... but I suspect the reason for that is that when studying you advance through the pages at a relatively slow pace.
But even when users themselves don't feel the lag too much when browsing the Study pages, it will affect users who are reviewing! While there are different tables the MySQL server will be busy with slow queries and when people are reviewing this can cause lag too. That is why I really want to solve this problem.
Gaining this experience now is crucial also because the flashcards table is growing too and it's nearing 500 MB as I write and will run into performance issues as well. In fact today if I want to add any kind of feature to the flashcards or stories, I am invariably expanding the size of those tables. -
19 August 2016Brainstorming upcoming performance update
I'm looking into improving the database queries that are used when browsing the Study pages. The main query that pulls out the shared stories is actually pretty bad. I'm regularly seeing queries that take between 1 - 2 seconds! For comparison, the forum (which is based on MyBB) typically returns a full page (php and mysql queries) under 100 milliseconds!
It looks like it's not so complicated after all. Although MyBB uses "Join Decomposition" for pulling posts within a thread, and I thought this would be a similar scenario for the stories, it turns out I don't even need to use this technique.
I might be able to do this first performance update within a week or two. Read on if you are interested in the MySQL ins and outs!First, I converted the stories from a TEXT to a VARCHAR field. This is not strictly necessary because we want to remove the use of "temporary" tables, but it may still be a good thing in unforeseen queries. Stories are currently limited to 512 characters, so that's approx. 1500 bytes assuming the whole story uses UTF-8 characters, which is not very long.
Secondly, I moved the sum of stories' stars and reports back into the main stories table. Those fields store the total of votes and reports, in order to simplify queries. We do store every individual vote in another table, which can be used to recalculate the vote / report sums if necessary.
This is called "de normalization". This could potentially be worse with MyISAM because whenever someones upvotes or edits a story, the whole table is locked (MyISAM does not handle individual row locks for writing). Previously we would lock a separate table holding those vote / report totals. But now we would lock the stories table itself.
Hence, I am updating the table to InnoDB. InnoDB as far as I understand can lock individual rows when writing. This means concurrent votes, story edits, and new story insertions can all happen in parrallel without too much of an impact (hopefully).
After moving the total stars / reports into the stories table, we have solved the "Using temporary" problem. There is no longer a JOIN with the vote / reports summary table. Because we used a TEXT field as well, the temporary tables were on disk. So hopefully that is a significant improvement to the query performance. In memory temporary tables can be 10 to 100 times faster (source).
Lastly, I added a new index which covers the main sort for the shared stories : stars descending, "last edit time" descending. The index can actually be used for sorting, so we also get rid of the "filesort" step in the MySQL query.
Along with the schema change I need to refactor some code. The backend code that handles votes / reports need to update a different table now. Because of additional columns I also need to change the way the "lest edit" timestamp is handled in the stories table.
Those "ON UPDATE" timestamps are incredibly annoying. I just noticed that I destroyed all the timestamps in the stories table on my local copy, after I did some simple ALTER TABLE queries. So now I have to restore a backup, and repeat the schema changes. So I am changing the code now to manually update the time stamp when the user actually edits a story. May as well do that since we don't actually want to change the "last edit" timestamp when we just want to update the count of votes or reports.
There are also interesting query design decisions I didn't think about before. It turns out, for example, that public stories only amount to approx. 6 percent of all the stored stories! Yes, only 6 percent of 6+ millions of rows in the stories table are public. Therefore, the "public" column in a story query is highly selective. If we add "WHERE public = 1" we have already excluded ~94 % of the rows in that table. So this also informs how to better design the indexes.
All in all, not very complicated. I'm looking forward to browse the Study pages and have much faster response times. -
17 August 2016Querying the database. Bleep, bloop.
Yes I am alive! I've been somewhat busy since January but I've been looking into the website code now for about a week or so. Some database queries are really slow and I always wanted to fix those. Browsing the Study pages ought to work much better.
I'm reading "High Performance MySQL" and looking into ways to improve the database schema. Some things that stood out is that I used multiple column keys as primary keys in a way that is inefficient and prevents so useful techniques, such as "Join Decomposition". Another thing is called "de normalization", and I should probably have stored the count of votes / reports along with the stories in order to avoid the "Temporary table" problem. I realized also that storing stories as "TEXT" instead of "VARCHAR" causes the temporary tables to be on disk instead of in memory!
All kinds of fun stuff. It'll probably take a few weeks and then I'll have to take down the site for an hour or so, make a backup first, then do all kind of wizardry to make major changes to the schema and queries. I should probably do this in different phases because it's quite tricky.
By Month
- Nov 2024 (1)
- Sep 2024 (1)
- Jun 2024 (2)
- May 2024 (4)
- Apr 2024 (3)
- Mar 2024 (1)
- Feb 2024 (1)
- Dec 2023 (1)
- Nov 2023 (2)
- Oct 2023 (2)
- Apr 2023 (2)
- Mar 2023 (2)
- Feb 2023 (1)
- Jan 2023 (2)
- Dec 2022 (1)
- Nov 2022 (2)
- Oct 2022 (3)
- Sep 2022 (1)
- May 2022 (4)
- Apr 2022 (1)
- Feb 2022 (2)
- Jan 2022 (2)
- Dec 2021 (4)
- Nov 2021 (2)
- Oct 2021 (2)
- Sep 2021 (2)
- Aug 2021 (1)
- Apr 2021 (2)
- Feb 2021 (3)
- Jan 2021 (3)
- Dec 2020 (1)
- Nov 2020 (1)
- May 2020 (1)
- Apr 2020 (1)
- Jan 2020 (1)
- Oct 2019 (1)
- Sep 2019 (1)
- Aug 2019 (4)
- Jul 2019 (3)
- Jun 2019 (1)
- May 2019 (1)
- Mar 2019 (2)
- Jan 2019 (1)
- Nov 2018 (3)
- Oct 2018 (8)
- Sep 2018 (4)
- Aug 2018 (3)
- Jul 2018 (1)
- Jun 2018 (4)
- May 2018 (1)
- Apr 2018 (1)
- Mar 2018 (1)
- Jan 2018 (1)
- Dec 2017 (6)
- Nov 2017 (4)
- Oct 2017 (4)
- Sep 2017 (5)
- Aug 2017 (5)
- Jun 2017 (3)
- May 2017 (2)
- Apr 2017 (3)
- Mar 2017 (7)
- Feb 2017 (10)
- Jan 2017 (11)
- Dec 2016 (6)
- Nov 2016 (5)
- Oct 2016 (6)
- Sep 2016 (7)
- Aug 2016 (3)
- May 2016 (1)
- Mar 2016 (2)
- Jan 2016 (1)
- Dec 2015 (3)
- Nov 2015 (1)
- Oct 2015 (1)
- Sep 2015 (7)
- Jul 2015 (2)
- Jun 2015 (1)
- May 2015 (5)
- Apr 2015 (4)
- Mar 2015 (5)
- Feb 2015 (4)
- Jan 2015 (5)
- Dec 2014 (4)
- Nov 2014 (3)
- Oct 2014 (2)
- Jun 2014 (1)
- Apr 2014 (2)
- Mar 2014 (4)
- Feb 2014 (3)
- Jan 2014 (4)
- Dec 2013 (2)
- Oct 2013 (1)
- Sep 2013 (1)
- Jun 2013 (4)
- May 2013 (1)
- Mar 2013 (1)
- Jan 2013 (2)
- Oct 2012 (2)
- Aug 2012 (1)
- Jul 2012 (2)
- Jun 2012 (2)
- May 2012 (1)
- Mar 2012 (2)
- May 2011 (1)
- Apr 2011 (4)
- Mar 2011 (3)
- Feb 2011 (2)
- Jan 2011 (2)
- Dec 2010 (8)
- Nov 2010 (8)
- Oct 2010 (3)
- Sep 2010 (3)
- Aug 2010 (1)
- Jul 2010 (2)
- Jun 2010 (5)
- May 2010 (1)
- Apr 2010 (3)
- Mar 2010 (4)
- Feb 2010 (2)
- Jan 2010 (1)
- Dec 2009 (5)
- Nov 2009 (5)
- Oct 2009 (1)
- Aug 2009 (1)
- May 2009 (5)
- Apr 2009 (2)
- Mar 2009 (1)
- Feb 2009 (2)
- Jan 2009 (2)
- Nov 2008 (1)
- Oct 2008 (1)
- Sep 2008 (1)
- May 2008 (2)
- Apr 2008 (1)
- Feb 2008 (6)
- Jan 2008 (5)
- Dec 2007 (6)
- Oct 2007 (1)
- Sep 2007 (2)
- Aug 2007 (3)
- Jun 2007 (1)
- May 2007 (5)
- Apr 2007 (1)
- Mar 2007 (2)
- Feb 2007 (1)
- Jan 2007 (4)
- Dec 2006 (3)
- Aug 2006 (1)
- Jun 2006 (3)
- Apr 2006 (6)
- Mar 2006 (8)
- Feb 2006 (1)
- Jan 2006 (4)
- Nov 2005 (1)
- Oct 2005 (4)
- Sep 2005 (1)
- Aug 2005 (11)