• 19 May 2022Downtime Friday 20th: Spring cleaning time for the database

    The website will be unavailable tomorrow (Friday May 20th, 2022) for up to an hour. The downtime will be sometime between 3-6 pm CEST.

    This is an annual "spring cleaning time" where I will run some manual queries on the database to trim stale data from very old, inactive accounts. The goal is to prevent the database from slowing down due to large indexes, as well as keep the full backups to a manageable size.

    As I have done in the past I will use the following criteria :

    • Delete flashcards from accounts inactive for 2 YEARS.
    • Delete private stories where the user's account was inactive for 5 YEARS.

    An inactive account is an account for which the user did not sign in within the period. For example if you sign in just once within the last 5 years, the account data is persisted. It is not necessary to do any activity - just sign in. In other words what I'm looking at is "the last time the user logged in".

    For flashcards : I have found in the past from emails asking "how do I reset my account?" that users ; when they come back after a long break ; usually want to delete all their cards and start over. It makes sense for how the SRS works after all. So the 2 YEAR interval seems sensible to me. Keep in mind on Koohii a "flashcard" refers to the SRS status for each kanji - not stories.

    For stories : we're looking at deleting private stories only, from old inactive accounts. This ensures the shared stories area is unaffected and as you'll see below, the shared stories only account for 7% of 7 million+ stories! Still, it's not unusual for users to come back after several years break, which is why I used a more sensible 5 year interval.

    How to Backup Your Data

    If you think you may take a long break, or have finished with Kanji Koohii for the foreseeable future - keep in mind your data is not "locked in". You can export all your stories from the My Stories page. This will download a CSV format file which you can load up in a typical spreadsheet app like Google Docs or Libre Office.

    The stories export could be useful also for eg. printing and browsing offline.

    You can also export your kanji flashcards from the Manage Flashcards section of the site. This is a CSV export and how you can use that data elsewhere is up to you. To clarify, this is the SRS status data.

    On Koohii, stories are independent of kanji cards. Many users are only interested in the stories section of the site, other users are only interested in the SRS. Some use both. Just remember : stories are stored independently of flashcards. You can remove a card and add it back later and the corresponding story for that kanji will still be there.

    More database nitty gritty and thoughts

    I have found over the years that as long as the full gzipped database backup is under 1 GB, it is manageable for me. It doesn't take too long to download, and if I need to debug something, it doesn't take forever either, to "copy" that database in my development environment.

    By running some statistics I have found in the past that the stories table in the database is, unsurprisingly, the most space consuming.

    As of a recent May backup, there are over 7,294,500 stories! (yes, that is seven millions). Of those, "only" 513,800 are shared - that is just 7% !

    So what this means, is when users go through RTK and input or copy shared stories for 2000 kanji, and then leave and never come back,... this just adds up to a lot of data over the years.

    Thankfully due to the nature of this service, I can afford to trim some of that excess data: it turns out that those users who do in fact return (after several years break) often want to "reset" their account and start over. I know that because I'd get occasional emails like "hey, I haven't used koohii in a while, how can I reset my account ?".

    More recently, the crude admin panel that I have lets me see when users delete accounts and create accounts - useful to pinpoint issues with registration, or abuse by spam bots. And I could see a user initiated an account deletion, and minutes later created a new account under the same name. I had a hunch that not all accounts deleted are necessarily users leaving and this confirmed it - many users want to "reset" their account after a long break.

    While we're here let's have some numbers - as of a recent May backup:

    Table         Rows        data_length   index_length
    ----------------------------------------------------
    reviews       8,513,399    194 MB       273 MB
    stories       7,294,513   1103 MB       191 MB
    custkeywords  1,834,099     58 MB        23 MB
    

    The reviews (ie. flashcards) table may have an issue with how I designed it. I'm not sure. It's pretty silly you can see that the index itself is larger than the data... although I think it is not unusual when the average row length is small (24 bytes).

    Based on some test runs tomorrow's "spring cleaning" will shave off 2,057,521 private stories (from accounts not used in 5+ years), of a total of 7,294,513. That is 28% of the stories table, and not negligible 300 MB trim. It will also shave 40% of the flashcards table (flashcards from accounts not used in 2+ years), another significant ~180 MB trim.

    Hope you enjoyed this little foray into the database!

    I will post an update when the site is back up tomorrow after the downtime.

By Month