I have a few reasons for writing this blog:
- I thought the other Chess.com members would enjoy reading what goes on behind the scenes to make a website like this function.
- It's therepeutic to get all of the stuff that is floating in my brain down onto paper so I can "Flush the buffer".
- People have been wondering what I've been so stressed out with this week, and instead of trying to explain it to 10 different people over IM, I can just send them this url.
- Finally, it'll hopefully act as a good technical reference for other developers around the world who might encounter similar issues. (So warning, parts of this blog may be technical in nature and boring to non-techies.)
After we launched the project Thursday of last week (December 6, 2007), it has received nothing but praises and compliments from those that have tried it. I have shown the project to fellow developers and they have given me nothing but "wow" and "awesome!" as they've been very impressed with the technical feat that's been accomplished, something that non-developers can't really appreciate. This definitely is the kind of stuff that makes me excited, keeps me going, builds up my morale, even when you're not actually seeing a paycheck for the work you're putting in.
I was riding high on cloud #9 watching as our first subscriptions started rolling in, justifying that we had really created a high-value usable product that people were willing to pay money for.
Then.....it all started crashing down...a report here, a report there, all from Chess Mentor users reporting that from time-to-time their lessons would just freeze, or they'd get an ajax error, or IE would crash entirely and they'd have to reboot or restart ie to fix it. Nothing kills your morale faster than bug reports coming in about an intermittent not-so-easy to replicate bug, that only affects SOME users. I immediately told Erik this could potentially be one of those bugs that takes days or even weeks to figure out and fix (if fixing is possible.)
So on Monday began the laborious task of trying to recreate the bug myself. After a few lessons of Chess Mentor on our production boxes, sure enough, I clicked the "Try Again" button and the spinning ajax indicator popped up, but then......nothing...........just waiting and waiting forever. Just as our users had reported. Its a good and bad feeling, good in that I now know we have a problem, I have seen it with my own eyes, but a depressing "What am I going to do about it?" feeling at the same time.
Chess.com was built using a PHP Framework, so much of what goes on deep deep down under the Chess.com application logic, is somewhat of a blackbox. You can try at your own peril to open up this black box and peek inside and try to fix things, but you can quickly find yourself down a rabbit hole with few or no escapes. So, I shot off an email to the founder of the framework, to see if he had any ideas or where I could begin to add more logging to track down the problem. He gave me a few suggestions and off I went.
I added logging galore to the Chess Mentor product to track each and every button that was clicked, move that was made by every user using the product. The hope was that the next time this problem occurred, I could look at the log files and determine the problem, or I could reproduce it myself, and look at my own log.
After the logging was in place, I went off to try and reproduce the freezing, and after 10-20 minutes, I got it to freeze again. I checked out the log files, and there were absolutely 0 clues there. Everything on the server looked just fine, no errors in the php or weird behavior. Hmmmmmmmmmmmmmmmmm, now what?
Well, any time you use 3rd party software or a framework such as we do, one option is to make sure you upgrade to the latest version, especially if it seems the problems you are experiencing are in the framework. Bad news is that this is not an easy task, and sometimes when you go to the latest version of any software, you introduce NEW bugs, just what I didn't want to deal with right now, but, I was out of ideas, so I downloaded the latest version (we were about 11 versions behind), and started the super tedious task of upgrading.
The reason it is so difficult is because when you have an application the size of Chess.com, there comes a time here and there where you need to make changes to the core of the framework, bug fixes, or whatever. So upgrading becomes that much more difficult because now you have to merge your "hacks" with the latest files.
I posted in the forums for the framework we use and asked for very experienced developers to help take a look at the problem. I was excited to get a couple responses from two of the big names in that community. I ended up working with "Kristof" from Belgium who has written the manual on the framework. Great guy and he started working on the problem right away. I hadn't even given him access to our servers yet and he was already trying to fix it on Chess.com using proxies and capturing network data. AMBITIOUS! I like that!
He also LOVED the Chess Mentor program, so he was more than happy to work on the bug AND learn Chess at the same time! We negotiated a deal where he could use Chess Mentor for free in exchange for helping me work on these bugs. Perfect! So, I got him setup with access and he started adding all kinds of alerts to the js ajax code so we could track down why it was crashing (on our test server of course).
At this point, I was in bad spirits and disgruntled and as I told Erik, I wanted to just "crawl under a rock and die". Sometimes you wish you were in retail selling clothes or something, instead of trying to fix weird random computer bugs. I think it was about 5pm on a Tuesday, and I decided to just get away, so I went and curled up under my covers on my bed and was out in no time. Later that night I decided to get away from it all and went out and watched a great movie with my wife and Igor and his wife (yes the Igor from Chess.com). When I got home around 11:30, I was feeling a little reinvigorated, so I got back onto my computer, found Kristof was online, and we started working on the problem.
With the alerts in place on Chess Mentor, it didn't take too long for him to realize that in IE, certain links on the Chess Mentor page were triggering the Window Unload function. This will definitely cause ajax errors because once the browser has unloaded the page, the DOM elements no longer exist so ajax will run into issues.
Kristof had seen this issue before and pointed me to a few links that discuss it:
So, things were looking good. We had figured out the problem (or so I thought) and I went ahead and changed all the links back to use #. I also found that by returning false, the window won't jump to top, so we get the best of both worlds here. I thought we had finally figured out the problem, so the last thing to do was some thorough testing on our test server before going live with the new code.
All my hopes and dreams came crashing down. I had upgraded to latest framework version. We had found a problem in IE with window unloading and fixed it, and it was STILL freezing!! Ahhhhhhh!
Kristof had to go to bed at this point, so we decided we pick it up again tomorrow. In the meantime, I asked shadowc to take a look at it, and sure enough, it froze for him as well, and he said, "definitely a js issue". I asked him if wanted to take a look into it, and he said sure. The more eyes and brains on this problem, the better, because I was clean out of ideas.
I set him up with access to our test server and with the help of midnight commander, he started poking around the files. I pointed him in the direction of the js files he should be looking at, and he started doing some analysis on them. While he was looking around, I continued to play with Chess Mentor to see if I could notice any patters, and sure enough, I stumbled upon one! It was a huge breakthrough! If you can consistently reproduce a problem, that in my mind is at least 60% of the battle towards fixing a bug. I told shadowc and sure enough, he was able to reproduce the bug, so now we had a point at which we could attack the problem. Super excited, adrenaline flowing! Things were looking up again! What a roller-coaster of emotions!
He started analyzing the xml string in the ajax call to see if there was a problem there. He emailed it to me and suggested that it might be a weird "special character" in the string. Uh oh....this brought back memories. I've seen this before where weird characters, like the curly quote, can cause xml to crash (but usually only in IE). So....feeling optimistic we had found the problem (and a bit depressed I hadn't thought of this before), I went and removed the special char, and we tried to make it freeze again, and sure enough, no freeze! We had found the problem! Unfortunately all these special characters were mixed throughout dozens of columns/tables in the database and had come across in our export from Access to Mysql a long time ago.
I decided to shut down the website, export the database to a text file, then do some search and replacing on these special chars, then reimport the database. Many of you may have noticed the site was down for about 25 minutes during this process. After I finished that, I was SOOO confident we had finally fixed Chess Mentor for good!
About this time, this is now Friday, Kristof was backonline and the first thing he says to me when I wake up...."good morning.... i must be a pain for you... last thing you hear, first thing you hear. experienced lots of timeouts today on production"
NOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO! What a sinking feeling of absolute and utter dispair! Twice now I thought I had licked this problem and we're STILL having freezing problems. My wife can't stand to be around me any more I'm so angry at the world, at the internet, but above all, angry at Bill Gates, Microsoft, and his terrible IE browser!!
Kristof confirms that the freezing only happens on IE, and he says that he can reproduce it pretty well. He tells me to make a move, then click back, wait 3 seconds, click continue, wait 3 seconds, click back, wait 3 seconds, click continue..etc etc. Wow, talk about a mysterious process to reproduce a bug. Anyhow, after several tries, I gave up, it never froze for me.
I added even more logging to the server, and him reproduce some more, and I watched the log files, but still no clues as to what was going on. Super weird and frustrating. So what next....hmmmmmmmmmmmmmmmm.
Finally I decide we better try changing some apache settings to see what happens, and whose our #1 Apache ace? Igor of course. First we try turning off compression, no luck. Next we try changing keepalive from 3 seconds to 60 seconds (yes, I did say 3 seconds.) Sure enough, that makes the problem pretty much go away, or at least to reproduce now, Kristof has to do his button clicking 60 seconds apart. If you want to learn more about what keepalive does and what it's used for, go to:
So essentially what was happening with the freezes was that if you send your ajax request in ie at the exact moment that the server kills your connection, the signals "get crossed" and the ajax request isn't really received by the server, cause that connection is now closed. It only causes an issue if both ends send their signals at the exact same time though and before the other side has received the new signal/status. For me, it was impossible to reproduce because I'm very close geographically to the servers, but Kristof in Belgium is very far. Amazing huh??!!
So sure enough, we turned off keep alive and the problems went away altogether! This isn't an issue for us because we use a separate asset server to serve all our images/js/css files, so we don't really need a keepalive.
Woooo hooooo! Problem solved! No more chess mentor freezes!!!
*** A special thanks to all those involved in helping to fix this bug:
(In the process of writing this, however, 5 more bugs have been reported to me and I'm way behind. Gotta run!!!)