<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	>

<channel>
	<title>Rheta’s World &#187; Asides</title>
	<atom:link href="http://rhetashan.name/category/asides/feed/" rel="self" type="application/rss+xml" />
	<link>http://rhetashan.name</link>
	<description>Blogging Rheta Shan’s Second Life</description>
	<pubDate>Fri, 09 Oct 2009 22:01:54 +0000</pubDate>
	
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Fair warning</title>
		<link>http://rhetashan.name/2008/01/22/fair-warning/</link>
		<comments>http://rhetashan.name/2008/01/22/fair-warning/#comments</comments>
		<pubDate>Tue, 22 Jan 2008 17:21:46 +0000</pubDate>
		<dc:creator>Rheta Shan</dc:creator>
		
		<category><![CDATA[Asides]]></category>

		<category><![CDATA[jokes]]></category>

		<category><![CDATA[navelgazing]]></category>

		<guid isPermaLink="false">http://rhetasworld.wordpress.com/2008/01/22/fair-warning/</guid>
		<description><![CDATA[Just a bumper sticker for my blog, alas more appropriate than funny.]]></description>
			<content:encoded><![CDATA[<p>Seeing I just managed to mangle <span class="effect" style="text-decoration:line-through;">Taturo</span> <span class="effect" style="text-decoration:line-through;">Trateru</span> Tateru Nino’s name in <a href="/2008/01/22/second-life-guess/">my last post</a>, I thought I better put this up:</p>
<p><a href="http://www.justsayhi.com/bb/stickers"><img src="http://wp-uploads.rhetashan.name/typos-ahead.jpg" alt="Typos ahead" title="typos-ahead" width="175" height="80" class="aligncenter size-full wp-image-586" /></a></p>
]]></content:encoded>
			<wfw:commentRss>http://rhetashan.name/2008/01/22/fair-warning/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Happy Holidays and a Happy New year</title>
		<link>http://rhetashan.name/2007/12/23/happy-holidays-and-a-happy-new-year/</link>
		<comments>http://rhetashan.name/2007/12/23/happy-holidays-and-a-happy-new-year/#comments</comments>
		<pubDate>Sun, 23 Dec 2007 22:50:06 +0000</pubDate>
		<dc:creator>Rheta Shan</dc:creator>
		
		<category><![CDATA[Asides]]></category>

		<category><![CDATA[greetings]]></category>

		<category><![CDATA[personal]]></category>

		<category><![CDATA[photos]]></category>

		<guid isPermaLink="false">http://rhetasworld.wordpress.com/2007/12/23/happy-holidays-and-a-happy-new-year/</guid>
		<description><![CDATA[I have added a seasonal greeting to my modest Flickr stream. Hope you like it :)
]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.flickr.com/photos/21800297@N08/2132046228/" title="Happy Holidays all" class="flickr-image"><img src="http://farm3.static.flickr.com/2291/2132046228_c9a1cdd4f8_s.jpg" alt="Happy Holidays all" class="alignleft" /></a>I have added <a href="http://www.flickr.com/photos/rhetasworld/2132046228/">a seasonal greeting</a> to my modest <a href="http://www.flickr.com/photos/rhetasworld/" rel="me">Flickr stream</a>. Hope you like it :)</p>
]]></content:encoded>
			<wfw:commentRss>http://rhetashan.name/2007/12/23/happy-holidays-and-a-happy-new-year/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Gasp</title>
		<link>http://rhetashan.name/2007/12/06/gasp/</link>
		<comments>http://rhetashan.name/2007/12/06/gasp/#comments</comments>
		<pubDate>Thu, 06 Dec 2007 00:11:53 +0000</pubDate>
		<dc:creator>Rheta Shan</dc:creator>
		
		<category><![CDATA[Asides]]></category>

		<category><![CDATA[jokes]]></category>

		<category><![CDATA[lindenlab]]></category>

		<guid isPermaLink="false">http://rhetasworld.wordpress.com/2007/12/06/gasp/</guid>
		<description><![CDATA[An excuse for an apology, or maybe an apology for an excuse.]]></description>
			<content:encoded><![CDATA[<p>I really, really will update this blog real soon now, any minute… uh… well, as soon as possible… some time. Duh. <abbr class="allcaps initialism" title="Real Life">RL</abbr> is a bit mad right now, so while the blog topics and snippets pile up in my del.icio.us. closet, I’ll try to keep everybody (if that is anybody but me) distracted by pointing to my contribution to <a href="http://nicholaz-beresford.blogspot.com/2007/11/secondlifelinden-jokes.html">Nicholaz Beresford’s call for Linden jokes</a>, also published <a href="/2007/12/01/linden-lightbulb/">here</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://rhetashan.name/2007/12/06/gasp/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Linden Lightbulb Re-Deploy Post Mortem</title>
		<link>http://rhetashan.name/2007/12/01/linden-lightbulb/</link>
		<comments>http://rhetashan.name/2007/12/01/linden-lightbulb/#comments</comments>
		<pubDate>Sat, 01 Dec 2007 14:56:30 +0000</pubDate>
		<dc:creator>Rheta Shan</dc:creator>
		
		<category><![CDATA[Asides]]></category>

		<category><![CDATA[jokes]]></category>

		<category><![CDATA[lindenlab]]></category>

		<guid isPermaLink="false">http://rhetasworld.wordpress.com/jokes/</guid>
		<description><![CDATA[My contribution to Nicholaz Beresford’s call for Linden lightbulb jokes. With a little help by Johsua Linden.]]></description>
			<content:encoded><![CDATA[<p><em>This one was my contribution to <a href="http://nicholaz-beresford.blogspot.com/2007/11/secondlifelinden-jokes.html">Nicholaz Beresford’s call for Linden lightbulb jokes</a>…</em></p>
<h2>Post mortem</h2>
<p>The Linden Lightbulb 1.18.5 release included updates for several systems, including new carbon filament libraries, alloy couplings (a piece of infrastructure which handles a variety of services, such as local fixation and capabilities, and proxies current between systems), and glass geometry. The deploy as planned for November 6th did not require any downtime – all components could be updated live. We planned to perform the rollout per our patch deploy sequences: updating central rooms one by one, then offices. Read on for the day-by-day, blow-by-blow sequence of events which followed…</p>
<h3>Tuesday, November 6th</h3>
<p>Prior to the 1.18.5 Lightbulb deploy, at around midnight (all times are Pacific Standard Time) we suffered an electricity outage to our restroom facilities, which caused many systems to drop offline. The system recovered on its own after about an hour, and our electricity provider’s initial investigation pointed to hardware issues with the network infrastructure.</p>
<p>Starting at 10:00&nbsp;am we began the actual update of the lighting fixtures to the Linden 1.18.5 Lightbulbs. We started by updating the “backbone” fixtures on central facilities one by one, such as hall areas, tackling the “non risky” fixtures first. At 11:00&nbsp;am we got to the “risky” fixtures, which handle emergency lighting (i.e. show the way in case of evacuation) as well as several other key services. Closely monitoring the load on the electrical grid (which usually shows increased load when something goes wrong) as well as internal graphs which closely track the number of appliances online, we started making updates. Everything seemed to be going well.</p>
<p>Towards about 11:15&nbsp;am the various internal communication channels lit up with reports of appliance failures. We stopped updates of these central systems (7/8ths of the way through) and started to gather data. We have seen this problem in the past when hardware issues or bugs caused the grid monitoring systems to spin out of control, but this time there were no obvious failures; for unknown reasons they grid wasn’t responding to requests from the appliances. Hoping for a quick-fix (i.e. a simple configuration change that could be applied live) we spent about 30 minutes trying to determine the cause, then gave up and rolled back to the previous lightbulb generation.</p>
<p>(Fortunately, in this case, a rollback was straightforward, and simply resulted in “unknown” lighting status for about 10 minutes. Rollbacks are not always so easy – see below!)</p>
<p>Simultaneously, lighting in developer cubicles and coffee rooms failed. These were due to the update as well (but, as it turned out, for different reasons). Once the dust had settled on the rollback it was easy to roll back one more set of fixtures to restore the lights.</p>
<p>Completely unrelated to the update, the electrical load on the central systems required us to pause the Tuesday stipend payouts, delaying the payouts for several hours.</p>
<h3>Wednesday, November 7th</h3>
<p>Several Lindens continued the investigation, and determined a source of the issues seen on Tuesday: the “emergency lighting” system was updated to use eolian and solar sources to increase performance, but the capacity of these sources was set too low. After some work, we were able to replicate this failure in test environments to verify the fix. The updated bulbs were re-distributed to the fixtures making up the service, and we prepared to try again on Thursday.</p>
<p>(Little did we know that the insufficient electrical capacity was merely a symptom, not the root cause.)</p>
<h3>Thursday, November 8th</h3>
<p>On Thursday, we proceeded with the 1.18.5 Lightbulb update. The first half of the central fixtures were updated by 12:00&nbsp;pm. We paused to ensure that the system was behaving as expected, then continued at about 12:30&nbsp;pm completing the updates. Shortly thereafter, as the number of online lights in the building passed 46,000, the lighting began failing in a new way. Although most of Linden Lab was functioning properly, many light fixtures were slow to go on or failed to light altogether, and some other appliances failed as well. We diagnosed the problem as an unrecognized dependency – the central transformers were assuming that the fuses would shutdown on overload, but the fuse circuits (which had not yet been updated) were assuming the transformers would throttle down instead. Once this root cause was identified (by about 2:15&nbsp;pm) we were able to change the breaker code in the central transformers’ controllers to resume throttling current consumption, since that was a faster fix. Restarting the transformers did cause employees to sit in the dark for a short period of time, which was unexpected (and is being investigated). Starting after 3&nbsp;pm we initiated a rolling restart to update the electrical grid as well to complete the update, a process which took about 5 hours. During a rolling restart, in order to reduce electricity consumption and load on central systems, the service is in an unusual state – employees are not allowed to put lights or appliances back on in case of a crash. There was anecdotal evidence that some floors were crashing a lot, but we were unable to verify that this was not simply due to bad hardware until after the process was complete.</p>
<p>After the post-roll cleanup, it became clear that this was not an anomaly. A few contingency plans were discussed, including rollbacks for specific floors, but we were primarily in a data-gathering phase.</p>
<h3>Friday, November 9th</h3>
<p>As sleepy Lindens stumbled back into work, one incorrect (but ostensibly harmless) idea was tried; unfortunately, due to a typo, this accidentally knocked many employees off the electrical grid entirely around 9:40&nbsp;am. Shortly thereafter, more testing including complete rollbacks on simulator offices showed that the new transformer controller code was indeed the culprit, but it took a while longer to identify the cause. By 12:00&nbsp;pm the investigation had turned up a likely candidate – and an indication that a simple widespread rollback of the code would not, in fact, be safe or easy!</p>
<p>The crashing was caused by the transformer “message queue” getting backed up. A server-to-viewer message (related to the grid emergency control system) was updated and changed to move over <abbr class="allcaps initialism" title="Transmission Control Protocol">TCP</abbr> (reliable, but costly) instead of <abbr class="allcaps initialism" title="User Datagram Protocol">UDP</abbr> (unreliable, but cheap and fast). On floors with many appliances and lights, this would cause the grid to become backed up (storing the “reliability” data) and eventually crash. We have a switchboard that allows us to toggle individual messages from <abbr class="allcaps initialism" title="Transmission Control Protocol">TCP</abbr> to <abbr class="allcaps initialism" title="User Datagram Protocol">UDP</abbr> on the fly, but while testing we discovered a second issue – another circuit necessary for the <abbr class="allcaps initialism" title="User Datagram Protocol">UDP</abbr> channel needed to be updated, and it could not be changed on the fly, and if we flipped the switch back from <abbr class="allcaps initialism" title="Transmission Control Protocol">TCP</abbr> to <abbr class="allcaps initialism" title="User Datagram Protocol">UDP</abbr> the transformer would crash. (The <abbr class="allcaps initialism" title="Transmission Control Protocol">TCP</abbr> to <abbr class="allcaps initialism" title="User Datagram Protocol">UDP</abbr> update on-the-fly worked, which is how we were able to do the rolling restart in the first place.)</p>
<p>By testing on individual floors, we were able to confirm that by switching back to <abbr class="allcaps initialism" title="User Datagram Protocol">UDP</abbr> the problem was eliminated, although this required cutting off all electrical current before throwing the switch. We co-opted an existing engineer for “host-based” rolling restarts (which he had been employed for once in the past), and had him shut down offices on each floor (doing several in parallel), update the breaker circuits, and restart the transformers. After significant testing, we asked this engineer to perform another rolling restart of the service, which was completed by 11&nbsp;pm on Friday, including subsequent cleanup.</p>
<h3>Saturday, November 10th</h3>
<p>Unrelated to the deploy (but included here to clear up any confusion), on Saturday at 5:20&nbsp;pm we suffered another electrical outage, which resulted in hundreds of developers being offline for just under two hours. The cause was due to the expiration of a contract renewal term with our electricity provider. We extended the contract, and our <abbr class="allcaps initialism">DNOC</abbr> team brought the affected floors back up.</p>
<h3>What Have We Learned</h3>
<p>Readers with technical backgrounds have probably said “Well, duh…” while reading the above transcription. There are obviously many improvements that can be made to our tools and processes to prevent at least some of these issues from occurring in the future. (And we’re hiring operations and release engineers and developers worldwide, so if you want to be a part of that future, head on over to the <a href="http://lindenlab.com/employment">Linden Lab Employment page</a>)</p>
<p>Here are a few of the take-aways:</p>
<ul>
<li>Our load testing of systems is insufficient to catch many issues before they are deployed. Although we have talked about janitors and in-house technicians as a way to roll out changes to a small number of offices to find issues before they are widely deployed, this will not allow us to catch problems on central systems. We need better monitoring and reporting; our reliability track record is such that even problem such as electricity failures for 1/16th of employees aren’t noted for a significant period of time.</li>
<li>When problems are detected, we don’t do a good enough job internally in communicating what changes went into each release at the level of detail necessary for first responders to be most effective.</li>
<li>Our end-to-end deployment process takes long enough that responding to issues caused during the rollout is problematic.</li>
<li>Our tools for managing deploys have not kept pace with the scale of the service, and manual processes are error prone.</li>
<li>Track date-driven work (e.g. contract renewal expiry) more closely; build pre-emptive alerts into the system if possible.</li>
<li>Be more skeptical about doing updates while the office is live, especially when involving third-party providers.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://rhetashan.name/2007/12/01/linden-lightbulb/feed/</wfw:commentRss>
		</item>
		<item>
		<title>The Greatest Thing</title>
		<link>http://rhetashan.name/2007/11/29/the-greatest-thing/</link>
		<comments>http://rhetashan.name/2007/11/29/the-greatest-thing/#comments</comments>
		<pubDate>Thu, 29 Nov 2007 16:20:01 +0000</pubDate>
		<dc:creator>Rheta Shan</dc:creator>
		
		<category><![CDATA[Asides]]></category>

		<category><![CDATA[love]]></category>

		<category><![CDATA[personal]]></category>

		<category><![CDATA[poetry]]></category>

		<guid isPermaLink="false">http://rhetasworld.wordpress.com/2007/11/29/the-greatest-thing/</guid>
		<description><![CDATA[Just a love poem, luckily not my own.]]></description>
			<content:encoded><![CDATA[<blockquote><p>There was a girl<br />
A very strange, enchanted girl<br />
They say she wandered very far<br />
Very far, over land and sea<br />
A little shy and sad of eye<br />
But very wise was she</p>
<p>And then one day,<br />
One magic day she passed my way<br />
While we spoke of many things<br />
Fools and Kings<br />
This she said to me</p>
<p>The greatest thing you’ll ever learn<br />
Is just to love and be loved in return.</p></blockquote>
<p>Thank you, <a href="http://en.wikipedia.org/wiki/Eden_Ahbez">Eden Ahbez</a>, for words ringing so true.</p>
]]></content:encoded>
			<wfw:commentRss>http://rhetashan.name/2007/11/29/the-greatest-thing/feed/</wfw:commentRss>
		</item>
		<item>
		<title>To absent friends,</title>
		<link>http://rhetashan.name/2007/09/26/to-absent-friends/</link>
		<comments>http://rhetashan.name/2007/09/26/to-absent-friends/#comments</comments>
		<pubDate>Wed, 26 Sep 2007 17:45:00 +0000</pubDate>
		<dc:creator>Rheta Shan</dc:creator>
		
		<category><![CDATA[Asides]]></category>

		<category><![CDATA[love]]></category>

		<category><![CDATA[personal]]></category>

		<category><![CDATA[poetry]]></category>

		<guid isPermaLink="false">http://rhetasworld.wordpress.com/2007/09/26/to-absent-friends/</guid>
		<description><![CDATA[Missing your beloved. With thanks to Neil Gaiman.]]></description>
			<content:encoded><![CDATA[<blockquote><p>lost lovers,<br />
old gods<br />
and the season of mist;<br />
and may each and every one of us<br />
always give the devil his due.</p></blockquote>
<p>You never know how much you love somebody until they are gone.<br />
You never know how much it hurts until it happens to you.</p>
]]></content:encoded>
			<wfw:commentRss>http://rhetashan.name/2007/09/26/to-absent-friends/feed/</wfw:commentRss>
		</item>
	</channel>
</rss>

