From disappearing SharePoint Searches to Exchange outages, IT pros tell of Microsoft product mishaps and how they got fixed.
Did you know that today is National Systems Administrator Appreciation Day? If your answer is no, shame on you.
Hard-working IT soldiers deserve a day of recognition, so why not July 30? Given that life on the IT front has its share of bad days where servers blow up and CEO gets a computer viruses, Azaleos, a managed services provider for Microsoft, has launched a “Bad IT Days” story contest to honor Sys Admins everywhere. The IT pro with the worst, or should I say best, IT disaster story about problems with Microsoft products and how those problems were fixed wins an iPad!
Ok, so Azaleos isn’t exactly being kind to partner Microsoft here (an iPad! Really?), but the contest is all in good fun despite the cringe-worthy tales on hand.
[ For complete coverage on Microsoft’s new Windows 7 operating system — including hands-on reviews, video tutorials and advice on enterprise rollouts — see CIO.com’s Windows 7 Bible. ]
So without further ado, here are the contest’s top five hard-luck stories from IT professionals involving Exchange, SharePoint, Active Directory and even BlackBerry Enterprise Server.
Due to contest rules, only the entrants’ first names and last initials were made available.
5. SharePoint Search Lost and Found — Paul B.
My company was working with a fairly large SharePoint 2007 environment — approximately 3,000 users across eight sites. I was the lead SharePoint administrator and in charge of all aspects of the farm. One day we began experiencing some serious problems with the SharePoint Search functionality. When the system would kick off a search, what normally would take an hour was taking five-plus hours and just hanging.
I spent a full day troubleshooting before turning to the Microsoft Support Service phone line for assistance. Working with Microsoft, we collectively accrued an additional 130 hours of phone support time and escalating the issue up to the most senior Microsoft support engineers. No matter what we tried, we could not get search to work.
Finally, I just happened to be talking with one of my IT colleagues about our virtualization technology on VMware. When I mentioned to him the problems I was having he said he didn’t think that SharePoint should be having those sorts of problems, especially with it running virtualized on top of VMotion. That’s when it hit me! First of all, I wasn’t aware that my colleague had set up my farms on top of VMotion to begin with!
Slideshow: Windows 7 Hardware in Pictures: The Latest and Greatest Laptops
Slideshow: Microsoft’s Home of the Future: A Visual Tour
Slideshow: Fighting the Dark Side: Tech’s Heroes and Villains
When I learned this, I quickly confirmed what had been happening. When a new search would kick off, things would run as planned and the initial search results would start to pour back. However, because SharePoint Search creates such a heavy demand on the servers, VMotion would detect this load and move the SharePoint Servers to a new virtual guest instance. This move would break the connection between the search and the databases being searched and not allow the search to continue. Once we removed SharePoint from the VMotion platform we were all systems go.”
4. CEO Calling: Where’s My E-mail? — Martha K.
One morning my CEO called to say that he was not able to receive mail on his BlackBerry. I proceeded to look into this case and found that the Radius server was down in the subsidiary office where the CEO was located and VPN was not an option give current network troubles.
So I talked the very new and very green
local sysadmin through the process of checking the BES [BlackBerry Enterprise Server] for SRP connectivity, and his CEO’s last contact times. While I was talking him through the SRP test, he stated that his Exchange e-mail went offline. I asked him how he knew and he responded with “I just rebooted the Exchange server.” I then used a third party application to share screens with him and took control of his local box so I would have access to the environment. It took me four hours to get him out of a degraded state. That’s four hours of him asking, ‘Why are you clicking that? What does that do? Are you sure that will fix this? How long will this take?’
The best part? The CEO’s BlackBerry wasn’t receiving mail … because it was turned off.
3. Executive Wireless Privilege — Adam S.A couple years back my network was severely hacked by someone who came in from the outside and deleted the main Exchange message store. Firewall logs had gotten the local IT admin nowhere, so we were called in to do a little snooping around. I wish I’d thought of it, but another guy on the team had the sense to run a wireless utility and he found a wide open Linksys wireless access point in about six seconds.
The internal admin insisted there was no wireless running anywhere on the network. It took some sneaker netting, but we found the rogue access point in a senior exec’s office about 20 minutes later. Seemed he saw how cheap they were at the local CompUSA and decided to plug one into the secondary network port in his office so he could use his notebook’s wireless instead of the wired connection because no wires “looks better.” Once we found the leak, we were able to patch it up and get Exchange running again.
Needless to say, we instituted much tighter controls on how peripherals could be used and let employees know what is and isn’t acceptable!
2. Outage Maid to Order — Harris LBack when I worked for a major consumer electronics manufacturer we had an Exchange server at our site in Florida. Every Thursday afternoon after 6:00, it would go offline. Someone would have to drive down there and manually quickly power it back up. It was always an ungraceful shutdown and we could never identify what was causing it. Other equipment nearby was not affected. After a month of this, we decided to stage a stake-out. I went down there in the afternoon and hung around for a few hours. After most of the staff went home, the cleaners arrived to dust the desks, empty bins and polish the floors. It seems the cleaning crew were using our power outlet for their equipment and when they were done, they helpfully plugged the server back in, but the damage was already done. The solution was to move our equipment to another rack and we left that power outlet to the cleaners.
One day I woke up and checked e-mail as usual before driving into the office. There were a couple help desk requests saying that users were not getting their mapped drives. Before driving in, I checked e-mail again and found more help desk requests that the student roster was not working, faculty members were not getting their mapped drives and staff members could not access network resources. I realized that the day was destined to be a bad one.
When I reached the office I was just in time for an IT all-hands meeting where we found out that almost all the security groups were missing from Active Directory. We were shocked. After a few minutes, our application folks came into the meeting and one of them told us that he wrote a script that “went crazy” and inadvertently deleted the security groups. At this point we thought, who’s going to hold him and who will hit him.
[ For complete coverage on Microsoft’s SharePoint collaboration software — including enterprise and cloud adoption trends and previews of SharePoint 2010 — see CIO.com’s SharePoint Bible. ]
Our second thought was that we should simply do a complete Active Directory restore (because we had a backup) or start creating a new one (because we had a list of deleted security groups). But there was another problem: We didn’t have the list of members and permissions. But luckily, we had a cloned AD machine that was only one week older than the current one. We started creating new security groups and adding members as well. We had to check the folder permissions and there were orphaned folder SIDs instead of group names. We matched that SID with the cloned machine group’s SID and assigned the permissions. Man, it was a horrible day. We worked for 20 hours continuously without a break and were finally able to bring the situation back to normal.
Shane O’Neill is a senior writer at CIO.com. Follow him on Twitter at twitter.com/smoneill. Follow everything from CIO.com on Twitter at twitter.com/CIOonline.