Saturday, January 23, 2010

Hard Drive Partition and version control of UC Appliance

Active and Inactive Version

Many Cisco Unified Communication appliances (CUCM, CUPS, CER, CUMA, etc.) share the same OS (Cisco customized Linux).

For maintenance purpose, you may install two copies (versions) of systems on the hard drive. Cisco call it "active version" and "inactive version".

CLI commands:

show version active
show version inactive
utils system switch-version

Please note: "active" and "inactive" are relative. When you switch the version (utils system switch-version), the "active" version will becomes "inactive".

If you're a Windows guy, you should be familiar with C:\boot.ini.

If you're a Linux guy, you should be familiar with grub.conf.

It's the same way Cisco UC appliance controls which version to boot from.

Partitions

The two copies of software are installed into two partitions: \ (referred as Partition A) and \partB (referred as Partition B). Whenever you use "utils system switch-version", the active partition will become inactive. The inactive partition will becomes active.

Upgrade and Switch Version

The word "upgrade/patch" has a different meaning in Cisco UC world. Instead of "replacing" files, the "upgrade/patch" process is actually installing a full copy of system in the inactive partition. This has two benefits:

1) You may perform an upgrade/patch during production hours.
2) It's easy to fall back to the old version.

Scenario:
You have 6 CUCM servers in the cluster. Let say, upgrade each server takes about 2 hours. Your business does not allow any downtime during business hours.

Questions:
Q1: How long does it take to upgrade the whole cluster?
Q2: How much time you'll have to spend in after hours for the upgrade?

Answers:
A1: About 4 hours (2 + 2)
A2: A couple minutes.

Explanation:

1. You need to finish the upgrade on CUCM publisher before you can do the upgrade on subscriber. It takes 2 hours to upgrade the publisher. The new code be installed into inactive partition. You don't have to switch to new version right after install. Thus you can do it in business hours.

2. Once the the new version has been installed on publisher (even it's in inactive partition), you may start upgrade process on subscribers (simultaneously). This takes about 2 hours (because you're upgrading all subscribers simultaneously). You don't have to switch to new version right after install. Thus you can do it in business hours.

3. In after hours, you may use "utils system switch-version" command to switch all boxes to new version. This usually takes less than 10 minutes.

However, there's a catch: if you made any configuration changes after the point of upgrade, those changes wouldn't be reflected in the new version. For example, you performed the upgrade at 10AM but didn't switch to new version. Then you switched to new version at 6:30PM. Any configuration changes made between 10AM and 6:30PM will be lost.

Under the hood of "utils system switch-version"

What actually happens when you type the command "utils system switch-version"?

1) It modifies /grub/boot/grub/grub.conf file to make the other partition active
2) It synchronizes UFF (User Faced Feature) to the other version. UFF refers to Call Forwarding, MWI (Message Waiting Indicator), etc.

If the system failed to switch version, here are some options:

Option 1: Try "utils system switch-version nodatasync"
This turn off the UFF data sync action.

Option 2: Use "Recovery CD" (downloadable from CCO) to switch version.

Option 3: If you're a Linux guy, it shouldn't be too difficult for you to get access to /grub/boot/grub/grub.conf.

OCS 2007 R2 on Windows 2008 R2 with SQL 2008

Neither Windows 2008 nor SQL 2008 is supported by OCS 2007 R2. But if you really want to do it, here are some tips:

Use SQL 2008 R2

SQL 2008 will fail to install on Windows 2008 R2. Instead of trying to 'fix' it, you may just use SQL 2008 R2.

Pool Creation Failure

a) manually create database rtcconfig (assuming you know how to use SQL Management Studio)

b) manually run the script on OCS installation CD.
D:\Setup\amd64\DbSetup>cscript.exe poolcfgdbsetup.wsf /clean /sqlserver:rcdn /serverrole:EP /verbose

c) continue GUI install

Error: "Not available: IIS 6 Management Compatibility and IIS Windows Authentication role services must be installed before you Deploy Server."

Solution: Install all IIS6 compatibility options. Install all authentication options


Error: "The Windows Media Format Runtime is required in order to install this component. Installing the Windows Media Format Runtime may require a system restart to complete the installation. Click OK to continue with the installation."

Solution: install the Desktop Experience Feature.

Error: "[0xC3EC78D8] Failed to read the Office Communications Server version information. This can happen if the computer clock is not set to correct date and time."

Solution: Uninstall MS Crypto API security update KB974571

Friday, January 22, 2010

The art of troubleshooting

Since I joined TAC, I've been the top case solver for 16 consequent quarters no matter what technology group I worked in. I'd like to share some tips on troubleshooting.

Understand the user's expectation.

Still remember the old joke that a user called IT support and said the "cup holder" on his computer stopped working? It turned out to be the CD drive. He insisted he's been using it for years.

Understanding user's expectation can help you determine if you should do customer education or troubleshooting.

Keep it simple

Which one is easier? Troubleshoot a light switch or troubleshoot a space shuttle?

Multiple-system integration adds complexity to the problem. You should try to simplify it as much as possible.

For example: When PSTN call comes in, it hits Unity Auto Attendant. Press 2 to transfer to sales queue, which is a CTI route point handled by UCCX. If no one answers the call, it should goes into a voicemail box dedicated for sales department. Instead of going into voicemail, the caller just heard repeating "transferring..."

In this case, we have too many elements in the picture - Unity, UCCX, CUCM, voice gateway, service provider. Instead of troubleshooting from end to end, we should troubleshoot it segment by segment -

1) What if we called the sale agent directly? If he didn't answer, would the call goes into voicemail? (get Unity Auto Attendant and UCCX out of picture)
2) What if we bypass Unity Attendant Console and call UCCX route point directly? Would it work properly? (get Unity Auto Attendant out of picture)
3) What if we make a test call from internal phone? Would the problem be the same? (get PSTN and voice gateway out of picture)

Other tips to make things simple during troubleshooting:
1) Use default settings. For example, use a "vanilla windows" (fresh installed with Microsoft CD) instead of using a "corporate customized" image.
2) Test on LAN instead of over VPN (again, decrease number of elements)
3) Always assume the system is case sensitive (err on the safe side)

Find a reference point

If a software doesn't work for one user and works for another one, use the good one as reference point and find out the difference.

Of course there are many differences between two users, such as their wife and kids. :) But we should look at the most relevant ones.

Most software nowadays are "client-server" model. The most relevant ones are accounts and computer. Switch the computer (or switch the account) to see if the problem follows the computer or account. If it follows the computer, it might be network or computer settings (client side). If it follows the account, it might be configuration issue (most likely server side).

Understand positive and negative result of the test

e.g.

"Dad, I couldn't find any Easter eggs in the backyard!". Does that mean there's no eggs there?

"Dad, I found some Easter eggs in the backyard!". That means there are some eggs there.