Backups (an overview)
Since I’ll be rebuilding/migrating our backup environment for the rest of the year, I’ll probably talk about backups a fair bit.
Philosophically, there are 3 driving needs behind backups. People say backups, when they mean:
1) Disaster recovery. The ‘Our building was just hit by a meteor, and we need to rebuild our environment pronto’ scenario. The requirements for this are fairly straight forward: Backup all the data needed to rebuild quickly if the building isn’t there tomorrow. Very data intensive, but the data grows stale very quickly. You’d want 2, maybe 3 sets worth of full data.
2) User error. The ‘Our DBA’s dropped this table 3 weeks ago, but forgot about the monthly report that still requires it’ or ‘The VP of marketing blew away this file 2 months ago… is it in our backups?’ scenarios. This can be fairly data intensive, since you need to keep so many more images than would be required by Disaster Recovery, but you can prune out a portion of the data (the files for the OS are a waste for example… you could also prune lots of transient data). You also need to consider what your retention period is, since that’s the big multiplier for your cost.
3) Archival. ‘The IRS requires you to keep this class of financial records for 15 years’ scenario. The good news is this is usually a very thin subset of your data. The better news is that usually it’s just kept in the various databases, and these applications don’t rely on backups to keep the data long term. The bad news is the retention times are crazy long, and you’re often backing up this 6 and 7 year old data multiple times.
Other topics: Backup to tape, or disks? DeDuplication? Netbackup or someone else?
OCFS2
OCFS2. It’s a great concept. Give admins a simple way to present a single disk/lun/volume to multiple hosts. Multiple hosts can read/write to the volume, and you’re good to go. It handles the locking, it handles almost everything for you, and just gives you a global file system without a lot of the hassles.
So, I’m setting up an FTP server cluster (don’t ask). The first problem I ran into is apparently OCFS2 v1.4.4 was a little over-aggressive in locking around the sendfile call[1]. So when user x writes a file for user y, and user y tries to pull the file while the write is still going, if the stars aligned correctly, the locking gets into a deadlock. All future reads/writes to that file hang, and the processes stay forever in a ‘D state’ (meaning they can’t be killed, even with kill -9). The only solution is to reboot the node. The good news is that the authors fixed the bug in the current release v1.4.7. The better news (and what ‘resolved’ the problem for me) is that proftpd has an option to disable the use of sendfile.
However, the real deal killer for me for OCFS2 is that it deals with fragmentation very poorly. Since I have lots of files, and lots of turnover, this kills me. Theoretically, v1.4.7 fixes this if you can afford to rebuild the file system. Unfortunately, I can’t. (to be fair, they claim the problem is lessened in v1.4.7, and several important fixes are in the mainline kernel, but if you’re using something like CentOS/RHEL/OEL, you have to wait on those fixes.
grrr. Great product, but not ready for everything.
[1] sendfile is a kernel based, zero-copy way of moving files around efficiently. Several network protocols implement it so they’re not coping files around in memory. I know Solaris, Linux and FreeBSD support the syscall. It might even be in the POSIX spec.
NetApp and SnapMirror
I’ve been pretty critical of NetApp recently (not here, at least not yet). But I have to admit that their SnapMirror tool is pretty sweet.
If you’re going to migrate a volume from filer X to filer Y, first you “snapmirror on” on both filers. Then, on the destination filer, you run
snapmirror initialize <src_host>:<src_volume> <dst_host>:dst_volume>
Then you wait. I was doing this over a fairly short network hop, but it still took a lot less time than I expected. 300gb in ~20 minutes isn’t bad.
Now you have to setup the snapmirror.conf file. The only gotcha in the docs is that if you’re NOT specifying any options, you need to have a ‘-’ in that field. So my example is:
<src_host>:<src_volume> <dst_host>:<dst_volume> – 0-59/5 * * *
Which translates to syncing the volumes every 5 minutes. They also offer ‘sync-mirrors’ if need be, but that’s not going to work in my case.
It’s apparently even possible to migrate the active NFS handles, which is pretty cool too.
The only downside? SnapMirror is a licensed add-on, but it’s NetApp? what would you expect?
Cacti
So, I’m trying to graph SNMP data from my brocade switches. Should be pretty simple, right?
Well, it’s not the easiest thing in the world… Cacti’s great if it’s already canned in there, or if there’s a good template for you. But it is possible…
My goals:
1) Get a pretty graph of my brocade traffic
2) Have 1 minute resolution on the graphs
First, you need to get Cacti going. There’s lots of docs on that. I also suggest getting spine working as well[1]. On my dual-CPU desktop box I’m only polling 2 brocade switches. Using cmd.php, it takes 45-50 seconds to grab everything. Using spine, it takes 6 seconds, without tuning thread counts and whatnot.
Second, you need to go to Cacti -> Settings -> Poller and set the Poller Interval and the Cron Interval to 1 minute. Then you need to adjust the crontab entry to run every minute. While you’re here, go to Cacti -> Settings -> General and set the default “SNMP Version” to v1[2].
Third, you need to grab the Brocade templates/queries. The best ones I found were in this post on the cacti forums. Rename brocade_sensors_436.xml to brocade_sensors.xml, rename brocade_interfaces_166.xml to brocade_interfaces.xml, and put both files in <Cacti_web_root>/resource/snmp_queries. The cacti_host_template_brocade_fc_switch_interfacessensors_120.xml file must be imported through the Cacti Gui.
Fourth, you need to set the timings for the new brocade template. Go to Cacti -> Data Templates -> Brocade FC DataTemplate. You’ll come back here a total of 13 times. First, you need to set the ‘Step’ to 60 (to match your one minute intervals). Click Save. Go back to “Brocade FC DataTemplate”, and for each of the twelve queries (swFCPortRxBadEofs swFCPortRxC3Frames, etc), you need to change the Heartbeat to 120. The problem is once you click to the second of the twelve, you loose your changes. So you have to set the Heartbeat to 120, then save, then go back to “Brocade FC DataTemplate”, click the next query, set the heartbeat to 120, and then save, and so on through all 12 of them. It’s a pain.
There’s one other thing you need to fix in these templates: instead of “LAST”, you need “AVERAGE” for some of your Consolidation Functions. Go to Cacti -> Graph Templates. You need to change “Brocade FC Fan Sensor”‘s “Item #1″ Consolidation function from LAST to AVERAGE. For “Brocade FC Switch Traffic”, it’s “Item #1″ and “Item #5″. And for “Brocade FC Temperature Sensors”, it’s “Item #1″. If you leave these as ‘LAST’, I ended up with broken graphs.
And now you should be good to set up your graphs. I ended up creating 2 graphs per port, and organizing my graph trees so they’d be separate. The Brocade FC Switch Traffic graphs are the ones I’m most interested in… it only shows the TX/RX traffic in bits/sec. The ‘normal’ “Brocade FC” graphs show Frames, Multicasts, Errors, EOFS and other things. I’ll only need those for troubleshooting.
[1] Spine has it’s own tricks. If you’re a good little admin, and installing things like this to /opt or some-such, you’ll need to symlink the /opt/cacti-spine-0.8.7e/etc/spine.conf file to /etc… Spine for some reason only searches a static list of paths for its config file. No one ever seems to mention the spine config file, but it needs the same info as the /include/config.php, just in a slightly different format. I also tend to link /opt/cacti-spine-0.8.7e/bin/spine to /usr/local/bin, so if I need to upgrade later, I can just move the symlink around. Next, you’ll need to go into Cacti -> Settings -> Paths and set the spine path correctly. Finally, you’ll go to Cacti -> Setttings -> Poller and select ‘spine’ rather than ‘cmd.php’.
[2] it’s not ideal, but it’s a good default to set.