cloud sysadmin: March 2013

Recovering data with grep from linux volumes

We have all done it.
We all should have known better.
We all know scripts should be developed in source control, not in vim on your running instance.
Finally, we were warned that “rm -f” is a powerful tool, and not to be used without adult supervision.

That feeling of dread that hits you almost before you have completely pressed the return key, but the synaptic nerves in your brain that have figured out what you are about to do, can’t quite get the message to your fingers in time to actually stop you pressing the key.

What makes this worse, is that when this happens to you, Murphy’s law says that you wont be in the office, and it will be on a production system that you can’t take down to single user mode.

What can you do?
Well despite the odds, in this case I was able to get my file back, with relatively little pain, using good old grep.

[DISCLAIMER - Your Milage May Vary, If your data is critical, contract your data recovery woes to a professional. My data loss was irritating, but not the end of the world!]

Step 1
Go to single user mode (if you can) in my example, i didn’t do this, but to increase your chances you want to stop any further IO to the volume

Step 2
Mount the volume as RO (if you can) – see above

Step 3
Rack your brains and try and remember the contents of the file you were working on, try and think of a passage of text, or a combination of words you (100%) know were in the file. If its a script, try and remember some of your comments (you do comment your scripts…right) avoid commands or file paths, because they are unlikely to be unique enough – my phrase was “some trickery to find”

Step 4
Estimate the length of your file before your phrase, and after.. I went for 100 lines in both directions.

Step 5
grep the nuts out of your HD in binary mode, looking for any occurrence of your search string:
grep -a -B[lines before] -A[lines after] 'text' /dev/[your_partition] > file.txt
if your running this remotely, consider running it within ‘screen’ in case you get disconnected.

Step 6
Grab a coffee, this may take a while depending on the size of your disk

Step 7
You should now have a large text file, which will contain loads of junk, but hopefully most of your missing text.

Search through your recovered file for your string, until you find something useful.
Note, you may not find all of your text, and the first copy you find may not be the best, make sure you parse the file thoroughly before settling on the version which only contains 50% of your code.

As said above, this is a quick and dirty way to recover ascii data which you may have lost through your own (or someone else’s) stupidity, and is dependent on a number factors, such as knowing “EXACTLY” a phrase to search for, and the IO on the volume being suitably low so as not to have overwritten your data.

Good luck, and next time, just use “rm”
…….and source control.

Persisting changes to FileHandles with ulimit

You can check your user limits using the ulimit -a command, which will display the currently configured limits for your user account.

One of the values your are most likely to need to tweak when building large scalable systems is the number of files you can hold open at any one time, which impacts such things as concurrent sessions or TCP sockets.

You can temporaraly override your ulimit parameters for the duration of your session (assuming you have permissions) using the command:

ulimit -n 2048

But when you next login, you will find your modification has reverted to defaults. To ensure your change is persisted, you need to update the values stored in:
/etc/security/limits.conf

Make sure you add BOTH a hard and soft limit, as setting one has no effect without the other!
in the case of root, you could use the following:

root hard nofiles 8092
root soft nofiles 4096

AWS Elastic Load Balancing with a Static IP Address

As anyone using AWS to host their applications and services already knows, Amazon has done a great job in building a scalable and reliable cloud platform.

One of the AWS tools is the Elastic Load Balancer, which allows you to host multiple instances for scalability or tolerance of failures across multiple geographic locations or availability zones, and as with many of the other AWS tools, this ‘just works’ and looks after things such as load balancing requests, ensuring that failed hosts are removed from your LB pool,add to this the simple yet effective SSL offload means if you haven’t considered using an ELB to host your app you probably should.

There is, however, on minor problem…

If your application might require your customers to change there firewalls, your wont be able to provide them an IP address to create a rule around.

Due to the way the ELB works, you can find your IP addreses on your LB changing without notice, and perhaps several times a day. This can be a problem for enterprises which want to know specifically what IP address your running your server on.

HAProxy to the rescue.
In order to provide your customers a ‘Static’ IP address (ok, in AWS we call it an Elastic IP) you can use HAProxy to operate as a transparent SSL proxy.

In this way you can spin up an HAProxy instance and have them forward their requests to your ELB.
To maximize your availability you will want to run a proxy in more than one AZ, and assign each of them an EIP. Then use your DNS (you using R53 right?) to DNS Round Robin your requests onto each HAProxy.

In the back end, you point your HAProxy to forward onto you ELB.

HAProxy & ELB Config
Configure HAProxy:

# this config needs haproxy-1.1.28 or haproxy-1.2.1
global
log 127.0.0.1 local0
maxconn 4096
chroot /var/lib/haproxy
user haproxy
group haproxy
daemon

defaults
log global
mode tcp
option dontlognull
retries 3
option redispatch
maxconn 2000
contimeout 5000
clitimeout 50000
srvtimeout 50000

listen tcp-80 *:80
option persist
mode tcp
balance roundrobin
server inst1 your-elb.elb.amazonaws.com:80 #check inter 2000 fall 3

listen tcp-443 *:443
option ssl-hello-chk
mode tcp
balance roundrobin
server inst1 your-elb.elb.amazonaws.com:443 check inter 30000 fall 3

Changing the mindset

I started in my new role with a preconception of what my job was to be.

I was hired as Systems Operations Manager for a Telecoms company, and as many SysAdmins think, assumed it was my responsibility to keep the systems running, whatever the weather. – Patch systems when vulnerabilities are found, delete files when your running out of space, spec the network for armageddon, and then backup-verify-test-prove that the DR plans stand up.

But despite all the preparations, the fail-safes and the redundancy you build in, something will always goes wrong, and the SysAdmin has to fix it.

Or do you?

What if…

You didn’t care when things go wrong?

I’m not suggesting a blase “meh..” but what if your infrastructure didn’t matter any more?

HB or not 2B…
When I was at school, I had a pencil. I used it all day every day. I had a pencil case to put it in, so that I could bring it home along with the sort of things many of us have not used since leaving school like a set of compasses, protractors, and those funny wedge rubbers you put on the end of your pencil (which always seemed to smudge more of a mess than they erased), and of course a pencil sharpener.

When your pencil snapped, or had simply worn to such a dull point it wouldn’t write anymore you had to hunch over the waste bin, and sharpen it back into a precise point. The best sharpener was always one of the metal ones with the spare blade screwed to the side (see!…redundant components, even at primary school) But in striving to get that perfect point invariably you’d overcook it from time to time, and snap it before taking it out of the sharpener. But if you were careful, and took your time, a pencil would last you a lifetime (or at least most of the term)

Not to tug on your heartstrings too much (and for the sake of my story) this was how I used to operate throughout school and university, but once my student days were over, flush with the trappings of employment, I was able to afford replacement pencils. Now when one breaks, I throw it away (and I can never find a pencil sharpener anymore)

Plan to Fail…
When i started looking at cloud services, I found dozens of articles which warn that cloud instances fail and often. A phrase that kept cropping up was “Plan to fail”

What the authors were getting at, is that you should have a plan in place to cope with an instance going wrong. Use load-balancers and multiple instances so that if one went wrong you have time to fix it, and get it back into service.

Turn it all off, and back on again…
My though process is slightly different. If you could start each day with a new pencil its unlikely you’d wear it out before the end of the day, and if it breaks just grab a new one out of the box.

If you could do this with your infrastructure you don’t ever need to worry about it going wrong again. If it does ‘go bang’, just drop a new one in and carry on. In fact, why not nuke the whole lot at 23:59 each day and start fresh at midnight with brand new servers.

Once you come round to this way of thinking, you’ll wonder why you ever need to try and fix a server problem again….

How hard could it be?

Starting out in the cloud

First, a little background on me:

I have 15 years experience in managing a “traditional” corporate network, but over the last 10 years, the traditional network has changed, with the greatest advances in the last 5 years with the “virtualization revolution”

Seemingly every CIO and network manager was clambering over one another to virtualize large parts of the corporate estate using a variety of technologies from the likes of VMWare, Microsoft and Citrix.

Around 5 years ago I started a project to virtualize the computing requirements of a UK local authority, which comprised of some 60 physical servers running workloads such as AD, Exchange, GIS, finance and a whole host of other legacy business apps”.

We were there at the start of Hyper-V and we felt a lot of pain when certain aspects did not live up to expectations, but the release of 2008 R2 solved a great many problems, and I remain convinced it is a great platform for a windows shop.

Moving to a virtual estate meant a great many changes in how we did normal things like backing up, applying updates, restarting servers, but the principle was always the same… test changes, deploy out of hours, test again, and hope rollback is not required.. or worse still a restore.

Not to go too deep into a potential article for another day, we had a major failure once and Microsoft DPM saved our bacon, but restoring VMs is always a worrying and stressful ordeal.

Then I changed jobs, and my entire way of thinking….

cloud sysadmin

Friday, 8 March 2013

Recovering data with grep from linux volumes

Persisting changes to FileHandles with ulimit

AWS Elastic Load Balancing with a Static IP Address

Changing the mindset

Starting out in the cloud

About Me