The thought of countless office workers slaving over spreadsheets and performing mindless work makes me sad. These machines that I love were supposed to free us from menial labor. Instead, there are people out there copying something from one spreadsheet and pasting it into another, over and over, ad nauseum. It’s because the tools are incomplete, a consequence of trying to tie every conceivable operation to a button on the screen.
You don’t need to be a computer scientist to process data. With casual programming, we can accomplish a lot without the operational complexity of “real” software development. If we take some basic precautions, we can save time, have fun, and look like a total wizard without appreciable risk.
Here are some examples of things I’ve done on a locked-down Windows 10 laptop:
- Translated an Excel timeline document to a Google Gantt chart.
- Automated the process of verifying that developers are correctly billing their time.
- Automated status reports that include financial calculations.
- Identified duplicate orders in a spreadsheet based on the shipping address using Perl.
- Automatically added weekly tasks to my todo list every Monday.
- Condensed multiple daily journal entries with a Perl script so that I can paste them into my timesheet.
- Fetched image URLs for a list of products in a spreadsheet using bash and curl.
- Added child elements to an XML file with awk.
- Automated the download of 10,000 unique codes from an online GUID generator with bash and curl.
- Used a Vim macro to restructure content that was mangled by another application.
To do all of this, I use Git for Windows. I don’t actually use git itself, just the tools that come with Git for Windows. Even on locked-down systems, I haven’t had trouble installing it. Note: Run the installer from the file explorer, not through your web browser’s download manager. My guess is that it’s an executable zip file, and extracting an archive doesn’t require admin access. Git for Windows is based on MSYS2. which includes MinGW, a collection of GNU utilities that have been ported to Windows. Git for Windows is updated about every 12 weeks, so be sure to keep your installation up to date.
These tools will let you do a lot of cool tricks. The key feature of this environment is the composability of trivial programs. It’s not as easy as clicking a button, but you won’t need to break out a calculator, either. This is a great toolset for working with small datasets. I believe most Excel spreadsheets fall into that category.
Key tools include:
- Vim and nano for text editing
- Awk and Perl for data processing
- GNU coreutils like curl, sed, and grep, also handy for data processing
- A bash shell emulator to hold everything together
Other features that make this environment easy to use:
- It uses the Windows filesystem, so you have access to all your files and folders.
- Integration with the Windows clipboard lets you can copy and paste bits of text to and from.
- Very lightweight compared to other Windows applications so it’s fast, even on older machines.
To work with Excel data in the Git for Windows environment, you will need to save the file as “Text (Tab Separated)”. You can always paste the resulting data back into your original file. I recommend tab-separated because it is very unlikely that you will encounter a tab character embedded in an Excel document. However, it is quite easy to embed a newline character in a cell using Ctrl-Enter and this will be hard to compensate for in a script. If your Excel document has cells with unstructured data, consider excluding those columns from the tab-separated version of the file.
I also recommend installing Pandoc. I hate messing around with styles in Word. With Pandoc, I can compose using a plain text editor. I use markdown to indicate headers, create tables, quote text, and bold and italicize text. Pandoc will then convert my markdown into a nice-looking Word document. I have also converted markdown into HTML, HTML into plain text, and Word into markdown. The key advantage for me is that I get to edit documents in Vim, my favorite editor, and then convert them to a document that is more appropriate for my audience (usually Word). It also brings this content into an environment with powerful automation capabilities.
I’ll focus on Excel because I use it a lot at work and awk was made to process structured data like you find in most Excel documents. The key things missing from Excel are regular expressions and scriptability. For really trivial stuff, like summing up numbers or creating pivot tables, staying in Excel is probably your best bet. When things get complicated, using a tool like awk or sed can save you a ton of time and soul-crushing, repetitive work. This is especially true of anything you will have to do more than once or on a regular schedule.
- Find rows that meet certain criteria and
- change or add a column value based on a set of rules or,
- copy the line to a new file and/or,
- remove them from the file or
- find matching information and another file and pull it in.
You can also perform find-and-replace operations that are too complicated or fuzzy for Excel’s built-in find-and-replace function.
Basically, if you find yourself eyeballing every row in a spreadsheet and applying some kind of rule, you should strongly consider creating a script to do it for you. Not only will this be faster in the long run, but less prone to human error. It’s also just more fun and it elevates your thinking.