18 Jul 2012 4 min read

Cleaning up data - Regexp and Notepad++ FTW!

A lot of times I'm digging data from some source thats presented in columns and rows, usually this is from a web-page (bug tracker for example) where I've performed some query or another, but there is to much data, or data in the wrong order.
In other words, it's there but just not 100% right.
Today I was faced with summarizing a list of issues from Jira that we've worked on in the team, and I wanted to email them to people.

*Note:* I have left out explaining what the regexp:es do since the intent of this post is to show the possibilities of regular expressions and the the ones that I used were specific for my context at the time.
Figuring out the ones below are left as an exercise for the reader ;-)

So I performed an "advanced query" in Jira looking for the issues that:

were affecting the product we were using
were targeted for the coming release
had the resolution fixed
did not have the status closed (aka verified and closed).

So I got back a list of issues in the following format with a bunch of results:

T	Key	Summary	Assignee	Reporter	P	Status	Resolution	Created	Updated	Due	Customer
	ID-1	Crash in component A	Developer 1	Evil Tester		Resolved	Fixed	2012-01-01	2012-02-03
	ID-2	Documentation error	Architect 1	Nice Tester		Open	Fixed	2012-02-03	2012-03-06
	ID-3	Slow startup	Developer 3	Busy Tester		Resolved	Fixed	2012-01-01	2012-01-03
	ID-4	Crash in component B	Developer 2	Evil Tester		Resolved	Fixed	2012-02-01	2012-02-03

Now I was only really interest in the ID and the summary, and I could of course have:

Summarised the issues by hand (typing.. and then more typing)
Copy&Paste all rows and then manually selecting and deleting what I didn't want (boring...)

But I am kind of allergic to that kind of work and what I normally do is select all the text, copy it, and go to my favourite text editor Notepad++ to paste it in.

Doing this I now had the following text mass in Notepad++ (all lines ending with garbage tabs of course)

ID-1
Crash in component A
Developer 1 Evil Tester Resolved  Fixed 2012-01-01 2012-02-03
ID-2
Documentation error
Architect 1 Nice Tester Open Fixed 2012-02-03 2012-03-06
ID-3
Slow startup
Developer 3 Busy Tester Resolved Fixed 2012-01-01 2012-01-03
ID-4
Crash in component B
Developer 2 Evil Tester Resolved Fixed 2012-02-01 2012-02-03

I wanted the ID and the summary on the same row and removing all else. Doing this manually for a small amount of rows is no problem, but doing it for many rows is a hassle to select the correct rows and deleting them, and then... well it just takes forever... and is boring.

So here comes the regexp!
In Notepad++ you just press Ctrl+H (Find&Replace) and then we enter a suitable regexp to look for and replace it with whatever text string (or even another regexp!) we want.

Now I could have omitted some steps below and done them in one, but since this is quick and dirty work, I prefer just solving the problem at hand with two regexp:es in a few seconds rather than constructing the "uber-regexp" for the next 10 minutes or so.

First I want to remove the tab from the row with the ID:s

Find: "t(ID.*)"
Replace with: "1"

ID-1
Crash in component A
Developer 1 Evil Tester Resolved Fixed 2012-01-01 2012-02-03
ID-2
Documentation error
Architect 1 Nice Tester Open Fixed 2012-02-03 2012-03-06
ID-3
Slow startup
Developer 3 Busy Tester Resolved Fixed 2012-01-01 2012-01-03
ID-4
Crash in component B
Developer 2 Evil Tester Resolved  Fixed 2012-02-01 2012-02-03

After this we want to put the ID and the summary on the same line:

Find: "^(ID.*)trn"
Replace with: "1:"

ID-1: Crash in component A
Developer 1 Evil Tester Resolved  Fixed 2012-01-01 2012-02-03
ID-2: Documentation error
Architect 1 Nice Tester Open Fixed 2012-02-03 2012-03-06
ID-3: Slow startup
Developer 3 Busy Tester Resolved Fixed 2012-01-01 2012-01-03
ID-4: Crash in component B
Developer 2 Evil Tester Resolved  Fixed 2012-02-01 2012-02-03

And finally we remove all lines not starting with ID:

Find: "^(?!ID).*rn"
Replace with: ""

ID-1: Crash in component A
ID-2: Documentation error
ID-3: Slow startup
ID-4: Crash in component B
... and many more rows ...

And voila! In under a minute we have taken the data from the web page and transformed it into something I can email my collueges.

So if you have large amounts of text data that you need to transform to a better format, there is many times the possibility to use Notepad++ and regexps (for Unix you should always prefer the universal grep/sed/awk!)

You might also like...

Test automation is more than... well test automation...

Python for Testers - A class that is never done...

Sending emails with Python and MailGun

TITAN is coming to Karlskrona

The Moomins goes Belgium