A lot of times I'm digging data from some source thats presented in columns and rows, usually this is from a web-page (bug tracker for example) where I've performed some query or another, but there is to much data, or data in the wrong order.
In other words, it's there but just not 100% right.
Today I was faced with summarizing a list of issues from Jira that we've worked on in the team, and I wanted to email them to people.

*Note:* I have left out explaining what the regexp:es do since the intent of this post is to show the possibilities of regular expressions and the the ones that I used were specific for my context at the time.
Figuring out the ones below are left as an exercise for the reader ;-)

So I performed an "advanced query" in Jira looking for the issues that:
  • were affecting the product we were using
  • were targeted for the coming release
  • had the resolution fixed
  • did not have the status closed (aka verified and closed).
So I got back a list of issues in the following format with a bunch of results:


TKeySummaryAssigneeReporterPStatusResolutionCreatedUpdatedDueCustomer
ID-1Crash in component ADeveloper 1Evil TesterResolvedFixed2012-01-012012-02-03
ID-2Documentation errorArchitect 1Nice TesterOpenFixed2012-02-032012-03-06
ID-3Slow startupDeveloper 3Busy TesterResolved Fixed2012-01-012012-01-03
ID-4Crash in component BDeveloper 2Evil TesterResolved Fixed2012-02-012012-02-03
 
Now I was only really interest in the ID and the summary, and I could of course have:

  • Summarised the issues by hand (typing.. and then more typing)
  • Copy&Paste all rows and then manually selecting and deleting what I didn't want (boring...)
But I am kind of allergic to that kind of work and what I normally do is select all the text, copy it, and go to my favourite text editor Notepad++ to paste it in.

Doing this I now had the following text mass in Notepad++ (all lines ending with garbage tabs of course)

ID-1 
Crash in component A
Developer 1 Evil Tester  Resolved  Fixed 2012-01-01 2012-02-03   
ID-2 
Documentation error
Architect 1 Nice Tester  Open Fixed 2012-02-03 2012-03-06   
ID-3 
Slow startup
Developer 3 Busy Tester  Resolved Fixed 2012-01-01 2012-01-03   
ID-4 
Crash in component B
Developer 2 Evil Tester  Resolved Fixed 2012-02-01 2012-02-03   


I wanted the ID and the summary on the same row and removing all else. Doing this manually for a small amount of rows is no problem, but doing it for many rows is a hassle to select the correct rows and deleting them, and then... well it just takes forever... and is boring.

So here comes the regexp!
In Notepad++ you just press Ctrl+H (Find&Replace) and then we enter a suitable regexp to look for and replace it with whatever text string (or even another regexp!) we want.

Now I could have omitted some steps below and done them in one, but since this is quick and dirty work, I prefer just solving the problem at hand with two regexp:es in a few seconds rather than constructing the "uber-regexp" for the next 10 minutes or so.

First I want to remove the tab from the row with the ID:s

Find: "t(ID.*)"
Replace with: "1"


ID-1 
Crash in component A
Developer 1 Evil Tester  Resolved Fixed 2012-01-01 2012-02-03   
ID-2 
Documentation error
Architect 1 Nice Tester  Open Fixed 2012-02-03 2012-03-06   
ID-3 
Slow startup
Developer 3 Busy Tester  Resolved Fixed 2012-01-01 2012-01-03   
ID-4 
Crash in component B
Developer 2 Evil Tester  Resolved  Fixed 2012-02-01 2012-02-03  


After this we want to put the ID and the summary on the same line:


Find: "^(ID.*)trn"
Replace with: "1:"

ID-1: Crash in component A
Developer 1 Evil Tester  Resolved  Fixed 2012-01-01 2012-02-03   
ID-2: Documentation error
Architect 1 Nice Tester  Open Fixed 2012-02-03 2012-03-06   
ID-3: Slow startup
Developer 3 Busy Tester  Resolved Fixed 2012-01-01 2012-01-03   
ID-4: Crash in component B
Developer 2 Evil Tester  Resolved  Fixed 2012-02-01 2012-02-03  


And finally we remove all lines not starting with ID:

Find: "^(?!ID).*rn"
Replace with: ""

ID-1: Crash in component A
ID-2: Documentation error
ID-3: Slow startup
ID-4: Crash in component B
... and many more rows ...


And voila! In under a minute we have taken the data from the web page and transformed it into something I can email my collueges.

So if you have large amounts of text data that you need to transform to a better format, there is many times the possibility to use Notepad++ and regexps (for Unix you should always prefer the universal grep/sed/awk!)