Cleaning up data - Regexp and Notepad++ FTW!
In other words, it's there but just not 100% right.
Today I was faced with summarizing a list of issues from Jira that we've worked on in the team, and I wanted to email them to people.
*Note:* I have left out explaining what the regexp:es do since the intent of this post is to show the possibilities of regular expressions and the the ones that I used were specific for my context at the time.
Figuring out the ones below are left as an exercise for the reader ;-)
So I performed an "advanced query" in Jira looking for the issues that:
- were affecting the product we were using
- were targeted for the coming release
- had the resolution fixed
- did not have the status closed (aka verified and closed).
So I got back a list of issues in the following format with a bunch of results:
| T | Key | Summary | Assignee | Reporter | P | Status | Resolution | Created | Updated | Due | Customer |
| ID-1 | Crash in component A | Developer 1 | Evil Tester | Resolved | Fixed | 2012-01-01 | 2012-02-03 | ||||
| ID-2 | Documentation error | Architect 1 | Nice Tester | Open | Fixed | 2012-02-03 | 2012-03-06 | ||||
| ID-3 | Slow startup | Developer 3 | Busy Tester | Resolved | Fixed | 2012-01-01 | 2012-01-03 | ||||
| ID-4 | Crash in component B | Developer 2 | Evil Tester | Resolved | Fixed | 2012-02-01 | 2012-02-03 |
Now I was only really interest in the ID and the summary, and I could of course have:
- Summarised the issues by hand (typing.. and then more typing)
- Copy&Paste all rows and then manually selecting and deleting what I didn't want (boring...)
Doing this I now had the following text mass in Notepad++ (all lines ending with garbage tabs of course)
ID-1
Crash in component A
Developer 1 Evil Tester Resolved Fixed 2012-01-01 2012-02-03
ID-2
Documentation error
Architect 1 Nice Tester Open Fixed 2012-02-03 2012-03-06
ID-3
Slow startup
Developer 3 Busy Tester Resolved Fixed 2012-01-01 2012-01-03
ID-4
Crash in component B
Developer 2 Evil Tester Resolved Fixed 2012-02-01 2012-02-03
I wanted the ID and the summary on the same row and removing all else. Doing this manually for a small amount of rows is no problem, but doing it for many rows is a hassle to select the correct rows and deleting them, and then... well it just takes forever... and is boring.
So here comes the regexp!In Notepad++ you just press Ctrl+H (Find&Replace) and then we enter a suitable regexp to look for and replace it with whatever text string (or even another regexp!) we want.
Now I could have omitted some steps below and done them in one, but since this is quick and dirty work, I prefer just solving the problem at hand with two regexp:es in a few seconds rather than constructing the "uber-regexp" for the next 10 minutes or so.
First I want to remove the tab from the row with the ID:s
Find: "t(ID.*)"
Replace with: "1"
ID-1
Crash in component A
Developer 1 Evil Tester Resolved Fixed 2012-01-01 2012-02-03
ID-2
Documentation error
Architect 1 Nice Tester Open Fixed 2012-02-03 2012-03-06
ID-3
Slow startup
Developer 3 Busy Tester Resolved Fixed 2012-01-01 2012-01-03
ID-4
Crash in component B
Developer 2 Evil Tester Resolved Fixed 2012-02-01 2012-02-03
After this we want to put the ID and the summary on the same line:
Find: "^(ID.*)trn"
Replace with: "1:"
ID-1: Crash in component A
Developer 1 Evil Tester Resolved Fixed 2012-01-01 2012-02-03
ID-2: Documentation error
Architect 1 Nice Tester Open Fixed 2012-02-03 2012-03-06
ID-3: Slow startup
Developer 3 Busy Tester Resolved Fixed 2012-01-01 2012-01-03
ID-4: Crash in component B
Developer 2 Evil Tester Resolved Fixed 2012-02-01 2012-02-03
And finally we remove all lines not starting with ID:
Find: "^(?!ID).*rn"
Replace with: ""
ID-1: Crash in component A
ID-2: Documentation error
ID-3: Slow startup
ID-4: Crash in component B
... and many more rows ...
And voila! In under a minute we have taken the data from the web page and transformed it into something I can email my collueges.
So if you have large amounts of text data that you need to transform to a better format, there is many times the possibility to use Notepad++ and regexps (for Unix you should always prefer the universal grep/sed/awk!)
