One of the pet hates of anyone working in data is that dreaded free text field. You know the one, every system has one. That little white box that just invites the user to enter what ever little collection of text they want to add. In an ideal world all data entry would be a drop down box or a selection of phrases so that you could do some data analysis on it. However, thats a dream world, full of fluffy clouds, endless summer days and cool drinks on a warm beach. In reality we have records in databases that are strings of text that contain useful data if only we could group it together.
I was asked at work this week if I could provide some analysis on a free text field. At the Welcome Trust Sanger Institute we have a large DNA Sequencing facility that produces huge amounts of data. This data is produced by running DNA samples on a number of Illumina DNA Sequencers, each run takes about 5 days and produces the equivalent of 100 peoples genomes. Each run gets QC'd by a real life person who will annotate some runs if there are interesting things to note. This annotation is a free text field that we would like to be able to use to monitor for potential problems. So heres how i did it.
First we need a parameter to act as our search entry. We create a parameter and change the data type to string and set the allowable values to all. Leave the current value blank.
Next we create a calculated field based on the parameter, this will then allow us to use the results of the search to filter our data.
In my data set the text is in the [lane_annotation] field so we create a calculation using the CONTAINS(string1,string2) function. What this does is search through the first field, string1, looking for the text in string2. Bit like searching for text in a text editor. If it find a match then it will return true or if not false. In my case i am searching the [lane_annotation] field looking for text entered in the [Parameters].[Annotation Search]. If it matches it will return "Annotation" and if not "None"
Next i build my viz.
So we have the number of lanes that have been QC'd each month since 2011. So now we can use the calculated field we created earlier and place it on the colour shelf.
So now we can see the number of lanes with and without annotations. This is useful as it shows that a spike in the number of annotations might just be due to an overall increase in lane throughput. This is using just the calculated field, we've not used the search parameter yet.
Now add the parameter control to the viz and we can search for something eg "uneven"
Now the search string in the parameter is being passed to the calculated field and gets evaluated. All the 6000 or so records are checked and those that contain the word "uneven" are tagged with the "Annotation" flag. We now we have a line chart that shows the occurance of the word "uneven" in the [lane_annotation] field
I then decided to add a second field and look at the % of total
To finish it off i added it to a dashboard along with a detailed view of each lane and annotation and used a dashboard action to filter the results. So now we have a little search tool that we can use to go back through a couple of years worth of data, looking for text in a free text field and look for potential issues.
I've already used this twice this week already so hopefully you might find this useful for getting information out of your free text fields.