Protein is Good for You (Matt Gertz)

Protein is Good for You (Matt Gertz)

  • Comments 3

As I was preparing to graduate from the University of Michigan way back in the late eighties, I had a big decision to make regarding grad school – robotics at Carnegie Mellon, or biology at Washington State?  On the one hand, biology was something I’d always really loved, having even intended to go to med school at one point.  On the other hand, robotics was more likely to help me pay off my debts a very exciting field with a lot of challenges still ahead of it.  Ultimately, I chose robotics (gaining a degree that, to this day, I have never used), but I often wonder what it might have been like to go into biology. 

Anyway, back when I was dating my future wife (who was in the molecular biology graduate program), I wrote a quick’n’dirty program to translate DNA coding sequences to chains of amino acids for her advisor.  That was fun, and I got feel like I was participating in the research (in a very teeny-tiny way).  Beyond that, I haven’t had much interaction with hands-on biology work in many years, although I try to keep up with what’s going on.  Recently, though, I’ve been scrambling trying to come up with new ideas for blog articles, and that program I wrote nearly 18 years ago came to mind.  I’d never been happy with the visualization of the data, so I decided to give it a second try, this time using the WPF designer to help me out. 

In this blog, I’ll cover the creation of a program to translate DNA from proteins, and tomorrow I’ll talk about visualizing the results using StackPanel controls.  The overall example requires VS2008 or later to code up, although today’s blog code is mostly machinery that would pretty much work on either WinForms or WPF.

“Captain… the alien virus is rewriting his DNA!  He’s changing!”

One of the problems in being both a computer specialist and also somewhat knowledgeable about biology is that it’s very difficult to make it through your average Star Trek episode, between the bad computer science and the bad biology.  (For the record, changing your DNA won’t change your appearance, since the protein structures they code for already exist and represent roughly seven years of dead-end energy investment on your part.  If you’re lucky, any changes will have no impact at all or even be beneficial; if you’re less lucky, the cell will die or cause some deleterious behavior.)  At any rate, this will be a simple DNA/RNA/protein visualization program, and no DNA altering will be allowed. J

I’m not going to give a big overview on DNA transcription; if you don’t remember enough from school to follow along, the Wikipedia article on DNA is pretty good for refreshing one’s memory (as I found out).  For the purposes of this exercise, we’ll just note that DNA is used to determine which proteins are created for a cell: 

(1)    DNA is comprised of a combination of 4 base pairs (A, T, C, G) connected longitudinally by a sugar/phosphate group and latitudinally by hydrogen bonds.  A (adenine) always connects latitudinally to T (thymine), and C (cytosine) always connects to G (guanine).

(2)    During transcription, the DNA (a double-helix) in a given chromosome is cut down the middle

(3)    An mRNA (“messenger RNA”) string is built up from the DNA side which contains the appropriate information for that part of the string.   (RNA is very similar to DNA except that the connective sugar is ribose instead of deoxyribose, and thymine is replaced with uracil (U)).  The resulting mRNA string is an inverse copy of the strand it copied.  I’ll be using the terms mRNA and RNA interchangeably for the purpose of this blog.

(4)    Amino acids are assembled together based on the sequence of the mRNA.  It takes three bases to code for one amino, so given four bases, there are 4^3 = 64 possible combinations for a given triplet (codon) of bases.  Some codons code for the same amino; there are 20 standard aminos mapping to 61 codons.  (Three of the possible codons simply indicate the end of a sequence; one codon indicates the start of a sequence and also codes for a specific amino, methionine.)

Basically, the plan for the program is this:  allow the user to read in a sequence of DNA, automatically convert it to an mRNA sequence, and then convert it to zero or more sequences of amino acids.  We’ll then throw a visual representation of all strings onto the form.

The basic application

First, I’ll create a new WPF application called “VBProtein.”  On its grid, I’m going to throw three main controls:

(1)    A ScrollViewer control for display the sequences graphically.  I’ll set mine to be 112 pixels high (enough for three rows of sequences of height 32 pixels plus the scrollbar).  In the properties of the ScrollViewer, I’ll set “HorizontalScrollbarVisibility” to “Visible” and “VerticalScrollbarVisibility” to “Disabled,” since the sequences will be listed left-to-right.  I’ve also set its “TabIndex” to “0”.

(2)    A Button control labeled “Load” (“TabIndex” = “1”).

(3)    A Button control labeled “Save” (“TabIndex” = “2”).

I’ve also added a few label controls and changed a few colors, but that’s the all of the important stuff.  Everything else gets added in code, so let’s take a look at that.   Double-click on the window frame to generate the Window1_Loaded event.  We’ll populate it later, but for the moment we’ll concentrate on the members we’ll need for the application.  These are:

    Public Translations As New Microsoft.VisualBasic.Collection

    Public DNA As String

    Public RNA As String

    Public Proteins As New List(Of String)


We’ll load in the value of DNA, translate it to RNA, and then translate that to the Proteins – in other words, we’ll deal with those later.  Let’s worry about the Translations instead.

For the translations, I decided to go with a Collection since they are easy to work with, they support keys for lookup, and I’m not dealing with too many objects – just the 64 possible codons.  There’s a lot of information I’ll want to keep with each Translation:

    Enum Sequence

        Normal = 0

        SequenceStart = 1

        SequenceStop = 2

    End Enum

    Class Translation

        Public Sub New(ByVal Triplet As String, ByVal Acid As String, _

ByVal Mnem As Char, ByVal Clue As Sequence)

            Codon = Triplet

            Amino = Acid

            shortAmino = Mnem

            Usage = Clue

        End Sub

        Public Overrides Function ToString() As String

            Return Codon

        End Function

        Public Codon As String

        Public Amino As String

        Public shortAmino As Char

        Public Usage As Sequence

    End Class


Note that I’m overriding the “ToString” method to return the Codon value, which I’ll be using as a key in the collection.   With this structure, I can initialize the translation collection (abridged from the actual code for the purposes of legibility):

    Private Sub InitializeTranslations()

        Translations.Add(New Translation("UUU", "Phe", "F", Sequence.Normal), "UUU") ' Phenylalanine

        Translations.Add(New Translation("UUC", "Phe", "F", Sequence.Normal), "UUC") ' Phenylalanine

        Translations.Add(New Translation("UAA", "OCH", ".", Sequence.SequenceStop), "UAA") ' Ochre stop sequence

        Translations.Add(New Translation("UAG", "AMB", ".", Sequence.SequenceStop), "UAG") ' Amber stop sequence


        Translations.Add(New Translation("UGA", "OPA", ".", Sequence.SequenceStop), "UGA") ' Opal stop sequence

        Translations.Add(New Translation("AUG", "Met", "M", Sequence.SequenceStart), "AUG") ' Methionine


' Etc…


    End Sub


Each translation has a codon, the abbreviated name of the matching amino, the one-character name of the matching amino (which I never use, but what the heck), and a setting which determines if this is a normal codon, a starting codon, or a stopping codon.

Now, in the Window1_Load code, I’ll add the following:


        SaveResultsBtn.IsEnabled = False


(The second line is unrelated to the previous code and just disables the Save button until we have something to save.)

I can now start writing the functional code.  Back on the grid, I’ll double-click the “Load” button, and in the resulting LoadSequenceBtn_Click event, I’ll add the following code:

        ' Load in the file

        Dim dlg As New OpenFileDialog

        dlg.Filter = My.Resources.FILT_FileFilter

        If dlg.ShowDialog() = True Then

            DNA = My.Computer.FileSystem.ReadAllText(dlg.SafeFileName)

        End If


As you can see, first I’m throwing up a file dialog to get the name of the file to load (which is just a TXT file filled with bases -- FILT_FileFilter is a resource I've mapped to "Text Files (*.txt)|*.txt"), and then I open it up and read it into the DNA string variable using a handy “My.Computer” function. 

Given the DNA, I can transcribe it to mRNA.  First, I’ll need a transcription function:

    Private Function Transcribe(ByVal Base As Char) As Char

        Select Case Base

            Case "A" ' Adenine

                Transcribe = "U" ' Uracil

            Case "T" ' Thymine

                Transcribe = "A" ' Adenine

            Case "G" ' Guanine

                Transcribe = "C" ' Cytosine

            Case "C" ' Cytosine

                Transcribe = "G" ' Guanine

            Case Else

                Throw New ApplicationException(String.Format(My.Resources.ERR_UnknownBase, Base))

        End Select

    End Function


This is a very simple function which, given a character representing a base, returns the character which is the corresponding base.  If it finds a character that it doesn’t understand, it throws an error, displaying a string (ERR_UnknownBase ) that I’ve defined in the project resources (right-click your project, choose “Properties,” and select the “Resources” tab from the resulting dialog to find the resources for the project).  Transcription is now very easy:

    Private Sub DNA2RNA()

        RNA = ""

        If DNA.Length < 6 Then Return ' Need two codons at least, or we're just wasting time.

        Dim mRNA As New StringBuilder(DNA.Length)


        For i As Integer = 0 To DNA.Length - 1



        RNA = mRNA.ToString

    End Sub


First I check to make sure there’s at least two codons (six bases) before doing anything – if less than that, then you can’t have both a start and a stop.  Given that, I’ll iterate through the DNA string and collect the corresponding RNA characters.  To save memory thrashing, I’ll use a stringbuilder to collect those and then save the result out to the RNA string.

Transcribing mRNA to aminos is slightly more difficult.  We have to search the mRNA for a particular start sequence, and then end when one of three end sequences is found – and then repeat that again in case there are more sequences available farther along the mRNA string.  (That’s why I’ve used List (Of String) for the collected proteins instead of just one string – if I ever chose to expand this program, it’s nice to have all of the sequences in separate strings already.)  I’ll begin by freeing up any existing protein memory, doing the check on lengths, and setting up an index:

    Private Sub RNA2Proteins()

        ' First, find the start codon.  This is typically AUG.

        ' I'm deliberately not dealing with the rare exceptional

        ' cases sometimes found in mature mRNA.


        If RNA.Length < 6 Then Return ' Need two codons at least, or we're just wasting time.

        Dim sProteins As New StringBuilder

        Dim currentIndex = 0


As with the DNA->RNA case, I’ll need a way of collecting bits of a string that doesn’t thrash memory, and thus I’ve declared a StringBuilder above.  The currentIndex will track where we are in the RNA string.

I’ve got to look for the first entry codon.  Since there’s only one type of those, I can use InStr to find it.  I’ve got to be careful, though, because InStr is 1-based and not 0-based:

        Do While currentIndex < RNA.Length

            ' Get past the current intron code (so-called "junk" DNA,

            ' though it may serve a function after all) to

            ' the next exon.

            ' CONSIDER:  We could also modify this method to create

            ' mature mRNA from the precursor mRNA.


            Dim startCodon = InStr(RNA.Substring(currentIndex), "AUG") - 1 ' InStr is 1-based, not 0-based

            If startCodon = -1 Then Return ' Remember I subtracted 1 from the value, so not checking for 0 as normal

            currentIndex = currentIndex + startCodon ' Jump to the start codon for the next exon


So now I’m pointing at the first starting codon (if any exists at all).  At this point, I need to check my spec and note that I’m going to want to display the DNA, RNA, and aminos all next to each other and lined up properly.  If I write out the aminos without taking account of the space taking up by the theoretically-unused portions of the genome, the aminos won’t line up properly with their corresponding genetic material.  So, I’ll introduce whitespace for each base I skipped over:

            Dim sProtein As New StringBuilder

            sProtein.Append(" ", startCodon) ' Remember the offset; it will be important later when drawing.


Having inserted that reminder for later use, I then proceed to grab every three bases and translate them to their amino equivalents.  I’ll do this until I find one of the “stop” codon.  (If I don’t find a stop codon before I get to the end of the mRNA, I’ll ignore the entire string.)  Note that I’m using the Translations collection to easily map the three-base codon (the key to the collection) to its amino equivalent.  “Stop” codons  don’t map to aminos, so when I find one of those, I simply insert three spaces, add the current protein string to the list of proteins, and exit the loop:


            Dim StopFound As Boolean = False

            Do While currentIndex + 2 < RNA.Length ' Need to consume this character plus 2 more

                Dim entry As Translation

                entry = Me.Translations(RNA.Substring(currentIndex, 3))

                currentIndex += 3 ' Skip to next sequence, no matter what happens

                If entry.Usage <> Sequence.SequenceStop Then



                    StopFound = True

                    sProtein.Append("   ") ' Add three spaces to make sure spacing works when drawing


                    Exit Do

                End If



I might have exited the loop due to running out of mRNA, so now I need to check for that condition.  (If I didn’t run out of space, I’ll just loop back up and look for the next start codon, if any.)

            If StopFound <> True Then Exit Do


    End Sub


Finally, I’ll go back to the LoadSequenceBtn_Click event and call the two translator functions (inserting this code after the file is loaded), and also notify the Save button that we now have something to save:



        ' We've loaded & processed, so we can save now.

        SaveResultsBtn.IsEnabled = True


The code for the Save button (double-click on it to create the event handler) is very similar to that of the Load button, except that instead of reading one string, we’re writing two strings plus the list of proteins:

        Dim dlg As New SaveFileDialog

        dlg.Filter = My.Resources.FILT_FileFilter

        If dlg.ShowDialog() = True Then

            My.Computer.FileSystem.WriteAllText(dlg.SafeFileName, DNA & vbCrLf, False) ' Overwrite existing file on first write

            My.Computer.FileSystem.WriteAllText(dlg.SafeFileName, RNA & vbCrLf, True)

            For Each p In Proteins

                My.Computer.FileSystem.WriteAllText(dlg.SafeFileName, p, True)


        End If


And if you look at the results in a fixed-text font without word wrap (i.e., Notepad), everything should line up correctly.  I could do more work here to break it all up into 80-char-width chunks, but I’ll leave that as an exercise for the audience.

We’ve got all of the information translated properly – now we need to display it.  In tomorrow’s blog post, I’ll show one way of doing this using the WPF Designer.

‘Til next time,


Leave a Comment
  • Please add 2 and 2 and type the answer here:
  • Post
Page 1 of 1 (3 items)