The ABC's of Bioinformatics

There are many ways in which computational ideas can be applied to biology; therefore, a career in bioinformatics will include a multitude of diverse paths. The path I have chosen has led me to the field of biological ontologies.

An ontology is a description of knowledge about a subject using a controlled vocabulary of terms and defined relationships between those terms. The ontologies allow us to describe our knowledge of a given subject. We define the key concepts and how they relate to one another. When data are assigned to categories in the ontology, the method becomes a powerful tool for exploring the data via the relationships. Developing biological ontologies and the software that use them has allowed me to pursue two interests at the same time: biology and knowledge representation.

Getting Started

I embarked on a bachelor's degree in biological and biochemical science in the United Kingdom and was lucky enough to participate in an exchange program with a university in the United States. During my year as an exchange student, I spent time in a parasitology lab. This exposed me to real bench work for the first time and forced me to use a computer for more than just e-mail. During this period I learned three important things:

- Biology was even more interesting than I had realized.

- Bench work was not for me.

- Computers aren't that scary, once you learn how to switch them on.

When I returned to England to complete my degree, I chose coursework that would further develop my fledgling skills in bioinformatics. Learning about sequence-similarity searches and protein folding whetted my appetite enough that I enrolled in a master's course in bioinformatics at the University of Manchester. This course exposed me to a couple of programming languages, Perl and C; some algorithms; and new (to me) ways of thinking, such as the object-oriented programming paradigm and using ontologies to describe knowledge. I stayed in the bioinformatics lab for my Ph.D., where my focus was on postgenome sequencing technologies.

The Best of Both Worlds

Although I was based in the biochemistry division, I also had an adviser in the computer science department. Because what I wanted to learn spanned two different departments, this became an important bridge. One of the biological issues that interested me was capturing, representing, and analyzing protein-interaction data. To get to a grip on the actual data, I spent time in a lab performing yeast two-hybrid experiments. This lab experience was vital, because I learned what the requirements of the system were from both my own work and from other scientists in the lab.

To do this, I described a series of use cases, which can be thought of as discrete goals for the user when they interact with the system. The approach I took was to create an object-oriented database using the programming language Java. Learning Java was especially fun for me, because it seemed to be the perfect introduction to object-oriented techniques. (I've gone on to use in other languages such as Perl since then.) The experience of mixing biology and computer science techniques has been fairly typical for me; I seem to reside somewhere in the interface between the two disciplines.

After finishing my Ph.D., I moved to the United States again, this time to work for a genomics company for 2 years. Entering industry without postdoc experience was unconventional, but it was a decision that I don't regret. There I gained a perspective on large-scale analysis in an industrial setting; in general, I learned a tremendous amount.

Last year, I moved to the Berkeley Drosophila Genome Project to work on Sequence Ontology (SO), which is part of the Open Biological Ontologies (OBO) project (see box).

The OBO Project

The purpose of OBO is to provide the community with ontologies for shared use across different biological domains. The most famous is the gene ontology, in which categories are assigned to genes based on their cellular location, molecular function, and biological process. The gene ontology was a success because it allowed researchers to categorize genes quickly, as well as because it is useful in the analysis of many large scale analyses (for example, microarray experiments).

Among other things, the gene ontology has allowed researchers to efficiently ask, "What kind of genes are up-regulated under these conditions?" OBO projects runs the gamut from describing biological sequence and cell type to development and anatomy. In order for ontology projects to join OBO, the following criteria must be fulfilled. They should

1. be open source and able to be used without constraint

2. use a common syntax

3. not overlap and compete with one another

4. have unique identifiers and definitions.

The SO Project

The goal of the SO project is to produce an ontology with which to label the parts of a genomic annotation and to describe the relations among them. Genomic annotations are the focal point of sequencing, bioinformatics analysis, and molecular biology. They are the means by which we attach what we know about a genome to its sequence. There are two issues that the SO addresses:

First, consider the ambiguity of biological terms. Often there are many ways to describe something--many dialects, or cases where one word can mean different things. Even when we think we are describing the same object, we can run into trouble. For example, "Does this coding sequence (CDS) contain a stop codon or not?" The SO unifies the language we use to describe biological sequence, so we can communicate about the same concepts in the same way.

Second, consider the relations among the terms. SO specifies the following relations: is a kind_of, is a part_of, and is derived_from. For example, a promoter is a kind_of regulatory_region, and an exon is part_of a transcript.

Practically speaking, labeling genomic data with SO terms means that software can understand the data. When visualization software encounters the term exon, it knows to draw a blue box if that exon is a part_of an mRNA, and a green box if that exon is a part_of an ncRNA; likewise an annotation validation program can find the transcript that the exon is a part_of and check that their coordinates are consistent with one another. Labeling genomic data with SO enables the asking of many questions. Using SO as part of a database schema ensures that such questions mean the same thing in different databases.

Putting My Knowledge to Work

In my day-to-day work on SO, I use a variety of skills. I need a broad understanding of genomics and biology, and I need the ability to acquire new information quickly when I extend the ontology. In addition, I need to be aware of both computer science and philosophical arguments about ontologies and their design, so I continually read the latest information and have discussions with others in the field. I must also write code on a regular basis. These days I mostly use Perl, because I am dealing with text-based problems. There is an administration aspect to the job, as well, which involves coordinating with the user community, maintaining a Web site and mailing lists, and organizing user meetings.

My advice to someone who is interested in this field: Work to develop a sound knowledge of biology, and keep it real by learning necessary computer skills that complement the field. It is one thing to sit through lectures about relational databases--and quite another to actually sit down and design one for your data and implement it on your system. As the old Chinese proverb goes, "Read it--forget it; watch it--remember it; do it--understand it." Don't be abstract; make your knowledge concrete.

If you get the opportunity to travel as part of your scientific career, take it. Travel broadens the mind; you will be exposed to new ideas and interesting people. Going to conferences and presenting posters and talks is a good way to get new feedback on your work and develop channels of communication with other groups and individuals. Finally, sitting in front of a computer every day wreaks havoc on your body, so go outside occasionally and get some exercise.

Karen Eilbeck, Ph.D., is a programmer-analyst at the Berkeley Drosophila Genome Project at the University of California, Berkeley.

Follow Science Careers

Search Jobs

Enter keywords, locations or job types to start searching for your new science career.

Top articles in Careers