Chloroplast (autofluorescence, red) and cell wall (Calcofluor staining, white) of an Ectocarpus filament ©Scornet
Pictures
Home > What do we do ? > Jointly Executed Research > Techniques > Bioinformatics

Bioinformatics

The use of computer sciences applied to biology after obtaining genomic data, is called “bioinformatics”, a new field which helps researchers to figure out the meaning and the function of genes by comparing different genomes.
Data obtained from the banks of ESTs for example, can be analysed by specific informatics programs. Scientists usually ran these programs successively which required a lot of time.
In the context of the implementation of MGE, these treatments are now done automatically (on, for example, the SAMS interface).

The automatic sequencers provide raw sequencing data that need to be analysed following different steps:

1) Decreasing the background noise
2) Cleaning off the plasmid sequences (= pieces of sequences which aren’t part of the interesting DNA fragment that has been amplified)
3) Then, once the sequence is clean, it is compared with others collected in worldwide databases.

These worldwide databases are known as: the EMBL in Europe, the NCBI in the United States, the DDBJ in Japan, etc. They are upgraded almost everyday and possess an enormous quantity of data.
This comparison process is called a Blast. It is very useful as it helps to determine if your unknown sequence corresponds to a known (identified) sequence or if it is new data coding for new proteins.
There are different kinds of Blasts according to what type of sequence has to be compared; for a transcript, it will be a Blast t; a Blast n if it’s nucleotides; a Blast p if it’s a protein.

SAMS is an interface where EST sequences can be tendered and where the analysis done on those are automated.
An advantage of the SAMS interface is that thousands of sequences can be processed at the same time without having to stay in front of the computer.
In the end, a score and an e-value are given to the result. The score reflects the quality of the resemblance with other sequences present in the database and the e-value reflects the idea that this resemblance is not due to chance. If the e-value is low and the score high, this will mean that the unknown sequence is very likely to be the same as the best hit obtained after comparison with the banks.

In relation to biological questions, an example of interesting application for this interface is the study of environmental diversity. In fact, in a sample of seawater, we are only capable of raising about 5 to 10% of the organisms that are present. One of the ways of acceding to this biodiversity is looking at the known genes that are present in these organisms present in the seawater. (More specifically, markers like the ribosomal RNA 16S or 18S, the total ribosomal RNAs from the seawater sample extracted and amplified are used very often to be compared to the sequences present in the banks.)
This technique can be used, for example, to study the proportion of different types of larvae present in the seawater .

Things get more complicated when we start talking about proteins as resemblance between sequences is much less stronger than for DNA or RNA sequences, in fact the resemblance is too low to be detected simply by a Blast analysis. Therefore sequences are compared by using profiles or motifs for example, comparing a protein which is composed by a group of hydrophobic amines acids followed by a group of hydrophilic amines acids, etc. with another of the same sort (the amines acids are the basal components of proteins). In order to do these comparisons two by two, more complicated algorithms are used such as the Markov ladder which are available inside work environments like SAMS or GenDB.

It is the expert role to analyse the results proposed by these programs and see if they are congruent or not, this part of the work is called the annotation of the genome and is very important in order to give more strength to the hypothesis proposed by this type of analysis.

In fact, the analysis that are made computer wise, although being very practical, are in silico analysis that are based on resemblance between sequences and are therefore hypothetical in a way. Biological analysis is indispensable to really prove the role of a specific gene or protein inside a living organism, in vivo, by molecular methods such as transformation, or mutagenesis for example.

Bioinformatics is a set of high-throughput analysis techniques which have come along with the area of genomics and which will evolve along with it. In fact, the presence of this tool is capital to be able to analyse the enormous amount of data which is brought by genomic studies. It is already moving along with transcriptomics and will follow the proteomics and probably the metabolomic areas.

Contributed by Stephanie Ries

Pictures | Who are we ? | Credits | Partners | Contact
All rights reserved © 2010, MGE