Publications des agents du Cirad

Cirad

Development of indexing compressed structure for analyzing a collection of similar genomes: application to rice

Agret C., Sarah G., Chateau A., Mancheron A., Ruiz M.. 2017. Lille : CNRS, p. 12-13. Journées Seqbio 2017, 2017-11-06/2017-11-07, Lille (France).

As the cost of DNA sequencing decreases, the high throughput sequencing technologies become more and more accessible to many laboratories. Consequently, new issues emerge that require new algorithms including tools for indexing and compressing thousands of genomes, as for example the 3000 rice genomes project [1], for which we are particularly interested in. Genomes can be considered as very large texts on a simple alphabet ¿ = {A, C, G, T }, We can refer to indexable dictionary problem which consists in storing a set ¿ {0, . . . , i, . . . , m- 1} of an universe U = n. B(n) where B[i] = 1 () i ¿ S. The indexable dictionary problem support two additionnal operations ranks(i) and selects(i) for s ¿ {0, 1}. The function ranks(i) returns the number of elements (s) up to i and selects(i) returns the position of the ith occurence of s. The indexation of complete genomes is an important stage in the exploration and understanding of data from living organisms. An efficient index should provide a quick answer to the following questions. -How many times a given pattern does appear in the genome? - What are the positions of a given pattern? -What is the pattern length at the ith position in the genome? The common way to structure index and compress one genome is to use the Burrows-Wheeler Transform –BWT)[2] with the FM-index [3] on BWT sequences for requests. If you want to index several genomes with one reference genome you may use MuGI [4]. To build MuGI index they store the reference in compact form (4 bits to encode single char), a variant database, one bit vector for each variant and an array kMA keeping information about each k-mers. This is a really interesting approach but it needs to have a reference genom. We present a structure which proposes a solution to index and compress very repetitive sequences over small alphabet in texts using k-mers. k-mers are factors of length k in the considered sequences. We built a 4k1 array, where k1 < k, and each entry, namely an ar

Documents associés

Communication de congrès

Agents Cirad, auteurs de cette publication :