informe final proyecto fondecyt regular

Transcripción

INFORME FINAL
GOBIERNO DE CHILE
CONICYT
FONDECYT
PROYECTO FONDECYT REGULAR
1-050493
3 años
2007
NÚMERO PROYECTO
DURACIÓN
AÑO DE EJECUCIÓN
Gonzalo Navarro Badino
INVESTIGADOR(A) RESPONSABLE
RUT
DIRECCION
FONO
[email protected]
E-mail
PERÍODO QUE INFORMA
15/3/2005
14 /3/2008
DESDE
HASTA
CONTENIDO
(MARQUE CON UNA X EL CASILLERO QUE CORRESPONDA)
INCLUYE
Formulario de Informe Final
X
Publicaciones
X
Resumen de Tesis Título/Grado
X
Información acerca de inventos y patentes
NO INCLUYE
X
Otros (especificar)
Informe Incentivo Coop. Internacional (Si corresponde)
Firma Coinvestigadores(as) X
Firma Investigador(a) Responsable
Fecha: 14/3/2008
CONTENIDO DEL INFORME FINAL
1. CUMPLIMIENTO DE LOS OBJETIVOS PLANTEADOS EN EL PROYECTO. Marque con una
X el casillero correspondiente.
Objetivos
1.
Cumplimiento
Total
Desarrollo
de nuevos índices
comprimidos y variantes que ofrezcan
compromisos relevantes entre el
X
espacio que ocupa el índice y el tiempo
de consulta.
2. Desarrollo de nuevas técnicas para
manipular índices comprimidos en
X
memoria secundaria.
3. Desarrollo de nuevas técnicas para
construir índices comprimidos en poco
espacio.
4. Desarrollo de índices que puedan
actualizarse eficientemente cuando
X
X
cambia la colección de texto.
5. Desarrollo de nuevas técnicas de
búsqueda para patrones complejos y
búsqueda aproximada en índices
X
comprimidos.
6. Desarrollo de variantes de índices
comprimidos orientadas a lenguaje
natural, para competir con los índices
X
invertidos comprimidos tradicionales.
7. Desarrollo de prototipos gratuitos de
los índices comprimidos desarrollados,
para demostrar su utilidad práctica y
transferir
la
investigación
a
aplicaciones reales.
8. Desarrollo de un software gratuito de
búsqueda
secuencia¡
en
texto
comprimido con los formatos ZivLempel más populares, de patrones
simples y complejos.
X
X
Parcial
No
Fundamentar el cumplimiento parcial o
incumplimiento
9.
Desarrollo
de nuevas técnicas
relevantes para lenguaje natural, como
compresión adaptiva y compresión de
texto estructurado.
10.
Desarrollo de algoritmos y
estructuras de datos para resolver
problemas fundamentales de búsqueda
en texto plano, con aplicaciones a
X
X
búsqueda en texto comprimido.
Otro(s) aspecto(s) que Ud. considere importante(s) en la evaluación del cumplimiento de los objetivos planteados en
la propuesta original o en las modificaciones autorizadas por los Consejos.
H. RESULTADOS OBTENIDOS
Describa brevemente los resultados obtenidos en el proyecto en un máximo de cinco páginas, tamaño carta, espacio
seguido. Para cada uno de los objetivos específicos, describa o resuma los resultados. Relacione las publicaciones y/o
manuscritos enviados a publicación con los objetivos específicos. Incluya en anexos, la información de apoyo que estime
pertinente y necesaria para la evaluación.
(1) Desarrollo de nuevos índices comprimidos y variantes que ofrezcan compromisos relevantes entre el
espacio que ocupa el índice y el tiempo de consulta.
Se obtuvo un nuevo índice, llamado Huffman-FMindex, bastante más simple que otros y competitivo en la
práctica. En particular es resistente a alfabetos grandes (como en lenguaje natural), un punto débil en otras
alternativas. Los resultados se publicaron en [GMNS05]. Posteriormente se trabajó en una codificación
alternativa que funciona mejor para ciertos tipos de textos, lo que se publicó en [PGNS06]. En [GNPSM06]
(revista ¡SI) se publicaron todos los resultados completos.
Se obtuvo también sobre variante más compacta del Lzindex, que ofrece mayor velocidad para encontrar las
ocurrencias comparado con índices alternativos. La nueva versión ocupa menos de la mitad del espacio
actual. Este trabajo se publicó en [ANS06] (ISI). Finalmente, se obtuvo otra versión intermedia que ocupa
más espacio y es más rapida que ésta, y es más rápida que el Lzindex original siendo que ocupa menos
espacio que éste. Este trabajo se publicó en [AN07].
(2) Desarrollo de nuevas técnicas para manipular índices comprimidos en memoria secundaria.
Se obtuvo un índice tipo Ziv-Lempel que funciona en forma muy competitiva en memoria secundaria y
desplaza a otros índices existentes en varios casos. El resultado se publicó en [AN07].
Asimismo se obtuvo una versión en memoria secundaria de un índice basado en la transformación de
Burrows-Wheeler, que mejora la localidad de referencia para obtener buenos resultados en memoria
secundaria. El resultado se publicó en [GN07].
(3) Desarrollo de nuevas técnicas para construir índices comprimidos en poco espacio.
Se obtuvo una técnica para construir el Lzindex en un espacio muy cercano al que ocupa la estructura final,
es decir, 4 veces la entropía de orden k del texto. Este fue el primer índice comprimido para el que se logró
un resultado de este tipo, que se publicó en [AN051 (¡SI).
Asimismo, todos los resultados que se mencionan en el punto 4. se adaptaron (en los mismos artículos que
se mencionan allí) para la construcción de índices en espacio similar al de la estructura final. En general se
mostró cómo utilizar un índice dinámico para construir un índice estático.
(4) Desarrollo de índices que puedan actualizarse eficientemente cuando cambia la colección de texto.
Se obtuvo una representación de secuencias de bits donde pueden insertarse y borrarse bits
dinámicamente, lo que asociado a técnicas estándares entrega un índice comprimido dinámico con espacio
4
asintóticamente igual a la entropía de orden cero del texto. Este resultado se publicó en [MN06] (¡SI).
Posteriormente demostramos que una adaptación de la técnica obtenía espacio asintóticamente igual a la
entropía de orden k del texto. Este resultado se publicó en [MN07] y es, dentro del modelo de entropía de
orden k, el mejor resultado posible en términos de espacio. Estos resultados se publicaron en revista
[M N 081.
Finalmente, el tiempo se mejoró por un factor de O(log log n), lo que constituye el mejor resultado a la fecha,
mejorable sólo por muy estrecho margen debido a cotas inferiores existentes. El resultado se publicó en
[GNO8].
(5) Desarrollo de nuevas técnicas de búsqueda para patrones complejos y búsqueda aproximada en índices
comprimidos.
Se obtuvo una técnica para búsqueda aproximada sobre el lLZl, una variante del Lzindex. El resultado se
publicó en [RNO07], y es uno de los pocos resultados prácticos en el área.
(6) Desarrollo de variantes de índices comprimidos orientadas a lenguaje natural, para competir con los índices
invertidos comprimidos tradicionales.
Se trabajó en la combinación de índices comprimidos con preprocesadores de lenguaje natural,
obteniéndose resultados muy promisorios, los cuales se publicaron en [FNP08]. También se desarrollaron
variantes que adaptan directamente ideas de índices comprimidos a lenguaje natural. Los resultados se han
enviado a publicación pero aún no se reciben las notificaciones de aceptación o rechazo [BFLN08].
Asimismo se ha trabajado en un compressed suffix array orientado a palabras; el artículo está en
preparación.
(7) Desarrollo de prototipos gratuitos de los índices comprimidos desarrollados, para demostrar su utilidad
práctica y transferir la investigación a aplicaciones reales.
Se desarrolló, conjuntamente con la Universidad de Pisa, el sitio PizzaChili, (mirrors
http ://oizzachili.dcc.uçhile.cl y http://oizzachili.di.unipi.it ). El sitio es un repositorio de software, colecciones de
texto e instrumentos de medición de performance, conteniendo las implementaciones de todos los índices
relevantes bajo una interfaz común, que permite a investigadores, estudiantes y profesionales utilizar las
implementaciones para fines pedagógicos, de investigación, o aplicación a la industria. Se utilizó una licencia
GPL para permitir el uso gratuito de todo el software. El trabajo en este Sitio ha sido financiado con varias
fuentes, una de ellas éste proyecto Fondecyt.
(8) Desarrollo de un software gratuito de búsqueda secuencial en texto comprimido con los formatos Ziv-Lempel
más populares, de patrones simples y complejos.
Se completó el desarrollo del software Lzgrep, el cual además permite búsqueda aproximada. El software es
gratuito, puede buscar en los formatos gzip y compress, y puede obtenerse de
http://www.dcc.uchile.ci/anavarro/software.
5
(9) Desarrollo de nuevas técnicas relevantes para lenguaje natural, como compresión adaptiva y compresión de
texto estructurado.
Se desarrolló un método para comprimir colecciones crecientes de texto en lenguaje natural, que no degrada
la tasa de compresión cuando el texto cambia su distribución. El resultado se publicó en [NFNP05] (ISI).
Seguidamente se trabajó en un código adaptivo que sin embargo no modifica los códigos asignados a las
palabras, de manera de facilitar la búsqueda directa en el texto comprimido. Esto se publicó en [BFNP06]
(ISI). Por el otro lado, se trabajó en una técnica para comprimir texto estructurado que modela
separadamente los textos bajo distintos tags.
Con respecto a compresión de texto estructurado, se trabajó en una técnica que modela separadamente los
textos bajo distintos tags. El resultado se publicó en [ANF07] (revista ¡SI).
(10)Desarrollo de algoritmos y estructuras de datos para resolver problemas fundamentales de búsqueda en
texto plano, con aplicaciones a búsqueda en texto comprimido.
Se obtuvieron nuevos resultados de búsqueda aproximada en texto plano, permitiendo errores con un
margen de probabilidad. Esto permite romper las cotas inferiores existentes en el problema (que valen
cuando no se permite error). Los resultados se publicaron en [KNT08], y pueden extenderse fácilmente a
búsqueda en texto comprimido, incluso indexada.
-> Publicaciones (se incluyen aquí para facilitar la lectura del punto II, y se da todo el detalle
en el punto III).
[GMNS06] Szymon Grabowski, Ve¡¡ Makinen, Gonzalo Navarro y Alejandro Salinger.
A Simple Alphabet-lndependent FM-lndex.
Proc. PSCO5, páginas 230-244.
[PGNS06] Rafal Przywarski, Szymon Grabowski, Gonzalo Navarro y Alejandro Salinger.
FM-KZ: An Even Simpler Alphabet-lndependent FM-Index.
Proc. PSC06, páginas 226-240.
[GNPSM06] Szymon Grabowski, Gonzalo Navarro, Rafal Przywarski, Alejandro Salinger y Veli Makinen.
A Simple Alphabet-lndependent FM-lndex.
International Journal of Foundations of Computer Science (IJFCS) 17(6):1365-1384, 2006.
[ANS06] Diego Arroyuelo, Gonzalo Navarro y Kunihiko Sadakane.
Reducing the Space Requirement of LZ-index.
Proc. CPM'06, páginas 319-330. LNCS 4009.
[AN07} Diego Arroyuelo y Gonzalo Navarro.
Smaller and Faster Lempel-Ziv Indices.
Proc. IWOCA07, páginas 11-20. College Publications, UK.
[AN07'] Diego Arroyuelo y Gonzalo Navarro.
A Lempel-Ziv Text Index Qn Secondary Storage.
Proc. CPM 07, páginas 83-94. LNCS 4580.
[GN07] Rodrigo González and Gonzalo Navarro.
A Compressed Text Index on Secondary Memory.
6
Proc. IWOCA'07, páginas 80-91. College Publications, UK.
[AN05] Diego Arroyuelo y Gonzalo Navarro.
Space-efficient Construction of LZ-index.
Proc. ISAAC'05, páginas 1143-1152. LNCS 3827.
[MN6] Ve¡¡ Makinen y Gonzalo Navarro.
Dynamic Entropy-Compressed Sequences and Full-Text Indexes.
Proc. CPM'06, páginas 307-318. LNCS 4009.
[MN7] Veli Makinen y Gonzalo Navarro.
lmplicit Compression Boosting with Applications to Self-lndexing.
Proc. SPIRE'07, páginas 214-226. LNCS 4726.
[MN8] Ve¡¡ Makinen y Gonzalo Navarro.
Dynamic Entropy-Compressed Sequences and Full-Text Indexes.
ACM Transactions ori Algorithms. Por aparecer.
[GN08] Rodrigo González y Gonzalo Navarro.
Improved Dynamic Rank-Select Entropy-Bound Structures.
Proc. LATIN'08, por aparecer. LNCS.
[RNO07] Luís Russo, Gonzalo Navarro y Arlindo Oliveira.
Approximate String Matching with Lempel-Ziv Compressed Indexes.
Proc. SPIRE'07, páginas 264-275. LNCS 4726.
[BFLN08] Nieves Brisaboa, Antonio Fariña, Susana Ladra y Gonzalo Navarro.
Reorganizing Compressed Text.
Enviado a ACM SIC IR08.
[BFNP05] Nieves Brisaboa, Antonio Fariña, Gonzalo Navarro, y José Paramá.
Compressing Dynamic Text Collections via Phrase-Based Coding.
Proc. ECDL'05, páginas 462-474. LNCS 3652.
[BFNP06] Nieves Brisaboa, Antonio Fariña, Gonzalo Navarro, y José Paramá.
Improving Semistatic Compression vía Pair-Based Coding.
Proc. PSI'06, páginas 124-134. LNCS 4378.
[ANF07] Joaquín Adiego, Gonzalo Navarro y Pablo de la Fuente.
Using Structural Contexts to Compress Semistructured Text Collections.
Information Processing and Management (1PM) 43:769-790, 2007.
[KNT08] Marcos Kiwi, Gonzalo Navarro y Claudio Telha.
On-line Approximate String Matching with Bounded Errors.
Proc. CPM'08, por aparecer. LNCS.
7
III. PRODUCTOS GENERADOS POR EL PROYECTO
En esta sección debe incluir todo documento o material cuyo contenido corresponda substancialmente a los objetivos del
proyecto que se informa y en los que se indique el N Q del proyecto FONDECYT. Aténgase a los formatos que se incluyen
para cada tipo de producto generado. Adjunte copia de los documentos no enviados previamente a FONDECYT. Utilice las
hojas adicionales que sean necesarias.
Si Ud. tiene un proyecto de Incentivo a la Cooperación Internacional, destaque con (*) las publicaciones generadas
como producto del mismo a continuación de las que corresponden al Regular
Artículos en revistas científicas nacionales o extranjeras con Comité Editorial.
Marque con una 'X" lo que corresponda. Para trabajos En Prensa! Aceptados/ Enviados adjunte copia de carta de
aceptación o de envío.
Autor(a)(es/as)
Szymon Grabowski, Gonzalo Navarro, Rafal Przywarski,
Alejandro Salinger y Ve¡¡ Makinen
Título (Idioma Original)
A Simple Alphabet-lndependent FM-lndex
Nombre Completo de la
Revista.
International Journal of Foundations of Computer Science (IJFCS)
(¡SI)
Ref. bibliográfica
Año: 2006 Vol. 17 NQ 6 Pág. 1365-1384
Estado de la publicación a la
fecha.*
Publicada /En
Prensa
D Aceptada
El Enviada
OEn preparación
Otras fuentes de
financiamiento, si las hay
Autor(a) (es/as)
Joaquín Adiego, Gonzalo Navarro y Pablo de la Fuente
Using Structural Contexts to Compress Semistructured Text Collections
Revista.
Information Processing and Management (1PM)
Ref. bibliográfica
Año: 2007 Vol. 43 Pág, 769-790
fecha. *
E
Publicada /En
Prensa
' Aceptada
O Enviada
Otras fuentes de
Autor(a)(es/as)
Ve¡¡ Makinen y Gonzalo Navarro
Dynamic Entropy-Compressed Sequences and FuIl-Text Indexes
Revista.
ACM Transactions on Atgorithms
8
0En preparación
Ref. bibliográfica
fecha.*
Vol.
Año:
O Publicada ¡En
N9
Pág. ______
Aceptada
El Enviada
°En preparación
Prensa
Otras fuentes de
2. Otras publicaciones/productos.
Autor(a)(es/as)
Szymon Grabowski, Gonzalo Navarro, Ve¡¡ Makinen y Alejandro Salinger
A Simple Alphabet-lndependent FM-lndex
Tipo de publicación o producto
B Monografía
U Seminario ¡Taller ¡Curso
5 Informe Técnico
U Libro
U Software
U Capítulo de Libro
U— Patente
U Mapa
9 Exposición de Arte
Otro. Especificar: Artículo extenso en actas de congreso
Marque con una "X" lo que
corresponda
Editor(es) (Libros o Capítulos de
Libros)
Jan Holub y Milan Simanek (editores), Proceedings oí the lOth Prague Stringology
Conference (PSC 2005), páginas 230-244
Nombre de la Editorial!
Organización
Czech Technical University in Prague
Lugar y Fecha de Publicación
País: Rep. Checa
Ciudad: Praga
Fecha: 2005
Autor(a)(es/as)
Rafal Przywarski, Szymon Grabowski, Gonzalo Navarro y Alejandro Salinger
FM-KZ: An Even Simpler Alphabet-lndependent FM-Index
P Monografía
corresponda
P Seminario ¡Taller ¡Curso
U Libro
U Informe Técnico
U Software
U Mapa
U Patente
U Exposición de Arte
Libros)
Jan Holub y Borjov Melichar (editores), Proceedings oí the 11 th Prague Stringology
Conference (PSC 2006), páginas 226-240
Organización
Czech Technical University in Prague
País: Rep. Checa
Ciudad: Praga
Fecha: 2006
Autor(a)(es/as)
Diego Arroyuelo, Gonzalo Navarro y Kunihiko Sadakane
Reducing the Space Requirement of LZ-index
Monografía
9- Seminario ¡Taller ¡Curso
El Informe Técnico
U Libro
El Capítulo de Libro
El Software
U Mapa
a Patente
El Exposición de Arte
Marque con una "X' lo que
corresponda
Libros)
El
Moshe Lewenstein y Gabriel Valiente (editores), Proceedings of the 171 Annual
Symposium on Combinatorial Pattern Matching (CPM 2006), páginas 319-330
Organización
Springer, serie Lecture Notes in Computer Science, volumen 4009 (¡SI)
País: Alemania
Ciudad: Berlín
Fecha: 2006
Autor(a)(es!as)
Diego Arroyuelo y Gonzalo Navarro
Smaller and Faster Lempel-Ziv Indices
0 Monografía
El
Seminario ¡Taller ¡Curso
9 Libro
8 Informe Técnico
El Software
El Mapa
El Patente
Exposición de Arte
corresponda
Libros)
BilI Smyth and Ljiljana Brankovic (editores), Proceedings of the 18) International
Workshop on Combinatorial Algorithms (IWOCA'07), páginas 11-20
Organización
College Publications, UK
País: Reino Unido
Ciudad: Cambridge
Fecha: 2007
Autor(a)(es/as)
10
A Lempel-Ziv Text Index on Secondary Storage
0 Monografía
E Seminario ¡Taller ¡Curso
Ü Libro
fi Informe Técnico
fl Capítulo de Libro
U Software
0- Mapa
0 Patente
corresponda
Libros)
Bin Ma y Kaizhong Zhang (editores), Proceedings of the 181 Annual Symposium on
Combinatorial Pattern Matching (CPM 2007), páginas 83-94
Organización
Springer, serie Lecture Notes in Computer Science, volumen 4580.
País: Alemania
Ciudad: Berlín
Fecha: 2007
Autor(a)(es/as)
Rodrigo González y Gonzalo Navarro
A Compressed Text Index on Secondary Memory
k Monografía
E Seminario ¡Taller ¡Curso
Libro
5 Informe Técnico
u Capítulo de Libro
11 Software
E Mapa
E Patente
corresponda
Libros)
Bili Smyth and Ljiljana Brankovic (editores), Proceedings of the 18 International
Workshop on Combinatorial Algorithms (IWOCA'07), páginas 80-91
Organización
College Publications, UK
País: Reino Unido
Ciudad: Cambridge
Fecha: 2007
Autor(a)(es!as)
Space-efficient Construction
u Monografía
8 Libro
of
LZ-index
5 Seminario ¡Taller (Curso
U Informe Técnico
corresponda
Capítulo de Libro
U Software
9 Patente
Mapa
X. Deng y D.-Z. Du (editores), Proceedings of the 16 " International Symposium on
Algorithms and Computation (ISAAC 2005), páginas 1143-1152
Libros)
¿.
0
Organización
Springer, serie Lecture Notes in Computer Science, volumen 3827 (ISI)
País: Alemania
Ciudad: Berlín
Fecha: 2005
Autor(a)(es/as)
Ve¡¡ Makinen y Gonzalo Navarro
Dynamic Entropy-Compressed Sequences and Full-Text Indexes
8 Monografía
U Libro
8 Informe Técnico
9 Software
Hl Mapa
Ü
corresponda
Patente
Exposición de Arte
O
Libros)
Moshe Lewenstein y Gabriel Valiente (editores), Proceedings of the 17t Annual
Symposium on Combinatorial Pattern Matching (CPM 2006), páginas 307-318
Organización
Springer, serie Lecture Notes in Computer Science, volumen 4009 (ISP)
País: Alemania
Ciudad: Berlín
Fecha: 2006
Autor(a)(es/as)
Veli Makinen y Gonzalo Navarro
lmplicit Compression Boosting with Applications to Self-Indexing
O Monografía
O Seminario ¡Taller ¡Curso
O Libro
O Informe Técnico
O Capítulo de Libro
a Software
U Mapa
Patente
O Exposición de Arte
corresponda
12
Libros)
Organización
Nivio Ziviani y Ricardo Baeza-Yates (editores), Proceedings of the 10 International
Symposium en String Processing and Information Retrieval (SPIRE 2007), páginas
214-226
Springer, serie Lecture Notes in Computer Science, volumen 4726
País: Alemania
Ciudad: Berlín
Fecha: 2007
Autor(a)(es/as)
Rodrigo González y Gonzalo Navarro
Improved Dynamic Rank-Select Entropy-Bound Structures
U Monografía
Ii Libro
U Informe Técnico
0 Capítulo de Libro
U Software
8 Mapa
U Patente
corresponda
Libros)
Eduardo Laber (editor), Proceedings of the 81 International Symposium on Latín
American Theoretical lnformatics (LATIN 2008), por aparecer
Organización
Springer, serie Lecture Notes in Computer Science
País: Alemania
Ciudad: Berlín
Fecha: 2008
Autor(a)(es/as)
Luís Russo, Gonzalo Navarro y Arlindo Oliveira
Approximate String Matching with Lempel-Ziv Compressed Indexes
Monografía
P Seminario /Taller ¡Curso
U Libro
Informe Técnico
fi Capítulo de Libro
U Software
U— Mapa
U Patente
Otro, Especificar: Artículo extenso en actas de congreso
corresponda
Libros)
Organización
Nivio Ziviani y Ricardo Baeza-Yates (editores), Proceedings of the 14' International
Symposium on String Processing and lnformation Retrieval (SPIRE 2007), páginas
264-275
Springer, serie Lecture Notes in Computer Science, volumen 4726
13
País: Alemania
Ciudad: Berlín
Fecha: 2007
Autor(a)(es/as)
Nieves Brisaboa, Antonio Fariña, Gonzalo Navarro, y José Paramá
Compressing Dynamic Text Collections vía Phrase-Based Coding
0 Monografía
fl Libro
Ü Informe Técnico
O Capítulo de Libro
FI Software
O Mapa
- Patente
e Exposición de Arte
Marque con una "X" ¡o que
corresponda
Libros)
Organización
A. Rauber, S. Christodulakis y A. Min Tjoa (editores). Proceedings of the 91
European Conference on Research and Advanced Technology for Digital Libraries
(ECDL 2005), páginas 462-474
Springer, serie Lecture Notes in Computer Science, volumen 3652 (¡SI)
País: Alemania
Ciudad: Berlín
Fecha: 2005
Autor(a)(es/as)
Nieves Brisaboa, Antonio Fariña, Gonzalo Navarro, y José Paramá
Improving Semistatic Compression vía Pair-Based Coding
0 Monografía
ti Seminario ¡Taller /Curso
Libro
9 Informe Técnico
fi Capítulo de Libro
O Software
nl Mapa
O Patente
fi Exposición de Arte
Marque con una MX" lo que
corresponda
Libros)
Proceedings ofd the 51 International Conference on Perspectives of System
Informatics (PSI 2006), páginas 124-134
Organización
Springer, serie Lecture Notes in Computer Science, volumen 4378 (ISP)
País: Alemania
Ciudad: Berlín
Fecha: 2005
Autor(a)(es/as)
Marcos Kiwi, Gonzalo Navarro y Claudio Telha
14
On-line Approximate String Matching with Bounded Errors
U Monografía
E Libro
O Informe Técnico
l'1 Capítulo de Libro
U Software
U Mapa
Patente
El Exposición de Arte
Marque con una "X" Foque
corresponda
Libros)
Paolo Ferragina y Gad Landau (editores), Proceedings of the 1 91h Annual Symposium
on Combinatorial Pattern Matching (CPM 2008), por aparecer
Organización
Springer, serie Lecture Notes in Computer Science
País: Alemania
Ciudad: Berlín
Fecha: 2008
Autor(a)(es/as)
Paolo Ferragina y Gonzalo Navarro
Sitio PizzaChili
II Monografía
U Libro
U Mapa
corresponda
l Informe Técnico
Software X
O Patente
Libros)
Organización
Universidad de Pisa y Universidad de Chile
Mirrors: htt p ://oizzpchili.dcc.uchile.cl y htt://oizzachili.di.unioi.it
Fecha: 2005
Autor(a)(es!as)
Gonzalo Navarro, Marco Mora, Marcos Avendaño
Software LZgrep
E Monografía
1e Seminario /Taller ¡Curso
Marque
con una "X"
lo que
corresponda
U Libro
J Mapa
Exposición de Arte
9 Informe Técnico
Software X
2 Patente
Libros)
Organización
Universidad de Chile
htto://www.dcc. uchile.cl/onavarro/software
Fecha: 2005
Autor(a)(es!as)
Gonzalo Navarro
Estructuras de Datos Compactas
M Monografía
Seminario ¡Taller /Curso X
O Libro
U Informe Técnico
U Software
R Mapa
Patente
Tutorial invitado dictado en el Encuentro Mexicano de Computación 2007, Morelia,
México.
Marque con una "X" lo
corresponda
que
Libros)
Organización
Universidad de Chile
htto://www.dcc.uchile.cl/gnavarro/tutoripl.odf
Fecha: 2007
If1
3. Presentaciones a Congresos Nacionales e Internacionales. Adjunte copia del resumen o texto de la
ponencia y de la tapa del libro de Resúmenes, si no la ha enviado previamente.
Se ha preferido usar el ítem 3 para publicaciones in extenso en actas de congreso, por
corresponder mejor los cuadros.
Autor(a)(es/as)
Nombre del Congreso
Lugar y Fecha
País:
Ciudad:
17
Fecha:
4. Tesis y lo Memorias en ejecución y lo terminadas en el marco del proyecto. Adjunte copia del resumen no
informado anteriormente y certificación de aprobación, si corresponde.
Título de la Tesis
Búsqueda Aproximada Permitiendo Errores
Nombre y Apellidos
del(de la)/de los(las)
Alumno(a)(os/as)
y Tutor(a)
Título/ Grado
Alumno: Claudio Telha Cornejo
Tutores: Marcos Kiwi y Gonzalo Navarro
Institución, Facultad,
Departamento
Ing. Civil Matemática, Ing. Civil en Computación, y Magíster en Ciencias Mención
Computación
Universidad de Chile, Facultad de Ciencias Físicas y Matemáticas, Departamentos de
Matemáticas y de Ciencias de la Computación
Lugar
País: Chile Ciudad: Santiago
Estado de Tesis
En Ejecución:
Terminada: X
Fecha de Inicio: Marzo 2005 Fecha de Término: Agosto 2007
Título de la Tesis
Estructuras Comprimidas para Grafos de la Web
Nombre y Apellidos
Alumno(a)(oslas)
y Tutor(a)
Título/ Grado
Alumno: Francisco Claude Faust
Tutor: Gonzalo Navarro
Departamento
Magister en Ciencias Mención Computación
Universidad de Chile, Facultad de Ciencias Físicas y Matemáticas, Departamento de
Ciencias de la Computación
Lugar
Estado de Tesis
En Ejecución: X Terminada:
Fecha de Inicio: Enero 2006 Fecha de Término: Mayo 2008 (probable)
Título de la Tesis
Combinando Indexación y Compresión en Texto Dinámico Semi-Estructurado
Nombre y Apellidos
Alumno(a)(os/as)
y Tutor(a)
Título/ Grado
Alumno: Felipe Sologuren Gutierrez
Tutores: Gonzalo Navarro y Benjamin Piwowarski
Departamento
Ingeniero
iivil Civil
en Computación
Universidad
de Chile, Facultad de Ciencias Físicas y Matemáticas, Departamento de
. .
LI;1
Lugar
Estado de Tesis
Fecha de Inicio: Julio 2007 Fecha de Término: Julio 2008 (probable)
Titulo de la Tesis
Análisis del Uso de q-Gramas y q-Samples de ADN
Nombre y Apellidos
del(de la)/de los(Ias)
Alumno(a)(os/as)
y Tutor(a)
Título/ Grado
Alumno: Nicolás Ozimíca Gacitua
Tutores: Gonzalo Navarro
Departamento
i Civil en Computación
Ingeniero
.
.
Universidad de Chile, Facultad de Ciencias Físicas y Matemáticas, Departamento de
. .
Lugar
Estado de Tesis
Fecha de Inicio: Julio 2007 Fecha de Término: Septiembre 2008 (probable)
W. DESTAQUE OTROS LOGROS DEL PROYECTO TALES COMO:
• Estadías de investigación.
• Formación de recursos humanos exceptuando tesistas ya informados.
• Actividades de difusión y/o extensión en la temática del proyecto.
• Cualquier otro logro no contemplado en los ítem anteriores y que Ud. quiera destacar.
Además de los memoristas y tesistas oficiales, en el proyecto colaboraron muy estrechamente los alumnos de
doctorado Diego Arroyuelo y Rodrigo González, los cuales fueron autorizados a recibir financiamiento para asistir a
conferencias y realizar visitas de investigación. Los resultados de esa colaboración se pueden ver en la cantidad de
artículos del proyecto donde aparecen como coautores. El apoyo de Fondecyt fue crucial para lograr que realizaran
estadías largas de investigación (típicamente 1 mes) con los investigadores más importantes del área. Diego
Arroyuelo visitó la Kyushu University (Japón), Universitá di Pisa (Italia), University of Waterloo (Canadá) y University of
Melbourne (Australia), mientras que Rodrigo visitó la University of Waterloo. Estas estadías y asistencias a congresos
son invaluables en su formación, y fueron complementadas con varias otras fuentes.
También creo relevante, en términos de formación, hacer notar el destino de algunos de los alumnos del proyecto:
Claudio Teiha está realizando su doctorado en el MIT desde fines del 2007. Francisco Claude ha sido aceptado, con
una fellowship sustancial adicional a la teaching & research assistanship, para realizar el doctorado en Waterloo a
partir de Septiembre de 2008. Diego Arroyuelo tiene ofertas de postdoctorado en Kyushu, Waterloo, Melbourne, y
Coruña.
Finalmente destaco una actividad de difusión. El tutorial de 8 horas Estructuras de Datos Compactas' que dicté, por
invitación, en el Encuentro Mexicano de Computación 2007, Morelia, México, introdujo a los asistentes al área de
interés del proyecto, buscando captar la atención de posibles colaboradores y alumnos de la región. El tutorial contó
con unos 15 asistentes. El objetivo fue aprovechar la invitación a dictar un tutorial para difundir la temática y algunos
resultados del proyecto. Es algo pronto para evaluar su impacto, sin embargo ya dos alumnos de ese tutorial han
postulado al Magíster en Computación de nuestro Departamento.
20
V. RESUMEN
Describa en forma precisa y breve el tópico general del proyecto, sus metas y objetivos y los resultados alcanzados.
Utilice un lenguaje apropiado para la comprensión del público no especialista en el tema. Esta información podrá ser
difundida. (No debe exceder este espacio en fuente Verdana 9)
El tópico general del proyecto fue explotar la relación entre compresión e indexación de texto. Desde hace una década
se ha observado que, lejos de representar objetivos contrapuestos, la compresión y la indexación están íntimamente
relacionadas, pues en cierto sentido un índice es una representación compacta y buscable del texto. Esto ha llevado a la
invención de los "autoíndices", estructuras de datos cuyo tamaño es proporcional al del texto comprimido, pueden
reemplazar al texto (de la misma forma que un archivo comprimido puede reemplazar al original), pero además permiten
búsqueda indexada en él (lo que tradicionalmente requería un índice aparte, en general sumamente costoso en
memoria). En el proyecto se plantearon varios objetivos alrededor de este objetivo general. En líneas generales,
podemos agrupar los objetivos en las siguientes líneas: (1) Investigar en distintos aspectos de autoíndices (objetivos
específicos 1-7), (2) Investigar en aplicaciones específicas para lenguaje natural (objetivos 6 y 9), (3) Investigar en
búsqueda secuencia¡ en texto comprimido (objetivos 8 y 10). La línea principal es la (1), mientras que la (2) busca
explotar un caso particular de mucho interés práctico, y la (3) se refiere a búsqueda sin índices, cuyos resultados
siempre resultan útiles para la búsqueda indexada.
Con respecto a autoíndices, el proyecto obtuvo una serie de resultados muy relevantes. Cuando se comenzó, existían
ya varios autoindices, pero eran muy primitivos con respecto a la funcionalidad que soportaban. Esto reducía
considerablemente su aplicabilidad práctica. Este proyecto obtuvo significativos resultados sobre aspectos como
agregarles dinamismo (es decir, poder modificar un índice comprimido cuando cambia la colección de texto),
construcción en espacio reducido (y no requerir primero construir el índice clásico para luego comprimirlo, lo que es muy
mpráctico), índices que funcionen en memoria secundaria (algo en que los índices existentes funcionaban muy mal), y
búsquedas complejas en índices comprimidos (los índices típicamente buscaban patrones simples, lo que es de poco
interés en aplicaciones como biología computacional). Se trabajó en dos tipos de índices: uno que comprime utilizando
métodos de la familia Ziv-Lempel, y otro basado en la transformación Burrows-Wheeler. Cada uno tiene sus méritos.
En términos teóricos, algunos de los resultados obtenidos son extremadamente relevantes, lo que atestigua su
publicación en ACM TALG, una de las dos revistas más relevantes en algoritmos. Nuestros resultados son los mejores
en este momento, y algunas cotas inferiores existentes hacen ver que no es posible mejorarlos mucho más. Asimismo,
no hemos perdido de vista la relación entre la teoría y la práctica, problema frecuente en esta área. Hemos desarrollado
un Sitio Web con bases de datos de pruebas, autoindices ya implementados, y herramientas de medición de
performance, que creemos resultará una herramienta muy útil para educación, investigación y uso industrial de los
autoíndices más exitosos en la práctica. Este sitio registra unos 500 accesos de investigadores y alumnos.
Con respecto a la aplicación a lenguaje natural, se obtuvieron varios resultados relevantes sobre compresión de texto
semiestructurado, compresión de texto buscable eficientemente, y combinación de autoíndices con lenguaje natural.
Algunos de nuestros resultados más promisorios están en sus primeras fases. Nuestros resultados preliminares indican
que una combinación adecuada de autoíndices con técnicas de compresión de lenguaje natural pueden competir
exitosamente con los índices invertidos, que han reinado como las mejores técnicas para comprimir lenguaje natural
durante décadas. Esta línea se continuará desarrollando en el siguiente proyecto Fondecyt.
Finalmente, obtuvimos algunos resultados de interés sobre búsqueda secuencial. El primero es el mejor software
existente para busar directamente en archivos comprimidos con Lempel-Ziv, más rápido que descomprimiendo y
buscando. El segundo es un algoritmo probabilístico de búsqueda aproximada que creemos puede tener aplicación en
el futuro para búsqueda en texto comprimido.
21
22
VI.- INFORME DE PROYECTO DE INCENTIVO A LA COOPERACION INTERNACIONAL
NÚMERO DE PROYECTO DE INCENTIVO A LA COOPERACION
INTERNACIONAL
NÚMERO DE PROYECTO FONDECYT REGULAR
INVESTIGADOR(A) RESPONSABLE
FIRMA
FECHA PRESENTACIÓN
PERÍODO QUE SE INFORMA
DESDE
NOMBRE COLABORADOR(A) EXTRANJERO(A)
HASTA
AFILIACIÓN INSTITUCIONAL ACTUAL
FECHAS DEESTADÍA
-
DESDE
HASTA
Describa las actividades realizadas y resultados obtenidos. Destaque su contribución al logro de los objetivos del proyecto
Regular. Si es pertinente, indique las publicaciones conjuntas generadas, haciendo referencia a lo informado en el punto
111 del informe de avance! final. Agregue los anexos necesarios.
z
w 0
<
E—
o
liD
1<
LL
811
Z
o
C-)
occ
o',.-',
0
0 m
International Journal of Foundations of Coniputer Science
© World Scicntific Publishing Company
A SIMPLE ALPHABET-INDEPENDENT FM-INDEX
SZYMON GRABOWSKI
Computer Engineering Department, Technical University
sgrabowDzl.kis.p. ¿odz.pi
Al. Politechniki 11
90-924 Lódí, Poland
GONZALO NAVARRO
Departrnent of Computer Science, University
g,avarrodcc. tLchi ¿e. ci
Blanco Encalada 2120, Ser piso,
Santiago, Chile
of
of
Lód8
Chile
RAFAL PRZYWARSKI
Computer Engineering Department, Technical University
rafal . przywarskisvensson. com . pl
Al. Politechniki 11
90-924 Lódí, Poland
ALEJANDRO SALINGER
David R. Cheriton School of Computer Science, University
asaiingercs.uwaterloo. ca
200 University Avenue West,
Waterloo, Ontario, Canada N21, 3G1
of
of
Lódí
Waterloo
VELI MÁKINEN
Department of Computer Science, University of Helsinki
vmakine,vcs. he?.sinki.fi
P. O. Box 68 (Gustaf Hbllstrcirnin kattz 2b),
FIN-00014 Helsinki, Finland
Received (received date)
Revised (revised date)
Cornmunicated by Editor's narne
Earlier partial versioris of this work appeared lo [7, 9, 21].
1
*
\Ve design a succinct full-text index based on the idea of Huffman-compressing the
text and then applying the Burrows-Wheeler transform over it. The resulting structure
can be searched as an FM-index, with the benefit of removing the sharp dependence
en the alphabet size, a, present in that structure. On a text of length n with zeroorder entropy H0 , our index needs O(n(I-io + 1)) bits of space, without any significant
dependence on a. The average search time for a pattern of length rn is O(m(IIo + 1)),
under reasonable assumptions. Each position of a text occurrence can be located in
worst case time O((Ho + 1)Iogn), while any text substring of length L can be retrieved
in O((Ho + 1)L) average time in addition to the previous worst case time. Our mdcx
provides a relevant space/time tradeoff between existing siiccinct data structures, with
the additional interest of being easy to implement. We also explore other coding variants
alternative to I-luffman and exploit their synchronization properties. Our experimental
resu!ts on various type.s of texts show that our indexes are highly competitive in the
space/time tradeoff map.
1. Introduction
A full-text index is a data structure that enables one to determine the occ occurrences of a short pattern P = P1 p2 . p, in a large text T = tit2... t, without the
need of scanning the whole text T. Text and pattern are sequences of characters
over an alphabet E of size a. la practice one wants to know not only the value occ,
i.e., how many times the pattern appears in the text (a counting query) but also
the text positions of those occ occurrences (a locatin9 query), and usually also a
text context around thom (a displaying query).
A classic example of a fulI-text mdcx is the stiffix tree [24], which achieves 0(m)
and 0(m + occ) time complexities for counting and locating queries, respectively.
Unfortunately, a suffix tree requires 0(n log n) bits of space°, and also the constant
factor is largo. A smaller space cornplexity factor is achieved by the suffix array
[15], where term m in the time complexities becomes m log n or in + log n depending
on the variant. Still the space usage is high and may rule out the structure from
sorne applications, for example in computational biology.
The large space requirernent of traditional full-text indexes has raised a natural
interest in succirict fulI-text indexes that achieve good tradeoffs between search time
and space complexity [3, 5, 10, 11, 12, 13, 16, 18, 20, 231. A truly exciting perspective
originated in the work of Ferragina and Manzini [3}: They showed that a full-text
index may allow discarding the original text, as it contains enough information to
recover the text and even access any arbitrary substring of it. Wc denote a structurc
with such a property a self-index.
The FM-index of Ferragina and Manzini [31, in addition, had a space cornplexity
proportional to Hk, the kth order (empirical) entropy of T. The space complexity,
however, contains an exponential dependence on the alphabet size a. A dependence
on a also appears in thc time used to solve a locating or displaying query. Such
weaknesses make the original FM-index appealing only for texts with very small
alphabets, such as DNA.
More precisely, the FM-index needs up to 5Hkn+0 ((aloga + log1ogn)
a By log we mean 10 92 in this paper.
2
. 4-
na+l) bits of space, where O < y < 1. The time needed to solve a counting query is just 0(m). The text position of each occurrence can be located in
0 (a log 1 n) time, for sorne O < E < 1 that shows up in the sublinear terrns of
the space complexity. Finaily, the time needed to display a text substring of length
L is O (a (L + log 1 n)). The last operation is important not only to show a text
context around each occurrence, but also because a self-index replaces the text and
hence it must provide the functionality of retrieving any desired text substring.
This alphabet dependence is eliminated in a practical implementation of the FMindex [4], at the price of lot achieving the optimal search time anymore. Further
developinents [5] achieve flHk + o(n log a) bits of space for any k < a logo n and
constant a < 1. The counting time complexity now raises to 0(mloga), yet the a
terrns multiplying tlie locating and displaying complexities of the FM-index become
now loga.
The compressed suffix array (CSA) of Sadakane [23] offers another tradeoff related to the dependence on a. The CSA needs (Ho/E+0(logloga))n bits of space.
Its counting time is 0(mlogn). Each occurrence can be located in 0(loge n) time,
and a text substring of length L can be displayed in time O (L + logE n). Other
later developrnents in the une of the CSA [10, 11] achieve results similar to those
in [5].
In this paper we present an alternative approach to removing the large space
dependence of the FM-index. We Huffman-compress the text and then, as in the
FM-index, apply the Burrows-Wheeler transform over it. The resulting structure
can be regarded as an FM-index built over a binary sequence. As a result, we
remove any significant dependence on the alphabet size.
Our mdcx needs n(2H0 + 3 + e)(1 + o(1)) bits of space, for any O < e < 1.
It solves counting queries in 0(m(Ho + 1)) average time. Thc text position of
cach occurrence can be located in worst-ca.se time 0 ( (Ho + 1) log n). Any text
substring of length L can be displayed in O ((H0 + 1) L) average time, in addition
to the mentioned worst-case time required to locate a text position. In the worst
case all the terrns (H0 + 1) iii the time complexities become logn. It is possihle to
convert this log n into log a without affecting the average complexities [8], but we
rcfrain from this idea in a real implementation.
We also study several variants of the original index that reduce the term 2 in
front of tlie space complexity, such as K-ary Huffrnan and Kautz-Zeckendorf coding.
Our experimental results on English and proteins show that, although not among
the most succinct, our index is faster than the others in many aspects, even letting
the others use significantly more space. Qn the other hand, on DNA our mdcx is
both the fastest and smallest compared to previous work. Furthermore, our index
is attractive for its simplicity.
2. The FM-index Structure
The FM-index [3] is based on the Burrows- Wheeler transform (BWT) [1], which
produces a permutation of the original text denoted by T t = bwt(T). String Tt
is the result of the foliowing forward transforrnation:
3
1. Append to the end of T a special erid marker $, which is lexicographically
smallcr than any otber character.
2. Form a conceptual matrix M whose rows are the cyclic shifts of the string
T$, sorted in lexicographic order.
3. Construct the transformed text L by taking the last column of M. The flrst
column is denoted by F.
The suffix array (SA) A of text T$ is essentially the rnatrix M: A[i] = ,i uf thc
ith row of M contains string
ti. The occurrences of any pattern
t,ti
P = P1 P2 p rn form an interval [sp, ep] in A, such that suffixes tA[j]tA[j]+1 t,
sp i ep, contain the pattern as a prefix. This interval can be searched for by
using two binary searches in time O(mlogn).
The suffix array of text T is represented implicitly by Tbtu). The novel idea of
the FM-index is to store T 1 in compressed form, and to simulate the search in
the suffix array. To describe the search algorithm, we need two deflnitions that will
be useful later as well.
Definition 1 Given a text T over an ordered alphabet E = { Ci,. . . , c}, C[ci,c]
stores in C[c[ the number of occurrences of characters {ci ,.. . ,c_ i } in T.
Definition 2 Let X be a sequerice, then Occ(X, c, i) is the number of occurrences
of character c in the prefir X[1, i].
With these deflnitions we can introduce the backward BWT that produces T
given Tt.
1. Compute the array C for T. Notice that C[c] + 1 is the position of the flrst
occurrence of c in F (if any).
2. Define the LF-mapping LF[1, u + 11 as LF[i[ = C[L[i[] + Occ(L, L[i], i).
3. Rcconstruct T backwards as follows: set s = 1 (because M[11 = $T) and
then, for each i E n,. . ., 1 do s - LF[s[ and T[i[ - L[s[.
We are now ready to describe the search algorithm given in [3] (Fig. 1). It
finds the interval of A containing the occurrences of the pattern P, and returns the
number of occurrences. The algorithm uses the array C and function Occ(X, c, i)
defined aboye. Using the properties of the backward BWT, it is easy to see that
the algorithm maintains tlie foliowing invariant [3]: After phase i, with i from rn to
1, the variable sp points to the first row of M prefixed by P[i, m[ anci the variable
ep points to the last row of M prefixed by P[i, m[. The correctness of the algorithm
foliows from this observation.
Ferragina and Manzini [3] describe an implementation of Occ(T, c, i) that
uses a compressed form of They show how to compute Occ(Tt, c, i) for any
e and i in constant time. However, to achieve this they need exponential space (in
the size of the alphabet).
The FM-index can also locate the text positions where P occurs, and display
any text substring. The details are deferred to Section 4.
4
Algorithm FMCount(P,Tt)
(1) i=in;
(2) sp=l;ep=n;
(3) while ((sp < ep) and (i > 1)) do
(4)
c= P[i];
(5)
sp = C[c] + Occ(T t , e, sp - 1)+1;
(6)
ep = C[c] + Occ(T t , e, ep);
(7)
z.=i — 1;
(8) if (ep < sp) then return O else return ep - sp + 1.
Figure 1: Algorithrn for counting the number of occurrences of P[1, m] in T[1, n].
3. First Huffman, then Burrows-Wheeler
We now introduce our new index. Froni now on assurne T already contains the
terminator $ at the endb. To begin, this text T will be Huffman-compressed into
a binary stream T' and the codeword beginnings marked in Th (the final mdcx
will not store Y' nor Th). The idea is that, instead of searching Y for P, we can
Huffrnan-encode P into P' and search the binary text Y' for P'. Yet we have to
ensure that the occurrerices of P' are codeword-aligned.
Definition 3 Let T'[l, n'] be the binary stream resnitingfrom Huffman-compressing
T, where n' < (H0 + 1) n since (binary) Huffman poses a maximum representation
overhead of 1 bit per syinbol. Lct Th[1,n'] be a second binary strearn such that
Th[i] 1 uf j is the starting position of a Huffman codeword in T'. In the Hoffman
code, we ensure that the last bit assigned to the end rnarker "$" is zero.
The reason for the final condition will be clear later. Note that this can always
be done, by making the node corrcsponding to "$" a left child of its parent in the
Huffman tree.
.9.1. Structure
Wc apply the Burrows-Wheeler transform over tcxt T', so as to obtain 13
(T/) bt . Yet, in order to have a binary alphabet, T' will not have its own special
terminator character 1" (note that the end marker of T is encoded in binary at
the end of T', just as any other character of T). To formally define B we resort to
the suffix array A' of Y', yet the final index will not store A'.
Definition 4 Let A'[1, n'] be the sufJix array for text T', that is, a permutation
of {i, ri'} ,9uch that T'[A'[i}, n'] < T'[A'[i + 1], n'] in lexicographic order, for ah
1 z < n' . In these hexicographic com.parisons, if a striny x is a pre.fix of y, we
assiime x < y.
Our mdcx will represent A' in succinct forrn, via array B and another array Bh
used to track the codeword beginnings in
i Thus the term nifo will refer to this new text with terminator included. The difference with
lic tcrrn nilo corresponding to tl1e text without the terminator is only O(log n), and will be
absorbed by the o(n) terins that will appear later in the space complexity.
5
Definition 5 Lel B[1, n'] be a binary stream such that B[i] T'[A'[i] 1] (except
thai B[i[ = T[n] if A'[i] = 1). Let Bh[l, n'] be another binary stream such that
Bh[i] = ThÍA'[i]]. This tells whether position i in A' points to the beginning of a
codeword.
3.2. Searching
Our goal is to search B exactly like the FM-index. For this sake we need array
C and function Occ of Definitions 1 and 2, now applied to T' and B. As we are
dealing with binary sequences, C and Oce are easy to compute using the well-known
function rank.
Definition 6 Given a binary sequence X, rank(X, i) is the nurnber of 1's in X[1, ¡l.
In particular rank(X,O) = O. The inverse function, select(X,j), telis the occurrence of ihe jth 1 in X.
Functions rank and sal ect can be computed in constant time using only o(n)
extra bits on top of the original sequence of n bits [19, 2]. An optimized practical
variant is described in [6].
Note that our C array has only two entries, which are easily precomputed.
Similarly, Occ can be expressed in terms of rank.
C[O] = O
C[1] = n - rank(B, n')
Occ(B, O, i) = j - rank(B, i)
Occ(B, 1, i) = rank(B, i)
Therefore, formulas C[c] + Occ(T t , c, i) in the search algorithm of Figure 1 are
solved in our index by using rank on B.
There is a small twist, however, because we are not putting a terminator to
our binary sequence T' and hence no terminator appears in B. Let us cali "#"
(# < O < 1) the terminator that should appear in T', so that it is not confused
with the terminator 1" of T. In the position p such that A '[p #[ = 1, we should
liave B[p] = #. Instead, we are setting B[p#] to the last bit of T'. This is the
last bit of the Huffman codeword assigned to the terminator "$" of T, and it is
zero according to Definition 3. Hence the correct B sequence would be of length
n' + 1, starting with O (which corresponds to T'[n'[, the character preceding thc
occurrence of "#"),and it would have B[p] = #. To obtain the right mapping to
our binary B, we must add 1 to C[O[ + Occ(B, O, i) when i <p. The computation
of C[1[ + Occ(B, 1, i) remains unchanged. Overail, formula C[c] + Occ(T t , c, i) is
computad as foliows
C[c[ + Occ(Tbwt
C j)
=
i —rank(B,i) + [i <#] ifc=O
n - rank(B, u') ± rank(B, i), if c 1
(1)
where
= (A')-1[1]
Therefore, by preprocessing B to solve rank queries, we can search B exactly as
in the FM-index. Our search pattern is not the original P, but its binary encoding
P'[1, m'] using the Huffman cede we apphed to T.
6
Algorithm Huff-FM..Count(P' ,B,Bh)
(1)
(2)
i=rn';
sp=1;ep=n';
(3) while ((sp < ep) and (i > 1)) do
(4)
if P'[iJ = O then
sp = (sp - 1) - rank(B, sp - 1) + [sp - 1 < p# ] + 1;
ep = ep - rank(B, ep) + [ep < p#[;
else sp = - rank(B, n') + rank(B, sp - 1) ± 1;
ep = n' - rank(B, n') + rank(B, ep);
(7)
i=i-1;
(8) if ep < sp then return O else return rank(Bh, ep) - rank(Bh, sp - 1);
Figure 2: Algorithm for counting the number of occurrences of P'[l,m'] in T'[1, ?z'].
The answer to that search, however, is different from that of the search of T for
P. The reason is that the search of T' for P' returns the number of suifixes of T'
that start with P. Certainly these include the suifixes of T that start with P, hut
also other suffixes of T' that do not start a Huffman codeword, yet start with P'.
Array Bh now comes into play to fllter out those spurious occurrences. In the
range [sp, ep] found by the search of B' for P', every bit set in Bh[sp, ep] represents
a true occurrence. Hence the true number of occurrences can be computed as
rank(Bh, ep) - rank(Bh, sp - 1). Figure 2 shows the final search algorithm.
S.S. Analysis
The index stores B and Bh, each of n' < (H0 + 1)n bits. The extra space
required by the rank structures is o(n") = o((Ho + 1) n). The only dependence on a
is that we must store the Huffman code, for which a log n bits is sufficient (say, using
a canonical Huffman tree). Thus our índex requires at most 2n(Ho + 1)(1 + o(1)) +
alogn bits. The latter term is o(n) even for very large alphabets, a = o(n/logn).
Note that alternative indexes achieving kth order conipression [5, 10, 11, 181 require
a = O(nl/k). The space of our mdcx will grow slightly in the next sections due to
additional requirements for locating and displaying queries.
Let US 110W consider the time for counting queries. If we assurne that the characters in P have the same distribution of T (which holds in particular if P is randoinly
chosen frorri T, or generated by the same statistical source), then the length of P'
is in' <rn(Ho + 1).This is the number of steps te search B using the algorithm of
Figure 2, so thc search complexity is O(m(Ho + 1)). Since H0 loga, our time is
better than the O(mloga) cornplexity of several indexes [5, 10, 11].
We now analyze our worst-case search cost, which depends 011 the maximurn
height of a Hiiffman tree with total frequency n. Consider the longest root-to-leaf
path iii thc Huffman tree. The leaf symbol has frcquency at least 1. Let us traverse
the path upwards and consider the (sum of) frequencies encountered in the other
branch at each node. These numbers must be, at least, 1, 1, 2, 3, 5, ..., that is,
CJ
practice, those indexes can also achieve O(m(Ho + 1)) average time using Iluffman-shaped
wavelet trees.
the Fibonacci sequence F(i). Hence, a Huffman tree with depth d needs that the
text is of length at least n > 1 + F(i) = F(d+ 2) [25, pp. 3971. Therefore,
the maximum length of a codeword is F 1 (n) - 2 = log(n) - 2 + o(1), where
Thus, the encoded pattern P' cannot be longer than O(mlogn) and this is also
the worst case search cost. This matches the worst-case search cost of the original
CSA, while our average case is hetter. It is actually possible to reduce our worstcase time to O(m log cr), without altering the average search time nor the space
usage, by forcing the Huffrnan tree to become balanced after level (1 + x)loga, for
sorne suitable constant x > O. For details see [8].
4. Locating Occurrences and Displaying the Text
Up to now we have focused en counting time, that is, the time needed to determine the sufflx array interval containing ah the occurrences. In practice, one
needs also the text positions where they appear, as well as possibly a text context.
Since self-indexes replace the text, ja general one necds to extract arbitrary text
substrings from the index.
Given the suffix array interval that contains the oce occurrences found, the FMmdcx locates cach such position in O(u 1og' n) time, for any O < e < 1 (which
affects the sublinear space cornponent). The CSA can locate each occurrence in
O(log n) time, where e is paid in the space, nilo/e. Similarly, a text substring
of length L can be displayed in time O(cr(L + 1og' TI)) by the FM-index and
O(L + logE n) by the CSA.
In this section we show that our index can do better than the FM-index, although
not as well as the CSA. Using (1+e) n additional bits, we can locate each occurrence
in time 0( -1 (Ho + 1)logn) and display a text context in time O(Lloga + logn) in
addition to locating time. Qn average, if random text positions are involved, the
overail complexity to display a text interval is O((Ho + 1)(L + logn)).
A first problem is how to extract, in O(occ) time, the occ positions of the bits
set in Bh[sp, ep]. This is easy using select function of Definition 6. Actually we
need a simpler version, selectnext(Bh,j), which returns the position of the first 1
in Bh[j,n].
Let r = rank(Bh, sp - 1). Then, the positions of the bits set in Bh are
select(13h, r + 1), select(Bh, r + 2), .. ., seleet(Bh, r + occ). We recail that occ =
rank(Bh, ep) - rank(Bh, sp - 1). This can be expressed using selectnext: The
positions pos t .. . OS Q Í can be found as
pos 1 = selectnext(Bh, sp),
POSi+l =
selectnext(Bh,pos + 1).
To complete the locating and displaying processes, we need additional structures.
.1. Structure
We sample T' at approximately regular intervals, so that only codeword begin8
nings can be sanipled. A sampling parameter O < E < 1 will control the density of
the sampling and the corresponding space/time tradeoff.
Definition 7 Given O < e < 1, let £ =Ílogn1 be the sampling step. Our
sainpling of T' is a sequence 8[1, [--,]], so that S[i] is the first position of the
codewoi'd t/aat covers position 1+(i-1) in T', that is, S[i] = select(Th, rank(Th, 1+
e(i - 1))).
Our index will include three additional structures called ST, TS, and S. TS is
an array storing the positions of A' that point to the sampled positions in T', in
increasing text position order.
Definition 8 TS[1, [-
.j ] is an array snch that TS[i] = j iffA'[j] = S[iJ.
Array ST is formed using the same positions of A', now sorted by position in
A' and storing thcir position in T.
Definition 9 ST[1, [ - i] is an array such that ST[i] = rank(Th, A'[j]), where
j is the i-th position in A' that points to a position preserit in S.
Finally, S[i] tells whether the i-th entry of A' that points a codeword beginning,
points to sampled a text position. S will be further processed for rank queries.
Definition 10 S[1,n] is a bit array such that S[i] = 1iff A'[select(Bh, i)] is in S.
4.2, Locating
Wc have to determine the text position corresponding to an entry A'[i] for
which Bh[i] = 1, that is, a valid occurrence. 'A le use bit array S[rank(Bh,i)]
to determine whether A'[i] points or not to a codeword beginning in position in
ST[rank(S,rank(Bh,i))]. If it does, we are done. Otherwise, just as with the
FM-index, we determine position i' whose value is A'[i'] = A'[i] — 1. This process
is repeated until a new codeword beginning is found, that is, Bh[i'] = 1 (this
corresponds to moving backward bit by bit in T'). Wc then check again whether
this position is sampled, and so on until finding a sampled codeword beginning. If
we finaily obtain position pos after d repetitions, the answer is pos + d as we have
moved backward d positions in T.
It is lcft to specify how to determine i' from i. Tu the FM-index, this is done
via the LF-mapping, i' = C[T'-"[i]] + Occ(T t , T'[i], i). In our index, the LFmapping over A' is impleniented using Eq. (1). Figure 3 gives the pseudocode.
.4.3. Displaying
In order to display atext substring T[l,r] of length L = r-1+1, we start by binary searching TS for the smallcst sampled text position largor than r. Let j be the
index found in TS. Given value i = TS[j], we know that 8 [rank(Bh, i)] = 1 as jis a
sampled entry in A'. The corresponding position in T is ST[rank(S,rank(Bh,i))}.
Once we find the first sampled text positiori that foliows r, we know its corresponding position i in A'. From there on, we move backwards in T' (via the
Algorithm Huff-FM_Locate(i,B,Bh,S,ST)
(1) d=0;
(2) while S[rank(Bh, i)] = O do
(3)
do it' B[i] = O then i = i - rank(B, i) + [i <p#];
else i = n' - rank(B, n') + rank(B, i);
(4)
while Bh[iJ = 0;
(5)
d=d+1;
(6) return d+ ST[rank(S,rank(Bh,i))];
Figure 3: Algorithm for locating the text position of thc occurrence at B[i]. It is
invoked for each i = seiect(Bh, r + k), 1 < k <oce, r rank(Bh, .p - 1).
Algorithm Huff-FM_Display(1,r, B,Bh,S,ST,TS)
(1) j = min{k, ST[rank(S,rank(Bh,TS[k]))] > r}; /1 binary search
(2) i = TS[j];
(3) p = ST[rank(S,rank(Bh,i))I;
(4) L=;
(5) while p> 1 do
(6)
doL=B[i]L;
(7)
it' B[i] = O then i = i - rank(B, i) + [i <p#];
else i = n' —rank(13,n')+rank(B,i);
while Bh [ i ] = 0;
(8)
p=p — l;
(9)
(10)Iluifinan-decode the first r - 1 + 1 characters from Iist L;
Figure 4: Algorithm for cxtracting T[l, r].
LF-mapping over A'), position by position, until reaching the first bit of the codeword for T[r + 11. Then, we obtain the L preceding characters of T, by further
traversing T' backwards, iiow collecting all its bits until reaching the first bit of the
codeword for Tlll. The collected bit stream is reversed and Huffrnan-decoded to
obtain T[l, r]. Figure 4 shows the pseudocode.
4 . 4 . Analysis
(1 + o(1)) bits, since there are n'/e entries and each entry
Array TS requires
needs log n' log n + O(log log n) bits. Array ST requires other bits, as its
entries require log n bits. Firially, array 5 preprocessed for rank queries requires
71(1 + o(1)) bits. Overail, we spend (1 + e)n(1 + o(1)) additional bits of space for
locating and displaying qucries. This raises our final space requirement to n(2H0 +
3+E)(1 +o(1)) +ologn bits.
Let us 110W consider the time for locating. This corresponds to the maximum
distance between two consecutive samplcs iii T', as we traverse it backwards until
finding a sampled position. Recali from Section 3 that no Huffman codeword can
be longer than Iog, n - 2 + o(1) bits. Then, the distance between two consecutive
10
samples in T', after the adj ustment to codeword beginnings, cannot exceed
e+1ogn-2+o(1)
(Ho+1)1ogn+1ogn-1+o(1) = O (o + 1)1ogn)
which is therefore the worst-case locating cornplexity.
Por the displaying time, each of the L characters obtained costs us O(I-I + 1)
en average because we obtain the codeword bits one by orie. In the worst case they
cost us O(log n). Note that we might have to traverse sorne additional characters
froin the next sampled position until reaching the text area of interest. Finaily,
we must consider the O(log n) time for the binary search of TS. overali, the time
connplexity is 0((H0 + 1)(L + log n)) on average and O(L log n + (Ho + 1) log n)
in the worst case,
Theorem 1 Given a text T[l, n] over an alphabet a aud with zero-order entropy
H0 , the FM-Huffman indez requires n(2Ho + 3 + e)(1 + o(1)) + alog n bits of space,
for any constant O <E < 1 fixed at construction time. It can count the occurrences
of P[1, m] in T in average time O(m(Ho + 1)) and worst-case time O(mlogn).
Each such occurrence can be located in worst-case time 0(-1 (Ho + 1)log71). Any
text substring of length L can be displayed in time O((Ho + 1)(L + logn)) orn
average and O((L + (H0 + 1))logn) in the worst case.
5. K-ary Huffman
While storing B seems riecessary as we are using zero-order cornpression of T,
doubling the space requirement to store Bh seems a waste of space. In this section
-,ve explore a way to reduce the size of Bh. Instead of using Huffman over a binary
coding alphabet, we can use a coding alphabet of k> 2 symbois, so that each symbol
needs flog kl bits. Varying tlio value of k yields interesting time/space tradeoffs. Wc
use only powers of 2 for k values, so that each symbol can be represented without
wasting space.
The space usage vares in different aspects. The size of B increases since Huffman's compression ratio degrades as k grows. B has length n' < (H +1) n
syrnbols, where H is the zero-order entropy of the text computed using base
(k)
k logarithm, that is, H0 = H0 1 log2 k. Therefore, the size of B is bounded by
n' log k (Ho + log k)n bits. The size of Bh, on the other hand, is reduced since it
needs one bit per symbol, that is n' bits.
The total space used by B and Bh structures is then n'(l + log k) < n(H +
1)(1 + log k), which is not larger than the space requirernent of the binary version,
2n(Ho + 1), for 1 < logk < HO. In particular, ifwe choose logk = cnH0 , then the
space is upper bounded by ri((1+cn)Ho+1 +1/a), which is optimized at a =
(that is, logk = /H). Using this optimal a value, the overall space required by B
and Bh is n(..,/ii + 1)2 < n(Ho + 1)(1 + 2/\/H). The original overhead factor of
2 over pure Huffman cornpression has been reduced to 1 + 0(1//H).
The space for the rank structures changes as well. The rank structure for Bh is
computed in the same way of the binary version, arid therefore its size is reduced to
11
o(Hn) bits. To solve Occ(B, e, i) queries, we rnust build the sublinear-size rank
structures over a virtual binary sequences B, [1, n'], so that .B[i] = 1 uf B[i] = c.
Therefore, Occ(B, e,¡) = rank(B, i) can be computed iii constant time. The size
of those rank structures adds up o(kHn) bits. (The solution for rank requires
accessing the bit vectors B, but one can use B itself instead.)
If we use the optirnum k derived aboye, the space for the rank structures
is o(n2'//) extra space, which turns out to be still o(n) (more precisely,
0(n/loglogn)) for H0 (loglogn) 2 . This value is reasonablylarge in practice.
Regarding the time complexities, the pattern has average length <m(H + 1)
symbols. This is the counting complexity, which is reduced as we increase k. Using
the value k = 2 v' H"
optimizes the space complexity, the counting time is
0(rn/!?). On the other hand, the average counting time can be made 0(m) by
using a constant a. For locating queries and displaying text, we need the same
additional structures TS, ST and S as for the binary version. The k-ary version
( k)
can locate the position of an occurrence in O(i(H0 + 1) log fl) time, which is
the maxirnum distance between two sampled positions. Similarly, the time used to
display a substring of length L becornes O((H + 1)(L + logn)) on average and
(k)
. the worst case. Again, with the optimum k, H0(k)
0(Llogn + ( H0 + 1)i logn) 111
is \/H, and it can be made 0(1) by using a constant a.
6. Kautz-Zeckendorf Coding
The previous section aims at reducing the size of Bh in exchange for increasing
the sizc of other structures. In this section we aim at completeiy getting rid of
the Bh array, by replacing Huffman coding with another for which the bit strearn
itself enables synchronization at codeword boundaries. Our solution is based on
a representation of integers first advocated by Kautz [14] for its synchronization
properties, that presents each number in a unique form as a sum of Fibonacci
numbers. This technique is better known from a work by Zeckendorf [26], therefore
we will cali it Kautz-Zeckendorf coding.
Consider the (slightly displaced) Fibonacci sequence 1, 2, 3, 5, 8, 13, ..., that
is, fi = 1, f2 = 2, and f +2 = f+i + fi- It is easy to preve by induction that any
natural number N can be uniquely decomposed into a sum of Fibonacci numbers,
where each number is summed at most once and no two consecutive elements of
tlie sequence are used in the decomposition. (If two consecutive elements f i and
f+i appear in the decomposition we can use f+2 instead.) Thus we can represent
N as a bit vector, whose i-th bit is set uf the i-th Fihonacci number is used to
represent N. No two consecutive bits can be set in this representation because this
would mean that we used two consecutive numbers in the decomposition. This can
be generalized to k consecutive ones [14]. The recurrence is now f = i for i Ç k
(k)
(k)
(k)
(k)
(k)
and f +k = f+k-1 + f+k-2 +
+ f+i + f . . In this representation we do not perrnit a sequence of k consecutive elements of the sequence in the decomposition,
and thus no stream of k l's appears in the bit vector.
The binary encoding we use for symbois differs slightly from the abo ye descrip-
12
tion. The reason is that, for example, 0, 00, 000, ... are all different codewords,
albeit all of thern represent N 0. Operationally, our codos of a given length are
obtained by generating all the binary sequences of that length and then removing
those having k consecutive l's. We also require the codeword to finish with a 0, for
reasons to be made clear soon. Wc then generate the codos by increasing length,
a.ssigning the i-th code to the i-th most frequent source symbol. In addition, all the
codewords are prepended with a sequence of k l's followed by one 0.
If, during the LF-mapping, we read a O and then k successive l's from T', we
know that we are at a codeword beginning. Thus, Bh is no longer needed. This is
expected to outweight the fact that tlie encoding is riot optimal as Huffman. An
important side-effect is also that there is no need for select nor selectnext to find
the successive matches: they all are in a contiguous rango in A'. All the rest of the
operatory remains unchanged.
Thiere is another consequence of the way we generate the codewords. Because the
codewords are zero-terminated, the longest runs of l's are precisely the codeword
hcaders, of k l's. Those are the lexicographically largest suffixes of T', and thus
the characters preceding them occupy the n largest positions in B. As all those
preceding characters are 0, we can remove the last n bits from B knowing that they
will be zero. This saves one additional bit per symbol in T. Letting codewords
finishing with up to k - 1 l's does not save that much space.
7. Experimental Results
In this section we present experimental results on counting, locating and displaying queries, and compare the efficiency to existing indexes. The indexes used
for the experiments were the FM-index implemented by Navarro [20], Sadakane's
CSA [23], the RLFM mdcx [18], the SSA index using balanced wavelet trees [18],
and the LZ index [20]. Other indexes, like the Cornpressed Cornpact Suffix Array
(CCSA) of Mkinen and Navarro [17], the Compact SA of Mákinen [16] and the
implementation of Ferragina and Manzini of the FM-index were riot included because they are not comparable to the FM-Huffman index due either to their large
space requirement (Compact SA) or their high search times (CCSA and original
FM index).
Wc considered three types of text for the experiments: 80 MB of English text
obtained from the TREC-3 col1ection d (files WSJ87-89), 60 MB of DNA and 55 MB
of protein sequences, both obtained from the BLAST database of the NCBI (files
month. est _others and swissprot respectivehy).
Our experirnents were run on an Intel(R) Xeon(TM) processor at 3.06 GHz, 2
GB of RAM and 512 KB cache, running Gentoo Linux 2.6.10. Wc compiled the
code with gcc 3.4. 2 using optimization option -09.
We first give the results regarding the space used by our mdcx and then the
resuhts of the experiments classified by query type.
d Text Retreva1 Conference,
C
http//trec.nist.gov
National Center for Biotechnology Inforination, http://www.ncbi.nlm.nih.gov
13
7. 1. Space Consumption
Table 1 (top) shows the space that the k-ary Huffman index takes as a fraction
of the text for different values of k and for the three types of text corisidered. These
values do not include the space required to locate positions and display text.
Wc can see that the space requirements are the lowest for k = 4. For higher
values this space increases, although staying reasonable until k = 16. With larger
k values the spaces are too high for these indexes to be comparable to the rest.
It is also interesting to see how the space requirement of the index is divided
among its different structures. Table 1 (bottom) shows the space used by each of
the structures for the index with k = 2 and k 4, considering the three types of
text. For higher values of k the space used by the rank tables will increase too fast
compared to the reduction in Bh.
k
Fraction of text
Proteins
English
DNA
1.45
1,68
0.76
1.52
0.74
1.30
0.91
1.43
1.60
1.84
1.57
2.67
1.92
3.96
1
2
4
8
64
Structure
1
FM-Huffman k = 2
Space_[MB]
English DNA Proteins
29.27
48.98 16.59
48.98 16.59
29.27
18.37
6.22
10.97
6.22
10.97
18.37
134.69 45.61
80.48
80.00 60.00
55.53
1.68
1.45
0.76
1
B
Bh
Rank(B)
Rank(Bh)
Total
Text
Fraction
FM-Huffman k =4
English
49.81
24.91
37.36
9.34
121.41
80.00
1.52
Space_[MB]
DNA Proteins
18.17
29.60
14.80
9.09
13.63
22.20
3.41
5.55
44.30
72.15
60.00
55.53
0.74
1.30
1
1
Table 1: On top, space requirement of our k-ary Huffman index for different values
of k. The value corresponding to row k = 8 for DNA actually corresponds to k = 5,
since this is the total number of symbois to code in this file. Similarly, the value of
row k = 32 for the protein sequence corresponds to k = 24. Ori the bottom, detailed
comparison of k = 2 versus k = 4. We omit the spaces used by the Huffman table,
the constant-size tables for rank, and array C, as they are all negligible.
A similar study is carried out on Kautz-Zeckendorf coding in Table 2, although
ja this case there is no array Bh. The space is not the result of a tension between B
and Ph, but between the length of the header and the number of different codewords
of each length.
14
1
2
-5
Fraction of text
English] DNA 1 Proteins
2.04
0.41
1.39
0.91
0,54
0.88
1.04
0.71
1.02
1.20
0.89
1.19
1.37
1.06
1.36
Table 2: Space requirement of our FM-KZ index with parameter k, for different
values of k.
7.2. Conting Queries
For the three files, we show the counting time as a function of the pattern length,
varying from 10 to 100, with a step of 10. For each length we used 1000 patterns
takeri at random positions from each text. Each search was repeated 1000 times,
Figure 5 (left) shows the time for counting the occurrences for each index and for
the three files considered. As the CSA index has a space/time tuning parameter
space for this type of qucries, we adjusted it to use approxirnately the same spacc
of the binary FM-Huffman mdcx.
We show in Figure 5 (right) the average search time per character along with
the space requirement to count occurrences. Only the CSA permits a space/time
tradeoff for counting queries, so the it appears as a une while the other indexes are
represented by points.
7.3. Locatin,q and Displaying
We measured the titile each mdcx took to search for a pattern and locate the
positions of the occurrences found. From the English text and the DNA sequence
we took 1000 random patterns of length 10. From the protein sequence we used
patterns of length 5.
Figure 6 (left) shows the time per occurrence located for each index as a function
of its size. Most indexes (except LZ) permit a space/time tradeoff for locating, so
they appear as unes in the plots. The CSA has two such parameters now, and we
show the optimal combination that achieves each space occupancy.
Figure 6 (right) shows the time to display a text character as a function of
the index size. For the same searched patterns aboye, we displayed 100 characters around each of their occurrences. As for counting, only the CSA permits a
space/time tradeoff for this operation.
7.4. Analysis of Results
We can see that our FM-Huffman index with k = 16 is the fastest for counting queries for English and proteins. The version with k = 4 also gives relevant
space/time tradeoffs. Qn the other hand, FM-KZ is the clear winner on DNA, as
it takes by far the least space and its counting time is the best, together with SSA
15
SearcIl time ml Emqhsh texi (80 Pet)
03
Spaen ele searco Snre pAl dhalacler ml EngIlsh testAS lt(
0 . 002
aLEM
025
00015
$0
0.2
0.10
0.001
E
j 00005
00:1
0
10
30
20
40
50
60
708
0,6
100
OCaleO time OB DNA (60 Ilt)
0.3
025CSA L=20
0
1,6
OpacA $05 searctl Orne per chamele, en DNA (60
0,003
EM
1.2
1,4
1,6
SpaCe (lraclioe CO 810 Iecl(
1
0.8
2
Pct(
0.0025
02
0.252
0.iO
0.0015
0.001
0.0005
0__
lo
M M e 50 60
0.3
70 80 90 100
0.25
0.002
O
•
0,5
0.6
0.7
308C4 (lraCAml
0.8
0.9
1.1
1
al Ate tul)
12
OpacA 015 soareS tinte per ellaracter en Pr011eflS (55 PeAl)
Search SenA en ploIeins (55 140(
•FMALEM
CSALi2
y
0.4
0.0015
0.2
0.15
lo
ti
0001
01
0.0005
0
0.05
0
10
20
30
40
50
m
60
70
60
90
100
0.8
0.5
1
1.1
12 1.3 1.4 1.5
rpace (lractcAr oP ale leer)
1.6
1.7
1.8
Figure 5: Qn the left, counting time as a function of the pattern length over English
(80 MB), DNA (60 MB), and a proteins (55 MB). Qn the right, average search time
per character as a function of the index size. The times of the LZ index are lot
competitive in this experiment.
16
Trole lo lopoli al occuoence
al
EngibO leA (80
Opaco vio 04$5ay lrr,re ox, EngOen 1041(80 lvtl(
Et)
0045
0,007
0.04
0.006
SSO •
FM.14u)01140 O
1U. O
plloran 5=4
FM'Hrctl,ean 6=16
FM.15Z2
o 0.005
W
0.03
o
F'M
LZ o
RLFM
*
0.004
11
0.003
U
0 .015
1::
08
2 44 168
lOase (114580001 Ore cxl)
1101010 repollan
2: 24 26
06
08
1.2
LA
10
1.8
opaco (41400011 040101441)
1
000u060ce Qn ONO (60601)
24
22
Opaco 4/5 JIopay 111114 Ofl 0144(60601)
0.045
0.0035
Rl *
ti o
RLFM 0
CSA -35-SSO U
FMl-krIlnrOfl
O
flrj.l'10flllao 6=4 •
FM-KZI o
o
004
0.003
0.035
0
2
0.0025
0,03
0.002
0.023
11
002
0.015
0.0045
0.001
0.01
00005
0.005
0.8
1
1.2
1.4
opaco (1,40500041110 bel)
16
1.8
2
0.4
0.6
Orno lO 1000040450001100000 proleioe (55 Mb)
0.0
1
1.4
12
Spao* (lrac5on clorO revI)
13
4.8
2
Opaco 0/5 dbsçlay time 041 plolBino (55 ElO)
0,54
0.006
0035
RLFM o
cso -e.SSA U
FMl4vIImrr
O
FM-Hrrllrnan 6=4
FM-Uullr405=16 O
FM-K22
0.005
0.03
0.004
0025
14
0.02
0,003
0.015
o
o
0.01
0.002
0001
0.005
0.8
1
1 2
1.4
1.6 1.8
2
2.2 2.4
opaco (llastral xl lOe roel)
2.6
2.8
0.5
3
0.8
1
4.2
1.6 '.8
2
2.2
opaco (114*1100 xl tOe 1001)
1.0
2.4
2.6
2.8
Figure 6: On the left, time to locate the positions of the occurrences as a function
of the size of the index. On the right, time per character to display text passages.
Wc show the results of searching on 80 MB of English text, 60 MB of DNA and
finaily 55 MB of proteins. The reporting time of LZ on English is 0.07 milliseconds.
17
3
and FM-Huffman. The other outstanding index is SSA, as it offers an attractive
space/time tradeoff 011 English and proteins, being second-best on DNA. As expected, all the FM-Huffman and FM-KZ versions are faster than CSA, RLFM and
LZ, the latter not being competitive for counting queries.
Por locating qucries, our indexes do not give competitive space/time tradeoffs
on English nor proteins, where the FM-index, CSA and SSA dominate in all the
spectrum. When all the indexes use much space, our FM-index variants can be
faster than RLFM, CSA, LZ, and barely the SSA. Por DNA, however, our FM-KZ
index gives the best tradeoff in all the spectrum. The next relevant indexes are the
SSA and the FM-Huffman variants.
Rcgarding display time, our FM-Huffman index variants are again the fastest,
Qn English text, however, the LZ is equally fast and much smaller (k = 16 is the
relevarit FM-Huffman version here). The FM-index, FM-KZ, and CSA also give
relevant space/time tradeoffs. Qn DNA, the FM-Huffman version with k = 4 is
the fastest, requiring also little space. The only other interesting tradeoff is given
by FM-KZ, which takes by far the least space and competitive time. Finahly, on
proteins, FM-Huffman version k 16 is clearly the fastest. The best competitor,
the FM-index, uses 30% less space but it is twice as slow. The other relevant
space/time tradeoff is given by FM-KZ.
In general we can see that the FM-Huffman index is in many cases the fastest,
albeit it cannot operate on very little space as other indexes. Qn DNA, on the other
hand, FM-KZ is in rnost cases the smalhest and fastest index.
8. Conclusions
Wc have presented a practicah data structure inspired by the FM-index [3], which
removes its sharp dependence on the alphabet size u. Our key idea is to Huffmancornpress the text before applying the Burrows-Wheeler transform over it. Over
a text of n characters, our structure needs O(n(IIo + 1)) bits, being H0 the zeroorder entropy of the text. It can scarch for a pattern of length m in O(m(Ho +
1)) average time. Our structure has the advantage of (almost) not depending oil
the alphabet size, and of having better complexities than other indexes for some
operations. Wc also discussed and tested alternative variants of our index, where the
binary Huffman was replaced with other encodings with stronger synchronization
properties.
Our structures are simple and easy to implement. Our experimental resuhts show
that our indexes are competitive in practice against other implemented alternatives.
In some cases they are not the most succinct, but they are the fastest, even if we
let the other structures use significantly more space. In other cases, our indexes are
both the smahlest and fastest among the compare(¡ ahternatives.
After several years of mainly theoretical development, the fleid of compressed
fulh-text sehf-indexing is moving fast to practical considerations. Our work can be
seen as one of the first practice-oriented devehopments [7]. Recently, new indexes
and variants have been implemented and a site devoted to practical implementations and testbeds is being developed (http://pizzachili.dcc.uchile.cl and
18
http: //pizzachili . di. unipi . it). New irnplementations are being constantly
added to this site. Our immediate future work is to adapt the most prornising vanants of our indexes to the common interface of this site, so as to perrnit a uniform
comparison among the most up-to-date irnplementations. We also plan to continue
the research on coding variants whose properties can be used to reduce the size of
the mdcx.
Acknowledgments
We thank the anonymous referees for suggesting irnprovements to the manuscript.
This work was partially funded by Fondecyt Grant 1-050493 (Gonzalo Navarro).
Re fe re n ces
1. M. Burrows and D. Wheeler. A block-sorting lossless data compression algorithm.
DEC SRC Research Report 124, 1994.
2. D. Clark. Compact Pat Trees. PhD thesis, University of Waterloo, 1996.
3. P. Ferragina and O. Manzini. Opportunistic data structures with applications. In
Proc. FOCS'Oo, pp. 390-398, 2000.
4. P. Ferragina and G. Manzini. An experimental study of an opportunistic index. In
Proc. SODA '01, pp. 269-278, 2001.
5. P. Ferragina, G. Manzini, V. Mákinen, and G. Navarro. An alphabet-friendly FMindex. In Proc. SPIRE'04, pp. 150-160, 2004. LNCS 3246.
6. R. González, Sz. Grabowski, V. Mákinen, and O. Navarro. Practical implementation
of rank and select queries. In Poster Proc. WEA '05, pp. 27-38, 20057. Sz. Crahowski, V. Mkinen, and O. Navarro. First Huffman, then Burrows-Wheeler:
an alphabet-independent. FM-index. In Proc. SPIRE'04, pp. 210-211, 2004. Poster.
S. Sz. Crabowski, V. Mkinen, and O. Navarro.
First Huffman, then
Burrows-Whecler: an alphabet-independent FM-index. Technical Report
TR/DCC-2004-4. Dept. of Computer Science, Univ. of Chile, July 2004.
ftp: 1/ftp. dcc. uchile. cl/pub/userslgnavarro/huffbwt . ps. gz.
9. Sz. Grabowski, V. Mákinen, O. Navarro, and A. Salinger. A simple alphabetindependent FM-index. In Proc. PSC'05, pp. 230-244, 2005.
10. R. Grossi, A. Gupta, and J. Vitter. High-order entropy-compressed text indexes.
In Proc. SODA '03, pp. 841-850, 2003.
11. R. Crossi, A. Gupta, and J. Vitter. When indexing equals compression: Experiments with compressing suffix arrays and applications. In Proc. SODA '04, 2004.
12. R. Grossi and J. Vitter. Compressed sufflx arrays and suffix trees with applications
to text indexing and string matching. In Proc. STOC'OO, pp. 397-406, 2000.
13. J. Krkkáinen. Repetition-Based Text Indexes, PhD Thesis, Report A-1999-4,
Department of Computer Science, University of Helsinki, Finland, 1999.
14. W. Kautz. Fibonacci codes for synchronization control. IEEE Trans. on mf. Th.,
11, pp. 284-292, 1965.
15. U. Manber and G. Myers. Suffix arrays: A new method for on-line string searches.
SIAM J. Comput., 22, pp. 935-948, 1993.
16. V. Mákinen. Compact Suffix Array A space-efficient full-text index. Fundamenta
Informaticae 56(1-2), pp. 191-210, 2003,
19
17. V. Mkinen and G. Navarro. Compressed compact suffix arrays. In Proc. CPM'04,
pp. 420-433. LNCS 3109, 2004.
18. V. M.kinen and G. Navarro. Succinct suffix arrays based Qn run-length encoding.
Nordic J. of Cornputing 12(1):40-66, 2005.
19. 1. Munro. Tables. In Proc. FSTTCS'96, pp. 37-42, 1996.
20. G. Navarro. Indexing text using the Ziv-Lempel trie. J. of Discrete Algorithms
2(1):87-114, 2004.
21. R. Przywarski, Sz. Crabowski, C. Navarro, and A. Salinger. FM-KZ: Ari even
simpler alphabet-independent FM-index. In Proc. PSC'06, 2006. To appear.
22. R. Raman, V. Raman, and S. Srinivasa Rao. Succinct indexable dictionaries with
applications to encoding k-ary trees and multisets. In Proc. l3th ACM-SIAM SODA,
pp. 233-242, 2002.
23. K. Sadakane. Compressed text databases with efficient query algorithms based Qn
the compressed suffix array. In Proc. ISAAC'OO, LNCS 1969, pp. 410--421, 2000.
24. P. Weiner. Linear pattern matching algorithm. Proc. 14h Annual IEEE Symposium
on Switching and Automata Theory pp. 1-11, 1973.
25. 1. Witten, A. Moffat, and T. Beil. Managing GigabJtes. Morgan Kaufmann PubIishers, New York, 1999. Second edition.
26. E. Zeckendorf. Représentation des nombres natureis par une somme de nombres de
Fibonacci ou de nombres Lucas. Buli. Soc. Roy. Sci. Liége 41, pp. 179-182, 1972.
20

informe final proyecto fondecyt regular

Transcripción

Documentos relacionados

holy n@me c@tholic church - Holy Name Catholic Church

manual - GLORIA Project