Abstract:
|
In the recent years, the costs of obtaining biological data
have been drastically reduced. This has lead into an
exponential growth of the available data. Having such
growth of data to analyze sometimes results in very
platform-dependent and difficult to scale software
solutions.
This final project tries to provide a solution to those
problems in a real bioinformatics core facility in the
Science For Life Laboratory. Science For Life Laboratory
is a center for large-scale biosciences with the focus in
health and environmental research. It is located in
Stockholm, Sweden. This laboratory has 15 next
generation sequencing instruments at present, with a
combined capacity for DNA sequencing equal to several
hundreds of complete human genomes per year. This
implies a massive amount of data to be managed and
analyzed.
This data is analyzed using bcbio-nextgen.
bcbio-nextgen is an in-house maintained genomics pipeline,
originally developed by Brad Chapman at Harvard School
of Public Health [Rom12].
The first goal of this project is to automate the
installation, deployment and testing of the aforementioned
pipeline. On the other hand, the alignment1 step of the
analysis will be modified to use Seal, a Hadoop based
aligner. This will allow us to check that all automations
are working properly, as the pipeline will have to be
installed and tested in several nodes. |