New Technologies for Identifying and Characterizing Genomic Repetitive Elements
Mobile element insertions (MEIs), including Long interspersed element-1 (L1), Alu, and SVA (SINE-VNTR-Alu) retrotransposons, comprise approximately 46% of the human genome and have been shown to play an important role in human development and disease. Various strategies have been developed to identify candidate polymorphic MEIs from short-read whole genome sequencing data, though they struggle in regions where reads can map equally to multiple alternative genomic positions. Long-read sequencing technology provides a better resolution in such regions by directly sequencing long stretches of contiguous DNA that enable the discovery of potential overlooked MEIs. Here, we present a comprehensive analysis of retrotransposon insertions across a diverse set of samples using an enhanced version of PALMER, an approach that uses a pre-masking strategy to consider endogenous reference repeats and then searches against a library of mobile element sequences to detect non-reference insertions within the remaining unmasked sequences. We applied PALMER to recently generated data from fifteen high coverage (>50x) PacBio whole genome sequences and identified a set of polymorphic MEIs (902 L1s, 5958 Alus, and 358 SVAs), 53.8% (3880/7218) of which were detected in multiple samples. We observed 42.2% (381/902) of L1s, 33.8% (2011/5958) of Alus, and 39.4% (141/358) of SVAs were absent in recent PacBio assembly-based studies, where an over-estimation may exist for SVAs (755 vs 358) and an under-estimation for human-specific L1s (615 vs 902) due to their omission or mis-annotation. We showed that these missing MEIs were predominantly found in endogenous reference LINE/SINE regions (P < .0001), suggesting that such regions hinder typical discovery approaches. An analysis of unique breakpoint junction sequences in short-read high coverage 1000 Genomes Project samples (n=2504, ~30x) further revealed that the 92.6% (6683/7218) of our detected MEIs were found in at least one sample and 71.8% (5185/7218) with an allele frequency of >5%. We next developed a Cas9 targeted enrichment approach using nanopore sequencing to enable the high-throughput identification and resolution of MEIs in large cohorts and show that new insertions can be preferentially enriched. Together, we present a more holistic view of retrotransposon insertions in human populations and provide additional utility for studies using both short and long-read sequencing technology. The PALMER Software is available at https://github.com/mills-lab/palmer.