Reads in Undetermined but Barcodes Are in Sample Sheet

idemuxCPP - inline barcode demultiplexing

GitHub release Build Status Conda Conda Downloads

idemuxCPP is a command line tool designed to demultiplex paired-finish fastq files from QuantSeq-Pool.

idemuxCPP tin demultiplex based on i7, i5 and i1 inline barcodes. While this tool tin can generally be used to demultiplex on whatever barcodes (equally long equally they are correctly supplied and in the fastq header), it best performs when used in combination with Lexogen indices, every bit it volition correct common sequencing errors in the sequenced barcodes. This volition allow you to retain more reads from your sequencing experiment, while minimizing cantankerous contamination.

idemuxCPP utilise is permitted nether the following licence.

idemuxCPP is a direct translation of the python tool idemux (https://github.com/lexogen-tools/idemux) in order to decrease the runtime. It is 2 times faster than the python version.

General usage:

                              idemuxCPP [-h] --r1 READ1 --r2 READ2 [--sample-canvas SAMPLE_SHEET] --out OUTPUT_DIR            [--i1-starting time I1_START] [--i5-rc] [-v]

Run idemuxCPP:

                              idemuxCPP --r1 read_1.fastq.gz --r2 read_2.fastq.gz --sample-sheet samples.csv --out /some/output/path --i1-start pos_in_read_2

Features

FASTQ file demultiplexing based on i7, i5 or i1 barcodes
Correction of barcode sequencing errors to maximize read yield (but works with Lexogen 12 nt UDIs, that have been sequenced at least viii nt.
Reverse complementation in case the i5 index has been sequenced equally opposite complement

Getting started

To get stated with demultiplexing you lot need to:

Install idemuxCPP
Set up a sample sheet csv
Run idemuxCPP

1. Installation

dependencies:

compiler supporting C++11 standard and OpenMP
boost C++ library version >= one.55.0 (install the development versions of the following libraries (or install all due east.thou. in Ubuntu via packet libboost-all-dev)
- libboost-filesystem
- libboost-system
- libboost-iostreams
- libboost-test (but required if you desire to compile unit tests)
zlib (e.g. zlib1g-dev in Ubuntu)
gengetopt

Windows 10 64bit binary For windows you do not demand to install any dependencies (they are included in the package). Simply download the pre-compiled windows binary from here windows binary. Extract the zip file. To execute the tool press windows+r, enter cmd, cd C:\\location_of_the_extracted_zip_file\bin and execute .\\idemuxCPP

From Source (distribution tar)

To configure, compile and install execute the following commands on your control line:

                              ./configure [--help for additional configuration options]     make     make install

From Source (git) The installation from source requires boosted tools and libraries:

gnulib (autoconf, automake, etc.)

Generate the configure file with:

So proceed with ./configure and brand like in the previous department.

From Linux Package

Debian	Ubuntu	Fedora
Debian_11 idemuxcpp - 0.i.ix - 32 bit idemuxcpp - 0.ane.9 - 64 bit Debian_10 idemuxcpp - 0.1.9 - 32 bit idemuxcpp - 0.ane.9 - 64 flake Debian_9.0 idemuxcpp - 0.ane.9 - 32 scrap idemuxcpp - 0.ane.9 - 64 fleck	xUbuntu_21.04 idemuxcpp - 0.one.9 - 64 bit xUbuntu_20.10 idemuxcpp - 0.1.9 - 64 flake xUbuntu_20.04 idemuxcpp - 0.1.9 - 64 bit xUbuntu_19.04 idemuxcpp - 0.1.9 - 64 chip xUbuntu_18.04 idemuxcpp - 0.1.9 - 64 bit xUbuntu_16.04 idemuxcpp - 0.1.nine - 32 chip idemuxcpp - 0.ane.9 - 64 fleck	Fedora 34 idemuxcpp - 0.i.9 - 64 scrap Fedora 33 idemuxcpp - 0.1.ix - 64 bit Fedora 32 idemuxcpp - 0.1.ix - 64 bit

on ubuntu yous can install it for example with:

idemuxCPP will as well soon be bachelor via bioconda!

two. Preparing the sample sail

In guild to run idemuxCPP on your QuantSeq-Pool data you beginning demand to prepare a csv file. We call this csv a sample sail and it specifies which barcodes correspond to each sample.

This is a necessity as the software needs to know into which bins reads should be sorted during demultiplexing. A sample sail can hands exist generated by filling in an excel spreadsheet and exporting it equally csv.

Example sample sheet (i7, i5 and i1 demuliplexing):

                              sample_name,i7,i5,i1     sample_0,AAAACATGCGTT,CCCCACTGAGTT,AAAACATGCGTT     sample_1,AAAATCCCAGTT,CCCCTAAACGTT,AAAATCCCAGTT     sample_2,GAAAATTTACGC,GCCCCTTTCAGA,GAAAATTTACGC     sample_3,AAACTAACTGTC,CCCATCCATGTA,AAACTAACTGTC

A sample sail consists of 4 columns and always starts with the header illustrated above. 'Sample_name' values will be used equally output file names, while the sequences specified in i7,i5 & i1 will be used for demultiplexing.

Therefore, only specific, unique unambiguous combinations of sample names and barcodes are allowed. This means using duplicated or cryptic combinations will result in an error. However, idemuxCPP will do its best to tell you where the trouble lies, one time this happens.

In cursory the rules are:

Sample names need to be unique.
Barcode combinations need to be unique.
i7 and/or i5 indices take to be used consistently within the csv file. i7 and/or i5 indices need to exist either present for all samples or none at all.
In dissimilarity to i7/i5 indices, i1 indices can be used for a subset of samples in the csv file.
Absence of a barcode needs to be indicated by an empty field (no value between comas ,,).
If your i5 has been sequenced as reverse complement, practice not enter the reverse complement sequences in the sample sheet. Utilize the --i5-rc pick!

Run into below. for more showcases of sample/barcode combinations that are allowed or disallowed.

3. Running idemuxCPP

Once you have installed the tool y'all can run it by typing idemuxCPP in the concluding.

idemuxCPP accepts the following arguments:

                              -h, --help                    Print help and go out   -Five, --version                 Print version and exit  Required arguments:   -i, --r1=STRING               Fastq.gz read file one (or .fastq file).                                     (default='')   -2, --r2=STRING               Fastq.gz read file 2 (required only in paired                                   end mode).                                     (default='')   -o, --out=STRING              Where to write the output files.                                     (default='./')   -s, --sample-canvass=STRING     Input a csv file describing sample names and                                   barcode combinations (i7, i5 and i1                                   barcodes).                                     (default='sample-sheet.csv')  Optional arguments:   -b, --barcode-corrections=Cord                                 Outputs a csv file that contains the number of                                   corrected barcodes   -five, --i5-rc                   Should be set up when the i5 barcode has been                                   sequenced equally reversecomplement. Brand sure to                                   enter non-reverse complementsequences in the                                   barcode file.  (default=off)   -i, --i1-start=INT            First position of the i1 index (ane-based) on                                   read ii.                                     (default='11')       --i1-read=INT             Read in which the i1 index should exist corrected                                   (1 or 2).                                     (default='2')   -q, --queue-size=INT          Queue size for reads that will be processed in                                   one block.                                     (default='4000000')   -r, --reading-threads=INT     Number of threads used for reading gz files.                                   Either one or 2 (1 thread per input file is                                   used).                                     (default='2')   -w, --writing-threads=INT     Number of threads used for writing gz files.                                   Default is the number of processor cores.    -p, --processing-threads=INT  Number of threads used for processing the mistake                                   correction. Default is the number of                                   processor cores.    -d, --demux-but              Do a one on ane mapping for the barcodes                                   specified in the sample canvass. No mistake                                   correction will be done. Barcodes that do not                                   match are written to the undetermined reads                                   file.  (default=off)   -v, --verbose                 Verbose.                                     (default=off)

Case commands:

                              # demultiplexes read 1 and two into the folder 'demux'     idemuxCPP --r1 read_1.fastq.gz --r2 read_2.fastq.gz --sample-sail samples.csv --out demux      # demultiplexing assuming the i1 barcode starts at the first base of operations     idemuxCPP --r1 read_1.fastq.gz --r2 read_2.fastq.gz --sample-sail samples.csv --out demux --i1_start 1      # demultiplexing assuming i5 is present equally reverse complement in the fastq header     # if he i5 has been sequenced as reverse complement use this option and provide     # the Non reverse complement sequences in the sample sheet.     idemuxCPP --r1 read_1.fastq.gz --r2 read_2.fastq.gz --sample-sheet samples.csv --out demux

After a successful completed run idemuxCPP volition write summary study to the output folder ('demultipexing_stats.tsv').

Technicalities

When you run idemuxCPP the post-obit will happen:

It will check if your sample sheet is okay. See here for examples

It will bank check the fastq header for barcodes and expects them in the following format:

                  single index (i7 or i5): @NB502007:379:HM7H2BGXF:1:11101:24585:1069 1:Northward:0:TCAGGTAANNTT  dual alphabetize (i7 and i5): @NB502007:379:HM7H2BGXF:one:11101:24585:1069 1:Northward:0:TCAGGTAANNTT+NANGGNNCNNNN

Reads that cannot be demultiplexed will be written to undetermined_R{1/2}.fastq.gz
When you demultiplex based on i1 inline barcodes, the a successful recognized barcode sequence will be cut out and removed from read 2. This is a pattern choice and volition leave yous with the 10 nt UMI + the nucleotides that potentially follow the i1 barcode (or don't).

This allows you to:

Utilise other software, such as UMI_tools to bargain with the 10nt UMI if desired
To demuliplex lanes where QuantSeq-Pool has been pooled with other libraries and read 2 has been sequenced longer than the actual barcode.

If you sequenced i5 every bit a reverse complement, make sure to not fill in reverse complement barcodes into the sample sheet, but to use the --i5-rc parameter.

Assistance

If you are demuliplexing a big number of samples (more 500) yous might see the following fault:

OSError: [Errno 24] Too many open files

This error occurs because well-nigh Bone have a limit on how many files tin can exist opened and written to at the ame time. In gild to temporarily increase the limit run:

                              # multiply your sample number*2 (as data is paired end)     # then round to the next multiple of 1024     ulimit -n the_number_above

If you lot are looking for a permanent solution you lot tin can change your ulimit values this manner.

In example y'all experience any bug with this software please open an effect describing your trouble. Make sure to post the version of the tool you are running (-five, --version) and your os.

Sample canvas examples

This is allowed:

                              # demultiplexing via total i7, i5, i1     sample_name,i7,i5,i1     sample_0,AAAACATGCGTT,CCCCACTGAGTT,AAAACATGCGTT     sample_1,AAAATCCCAGTT,CCCCTAAACGTT,AAAATCCCAGTT      # demultiplexing via full i7, i5 and sparse i1     sample_name,i7,i5,i1     sample_0,AAAACATGCGTT,CCCCACTGAGTT,AAAACATGCGTT     sample_1,AAAATCCCAGTT,CCCCTAAACGTT,      # demultiplexing via total i7, i5     sample_name,i7,i5,i1     sample_0,AAAACATGCGTT,CCCCACTGAGTT,     sample_1,AAAATCCCAGTT,CCCCTAAACGTT,      # demultiplexing via total i7, no i5 and thin i1     sample_name,i7,i5,i1     sample_0,AAAACATGCGTT,,AAAACATGCGTT     sample_1,AAAATCCCAGTT,,      # demultiplexing via full i7 only     sample_name,i7,i5,i1     sample_0,AAAACATGCGTT,,     sample_1,AAAATCCCAGTT,,      # demultiplexing via full i5 and i1     sample_name,i7,i5,i1     sample_0,,CCCCACTGAGTT,AAAACATGCGTT     sample_1,,CCCCTAAACGTT,AAAATCCCAGTT      # demultiplexing via total i5 and sparse i1     sample_name,i7,i5,i1     sample_0,,CCCCACTGAGTT,AAAACATGCGTT     sample_1,,CCCCTAAACGTT,      # demultiplexing via total i5     sample_name,i7,i5,i1     sample_0,,CCCCACTGAGTT,     sample_1,,CCCCTAAACGTT,      # demultiplexing via total i1     sample_name,i7,i5,i1     sample_0,,,AAAACATGCGTT     sample_1,,,AAAATCCCAGTT      # mixed indexing (if not ambiguous) (full i7 and sparse i5, i1)     sample_name,i7,i5,i1     sample_0,AAAACATGCGTT,CCCCACTGAGTT,AAAACATGCGTT     sample_1,AAAATCCCAGTT,,AAAATCCCAGTT     sample_2,GAAAATTTACGC,GCCCCTTTCAGA,GAAAATTTACGC     sample_3,AAACTAACTGTC,,AAACTAACTGTC      # mixed indexing (if not ambiguous) (no i7, sparse i5 & i1)     sample_name,i7,i5,i1     sample_0,,CCCCACTGAGTT,     sample_1,,,AAAATCCCAGTT      # mixed indexing (if not cryptic) (thin i7, full i5 & i1)     sample_name,i7,i5,i1     sample_0,,CCCCACTGAGTT,AAAACATGCGTT     sample_1,AAAATCCCAGTT,CCCCTAAACGTT,AAAATCCCAGTT     sample_2,,GCCCCTTTCAGA,GAAAATTTACGC     sample_3,AAACTAACTGTC,CCCATCCATGTA,AAACTAACTGTC      # additional parameter columns for i1_read and i1_start index (1-based).     sample_name,i7,i5,i1,i1_read,i1_start     sample_0,AANACATGCGTT,,TTTTAG,2,1     sample_1,AANACATGCG,,AAAACATG,2,xi     sample_2,AANACA,,CACCCC,1,5

This is not allowed:

                              # missing i1 column (or any other)     sample_name,i7,i5,     sample_0,AAAACATGCGTT,CCCCACTGAGTT     sample_1,AAAATCCCAGTT,CCCCTAAACGTT      # duplicated barcode combination     sample_name,i7,i5,i1     sample_0,AAAACATGCGTT,CCCCACTGAGTT,AAAACATGCGTT     sample_1,AAAACATGCGTT,CCCCACTGAGTT,AAAACATGCGTT      # duplicated sample names     sample_name,i7,i5,i1     sample_0,AAAACATGCGTT,CCCCACTGAGTT,AAAACATGCGTT     sample_0,AAAATCCCAGTT,CCCCTAAACGTT,AAAATCCCAGTT      # missing comma separator     sample_name,i7,i5,i1     sample_0,AAAACATGCGTTCCCCACTGAGTT,AAAACATGCGTT      # no barcodes     sample_name,i7,i5,i1     sample_0,,,      # wrong cavalcade headers     wrong_col_name,i7,i5,i1     sample_0,AAAACATGCGTT,CCCCACTGAGTT,AAAACATGCGTT

Ladner Ablemody