Reads in Undetermined but Barcodes Are in Sample Sheet
idemuxCPP - inline barcode demultiplexing
idemuxCPP is a command line tool designed to demultiplex paired-finish fastq files from QuantSeq-Pool.
idemuxCPP tin demultiplex based on i7, i5 and i1 inline barcodes. While this tool tin can generally be used to demultiplex on whatever barcodes (equally long equally they are correctly supplied and in the fastq header), it best performs when used in combination with Lexogen indices, every bit it volition correct common sequencing errors in the sequenced barcodes. This volition allow you to retain more reads from your sequencing experiment, while minimizing cantankerous contamination.
idemuxCPP utilise is permitted nether the following licence.
idemuxCPP is a direct translation of the python tool idemux (https://github.com/lexogen-tools/idemux) in order to decrease the runtime. It is 2 times faster than the python version.
General usage:
idemuxCPP [-h] --r1 READ1 --r2 READ2 [--sample-canvas SAMPLE_SHEET] --out OUTPUT_DIR [--i1-starting time I1_START] [--i5-rc] [-v]
Run idemuxCPP:
idemuxCPP --r1 read_1.fastq.gz --r2 read_2.fastq.gz --sample-sheet samples.csv --out /some/output/path --i1-start pos_in_read_2
Features
- FASTQ file demultiplexing based on i7, i5 or i1 barcodes
- Correction of barcode sequencing errors to maximize read yield (but works with Lexogen 12 nt UDIs, that have been sequenced at least viii nt.
- Reverse complementation in case the i5 index has been sequenced equally opposite complement
Getting started
To get stated with demultiplexing you lot need to:
- Install idemuxCPP
- Set up a sample sheet csv
- Run idemuxCPP
1. Installation
dependencies:
-
compiler supporting C++11 standard and OpenMP
-
boost C++ library version >= one.55.0 (install the development versions of the following libraries (or install all due east.thou. in Ubuntu via packet
libboost-all-dev
)- libboost-filesystem
- libboost-system
- libboost-iostreams
- libboost-test (but required if you desire to compile unit tests)
-
zlib (e.g. zlib1g-dev in Ubuntu)
-
gengetopt
Windows 10 64bit binary For windows you do not demand to install any dependencies (they are included in the package). Simply download the pre-compiled windows binary from here windows binary. Extract the zip file. To execute the tool press windows+r
, enter cmd
, cd C:\\location_of_the_extracted_zip_file\bin
and execute .\\idemuxCPP
From Source (distribution tar)
To configure, compile and install execute the following commands on your control line:
./configure [--help for additional configuration options] make make install
From Source (git) The installation from source requires boosted tools and libraries:
- gnulib (autoconf, automake, etc.)
Generate the configure file with:
So proceed with ./configure
and brand
like in the previous department.
From Linux Package
Debian | Ubuntu | Fedora |
---|---|---|
Debian_11idemuxcpp - 0.i.ix - 32 bit idemuxcpp - 0.ane.9 - 64 bit Debian_10idemuxcpp - 0.1.9 - 32 bit idemuxcpp - 0.ane.9 - 64 flake Debian_9.0idemuxcpp - 0.ane.9 - 32 scrap idemuxcpp - 0.ane.9 - 64 fleck | xUbuntu_21.04idemuxcpp - 0.one.9 - 64 bit xUbuntu_20.10idemuxcpp - 0.1.9 - 64 flake xUbuntu_20.04idemuxcpp - 0.1.9 - 64 bit xUbuntu_19.04idemuxcpp - 0.1.9 - 64 chip xUbuntu_18.04idemuxcpp - 0.1.9 - 64 bit xUbuntu_16.04idemuxcpp - 0.1.nine - 32 chip idemuxcpp - 0.ane.9 - 64 fleck | Fedora 34idemuxcpp - 0.i.9 - 64 scrap Fedora 33idemuxcpp - 0.1.ix - 64 bit Fedora 32idemuxcpp - 0.1.ix - 64 bit |
on ubuntu yous can install it for example with:
idemuxCPP will as well soon be bachelor via bioconda!
two. Preparing the sample sail
In guild to run idemuxCPP on your QuantSeq-Pool data you beginning demand to prepare a csv file. We call this csv a sample sail and it specifies which barcodes correspond to each sample.
This is a necessity as the software needs to know into which bins reads should be sorted during demultiplexing. A sample sail can hands exist generated by filling in an excel spreadsheet and exporting it equally csv.
Example sample sheet (i7, i5 and i1 demuliplexing):
sample_name,i7,i5,i1 sample_0,AAAACATGCGTT,CCCCACTGAGTT,AAAACATGCGTT sample_1,AAAATCCCAGTT,CCCCTAAACGTT,AAAATCCCAGTT sample_2,GAAAATTTACGC,GCCCCTTTCAGA,GAAAATTTACGC sample_3,AAACTAACTGTC,CCCATCCATGTA,AAACTAACTGTC
A sample sail consists of 4 columns and always starts with the header illustrated above. 'Sample_name' values will be used equally output file names, while the sequences specified in i7,i5 & i1 will be used for demultiplexing.
Therefore, only specific, unique unambiguous combinations of sample names and barcodes are allowed. This means using duplicated or cryptic combinations will result in an error. However, idemuxCPP will do its best to tell you where the trouble lies, one time this happens.
In cursory the rules are:
- Sample names need to be unique.
- Barcode combinations need to be unique.
- i7 and/or i5 indices take to be used consistently within the csv file. i7 and/or i5 indices need to exist either present for all samples or none at all.
- In dissimilarity to i7/i5 indices, i1 indices can be used for a subset of samples in the csv file.
- Absence of a barcode needs to be indicated by an empty field (no value between comas
,,
). - If your i5 has been sequenced as reverse complement, practice not enter the reverse complement sequences in the sample sheet. Utilize the
--i5-rc
pick!
Run into below. for more showcases of sample/barcode combinations that are allowed or disallowed.
3. Running idemuxCPP
Once you have installed the tool y'all can run it by typing idemuxCPP
in the concluding.
idemuxCPP accepts the following arguments:
-h, --help Print help and go out -Five, --version Print version and exit Required arguments: -i, --r1=STRING Fastq.gz read file one (or .fastq file). (default='') -2, --r2=STRING Fastq.gz read file 2 (required only in paired end mode). (default='') -o, --out=STRING Where to write the output files. (default='./') -s, --sample-canvass=STRING Input a csv file describing sample names and barcode combinations (i7, i5 and i1 barcodes). (default='sample-sheet.csv') Optional arguments: -b, --barcode-corrections=Cord Outputs a csv file that contains the number of corrected barcodes -five, --i5-rc Should be set up when the i5 barcode has been sequenced equally reversecomplement. Brand sure to enter non-reverse complementsequences in the barcode file. (default=off) -i, --i1-start=INT First position of the i1 index (ane-based) on read ii. (default='11') --i1-read=INT Read in which the i1 index should exist corrected (1 or 2). (default='2') -q, --queue-size=INT Queue size for reads that will be processed in one block. (default='4000000') -r, --reading-threads=INT Number of threads used for reading gz files. Either one or 2 (1 thread per input file is used). (default='2') -w, --writing-threads=INT Number of threads used for writing gz files. Default is the number of processor cores. -p, --processing-threads=INT Number of threads used for processing the mistake correction. Default is the number of processor cores. -d, --demux-but Do a one on ane mapping for the barcodes specified in the sample canvass. No mistake correction will be done. Barcodes that do not match are written to the undetermined reads file. (default=off) -v, --verbose Verbose. (default=off)
Case commands:
# demultiplexes read 1 and two into the folder 'demux' idemuxCPP --r1 read_1.fastq.gz --r2 read_2.fastq.gz --sample-sail samples.csv --out demux # demultiplexing assuming the i1 barcode starts at the first base of operations idemuxCPP --r1 read_1.fastq.gz --r2 read_2.fastq.gz --sample-sail samples.csv --out demux --i1_start 1 # demultiplexing assuming i5 is present equally reverse complement in the fastq header # if he i5 has been sequenced as reverse complement use this option and provide # the Non reverse complement sequences in the sample sheet. idemuxCPP --r1 read_1.fastq.gz --r2 read_2.fastq.gz --sample-sheet samples.csv --out demux
After a successful completed run idemuxCPP volition write summary study to the output folder ('demultipexing_stats.tsv').
Technicalities
When you run idemuxCPP the post-obit will happen:
-
It will check if your sample sheet is okay. See here for examples
-
It will bank check the fastq header for barcodes and expects them in the following format:
single index (i7 or i5): @NB502007:379:HM7H2BGXF:1:11101:24585:1069 1:Northward:0:TCAGGTAANNTT dual alphabetize (i7 and i5): @NB502007:379:HM7H2BGXF:one:11101:24585:1069 1:Northward:0:TCAGGTAANNTT+NANGGNNCNNNN
-
Reads that cannot be demultiplexed will be written to undetermined_R{1/2}.fastq.gz
-
When you demultiplex based on i1 inline barcodes, the a successful recognized barcode sequence will be cut out and removed from read 2. This is a pattern choice and volition leave yous with the 10 nt UMI + the nucleotides that potentially follow the i1 barcode (or don't).
This allows you to:
- Utilise other software, such as UMI_tools to bargain with the 10nt UMI if desired
- To demuliplex lanes where QuantSeq-Pool has been pooled with other libraries and read 2 has been sequenced longer than the actual barcode.
If you sequenced i5 every bit a reverse complement, make sure to not fill in reverse complement barcodes into the sample sheet, but to use the --i5-rc
parameter.
Assistance
If you are demuliplexing a big number of samples (more 500) yous might see the following fault:
-
OSError: [Errno 24] Too many open files
This error occurs because well-nigh Bone have a limit on how many files tin can exist opened and written to at the ame time. In gild to temporarily increase the limit run:
# multiply your sample number*2 (as data is paired end) # then round to the next multiple of 1024 ulimit -n the_number_above
If you lot are looking for a permanent solution you lot tin can change your ulimit values this manner.
In example y'all experience any bug with this software please open an effect describing your trouble. Make sure to post the version of the tool you are running (-five, --version
) and your os.
Sample canvas examples
This is allowed:
# demultiplexing via total i7, i5, i1 sample_name,i7,i5,i1 sample_0,AAAACATGCGTT,CCCCACTGAGTT,AAAACATGCGTT sample_1,AAAATCCCAGTT,CCCCTAAACGTT,AAAATCCCAGTT # demultiplexing via full i7, i5 and sparse i1 sample_name,i7,i5,i1 sample_0,AAAACATGCGTT,CCCCACTGAGTT,AAAACATGCGTT sample_1,AAAATCCCAGTT,CCCCTAAACGTT, # demultiplexing via total i7, i5 sample_name,i7,i5,i1 sample_0,AAAACATGCGTT,CCCCACTGAGTT, sample_1,AAAATCCCAGTT,CCCCTAAACGTT, # demultiplexing via total i7, no i5 and thin i1 sample_name,i7,i5,i1 sample_0,AAAACATGCGTT,,AAAACATGCGTT sample_1,AAAATCCCAGTT,, # demultiplexing via full i7 only sample_name,i7,i5,i1 sample_0,AAAACATGCGTT,, sample_1,AAAATCCCAGTT,, # demultiplexing via full i5 and i1 sample_name,i7,i5,i1 sample_0,,CCCCACTGAGTT,AAAACATGCGTT sample_1,,CCCCTAAACGTT,AAAATCCCAGTT # demultiplexing via total i5 and sparse i1 sample_name,i7,i5,i1 sample_0,,CCCCACTGAGTT,AAAACATGCGTT sample_1,,CCCCTAAACGTT, # demultiplexing via total i5 sample_name,i7,i5,i1 sample_0,,CCCCACTGAGTT, sample_1,,CCCCTAAACGTT, # demultiplexing via total i1 sample_name,i7,i5,i1 sample_0,,,AAAACATGCGTT sample_1,,,AAAATCCCAGTT # mixed indexing (if not ambiguous) (full i7 and sparse i5, i1) sample_name,i7,i5,i1 sample_0,AAAACATGCGTT,CCCCACTGAGTT,AAAACATGCGTT sample_1,AAAATCCCAGTT,,AAAATCCCAGTT sample_2,GAAAATTTACGC,GCCCCTTTCAGA,GAAAATTTACGC sample_3,AAACTAACTGTC,,AAACTAACTGTC # mixed indexing (if not ambiguous) (no i7, sparse i5 & i1) sample_name,i7,i5,i1 sample_0,,CCCCACTGAGTT, sample_1,,,AAAATCCCAGTT # mixed indexing (if not cryptic) (thin i7, full i5 & i1) sample_name,i7,i5,i1 sample_0,,CCCCACTGAGTT,AAAACATGCGTT sample_1,AAAATCCCAGTT,CCCCTAAACGTT,AAAATCCCAGTT sample_2,,GCCCCTTTCAGA,GAAAATTTACGC sample_3,AAACTAACTGTC,CCCATCCATGTA,AAACTAACTGTC # additional parameter columns for i1_read and i1_start index (1-based). sample_name,i7,i5,i1,i1_read,i1_start sample_0,AANACATGCGTT,,TTTTAG,2,1 sample_1,AANACATGCG,,AAAACATG,2,xi sample_2,AANACA,,CACCCC,1,5
This is not allowed:
# missing i1 column (or any other) sample_name,i7,i5, sample_0,AAAACATGCGTT,CCCCACTGAGTT sample_1,AAAATCCCAGTT,CCCCTAAACGTT # duplicated barcode combination sample_name,i7,i5,i1 sample_0,AAAACATGCGTT,CCCCACTGAGTT,AAAACATGCGTT sample_1,AAAACATGCGTT,CCCCACTGAGTT,AAAACATGCGTT # duplicated sample names sample_name,i7,i5,i1 sample_0,AAAACATGCGTT,CCCCACTGAGTT,AAAACATGCGTT sample_0,AAAATCCCAGTT,CCCCTAAACGTT,AAAATCCCAGTT # missing comma separator sample_name,i7,i5,i1 sample_0,AAAACATGCGTTCCCCACTGAGTT,AAAACATGCGTT # no barcodes sample_name,i7,i5,i1 sample_0,,, # wrong cavalcade headers wrong_col_name,i7,i5,i1 sample_0,AAAACATGCGTT,CCCCACTGAGTT,AAAACATGCGTT
© Lexogen GmbH, 2020
Source: https://github.com/Lexogen-Tools/idemuxcpp
Post a Comment for "Reads in Undetermined but Barcodes Are in Sample Sheet"