CScape predicts the oncogenic status (disease-driver or neutral) of somatic point mutations in the coding and non-coding regions of the cancer genome. Enter a mutation or list of mutations (one per line) into the form below using the format chromosome,position,reference,mutant (see Help for more details). Mutations uploaded from a file should use the VCF format with a minimum of five columns (chromosome, position, id, reference, mutant). Note: if a VCF file is uploaded, any entries in the User Input field will be ignored.
July 2018: liftOver conversion of the database has now been performed to provide predictions for GRCh38/hg38. Check the box below to access them.

Enter Your Mutations:

User Input

use GRCh38/hg38 Alternately, upload a VCF file (see example):

Publication:

Publications that use these data should cite the following: Rogers MF, Shihab H, Gaunt TR, Campbell C (2017). CScape: a tool for predicting oncogenic single-point mutations in the cancer genome. Nature Scientific Reports

Input Format:

Our software accepts comma-separated mutation data in the following format:

Chromosome
Position
Reference Base
Mutant Base

For example:

11,219046,A,C
11,224139,A,T
11,375885,G,T
11,408898,A,T
11,499190,G,C
11,551832,C,A
11,607532,C,T
11,773638,A,T
11,800755,C,A
11,828599,C,G
11,988551,G,C
11,1025084,C,G
11,1027680,C,A
17,46827903,A,G
17,79060569,A,G
18,756761,C,A
18,3879501,C,A
19,407408,G,T
19,407519,G,C
19,407627,G,A
19,757693,C,A
19,757792,G,T
19,812882,G,T
2,45966,C,A
20,9048655,A,G
20,9923941,A,G
20,18479366,A,G
20,53170414,T,C
3,48265219,A,G
3,52848428,A,G
3,66659209,A,G
3,184195375,A,G
7,193598,C,T
9,916799,C,T
9,3324019,A,T
9,5050791,G,T
9,5077554,C,T
9,6013277,T,A
9,6550908,C,A
9,6554763,C,A

Note 1: 'Chr' is not required when defining the chromosome above (e.g. Chr1) and all our predictions are derived using the forward strand.

Note 2: All predictions are based on version GRCh37 (ENSEMBL release 87) of the human genome.

VCF files

The software also accepts Variant Call Format (VCF) files with up to 100,000 queries. This is a tab-delimited format that must have, at a minimum, these first five columns:

Chromosome
Position
Identifier
Reference Base
Mutant Base

As an example, try the file: test.vcf

Prediction Interpretation:

Predictions are given as p-values in the range [0, 1]: values above 0.5 are predicted to be deleterious, while those below 0.5 are predicted to be neutral or benign. P-values close to the extremes (0 or 1) are the highest-confidence predictions that yield the highest accuracy.

We also apply cautious classification thresholds, defined as those thresholds that yield the highest possible accuracy (see our paper for details). These are reported using different thresholds for coding (0.89 or above) and noncoding (0.70 or above) SNVs.

We use distinct predictors for positions either in coding regions (positions within coding-sequence exons) and non-coding regions (positions in intergenic regions, introns or non-coding genes).

In our paper we consider regions of interest in the cancer genome. For coding regions these are listed in the file coding-regions.tab as

chromosome : start position : end position : probability

as derived from the KS test. A supplementary file coding-regions-annotated.tab also gives associated gene names and locus annotations. The annotated file has the following format:

chromosome : start position : end position : gene name : locus : probability

Finally, the number of regions associated with each gene may be found in the file coding-genelist.tab that has the following format:

count : gene name

Non-coding regions of interest are provided in the file noncoding-regions.tab as

chromosome : start position : end position : probability

as derived from the KS test. These regions can also be visualised by loading the CScape track for our Genome Tolerance Browser (located at gtb.biocompute.org.uk)

Downloads:

To run CScape queries locally, download the following files and run the query script as outlined below. Please note that you must have tabix installed to run the script.

Python query script (7.5KB)
cscape_coding.vcf.gz (669MB)
cscape_coding.vcf.gz.tbi (664KB)
cscape_noncoding.vcf.gz (48GB)
cscape_noncoding.vcf.gz.tbi (2.3MB)

Usage: cscape_query.py query-file [options]

Predict the oncogenic potential of single nucleotide variants (SNVs).  The query
file must be a list of queries that use the following format:

chromosome,position,reference,mutant

Example:

1,69094,G,A
11,168961,T,A
18,119888,G,A

Options:
  -h, --help  show this help message and exit
  -c CDB      CScape coding database [default: cscape_coding.bed.gz]
  -n NDB      CScape noncoding database [default: cscape_noncoding.bed.gz]
  -o OUTPUT   Output file [default: stdout]
  -v          Verbose mode [default: False]

Training and test data used to develop CScape are provided below. Each file within the .zip archives has five (tab-delimited) columns:

Chromosome	Position	Reference	Allele	Label

training_data.zip (820KB)
cscape_coding_tests.zip (840KB)
cscape_noncoding_tests.zip (17MB)

An Investigation of the Frequency Count of Single Point Driver Mutations across Common Solid Tumours

In our paper:

Madeleine Darbyshire, Zachary du Toit, Mark F. Rogers, Tom R. Gaunt and Colin Campbell. Estimating the Frequency of Single Point Driver Mutations across Common Solid Tumours. Scientific Reports (2019).

we investigate the frequency count of single nucleotide variants driving common solid tumours. We further discuss predicted driver counts stratified by stage of disease and driver counts in non-coding regions of the cancer genome, in addition to driver-genes. The following file driver-genes.xlsx gives the full count of single nucleotide variant drivers (SNV-drivers) across 25 different types of cancer, as discussed in this paper.