Kozak consensus sequence

The Kozak consensus sequence (Kozak consensus or Kozak sequence) is a

eukaryotic mRNA transcripts.^[1] Regarded as the optimum sequence for initiating translation in eukaryotes, the sequence is an integral aspect of protein regulation and overall cellular health as well as having implications in human disease.^[1]^[2] It ensures that a protein is correctly translated from the genetic message, mediating ribosome assembly and translation initiation. A wrong start site can result in non-functional proteins.^[3] As it has become more studied, expansions of the nucleotide sequence, bases of importance, and notable exceptions have arisen.^[1]^[4]^[5] The sequence was named after the scientist who discovered it, Marilyn Kozak. Kozak discovered the sequence through a detailed analysis of DNA genomic sequences.^[6]

The Kozak sequence is not to be confused with the

5′ cap of a messenger RNA or an internal ribosome entry site

(IRES).

Sequence

The Kozak sequence was determined by sequencing of 699 vertebrate mRNAs and verified by site-directed mutagenesis.^[7] While initially limited to a subset of vertebrates (i.e. human, cow, cat, dog, chicken, guinea pig, hamster, mouse, pig, rabbit, sheep, and Xenopus), subsequent studies confirmed its conservation in higher eukaryotes generally.^[1] The sequence was defined as 5'-(gcc)gccRccAUGG-3' (IUPAC nucleobase notation summarized here) where:^[7]

The underlined nucleotides indicate the translation start codon, coding for Methionine.
upper-case letters indicate highly conserved bases, i.e. the 'AUGG' sequence is constant or rarely, if ever, changes.^[8]
'R' indicates that a purine (adenine or guanine) is always observed at this position (with adenine being more frequent according to Kozak)
a lower-case letter denotes the most common base at a position where the base can nevertheless vary
the sequence in parentheses (gcc) is of uncertain significance.

The AUG is the initiation codon encoding a methionine amino acid at the N-terminus of the protein. (Rarely, GUG is used as an initiation codon, but methionine is still the first amino acid as it is the met-tRNA in the initiation complex that binds to the mRNA). Variation within the Kozak sequence alters the "strength" thereof. Kozak sequence strength refers to the favorability of initiation, affecting how much protein is synthesized from a given mRNA.^[4]^[9] The A nucleotide of the "AUG" is delineated as +1 in mRNA sequences with the preceding base being labeled as −1. For a 'strong' consensus, the nucleotides at positions +4 (i.e. G in the consensus) and −3 (i.e. either A or G in the consensus) relative to the +1 nucleotide must both match the consensus (there is no 0 position). An 'adequate' consensus has only 1 of these sites, while a 'weak' consensus has neither. The cc at −1 and −2 are not as conserved, but contribute to the overall strength.^[10] There is also evidence that a G in the -6 position is important in the initiation of translation.^[4] While the +4 and the −3 positions in the Kozak sequence have the greatest relative importance in the establishing a favorable initiation context a CC or AA motif at −2 and −1 were found to be important in the initiation of translation in tobacco and maize plants.^[11] Protein synthesis in yeast was found to be highly affected by composition of the Kozak sequence in yeast, with adenine enrichment resulting in higher levels of gene expression.^[12] A suboptimal Kozak sequence can allow for PIC to scan past the first AUG site and start initiation at a downstream AUG codon.^[13]^[2]

mRNAs

. Larger letters indicate a higher frequency of incorporation. Note the larger size of A and G at the 8 position (−3, Kozak position) and at the G at position 14 which corresponds to (+4) position in the Kozak sequence.

Ribosome assembly

The ribosome assembles on the start codon (AUG), located within the Kozak sequence. Prior to translation initiation, scanning is done by the pre-initiation complex (PIC). The PIC consists of the 40S (small ribosomal subunit) bound to the ternary complex, eIF2-GTP-intiatorMet tRNA (TC) to form the 43S ribosome. Assisted by several other initiation factors (eIF1 and eIF1A, eIF5, eIF3, polyA binding protein) it is recruited to the 5′ end of the mRNA. Eukaryotic mRNA is capped with a 7-methylguanosine (m7G) nucleotide which can help recruit the PIC to the mRNA and initiate scanning. This recruitment to the m7G 5′ cap is supported by the inability of eukaryotic ribosomes to translate circular mRNA, which has no 5′ end.^[14] Once the PIC binds to the mRNA it scans until it reaches the first AUG codon in a Kozak sequence.^[15]^[16] This scanning is referred to as the scanning mechanism of initiation.

The scanning mechanism of Initiation starts when the PIC binds the 5′ end of the mRNA. Scanning is stimulated by Dhx29 and Ddx3/Ded1 and eIF4 proteins.^[1] The Dhx29 and Ddx3/Ded1 are DEAD-box helicases that help to unwind any secondary mRNA structure which could hinder scanning.^[17] The scanning of an mRNA continues until the first AUG codon on the mRNA is reached, this is known as the "First AUG Rule".^[1] While exceptions to the "First AUG Rule" exist, most exceptions take place at a second AUG codon that is located 3 to 5 nucleotides downstream from the first AUG, or within 10 nucleotides from the 5′ end of the mRNA.^[18] At the AUG codon a Methionine tRNA anticodon is recognized by mRNA codon.^[19] Upon base pairing to the start codon the eIF5 in the PIC helps to hydrolyze a guanosine triphosphate (GTP) bound to the eIF2.^[20]^[21] This leads to the a structural rearrangement that commits the PIC to binding to the large ribosomal subunit (60S) and forming the ribosomal complex (80S). Once the 80S ribosome complex is formed then the elongation phase of translation starts.

The first start codon closest to the 5′ end of the strand is not always recognized if it is not contained in a Kozak-like sequence. Lmx1b is an example of a gene with a weak Kozak consensus sequence.^[22] For initiation of translation from such a site, other features are required in the mRNA sequence in order for the ribosome to recognize the initiation codon. Exceptions to the first AUG rule may occur if it is not contained in a Kozak-like sequence. This is called leaky scanning and could be a potential way to control translation through initiation.^[23] For initiation of translation from such a site, other features are required in the mRNA sequence in order for the ribosome to recognize the initiation codon.

It is believed that the PIC is stalled at the Kozak sequence by interactions between eIF2 and the −3 and +4 nucleotides in the Kozak position.^[24] This stalling allows the start codon and the corresponding anticodon time to form the correct hydrogen bonding. The Kozak consensus sequence is so common that the similarity of the sequence around the AUG codon to the Kozak Sequence is used as a criterion for finding start codons in eukaryotes.^[25]

Differences from bacterial initiation

The scanning mechanism of initiation, which utilizes the Kozak sequence, is found only in eukaryotes and has significant differences from the way bacteria initiate translation. The biggest difference is the existence of the

Shine-Dalgarno (SD) sequence in mRNA for bacteria. The SD sequence is located near the start codon which is in contrast to the Kozak sequence which actually contains the start codon. The Shine Dalgarno sequence allows the 16S subunit of the small ribosome subunit to bind to the AUG (or alternative) start codon immediately. In contrast, scanning along the mRNA results in a more rigorous selection process for the AUG codon than in bacteria.^[26] An example of bacterial start codon promiscuity can be seen in the use of the alternate start codons UUG and GUG for some genes.^[27]

leaderless initiation. Haloarchaea are known to have a variant of the Kozak consensus sequence in their Hsp70 genes.^[28]

Mutations and disease

Marilyn Kozak demonstrated, through systematic study of point mutations, that any mutations of a strong consensus sequence in the −3 position or to the +4 position resulted in highly impaired translation initiation both in vitro and in vivo.[29]^[30]

Research has shown that a mutation of G—>C in the −6 position of the β-globin gene (β+45; human) disrupted the haematological and biosynthetic phenotype function. This was the first mutation found in the Kozak sequence and showed a 30% decrease in translational efficiency. It was found in a family from the Southeast Italy and they suffered from

thalassaemia intermedia.^[4]

Mutations to the Kozak sequence can also have drastic effects upon human health; in particular, certain forms of

congenital heart disease are caused by Kozak sequence mutations in the GATA4 gene's 5' untranslated region. The GATA4 gene is responsible for gene expression in a wide variety of tissues including the heart.^[34] When the guanosine at the -6 position in the Kozak sequence of GATA4 is mutated to a cytosine, a reduction in GATA4 protein levels results, which leads to a decrease in the expression of genes regulated by the GATA4 transcription factor and linked to the development of atrial septal defect.^[35]

The ability of the Kozak sequence to optimize translation can result in novel initiation codons in the typically untranslated region of the 5′ (5′ UTR) end of the mRNA transcript. A G to A mutation was described by Bohlen et al. (2017) in a Kozak-like region in the SOX9 gene that created a new translation initiation codon in an out-of-frame open reading frame. The correct initiation codon was located in a region that did not match the Kozak consensus sequence as closely as the surrounding sequence of the new, upstream initiation site did, which resulted in reduced translation efficiency of functional SOX9 protein. The patient in whom this mutation was detected had developed acampomelic campomelic dysplasia, a developmental disorder that causes skeletal, reproductive and airway issues due to insufficient SOX9 expression.^[32]

Variations in the consensus sequence

The Kozak consensus has been variously described as:^[36]

     65432-+234
(gcc)gccRccAUGG (Kozak 1987)
       AGNNAUGN
        ANNAUGG
        ACCAUGG (Spotts et al., 1997, mentioned in Kozak 2002)
     GACACCAUGG (H. sapiens HBB, HBD, R. norvegicus Hbb, etc.)

Kozak-like sequences in various eukaryotes
Biota	Phylum	Consensus sequences
Vertebrate (Kozak 1987)		`gccRccATGG`^[7]
Fruit fly (Drosophila spp.)	Arthropoda	`atMAAMATGamc`^[37]
Budding yeast (Saccharomyces cerevisiae)	Ascomycota	`aAaAaAATGTCt`^[38]
Slime mold (Dictyostelium discoideum)	Amoebozoa	`aaaAAAATGRna`^[39]
Ciliate	Ciliophora	`nTaAAAATGRct`^[39]
Malarial protozoa (Plasmodium spp.)	Apicomplexa	`taaAAAATGAan`^[39]
Toxoplasma (Toxoplasma gondii)	Apicomplexa	`gncAaaATGg`^[40]
Trypanosomatidae	Euglenozoa	`nnnAnnATGnC`^[39]
Terrestrial plants		`acAACAATGGC`^[41]
Microalga (Chlamydomonas reinhardtii)	Chlorophyta	`gccAaCATGGcg`^[42]^[43]

References

^
PMID 2645293
.

^
PMID 12459250
.

PMID 10395892
.

^
S2CID 86704907
.

PMID 31353284
.

PMID 6694911
.

^
PMID 3313277
.

^ Nomenclature for Incompletely Specified Bases in Nucleic Acid Sequences, NC-IUB, 1984.

S2CID 4366379
.

S2CID 15613863
.

PMID 10725562
.

PMID 28835771
.

PMID 15531618
.

S2CID 4319259
.

PMID 30795538
.

PMID 25248153
.

PMID 24499181
.

PMID 7708701
.

PMID 3051379
.

S2CID 3739106
.

PMID 16246727
.

PMID 15498463
.

S2CID 14157104
.

PMID 21885680
.

S2CID 25936805
.

PMID 9308967
.

PMID 2200518
.

PMID 26379277
.

S2CID 15613863
.

S2CID 4366379
.

PMID 20301724
– via NIH National Library of Medicine, National Center for Biotechnology Information.

^
PMID 28546996
.

doi:10.1182/blood.v94.1.186.413k19_186_191
.

PMID 9584153
.

S2CID 32674053
.

PMID 20417269
.

PMID 3822832
.

PMID 3554144
.

^
PMID 2041747
.

S2CID 10433917
.

PMID 3556162
.

PMID 26701783
.

PMC 7896298
.

Further reading

Kozak M (November 1990). "Downstream secondary structure facilitates recognition of initiator codons by eukaryotic ribosomes". Proceedings of the National Academy of Sciences of the United States of America. 87 (21): 8301–5.
PMID 2236042
.

Kozak M (November 1991). "An analysis of vertebrate mRNA sequences: intimations of translational control". The Journal of Cell Biology. 115 (4): 887–903.
PMID 1955461
.

Kozak M (October 2002). "Pushing the limits of the scanning mechanism for initiation of translation". Gene. 299 (1–2): 1–34.
PMID 12459250
.

v
t
e
Protein biosynthesis: translation (bacterial, archaeal, eukaryotic)
Proteins
Initiation factor
Bacterial

IF1

IF2

IF3

Mitochondrial

MTIF1

MTIF2

MTIF3

Archaeal

aIF1

aIF2

aIF5

aIF6

Eukaryotic
eIF1

eIF1
B

SUI1 family

eIF1A
Y

eIF2

α
kinase

β

γ

eIF2A

eIF2B
1

2

3

4

5

eIF2D

eIF3

A

B

C

D

E

F

G

H

I

J

K

L

M

eIF4

A
1

2

3

E1
2

3

G
1

2

3

B

H

eIF5

EIF5

EIF5A
2

5B

eIF6

EIF6

Elongation factor
Bacterial/Mitochondrial

EF-Tu

EF-Ts

EF-G

EF-4

EF-P

TSFM

GFM1

GFM2

Archaeal/Eukaryotic

a/eEF-1
A1
2

3

B
P1

P2

P3

D

E

G

a/eEF-2

Release factor

Class 1
eRF1

Class 2/RF3
GSPT1

GSPT2

Ribosomal Proteins
Cytoplasmic
60S subunit

RPL3

RPL4

RPL5

RPL6

RPL7

RPL7A

RPL8

RPL9

RPL10

RPL10A

RPL10-like

RPL11

RPL12

RPL13

RPL13A

RPL14

RPL15

RPL17

RPL18

RPL18A

RPL19

RPL21

RPL22

RPL23

RPL23A

RPL24

RPL26

RPL27

RPL27A

RPL28

RPL29

RPL30

RPL31

RPL32

RPL34

RPL35

RPL35A

RPL36

RPL36A

RPL37

RPL37A

RPL38

RPL39

RPL40

RPL41

RPLP0

RPLP1

RPLP2

RRP15-like

RSL24D1

40S subunit

RPSA

RPS2

RPS3

RPS3A

RPS4 (RPS4X, RPS4Y1, RPS4Y2)

RPS5

RPS6

RPS7

RPS8

RPS9

RPS10

RPS11

RPS12

RPS13

RPS14

RPS15

RPS15A

RPS16

RPS17

RPS18

RPS19

RPS20

RPS21

RPS23

RPS24

RPS25

RPS26

RPS27

RPS27A

RPS28

RPS29

RPS30

RACK1

Mitochondrial
39S subunit

MRPL1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

28S subunit

MRPS1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

Other concepts

Aminoacyl tRNA synthetase

Reading frame

Start codon

Stop codon

Shine-Dalgarno sequence/Kozak consensus sequence

Retrieved from "https://en.wikipedia.org/w/index.php?title=Kozak_consensus_sequence&oldid=1195207902"

[:02-1] 
PMID 2645293
.

[Kozak_1–34-2] 
PMID 12459250
.

[3] PMID 10395892
.

[Angioletti200422-4] 
S2CID 86704907
.

[5] PMID 31353284
.

[6] PMID 6694911
.

[Kozak198722-7] 
PMID 3313277
.

[NCIUB22-8] Nomenclature for Incompletely Specified Bases in Nucleic Acid Sequences, NC-IUB, 1984.

[Kozak19842-9] S2CID 4366379
.

[Kozak19862-10] S2CID 15613863
.

[11] PMID 10725562
.

[12] PMID 28835771
.

[13] PMID 15531618
.

[14] S2CID 4319259
.

[:22-15] PMID 30795538
.

[16] PMID 25248153
.

[17] PMID 24499181
.

[18] PMID 7708701
.

[19] PMID 3051379
.

[20] S2CID 3739106
.

[21] PMID 16246727
.

[Dunston20042-22] PMID 15498463
.

[:1-23] S2CID 14157104
.

[24] PMID 21885680
.

[25] S2CID 25936805
.

[26] PMID 9308967
.

[27] PMID 2200518
.

[Haloarchaea-28] PMID 26379277
.

[29] S2CID 15613863
.

[30] S2CID 4366379
.

[31] PMID 20301724
– via NIH National Library of Medicine, National Center for Biotechnology Information.

[Bohlen-32] 
PMID 28546996
.

[33] :10.1182/blood.v94.1.186.413k19_186_191
.

[34] PMID 9584153
.

[35] S2CID 32674053
.

[36] PMID 20417269
.

[37] PMID 3822832
.

[38] PMID 3554144
.

[Yamauchi19912-39] 
PMID 2041747
.

[40] S2CID 10433917
.

[41] PMID 3556162
.

[42] PMID 26701783
.

[43] PMC 7896298
.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[30]

[31]

[32]

[34]

[35]

[36]

[37]

[38]

[39]

[40]

[41]

[42]

[43]