Peptide Nomenclature and Chemical Naming Conventions: Decoding Three-Letter Codes, Sequence Notation, and Research Compound Identification
Peptide science operates across an unusually wide range of disciplines — synthetic chemistry, molecular biology, pharmacology, and regulatory affairs — each of which has developed its own conventions for naming and identifying compounds. A single peptide may appear in the literature under a systematic IUPAC chemical name, a shorthand single-letter sequence, a three-letter code string, a WHO International Nonproprietary Name (INN), and one or more trade designations. Without a working knowledge of how these systems relate to one another, cross-referencing a compound across sources becomes unreliable, and errors in compound identification can undermine the reproducibility of research.
This article provides a structured reference for understanding peptide naming conventions, interpreting sequence notation, decoding modification nomenclature, and verifying compound identity across databases and supplier documentation.
Single-Letter and Three-Letter Amino Acid Codes
The Two Parallel Systems
The twenty standard amino acids are represented by two parallel coding systems, both sanctioned by the IUPAC-IUBMB Joint Commission on Biochemical Nomenclature [1]. The three-letter code — such as Gly for glycine, Phe for phenylalanine, or Tyr for tyrosine — is the older system and remains the standard in formal chemical nomenclature, regulatory submissions, and detailed structural descriptions. The single-letter code — G, F, and Y for the same residues — was introduced to accommodate the computational demands of sequence analysis and is now dominant in bioinformatics, proteomics databases, and high-throughput research contexts [4].
The two systems are not interchangeable in all contexts. Three-letter codes carry enough information to be unambiguous in isolation; single-letter codes require positional context to be interpreted correctly. For this reason, regulatory documents and synthesis specifications typically favour three-letter notation, while sequence databases and alignment tools use single-letter strings.
Converting Between Systems
The conversion between systems is straightforward for the twenty canonical amino acids, but becomes more complex when non-standard or modified residues are involved. Several single-letter codes are counterintuitive — asparagine is N, not A (which belongs to alanine), and glutamine is Q rather than G (which is glycine). Researchers new to peptide literature frequently misread sequences because of these assignments. A reliable conversion table should be consulted whenever a sequence is being transcribed between systems, particularly before synthesis or procurement.
Non-canonical amino acids, including D-enantiomers and synthetic residues, have no universally assigned single-letter codes. In such cases, three-letter notation is obligatory, often supplemented by parenthetical descriptors or modification prefixes, as discussed below.
Directionality Notation: N-Terminus to C-Terminus
Why Direction Matters
Peptide sequences are written and read from the N-terminus (the free amino group end) to the C-terminus (the free carboxyl group end) by universal convention [1]. This directionality is not merely typographic; it reflects the actual biochemical orientation of the molecule and is directly relevant to biological activity, receptor binding, and structural interpretation. A peptide written as Tyr-Gly-Gly-Phe-Leu (or YGGFL, the sequence of Leu-enkephalin) is chemically and biologically distinct from Leu-Phe-Gly-Gly-Tyr, even though both contain the same residues.
Errors in directionality notation are among the most consequential in peptide research. A reversed sequence may have an entirely different activity profile, solubility, or secondary structure. When interpreting a sequence from any source — literature, supplier catalogue, or database — confirming that the N-to-C convention has been applied consistently is a necessary verification step.
Notation in Practice
In three-letter notation, residues are separated by hyphens and listed N-to-C: H-Tyr-Gly-Gly-Phe-Leu-OH. The H- prefix denotes a free N-terminal amine and the -OH suffix denotes a free C-terminal carboxyl. When the termini are modified — for example, when the N-terminus is acetylated or the C-terminus is amidated — this is indicated explicitly: Ac-Tyr-Gly-Gly-Phe-Leu-NH₂. In single-letter notation, the same conventions apply but are often omitted in informal usage, creating ambiguity that must be resolved by consulting the original source.
Post-Translational and Synthetic Modification Nomenclature
Acetylation
N-terminal acetylation — one of the most common modifications in both natural peptides and synthetic research compounds — is denoted by the prefix Ac- before the first residue in three-letter notation, or by the descriptor N-acetyl in IUPAC-style names [5]. Acetylation eliminates the positive charge of the free amine, altering the peptide's isoelectric point, proteolytic stability, and membrane permeability. When a specification sheet lists a compound as Ac-peptide without further clarification, it should be understood that the N-terminus is blocked, and this must be factored into any molecular weight calculation or activity comparison with the unmodified sequence.
Phosphorylation
Phosphorylation is indicated by the prefix pSer, pThr, or pTyr (or Ser(P), Thr(P), Tyr(P) in alternative notation) for phosphoserine, phosphothreonine, and phosphotyrosine respectively [5]. The position of phosphorylation within the sequence is critical, as the same peptide phosphorylated at different residues may have entirely different properties. Literature descriptions should always specify both the modified residue and its position number within the sequence.
Lipidation and PEGylation
Lipidated peptides — in which a fatty acid chain is conjugated to the peptide backbone, typically at the N-terminus or a lysine side chain — are described using the fatty acid name as a prefix or parenthetical modifier. Palmitoylation, for instance, may appear as Palm-Lys or C16-Lys in supplier notation. PEGylation, the conjugation of polyethylene glycol chains, is denoted by PEG followed by a subscript or parenthetical indicating the chain length or molecular weight of the PEG moiety, such as PEG₂₀₀₀ or mPEG-40kDa [5]. Both modifications substantially alter molecular weight, hydrodynamic radius, and pharmacokinetic behaviour, making precise notation essential for compound verification.
Stereochemistry Notation
All standard amino acids in biological peptides are L-enantiomers. When D-amino acids are incorporated — a common strategy in research compounds to confer proteolytic resistance — this must be explicitly noted. The convention is to prefix the residue with D-, as in D-Phe or d-Phe, or to use lowercase single-letter codes in some database contexts. A peptide containing D-Phe at position four is a distinct chemical entity from its all-L counterpart, and the two should never be treated as equivalent in research documentation.
Chemical Naming Versus Trade Names and INNs
Multiple Identifiers for a Single Compound
Research peptides frequently accumulate multiple identifiers over the course of their development. A compound may be synthesised under a systematic IUPAC name, assigned a code number by the originating laboratory (such as BPC-157 or AOD-9604), granted an INN by the WHO if it reaches clinical development, and marketed under a trade name if approved [3]. Each of these identifiers may appear in different sections of the literature, and none is inherently more authoritative than the others for the purpose of compound identification.
The WHO INN system is the most internationally standardised framework for naming peptide therapeutics intended for clinical use. INNs are assigned based on pharmacological class and are designed to be unique, pronounceable, and internationally recognisable [3]. However, INNs are assigned relatively late in the development process and are absent for the majority of research-stage compounds. For investigational peptides, the systematic chemical name or the sequence notation is typically the most reliable identifier.
Recognising Compound Equivalence
Determining whether two differently named compounds are in fact the same entity requires cross-referencing across naming systems. A compound listed as Selank in one source may appear as Thr-Lys-Pro-Arg-Pro-Gly-Pro in another, and as a specific CAS registry number in a third. Confirming equivalence requires verifying that the sequence, modification state, stereochemistry, and molecular weight are consistent across all sources. Discrepancies in any of these parameters indicate either a different compound or an error in one of the sources.
Standardised Nomenclature Systems
IUPAC-IUBMB Conventions
The IUPAC-IUBMB recommendations for the nomenclature of peptides and proteins establish the formal rules governing sequence notation, modification description, and the use of amino acid codes [1]. These recommendations are the reference standard for formal scientific publications and regulatory submissions. They specify, among other things, that sequences are always written N-to-C, that modifications are described using defined prefixes and suffixes, and that non-standard residues must be identified explicitly.
Regulatory Expectations
Regulatory agencies, including the US Food and Drug Administration, require that peptide compounds submitted in Investigational New Drug (IND) applications be identified with sufficient precision to allow unambiguous characterisation [3]. This typically means providing the full sequence in three-letter notation, specifying all modifications, confirming stereochemistry, and providing a molecular formula and molecular weight consistent with the stated structure. Nomenclature inconsistencies between sections of a regulatory submission can trigger requests for clarification and delay review timelines.
Decoding Supplier Specification Sheets
Purity Grades and Analytical Methods
Supplier specification sheets for research peptides typically report purity as a percentage determined by reverse-phase high-performance liquid chromatography (HPLC), with values commonly ranging from greater than 95% to greater than 99% for research-grade material [7]. The HPLC method used — including column type, gradient conditions, and detection wavelength — affects the reported purity value and should be noted when comparing specifications across suppliers. A purity figure without an associated analytical method is difficult to interpret reliably.
Molecular Weight Calculation and Confirmation
The molecular weight listed on a specification sheet should correspond to the calculated monoisotopic or average mass of the peptide as defined by its sequence and modification state. Mass spectrometry — typically electrospray ionisation (ESI-MS) or matrix-assisted laser desorption/ionisation (MALDI-MS) — is the standard method for confirming molecular weight and, by extension, sequence integrity [7]. A specification sheet that reports only HPLC purity without mass spectrometric confirmation provides incomplete identity verification.
Sequence Confirmation Methods
Amino acid analysis and Edman degradation are classical methods for sequence confirmation, though they are less commonly used for short synthetic peptides where mass spectrometry is sufficient. For longer or more complex sequences, tandem mass spectrometry (MS/MS) provides residue-level sequence confirmation. Researchers should request certificates of analysis that include both HPLC chromatograms and mass spectra before using a peptide in quantitative or mechanistic studies.
Common Nomenclature Errors and Ambiguities
Sequence Length and Truncation
A frequent source of confusion in the literature is the use of truncated sequences without explicit notation. A compound described as a "fragment" of a parent peptide may refer to any sub-sequence, and without specifying the residue positions (using the parent sequence numbering), the identity of the fragment is ambiguous. The convention of indicating fragment positions as superscripts — for example, IGF-1(1-3) to denote the first three residues of insulin-like growth factor 1 — should be applied consistently.
Modification Position Ambiguity
When a modification such as phosphorylation or glycosylation can occur at multiple residues within a sequence, the position must be specified numerically. A phosphopeptide described only as "phospho-Ser peptide" without a position number is not adequately characterised. This ambiguity is particularly problematic in studies of signalling peptides, where the phosphorylation site determines which kinase or phosphatase is involved and what downstream effects are expected.
Stereochemical Omissions
The omission of stereochemical descriptors is among the most consequential nomenclature errors in peptide research. D-amino acid substitutions are used deliberately in many research compounds to alter proteolytic stability and receptor selectivity, and a sequence described without stereochemical annotation cannot be assumed to consist entirely of L-residues. Any compound containing non-standard stereochemistry should have this explicitly noted in every context where the sequence appears.
Cross-Referencing Peptide Identity Across Databases
PubChem and ChemSpider
PubChem, maintained by the US National Library of Medicine, provides Compound Identification (CID) numbers for peptides and allows cross-referencing of synonyms, molecular formulas, and structural representations [6]. Searching PubChem by sequence, IUPAC name, or CAS number can confirm whether multiple identifiers refer to the same compound. ChemSpider offers similar functionality with additional cross-links to commercial and regulatory sources.
UniProt and Peptide Atlases
For peptides derived from known proteins, UniProt provides authoritative sequence data, post-translational modification annotations, and literature cross-references [6]. The PeptideAtlas and related resources extend this to experimentally observed peptides from mass spectrometry studies. These databases are particularly useful for confirming the sequence and modification state of endogenous peptides that are also studied as synthetic research compounds.
Verification as a Research Practice
Cross-referencing a peptide's identity across at least two independent database sources — comparing sequence, molecular weight, and modification state — should be treated as a standard step in research design rather than an optional precaution. Discrepancies between sources may indicate nomenclature errors in the literature, differences in modification state between studies, or the existence of multiple compounds sharing a common name. Resolving these discrepancies before initiating a study is substantially less costly than discovering them during data interpretation.
Conclusion
Peptide nomenclature is a practical discipline as much as a formal one. The ability to move fluently between single-letter and three-letter codes, to interpret modification prefixes, to confirm directionality, and to cross-reference identifiers across databases and specification sheets is foundational to research integrity in this field. As the volume of peptide research compounds in preclinical investigation continues to grow, so does the importance of nomenclature literacy — not as an abstract standard, but as a concrete tool for ensuring that the compound studied, described, and referenced is unambiguously the same entity across every context in which it appears.