.. -*- coding: utf-8 -*- ===================== Finding AdK sequences ===================== Sequence of AdK =============== 1. Start with PDB 1AKE_ in the `Protein Databank`_. 2. Find the UniProtKB accession number **P69441** from the PDB page. 3. Find P69441 in UniProt_ (`P69441 (KAD_ECOLI)`_). 4. Browse the *KAD_ECOLI* page. Find the `KAD_ECOLI sequence`_ :: >sp|P69441|KAD_ECOLI Adenylate kinase OS=Escherichia coli (strain K12) GN=adk PE=1 SV=1 MRIILLGAPGAGKGTQAQFIMEKYGIPQISTGDMLRAAVKSGSELGKQAKDIMDAGKLVT DELVIALVKERIAQEDCRNGFLLDGFPRTIPQADAMKEAGINVDYVLEFDVPDELIVDRI VGRRVHAPSGRVYHVKFNPPKVEGKDDVTGEELTTRKDDQEETVRKRLVEYHQMTAPLIG YYSKEAEAGNTKYAKVDGTKPVAEVRADLEKILG (This is the sequence in FASTA_ format, one of the common sequence formats.) .. _1AKE: http://www.rcsb.org/pdb/explore/explore.do?structureId=1ake .. _Protein Databank: http://www.rcsb.org/pdb .. _UniProt: http://www.uniprot.org/ .. _`P69441 (KAD_ECOLI)`: http://www.uniprot.org/uniprot/P69441 .. _`KAD_ECOLI sequence`: http://www.uniprot.org/uniprot/P69441#sequences .. _FASTA: https://www.ncbi.nlm.nih.gov/blast/fasta.shtml Finding other AdK sequences with BLAST ====================================== The Basic Local Alignment Search Tool (BLAST_) finds regions of local similarity between sequences, which can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families. Use **blastp** (`Protein BLAST`_) with the following settings (everything else can be left at defaults): :guilabel:`Enter Query Sequence`: - :guilabel:`Enter accession number(s), gi(s), or FASTA sequence(s)`: paste the *KAD_ECOLI* FASTA sequence into the search box :guilabel:`Choose Search Set:` - :guilabel:`Database`: *Non-redundant protein sequences (nr)* :guilabel:`Program Selection` - :guilabel:`Algorithm`: *blastp (protein-protein BLAST)* .. _BLAST: https://blast.ncbi.nlm.nih.gov/Blast.cgi .. _Protein BLAST: https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome This finds many sequences that are all almost identical (identity 99% - 100%, E values around 1e-150). One would have to create a data set that removes some of the nearly identical sequences but this is beyond this introduction. PFAM ==== Use the sequence to search PFAM_ .. _PFAM: http://pfam.xfam.org .. _CL0023: http://pfam.xfam.org/clan/CL0023 .. _ADK: http://pfam.xfam.org/family/PF00406 .. _`Adk_lid pfam`: http://pfam.xfam.org/family/ADK_lid Find Clan CL0023_ (bit score 204.7, E-value 6.8e-61) and there family ADK_ PF00406. - Domain organization: *There are 2742 sequences with the following architecture: ADK, ADK_lid* Even the `Adk_lid pfam`_ PF05191 is still too big. The `view of PDB structures `_ is useful (and could be used with MultiSeq).