.. -*- coding: utf-8 -*-

=====================
Finding AdK sequences
=====================


Sequence of AdK
===============

1. Start with PDB 1AKE_ in the `Protein Databank`_.
2. Find the UniProtKB accession number **P69441** from the PDB page.
3. Find P69441 in UniProt_ (`P69441 (KAD_ECOLI)`_).
4. Browse the *KAD_ECOLI* page. Find the `KAD_ECOLI sequence`_ ::

      >sp|P69441|KAD_ECOLI Adenylate kinase OS=Escherichia coli (strain K12) GN=adk PE=1 SV=1
      MRIILLGAPGAGKGTQAQFIMEKYGIPQISTGDMLRAAVKSGSELGKQAKDIMDAGKLVT
      DELVIALVKERIAQEDCRNGFLLDGFPRTIPQADAMKEAGINVDYVLEFDVPDELIVDRI
      VGRRVHAPSGRVYHVKFNPPKVEGKDDVTGEELTTRKDDQEETVRKRLVEYHQMTAPLIG
      YYSKEAEAGNTKYAKVDGTKPVAEVRADLEKILG

   (This is the sequence in FASTA_ format, one of the common sequence
   formats.)

.. _1AKE:
   http://www.rcsb.org/pdb/explore/explore.do?structureId=1ake
.. _Protein Databank: http://www.rcsb.org/pdb

.. _UniProt: http://www.uniprot.org/
.. _`P69441 (KAD_ECOLI)`:  http://www.uniprot.org/uniprot/P69441
.. _`KAD_ECOLI sequence`: http://www.uniprot.org/uniprot/P69441#sequences
.. _FASTA: https://www.ncbi.nlm.nih.gov/blast/fasta.shtml


Finding other AdK sequences with BLAST
======================================

The Basic Local Alignment Search Tool (BLAST_) finds regions of local
similarity between sequences, which can be used to infer functional
and evolutionary relationships between sequences as well as help
identify members of gene families.

Use **blastp** (`Protein BLAST`_) with the following settings
(everything else can be left at defaults):

:guilabel:`Enter Query Sequence`: 

- :guilabel:`Enter accession number(s), gi(s), or FASTA
  sequence(s)`: paste the *KAD_ECOLI* FASTA sequence into the search box

:guilabel:`Choose Search Set:`

- :guilabel:`Database`: *Non-redundant protein sequences (nr)*

:guilabel:`Program Selection`    

- :guilabel:`Algorithm`: *blastp (protein-protein BLAST)*

.. _BLAST: https://blast.ncbi.nlm.nih.gov/Blast.cgi
.. _Protein BLAST:
   https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome

This finds many sequences that are all almost identical (identity
99% - 100%, E values around 1e-150). One would have to create a data
set that removes some of the nearly identical sequences but this is
beyond this introduction.


PFAM
====

Use the sequence to search PFAM_

.. _PFAM: http://pfam.xfam.org
.. _CL0023: http://pfam.xfam.org/clan/CL0023
.. _ADK: http://pfam.xfam.org/family/PF00406
.. _`Adk_lid pfam`: http://pfam.xfam.org/family/ADK_lid


Find Clan CL0023_ (bit score 204.7, E-value 6.8e-61) and there family
ADK_ PF00406.

- Domain organization: *There are 2742 sequences with the following
  architecture: ADK, ADK_lid*

Even the `Adk_lid pfam`_ PF05191 is still too big. The `view of PDB
structures <http://pfam.xfam.org/family/ADK_lid#tabview=tab9>`_ is
useful (and could be used with MultiSeq).