Unix Basics

We will talk to the computer using the text-based terminal also known as “the commandline”. On a Unix -like operating systems (typically used for high-performance computing) this is a very powerful way to interact with the computer.

On a typical Linux desktop system you open a “xterm” or “kterm” (or similar application) to get access to the commandline. On Mac OS X you open “Terminal.app” (in the Utilities folder in Applications).

  • We use the “bash” shell bash (“Bourne again shell”, which replaces the Bourne shell sh). Bash is a very good shell to write scripts in and to use on an every day basis. Pretty much everything written about sh also applies to bash (including the original and very readable Introduction to the Unix Shell written by Steve Bourne in 1978).
  • Bash is available on most modern Unix-like operating systems; it is the default shell on Linux and Mac OS X.
  • There are other shells out there (like csh and tcsh (“C-shells”) or the Korn shell ksh). We will not deal with them. In particular, the C-shells should not be used for writing scripts, for many good reasons.

Command syntax: options and arguments:

command [-argument] [-argument OPTVAL] ... [--long-argument] [--long-argument=OPTVAL]  [file ...]
  • single-letter, can be combined
  • long options
  • arguments: if not supplied, often read from “standard input” and output to “standard output”

Help

Not all help functions described below are always available. Simply try them all. You need to learn to find out about commands by yourself. This introduction can only point you in the right direction and give you hints at to what could be useful to you.

  • built in help function:

    command -h
    command --help
    command -?
    command
  • manual page (‘man page’):

    man command
    man -k search_phrase

    This is where to look when someone tells you to RTFM.

  • help function of the shell:

    help command
  • info page (can be somewhat similar to man but can contain hyperlinks):

    info command

And of course, you can always search on the web. Just make sure you understand which version of the command is described. There can be huge differences. If in doubt, look if a command complies to the “POSIX standard”, a lowest common denominator for various flavors of Unix (yes, there are many different Unix-like operating systems out there!).

Copy, renaming, deleting

  • download from PHY598/01/ vim_basics.txt

  • copy to your docs directory:

    cp vim_basics.txt docs
  • move file to editor directory:

    mv vim_basics.txt p01/editor
  • make a temporary copy:

    cp docs/vim_basics.txt tmp
    cd tmp
    cp vim_basics.txt foo.txt
    mkdir tmp2
    cd tmp2
    mv ../foo.txt .    # note '.' for current dir
  • remove the copy:

    cd ..
    pwd
    rm vim_basics.txt
  • remove the tmp dir:

    rmdir tmp2   # only empty dirs!
    ## FAILS
    
    rm -r tmp2   # deletes full dirs recursively (dangerous)

Warning

rm -r and rm -rf (“force”, even override permissions) are very dangerous!! It deletes everything recusively. It does not ask and it does not keep a backup (no “Trash”).

In general, Unix commands do what you ask them to. If they succeed they typically will not output anything (unless it’s part of their job such as ls). Only if there’s a problem you’ll get a (terse) message.

Looking at files

Different ways to display a text file:

cat FILE
less FILE     # h for help, q for quit
head FILE
tail FILE

Try head -5 vim_basics.txt. For looking at log files of a running simulation, try

tail -f output.log

which will continuously update.

Shell name generation

“glob” patterns

  • * means any characters (even zero):

    ls *
    ls *.txt
    
    ls /usr/bin/*grep
  • ? means any single one character:

    ls /usr/bin/?grep

[x-y] means a range:

ls /usr/bin/[a-y]grep
  • Advanced: brace expansion

    head_{a,b,42,xx}_tail -->  head_a_tail
                               head_b_tail
                               head_42_tail
                               head_xx_tail

    Can be useful when making a complicated directory layout, e.g. simulations for three systems S1,S2,S3 at four temperatures 273, 300, 310, 373 and 2 pressures (1 atm and 1000 atm) (3*4*2 = 24 directories):

    mkdir -p {S1,S2,S3}/T={273,300,310,373}K/P={1,1000}atm

    Will create

    S1/T=273K/P=1atm
    S1/T=273K/P=1000atm
    S1/T=300K/P=1atm
    S1/T=300K/P=1000atm
    ...

Input/Output Redirection

  • commands read from standard input (stdin; by default, the terminal, i.e. the keyboard) and write to standard output (stdout):

    stdin --> command ---> stdout

    By default, standard output is printed to the screen.

    (There’s also a second output channel, called standard error (stderr), which is used for error messages. By default it is also sent to the screen.

  • redirection operators:

    command > file      # create/overwrite output file
    command >> file     # append
    command < file      # read contents of file into stdin
    
  • Note on Unix philosophy: “Everything is a file”: a file is a file (of course) but whole disks are also files, the terminal is a file, memory can be treated as a file, a random number generator can appear as a file, ... Not overly important for the course but to keep at the back of your mind because it means that anything you learn about “real” files can be applied in a wider context!

Excercise

Run

cat

and type something, using return to finish lines... what happens?

To end input, type CONTROL-D (press control and ‘D’ at the same time; often written “^D”) to end input. (Knowing this is often useful.)

Note

To terminate a running command, use CONTROL-C (^C).

cat reads your input from stdin (keyboard) and writes it to stdout (screen).

example: cat as simple “editor”:

cat > TODO
- learn Unix
- learn vi
^D

less TODO

example:

mkdir p01
cd p01
ls -R ~/Documents
ls -R ~/Documents > Documents.lsR

less Documents.lsR

wc Documents.lsR

cat Documents.lsR > double1.lsR    # create/overwrite
cat Documents.lsR >> double1.lsR   # append
wc double1.lsR

or

cat Documents.lsR Documents.lsR > double2.lsR
wc double2.lsR

wc can also read from stdin:

wc < double1.lsR

What’s the difference?

Excercise:

  1. find out how to only print the number of words with wc
  2. create a new file with this number at the top and the remaining Documents.lsR following
solution::
wc -w < Documents.lsR > Documents.NlsR cat Documents.lsR >> Documents.NlsR

or

(wc -w Documents.lsR; cat Documents.lsR) > Documents.NlsR

( command; command; ... ) runs a sequence of commands in a sub-shell whose output can again be redirected.

Pipelines

|” is the “pipe” character. It connects stdout from one command with stdin from another one:

command1 | command2

output of 1 is input of 2 (filter):

ls ~/Documents | wc

One of the power of Unix comes from the fact that a Unix system contains many small programs that do one job particularly well and which can be strung together as filters in a pipeline.

Useful filter commands:

grep

(“get regular expression”)

Shows lines that match the expression REGEX:

command | grep 'REGEX'

Note

It is generally a good idea to enclose REGEX in “hard quotes” (single quote character “’”) so that the shell does not interprete special characters such as $ or ~.

Simple REGEX (“basic regular expressions”):

word          matches "word" literally anywhere
^word         matches "word" at beginning of line
word$         matches "word" at end of line
a *b          matches ab, a b, a   b, i.e. ' *' is zero or more
              spaces
a \+b         matches a b, a  b, ..., i.e. ' \+' is one or more
              spaces (NOTE: in "extended regular expressions"
              as used in 'egrep' this is just '+', i.e. 'a +b')
a[A-Z]b       matches aAb, aBb, ..., aZb  (range expression)
a[0-9][0-9]b          a00b, a01b, a02b, ..., a99b
a[A-Z]*b      ab, aAb, ..., aZb
a[A-Za-z]b    aAb,..., aZb, aab, ..., azb
a[^A-Z]b      aab, axb, ab, a+b, ... ([^...] is a negation)
a.b           matches aXb a3b a_b a b  but not ab: '.' stands for
              a single character
a...b         a123b aXYZb etc: ... are three characters
a.*b          ab a1b a12b a123b etc: .* is zero or more characters
              (this is used very often)
a.\+b         a1b a12b but not ab: .\+ is one or more characters

(Regular expressions are amazingly useful but it takes some time to learn them. See ‘man re_format’ for the bare bones and various tutorials on the internet. The above barely scratches the surface.)

Examples:

ls /usr/bin | grep lp
ls /usr/bin | grep ^lp

Excercise

How many lp commands?

ls /usr/bin | grep ^lp | wc -l

sort

alphabetical or numerical sort, e.g.

who | sort

uniq

cat FILE | sort | uniq

(note: uniq -c: histogram)

cut

cut -c N-M,X-Y  FILE  --> data from cols N-M and X-Y
cut -f 2,3 -d ' '     --> separate fields by space and print 1 and 2

(But for field splitting, awk works better (see below).)

sed

“stream editor”: reads a file line by line and applies a sed-program to each line in turn. It is rather complicated and a typical use is to search and replace in a file:

cat FILE | sed 's/SEARCH/REPLACE/g'

where SEARCH is a “basic regular expression” as for grep.

Warning

sed sed-program FILE > FILE will destroy FILE. You must redirect to a temporary file, e.g. sed sed-program FILE > FILE.temp && mv FILE.temp FILE. Modern versions of sed have the -i (inplace) option to take care of that.

awk

awk also scans a file line-by-line and applies an awk program to each line. It is also fairly complicated (actually, awk is a full blown scripting language) but typical use is straight forward: awk splits the line into fields (separated by white space (i.e. space, tabs) and then allows you to access fields by the special variables $1 (first field), $2 (second field), etc. For most data files you can think of fields as columns.

cat FILE | awk '/REGEX/ {awk-command; awk-command; ...}'

e.g.:

ls -lR /usr/bin | awk '/grep$/ {print $9, $5/1024}'

prints the file name and the size in kB instead of bytes but only of those commands that end in grep.

Filter excercises

Download 1AKE and 4AKE from the PDB (Protein Databank). Look at the files with less.

  • search with the PDB code (e.g. “1ake”)

  • download file (Files -> Download Files -> PDB File (gz))

  • Or from the command line:

    curl http://www.rcsb.org/pdb/files/1ake.pdb.gz -o 1ake.pdb.gz
    curl http://www.rcsb.org/pdb/files/4ake.pdb.gz -o 4ake.pdb.gz
    gunzip *.gz

    (curl means “cat URL”, i.e. by default it writes the file pointed to by URL to stdout — ready to be used in a pipeline.)

    Or (if wget is installed):

    wget http://www.rcsb.org/pdb/files/{1ake,4ake}.pdb.gz
    gunzip 1ake.pdb.gz

Put files into your PDB dir.

Look at files with less and recognize the PDB file format.

We are particularly interested in the Coordinates Section where individual atoms are listed together with their coordinates. Move to the ATOM and HETATM sections (use / (e.g. /ATOM) to search; press n repeatedly to move forward through the matches.).

  1. count the number of residues [1]

    Hint: each residue has exactly one CA atom; protein residues are stored with ATOM records. Other molecules are in HETATM.

    Bonus: How many residues in each chain?

    Solution: manual inspection of the file showed that there are only two chains, A and B, so we simply grep for those separately:

    grep '^ATOM.*CA.* A ' 1ake.pdb | wc -l
    grep '^ATOM.*CA.* B ' 1ake.pdb | wc -l

    214 (same for all)

    Total:

    grep '^ATOM.*CA' 1ake.pdb | wc -l

    428

  2. histogram of residue names: how often does each amino acid occur in the protein? Are some rarer than others?

    • find the CA
    • use cut -c N-M to extract the name from the fixed (not white-space separated!) columns (check PDB ATOM specs for N-M)
    • use sort and uniq -c

    Solution:

    cat 1ake.pdb | grep '^ATOM.*CA' | cut -c 18-20 | sort | uniq -c

    Question: How to some up the totals?

    cat 1ake.pdb | grep '^ATOM.*CA' | cut -c 18-20 | sort | uniq -c \
      | awk '{sum+=$1}; END {print "total: ", sum}'

Footnotes

[1]A protein is a polypeptide that is made up from a linear sequence of amino acids; each amino acid is called a residue. There are 20 natural (and frequently) occuring amino acids. Each has a three-letter residue name. For instance, glycine is Gly, arginine is Arg, and Glutamine is Gln.

Access rights (permissions)

Take a detailed look at the output of a long file listing:

ls -la ....
drwxr-xr-x  2 oliver oliver     68 Jan 11 02:34 tmp
-rw-r--r--  1 oliver oliver 495559 Jan 11 13:56 Documents.lsR
 uuugggooo    owner  group  size   date         name

d = directory (see man ls for the full list)
r = read
w = write
x = execute

Fields:
u = user/owner   = oliver
g = group  = oliver
o = other

Additionally, after the permissions there can also be a single character that shows if alternative access controls (such as Access Control Lists) are applicable. This is typically signified through a + sign. The ls -l command on Mac OS X also shows information about extended attributes (@), i.e. there exists meta data stored in the file system. This is only of concern if you copy the file to a non-Mac OS X formatted disk or USB flash drive because the “foreign” filesystem will not be able to store these extended attributes.

chmod

change permissions:

chmod go-rwx FILE   # make it fairly private
chmod go+r   FILE   # let others read it
chmod a+r    FILE   # let everyone read it

Excercise:

  1. remove execute permission from tmp dir. Try ‘cd tmp’ and ‘ls tmp’
  2. Fix the permissions so that everthing works again.

Other useful commands

df — file system information

(“display file system”):

df

du — file use in directories

(“disk usage”):

du -s DIR

compression

compress (shrink file size without loosing information):

zip
gzip
bzip2  (may not be installed)

uncompress:

unzip
gunzip, zcat
bunzip2

(These commands can typically be asked to take either a FILE as input or read from stdin. They can also write to stdout so that one can put a compress or uncompress step into a pipeline.)

file — guess file type

file FILE

diff — compare two files

diff FILE1 FILE2

The diff command (together with its cousin patch) is very powerful when it comes to big software development projects. For us it is mostly useful to quickly compare two files. diff -U2, a so-called “unified diff” is generally more readable than the standard diff output. Also look at sdiff, which shows the differences in a side-by-side view (pipe the output through less!).

find — find files

Complicated syntax but can be extremely useful:

find . -name '*.txt'

Find files over 1M in size:

find . -size +1M -ls

history — all the commands you typed

The shell remembers all the commands that you typed in the “history” (typically a hidden file ~/.history or similarly named). This history allows you to

  • go back with the Cursor-up key to recall commands
  • search backwards with ^R

and see the commands with

history

The history is actually truncated at HISTSIZE (you can set the environment variable HISTSIZE yourself: e.g. HISTSIZE=500; export HISTSIZE)

Downloading files via the commandline

curl (“cat url”) treats a URL as a file. It is a great tool and well worth learning:

curl URL | command
curl URL -o FILENAME

wget is straightforward for downloading files (but not installed on Mac OS X by default):

wget URL

or

wget URL -O NEWNAME

Excercise: Download PHY598/01/vimqrc.pdf and move it into your docs directory, using the commandline:

cd ~/NAME/p01/docs
curl http://becksteinlab.physics.asu.edu/pages/courses/2012/PHY598/01/vimqrc.pdf -o vimqrc.pdf

echo — printing a string to stdout

If you want to see what the shell thinks of a expression or if you want to have a script output a message you can use the echo command:

echo "Hello world!"
echo "nothing happens *"
echo "all the files " *

(Note that in the last line the shell expands * to all the files in the directory.)

There’s also the printf command, which is more versatile but less often used (see man page).

Unix variables

Variables are containers to store content in. Bash knows simple variables and arrays. It does not distinguish between text and numbers (essentially, everything is treated as text and if needed interpreted as a number).

Shell variables and variable expansion

Assign value to the variable NAME:

NAME=value

E.g:

WORK=$HOME/NAME/p01
ls $WORK
TMP_DIR=$WORK/tmp
ls ${TMP_DIR}/*    # braces for variables unless only letters

By convention we use uppercase letters for NAME but it can be any mixture between upper and lower case characters and numbers (though it can’t start with a number). Stick to letters, numbers, underscores.

The value is accessed (“expanded”) by prefixing NAME with the $ (dollar symbol).

Variables behave differently in quotes:

echo "My work directory is $WORK"

This will print something like

My work directory is /Users/USERNAME/NAME/p01

The value is expanded inside strings with double quotes (“soft quotes”). Within single quotes the variable is not expanded (or “interpolated into the string”) hence we call them “hard quotes”:

echo 'The WORK variable can be accessed as $WORK'

will print

The WORK variable can be accessed as $WORK

In order to keep special characters such as $ in a string you can

  • in double quotes, prefix it with a single backslash \
  • put it in single quotes

Environment variables

A number of variables are already set to certain values. These “environment variables” have special meaning. Examples

echo $HOME
echo $USER

Only modify if you know what you are doing!

Show the whole environment:

env
env | less

New environment variables are generated with

export NAME=value

or:

NAME=value
export NAME

Note

In other shells (not bash), one has to use different commands, e.g. in csh and tcsh it is setenv NAME value.

PATH

PATH is a very special environment variable. It lists the directories where commands are searched for.

echo $PATH

which ls

The second command shows the full path of the ls command (which is simply a file in a bin directory). It’s directory is also listed in PATH. This is why one can simply type

ls

although one could alternatively use the full path to the command:

/bin/ls

Note

If an executable (e.g. a code that you compiled yourself) is not on the PATH then you will always have to provide the path name in order to execute it.

  • The shell searches directories for a command in the order listed in PATH.
  • Directories are separated by ‘:’.
  • If a command is not on PATH then one has to provide its path to the shell in order to run it.
  • It can make sense to ‘.’ to the path to be able to run executables in the current directory.

Changes to PATH are typically done in the shell startup file ~/.bashrc. E.g. adding your own bin directory:

export PATH=$PATH:$HOME/NAME/bin

General Unix gotchas

  • Unix is generally case-sensitive (however, Mac OS X is typically not!)

  • directories are separated by the forward slash /

  • filenames: may contain any characters, however:

    • special characters need to be quoted, using

      • backslash:
      • “soft” quotes: ” “
      • “hard” quotes: ‘ ‘
    • Avoid spaces , slashes /, backslashes \, dollar sign $, ampersand &, question mark ?, parentheses (), square [] and curly brackets {}, pipe |, binary relations >, <, back tick \` — it will make your life difficult.

      Also avoid non-english (i.e. non ASCII) letters such as German umlauts (äöü ...) or accented characters (éîè ...) or special symbols (©≠–† ...).

    • Good: standard letters, numbers, underscore _, dot ., dash -. The equality sign = can be used but it can lead to confusion.

Setting up an editor

We will use the vi editor (actually, the editor is really called vim but it will appear as “the vi editor”... there’s longish Unix history behind this , which does not need to concern us.) vi is very powerful and available on any Unix system. It’s learning curve is fairly steep, though, but “vi” is a very useful skill to have.

vi/vim

Go through vi/vim essentials (vim_basics.txt).

If this is not sufficient consider doing vimtutor (takes about 20-30 mins): On the commandline type

vimtutor

and follow the instructions.

nano

nano is a light-weight editor available on most modern Unix-like systems. If you absolutely hate vi (some people do) then you can try this one. It has fewer features than vi and requires a few more customizations in order to provide a comparable experience.

To learn more about nano, launch it and read the help (^G, i.e. CTRL-G) and have a look at nano’s homepage

  • customize ~/.nanorc (see also nanorc (5)). Example:

    ## Backup files to filename~.
    set backup
    
    ## Enable ~/.nano_history for saving and reading search/replace strings.
    set historylog
    
    ## The opening and closing brackets that can be found by bracket
    ## searches.  They cannot contain blank characters.  The former set must
    ## come before the latter set, and both must be in the same order.
    ##
    set matchbrackets "(<[{)>]}"
    
    ## Enable mouse support, if available for your system.  When enabled,
    ## mouse clicks can be used to place the cursor, set the mark (with a
    ## double click), and execute shortcuts.  The mouse will work in the X
    ## Window System, and on the console when gpm is running.
    ##
    set mouse
    
    ## Use smooth scrolling as the default.
    # set smooth
    
    ## Constantly display the cursor position in the statusbar.  Note that
    ## this overrides "quickblank".
    set const
    
    ## For python
    ## Use this tab size instead of the default; it must be greater than 0.
    set tabsize 4
    ## Convert typed tabs to spaces.
    set tabstospaces
    
    ## Use auto-indentation.
    set autoindent
  • add syntax highlighting: Mac OS X’s nano misses the files that describe syntax highlighting. You can download them from http://becksteinlab.physics.asu.edu/pages/courses/2012/PHY598/nanorc.tar.gz . Unpack into a new directory ~/.nano. Or do it in one go:

    mkdir ~/.nano
    cd ~/.nano
    curl http://becksteinlab.physics.asu.edu/pages/courses/2012/PHY598/nanorc.tar.gz | tar zxvf -

    (Note how the tar command can read from stdin (f -) and curl provides the archive to stdin.)

    Now you have to add the enable the syntax highlighting in your ~/.nanorc file. Instead of doing this manually we use a simple for loop:

    cd ~/.nano
    (for f in *.nanorc; do echo "include \"~/.nano/$f\""; done) >> ~/.nanorc

    That appends the correct commands such as

    include "~/.nano/asm.nanorc"
    include "~/.nano/awk.nanorc"
    include "~/.nano/python.nanorc"
    ...

    to the configuration file,