Introduction to Shell#

We started from learning to the Python programming language and the Jupyter notebook, which are powerful tools for scientific computing. Now we are going to introduce another useful and ubiquitous tool: the shell.

License Information

Much of the materials for this part of the course have been inspired by, or directly taken from the Software Carpentry lesson “The Unix Shell”, which is graciously provided by Software Carpentry under the permissive CC BY 4.0 license.

The sonnets of William Shakespeare in the sonnets/ directory are used under the conditions of the Project Gutenberg (a copy of which is included in sonnets/LICENSE.txt).

What is “the shell”#

A shell is a program that allows you to interact with the computer in a REPL (read-evaluate-print loop). You type in commands, the shell interprets these commands and interacts with the rest of the computer on your behalf, then it will perhaps print some output to the screen, and then allow you to type in another command. For example, instead of browsing through your folders to find a file and click on it to view it, you can type commands to browse and view in the shell.

There are many programs that act as shells, but the most common is bash, the “Bourne Again SHell”. This is the shell that we are going to learn today. While each shell program has its own idiosyncrasies, the principles and most of the syntax for operating them is the same.

Note on nomenclature

Often you will hear the terms “shell”, “terminal” and “command line” (cli) used interchangeably. While there are technical differences between these terms, the distinction is not important for the purposes of this course.

The term “shell” itself invokes the idea that the program in question acts as a thin layer around the core of the “computer” (operating system + other tools).

Why bother?#

Most of our daily computing is done using GUIs (Graphical User Interfaces) driven mainly by a mouse and, increasingly, by touch screen interfaces. Given this, why should we learn to interact with the computer using a CLI (Command Line Interface), which requires use of a keyboard and provides much less rich input and output?

Here are several reasons why it is useful to have at least a basic proficiency in using the shell, in the context of scientific computing:

Operating in restricted environments#

There are computers that may not provide a GUI, for example:

  • High Performance Computing (HPC) clusters

  • controllers for experimental equipment (e.g. raspberry pi)

  • web servers (e.g. when deploying your website)

Shell is an efficient environment for manipulating files and running programs#

The shell’s whole job is to make it easy to run other programs and to interact with files on the computer. When running programs from the shell you can provide arguments that will modify the program’s behaviour; This is typically more difficult or impossible when running programs by clicking on icons in a file browser.

In addition the shell provides other features, such as being able to use the output of one command as the input of another, and so being able to chain simple commands together into more complex ones.

Writing programs that expose a CLI is very simple#

CLI-driven programs only need to know how to accept command line “arguments” and how to print text to the screen. The operating system typically provides these mechanisms “for free”. In Python you can access the list of arguments provided with sys.argv (more on this later) and can print text by using the print function.

The consequence of this is that a lot of scientific software only provides a CLI. It is therefore important to learn how to use the shell to be able to operate this software.

Learning Goals#

There are four main aims for this part of the course:

Be able to launch and ascertain the “state” of your shell by executing basic commands#

This includes information such as which user you are logged in as, the directory in which you find yourself, and the files in the current directory

Be able to navigate the filesystem and perform basic file manipulations#

For many people manipulating files forms the bulk of the work they will do in the shell. This includes copying, renaming and deleting files, as well as viewing and editing file contents.

Be able to use the shell’s productivity tools#

This includes utilizing command completion (so that you avoid excessive typing) and searching through the command history (to re-use commands you previously executed). This also includes using more advanced shell features such as piping.

Be able to find documentation for CLI programs, and use the documentation to interact with these programs#

It is not the aim of this course to give you knowledge of any specific tools (beyond the basic ones that are common to essentially all shell usage). This is why it is important to be able to effectively find and understand documentation for CLI programs, so that you can solve problems yourself.

Launching a shell session and getting oriented#

You will usually launch a shell session from an existing graphical environment (your desktop!) by launching a “terminal” program that opens a window through which you can interact with the shell.

The location of this terminal program will depend on your operating system (google how do I open a terminal in  <X> if you are unsure). Jupyter also provides a terminal program that runs in your web browser (screencast below).

The command prompt#

After launching the terminal you will see the following:

learner@casimir:~ $

This is the shell prompt; it’s asking you for input!

If you now type whoami you will see this appear to the right of the prompt:

learner@casimir:~ $ whoami

Now hitting the “enter” key will tell the shell to execute the command:

learner@casimir:~ $ whoami
learner
learner@casimir:~ $ 

The whoami command printed the current user (who the shell thinks we are) to the screen, and finished. The shell then printed another prompt, indicating that it is ready for more input.

More specifically, when we hit “enter” after typing whoami the shell:

  1. finds a program called whoami,

  2. runs that program,

  3. displays that program’s output, then

  4. displays a new prompt to tell us that it’s ready for more commands.

Unknown commands#

If you try to get the shell to execute a command that it does not know about, it will print an error:

learner@casimir:~$ nonexistantcommand
-bash: nonexistantcommand: command not found

Absolute Paths#

We can uniquely identify a file by giving its location relative to the root of the filesystem.This is called the absolute path of the file.

The absolute path of a file starts with a / and contains the “path” that you would follow from the root directory to reach the file in question, with each directory separated by /

../_images/filesystem-highlighted.svg

The path to the a.py file shown above would be written as

/home/learner/a.py

We can now see that the output of the pwd command above means that we are in the learner directory, which itself is in the home directory, which itself is in the root directory.

Listing Files#

We can use the ls command to list files and directories in the current working directory:

learner@casimir:~$ ls 
a.py  casimir_programming_course  f.txt  id_rsa.pub  shared  Sync

This is kind of OK but not very useful — which of the above are files, and which are directories, for example?

We can modify what the ls command does by using the flag -F (also known as a switch or option). This flag tells ls to add a trailing / to the names of directories:

learner@casimir:~$ ls -F
a.py  casimir_programming_course/  f.txt  id_rsa.pub  shared/  Sync/

Finding documentation and manuals#

ls has lots of flags. We can use the --help flag to ls to find out more information:

learner@casimir:~$ ls --help
Usage: ls [OPTION]... [FILE]...
List information about the FILEs (the current directory by default).
Sort entries alphabetically if none of -cftuvSUX nor --sort is specified.

Mandatory arguments to long options are mandatory for short options too.
  -a, --all                  do not ignore entries starting with .
  -A, --almost-all           do not list implied . and ..
      --author               with -l, print the author of each file
  -b, --escape               print C-style escapes for nongraphic characters
      --block-size=SIZE      scale sizes by SIZE before printing them; e.g.,
                               '--block-size=M' prints sizes in units of
                               1,048,576 bytes; see SIZE format below
  -B, --ignore-backups       do not list implied entries ending with ~
  -c                         with -lt: sort by, and show, ctime (time of last
                               modification of file status information);
                               with -l: show ctime and sort by name;
                               otherwise: sort by ctime, newest first
  -C                         list entries by columns
      --color[=WHEN]         colorize the output; WHEN can be 'always' (default
                               if omitted), 'auto', or 'never'; more info below
  -d, --directory            list directories themselves, not their contents
  -D, --dired                generate output designed for Emacs' dired mode
  -f                         do not sort, enable -aU, disable -ls --color
  -F, --classify             append indicator (one of */=>@|) to entries
      --file-type            likewise, except do not append '*'
      --format=WORD          across -x, commas -m, horizontal -x, long -l,
                               single-column -1, verbose -l, vertical -C
      --full-time            like -l --time-style=full-iso
  -g                         like -l, but do not list owner
      --group-directories-first
                             group directories before files;
                               can be augmented with a --sort option, but any
                               use of --sort=none (-U) disables grouping
  -G, --no-group             in a long listing, don't print group names
  -h, --human-readable       with -l and/or -s, print human readable sizes
                               (e.g., 1K 234M 2G)
      --si                   likewise, but use powers of 1000 not 1024
  -H, --dereference-command-line
                             follow symbolic links listed on the command line
      --dereference-command-line-symlink-to-dir
                             follow each command line symbolic link
                               that points to a directory
      --hide=PATTERN         do not list implied entries matching shell PATTERN
                               (overridden by -a or -A)
      --indicator-style=WORD  append indicator with style WORD to entry names:
                               none (default), slash (-p),
                               file-type (--file-type), classify (-F)
  -i, --inode                print the index number of each file
  -I, --ignore=PATTERN       do not list implied entries matching shell PATTERN
  -k, --kibibytes            default to 1024-byte blocks for disk usage
  -l                         use a long listing format
  -L, --dereference          when showing file information for a symbolic
                               link, show information for the file the link
                               references rather than for the link itself
  -m                         fill width with a comma separated list of entries
  -n, --numeric-uid-gid      like -l, but list numeric user and group IDs
  -N, --literal              print raw entry names (don't treat e.g. control
                               characters specially)
  -o                         like -l, but do not list group information
  -p, --indicator-style=slash
                             append / indicator to directories
  -q, --hide-control-chars   print ? instead of nongraphic characters
      --show-control-chars   show nongraphic characters as-is (the default,
                               unless program is 'ls' and output is a terminal)
  -Q, --quote-name           enclose entry names in double quotes
      --quoting-style=WORD   use quoting style WORD for entry names:
                               literal, locale, shell, shell-always,
                               shell-escape, shell-escape-always, c, escape
  -r, --reverse              reverse order while sorting
  -R, --recursive            list subdirectories recursively
  -s, --size                 print the allocated size of each file, in blocks
  -S                         sort by file size, largest first
      --sort=WORD            sort by WORD instead of name: none (-U), size (-S),
                               time (-t), version (-v), extension (-X)
      --time=WORD            with -l, show time as WORD instead of default
                               modification time: atime or access or use (-u);
                               ctime or status (-c); also use specified time
                               as sort key if --sort=time (newest first)
      --time-style=STYLE     with -l, show times using style STYLE:
                               full-iso, long-iso, iso, locale, or +FORMAT;
                               FORMAT is interpreted like in 'date'; if FORMAT
                               is FORMAT1<newline>FORMAT2, then FORMAT1 applies
                               to non-recent files and FORMAT2 to recent files;
                               if STYLE is prefixed with 'posix-', STYLE
                               takes effect only outside the POSIX locale
  -t                         sort by modification time, newest first
  -T, --tabsize=COLS         assume tab stops at each COLS instead of 8
  -u                         with -lt: sort by, and show, access time;
                               with -l: show access time and sort by name;
                               otherwise: sort by access time, newest first
  -U                         do not sort; list entries in directory order
  -v                         natural sort of (version) numbers within text
  -w, --width=COLS           set output width to COLS.  0 means no limit
  -x                         list entries by lines instead of by columns
  -X                         sort alphabetically by entry extension
  -Z, --context              print any security context of each file
  -1                         list one file per line.  Avoid '\n' with -q or -b
      --help     display this help and exit
      --version  output version information and exit

The SIZE argument is an integer and optional unit (example: 10K is 10*1024).
Units are K,M,G,T,P,E,Z,Y (powers of 1024) or KB,MB,... (powers of 1000).

Using color to distinguish file types is disabled both by default and
with --color=never.  With --color=auto, ls emits color codes only when
standard output is connected to a terminal.  The LS_COLORS environment
variable can change the settings.  Use the dircolors command to set it.

Exit status:
 0  if OK,
 1  if minor problems (e.g., cannot access subdirectory),
 2  if serious trouble (e.g., cannot access command-line argument).

GNU coreutils online help: <http://www.gnu.org/software/coreutils/>
Full documentation at: <http://www.gnu.org/software/coreutils/ls>
or available locally via: info '(coreutils) ls invocation'

Many other commands have a --help flag.

Note that flags will often have a short version, which consists of a single - and a single character, and a long version, which consists of a -- and several characters.

If you provide a flag that a command does not support, it will print an error:

learner@casimir:~$ ls --nonexistant-flag
ls: unrecognized option '--nonexistant-flag'
Try 'ls --help' for more information.

You can also find help on commands by using the man command:

man ls

This will display the “manual” for the ls command in the terminal, which you can scroll through with the arrow keys (use q to quit).

man pages are also available online at: https://linux.die.net/man/. This is often much more practical than using the man command.

While --help and man pages are certainly complete, they are often very poor at giving a feel for how a particular program should be used, with common usage examples. For this, google is your friend.

Relative Paths#

We can also use ls to list the files in other directories than the current working directory, like so:

learner@casimir:~$ ls -F casimir_programming_course
course_announcement.md  day1/  day3/  exercises/                       intro/   misc/       projects.ipynb
day1/         day2/          day4/  installation_instructions.ipynb  LICENSE  program.md  README.md

We passed a directory name, casimir_programming_course, to ls as an argument, after the flag -F.

We can also list the contents of day1, inside casimir_programming_course:

learner@casimir:~$ ls -F casimir_programming_course/day1/
babynames.txt  helpers.py  if_figure.png  introduction_to_python.ipynb  projects_day_1.ipynb  romeo_and_juliet.txt

Here we provided the directory to ls as a relative path, which does not start with a /, and is to be understood as a path relative to the current working directory. Using relatives paths is a useful shortcut that saves quite a bit of typing.

There are several other shortcuts that can be used when specifying paths:

  • . refers to the current working directory

  • .. refers to the parent of the current working directory

  • ~ refers to the user’s home directory

Moving around the filesystem#

Even though using relative paths adds convenience, most of the time you will want to do operations on files in a single directory. In this case it is useful to be able to change the working directory.

This is accomplished using the cd command:

learner@casimir:~$ cd casimir_programming_course
learner@casimir:~/casimir_programming_course$

We changed to the casimir_programming_course directory that is inside our home directory, specifying it using a relative path.

The cd command did not print anything to the screen, and the shell returned us a prompt. We can use the pwd command to check what directory we are in:

learner@casimir:~/casimir_programming_course$ pwd
/home/learner/casimir_programming_course
learner@casimir:~/casimir_programming_course$ 

Note that the prompt also contains the current working directory between the : and $, using the shortcut ~ for the “home directory”. Can you use this to deduce what our home directory is?

Be aware that when using different systems the shell prompt may not look exactly the same.

We can of course use all the shortcuts we learned above when specifying paths to cd. In addition cd has the following behaviour when certain “special” arguments are provided:

  • cd (no arguments) takes us back to our home directory

  • cd - returns to the last directory we were in (like an “undo”)

✓ Mini exercises#

  • Execute the example commands in this notebook in a shell, if you haven’t done so yet

Exercise#

  • Navigate to the following places in your home directory:

    1. Make sure you are in the home directory

    2. Go to casimir_programming_course

    3. List the contents of this directory

    4. List the contents of day2 without switching to this directory first

    5. Go to day3

    6. List the contents of this directory

    7. Go back to the parent directory (i.e. you should be then be in /home/learner/casimir_programming_course. Check with pwd!)

Solution#

  1. You can check with pwd or the command prompt if you are in the home directory. If not, use cd to go there.

  2. cd casimir_programming_course

  3. ls

  4. ls day2

  5. cd day3

  6. ls

  7. cd .. and then run pwd

Exercise#

  • After completing the exercise before, you should be in /home/learner/casimir_programming_course/. Which of the following commands could you use to navigate to your home directory, which is /home/learner? Try them out and check where you end up! Make sure you always start from /home/learner/casimir_programming_course/ (e.g. execute cd /home/learner/casimir_programming_course/ to go back there if necessary)

    1. cd .

    2. cd /

    3. cd /home/amanda

    4. cd ../..

    5. cd ~

    6. cd home

    7. cd ~/data/..

    8. cd

    9. cd ..

Solution#

  1. No: . stands for the current directory. You will stay where you were.

  2. No: / stands for the root directory.

  3. No: Learner’s home directory is /home/learner.

  4. No: this goes up two levels, i.e. ends in /home.

  5. Yes: ~ stands for the user’s home directory, in this case /home/learner.

  6. No: this would navigate into a directory home in the current directory if it exists.

  7. Yes: unnecessarily complicated, but correct.

  8. Yes: shortcut to go back to the user’s home directory.

  9. Yes: goes up one level.

Exercise#

  • Using the filesystem diagram below, if pwd displays /home/learner, what will ls -F ../backup display?

  1. ../backup: No such file or directory

  2. 2017-07 2017-08 2017-09

  3. 2017-07/ 2017-08/ 2017-09/

  4. base/ orig/ recent/

../_images/fs-check-question.svg

Solution#

  1. No: there is a directory backup in /home.

  2. No: this is the content of /home/learner/backup, but with .. we asked for one level further up.

  3. No: see previous explanation.

  4. Yes: ../backup/ refers to /home/backup/.

Exercise#

  • Assuming a directory structure as in the above figure, if pwd displays /home/backup, and -r tells ls to display things in reverse order, what command will display:

    recent/ orig/ base/
    
  1. ls pwd

  2. ls -r -F

  3. ls -r -F /home/backup

  4. Either #2 or #3 above, but not #1.

Solution#

  1. No: pwd is not the name of a directory.

  2. Yes: ls without directory argument lists files and directories in the current directory.

  3. Yes: uses the absolute path explicitly.

  4. Correct: see explanations above.

Manipulating files and directories#

So now we can move around the filesystem effectively and list the contents of directories as we move around. The next step is to be able to create, read, update and delete files and directories.

Directories are created with the mkdir (make directory) command, which takes a single argument: the name of the directory to create.

learner@casimir:~$ mkdir work
learner@casimir:~$ 

Editing text files#

Simple! if we now run ls -F we will see that there is a directory called work in our current working directory, which we can cd into and create some files

learner@casimir:~$ cd work
learner@casimir:~/work$ nano draft.txt

Above we use the nano command, providing the filename draft.txt as an argument. nano is a text editor, which will allow us to type some text and save it into the file provided as command line argument.

Note on text editors

When we say, “nano is a text editor,” we really do mean “text”: it can only work with plain character data, not tables, images, or any other human-friendly media. We use it in examples because it is one of the least complex text editors. However, because of this trait, it may not be powerful enough or flexible enough for the work you need to do after this course. On Unix systems (such as Linux and Mac OS X), many programmers use Emacs or Vim (both of which require more time to learn), or a graphical editor such as Gedit. On Windows, you may wish to use Notepad++. Windows also has a built-in editor called notepad that can be run from the command line in the same way as nano for the purposes of this lesson.

Once we’re happy with our text, we can press Ctrl-O (press the Ctrl or Control key and, while holding it down, press the O key) to write our data to disk (we’ll be asked what file we want to save this to: press Return to accept the suggested default of draft.txt).

Once our file is saved, we can use Ctrl-X to quit the editor and return to the shell.

Note on the control key

The Control key is also called the “Ctrl” key. There are various ways in which using the Control key may be described. For example, you may see an instruction to press the Control key and, while holding it down, press the X key, described as any of:

  • Control-X

  • Control+X

  • Ctrl-X

  • Ctrl+X

  • ^X

  • C-x

In nano, along the bottom of the screen you’ll see ^G Get Help ^O WriteOut. This means that you can use Control-G to get help and Control-O to save your file.

Naming tips for files or directories#

  • Don’t use whitespace

    The shell separates command arguments on whitespace, so having filenames with whitespace is problematic. While it is possible to get around this restriction, it is easier to just avoid whitespace in filenames

  • Don’t begin names with a - (dash)

    Commands treat names starting with - as options.

  • Use only letters, numbers, . (period) and _ (underscore)

    Many other characters have special meaning to the shell or to other programs you may invoke. This may mean that you provide unintended options to certain commands, and may make them misbehave or result in data loss

Copying, moving, and deleting files and directories#

We can use the cp command to copy a file from one place to another:

learner@casimir:~/work$ cp draft.txt previous-draft.txt
learner@casimir:~/work$ ls -F
draft.txt  previous-draft.txt
learner@casimir:~/work$ 

The above invocation of cp copied the file draft.txt to the file previous-draft.txt, creating it in the process.

Caution: If the target filename to cp already exists, you will overwrite it

We can also copy files between directories, so to copy draft.txt into the parent directory we could simply do the following:

learner@casimir:~/work$ cp draft.txt ..

We can remove files using the rm command:

learner@casimir:~/work$ rm previous-draft.txt 
learner@casimir:~/work$ ls -F
draft.txt

Caution: deleting is forever

The Unix shell doesn’t have a trash bin that we can recover deleted files from (though most graphical interfaces to Unix do). Instead, when we delete files, they are unhooked from the file system so that their storage space on disk can be recycled. Tools for finding and recovering deleted files do exist, but there’s no guarantee they’ll work in any particular situation, since the computer may recycle the file’s disk space right away.

by default, rm will not work on directories:

learner@casimir:~/work$ cd ..
learner@casimir:~$ rm work
rm: cannot remove 'work': Is a directory
learner@casimir:~$ 

we need to supply the -r flag (for recursive):

learner@casimir:~$ rm -r work
learner@casimir:~$ 

For extra safety, you can use the -i flag to rm, which will ask you to confirm the deletion of each of the files inside work in turn.

Now let’s recreate the work directory and move draft.txt (which we previously copied into our home directory) into it.

For this will use the mv (move) command.

learner@casimir:~$ mkdir work
learner@casimir:~$ mv draft.txt work
learner@casimir:~$ ls -F work
draft.txt

cp, mv and rm all expect paths as arguments. In the preceding examples we used relative paths to refer to files and directories in our current working directory, but we can also use these same programs to operate on files in locations other than our current working directory.

Protip: shell autocompletion

If your file and directory names are descriptive (and they should be!) they may become somewhat long and cumbersome to type manually.

Luckily the shell provides autocompletion for paths. While typing out a filename, simply hit the Tab key to have the shell autocomplete the path for you. If the shell cannot unambiguously determine which file you want, it will not autocomplete; if you hit the Tab key a second time it will print the possible autocompletion options. You will then need to carry on typing the filename until the shell is able to determine which one you mean, at which point you can hit Tab again to have the shell autocomplete.

Protip: shell history

Now that we are composing commands into more complicated ones we shall look at another shell productivity tool: history.

The shell keeps around a history of all the commands that you send to it (this is configurable but usually defaults to ~1000 commands). Pressing the Up Arrow key at a blank prompt will allow you to cycle through your most recently executed commands. Before executing a command from the history, you also have the opportunity to edit it. This is incredibly useful if you are performing repetitive tasks.

You can also use the history command to print out the command history to the screen

It is also possible to search through the command history. To activate this, enter Ctrl-R and begin typing the start of the command that you are looking for. The shell will display the match that occured most recently in the history, and you can press Ctrl-R to cycle through the matches further back in time. Pressing any other key will stop the search, with the selected command already typed for you on the command line

✓ Mini exercises#

Exercise#

We have seen how to create text files using the nano editor. Now, try the following command in your home directory:

touch my_file.txt

What did the touch command do? When you look at your home directory using the GUI file explorer, does the file show up?

Use ls -l to inspect the files. How large is my_file.txt?

When might you want to create a file this way?

Solution#
  1. The touch command generates a new file called my_file.txt in your home directory. If you are in your home directory, you can observe this newly generated file by typing ls at the command line prompt. my_file.txt can also be viewed in your GUI file explorer.

  2. When you inspect the file with ls -l, note that the size of my_file.txt is 0kb. In other words, it contains no data. If you open my_file.txt using your text editor it is blank.

  3. Some programs do not generate output files themselves, but instead require that empty files have already been generated. When the program is run, it searches for an existing file to populate with its output. The touch command allows you to efficiently generate a blank text file to be used by such programs.

Exercise#

We now consider the case that you created a file with a wrong filename.

Use nano to create a file called statstics.txt (you may write anything you want in there)

After creating and saving this file you realize you misspelled the filename! You want to correct the mistake, which of the following commands could you use to do so? Think about it and try them out!

  1. cp statstics.txt statistics.txt

  2. mv statstics.txt statistics.txt

  3. mv statstics.txt .

  4. cp statstics.txt .

Solution#

First, you need to run nano statstics.txt, enter some text, and then use Ctrl-O and finally Ctrl-X to close.

Do the suggested commands do what you want?

  1. No. While this would create a file with the correct name, the incorrectly named file still exists in the directory and would need to be deleted.

  2. Yes, this would work to rename the file.

  3. No, the period(.) indicates where to move the file, but does not provide a new file name; identical file names cannot be created.

  4. No, the period(.) indicates where to copy the file, but does not provide a new file name; identical file names cannot be created.

Exercise#

We have prepared some example files for you. Go to data1 in the folder day3 of this programming course. What is the output of the closing ls command in the sequence shown below? First think about it, then try out the commands.

learner@casimir:~/casimir_programming_course/day3/data1$ pwd
/home/learner/casimir_programming_course/day3/data1
learner@casimir:~/casimir_programming_course/day3/data1$ ls
proteins.dat
learner@casimir:~/casimir_programming_course/day3/data1$ mkdir recombine
learner@casimir:~/casimir_programming_course/day3/data1$ mv proteins.dat recombine
learner@casimir:~/casimir_programming_course/day3/data1$ cp recombine/proteins.dat ../proteins-saved.dat
learner@casimir:~/casimir_programming_course/day3/data1$ ls
  1. proteins-saved.dat recombine

  2. recombine

  3. proteins.dat recombine

  4. proteins-saved.dat

Finally, delete proteins-saved.dat again.

Solution#

Use cd ~/casimir_programming_course/day1/data1 to go to the correct place. There create a new folder called recombine. The second line moves (mv) the file proteins.dat to the new folder (recombine). The third line makes a copy of the file we just moved. The tricky part here is where the file was copied to. Recall that .. means “go up a level”, so the copied file is now in /home/learner/casimir_programming_course/day1. Notice that .. is interpreted with respect to the current working directory, not with respect to the location of the file being copied. So, the only thing that will show using ls is the recombine folder.

  1. No, see explanation above. proteins-saved.dat is located at /home/learner

  2. Yes.

  3. No, see explanation above. proteins.dat is located at /home/learner/data/recombine

  4. No, see explanation above. proteins-saved.dat is located at /home/learner

Finally, run rm ../proteins-saved.dat

Exercise#

Go to data2 in the folder day3. Imagine you are working on a project and see that your files aren’t very well organized:

learner@casimir:~/casimir_programming_course/day3/data2$ ls -F
fructose.dat   sucrose.dat

The fructose.dat and sucrose.dat files contain output from your data analysis. First, create two empty directories called analyzed and raw. What command(s) covered so far do you then need to run so that the commands below will produce the output shown?

learner@casimir:~/casimir_programming_course/day3/data2$ ls -F
analyzed/   raw/
learner@casimir:~/casimir_programming_course/day3/data2$ ls -F analyzed
fructose.dat    sucrose.dat
Solution#

mv *.dat analyzed

You need to move your files fructose.dat and sucrose.dat to the analyzed directory. The shell will expand *.dat to match all .dat files in the current directory. The mv command then moves the list of .dat files to the analyzed directory.

Exercise#

After running the following commands, you realize that you put the files sucrose.dat into the wrong folder:

learner@casimir:~/casimir_programming_course/day3/data2$ ls -F
raw/ analyzed/
learner@casimir:~/casimir_programming_course/day3/data2$ ls -F analyzed
fructose.dat sucrose.dat
learner@casimir:~$ cd raw/

Fill in the blanks to move sucrose.dat to the current folder (i.e., the one you are currently in):

learner@casimir:~/raw$ mv ___/sucrose.dat  ___
Solution#

mv ../analyzed/sucrose.dat ../analyzed/maltose.dat .

Recall that .. refers to the parent directory (i.e. one above the current directory) and that . refers to the current directory.

Python scripts and the shell#

Let’s try and make a poor-man’s clone of the echo command in Python.

The Python sys module has a property called argv, which is a list of arguments that were provided to the script on the command line. echo just takes its arguments and prints them to stdout, so in Python we could implement this as:

#!/opt/conda/bin/python

import sys

# remove the program name from 'argv'
arguments = sys.argv[1:]

print(' '.join(arguments))

We can save this as echo.py, add execute permissions with chmod, and run it:

learner@casimir:~$ ./echo.py hello world
hello world

Making a better CLI for echo.py#

echo.py as it is written does not provide a very good CLI. Notably it does not have any documentation accessible from the command line:

learner@casimir:~$ ./echo.py --help
--help
learner@casimir:~$ 

This means that anyone who merely wants to use our script will need to go digging round in the source code.

The argparse module in the Python standard library provides tools for making nice CLIs:

#!/opt/conda/bin/python

import argparse

parser = argparse.ArgumentParser(description='Echo arguments to stdout')
parser.add_argument('word', nargs='*', help="word to print")

args = parser.parse_args()

print(' '.join(args.word))

It gives you things like automatic help-page generation:

learner@casimir:~$ ./echo.py --help
usage: echo.py [-h] [word [word ...]]

Echo arguments to stdout

positional arguments:
  word        word to print

optional arguments:
  -h, --help  show this help message and exit
learner@casimir:~$ 

And reasonable error messages if an unrecognized argument is provided:

learner@casimir:~$ ./echo.py --options
usage: echo.py [-h] [word [word ...]]
echo.py: error: unrecognized arguments: --options
learner@casimir:~$

Installing software#

Now we know how to run existing software from the command line and how to write our own command-line driven software. In this section we will cover the basics of installing software from the command line.

This is the bare minimum of what will be required for this course, we will not cover installing software from source code distributions etc.

Installing Python packages#

There are 2 tools that you need to be aware of for installing Python packages onto your computer: pip and conda.

pip is the official Python package manager, and comes installed by default with all distributions of Python. It downloads and installs packages from PyPI (Python Package Index): https://pypi.python.org

conda is the package manager for the Anaconda Python distribution (which we are using for this course). It downloads and installs packages from the Anaconda website: https://repo.continuum.io/pkgs/

Anyone who distributes their Python package will (in the overwhelmingly majority of cases) first put it on PyPI, as this is the place that is visible to and known by the most people. For this reason PyPI can be considered the “canonical” place to get the most up to date versions of packages.

The caveat to this is that PyPI and pip do not know anything about dependencies external to the Python ecosystem. This means that if the Python package that you want to use depends on some external library not written in python, then you will need to install this yourself by some other means (see below). conda, on the other hand, does have non-Python packages. The disadvantage of conda is that not all packages available on PyPI will have corresponding conda packages, or the conda packages available may not be up to date with the latest version available on PyPI.

To make the situation even more complicated, somtimes packages on PyPI will come with the non-Python dependencies already linked (i.e. things will “just work”). This is typical of the more widely used packages, which are looking for a smoother user experience (e.g. opencv and ZeroMQ).

If you think all this is insane, you are in good company.

TL; DR#

For the purposes of this course you can just find a package by searching PyPI to find the package name you want, and then

pip install my-sweet-package

Installing other packages#

All Linux distros come with a package manager pre-installed. There are also package managers for OSX and Windows, but they will probably be less-used: https://en.wikipedia.org/wiki/List_of_software_package_management_systems

The package manager handles installing and updating all the programs and libraries on the computer, taking into account all the dependencies.

Package managers do typically have many Python packages in their repositories, but due to the typically long vetting procedure and release schedule, they often only have out of date versions. This is why we recommend to use pip or conda to install your Python packages.

You will need to use the distribution’s package manager for everything else. In this course we have an environment based on Ubuntu 16.04, which uses the apt package manager. There is an exhaustive list of packages available online: https://packages.ubuntu.com/xenial/

To install a package my-sweet-package we say

apt-get install my-sweet-package

Running this produces the following:

learner@casimir:~$ apt-get install my-sweet-package
E: Could not open lock file /var/lib/dpkg/lock - open (13: Permission denied)
E: Unable to lock the administration directory (/var/lib/dpkg/), are you root?

apt-get complains that it cannot install the package because we do not have sufficient permissions. This is because we are running apt-get as our regular user, learner, who does not have write permissions on the directories into which apt-get wants to install things.

We can use the sudo command to execute things as the super user, root:

sudo apt-get install my-sweet-package

The first argument to sudo is the command to run as root, and the remaining arguments are passed to that command.

Advanced topics (for self-study, if you are interested!)#

Joining commands together#

Now that we know a few basic commands, we can finally look at the shell’s most powerful feature: the ease with which it lets us combine existing programs in new ways.

In the sonnets directory (next to this notebook) there is a collection of files containing the sonnets of William Shakespeare, which we will use in what follows.

learner@casimir:~$ pwd
/home/learner
learner@casimir:~$ cp -r casimir_programming_course/day3/sonnets .
learner@casimir:~$ cd sonnets
learner@casimir:~/sonnets$ 

In this directory there are a collection of files with names like sonnet_001.txt and sonnet_124.txt, each with the corresponding Shakespeare sonnet in them.

We can use the wc (word count) command with the -w flag to get the number of words in each of these files:

learner@casimir:~/sonnets$ wc -w sonnet_*.txt
  106 sonnet_001.txt
  115 sonnet_002.txt
  115 sonnet_003.txt
  104 sonnet_004.txt
  104 sonnet_005.txt
  110 sonnet_006.txt
  100 sonnet_007.txt
  110 sonnet_008.txt
  118 sonnet_009.txt
  114 sonnet_010.txt
  116 sonnet_011.txt
  118 sonnet_012.txt
  110 sonnet_013.txt
  112 sonnet_014.txt
  111 sonnet_015.txt
  110 sonnet_016.txt
  123 sonnet_017.txt
  114 sonnet_018.txt
  115 sonnet_019.txt
  114 sonnet_020.txt
  117 sonnet_021.txt
  122 sonnet_022.txt
  114 sonnet_023.txt
  120 sonnet_024.txt
  108 sonnet_025.txt
  118 sonnet_026.txt
  110 sonnet_027.txt
  116 sonnet_028.txt
  115 sonnet_029.txt
  116 sonnet_030.txt
  113 sonnet_031.txt
  114 sonnet_032.txt
  110 sonnet_033.txt
  120 sonnet_034.txt
  106 sonnet_035.txt
  113 sonnet_036.txt
  116 sonnet_037.txt
  114 sonnet_038.txt
  119 sonnet_039.txt
  121 sonnet_040.txt
  110 sonnet_041.txt
  130 sonnet_042.txt
  122 sonnet_043.txt
  117 sonnet_044.txt
  105 sonnet_045.txt
  115 sonnet_046.txt
  123 sonnet_047.txt
  117 sonnet_048.txt
  114 sonnet_049.txt
  118 sonnet_050.txt
  117 sonnet_051.txt
  109 sonnet_052.txt
  107 sonnet_053.txt
  112 sonnet_054.txt
  106 sonnet_055.txt
  112 sonnet_056.txt
  117 sonnet_057.txt
  112 sonnet_058.txt
  109 sonnet_059.txt
  108 sonnet_060.txt
  121 sonnet_061.txt
  105 sonnet_062.txt
  110 sonnet_063.txt
  111 sonnet_064.txt
  112 sonnet_065.txt
   87 sonnet_066.txt
  106 sonnet_067.txt
  108 sonnet_068.txt
  119 sonnet_069.txt
  110 sonnet_070.txt
  122 sonnet_071.txt
  118 sonnet_072.txt
  121 sonnet_073.txt
  113 sonnet_074.txt
  117 sonnet_075.txt
  114 sonnet_076.txt
  107 sonnet_077.txt
  110 sonnet_078.txt
  117 sonnet_079.txt
  114 sonnet_080.txt
  116 sonnet_081.txt
  105 sonnet_082.txt
  116 sonnet_083.txt
  114 sonnet_084.txt
  110 sonnet_085.txt
  110 sonnet_086.txt
  118 sonnet_087.txt
  112 sonnet_088.txt
  112 sonnet_089.txt
  121 sonnet_090.txt
  114 sonnet_091.txt
  120 sonnet_092.txt
  114 sonnet_093.txt
  106 sonnet_094.txt
  112 sonnet_095.txt
  120 sonnet_096.txt
  107 sonnet_097.txt
  118 sonnet_098.txt
  125 sonnet_099.txt
  111 sonnet_100.txt
  114 sonnet_101.txt
  117 sonnet_102.txt
  117 sonnet_103.txt
  117 sonnet_104.txt
  105 sonnet_105.txt
  110 sonnet_106.txt
  113 sonnet_107.txt
  115 sonnet_108.txt
  116 sonnet_109.txt
  119 sonnet_110.txt
  110 sonnet_111.txt
  114 sonnet_112.txt
  117 sonnet_113.txt
  114 sonnet_114.txt
  115 sonnet_115.txt
  109 sonnet_116.txt
  110 sonnet_117.txt
  112 sonnet_118.txt
  113 sonnet_119.txt
  117 sonnet_120.txt
  115 sonnet_121.txt
  105 sonnet_122.txt
  116 sonnet_123.txt
  109 sonnet_124.txt
  105 sonnet_125.txt
   96 sonnet_126.txt
  111 sonnet_127.txt
  111 sonnet_128.txt
  110 sonnet_129.txt
  123 sonnet_130.txt
  119 sonnet_131.txt
  114 sonnet_132.txt
  122 sonnet_133.txt
  121 sonnet_134.txt
  116 sonnet_135.txt
  124 sonnet_136.txt
  124 sonnet_137.txt
  117 sonnet_138.txt
  120 sonnet_139.txt
  118 sonnet_140.txt
  118 sonnet_141.txt
  117 sonnet_142.txt
  119 sonnet_143.txt
  113 sonnet_144.txt
   97 sonnet_145.txt
  112 sonnet_146.txt
  107 sonnet_147.txt
  123 sonnet_148.txt
  119 sonnet_149.txt
  118 sonnet_150.txt
  119 sonnet_151.txt
  124 sonnet_152.txt
  109 sonnet_153.txt
  107 sonnet_154.txt
17516 total

We used the pattern sonnet_*.txt, where the * is a wildcard that matches any number of characters, to match all the sonnet files. We did this to avoid counting the LICENSE.txt file.

By default wc prints several pieces of information about each file it receives as argument: line, word and character counts. On the last line are the total lines, words and characters.

This is a lot of output — it fills up more than one screen of text!

Output redirection#

We can redirect the output of the command to a file like so:

learner@casimir:~/sonnets$ wc -w sonnet_*.txt > word_counts.txt
learner@casimir:~/sonnets$

the character > used in this way is interpreted specially by the shell. The filename to the right of the > indicates where the output of the command on the left of the > should go. The shell will create the file if it doesn’t exist. If the file exists, it will be silently overwritten, which may lead to data loss and thus requires some caution.

The sequence >> also has a special meaning. It means to redirect output (as with >), but to append to the file on the right, rather than to overwrite.

Imagine now that we wanted to find which sonnet has the most words in it. Sure we can manually inspect word_counts.txt, but surely there’s a better way?

There is a command called sort that we can use to sort the lines of word_counts.txt. The -n flag to sort specifies that we should compare lines by interpreting them as integers.

learner@casimir:~/sonnets$ sort -n word_counts.txt > sorted_word_counts.txt
learner@casimir:~/sonnets$

Then we can use the tail command to print only the last few lines of sorted_word_counts.txt

learner@casimir:~/sonnets$ tail sorted_word_counts.txt 
  123 sonnet_017.txt
  123 sonnet_047.txt
  123 sonnet_130.txt
  123 sonnet_148.txt
  124 sonnet_136.txt
  124 sonnet_137.txt
  124 sonnet_152.txt
  125 sonnet_099.txt
  130 sonnet_042.txt
17516 total
learner@casimir:~/sonnets$ 

Pipes#

This is all very well, but somewhat cumbersome if we don’t want to keep all these intermediate files around.

We can do this more succinctly by running sort and tail together:

learner@casimir:~/sonnets$ sort -n word_counts.txt | tail
  123 sonnet_017.txt
  123 sonnet_047.txt
  123 sonnet_130.txt
  123 sonnet_148.txt
  124 sonnet_136.txt
  124 sonnet_137.txt
  124 sonnet_152.txt
  125 sonnet_099.txt
  130 sonnet_042.txt
17516 total
learner@casimir:~/sonnets$ 

The vertical bar, |, between the two commands is called a pipe. It tells the shell that we want to use the output of the command on the left as the input to the command on the right.

We are, of course, not limited to using a single pipe in a command, so we can also pipe the input of wc into sort, and then the output of sort into tail:

learner@casimir:~/sonnets$ wc -w sonnet_*.txt | sort -n | tail
  123 sonnet_017.txt
  123 sonnet_047.txt
  123 sonnet_130.txt
  123 sonnet_148.txt
  124 sonnet_136.txt
  124 sonnet_137.txt
  124 sonnet_152.txt
  125 sonnet_099.txt
  130 sonnet_042.txt
17516 total

What is actually going on#

Here’s what actually happens behind the scenes when we create a pipe. When a computer runs a program — any program — it creates a process in memory to hold the program’s software and its current state. Every process has an input channel called standard input. (By this point, you may be surprised that the name is so memorable, but don’t worry: most programmers call it stdin). Every process also has a default output channel called standard output (or stdout). A second output channel called standard error (stderr) also exists. This channel is typically used for error or diagnostic messages, and it allows a user to pipe the output of one program into another while still receiving error messages in the terminal.

The shell is actually just another program. Under normal circumstances, whatever we type on the keyboard is sent to the shell on its standard input, and whatever it produces on standard output is displayed on our screen. When we tell the shell to run a program, it creates a new process and temporarily sends whatever we type on our keyboard to that process’s standard input, and whatever the process sends to standard output to the screen.

Here’s what happens when we run wc -l sonnet_*./txt > word_counts.txt. The shell starts by telling the computer to create a new process to run the wc program. Since we’ve provided some filenames as arguments, wc reads from them instead of from standard input. And since we’ve used > to redirect output to a file, the shell connects the process’s standard output to that file.

If we run wc -l sonnet_*.txt | sort -n instead, the shell creates two processes (one for each process in the pipe) so that wc and sort run simultaneously. The standard output of wc is fed directly to the standard input of sort; since there’s no redirection with >, sort’s output goes to the screen. And if we run wc -l sonnet_*.txt | sort -n | tail, we get three processes with data flowing from the files, through wc to sort, and from sort through tail to the screen.

../_images/pipes.svg

✓ Check Questions#

Question#

What is the difference between:

learner@casimir:~$ echo hello > testfile01.txt

and:

learner@casimir:~$ echo hello >> testfile02.txt

Hint: Try executing each command twice in a row and then examining the output files.

Solution#

The first command (wtih >) overwrites any content in testfile01.txt, whereas the second command (with >>) appends to the existing content of testfile02.txt.

Exercise#

In our current directory, we want to find the 3 files which have the least number of lines. Which command listed below would work?

  1. wc -l * > sort -n > head -n 3

  2. wc -l * | sort -n | head -n 1-3

  3. wc -l * | head -n 3 | sort -n

  4. wc -l * | sort -n | head -n 3

Solution#

Option 4 is the solution. The pipe character | is used to feed the standard output from one process to the standard input of another. > is used to redirect standard output to a file.

Question#

A file called animals.txt contains the following data:

2012-11-05,deer
2012-11-05,rabbit
2012-11-05,raccoon
2012-11-06,rabbit
2012-11-06,deer
2012-11-06,fox
2012-11-07,rabbit
2012-11-07,bear

What text passes through each of the pipes and the final redirect in the pipeline below?

learner@casimir:~$ cat animals.txt | head -n 5 | tail -n 3 | sort -r > final.txt

Hint: build the pipeline up one command at a time to test your understanding.

Solution#

The contents of animals.txt passes verbatim through the first pipe:

2012-11-05,deer
2012-11-05,rabbit
2012-11-05,raccoon
2012-11-06,rabbit
2012-11-06,deer
2012-11-06,fox
2012-11-07,rabbit
2012-11-07,bear

Only the first 5 lines passes through the second pipe:

2012-11-05,deer
2012-11-05,rabbit
2012-11-05,raccoon
2012-11-06,rabbit
2012-11-06,deer

The last 3 lines of the first 5 lines (i.e. lines 3, 4, and 5) pass through the third pipe:

2012-11-05,raccoon
2012-11-06,rabbit
2012-11-06,deer

The last command sorts the lines in reverse lexicographic ordering:

2012-11-06,rabbit
2012-11-06,deer
2012-11-05,raccoon

And writes this into the file final.txt.

Writing CLI programs#

So far we have used commands that are already installed on our system. Because the shell is a general purpose tool for launching programs, it should come as no surprise that it is also possible to write your own programs, with which you can interact from the shell.

Shell scripts#

The simplest way to write such a program is just to put a bunch of shell commands into a file, one per line:

echo "The longest of the Shakespeare sonnets are:"

# print the 4 longest Shakespeare sonnets, and a summary line about all the sonnets
wc -l sonnet_*.txt | sort -n | tail -5

We name this file commands.sh, and we can run the commands using the bash command (we get the shell to call an instance of itself):

learner@casimir:~/sonnets$ bash commands.sh 
The longest of the Shakespeare sonnets are:
   13 sonnet_152.txt
   13 sonnet_153.txt
   13 sonnet_154.txt
   14 sonnet_099.txt
 2001 total
learner@casimir:~/sonnets$ 

In the above we explicitly said that we wanted the bash program to interpret and execute the commands contained in the commands.sh file.

This is all very well, but what if we give this script to somebody else? We have to also tell them that they need to run it by providing it as an argument to bash. Is there a better way?

It turns out there is. When you execute a command in the shell, the shell will look for a file with that name, will open it and will attempt to interpret what is inside. Commands such as ls are binary files that the shell can ask the operating system to execute directly, however the shell can also interpret text files.

If the first line of a program file starts with the special character sequence #! (pronounced “shebang”), followed by the path to another program, the shell will attempt to use that program to interpret and execute the contents of this program, by passing the remaining contents of the program file on the standard input of the program specified after the shebang.

We should thus modify our script to be:

#!/bin/bash

echo "The longest of the Shakespeare sonnets are:"

# print the 4 longest Shakespeare sonnets, and a summary line about all the sonnets
wc -l sonnet_*.txt | sort -n | tail -5

We can now run our script just by executing the command ./commands.sh.

We need the leading ./ to tell the shell to look in the current directory for the program; by default the shell looks in several special places (such as /bin and /usr/bin) for programs, but not in the current working directory.

learner@casimir:~/sonnets$ ./commands.sh
-bash: ./commands.sh: Permission denied

An error! the shell told us that we do not have permission to execute commands.sh. Why is this?

Permissions#

Every file and directory has a certain set of permissions associated with it. We can inspect these permissions using ls with the -l flag

learner@casimir:~/sonnets$ ls -l commands.sh
-rw-r--r-- 1 learner users 188 Nov  4 12:54 commands.sh

../_images/ls_output.svg

../_images/permissions.svg

So we cannot execute commands.sh because we have not set the x permission.

We can do so with the chmod command:

learner@casimir:~/sonnets$ chmod +x commands.sh 
learner@casimir:~/sonnets$ ls -l commands.sh 
-rwxr-xr-x 1 learner users 188 Nov  4 12:54 commands.sh
learner@casimir:~/sonnets$ 

We just gave the owner user, users in the file’s group and everyone else permission to execute commands.sh.

Now we can execute commands.sh directly:

learner@casimir:~/sonnets$ ./commands.sh 
The longest of the Shakespeare sonnets are:
   13 sonnet_152.txt
   13 sonnet_153.txt
   13 sonnet_154.txt
   14 sonnet_099.txt
 2001 total
learner@casimir:~/sonnets$ 

Avoiding calling python script.py all the time#

In the above we provided /bin/bash as the interpreter for our shell script in the shebang line.

Similarly, we can provide python as an interpreter for our python scripts, if we wish to invoke them directly from the command line without writing python all the time.

In the computational environment provided for this course, the python interpreter is found at /opt/conda/bin/python, so our shebang line for Python scripts should be:

#!/opt/conda/bin/python

Python or Bash?#

Now we’ve seen how you can make scripts in Bash and Python that you can run from the command line.

If you just need to chain a few shell commands together, a Bash script is probably the simplest way to do this.

If, however, you find yourself needing more complicated logic (conditionals and loops) we strongly recommend to write a Python script to do the job for you. While if and for are possible from the shell, they are very unintuitive and difficult to write in a robust way.

The os and shutil modules in the Python standard library provide shell-like functionality for manipulating files and changing directory. If you ever find that you really need to call a shell command, you can use the subprocess module and parse the command output.