Introduction to Shell#
We started from learning to the Python programming language and the Jupyter notebook, which are powerful tools for scientific computing. Now we are going to introduce another useful and ubiquitous tool: the shell.
License Information
Much of the materials for this part of the course have been inspired by, or directly taken from the Software Carpentry lesson “The Unix Shell”, which is graciously provided by Software Carpentry under the permissive CC BY 4.0 license.
The sonnets of William Shakespeare in the sonnets/
directory are used under the conditions of the Project Gutenberg (a copy of which is included in sonnets/LICENSE.txt
).
What is “the shell”#
A shell is a program that allows you to interact with the computer in a REPL (read-evaluate-print loop). You type in commands, the shell interprets these commands and interacts with the rest of the computer on your behalf, then it will perhaps print some output to the screen, and then allow you to type in another command. For example, instead of browsing through your folders to find a file and click on it to view it, you can type commands to browse and view in the shell.
There are many programs that act as shells, but the most common is bash
, the “Bourne Again SHell”. This is the shell that we are going to learn today. While each shell program has its own idiosyncrasies, the principles and most of the syntax for operating them is the same.
Note on nomenclature
Often you will hear the terms “shell”, “terminal” and “command line” (cli) used interchangeably. While there are technical differences between these terms, the distinction is not important for the purposes of this course.
The term “shell” itself invokes the idea that the program in question acts as a thin layer around the core of the “computer” (operating system + other tools).
Why bother?#
Most of our daily computing is done using GUIs (Graphical User Interfaces) driven mainly by a mouse and, increasingly, by touch screen interfaces. Given this, why should we learn to interact with the computer using a CLI (Command Line Interface), which requires use of a keyboard and provides much less rich input and output?
Here are several reasons why it is useful to have at least a basic proficiency in using the shell, in the context of scientific computing:
Operating in restricted environments#
There are computers that may not provide a GUI, for example:
High Performance Computing (HPC) clusters
controllers for experimental equipment (e.g. raspberry pi)
web servers (e.g. when deploying your website)
Shell is an efficient environment for manipulating files and running programs#
The shell’s whole job is to make it easy to run other programs and to interact with files on the computer. When running programs from the shell you can provide arguments that will modify the program’s behaviour; This is typically more difficult or impossible when running programs by clicking on icons in a file browser.
In addition the shell provides other features, such as being able to use the output of one command as the input of another, and so being able to chain simple commands together into more complex ones.
Writing programs that expose a CLI is very simple#
CLI-driven programs only need to know how to accept command line “arguments” and how to print text to the screen.
The operating system typically provides these mechanisms “for free”. In Python you can access the list of arguments provided with sys.argv
(more on this later) and can print text by using the print
function.
The consequence of this is that a lot of scientific software only provides a CLI. It is therefore important to learn how to use the shell to be able to operate this software.
Learning Goals#
There are four main aims for this part of the course:
Be able to launch and ascertain the “state” of your shell by executing basic commands#
This includes information such as which user you are logged in as, the directory in which you find yourself, and the files in the current directory
Be able to use the shell’s productivity tools#
This includes utilizing command completion (so that you avoid excessive typing) and searching through the command history (to re-use commands you previously executed). This also includes using more advanced shell features such as piping.
Be able to find documentation for CLI programs, and use the documentation to interact with these programs#
It is not the aim of this course to give you knowledge of any specific tools (beyond the basic ones that are common to essentially all shell usage). This is why it is important to be able to effectively find and understand documentation for CLI programs, so that you can solve problems yourself.
Launching a shell session and getting oriented#
You will usually launch a shell session from an existing graphical environment (your desktop!) by launching a “terminal” program that opens a window through which you can interact with the shell.
The location of this terminal program will depend on your operating system (google how do I open a terminal in <X>
if you are unsure). Jupyter also provides a terminal program that runs in your web browser (screencast below).
The command prompt#
After launching the terminal you will see the following:
learner@casimir:~ $
This is the shell prompt; it’s asking you for input!
If you now type whoami
you will see this appear to the right of the prompt:
learner@casimir:~ $ whoami
Now hitting the “enter” key will tell the shell to execute the command:
learner@casimir:~ $ whoami
learner
learner@casimir:~ $
The whoami
command printed the current user (who the shell thinks we are) to the screen, and finished.
The shell then printed another prompt, indicating that it is ready for more input.
More specifically, when we hit “enter” after typing whoami
the shell:
finds a program called
whoami
,runs that program,
displays that program’s output, then
displays a new prompt to tell us that it’s ready for more commands.
Unknown commands#
If you try to get the shell to execute a command that it does not know about, it will print an error:
learner@casimir:~$ nonexistantcommand
-bash: nonexistantcommand: command not found
Absolute Paths#
We can uniquely identify a file by giving its location relative to the root of the filesystem.This is called the absolute path of the file.
The absolute path of a file starts with a /
and contains the “path” that you would follow from the root directory to reach the file in question, with each directory separated by /
The path to the a.py
file shown above would be written as
/home/learner/a.py
We can now see that the output of the pwd
command above means that we are in the learner
directory, which itself is in the home
directory, which itself is in the root directory.
Listing Files#
We can use the ls
command to list files and directories in the current working directory:
learner@casimir:~$ ls
a.py casimir_programming_course f.txt id_rsa.pub shared Sync
This is kind of OK but not very useful — which of the above are files, and which are directories, for example?
We can modify what the ls
command does by using the flag -F
(also known as a switch or option). This flag tells ls
to add a trailing /
to the names of directories:
learner@casimir:~$ ls -F
a.py casimir_programming_course/ f.txt id_rsa.pub shared/ Sync/
Finding documentation and manuals#
ls
has lots of flags. We can use the --help
flag to ls
to find out more information:
learner@casimir:~$ ls --help
Usage: ls [OPTION]... [FILE]...
List information about the FILEs (the current directory by default).
Sort entries alphabetically if none of -cftuvSUX nor --sort is specified.
Mandatory arguments to long options are mandatory for short options too.
-a, --all do not ignore entries starting with .
-A, --almost-all do not list implied . and ..
--author with -l, print the author of each file
-b, --escape print C-style escapes for nongraphic characters
--block-size=SIZE scale sizes by SIZE before printing them; e.g.,
'--block-size=M' prints sizes in units of
1,048,576 bytes; see SIZE format below
-B, --ignore-backups do not list implied entries ending with ~
-c with -lt: sort by, and show, ctime (time of last
modification of file status information);
with -l: show ctime and sort by name;
otherwise: sort by ctime, newest first
-C list entries by columns
--color[=WHEN] colorize the output; WHEN can be 'always' (default
if omitted), 'auto', or 'never'; more info below
-d, --directory list directories themselves, not their contents
-D, --dired generate output designed for Emacs' dired mode
-f do not sort, enable -aU, disable -ls --color
-F, --classify append indicator (one of */=>@|) to entries
--file-type likewise, except do not append '*'
--format=WORD across -x, commas -m, horizontal -x, long -l,
single-column -1, verbose -l, vertical -C
--full-time like -l --time-style=full-iso
-g like -l, but do not list owner
--group-directories-first
group directories before files;
can be augmented with a --sort option, but any
use of --sort=none (-U) disables grouping
-G, --no-group in a long listing, don't print group names
-h, --human-readable with -l and/or -s, print human readable sizes
(e.g., 1K 234M 2G)
--si likewise, but use powers of 1000 not 1024
-H, --dereference-command-line
follow symbolic links listed on the command line
--dereference-command-line-symlink-to-dir
follow each command line symbolic link
that points to a directory
--hide=PATTERN do not list implied entries matching shell PATTERN
(overridden by -a or -A)
--indicator-style=WORD append indicator with style WORD to entry names:
none (default), slash (-p),
file-type (--file-type), classify (-F)
-i, --inode print the index number of each file
-I, --ignore=PATTERN do not list implied entries matching shell PATTERN
-k, --kibibytes default to 1024-byte blocks for disk usage
-l use a long listing format
-L, --dereference when showing file information for a symbolic
link, show information for the file the link
references rather than for the link itself
-m fill width with a comma separated list of entries
-n, --numeric-uid-gid like -l, but list numeric user and group IDs
-N, --literal print raw entry names (don't treat e.g. control
characters specially)
-o like -l, but do not list group information
-p, --indicator-style=slash
append / indicator to directories
-q, --hide-control-chars print ? instead of nongraphic characters
--show-control-chars show nongraphic characters as-is (the default,
unless program is 'ls' and output is a terminal)
-Q, --quote-name enclose entry names in double quotes
--quoting-style=WORD use quoting style WORD for entry names:
literal, locale, shell, shell-always,
shell-escape, shell-escape-always, c, escape
-r, --reverse reverse order while sorting
-R, --recursive list subdirectories recursively
-s, --size print the allocated size of each file, in blocks
-S sort by file size, largest first
--sort=WORD sort by WORD instead of name: none (-U), size (-S),
time (-t), version (-v), extension (-X)
--time=WORD with -l, show time as WORD instead of default
modification time: atime or access or use (-u);
ctime or status (-c); also use specified time
as sort key if --sort=time (newest first)
--time-style=STYLE with -l, show times using style STYLE:
full-iso, long-iso, iso, locale, or +FORMAT;
FORMAT is interpreted like in 'date'; if FORMAT
is FORMAT1<newline>FORMAT2, then FORMAT1 applies
to non-recent files and FORMAT2 to recent files;
if STYLE is prefixed with 'posix-', STYLE
takes effect only outside the POSIX locale
-t sort by modification time, newest first
-T, --tabsize=COLS assume tab stops at each COLS instead of 8
-u with -lt: sort by, and show, access time;
with -l: show access time and sort by name;
otherwise: sort by access time, newest first
-U do not sort; list entries in directory order
-v natural sort of (version) numbers within text
-w, --width=COLS set output width to COLS. 0 means no limit
-x list entries by lines instead of by columns
-X sort alphabetically by entry extension
-Z, --context print any security context of each file
-1 list one file per line. Avoid '\n' with -q or -b
--help display this help and exit
--version output version information and exit
The SIZE argument is an integer and optional unit (example: 10K is 10*1024).
Units are K,M,G,T,P,E,Z,Y (powers of 1024) or KB,MB,... (powers of 1000).
Using color to distinguish file types is disabled both by default and
with --color=never. With --color=auto, ls emits color codes only when
standard output is connected to a terminal. The LS_COLORS environment
variable can change the settings. Use the dircolors command to set it.
Exit status:
0 if OK,
1 if minor problems (e.g., cannot access subdirectory),
2 if serious trouble (e.g., cannot access command-line argument).
GNU coreutils online help: <http://www.gnu.org/software/coreutils/>
Full documentation at: <http://www.gnu.org/software/coreutils/ls>
or available locally via: info '(coreutils) ls invocation'
Many other commands have a --help
flag.
Note that flags will often have a short version, which consists of a single -
and a single character, and a long version, which consists of a --
and several characters.
If you provide a flag that a command does not support, it will print an error:
learner@casimir:~$ ls --nonexistant-flag
ls: unrecognized option '--nonexistant-flag'
Try 'ls --help' for more information.
You can also find help on commands by using the man
command:
man ls
This will display the “manual” for the ls
command in the terminal, which you can scroll through with the arrow keys (use q
to quit).
man
pages are also available online at: https://linux.die.net/man/. This is often much more practical than using the man
command.
While --help
and man
pages are certainly complete, they are often very poor at giving a feel for how a particular program should be used, with common usage examples. For this, google is your friend.
Relative Paths#
We can also use ls
to list the files in other directories than the current working directory, like so:
learner@casimir:~$ ls -F casimir_programming_course
course_announcement.md day1/ day3/ exercises/ intro/ misc/ projects.ipynb
day1/ day2/ day4/ installation_instructions.ipynb LICENSE program.md README.md
We passed a directory name, casimir_programming_course
, to ls
as an argument, after the flag -F
.
We can also list the contents of day1
, inside casimir_programming_course
:
learner@casimir:~$ ls -F casimir_programming_course/day1/
babynames.txt helpers.py if_figure.png introduction_to_python.ipynb projects_day_1.ipynb romeo_and_juliet.txt
Here we provided the directory to ls
as a relative path, which does not start with a /
, and is to be understood as a path relative to the current working directory. Using relatives paths is a useful shortcut that saves quite a bit of typing.
There are several other shortcuts that can be used when specifying paths:
.
refers to the current working directory..
refers to the parent of the current working directory~
refers to the user’s home directory
Moving around the filesystem#
Even though using relative paths adds convenience, most of the time you will want to do operations on files in a single directory. In this case it is useful to be able to change the working directory.
This is accomplished using the cd
command:
learner@casimir:~$ cd casimir_programming_course
learner@casimir:~/casimir_programming_course$
We changed to the casimir_programming_course
directory that is inside our home directory, specifying it using a relative path.
The cd
command did not print anything to the screen, and the shell returned us a prompt.
We can use the pwd
command to check what directory we are in:
learner@casimir:~/casimir_programming_course$ pwd
/home/learner/casimir_programming_course
learner@casimir:~/casimir_programming_course$
Note that the prompt also contains the current working directory between the :
and $
, using the shortcut ~
for the “home directory”. Can you use this to deduce what our home directory is?
Be aware that when using different systems the shell prompt may not look exactly the same.
We can of course use all the shortcuts we learned above when specifying paths to cd
. In addition cd
has the following behaviour when certain “special” arguments are provided:
cd
(no arguments) takes us back to our home directorycd -
returns to the last directory we were in (like an “undo”)
✓ Mini exercises#
Execute the example commands in this notebook in a shell, if you haven’t done so yet
Exercise#
Navigate to the following places in your home directory:
Make sure you are in the home directory
Go to
casimir_programming_course
List the contents of this directory
List the contents of
day2
without switching to this directory firstGo to
day3
List the contents of this directory
Go back to the parent directory (i.e. you should be then be in
/home/learner/casimir_programming_course
. Check withpwd
!)
Solution#
You can check with
pwd
or the command prompt if you are in the home directory. If not, usecd
to go there.cd casimir_programming_course
ls
ls day2
cd
day3
ls
cd ..
and then runpwd
Exercise#
After completing the exercise before, you should be in
/home/learner/casimir_programming_course/
. Which of the following commands could you use to navigate to your home directory, which is/home/learner
? Try them out and check where you end up! Make sure you always start from/home/learner/casimir_programming_course/
(e.g. executecd /home/learner/casimir_programming_course/
to go back there if necessary)cd .
cd /
cd /home/amanda
cd ../..
cd ~
cd home
cd ~/data/..
cd
cd ..
Solution#
No:
.
stands for the current directory. You will stay where you were.No:
/
stands for the root directory.No: Learner’s home directory is
/home/learner
.No: this goes up two levels, i.e. ends in
/home
.Yes:
~
stands for the user’s home directory, in this case/home/learner
.No: this would navigate into a directory home in the current directory if it exists.
Yes: unnecessarily complicated, but correct.
Yes: shortcut to go back to the user’s home directory.
Yes: goes up one level.
Exercise#
Using the filesystem diagram below, if pwd displays
/home/learner
, what willls -F ../backup
display?
../backup: No such file or directory
2017-07 2017-08 2017-09
2017-07/ 2017-08/ 2017-09/
base/ orig/ recent/
Solution#
No: there is a directory backup in
/home
.No: this is the content of
/home/learner/backup
, but with..
we asked for one level further up.No: see previous explanation.
Yes:
../backup/
refers to/home/backup/
.
Exercise#
Assuming a directory structure as in the above figure, if
pwd
displays/home/backup
, and-r
tellsls
to display things in reverse order, what command will display:recent/ orig/ base/
ls pwd
ls -r -F
ls -r -F /home/backup
Either #2 or #3 above, but not #1.
Solution#
No:
pwd
is not the name of a directory.Yes:
ls
without directory argument lists files and directories in the current directory.Yes: uses the absolute path explicitly.
Correct: see explanations above.
Manipulating files and directories#
So now we can move around the filesystem effectively and list the contents of directories as we move around. The next step is to be able to create, read, update and delete files and directories.
Directories are created with the mkdir
(make directory) command, which takes a single argument: the name of the directory to create.
learner@casimir:~$ mkdir work
learner@casimir:~$
Editing text files#
Simple! if we now run ls -F
we will see that there is a directory called work
in our current working directory, which we can cd
into and create some files
learner@casimir:~$ cd work
learner@casimir:~/work$ nano draft.txt
Above we use the nano
command, providing the filename draft.txt
as an argument. nano
is a text editor, which will allow us to type some text and save it into the file provided as command line argument.
Note on text editors
When we say, “nano is a text editor,” we really do mean “text”: it can only work with plain character data, not tables, images, or any other human-friendly media. We use it in examples because it is one of the least complex text editors. However, because of this trait, it may not be powerful enough or flexible enough for the work you need to do after this course. On Unix systems (such as Linux and Mac OS X), many programmers use Emacs or Vim (both of which require more time to learn), or a graphical editor such as Gedit. On Windows, you may wish to use Notepad++. Windows also has a built-in editor called notepad that can be run from the command line in the same way as nano for the purposes of this lesson.
Once we’re happy with our text, we can press Ctrl-O (press the Ctrl or Control key and, while holding it down, press the O key) to write our data to disk (we’ll be asked what file we want to save this to: press Return to accept the suggested default of draft.txt).
Once our file is saved, we can use Ctrl-X to quit the editor and return to the shell.
Note on the control key
The Control key is also called the “Ctrl” key. There are various ways in which using the Control key may be described. For example, you may see an instruction to press the Control key and, while holding it down, press the X key, described as any of:
Control-X
Control+X
Ctrl-X
Ctrl+X
^X
C-x
In nano, along the bottom of the screen you’ll see ^G Get Help ^O WriteOut
.
This means that you can use Control-G to get help and Control-O to save your file.
Naming tips for files or directories#
Don’t use whitespace
The shell separates command arguments on whitespace, so having filenames with whitespace is problematic. While it is possible to get around this restriction, it is easier to just avoid whitespace in filenames
Don’t begin names with a
-
(dash)Commands treat names starting with
-
as options.Use only letters, numbers,
.
(period) and_
(underscore)Many other characters have special meaning to the shell or to other programs you may invoke. This may mean that you provide unintended options to certain commands, and may make them misbehave or result in data loss
Copying, moving, and deleting files and directories#
We can use the cp
command to copy a file from one place to another:
learner@casimir:~/work$ cp draft.txt previous-draft.txt
learner@casimir:~/work$ ls -F
draft.txt previous-draft.txt
learner@casimir:~/work$
The above invocation of cp
copied the file draft.txt
to the file previous-draft.txt
, creating it in the process.
Caution: If the target filename to cp
already exists, you will overwrite it
We can also copy files between directories, so to copy draft.txt
into the parent directory we could simply do the following:
learner@casimir:~/work$ cp draft.txt ..
We can remove files using the rm
command:
learner@casimir:~/work$ rm previous-draft.txt
learner@casimir:~/work$ ls -F
draft.txt
Caution: deleting is forever
The Unix shell doesn’t have a trash bin that we can recover deleted files from (though most graphical interfaces to Unix do). Instead, when we delete files, they are unhooked from the file system so that their storage space on disk can be recycled. Tools for finding and recovering deleted files do exist, but there’s no guarantee they’ll work in any particular situation, since the computer may recycle the file’s disk space right away.
by default, rm
will not work on directories:
learner@casimir:~/work$ cd ..
learner@casimir:~$ rm work
rm: cannot remove 'work': Is a directory
learner@casimir:~$
we need to supply the -r
flag (for recursive):
learner@casimir:~$ rm -r work
learner@casimir:~$
For extra safety, you can use the -i
flag to rm
, which will ask you to confirm the deletion of each of the files inside work
in turn.
Now let’s recreate the work
directory and move draft.txt
(which we previously copied into our home directory) into it.
For this will use the mv
(move) command.
learner@casimir:~$ mkdir work
learner@casimir:~$ mv draft.txt work
learner@casimir:~$ ls -F work
draft.txt
cp
, mv
and rm
all expect paths as arguments. In the preceding examples we used relative paths to refer to files and directories in our current working directory, but we can also use these same programs to operate on files in locations other than our current working directory.
Protip: shell autocompletion
If your file and directory names are descriptive (and they should be!) they may become somewhat long and cumbersome to type manually.
Luckily the shell provides autocompletion for paths. While typing out a filename, simply hit the Tab key to have the shell autocomplete the path for you. If the shell cannot unambiguously determine which file you want, it will not autocomplete; if you hit the Tab key a second time it will print the possible autocompletion options. You will then need to carry on typing the filename until the shell is able to determine which one you mean, at which point you can hit Tab again to have the shell autocomplete.
Protip: shell history
Now that we are composing commands into more complicated ones we shall look at another shell productivity tool: history.
The shell keeps around a history of all the commands that you send to it (this is configurable but usually defaults to ~1000 commands). Pressing the Up Arrow key at a blank prompt will allow you to cycle through your most recently executed commands. Before executing a command from the history, you also have the opportunity to edit it. This is incredibly useful if you are performing repetitive tasks.
You can also use the history
command to print out the command history to the screen
It is also possible to search through the command history. To activate this, enter Ctrl-R
and begin typing the start of the command that you are looking for. The shell will display the match that occured most recently in the history, and you can press Ctrl-R
to cycle through the matches further back in time. Pressing any other key will stop the search, with the selected command already typed for you on the command line
✓ Mini exercises#
Exercise#
We have seen how to create text files using the nano
editor. Now, try the following command in your home directory:
touch my_file.txt
What did the touch
command do? When you look at your home directory using the GUI file explorer, does the file show up?
Use ls -l
to inspect the files. How large is my_file.txt
?
When might you want to create a file this way?
Solution#
The touch command generates a new file called
my_file.txt
in your home directory. If you are in your home directory, you can observe this newly generated file by typingls
at the command line prompt.my_file.txt
can also be viewed in your GUI file explorer.When you inspect the file with
ls -l
, note that the size ofmy_file.txt
is 0kb. In other words, it contains no data. If you openmy_file.txt
using your text editor it is blank.Some programs do not generate output files themselves, but instead require that empty files have already been generated. When the program is run, it searches for an existing file to populate with its output. The touch command allows you to efficiently generate a blank text file to be used by such programs.
Exercise#
We now consider the case that you created a file with a wrong filename.
Use nano
to create a file called statstics.txt
(you may write anything you want in there)
After creating and saving this file you realize you misspelled the filename! You want to correct the mistake, which of the following commands could you use to do so? Think about it and try them out!
cp statstics.txt statistics.txt
mv statstics.txt statistics.txt
mv statstics.txt .
cp statstics.txt .
Solution#
First, you need to run nano statstics.txt
, enter some text, and then use Ctrl-O
and finally Ctrl-X
to close.
Do the suggested commands do what you want?
No. While this would create a file with the correct name, the incorrectly named file still exists in the directory and would need to be deleted.
Yes, this would work to rename the file.
No, the period(.) indicates where to move the file, but does not provide a new file name; identical file names cannot be created.
No, the period(.) indicates where to copy the file, but does not provide a new file name; identical file names cannot be created.
Exercise#
We have prepared some example files for you. Go to data1
in the folder day3
of this programming course. What is the output of the closing ls
command in the sequence shown below? First think about it, then try out the commands.
learner@casimir:~/casimir_programming_course/day3/data1$ pwd
/home/learner/casimir_programming_course/day3/data1
learner@casimir:~/casimir_programming_course/day3/data1$ ls
proteins.dat
learner@casimir:~/casimir_programming_course/day3/data1$ mkdir recombine
learner@casimir:~/casimir_programming_course/day3/data1$ mv proteins.dat recombine
learner@casimir:~/casimir_programming_course/day3/data1$ cp recombine/proteins.dat ../proteins-saved.dat
learner@casimir:~/casimir_programming_course/day3/data1$ ls
proteins-saved.dat recombine
recombine
proteins.dat recombine
proteins-saved.dat
Finally, delete proteins-saved.dat
again.
Solution#
Use cd ~/casimir_programming_course/day1/data1
to go to the correct place. There create a new folder called recombine
. The second line moves (mv
) the file proteins.dat
to the new folder (recombine). The third line makes a copy of the file we just moved. The tricky part here is where the file was copied to. Recall that ..
means “go up a level”, so the copied file is now in /home/learner/casimir_programming_course/day1
. Notice that ..
is interpreted with respect to the current working directory, not with respect to the location of the file being copied. So, the only thing that will show using ls
is the recombine
folder.
No, see explanation above.
proteins-saved.dat
is located at/home/learner
Yes.
No, see explanation above.
proteins.dat
is located at/home/learner/data/recombine
No, see explanation above.
proteins-saved.dat
is located at/home/learner
Finally, run rm ../proteins-saved.dat
Exercise#
Go to data2
in the folder day3
. Imagine you are working on a project and see that your files aren’t very well organized:
learner@casimir:~/casimir_programming_course/day3/data2$ ls -F
fructose.dat sucrose.dat
The fructose.dat
and sucrose.dat
files contain output from your data analysis. First, create two empty directories called analyzed
and raw
. What command(s) covered so far do you then need to run so that the commands below will produce the output shown?
learner@casimir:~/casimir_programming_course/day3/data2$ ls -F
analyzed/ raw/
learner@casimir:~/casimir_programming_course/day3/data2$ ls -F analyzed
fructose.dat sucrose.dat
Solution#
mv *.dat analyzed
You need to move your files fructose.dat
and sucrose.dat
to the analyzed
directory. The shell will expand *.dat
to match all .dat
files in the current directory. The mv
command then moves the list of .dat
files to the analyzed
directory.
Exercise#
After running the following commands, you realize that you put the files sucrose.dat
into the wrong folder:
learner@casimir:~/casimir_programming_course/day3/data2$ ls -F
raw/ analyzed/
learner@casimir:~/casimir_programming_course/day3/data2$ ls -F analyzed
fructose.dat sucrose.dat
learner@casimir:~$ cd raw/
Fill in the blanks to move sucrose.dat
to the current folder (i.e., the one you are currently in):
learner@casimir:~/raw$ mv ___/sucrose.dat ___
Solution#
mv ../analyzed/sucrose.dat ../analyzed/maltose.dat .
Recall that ..
refers to the parent directory (i.e. one above the current directory) and that .
refers to the current directory.
Python scripts and the shell#
Let’s try and make a poor-man’s clone of the echo
command in Python.
The Python sys
module has a property called argv
, which is a list of arguments that were provided to the script on the command line. echo
just takes its arguments and prints them to stdout
, so in Python we could implement this as:
#!/opt/conda/bin/python
import sys
# remove the program name from 'argv'
arguments = sys.argv[1:]
print(' '.join(arguments))
We can save this as echo.py
, add execute permissions with chmod
, and run it:
learner@casimir:~$ ./echo.py hello world
hello world
Making a better CLI for echo.py
#
echo.py
as it is written does not provide a very good CLI. Notably it does not have any documentation accessible from the command line:
learner@casimir:~$ ./echo.py --help
--help
learner@casimir:~$
This means that anyone who merely wants to use our script will need to go digging round in the source code.
The argparse
module in the Python standard library provides tools for making nice CLIs:
#!/opt/conda/bin/python
import argparse
parser = argparse.ArgumentParser(description='Echo arguments to stdout')
parser.add_argument('word', nargs='*', help="word to print")
args = parser.parse_args()
print(' '.join(args.word))
It gives you things like automatic help-page generation:
learner@casimir:~$ ./echo.py --help
usage: echo.py [-h] [word [word ...]]
Echo arguments to stdout
positional arguments:
word word to print
optional arguments:
-h, --help show this help message and exit
learner@casimir:~$
And reasonable error messages if an unrecognized argument is provided:
learner@casimir:~$ ./echo.py --options
usage: echo.py [-h] [word [word ...]]
echo.py: error: unrecognized arguments: --options
learner@casimir:~$
Installing software#
Now we know how to run existing software from the command line and how to write our own command-line driven software. In this section we will cover the basics of installing software from the command line.
This is the bare minimum of what will be required for this course, we will not cover installing software from source code distributions etc.
Installing Python packages#
There are 2 tools that you need to be aware of for installing Python packages onto your computer: pip
and conda
.
pip
is the official Python package manager, and comes installed by default with all distributions of Python. It downloads and installs packages from PyPI (Python Package Index): https://pypi.python.org
conda
is the package manager for the Anaconda Python distribution (which we are using for this course). It downloads and installs packages from the Anaconda website: https://repo.continuum.io/pkgs/
Anyone who distributes their Python package will (in the overwhelmingly majority of cases) first put it on PyPI, as this is the place that is visible to and known by the most people. For this reason PyPI can be considered the “canonical” place to get the most up to date versions of packages.
The caveat to this is that PyPI and pip
do not know anything about dependencies external to the Python ecosystem. This means that if the Python package that you want to use depends on some external library not written in python, then you will need to install this yourself by some other means (see below). conda
, on the other hand, does have non-Python packages. The disadvantage of conda
is that not all packages available on PyPI will have corresponding conda
packages, or the conda
packages available may not be up to date with the latest version available on PyPI.
To make the situation even more complicated, somtimes packages on PyPI will come with the non-Python dependencies already linked (i.e. things will “just work”). This is typical of the more widely used packages, which are looking for a smoother user experience (e.g. opencv and ZeroMQ).
If you think all this is insane, you are in good company.
Installing other packages#
All Linux distros come with a package manager pre-installed. There are also package managers for OSX and Windows, but they will probably be less-used: https://en.wikipedia.org/wiki/List_of_software_package_management_systems
The package manager handles installing and updating all the programs and libraries on the computer, taking into account all the dependencies.
Package managers do typically have many Python packages in their repositories, but due to the typically long vetting procedure and release schedule, they often only have out of date versions. This is why we recommend to use pip
or conda
to install your Python packages.
You will need to use the distribution’s package manager for everything else. In this course we have an environment based on Ubuntu 16.04, which uses the apt
package manager. There is an exhaustive list of packages available online: https://packages.ubuntu.com/xenial/
To install a package my-sweet-package
we say
apt-get install my-sweet-package
Running this produces the following:
learner@casimir:~$ apt-get install my-sweet-package
E: Could not open lock file /var/lib/dpkg/lock - open (13: Permission denied)
E: Unable to lock the administration directory (/var/lib/dpkg/), are you root?
apt-get
complains that it cannot install the package because we do not have sufficient permissions. This is because we are running apt-get
as our regular user, learner
, who does not have write permissions on the directories into which apt-get
wants to install things.
We can use the sudo
command to execute things as the super user, root
:
sudo apt-get install my-sweet-package
The first argument to sudo
is the command to run as root
, and the remaining arguments are passed to that command.
Advanced topics (for self-study, if you are interested!)#
Joining commands together#
Now that we know a few basic commands, we can finally look at the shell’s most powerful feature: the ease with which it lets us combine existing programs in new ways.
In the sonnets
directory (next to this notebook) there is a collection of files containing the sonnets of William Shakespeare, which we will use in what follows.
learner@casimir:~$ pwd
/home/learner
learner@casimir:~$ cp -r casimir_programming_course/day3/sonnets .
learner@casimir:~$ cd sonnets
learner@casimir:~/sonnets$
In this directory there are a collection of files with names like sonnet_001.txt
and sonnet_124.txt
, each with the corresponding Shakespeare sonnet in them.
We can use the wc
(word count) command with the -w
flag to get the number of words in each of these files:
learner@casimir:~/sonnets$ wc -w sonnet_*.txt
106 sonnet_001.txt
115 sonnet_002.txt
115 sonnet_003.txt
104 sonnet_004.txt
104 sonnet_005.txt
110 sonnet_006.txt
100 sonnet_007.txt
110 sonnet_008.txt
118 sonnet_009.txt
114 sonnet_010.txt
116 sonnet_011.txt
118 sonnet_012.txt
110 sonnet_013.txt
112 sonnet_014.txt
111 sonnet_015.txt
110 sonnet_016.txt
123 sonnet_017.txt
114 sonnet_018.txt
115 sonnet_019.txt
114 sonnet_020.txt
117 sonnet_021.txt
122 sonnet_022.txt
114 sonnet_023.txt
120 sonnet_024.txt
108 sonnet_025.txt
118 sonnet_026.txt
110 sonnet_027.txt
116 sonnet_028.txt
115 sonnet_029.txt
116 sonnet_030.txt
113 sonnet_031.txt
114 sonnet_032.txt
110 sonnet_033.txt
120 sonnet_034.txt
106 sonnet_035.txt
113 sonnet_036.txt
116 sonnet_037.txt
114 sonnet_038.txt
119 sonnet_039.txt
121 sonnet_040.txt
110 sonnet_041.txt
130 sonnet_042.txt
122 sonnet_043.txt
117 sonnet_044.txt
105 sonnet_045.txt
115 sonnet_046.txt
123 sonnet_047.txt
117 sonnet_048.txt
114 sonnet_049.txt
118 sonnet_050.txt
117 sonnet_051.txt
109 sonnet_052.txt
107 sonnet_053.txt
112 sonnet_054.txt
106 sonnet_055.txt
112 sonnet_056.txt
117 sonnet_057.txt
112 sonnet_058.txt
109 sonnet_059.txt
108 sonnet_060.txt
121 sonnet_061.txt
105 sonnet_062.txt
110 sonnet_063.txt
111 sonnet_064.txt
112 sonnet_065.txt
87 sonnet_066.txt
106 sonnet_067.txt
108 sonnet_068.txt
119 sonnet_069.txt
110 sonnet_070.txt
122 sonnet_071.txt
118 sonnet_072.txt
121 sonnet_073.txt
113 sonnet_074.txt
117 sonnet_075.txt
114 sonnet_076.txt
107 sonnet_077.txt
110 sonnet_078.txt
117 sonnet_079.txt
114 sonnet_080.txt
116 sonnet_081.txt
105 sonnet_082.txt
116 sonnet_083.txt
114 sonnet_084.txt
110 sonnet_085.txt
110 sonnet_086.txt
118 sonnet_087.txt
112 sonnet_088.txt
112 sonnet_089.txt
121 sonnet_090.txt
114 sonnet_091.txt
120 sonnet_092.txt
114 sonnet_093.txt
106 sonnet_094.txt
112 sonnet_095.txt
120 sonnet_096.txt
107 sonnet_097.txt
118 sonnet_098.txt
125 sonnet_099.txt
111 sonnet_100.txt
114 sonnet_101.txt
117 sonnet_102.txt
117 sonnet_103.txt
117 sonnet_104.txt
105 sonnet_105.txt
110 sonnet_106.txt
113 sonnet_107.txt
115 sonnet_108.txt
116 sonnet_109.txt
119 sonnet_110.txt
110 sonnet_111.txt
114 sonnet_112.txt
117 sonnet_113.txt
114 sonnet_114.txt
115 sonnet_115.txt
109 sonnet_116.txt
110 sonnet_117.txt
112 sonnet_118.txt
113 sonnet_119.txt
117 sonnet_120.txt
115 sonnet_121.txt
105 sonnet_122.txt
116 sonnet_123.txt
109 sonnet_124.txt
105 sonnet_125.txt
96 sonnet_126.txt
111 sonnet_127.txt
111 sonnet_128.txt
110 sonnet_129.txt
123 sonnet_130.txt
119 sonnet_131.txt
114 sonnet_132.txt
122 sonnet_133.txt
121 sonnet_134.txt
116 sonnet_135.txt
124 sonnet_136.txt
124 sonnet_137.txt
117 sonnet_138.txt
120 sonnet_139.txt
118 sonnet_140.txt
118 sonnet_141.txt
117 sonnet_142.txt
119 sonnet_143.txt
113 sonnet_144.txt
97 sonnet_145.txt
112 sonnet_146.txt
107 sonnet_147.txt
123 sonnet_148.txt
119 sonnet_149.txt
118 sonnet_150.txt
119 sonnet_151.txt
124 sonnet_152.txt
109 sonnet_153.txt
107 sonnet_154.txt
17516 total
We used the pattern sonnet_*.txt
, where the *
is a wildcard that matches any number of characters, to match all the sonnet files. We did this to avoid counting the LICENSE.txt
file.
By default wc
prints several pieces of information about each file it receives as argument: line, word and character counts. On the last line are the total lines, words and characters.
This is a lot of output — it fills up more than one screen of text!
Output redirection#
We can redirect the output of the command to a file like so:
learner@casimir:~/sonnets$ wc -w sonnet_*.txt > word_counts.txt
learner@casimir:~/sonnets$
the character >
used in this way is interpreted specially by the shell. The filename to the right of the >
indicates where the output of the command on the left of the >
should go. The shell will create the file if it doesn’t exist. If the file exists, it will be silently overwritten, which may lead to data loss and thus requires some caution.
The sequence >>
also has a special meaning. It means to redirect output (as with >
), but to append to the file on the right, rather than to overwrite.
Imagine now that we wanted to find which sonnet has the most words in it. Sure we can manually inspect word_counts.txt
, but surely there’s a better way?
There is a command called sort
that we can use to sort the lines of word_counts.txt
. The -n
flag to sort
specifies that we should compare lines by interpreting them as integers.
learner@casimir:~/sonnets$ sort -n word_counts.txt > sorted_word_counts.txt
learner@casimir:~/sonnets$
Then we can use the tail
command to print only the last few lines of sorted_word_counts.txt
learner@casimir:~/sonnets$ tail sorted_word_counts.txt
123 sonnet_017.txt
123 sonnet_047.txt
123 sonnet_130.txt
123 sonnet_148.txt
124 sonnet_136.txt
124 sonnet_137.txt
124 sonnet_152.txt
125 sonnet_099.txt
130 sonnet_042.txt
17516 total
learner@casimir:~/sonnets$
Pipes#
This is all very well, but somewhat cumbersome if we don’t want to keep all these intermediate files around.
We can do this more succinctly by running sort
and tail
together:
learner@casimir:~/sonnets$ sort -n word_counts.txt | tail
123 sonnet_017.txt
123 sonnet_047.txt
123 sonnet_130.txt
123 sonnet_148.txt
124 sonnet_136.txt
124 sonnet_137.txt
124 sonnet_152.txt
125 sonnet_099.txt
130 sonnet_042.txt
17516 total
learner@casimir:~/sonnets$
The vertical bar, |
, between the two commands is called a pipe. It tells the shell that we want to use the output of the command on the left as the input to the command on the right.
We are, of course, not limited to using a single pipe in a command, so we can also pipe the input of wc
into sort
, and then the output of sort
into tail
:
learner@casimir:~/sonnets$ wc -w sonnet_*.txt | sort -n | tail
123 sonnet_017.txt
123 sonnet_047.txt
123 sonnet_130.txt
123 sonnet_148.txt
124 sonnet_136.txt
124 sonnet_137.txt
124 sonnet_152.txt
125 sonnet_099.txt
130 sonnet_042.txt
17516 total
What is actually going on#
Here’s what actually happens behind the scenes when we create a pipe. When a computer runs a program — any program — it creates a process in memory to hold the program’s software and its current state. Every process has an input channel called standard input. (By this point, you may be surprised that the name is so memorable, but don’t worry: most programmers call it stdin
). Every process also has a default output channel called standard output (or stdout
). A second output channel called standard error (stderr
) also exists. This channel is typically used for error or diagnostic messages, and it allows a user to pipe the output of one program into another while still receiving error messages in the terminal.
The shell is actually just another program. Under normal circumstances, whatever we type on the keyboard is sent to the shell on its standard input, and whatever it produces on standard output is displayed on our screen. When we tell the shell to run a program, it creates a new process and temporarily sends whatever we type on our keyboard to that process’s standard input, and whatever the process sends to standard output to the screen.
Here’s what happens when we run wc -l sonnet_*./txt > word_counts.txt
. The shell starts by telling the computer to create a new process to run the wc
program. Since we’ve provided some filenames as arguments, wc
reads from them instead of from standard input. And since we’ve used >
to redirect output to a file, the shell connects the process’s standard output to that file.
If we run wc -l sonnet_*.txt | sort -n
instead, the shell creates two processes (one for each process in the pipe) so that wc
and sort
run simultaneously. The standard output of wc is fed directly to the standard input of sort; since there’s no redirection with >
, sort
’s output goes to the screen. And if we run wc -l sonnet_*.txt | sort -n | tail
, we get three processes with data flowing from the files, through wc
to sort
, and from sort
through tail
to the screen.
✓ Check Questions#
Question#
What is the difference between:
learner@casimir:~$ echo hello > testfile01.txt
and:
learner@casimir:~$ echo hello >> testfile02.txt
Hint: Try executing each command twice in a row and then examining the output files.
Solution#
The first command (wtih >
) overwrites any content in testfile01.txt
, whereas the second command (with >>
)
appends to the existing content of testfile02.txt
.
Exercise#
In our current directory, we want to find the 3 files which have the least number of lines. Which command listed below would work?
wc -l * > sort -n > head -n 3
wc -l * | sort -n | head -n 1-3
wc -l * | head -n 3 | sort -n
wc -l * | sort -n | head -n 3
Solution#
Option 4 is the solution. The pipe character |
is used to feed the standard output from one process to the standard input of another. >
is used to redirect standard output to a file.
Question#
A file called animals.txt
contains the following data:
2012-11-05,deer
2012-11-05,rabbit
2012-11-05,raccoon
2012-11-06,rabbit
2012-11-06,deer
2012-11-06,fox
2012-11-07,rabbit
2012-11-07,bear
What text passes through each of the pipes and the final redirect in the pipeline below?
learner@casimir:~$ cat animals.txt | head -n 5 | tail -n 3 | sort -r > final.txt
Hint: build the pipeline up one command at a time to test your understanding.
Solution#
The contents of animals.txt
passes verbatim through the first pipe:
2012-11-05,deer
2012-11-05,rabbit
2012-11-05,raccoon
2012-11-06,rabbit
2012-11-06,deer
2012-11-06,fox
2012-11-07,rabbit
2012-11-07,bear
Only the first 5 lines passes through the second pipe:
2012-11-05,deer
2012-11-05,rabbit
2012-11-05,raccoon
2012-11-06,rabbit
2012-11-06,deer
The last 3 lines of the first 5 lines (i.e. lines 3, 4, and 5) pass through the third pipe:
2012-11-05,raccoon
2012-11-06,rabbit
2012-11-06,deer
The last command sorts the lines in reverse lexicographic ordering:
2012-11-06,rabbit
2012-11-06,deer
2012-11-05,raccoon
And writes this into the file final.txt
.
Writing CLI programs#
So far we have used commands that are already installed on our system. Because the shell is a general purpose tool for launching programs, it should come as no surprise that it is also possible to write your own programs, with which you can interact from the shell.
Shell scripts#
The simplest way to write such a program is just to put a bunch of shell commands into a file, one per line:
echo "The longest of the Shakespeare sonnets are:"
# print the 4 longest Shakespeare sonnets, and a summary line about all the sonnets
wc -l sonnet_*.txt | sort -n | tail -5
We name this file commands.sh
, and we can run the commands using the bash
command (we get the shell to call an instance of itself):
learner@casimir:~/sonnets$ bash commands.sh
The longest of the Shakespeare sonnets are:
13 sonnet_152.txt
13 sonnet_153.txt
13 sonnet_154.txt
14 sonnet_099.txt
2001 total
learner@casimir:~/sonnets$
In the above we explicitly said that we wanted the bash
program to interpret and execute the commands contained in the commands.sh
file.
This is all very well, but what if we give this script to somebody else? We have to also tell them that they need to run it by providing it as an argument to bash
. Is there a better way?
It turns out there is. When you execute a command in the shell, the shell will look for a file with that name, will open it and will attempt to interpret what is inside. Commands such as ls
are binary files that the shell can ask the operating system to execute directly, however the shell can also interpret text files.
If the first line of a program file starts with the special character sequence #!
(pronounced “shebang”), followed by the path to another program, the shell will attempt to use that program to interpret and execute the contents of this program, by passing the remaining contents of the program file on the standard input of the program specified after the shebang.
We should thus modify our script to be:
#!/bin/bash
echo "The longest of the Shakespeare sonnets are:"
# print the 4 longest Shakespeare sonnets, and a summary line about all the sonnets
wc -l sonnet_*.txt | sort -n | tail -5
We can now run our script just by executing the command ./commands.sh
.
We need the leading ./
to tell the shell to look in the current directory for the program; by default the shell looks in several special places (such as /bin
and /usr/bin
) for programs, but not in the current working directory.
learner@casimir:~/sonnets$ ./commands.sh
-bash: ./commands.sh: Permission denied
An error! the shell told us that we do not have permission to execute commands.sh
. Why is this?
Permissions#
Every file and directory has a certain set of permissions associated with it. We can inspect these permissions using ls
with the -l
flag
learner@casimir:~/sonnets$ ls -l commands.sh
-rw-r--r-- 1 learner users 188 Nov 4 12:54 commands.sh
So we cannot execute commands.sh
because we have not set the x
permission.
We can do so with the chmod
command:
learner@casimir:~/sonnets$ chmod +x commands.sh
learner@casimir:~/sonnets$ ls -l commands.sh
-rwxr-xr-x 1 learner users 188 Nov 4 12:54 commands.sh
learner@casimir:~/sonnets$
We just gave the owner user, users in the file’s group and everyone else permission to execute commands.sh
.
Now we can execute commands.sh
directly:
learner@casimir:~/sonnets$ ./commands.sh
The longest of the Shakespeare sonnets are:
13 sonnet_152.txt
13 sonnet_153.txt
13 sonnet_154.txt
14 sonnet_099.txt
2001 total
learner@casimir:~/sonnets$
Avoiding calling python script.py
all the time#
In the above we provided /bin/bash
as the interpreter for our shell script in the shebang line.
Similarly, we can provide python
as an interpreter for our python scripts, if we wish to invoke them directly from the command line without writing python
all the time.
In the computational environment provided for this course, the python interpreter is found at /opt/conda/bin/python
,
so our shebang line for Python scripts should be:
#!/opt/conda/bin/python
Python or Bash?#
Now we’ve seen how you can make scripts in Bash and Python that you can run from the command line.
If you just need to chain a few shell commands together, a Bash script is probably the simplest way to do this.
If, however, you find yourself needing more complicated logic (conditionals and loops) we strongly recommend to write a Python script to do the job for you. While if
and for
are possible from the shell, they are very unintuitive and difficult to write in a robust way.
The os
and shutil
modules in the Python standard library provide shell-like functionality for manipulating files and changing directory.
If you ever find that you really need to call a shell command, you can use the subprocess
module and parse the command output.