Welcome to Software Carpentry Etherpad for the Oct 6-7th. workshop at Harvard University!
This pad is synchronized as you type, so that everyone viewing this page sees the same text. This allows you to collaborate seamlessly on documents.
Use of this service is restricted to members of the Software Carpentry and Data Carpentry community; this is not for general purpose use (for that, try https://etherpad.wikimedia.org/).
Users are expected to follow our code of conduct: http://software-carpentry.org/conduct.html
All content is publicly available under the Creative Commons Attribution License: https://creativecommons.org/licenses/by/4.0/
We will use this Etherpad during the workshop for chatting, taking notes, and sharing URLs and bits of code.
Instructors:
Byron Smith, PhD Candidate, University of Michigan - <bsmith89@gmail.com>
James Mickley, PhD Candidate, University of Connecticut - http://jamesmickley.com - <james.mickley@uconn.edu>
Helpers:
Jeremy Muhlich
Gabriel Berriz
Douglas Russell
David "Quint" Gribbin
Attendees:
- Mariya Atanasova
- Amy Thurber
- Neeti Mittal
- Bobby Sheehan
- Robert Everley
- yuyu song
- Nienke Moret
- Changchang Liu
- Sameer Chopra
- Adam Palmer
- Jia-Yun Chen
- John Santa Maria
- Greg Baker
- Caitlin Mills
Options for lunch:
==============
- Day 1 - sandwiches
- Chicken salad
- Turkey
- Roast beef
- Grilled chicken breast
- Tuna salad
- Hummus and vegetables
- Day 2 - ???
- (?beer?)
YES! pizza?
=================================================================================================================
Day 1
Setup:
1. Download and install software from our course website: http://tinyurl.com/harvard-swc
2. Go to Socrative, and put in MICKLEY as the room: https://b.socrative.com/login/student/
3. Put your name under Attendees above (You can get here from the Etherpad link on the course website)
4. Navigate to https://sorgerlab.github.io/2016-10-06-harvard/setup/index.html
Unix Shell
==========
Follow along with what I typed: https://www.dropbox.com/s/09kdy6gyocpxigs/shell.txt
Data for shell: http://swcarpentry.github.io/shell-novice/data/shell-novice-data.zip
A really useful website for figuring out what a complicated line of commands does: http://explainshell.com/
On mac you can get into the shell by finding the application called "Terminal"
On windows, you should use "Git-Bash"
/Users/<USERNAME> means that we're in a directory ("<USERNAME>") inside of another directory
## Getting Help ##
ls --help # on windows or mac
man ls # on mac
http://man.cx/ls (in a browser)
hidden files: files whose names begin with a . are "hidden"; they are not shown by default by ls or other tools
ls -F (-F is a "flag" which changes the behavior of the `ls` command to print a "/" after directories)
(NB: some versions of ls print a "/" after directories by default, without needing the -F flag)
ls -a (show "all" which includes "hidden" files/directories)
~ stands for the home directory
cd <SOME DIRECTORY> (change directories)
cd (with no argument: changes directory to ~)
cd - (change directory to the previous directory)
../ is a sort-of directory which refers to the "parent" directory (the directory which contains our current directory)
./ is another "sort-of"
The `/` at the front of a path means the "root directory" (the directory that contains all other directories)
Within a path the `/` separates directories going from the "top" directory to the "bottom" (most specific) directory
Naming files:
1. avoid spaces in file names
2. avoid leading hyphens in file names
3. when including dates in file names:
- use the year-month-day convention (e.g. 2016-01-23)
- use the 2-digit version of months and days (e.g. 2016-01-01, instead of 2016-1-1)
nano filename : edit a file
While using nano:
control-X : exit (prompts to save file if you haven't already)
control-O : save file (prompts for filename if you want to change it)
to remove a file
rm FILENAME
rm -r directory # remove a directory and its contents ("recursive")
rm -r -i directory # it will ask you whether it should delete each file and directory ("interactive")
REMEMBER: rm is forever!
to rename or move a file:
mv original_name new_name # rename a file
mv subdirectory/filename . # move a file from a subdirectory to the current directory
to copy a file:
cp old_file new_file # make a copy of old_file, named new_file
NB: cp and mv will overwrite existing files; for example, if one runs
mv foo.txt bar.txt
and a file bar.txt happens to already exist, it will be overwritten by foo.txt
in the shell, to recall an earlier command use the up arrow
the up and down arrows can be used to navigate through the shell's history
recalled commands can be edited before re-running them
wc FILENAME # print number of characters, words, and lines in a file
# If you don't type a filename, you will get "stuck" press control-C to get out.
wc -l FILENAME # print just number of lines
* is a wildcard characters that matches any number (zero or more) of characters in a filename (except for the leading .)
? is a wildcard character that matches exactly one character (except for a leading .)
*[AB] matches all files whose names end with A or B
( http://regexr.com is a site where you can learn more about "regular expressions", the system that defines the [AB] style of matching)
The > character (greater-than) is used after a command to "redirect" the output from the command into a file:
wc -l *.pdb > lengths.txt
NB: COMMAND > FILENAME will always overwrite FILENAME
The < character (less-than) is used after a command to "redirect" the input to the command from a file:
wc < methane.pdb
cat FILENAME prints the contents of FILENAME to the terminal
less FILENAME also prints the contents of FILENAME to the terminal, but page by page (less what is a known as a "pager")
to get out of less: type q
sort takes an input file and prints out the lines from the file in sorder order:
sort file.txt
sort -n lengths.txt # the -n flag is required to sort numbers properly
head shows just the first few lines of a file:
head -n 1 sorted-lengths.txt # show just the first line
head -n 5 sorted-lengths.txt # show the first 5
head -5 is a shortcut for head -n 5
tail is the counterpart of head: it shows n lines at the end of a file
tail -n 1 # shows the last line of a file
Q: how to get only the second line of a file
A: use both head and tail; e.g.:
head -n 2 FILENAME | tail -n 1
The | character (vertical bar or pipe) is used after a command to "pipe" the output from one command directly into another command:
sort -n lengths.txt | head -n 1 # show the line with the smallest count from lengths.txt, without creating any extra files
Any number of commands may be "piped" together in a line:
wc -l *.pdb | sort -n | head -n 1
"standard in" : source of input to commands when a file is not specified as an argument
"standard out" : target of output from commands
to kill current command: ^C (control-C)
this is useful when a command appears to be stuck
ASIDE: R has pipes too! (in dplyr package) %>% sign
there's no training wheels in the shell
UNIX philosophy: small, single-purpose tools that can be composed to perform complex tasks
echo hi # just outputs hi to the terminal
for loops:
for VARIABLE in ITEM1 ITEM2 ITEM3 ... ITEMn
do
## shell comands using $VARIABLE
## where $VARIABLE sequentially takes on the values ITEM1, ITEM2, ITEM3, ..., ITEMn
done
Semicolons can be used to type complex commands (such as for loops) in a single line; e.g.
for filename in *.dat; do head -3 $filename; done
PROGRAMMING TIP: use human-readable variable names rather than cryptic
PROGRAMMING TIP: use indentation to indicate logical structure; e.g.
# GOOD
for filename in *.dat
do
head -3 $filename
done
# NOT SO GOOD
for filename in *.dat
do
head -3 $filename
done
scripts are files that collect multiple shell commands, and that can be run all at once; example:
#!/bin/bash
# script to run the command frobozz on every *.txt file, appending the output to frobozz.log
for filename in *.txt
do
echo $filename
frobozz $filename >> frobozz.log
done
Note: the # character means "every thing that follows is a comment"; comments are only for the benefit of human readers; they are otherwise ignored
The shebang line
if the first line of a script begins with #!, it is called "the shebang line"; what follows this sequence gets used as the program that processes the contents of the script file
For example, the script above begins with the line
#!/bin/bash
...the program /bin/bash will be used to execute the script.
Python
==========
Download and unzip on your desktop: http://swcarpentry.github.io/python-novice-gapminder/files/python-novice-gapminder-data.zip
Rename the directory to `gapminder`
To open a python command prompt (in shell), type: python
To run a python program in the shell: python somepythonscript.py
Command to open jupyter notebook (in shell): jupyter notebook
To create a notebook in jupyter:
- In shell run: jupyter notebook
- Click New in the top right, and select python 3 (or python)
- Click Untitled at the top to rename the notebook
Use SHIFT+Enter to run a command in a jupyter notebook
== Jupyter Notebook ==
- This is a very nice way to keep track of what you've run, kind of like a lab notebook.
- You can save it and return to it later
- You can convert it to a pdf or html to send a report to someone
- Typing h will give you a list of keyboard shortcuts
- Jupyter Notebook can also be setup to work in shell or R instead of Python
Variables in python are not like cells in Excel.
If one variable depended on another and you change the original variable, it won't change the later one
use CTRL+m and then press m for changing code into markdown text. You can still run it using SHIFT+Enter (but it will display formatted text)
Strings can be combined (or concatanated) with the '+' character
Variables store different types of data in python
Use the type() function to figure out what kind of data a python variable holds
- int contains an integer
- float contains a floating point or decimal number
- str contains a string (or text)
- bool contains either true or false
- lists contain multiple values
You can convert between types of variables using functions like str(), float(), int(), bool(), list()
Variable names are case-sensitive. By convention, Python discourages using capital letters
Unlike R, and some other languages, when using a list in Python, the first item is #0, not #1. Eg: list[0]
The "slice" of a list is a subset of that list, only some of the items.
Strings can be "sliced" too in the same way to get part of the string
every 2nd item example: important_people[0:4:2]
== Some python built-in functions: ==
print (Extremely useful for understanding your code or fixing bugs)
type
str
bool
int
float
round
len
== Getting help in Python ==
- help(functionname), eg help(print)
- In Jupyter Notebook, you can also type print?
- If your cursor is inside a functions parentheses, you can type SHIFT+TAB to bring up some help
To import a Python library, eg pandas or matplotlib: import pandas or import matplotlib
== Using Pandas (a data manipulation library for Python that works similarly to R) ==
- Documentation for pandas: http://pandas.pydata.org/pandas-docs/stable/
- You can also use pandas? in Jupyter
- pandas.read_csv() to read in a csv file into a pandas dataframe
- If you have a pandas dataframe named data:
- data.columns gives you a list of columns
- data.index gives you the number of rows
- data.columnname or data['columnname'] gives you the data in that column
- data.ix[rownumber] gives you the data in that row
- To subset data, eg only Africa: data[data.continent == "Africa"]
- in a list: data[data.country.isin(['Algeria','Angola'])]
- with regular expression data[data.country.str.match('.*ria$')]
- or more interesting data[data.gdpPercap_1952-data.gdpPercap_1957 >0]
== Using Matplotlib (a plotting library for Python) ==
- import matplotlib.pyplot as plt Note: this lets us use plt.scatter() instread of matplotlib.pyplot.scatter(), just a shortcut :)
- To display plots inline in the Jupyter notebook (Jupyter only) %matplotlib inline
- For a scatter plot: plt.scatter(x, y)
- To change the X or Y labels: plt.xlabel(), plt.ylabel()
=================================================================================================================
Day 2
Python Continued
================
== General instructions for using a Python library (eg. How do I find the cos(x)?) ==
- Get help!
- Google something like "python math cos" to find the help page
- Or type math? or math.cos? or dir(math)
== Using Pandas Continued ==
- If you have a pandas dataframe named data:
- data.head() gets you the first few rows, useful to remember what's in there
- data prints out the whole dataset
- data.to_csv() saves your dataset as a CSV file. (useful if you've subset your data, for example)
== For loops in Python ==
- Don't have a done at the end as in shell, and instead rely on indentation. Everything indented is inside the loop
- For loops can be nested inside of each other
For loop syntax:
for loop_variable in list:
== If statements in Python ==
- Also don't have a done or endif at the end and rely on indentation, just like for loops
- else or elif let us consider alternatives to the first if condition (elif stands for 'else if')
- else or elif do NOT get indented, and get their own indented block
- You can use and, or, not to test multiple conditions
- eg if variable == "A" or variable == "B":
- As soon as one condition in an if ... elif ... else sequence matches, all the rest get skipped
If statement syntax:
if variable == test_condition:
elif variable == other_test_condition:
else:
- ... do something else ...
Accumulators
In class, we did the following problem:
- find the sum of all the numbers greater than 2 in the input list
The solution was
input_list = [1, 2, 3, 4, 5]
accum = 1
for num in input_list:
if num > 2:
accum = accum * num
print(accum)
To get more practice with the concept of accumulators, try the following variants of the problem we did in class:
1. find the sum of all the numbers greater than 2 in input_list (final value of accum should be 12)
2. produce a list of all the numbers greater than 2 in input_list (final value of accum should be [3, 4, 5])
3. produce a string consisting of the concatenation of the string form of all the numbers greater than 2 in input_list (final value of accum should be the string "345")
Hint: the solutions to all the problems above all have the same general structure as the solution to the product problem; only two lines will change: the one setting the initial value of the accumulator, and the one updating the value of the accumulator (inside the if-statement).
== Writing your own functions in Python ==
- You've been using functions already made for you so far, but you can also make your own!
- Defining a function doesn't run it. You have to run it yourself AFTER you've defined it (so Python knows to look for it)
- Functions can have more than one argument
- Arguments can have default values, eg def my_function(argument1=1):
- You can add a Docstring in a function inside the function just below the def statement
- Start with """ and end with """. Everything inside of those two sets of 3 quotes will show up when you call help(my_function) or my_func?
explainshell
Afternoon: Git
==============
Setup
1. Open your shell and cd to your home directory
2. Go to Socrative, and put in MICKLEY as the room: https://b.socrative.com/login/student/
Follow along with what I typed: https://www.dropbox.com/s/565vbg0c87bne58/git.txt
== Getting a Github Account ==
- Go to https://github.com/ and sign up for a new account
- You can also get git premium for FREE as educational professionals: https://education.github.com/discount_requests/new
- This comes with unlimited private repositories