A function that calculates the mean in Python:
def mean(x): sum = 0 for i in x: sum += i mean = sum / len(x) return mean
or the same function in R:
mean <- function(x) { sum = 0 for (i in x) { sum = sum + i } mean = sum / length(x) print(mean) }
A function that calculates the mean in Python:
def mean(x): sum = 0 for i in x: sum += i mean = sum / len(x) return mean
or the same function in R:
mean <- function(x) { sum = 0 for (i in x) { sum = sum + i } mean = sum / length(x) print(mean) }
It’s always entertaining when I see computer scientists and engineers make analogies between biology and human artifacts. Then they make predictions about biomedical progress as if it marches at the same ineluctable pace as computer science.
Is the brain like a computer? Is the genome like a program? Let’s examine that. Let’s assume that evolution is a programmer.
First, our programmer has no clue what kind of program she wants to write. Unlike real programmers, who have some goal for their code, evolution just knows that she must write code. It is her nature. Second, she doesn’t know how to program. She’s blind, has no foresight, and doesn’t understand the syntax or logic of her programming language.
Luckily that doesn’t matter much, because she has several saving graces. For one, we gave her a programming language that is extremely forgiving of syntax, logic and execution. A missing parenthesis here or erroneous indentation there doesn’t crash the system. Neither does misspelling the names of variables, most of the time (the analogy here is of the vast neutral fitness zone, where most mutations are basically neutral, and only rarely are they significantly deleterious). It’s like a programming language where every statement is an implicit try: with a fallback of except: pass, but not really, since the point is that the language doesn’t care about trivial errors in syntax. It exhibits smooth fitness gradients.
The second saving grace is that our programmer gets immediate user feedback. They say the reason why FOSS works, despite the lack of revenue, is because of immediate user testing and feedback (see: The Cathedral and the Bazaar by Eric Raymond). Open source is open, and rather than hammering out code for months or years in a closed environment, FOSS developers get constant, immediate feedback from a large testing community, which improves the code faster. Well, that’s evolution. The genome is open source, and there is no production, no final release, only testing.
Our programmer doesn’t know how to code. She mostly types random stuff, mashes on the keyboard, and patches from /dev/urandom. Actually, she almost exclusively patches from /dev/urandom, for lack of any other insight, but everything she writes gets quickly thrown out to user testing (the environment), and she gets immediate feedback.
Along the way, she randomly forks the code into various branches and tries out different things. Over time, the branches diverge to the point where they are no longer interoperable, but some relics of their shared history remain. Eventually, many branches are discarded.
The third saving grace is that she literally has all the time in the world. She’s not constrained by deadlines, and it took her a few hundred million years to publish the first usable code. In this case, it was code that accomplished the singular task of making more copies of itself. And why not? Since she had no goal in mind, it should be obvious that the code most likely to persist would simply be code that replicates itself. All other tasks that it eventually accomplishes are secondary to that goal .
And that’s how it goes for millenia. Despite the massive inefficiency of this process, it works because our programmer is extremely productive. She shotguns the problem. She iterates on the code billions of times a day and handles a bug tracker of unfathomable proportions.
What is the end result? Incredibly complex code with little underlying logic that just works, most of the time. That is why Kurzweil and many computer scientists and engineers are wrong about their predictions of biomedical progress. You’re not just reverse engineering the Kinect or some proprietary code, which you know has a purpose and internal logic. Cracking the genome is not like cracking MD5, where the time it takes to arrive at a cryptographic solution is inversely proportional to computing power. To understand the genome, and biology more generally, you have to decipher every line of code empirically.
That requires research on real biological systems, which is hamstrung by things like reproductive output, generation time, ethics, and pure luck. David Linden is right: empirical progress in the biological sciences is much more linear than you imagine. And if some aspects of it are exponential, the exponent is much smaller than you think.
Kurzweil likes to use the Human Genome Project as an example of exponential growth. It took seven years to sequence the first 1% (or thereabouts), and most scientists directly involved in the project thought it would take much longer to finish the full genome. But they underestimated the power of exponential (sequencing technology) growth, and several doubling times later, a full (draft) sequence was published in 2003. That’s great, but then what?
Then the hard problem of empirically deciphering the genome became relevant, and a decade later, we still don’t have personalized medicine. We’re not even close to understanding the human genome in its entirety. We didn’t even fully appreciate the importance of non-coding RNAs until after it was completed. We have the source code but we don’t know what to make of it.
In the widest interpretation, the genome is code, but it’s nothing like the code that you write. The programmer is nothing like you. And deciphering biology is not a straightforward engineering task, because it wasn’t made by engineers. You can’t look for internal logic, because there is none. You have to decipher each line empirically, and that’s hard work which is not subject to Moore’s Law. You can’t project current computational trends to some distant point in the future and seriously predict the emergence of a specific discovery or innovation to within a few years.
That might work when the entire system is a fabrication of goal-oriented human logic. It doesn’t work for biology.
Just as sequencing the human genome didn’t automatically produce usable knowledge of genetics and development, Kurzweil’s predictions about high resolution neural imaging will not automatically produce usable knowledge of the brain and consciousness. That will require lab work.
In the end, the analogies are superficial, and you can’t reason your way to biological conclusions through the lens of computer science or engineering. You should really learn biology and the travails of biological research to make insightful statements about them.
Sometimes I use batch files to run dozens of commands in Matlab, which can take days to process. This saves time, because if I entered each command manually, some amount of time would be wasted before I realized the previous job was done. Of course, writing the batch file can also take time. So I do something like this:
from os import listdir # Get a list of files to batch process filelist = listdir('/path/to/dir/') # I may only want to process some of those files selectfiles = [filename for filename in filelist if 'term' in filename] batchfile = open('BatchFile.m', 'w') for filename in selectfiles: batchfile.writelines("command('/path/to/files/', '" + filename + "')n")
That occasionally saves me half an hour of mindless typing.
If you import a module like Numpy into your Python interactive session and inspect its contents with dir(), you’ll get a list with 530 items. A saner method is to search within that list:
import numpy [item for item in dir(numpy) if 'poly' in item] ['poly', 'poly1d', 'polyadd', 'polyder', 'polydiv', 'polyfit', 'polyint', 'polymul', 'polysub', 'polyval']
So we turn that into a function:
def derp(module, term): return [item for item in dir(module) if term in item]
Download your sequence and remove any header lines so it looks like this.
seq = '' for line in open('seq.txt'): line = line.rstrip() seq += line gc = 0 for char in seq: if char == 'G' or char == 'C': gc += 1 print 'GC content = %.2f%%' % (gc / len(seq) * 100)
Turns out that BioPython makes life easier. Without removing the FASTA header, we can do:
from Bio.SeqIO import parse from Bio.SeqUtils import GC gc = [GC(record.seq) for record in parse(open('seq.txt'), 'fasta')] print "GC content = %.2f%%" % gc[0]
That will print the GC content of the first record, but if you have more than one record, the others are calculated and can be printed with a for loop.
from random import shuffle def fancysort(thelist): while thelist != sorted(thelist): print thelist shuffle(thelist) print 'The list is sorted!', thelist
I’ve been learning Numpy lately, and one thing that it has in common with Matlab is that the shape of an array is specified according to the m x n convention in matrix algebra. For example,
import numpy as np a = np.arange(1, 11).reshape(2, 5) print a [[ 1 2 3 4 5] [ 6 7 8 9 10]]
As you can see, the reshape() method does what it says, and takes the first argument as m (the number of rows) and the second argument as n (the number of columns). Since the m x n convention is always rows x columns, with the rows specified first, you would expect this to be consistent elsewhere. This works for indexing, so a[1, 3] will correctly grab the value that is in the second row and fourth column (because Python starts indexing at 0). Now, when summing rows and columns, you specify the axis with 0 and 1. Obviously, 0 comes before 1. So if rows come first in reshape() and indexing, they should come first in sum().
print a.sum(axis=0) [ 7 9 11 13 15] print a.sum(axis=1) [15 40]
Of course, it can’t be that easy, and I’m forced to memorize a special case.
Aside from this little annoyance, Numpy is pretty nice, though. You don’t need any special syntax for array operations like you do in Matlab, and it automatically handles ZeroDivisionErrors like Matlab (printing 0 instead of Inf, but whatever).
I’d really like to replace Matlab entirely with Numpy and related tools, which is why I’m going through a tutorial to evaluate it.
Update: And just to add to the confusion, the insert() function specifies the axis with 0 for row and 1 for column, the exact opposite of sum()!!
a = np.arange(1, 16).reshape(3, 5) print a [[ 1 2 3 4 5] [ 6 7 8 9 10] [11 12 13 14 15]] print np.insert(a, 1, 777, axis=0) [[ 1 2 3 4 5] [777 777 777 777 777] [ 6 7 8 9 10] [ 11 12 13 14 15]]
So basically, any sort of indexing feature follows 0 = row, 1 = column, while any descriptive feature (sum, mean, std) follows 0 = column, 1 = row.
I posted about Gitosis before, but I’ve come to the realization that it’s not necessary for my needs. Here’s how I set up a central Git repository with Gitweb on Ubuntu Server. With this configuration, specific people can be allowed to push to the repo, while everyone else can browse through Gitweb.
On the remote machine, create a new user:
sudo adduser git
Now we need to configure the ssh account. If password logins are enabled, you can just do this on your local machine:
ssh-copy-id git@remoteserver.com
Otherwise, if you’ve already copied your ssh key over, then on the remote machine:
sudo mkdir /home/git/.ssh sudo touch /home/git/.ssh/authorized_keys sudo cat ~/.ssh/authorized_keys >> /home/git/.ssh/authorized_keys sudo chown -R git:git /home/git/.ssh
If you want to give others access to the repo, add their keys the same way you did in line 3 above.
Next, limit the git user to git-shell, so others can only execute git commands on your server:
sudo vim /etc/passwd
And change the git line from /bin/bash to /usr/bin/git-shell
Now you can create or upload a repository to the git account. Since the git user is limited to git-shell, you can scp it to your own account and copy it over. If the repository is in a directory under the root of the git account, like /home/git/myproject, then anyone with access can clone it with:
git clone git@remoteserver.com:myproject
That’s it!
Now, to configure Gitweb, install it and edit /etc/apache2/conf.d/gitweb.conf and change $projectroot to /home/git. Then edit your vhost file and add:
RewriteEngine on RewriteRule ^/gitweb/([a-zA-Z0-9_-]+.git)/?(?.*)?$ /cgi-bin/gitweb.cgi/$1 [L,PT] Alias /gitweb /home/git <Directory /home/git> Options Indexes FollowSymlinks ExecCGI DirectoryIndex /cgi-bin/gitweb.cgi AllowOverride None </Directory>
Finally, symlink the icons and stylesheet for the Gitweb front end, and restart Apache:
sudo ln -s /usr/share/gitweb/*.png /home/git/. sudo ln -s /usr/share/gitweb/*.css /home/git/. sudo service apache2 restart
Then you can browser your repositories at remotemachine.com/gitweb.
I’m writing these posts to help myself remember. After all, you don’t really learn something until you teach someone else, right?
So I’ve been version tracking some Matlab scripts with git, and I decided to make a public repository. Here’s how I did it.
First, if you haven’t started tracking changes yet, install git and and initialize a repository:
sudo apt-get install git
cd /path/to/files
git init
git add .
git commit -am “I wish I was as commited as these files!”
Now, on the remote machine, install and setup gitosis:
sudo apt-get install gitosis
sudo -H -u gitosis gitosis-init < ~/.ssh/id_rsa.pub
That last part is my public SSH key. I created the public (gitosis) repo on the same machine as my development (git) repo. If you’re creating the gitosis repo on a remote machine, you’ll have to copy your public key over first. Just remember to put it in a file that ends in .pub.
Ok, back on the development machine, clone the admin repo:
git clone gitosis@mybox.net:gitosis-admin.git
Open up the gitosis config file:
nano [or vim or whatever] /path/to/gitosis-admin/gitosis.conf
And add two sections:
[repo MyProject]
description = “My witty description”
owner = me[group MyProject]
writable = MyProject
members = me bro dannyboy blue gaga
Notice how repo, group and writable are all the same. Also, the members will be based on SSH keys that you add to /path/to/gitosis-admin/keyfiles, and they should all end in .pub.
We need to commit and push these changes:
git commit -am “added MyProject”
git push origin master
Now go to your personal project directory and set that up:
git remote add origin gitosis@mybox.net:myproject.git
git push origin master
That’s it. The other members of your team can clone the project with:
git clone gitosis@mybox.net:myproject.git
And start pushing and pulling to their heart’s desire.
BTW, did you know that Linus Torvalds, the founder of the Linux kernel, created git, and many people consider it to be his greatest software contribution?