10. January 2012 · Comments Off · Categories: Computers, Linux · Tags: , ,

My small contribution to the FOSS community: over 200 GB of Linux ISOs served via bittorrent.

27. December 2011 · Comments Off · Categories: Computers, Internet, Linux · Tags: , , ,

YouHaveDownloaded.com claims to track the IP addresses of people who use BitTorrent, but I’ve been torrenting Linux ISOs for months, pushing almost 200 GB of data, and the IP address of that box is not in their database. I don’t know what their methodology is, but they’re obviously not interested in people sharing legal files.

14. December 2011 · Comments Off · Categories: Computers, Internet, Linux · Tags: , , , , ,

I manage a Debian server that hosts the Evolving Scientist Podcast, among other things. I have no background in computer science or Linux administration. I’m just a Linux fan who loves learning, and administering a “production” server can be simultaneously entertaining, educational, perplexing, and infuriating.

A few weeks ago, I noticed htop reporting a load over 1.0, yet the CPU use was always close to 0%. I didn’t think much of it at the time, but as the load persisted, I got more worried. Was this harming my box in some way? Turns out, if it’s not CPU use, it’s probably I/O — continuous writing to the hard drive — and that’s not good for the integrity of your data.

If something was constantly read/writing to the hard drive, what was it? Where was it? I just happened to do this:

$ tail -n30 /var/log/syslog
 
kernel: [   76.392218] hub 1-0:1.0: unable to enumerate USB device on port 5
kernel: [   76.580230] hub 1-0:1.0: unable to enumerate USB device on port 5
kernel: [   76.768221] hub 1-0:1.0: unable to enumerate USB device on port 5
kernel: [   76.952222] hub 1-0:1.0: unable to enumerate USB device on port 5
# repeated 30 times

What the hell? The kernel was writing error messages at a rate of about 6 times per second. That might be the problem. But what does “unable to enumerate USB device” mean? Why didn’t I see it before?

A Google search turned up this bug, along with numerous forum posts, speculating and pontificating on the matter. I tried upgrading the kernel. I tried rebooting the server (losing 131 days of uptime), to no avail. I was ready to move everything to a new server. Finally I stumbled across this:

cd /sys/bus/pci/drivers/ehci_hcd/
sudo sh -c 'find ./ -name "0000:00:*" -print | sed "s/.///" > unbind'

Immediately the load dropped, and 15 minutes later, it sits comfortably near zero. It was just that simple, but as always, you don’t know what you don’t know.

This is yet another reminder that there are no hard problems — only problems that are hard to a certain level of intelligence and knowledge.

I don’t know why it worked; I was just desperate and wanted it to stop, so I pasted some code from (yet another) tutorial on the net. Now that I have some peace of mind, I can dig deeper.

02. October 2011 · Comments Off · Categories: Internet, Linux, Research · Tags: , , , ,

A while ago I posted a tutorial on setting up your own webdav server to sync Zotero. Although you may not consider your bibliography to be sensitive data, authenticating to the server by sending a plaintext password is a bad idea. Here I’ll show you how to sync over an encrypted connection.

First, create a self-signed SSL certificate:

mkdir -p /etc/apache2/ssl
openssl req -new -x509 -days 365 -nodes -out /etc/apache2/ssl/ssl.pem 
        -keyout /etc/apache2/ssl/ssl.key

You will be asked a series of questions, but you don’t have to fill out most of those details. Just put your server’s domain name as the common name.

Then edit /etc/apache2/ports.conf and add your server’s IP address:

NameVirtualHost 12.34.56.78:443

Next, edit the configuration file for your webdav server. For example, if your Zotero data is being synced from domain1.net/webdav, then you should append the following to the vhost file for that domain:

<VirtualHost 12.34.56.78:443>
    SSLEngine On
    SSLCertificateFile /etc/apache2/ssl/ssl.pem
    SSLCertificateKeyFile /etc/apache2/ssl/ssl.key
 
    ServerAdmin webmaster@domain1.net
    ServerName domain1.net
    ServerAlias www.domain1.net
    DocumentRoot /path/to/domain1.net/
    ErrorLog /path/to/logs/error.log
    CustomLog /path/to/logs/access.log combined
 
        DAV On
        AuthType Basic
        AuthName "webdav"
        AuthUserFile /path/to/domain1.net/webdav/passwd.dav
        Require valid-user
</VirtualHost>

Now we need to enable SSL for the web server:

a2enmod ssl
service apache2 restart

Lastly, point Firefox at https://domain1.net and make an exception for your self-signed certificate, then change the sync option in Zotero from HTTP to HTTPS.

18. September 2011 · Comments Off · Categories: Computers, Linux · Tags: , , ,

A standard way of measuring hard drive performance is to write a few hundred megabytes of zeros, like this:

$ dd if=/dev/zero of=test bs=64k count=6400 conv=fdatasync

Most hard drives can write in the range of 40 – 100 MB/s this way. We know RAM is faster, but how much faster? In Linux, we can conveniently mount a partition in RAM and write to that.

$ sudo mount -t tmpfs tmpfs /mnt -o size=1024m
 
$ df -H
 
Filesystem             Size   Used  Avail Use% Mounted on
/dev/sda1               51G   5.4G    43G  12% /
/dev/sda5              680G   272G   374G  43% /home
tmpfs                  1.1G      0   1.1G   0% /mnt
 
$ dd if=/dev/zero of=/mnt/test bs=64k count=6400 conv=fdatasync
 
# DDR3 1066 RAM
6400+0 records in
6400+0 records out
419430400 bytes (419 MB) copied, 0.246482 s, 1.7 GB/s

Wow. Even solid state drives write at a maximum of about 250 MB/s. Imagine if hard drives were as fast as RAM. Well, for now, you can do disk IO-heavy work in a temporary RAM-mounted partition. When you’re done, remember to move or delete the files and unmount the partition:

$ rm /mnt/test
 
$ sudo umount -fl /mnt
15. August 2011 · Comments Off · Categories: Computers, Internet, Linux · Tags: , , , , ,

All right, I finally solved that problem of running multiple vhosts on Apache with WSGI. So now evolvingpodcast.net brings back a different result from the IP address itself. As always, the solution was simple. You just don’t know what you don’t know.

It’s always entertaining when I see computer scientists and engineers make analogies between biology and human artifacts. Then they make predictions about biomedical progress as if it marches at the same ineluctable pace as computer science.

Is the brain like a computer? Is the genome like a program? Let’s examine that. Let’s assume that evolution is a programmer.

First, our programmer has no clue what kind of program she wants to write. Unlike real programmers, who have some goal for their code, evolution just knows that she must write code. It is her nature. Second, she doesn’t know how to program. She’s blind, has no foresight, and doesn’t understand the syntax or logic of her programming language.

Luckily that doesn’t matter much, because she has several saving graces. For one, we gave her a programming language that is extremely forgiving of syntax, logic and execution. A missing parenthesis here or erroneous indentation there doesn’t crash the system. Neither does misspelling the names of variables, most of the time (the analogy here is of the vast neutral fitness zone, where most mutations are basically neutral, and only rarely are they significantly deleterious). It’s like a programming language where every statement is an implicit try: with a fallback of except: pass, but not really, since the point is that the language doesn’t care about trivial errors in syntax. It exhibits smooth fitness gradients.

The second saving grace is that our programmer gets immediate user feedback. They say the reason why FOSS works, despite the lack of revenue, is because of immediate user testing and feedback (see: The Cathedral and the Bazaar by Eric Raymond). Open source is open, and rather than hammering out code for months or years in a closed environment, FOSS developers get constant, immediate feedback from a large testing community, which improves the code faster. Well, that’s evolution. The genome is open source, and there is no production, no final release, only testing.

Our programmer doesn’t know how to code. She mostly types random stuff, mashes on the keyboard, and patches from /dev/urandom. Actually, she almost exclusively patches from /dev/urandom, for lack of any other insight, but everything she writes gets quickly thrown out to user testing (the environment), and she gets immediate feedback.

Along the way, she randomly forks the code into various branches and tries out different things. Over time, the branches diverge to the point where they are no longer interoperable, but some relics of their shared history remain. Eventually, many branches are discarded.

The third saving grace is that she literally has all the time in the world. She’s not constrained by deadlines, and it took her a few hundred million years to publish the first usable code. In this case, it was code that accomplished the singular task of making more copies of itself. And why not? Since she had no goal in mind, it should be obvious that the code most likely to persist would simply be code that replicates itself. All other tasks that it eventually accomplishes are secondary to that goal .

And that’s how it goes for millenia. Despite the massive inefficiency of this process, it works because our programmer is extremely productive. She shotguns the problem. She iterates on the code billions of times a day and handles a bug tracker of unfathomable proportions.

What is the end result? Incredibly complex code with little underlying logic that just works, most of the time. That is why Kurzweil and many computer scientists and engineers are wrong about their predictions of biomedical progress. You’re not just reverse engineering the Kinect or some proprietary code, which you know has a purpose and internal logic. Cracking the genome is not like cracking MD5, where the time it takes to arrive at a cryptographic solution is inversely proportional to computing power. To understand the genome, and biology more generally, you have to decipher every line of code empirically.

That requires research on real biological systems, which is hamstrung by things like reproductive output, generation time, ethics, and pure luck. David Linden is right: empirical progress in the biological sciences is much more linear than you imagine. And if some aspects of it are exponential, the exponent is much smaller than you think.

Kurzweil likes to use the Human Genome Project as an example of exponential growth. It took seven years to sequence the first 1% (or thereabouts), and most scientists directly involved in the project thought it would take much longer to finish the full genome. But they underestimated the power of exponential (sequencing technology) growth, and several doubling times later, a full (draft) sequence was published in 2003. That’s great, but then what?

Then the hard problem of empirically deciphering the genome became relevant, and a decade later, we still don’t have personalized medicine. We’re not even close to understanding the human genome in its entirety. We didn’t even fully appreciate the importance of non-coding RNAs until after it was completed. We have the source code but we don’t know what to make of it.

In the widest interpretation, the genome is code, but it’s nothing like the code that you write. The programmer is nothing like you. And deciphering biology is not a straightforward engineering task, because it wasn’t made by engineers. You can’t look for internal logic, because there is none. You have to decipher each line empirically, and that’s hard work which is not subject to Moore’s Law. You can’t project current computational trends to some distant point in the future and seriously predict the emergence of a specific discovery or innovation to within a few years.

That might work when the entire system is a fabrication of goal-oriented human logic. It doesn’t work for biology.

Just as sequencing the human genome didn’t automatically produce usable knowledge of genetics and development, Kurzweil’s predictions about high resolution neural imaging will not automatically produce usable knowledge of the brain and consciousness. That will require lab work.

In the end, the analogies are superficial, and you can’t reason your way to biological conclusions through the lens of computer science or engineering. You should really learn biology and the travails of biological research to make insightful statements about them.

12. July 2011 · Comments Off · Categories: Computers, Linux, Research · Tags: , ,

If you import a module like Numpy into your Python interactive session and inspect its contents with dir(), you’ll get a list with 530 items. A saner method is to search within that list:

import numpy
 
[item for item in dir(numpy) if 'poly' in item]
 
['poly',
 'poly1d',
 'polyadd',
 'polyder',
 'polydiv',
 'polyfit',
 'polyint',
 'polymul',
 'polysub',
 'polyval']

So we turn that into a function:

def derp(module, term):
    return [item for item in dir(module) if term in item]
28. June 2011 · Comments Off · Categories: Computers, Linux · Tags: , , , ,

Usually when I uninstall a package on Debian, Ubuntu, or Linux Mint, I want to get rid of all the files associated with that package. Other times, I may plan to reinstall the package later, so I just remove the binaries, but leave the config files. apt-get has the purge and remove commands for these two cases.

However, occasionally I will remove packages and then decide that I never want to reinstall them. So how can I find the leftover files and purge them? Here’s a solution:

dpkg --get-selections > packages.txt
grep deinstall packages.txt | awk '{print $1}' | xargs sudo dpkg -P

You can generate a list of installed packages with the first command. Packages that were removed but not purged, ie, that have leftover files, are marked with “deinstall”, so it’s a matter of grepping for that word, slicing out the package names, and piping them to dpkg for purging.

Now show me a solution for Windows as neat and tidy as this. What you get instead is a file system cluttered with leftover files and folders. A utility like CCleaner does a reasonable job of purging cache folders and obsolete registry keys, but you have to dig for some leftover program folders manually.

27. June 2011 · Comments Off · Categories: Linux, Personal, Philosophy · Tags: , , ,

Learning the Linux command line has turned out to be a great decision, now that I work almost exclusively at the computer. Don’t get me wrong: I wouldn’t call myself an expert. I’ve been using the command line for the last two years, but I still learn something new every week. However, the command line has increased my productivity by 2 to 10 fold compared to how I would have done things the GUI way.

Quite often, if you dedicate yourself to learning something “hard”, you will get much bigger rewards in the end. A good example is typing. My mother has been using a computer for almost 20 years, and she still hunts-and-pecks. I took a keyboarding class in high school. It was onerous and boring, but now I can type 80-90 wpm, whereas I would have hit an asymptote of about 40 wpm if I were still doing it her way. That commitment has paid dividends for the last 15 years, and if I spent 200 hours in keyboarding class, I’ve gained 2000 back by typing faster.

Now the command line is paying similar dividends. Say that I need to find every CSV file spread out across 20 folders, copy them to a central location, and rename all of them by prepending a word. I can do that in two commands and about 10 seconds. Mucking around a graphical file manager, it would take me an hour.

It’s a good life lesson: learn to do things that are hard.