The magic of join

Often we want to compare or connect two lists and see what the overlap is like. While this is possible in spreadsheet software (Excel and alike) it is often overly complicated and does not produce the results we would like.

UNIX join to the rescue.

With the following example we can join two files easily:

 cat students.csv
 > #studentID,name,semester
 > 0,Peter,3
 > 1,Anna,2
 > 2,Sonja,7
 cat grades.csv
 > #studentID,course,grade
 > 2,Physics,89
 > 0,Math,40
 > 0,Physics,30
 join students.csv grades.csv
 > #studentID,name,semester,course,grade
 > 0,Peter,3,Math,40
 > 0,Peter,3,Physics,30
 > 2,Sonja,7,Physics,89

What happened to Anna? She is not in both files and thus omitted from the output. If we want to include all entries from one file we can do so.

 join -a 1 students.csv grades.csv
 > #studentID,name,semester,course,grade
 > 0,Peter,3,Math,40
 > 0,Peter,3,Physics,30
 > 1,Anna,2
 > 2,Sonja,7,Physics,89

Better! But if we want to use this table to sort by grade or course there are no entries for Anna. In fact most parsers will complain, that row 4 has less fields than the others. We can include empty fields with added separators with the auto output format.

 join -a 1 -o auto students.csv grades.csv
 > #studentID,name,semester,course,grade
 > 0,Peter,3,Math,40
 > 0,Peter,3,Physics,30
 > 1,Anna,2,,
 > 2,Sonja,7,Physics,89

These two added commas will save us a lot of headache down the line.

2017/04/06 14:04 · thoelken

DNA binding proteins

I found this very interesting site: Lecture notes on DNA binding proteins

A very nice but extremely ugly and somewhat old course material about DNA binding proteins with all physics relevant vor interactions and general DNA folding properties.

2017/03/31 15:08 · thoelken

The fun of GFF parsing

Lately I come to work a lot with especially eukaryotic genome annotation (-sigh- prokaryotes are sooo much easier) and have to rely on tools reading GFF or GTF formatted annotations.

Once you get into the trenches of elaborate exon structures of different isoforms, you will notice that neither GFF nor GTF were ever a good idea for quick parsing or any analysis. To make matters worse, both formats adhere only sometimes very loosely to some vague (mostly optional) conventions -sigh-.

Some explanations or rather recommendations can be found here:

The last is part of a widely used GFF utility suite including gffread:

A glimmer of light seems to be the python package gffutils:

And the following site tries to validate any of the arbitrarily pieced-together GFFs:

How come all these will-do “standards” have become so widely adopted and every improvement always leads to even more confusion and more unparsable laissez-faire data junk. </rant>

2017/01/27 11:21 · thoelken

Wiping old hard drives securely

TL;DR: Wipe hard drives with shred -v /dev/sda

Scenario: I want to give away my 10 year old self-built desktop computer to complete strangers for free. The hardware was actively used until around one year ago and should still be very functional with a lightweight linux distro. However, I feel somewhat paranoid about giving away my old IDE hard drives, once containing my personal stuff.

So I thought I'd fire up some trusty live image of Linux and overwrite my drives with random data. A little searching brought my to the following command:

# dd if=/dev/urandom of=/dev/sda bs=4M

Note that sda is the target hard drive (not partition) which is not mounted and bs stands for a reasonable block size for the chunks of data written per operation

/dev/urandom just like /dev/random generates random data but will also output weaker pseudo-random data once the pool of entropy (induced by external input such as mouse movement) runs out and thus should be fairly quick.

However, this operation does not emit any progress indication and I almost thought my computer just silently crashed when the process was not finished after a couple of hours. Killing the process with Ctrl + C revealed that dd was writing with 3-4Mb/s which is slow even for my ancient green label IDE drives.

Further investigation on the net brought me to use the following command:

# shred -v /dev/sda

shred is precisely designed for this kind of process and overwrites the drive with three passes of pseudo-random data. The -v flag will keep me informed about the progress. Optionally, you can add the -z flag to write all zeros after the last pass (somewhat covering the shredding operation).

On my admitably small hard drives (max. 200GB) three passes took a couple of hours, which is way too much effort and time for potentially recycling outdated hardware, but seemed like the fastest option for secure destruction.

2016/03/05 14:33 · thoelken

System backup using dd

I want to backup my complete SSD with Win7 and UEFI boot partition on it.

Determine which drive my SSD is and which my HDD where I want to store the backup. Both drives contain NTFS partitions which most Linux distros should handle just fine.

$ sudo fdisk -l
Disk /dev/sdb: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: dos
Disk identifier: 0xa2719665

Device     Boot Start        End    Sectors  Size Id Type
/dev/sdb1        2048 3907026943 3907024896  1.8T  7 HPFS/NTFS/exFAT

Disk /dev/sda: 232.9 GiB, 250059350016 bytes, 488397168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0xd9596aeb

Device     Boot     Start       End   Sectors   Size Id Type
/dev/sda1  *         2048    206847    204800   100M  7 HPFS/NTFS/exFAT
/dev/sda2          206848 409599999 409393152 195.2G  7 HPFS/NTFS/exFAT
/dev/sda3  *    409600000 472514559  62914560    30G 83 Linux
/dev/sda4       472514560 488397167  15882608   7.6G 82 Linux swap / Solaris

Disk /dev/sdc: 59.6 GiB, 64023257088 bytes, 125045424 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0xa0feffa7

Device     Boot     Start       End   Sectors  Size Id Type
/dev/sdc1  *         2048 102402047 102400000 48.8G 83 Linux
/dev/sdc2       102402048 125044735  22642688 10.8G 82 Linux swap / Solaris

Ok so sda is my SSD, sdb is my HDD and sdc is another SSD from which I run the Linux OS I am currently in.

So let's mount the HDD where I want to put the backup

$ sudo mkdir /mnt/backup
$ sudo mount /dev/sdb1 /mnt/backup

I want to monitor the progress of the backup and need pv for this

$ sudo pacman -S pv # for Arch like


$ sudo apt install pv # for Ubuntu like

I use the parameters p (progressbar) r (rate) t (time) e (estimated time) b (bytes transfered) for pv and open /dev/sda this gets piped through dd with a block size of 4MB and is then gzipped to the HDD

$ sudo pv -prteb /dev/sda | dd bs=4M | gzip > /mnt/backup/DATE_DESCRIPTION.img.gz
2015/12/31 16:38 · thoelken

<< Newer entries | Older entries >>