Sat 17 Oct 2009
Disclaimer / Motivation
First of all, it is important to note that this isn’t supposed to be a benchmark. The results of this test are worth what they are, it only means that PHP performed better than perl for this particular program.
For my thesis, I need to analyse large sets of data. The data is stored in the DataSeries format, which is a format developed by HP Labs specially for these type of things. I need to do several things with the data, so I created a script to do some basic analysis.
I had several options: I could write a shell script, or choose php or perl or something else instead. I realized that writing a shell script for this would be very complex, so I pondered between PHP and Perl. My feeling is that PHP is more suitable for Web and Perl is more suitable for sysadmin tasks, parsing, etc. I should choose Perl then, but my knowledge of Perl is very very basic, so I would need to learn it first. Unfortunately I am running against time, so I ended up choosing PHP since it would be very easy for me to write the script.
So I wrote a first version of the PHP script. The script is not optimized whatsoever, but works as expected. The problem with a large dataset is that its parsing takes ages, and my first run took ages. I started wondering if a Perl equivalent would be a lot faster than the script I wrote, so I asked a friend to write an equivalent script in Perl. He wrote it and I ran both scripts at the same time on the same server. My friend noticed later that he had left an extra instruction in the main loop that doesn’t exist in the PHP version. Anyway, both scripts were already running so I didn’t abort the run. The results were quite surprising. The PHP version took 551m56.349s and the Perl equivalente took 712m16.792s.
- PHP version is: PHP 5.2.6-1+lenny3 with Suhosin-Patch 0.9.6.2 (cli) + APC
- Perl version is: perl, v5.10.0 built for x86_64-linux-gnu-thread-multi
The way I ran the scripts and the respective output is followed:
$ time nfsdsanalysis -Z common archive/lindump_total.ds | ./stats_basic.php > stats_basic.txt
$ time nfsdsanalysis -Z common archive/lindump_total.ds | perl stats_basic.pl > stats_basic1.txt
The file lindump_total.ds is a 80Gb file. The output of nfsdsanalysis (what is piped to the script) is something like this:
# Extent, type='Trace::NFS::common'
packet_at source source_port dest dest_port is_udp is_request nfs_version transaction_id op_id operation rpc_status payload_length record_id
1253831523212739 3a163121 790 01c633c7 2049 TCP request V3 21ff6e38 3 lookup null 56 0
1253831523212743 3a163121 790 01c633c7 2049 TCP request V3 21ff6e38 3 lookup null 56 1
1253831523212746 3a163121 897 01c633c7 2049 TCP request V3 2eff9a5e 1 getattr null 36 2
1253831523212748 3a163121 897 01c633c7 2049 TCP request V3 2eff9a5e 1 getattr null 36 3
1253831523214877 2a2622c2 2049 1a264421 790 TCP response V3 2ffdae28 3 lookup 0 216 4
1253831523214886 2a2622c2 2049 1a264421 897 TCP response V3 2ffca15e 1 getattr 0 88 5
Some people asked me to run the scripts isolated, i.e., not in paralel like last time. I got optimized versions from several people, and I even got some versions in other languages like python and C.
Apparently, the Perl version was so slow due some serious performance bug with regards to list assignment. Thanks to Pedro Figueiredo for the tip. Just by installing 5.10.1 I got a 37% performance improvement. Even though the improvements were significative, Perl still performed in last.
Below you can see the results of the runs of the several optimized scripts in different languages. The results are ordered by run time, being the first one the fastest one and the last one the slowest one:
C Version (By Jose Celestino):
$ time nfsdsanalysis -Z common archive/lindump_total.ds | ./stats_basic > stats_basic4.txt
PHP Version (Optimized by Diogo Neves, and modified by me since there were several bugs):
$ time nfsdsanalysis -Z common archive/lindump_total.ds | ./stats_basic_optimized.php >stats_basic5.txt
Some already asked me why the user time is greater than the real time. Keep in mind that the server where I ran these scripts has 8 cores and that is the reason for it.
It really surprised me that Perl performed the worst, I wasn’t really expecting it. I also ran PHP without APC and the results were similar.