wirespeed

the hypothetical maximum data transmission rate of a telecommunications medium

Posts Tagged ‘programming’

Subversion – how many times has a file been modified?

Posted by dlandgren on 2011-06-08

Someone asked me the other day at work, which file in a project has undergone the most changes. The idea being to look at which files are “hot”, as in frequently touched, and which files are “cold”, rarely edited.

This information is not available directly, but can be assembled from the commit history.

The first step is to produce the list of files changed in each commit since the beginning of time:

svn log -qvr 1:HEAD

which will produce a rather verbose description of what files were changed in each revision:

r382 | david | 2011-04-07 15:32:57 +0200 (Thu, 07 Apr 2011)
Changed paths:
   M /trunk/Assemble.pm
   M /trunk/Changes
   M /trunk/MANIFEST
   M /trunk/README
   M /trunk/t/03_str.t
   M /trunk/t/09_debug.t
   A /trunk/t/10_perl514.t
------------------------------------------------------------------------
r387 | david | 2011-04-17 16:29:59 +0200 (Sun, 17 Apr 2011)
Changed paths:
   M /trunk
   M /trunk/MANIFEST
   M /trunk/t/03_str.t
   M /trunk/t/09_debug.t
   D /trunk/t/10_perl514.t
------------------------------------------------------------------------

What we want to do is throw away all the fluff in each revision stanza, and retain the file paths. A quick Perl one-liner will do that for us, by printing out only the lines between “Changed paths” and a line of dashes. Since this will also include the delimiting lines, the line is also tested to ensure it starts with a space. This is probably overkill, but offers a slightly improved guarantee against surprises.

perl -nle 'print if /^Changed paths:/ ... /^-+$/ and /^\s/'

If the svn log output it piped through this, we obtain

   M /trunk/Assemble.pm
   M /trunk/Changes
   M /trunk/MANIFEST
   M /trunk/README
   M /trunk/t/03_str.t
   M /trunk/t/09_debug.t
   A /trunk/t/10_perl514.t
   M /trunk
   M /trunk/MANIFEST
   M /trunk/t/03_str.t
   M /trunk/t/09_debug.t
   D /trunk/t/10_perl514.t

The next step is to throw away the Subversion action code, and discard any paths not under /trunk (such as /branches or /tags). To do this, we’ll attempt a substitution that eliminates the leading space, some non-space characters and a space, and then capture a path that begins with /trunk. If this succeeds, then print the line:

perl -nle '/^Changed paths:/ ... /^-+$/ and s/^\s+\S+\s+(\/trunk)/$1/ and print'

Now we’re down to:

/trunk/Assemble.pm
/trunk/Changes
/trunk/MANIFEST
/trunk/README
/trunk/t/03_str.t
/trunk/t/09_debug.t
/trunk/t/10_perl514.t
/trunk
/trunk/MANIFEST
/trunk/t/03_str.t
/trunk/t/09_debug.t
/trunk/t/10_perl514.t

Now it’s a simple matter to either grep for the file we want to look for, or count how many times each file occurs, and sort the files by the number of times they appear. The latter is done trivially with the Unix toolkit: sort, count unique occurrences, and sort by count:

sort | uniq -c | sort -n

Which results in

...
  40    M /trunk/t/00_basic.t
  58    M /trunk/Changes
  59    M /trunk/t/03_str.t
 109    M /trunk/Assemble.pm

So putting it altogether, the magic command is

svn log -qvr 1:HEAD|perl -nle 'print if /^Changed paths:/ ... /^-+$/ and /^\s/' \
    | sort | uniq -c | sort -n

And the deed is done.

Advertisements

Posted in perl, programming | Tagged: , | 1 Comment »

Using Perl to scan a Lotus Notes database quickly

Posted by dlandgren on 2009-04-22

I had a couple of hundred messages lying around in the depths of my work e-mail account. They were old Majordomo subscribe/unsubscribe alerts for a mailing list I managed (until we switched over to Mailman). I had kept them around because one of these days I figured I’d load the information into a database to track the evolution of the subscriber base.

I haven’t managed to get around to doing that yet, but I did want to get rid of the messages. All I needed was the subject line of the mail (which contained all the necessary info) and the date the message was received. The idea of doing it manually would have been a nightmare. I searched around a bit for a way of automating the task, and discovered that it could be done through OLE.

So I wrote a quick Perl program to do it. It went like this:

use strict;
use warnings;

use Win32::OLE;

my $Notes = Win32::OLE->new('Notes.NotesSession')
    or die "Cannot start Lotus Notes Session object.\n";
my $db = $Notes->GetDatabase("MyServer/MyDomain", "mail/mymail.nsf")
    or die "Could not open database.\n";
my $all = $db->AllDocuments;

foreach my $n (1 .. $all->Count) {
    my $doc  = $all->GetNthDocument($n);
    my $item = $doc->GetFirstItem('Subject');
    if (!$item) {
        warn "doc $n has no subject\n";
        next;
    }

    my $subject = $item->{Text};
    next unless $subject =~ /^(?:UN)?SUBSCRIBE my-mailing-list/;
    print $doc->GetFirstItem('DeliveredDate')->{Text}, " $subject\n";
}

and presto, the deed was done. This saved me I don’t know how many hours of mind-crushingly boring and RSI-inducing cutting and pasting. It’s so trivial it’s not even worth bundling up, so this is probably the best place for other people to stumble across it in a search (hi!).

I’m not sure if Strawberry Perl is bundled with Win32::OLE, but this works straight out of the box with ActivePerl.

Posted in perl, programming | Tagged: , | 8 Comments »

Limiting of the number of checkboxes checked in HTML

Posted by dlandgren on 2008-12-07

An issue came up at work on the website I’m developing. One page contains an entry form with a list of elements that may be chosen via a list of checkboxes. The constraint is that only a maximum of three choices out of eight are allowed.

Thinking about this for a bit, it seemed evident that the ideal user experience is to click as many choices as you want, but if more than the maximum allowed are checked then the page should begin to discard the oldest clicks; that is, the first clicked checkbox would return to its orginal unchecked state. The idea being that you’re most interested in the thing you clicked most recently. If you still want the first choice that was clicked, go back and click it, and then the second choice is discarded. I find this behaviour is the easiest for someone to learn.

In other words, first-in first-out, rather than last-in first out.

The existing state of the art

This seems like a reasonable thing to do, so I was hoping I could steal reuse some Javascript published somewhere on the web to handle this for me. I went looking, and found some really miserable answers. The first  example allows you to click up to the permitted maximum, and if you try to click an additional checkbox, an error message pops up telling you how stupid you are (in their defense they say Please) and discards the last change. You have to go back and decide what other checkbox you want to uncheck in order to check the one you now want.

My first thought was that this was just someone being lazy. So I kept on looking, and found another bad example. And another. Each time, you can you click up to the limit, but go beyond it and you get back an error message. How lazy are these people? No wonder people hate computer programmers. How difficult is it to arrange a user experience with no error messages, that just does the right thing?

Taking this approach to its logical conclusion, if the goal is to prohibit people from going any further, then the ideal solution would be to scan the list of checkboxes, and disable any checkbox left unchecked. That way the user can never break the limit and no error condtion can occur. If the user wanted to choose another checkbox, they would first have to uncheck one. At this point, being one under the limit, the page would re-enable the blocked checkboxes and another checkbox could be chosen.

The big problem is that such a design provides very little feedback to explain to what’s happening and is thus not at all intuitive. I think people would click frustratedly at a blocked checkbox and then just give up and either reload the page and start over, or go somewhere else.

After a while I gave up and realised that I’d have to write my own code.

Implementing a better design

I started thinking about how to implement this. At the most basic level, code needs to be fired each time a checkbox is clicked. This means an onclick handler on the input element.

Then there’s the question of state. The page will have to keep track of each checkbox in the form, so we’ll need an array for that. The onclick handler for each checkbox will pass an offset to the underlying function so that the latter knows which entry of the array to update.

Then we shall have to keep track of the order in which the checkboxes are clicked, so that we can discard the oldest in the event of an overflow.

At first I thought about reading the epoch timestamp (the number of seconds since 1970-01-01 UTC) and using that each time a checkbox was clicked. This makes it easy to see which element has the oldest timestamp, but it raised a number of questions I’d need to solve:

  • Can you obtain the epoch timestamp in Javascript? If so, how?
  • Is it an integer or a floating-point value? If the former, we’ll run into trouble when dealing with people who click rapidly on more than one checkbox during the same second. Unlikely with a human, an almost certainty with an automated testing tool, but definitely a flaw.
  • If the underlying time_t variable for the epoch is a 32 bit quantity, we’ll have trouble in the year 2038.

It then occurred to me that a much simpler solution is to use a monotonic sequence. Start with a serial number set to 1, and store that when a checkbox is clicked. On each new click, increment the number and store that in the array element that corresponds to the checkbox. The oldest element that has been clicked will have the lowest (non-zero) serial number.

The code of the handler is quite simple. We need to pass in a number referring to position of the checkbox in the form. If the checkbox has just been checked, we’ll store the serial number and increment it afterwards. If the checkbox has just been unchecked, we’ll wipe out the existing value and store 0 (zero) in its place.

After that’s taken care of, we need to count how many entries in the state array have a non-zero value (meaning that the checkbox is clicked), and we need to keep track of where we saw the lowest non-zero element in the array (in case we have to uncheck it).

After we’ve visited the entire array, if more than the limit (3 in my case) were checked, uncheck the oldest.

In the first round of development I preallocated an array with the necessary number of elements set to zero, to track the checkboxes. After experimenting for a while I remembered that Javascript is perfectly happy to extend arrays lazily and so I was able to just define an empty array, and update elements as needed, and everything else continued to work.

This is important, since it means an additional checkbox can be added to the form and the validation code doesn’t have to be revisited to allocate one extra element for the array.

Basic version

The first version (and the version that I used on my webpage) is as follows:

    var tstamp = new Array;
    var seq    = 1;
    function t(n) {
        tstamp[n] = (document.f.x[n].checked == true) ? seq++ : 0;

        var nr     =  0; // how many have been clicked
        var oldest = -1; // offset of oldest checked
        var i;
        for (i=0; i < tstamp.length; ++i) {
            if (tstamp[i] > 0) {
                 ++nr;
                if (oldest < 0 || tstamp[oldest] > tstamp[i]) {
                    oldest = i;
                }
            }
        }

        // more than 3, uncheck oldest
        if (nr > 3) {
            tstamp[oldest] = 0;
            document.f.x[oldest].checked = false;
        }
        return true;
    }

The name of the array “tstamp” retains an echo of the initial assumption that I was going to use timestamps. The code also contains some hardcoded constants relating to the page. The form was named “f” and the checkboxes were all named “x” (thus sending a multivalued result for the CGI “x” parameter).

This can be used with a form that looks something like:

<form name="f">
<input type="checkbox" name="x" value="a" onclick="t(0)" /> apple  <br>
<input type="checkbox" name="x" value="b" onclick="t(1)" /> banana <br>
<input type="checkbox" name="x" value="c" onclick="t(2)" /> carrot <br>
<input type="checkbox" name="x" value="d" onclick="t(3)" /> date
</form>

There are two problems with the code as it stands: firstly the names of the form and its elements are intertwined in the HTML and the Javascript. The second problem is more or less a direct consequence of the first problem, which is that it is essentially impossible to use this code on a page with two or more sets of checkboxes.

The HTML onclick attribute could be amended to pass in the name of the form variable, and the handler rewritten to use them :

<input type="checkbox" name="x" value="1" onclick="t('f','x',0)" />
    apple<br />
<input type="checkbox" name="x" value="2" onclick="t('f','x',1)" />
    banana<br />

… but there is a lot of repetition going on which in turn is a maintenance hassle and thus a magnet for attracting errors.¬† And while this might seem to make the code a bit more flexible, we still have a problem with the state array. The sequence counter could be shared between different forms. It would not matter if one array contained (0, 0, 4, 0, 3) and another array contained (1, 2, 0, 5, 0, 0); everything would continue to work. But sharing the state array itself is impossible, without resorting to some very ugly hacks such as reserving elements 0-9 for one array, and 10-20 for another.

A final problem is that it is not at all simple to take this code and use it on other web pages. Since I’m going to go to the trouble of writing this article, I wanted to ensure that if other people want to steal reuse the final code, it should be easy to do so. Hence, it has to be self-contained, so that it can be included into any page with a minimum of fuss.

Enter…

Object-oriented Javascript

Reframing the problem in terms of object-oriented programming offers a nice solution. If you’ve understood the above, but don’t understand object-oriented programming, I think this makes a nice tutorial.

What we want to do is create, well, an object, to hold a state array, and we may as well toss in a sequence counter for it as well. During the creation, we’ll also record which form and which name the object has to deal with. That means we only have to specify that once. When we use the object in an onchange handler, it already knows which form and element with which it is associated.

This means if there are two forms or checkbox groups to manage, we create two objects. First off, we need a name for the object class. After thinking about this for 3.5 seconds, I decided on CheckboxClamp, since the idea is to clamp the number of checkboxes to a fixed limit. The code required to implement the object is as follows (the main thing to note is that the function that checks the number of checkboxes has migrated within the CheckboxClamp function):

    function CheckboxClamp(formname, fieldname, limit) {
        this.state = new Array;
        this.seq    = 1;
        this.limit  = limit;
        this.fname  = formname;
        this.field  = fieldname;

        this.check = function (n) {
            var f = document.forms[this.fname];
            this.state [n] = (f.elements[this.field][n].checked == true)
                    ? this.seq++
                    : 0;

            var nr     =  0; // how many have been clicked
            var oldest = -1; // offset of oldest checked
            var i;
            for (i=0; i < this.state.length; ++i) {
                if (this.state[i] > 0) {
                    ++nr;
                    if (oldest < 0 || this.state[oldest] > this.state[i]) {
                        oldest = i;
                    }
                }
            }

            // more than combo limit, uncheck oldest
            if (nr > this.limit) {
                this.state[oldest] = 0;
                f.elements[this.field][oldest].checked = false;
            }
            return true;
        }
    }

This can be set up and used as follows:

    <script>
        var u = new CheckboxClamp("f","x", 3);
        var v = new CheckboxClamp("f","y", 2);
    </script>

    <form name="f">
    <input type="checkbox" name="x" value="a" onclick="u.check(0)" /> apple  <br>
    <input type="checkbox" name="x" value="b" onclick="u.check(1)" /> banana <br>
    <input type="checkbox" name="x" value="c" onclick="u.check(2)" /> carrot <br>
    <input type="checkbox" name="x" value="d" onclick="u.check(3)" /> date
    <input type="checkbox" name="y" value="A" onclick="v.check(0)" /> Alice <br>
    <input type="checkbox" name="y" value="B" onclick="v.check(1)" /> Bob <br>
    <input type="checkbox" name="y" value="C" onclick="v.check(2)" /> Carol <br>
    <input type="checkbox" name="y" value="D" onclick="v.check(3)" /> David
    </form>

Not hard at all. If this still seems a little difficult to follow, take a look at a demo that shows what it looks like.

Using this in your own code

To save you the bother of saving and installing the file, you are welcome to link to the implementation here:

  <script language="javascript" type="text/javascript"
      src="http://www.landgren.net/js/checkboxclamp.js" />

Put that at the top of your page and you’re done. And if any bugs show up, you’ll get the fix for free. There is a possible bug I can think of, but I’m not sure how easy it would be to trigger: if the state array winds up having two checkboxes over the limit, only one will be removed. This could be addressed by using a while loop (keep unchecking the oldest until we get back to the limit), but right now I’m not going to lose any sleep over it.

One final word of caution

Just because you have limited the number of checkboxes that may be checked on the client side, don’t assume that this means you don’t have to bother checking when the information is posted to the server. You must check the values on the server side as well. For instance, the client may have Javascript disabled. This can occur with the wonderful NoScript Firefox plugin, which protects you from all sorts of nasties on the web. My telephone doesn’t do Javascript at all, although it renders the demo page well enough.

In these circumstances, the user will be able to check every checkbox, and it will be up to the server to handle the situation gracefully. In this scenario an error message might be reasonable, but even then I would tend to discard additional values over the limit (possibly reporting “hey, I threw away choices E, F and G. If you don’t like that, go back and change things).

So there you have it. Enjoy. (And help stamp out bad user interfaces).

Posted in programming | Tagged: , | 3 Comments »

Changing the type of a column in Postgres

Posted by dlandgren on 2008-11-07

Someone handed me a large spreadsheet at $work and said “We need to turn this into a web site”. So I wrote some code to transfer it to a database. A number of columns were numeric, so I assigned numeric datatypes to those columns.

Afterwards, when I imported the information, Postgresql threw up its hands in horror, complaining about my attempts to stuff non-numeric data into a field designated as numeric. I went back and looked at the spreadsheet more closely, and sure enough, buried away many lines below in the fine print, a cell contained “N/A”. I decided that I would just coerce such a value to zero, and was done with it.

It turned out later on, during a review, that no, “N/A” is really what needs to be displayed. I couldn’t add code to say ‘if the field read from the database is zero, display “N/A”‘ because there were already other rows that legitimately contained zero.

What we have in fact is data that is nearly almost always numeric, but not quite. So the solution is to change the column data type. At first I thought I’d be able to just

alter table t1 alter column ca_scientif type varchar(5);

… but Postgres doesn’t like that (surprisingly it doesn’t even produce an error message, it’s just that nothing changes). Hmm. After searching around on the web for a bit, I found half a solution, which I was able to fill out into a complete solution. The idea is to add a new temporary column with the right datatype to the table. Then copy the contents of the old column name over to the new name. Then drop the old column.

Now that the old column name no longer exists, we can create it again, this time with the right datatype. And then we can copy of the contents of the temporary column over to the new version of the old column, now with the right datatype.

Finally, we can then drop the temporary column since we don’t need it any more. The exact sequence of DDL statements looks like (for each column that needed munging)

alter table t1 add column new_ca_scientif varchar(5);
update t1 set new_ca_scientif = cast(ca_scientif as varchar(5));
alter table t1 drop column ca_scientif;
alter table t1 add column ca_scientif varchar(5);
update t1 set ca_scientif = new_ca_scientif;
alter table t1 drop column new_ca_scientif;

After all that’s been done for as many columns as necessary, the tablespace will be littered with dead tuples, so it’s a good idea to tidy up afterwards:

vacuum full t1;

And the deed is done.

Update: I discovered a minor flaw in this plan: if the original column had a comment attached to it, that gets thrown away. Hmmm. I needed those comments to help me track Excel columns to Postgresql names.

Fixing this turned out to be more difficult than I thought. Fortunately a question posted on Stackoverflow received a good answer. You have to grovel through the system catalogs to pull out the information. So I wrote a Perl program to do that for me. It’s always more fun to write a program that writes SQL rather than writing the SQL directly. As a result I was able to fully automate the process.

I’ve made the program available on my website. Keeping the source on my own server allows me to manage the issues of revisions easily.

Enjoy.

Posted in programming | Tagged: , | 3 Comments »