Misc Thoughts

Syncing a Google Spreadsheet to a Calendar

The problem

In Aistemos we manage our holiday (vacation) bookings using a simple Google Apps spreadsheet. It’s not perfect, but it’s simple and lets everyone manage their own entries.

Screen shot of spreadsheet with holiday entries in

Incidentally, the value in the Days column is calculated with the NETWORKDAYS() function:

=NETWORKDAYS(B2,C2,'Bank holidays'!$A$1:$A$10)) - IF(D2="Yes", 0.5, 0.0)

That’s all great, but it’s quite hard to see when people are off by reading a list, so it would be great if the data in the sheet was somehow synced to a shared calendar, so people could see who was away when.

In theory, that’s pretty easy - the Google Spreadsheet API is pretty easy to use, and there’s an API to the Google Calendar system, how hard can it be?

The gotchas

The first gotcha is that it’s not actually possible to create an all day, multi-day event using the version 2 API. There are some hacks but they all have serious disadvantages when it comes to trying to update, remove, or view the events created.

So, I decided to try with the v3 API, which does support multi-day all-day events, but it’s not really properly documented (I think it’s still in Beta), especially the in-app scripting part of it.

Another gotcha is that you have to enable the v3 API in two places before you can use it. If you don’t you will get some error about some object not being available.

Using the v3 API you add / update / delete calendar entries by calling functions with small JavaScript object fragments. This is quite a bit more humane than the v2 API, which uses a chain of badly-named methods. The fragments look like:

{
  "summary": "Bob's Holiday",
  "location": "Sunny place",
  "start": {
    "date": "2014-05-23"
  },
  "end": {
    "date": "2014-05-26"
  }
}

Which, despite what it looks like will create a multi-day event that starts on the 23rd, and ends on the 25th, so first we need to add one to the end date, with date.setDate(date.getDate() + 1); [src].

The next challenge is around timezones, there’s some complex and unpredictable relationship between the timezone of the spreadsheet user, the timezone of the date in the spreadsheet cell, and the timezone of the calendar. The only sensible way to resolve all these is just to pull the raw un-normalised date numbers from the Javascript Date object and make up an ISO 8601 date string manually [src]:

Utilities.formatString("%04d-%02d-%02d", date.getFullYear(),
                       date.getMonth()+1, date.getDate());

If we don’t do that you get hard to trace off-by-one errors in the calendar, caused by daylight savings changes pushing the date one day later than you expected, sometimes.

Once we’ve done this, we can add dates to the calendar with Calendar.Events.insert() [src], capture the event ID and store it in an invisible column G of the sheet, so we can delete and update the entries later if the sheet changes… except the update method is buggy, and only seems to work once after you’ve created a new event, so I ended up removing and re-inserting on changes.

Note the delete call is actually called remove() in Javascript, I guess because otherwise it would collide with a keyword? Either way, a note to that effect would have helped. [src]

You have to be careful when re-inserting to generate a new ID, incase the event has been deleted in the meantime. Otherwise the changes are made to the deleted event, which doesn’t help.

In order to see if an event needs (re)creating, we look in column G, find the event ID, and do a search with Calendar.Events.get(), but if you pass a null as the second argument you seem to get a Calendar object back, instead of an Event. This is quite dangerous, so have to guard against that. [src]

The solution

You can see the final script here. Just needs editing to match your IDs and sheet name, and adding to your spreadsheets project scripts (Tools | Script editor).

Lastly we just need to hook up a timer function to the projects Triggers with the Resources | Current project’s triggers menu item. I added an hourly timer event to the timerEvent() function:

Screenshot showing trigger setup for the script

Future work

There’s a fairly major bug where if someone deletes a row from the spreadsheet that doesn’t cause the calendar event to be deleted. I’m fairly sure it’s possible to handle with the onEdit() trigger, but it would involve a load more messing with the script API, and I can’t be bothered right now. I think it doesn’t happen often enough to be a huge problem for us.

Excel and SPARQL

I often end up running a big SPARQL query (usually on a server), exporting the results a TSV, and post processing the results with some combination of vi, perl, awk, sort etc., then loading the processed data into a copy of Excel to get stats out of it, or produce a chart or whatever.

The other day I was wondering if you could pull results directly from a SPARQL endpoint into Excel. Well, it turns out that you can, via something called an Internet Query File.

I’ll give a quick example of how to make it work with a local copy of 4store, then show one on a remote SPARQL endpoint.

Local 4store database

First, create a local 4store database, and import some DOAP files:

$ 4s-backend-setup doap
$ 4s-backend doap
$ 4s-import doap http://4store.org/doap.rdf
$ 4s-import doap http://librdf.org/rasqal/rasqal.rdf
$ 4s-httpd doap

Next, you need to create your .iqy file, it should have the following lines, but needs to use CRLF (\r\n) line endings, you can download this example:

WEB
1
http://localhost:["port", "Port number"]/sparql/?output=text&query=["query","SPARQL query."]

PreFormattedTextToColumns=True
ConsecutiveDelimitersAsOne=False
SingleBlockTextImport=False

Save this file as 4store.iqy.

Now, open Excel, and fill out column A with the following lines (one line per cell)

Query
PREFIX foaf: <http://xmlns.com/foaf/0.1/> PREFIX doap: <http://usefulinc.com/ns/doap#>
SELECT ?name ?homepage (GROUP_CONCAT(STR(?l); separator=", ") AS ?licenses)
WHERE { ?proj doap:name ?name ; doap:homepage ?homepage ; doap:license ?l }
GROUP BY ?proj

Next, select a cell under the query, open the Data menu, and pick Get External Data, Run Saved Query…

Run Saved Query

Now pick your .iqy file using the file browser, and open it with Get Data. Now, you will have a dialog called Returning Data to Microsoft Excel, or something similar, just take the default options.

Now you will be asked for a port number, if it’s a default 4store install you should enter 8080 in the textbox, and select Use this value/reference for future refreshes.

Enter parameter

Now you need to tell it where to get the query text from, enter =A2:A5 into the text box, and select Use this value/reference for future refreshes.

Enter parameter

When you click OK after a short delay you should see a table filled out with the results:

Result

Tips & Tricks

You can rerun the SPARQL query by right clicking on the cell you did the Data binding from, and selecting Refresh Data.

The output=text CGI parameter in the URI tells 4store to output TSV results, instead of SPARQL XML. According to the Microsoft documentation web queries only process HTML, but not only does TSV work, it seems to be a lot faster. The correct way would be to set the accept header, but I don’t know of any way to do that in the .iqy syntax.

You might be wondering why the query is broken up over lots of cells, well, two reasons:

  1. Editing long sections of text in Excel is painful, so it’s easier to edit if you can just to significant portions of the query easily.

  2. If text cells in Excel go over 255 characters long, lots of Bad Things happen internally, and you wont be able to run the query. For some reason ranges of cells work fine, the strings get concatenated.

If you’re using 5store, instead of parametrising the port number, you can pick a KB name:

http://localhost:8080/["kb", "KB name"]/sparql?output=text&query=["query","SPARQL query."]

Querying Remote Endpoints

Create the following .iqy file:

WEB
1
http://dbpedia.org/sparql?format=text/html&query=["query","SPARQL query."]

Selection=EntirePage
PreFormattedTextToColumns=True

Or you can download a premade one. Fill out a spreadsheet like:

Query
PREFIX yago: <http://dbpedia.org/class/yago/> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?mfr (STR(?n) AS ?name) (STR(?found) AS ?founded)
WHERE { ?mfr a yago:SportsCarManufacturers ; rdfs:label ?n OPTIONAL { ?mfr <http://dbpedia.org/property/foundation> ?found } FILTER(LANG(?n) = "en") }
ORDER BY ?name

And follow the Excel instructions as per a local 4store, but with the dbpedia.iqy file.

Excel Screenshot

For some reason processing HTML results is much slower than TSV. The dbpedia endpoint returns pretty quickly in curl, but Excel takes ages to ingest the data.

You can also download a premade XLSX file with the .iqy file baked in.

CONSTRUCT JSON

I’ve been going on a lot about CONSTRUCT JSON for a while, trying to persuade people to implement it, at the moment it’s just a thought experiment but I think it’s at least worth blogging about…

So the idea is that it’s pretty common that you have a SPARQL store containing some interesting RDF, and you’d like to expose it to JS developers as JSON data. We’ve done this in Garlik a few times with FOAF and the like, using PHP scripts, but I think there’s an easy way to generalise it.

Imagine there was a CONSTRUCT syntax for RDF, that instead of constructing RDF graphs, constructed arrays of JSON object, one per solution, e.g.:

CONSTRUCT JSON {
  {
    "name": ?name,
    "age": ?age 
  }
} 
WHERE {
  ?x a <Person> ;
      <name> ?name ;
      <age> ?age ;
}

So, what this will do is execute the following query:

SELECT ?name ?age 
WHERE {
  ?x a <Person> ;
      <name> ?name ;
      <age> ?age ;
}

And substitute the variable bindings from each row into a JSON array, with one member per solution, e.g.:

[
    {
        "name": "Alice",
        "age": 23 
    },
    {
        "name": "Bob",
        "age": 24 
    }
]

The obvious use for this would be as a config in an HTTP server that sits between a user and a SPARQL endpoint, and does a literal transform from the CONSTRUCT JSON syntax to SPARQL, but ultimately it could be a syntax spoken directly by SPARQL endpoints.

The user would name some RESTful API request like

http://example.com/service/cat/moggy.json

The parameters (“moggy” in this case, see below) would be filled out in a JSON CONSTRUCT template, exectued, and the results returned as JSON.

The intermediary service could also implement stuff like developer API keys / OAuth / whatever else you need to make a public API on the internet, then you could just configure a service like this with some templates, point it at a SPARQL endpoint, and you have a custom JSON API, for very little effort.

I see this as sitting between things like the Linked Data API and full SPARQL in terms of capability, and ease of configuration.

Each service would require two templates, one that specifies how varaibles are parsed from a RESTful URI, and the SPARQL-ish query form like in example one, e.g.:

/service/cat/$NAME.json

PREFIX a: <http://example.com/animal#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
CONSTRUCT JSON {
  {
    "name": ?name,
    "breed": ?breed,
    "colour": ?colour
  }
}
WHERE {
   ?cat a a:Cat ;
       a:name ?name ;
       a:breed [
         rdfs:label ?breed ;
         a:colour ?colour ;
       ] .
   FILTER(REGEX(?name, $NAME, "i"))
}

Note: I’m abusing SPARQL $ variables a bit, but they’re pretty uncommon in the wild, so I think using them as parameter substitution variables in config files is fine.

There’s probably lots of corner cases to worry about, an obvious one is what to do with OPTIONALs which don’t bind. The “natural” thing is probably to omit any pair where either the key or value is unbound, but it’s probably not that simple w.r.t JSON syntax, so sending the empty string or something is probably easier.

You could also apply the same idea to XML, or whatever else, but JSON is more useful to people writing javascript apps.

Just as a quick addendum, incase anyone is worried, I’m not proposing to add this to SPARQL 1.1, it’s just an experiment that I’d like to encourage someone else to implement!

BSBM v3 Post Mortem

http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/results/V6/

4store was included in the recent BSBM test for the first time. Other than the qualification test, we’d never run the BSBM benchmark before, so weren’t really sure what to expect.

Overall 4store gave a pretty good showing. I’ll go through the various sections, and make some comments on each.

Load

4store did much worse here than I’d expect, the load times shown are equivalent to about 60 kT/s, and generally we see between 150 and 250 kT/s on machines of that class, with “average” data. The BSBM data has lots of very long literals in it, and I think either raptor or 4store doesn’t handle them very efficiently at load time.

I recommended running with 8 segments, but I hadn’t looked at the machine spec closely enough, and I expect it was IO saturated. With fewer segments the import speed would have been better, but not hugely so.

BigOwlim won this test, 4store was second. BigOwlim was about 35% faster with 100MT, and 46% faster with 200MT.

Explore

I didn’t really have any expectations of performance here, as we’ve never tested 4store’s speed against stores other than 5store. The excessive number of segments doesn’t hurt query performance as much as import — there’s less IO involved, so we did pretty well

Virtuoso came first in the Explore usecase, and 4store was second. Virtuoso was 24% faster with 100MT, and 2% faster at 200MT. You can tell which situation we optimise more for :)

Interestingly you can see the variety in performance drops as we go from 100MT to 200MT:

Store Perf. drop
4store18%
BigData26%
BigOwlim49%
TDB37%
Virtuoso36%

There’s a bug in the current version of 4store which means it errored for some queries in the parallel test sadly. The last time we tested this internally peak performance was at about 10 simultaneous clients, but that’s some time ago, and the query engine has changed quite a bit since then.

Expore and Update

4store won this one handily. That’s quite a surprise — the SPARQL Update implementation is quite new, and has had very little optimisation. I guess that’s true for other implementations too.

4store was 47% faster than the second placed store, BigOwlim.

Conclusion

Overall I think 4store did well. One win, and two second places is pretty respectable.

Of the other stores that are mentioned here, Virtuoso came first in the query test, last in the load test, and didn’t place in the update usecase. BigOwlim came first in the load test, and second in the update usecase, and joint fourth in the query test.

4store doesn’t fully exploit single machines — it was originally designed to be run on clusters, and always behaves as if it is. The query engine is split across two halves of a TCP connection, and all the writes have to be streamed accross TCP too. However the vast majority of 4store users are running it on single machines, so it’s good to know that it performs well in that situation too.

4store Index Design and SSDs

A couple of people asked about the design of the 4store indexes and SSDs. I mentioned that I had considered flash disks when designing the current one in a previous post.

So, this post is:

  1. Very geeky. Probably not even of much interest to general RDF geeks, it’s really about RDF indexing.
  2. Based on speculation about how SSDs would behave, before I actually got to try one, so probably technically not very accurate.


For anyone that’s still reading, I’ll try to add any notes about the actual behaviour to this post, but I still don’t have that much practical experience. It’s a combination of speculation, reading, and a little experimentation.

I’m a very keen photographer, so have been used to working with Compact Flash cards for some time. The CF interface is really just IDE with a different physical interface, and the advantages of flash media — durability, size, power consumption, and low heat all play very well in data centre applications, so I was expecting flash storage to take over from spinning disk media for applications where capacity isn’t the primary concern, in a reasonably short period of time.

That didn’t actually happen, and it’s taken a few years for SSDs to start to become common in data centres. Consequently we never got to run 4store on SSDs in anger in Garlik, as we’d already migrated our large data apps to 5store by the time SSDs became economic.

One thing I understood from CF cards is that write speeds on NAND flash media are much slower than read speeds, usually 2x slower. This lead me to avoid data structures which require significant rebalancing operations, like B-Trees. I’m guessing that was a good choice, though I don’t really have a basis for comparison. I went for a kind of radix trie in the end, which has some real-world advantages when it comes to writing.

Probably by looking at keeping rebalancing operations within a limited number of blocks you could do OK with B-Trees too, the blocks are so large in some SSDs that it’s not really that difficult to limit the scope of the writes to a few blocks.

Another factor is that SSD seek times are very, very short, which lead me to discount algorithms which place a lot of stock in locality — space filling curves and the like. Pretty much ignoring locality seems to work fine for reads, but due to the way flash block writes work, locality still important for writes, though linear ordering isn’t. Native flash writes happen in very large blocks, anything from 128kB to a megabyte or so, so if you can arrange for all the writes within that native block you can save a lot of writes.

Luckily 4store has write ordering anyway, to make it work better on spinning disks, and that seems to be sufficient. Probably if you had that in mind when you designed your IO routines you could do better though. Writing native blocks 1, 790, 34 is essentially as fast as 1, 2, 3, from what I can tell.

That’s enough storage geeking for now :)

SPARQL Command Line Client

Some time ago Nick at Garlik wrote a UNIX-flavoured command line tool for interacting with SPARQL query endpoints. He released it under the GPL, and we occasioanly make small hacks on it.

I got to thinking about what’s the most annoying thing about writing SPARQL queries? I find remembering an cut-and-pasteing all the QName prefixes to be a real pain.

So, I hacked in some simple regex-based scanning of the queries entered that looks for QNames without matching PREFIX keywords, and tries to guess the right PREFIX, and adds it to the front of the query.

PREFIXes are searched for in the following priority:

  1. PREFIXes used in this session
  2. PREFIXes used in the history
  3. A lookup on prefix.cc

I also added a “sparql-update” command, that works the same way (it’s actually the same binary), and used POST requests to send SPARQL Update requests.

Example:

$ sparql-query http://localhost:8080/sparql/
sparql$ SELECT * WHERE { ?x foaf:knows ?y };
Missing PREFIXes, adding:
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
┌───────────┬───────────┐
│ ?x        │ ?y        │
├───────────┼───────────┤
│ <local:b> │ <local:d> │
│ <local:b> │ <local:c> │
│ <local:d> │ <local:e> │
│ <local:a> │ <local:c> │
│ <local:a> │ <local:b> │
└───────────┴───────────┘

Want to store hundreds of megatriples on lowend hardware?

The answer is, buy a SSD. Might seem crazy to spend that much on disks for a lowend machine, but economically, it makes sense. More details below.

This post specifically talks about 4store, but probably applies to other RDF stores too.

When I was designing the index used in the current version of 4store, I had in mind that solid state disks were probably coming along, so the IO optimisation is a bit different to how you would normally tune DB indexes, to take more advantage of the low seek and read times, but relatively expensive writes in flash media. However, I didn’t predict the huge internal block sizes in flash storage, so until this weekend I wasn’t sure how it would actually perform. This might mean that 4store takes more advantage from SSD hardware than other RDF stores, but I wouldn’t want to predict that until someone compares them.

Well, at least on the SSD I tried, a Corsair Force F120, the performance is excellent. On a 3 year old, lowend desktop machine with 2x 2GHz cores, and 2GB of RAM I managed to import almost 280MT in under 4 hours. The machine wasn’t idle during that time, it was doing it’s normal job as a file server, but the SSD wasn’t used for anything but 4store indexes.

I’ve not done any formal query performance testing, but anecdotally it seemed to be compeletly acceptable.

It’s dangerous to extrapolate out from the behaviour of a low-end machine, but I’d expect a more sensibly specced server to be able to go to similar ratios of storage to main memory, hopefully making gigatriples in single nodes a pracical proposition.

The data I used was from the US data.gov datasets - it was just the first thing that came to hand, and I strongly believe in testing with real not, not synthetic data. I used the follwing files: data-1058.nt, data-10.nt, data-57.nt, data-793.nt, data-795.nt, data-805.nt.

The average performance over the whole import came to just over 20kT/s, and the speed over the last few million triples was about 17kT/s, but it wasn’t decreasing very rapidly.

The workingset size for this data is about 30GB, and normally we’d recommend a server with at least 16GB of RAM to hold that amount of data, and probably 24GB would be safer. A Dell R210 with 4GB of RAM, and a 120GB SSD (you will need to add that yourself) will set you back about £700, whereas one with 16GB of RAM, and a second fast HD will set you back more like £1000, draw significantly more power, and probably have a lower usable capacity. So, economically SSDs look very sensible, on the face of it.

The unkowns at this point are:

  • Will the performance scale usefully to the class of servers you’d actually justify having in a 2011 datacentre.

  • What the lifespan of the SSDs will be - I’m using a consumer one, “enterprise” SSDs have a longer lifespan, but are substantialy more expensive. I’d expect a consumer SSD to last long enough to become obsolete with a realsonable daily write load (few tens of MT/day), based on what I’ve seen so far.

I’ve included an abbreviated log of the import below, and the full log is available elsewhere. The figures in []s are the size of the import, and the average import speed for the last 5 million triples of that import:

$ 4s-import flashtest -v data-*
Reading <file:///storage/rdf/data-1058.nt>
[ 20449300 triples, 87242 triples/s ]
Reading <file:///storage/rdf/data-10.nt>
[ 4719314 triples, 94577 triples/s ]
Reading <file:///storage/rdf/data-57.nt>
[ 26946509 triples, 100992 triples/s ]
Reading <file:///storage/rdf/data-793.nt>
[ 96307678 triples, 28270 triples/s ]
Reading <file:///storage/rdf/data-795.nt>
[ 98193090 triples, 15586 triples/s ]
Reading <file:///storage/rdf/data-805.nt>
[ 32505405 triples, 16346 triples/s ]

Imported 279121296 triples, average 20098 triples/s

$ 4s-size flashtest                
  seg   quads (s)  quads (sr)      models   resources
    0   139533763          +0           7    24076141
    1   139587533          +0           6    24077738

TOTAL   279121296          +0           7    48153879

ISWC 2010 Review

As promised some time ago, here’s the stuff I found interesting at ISWC. Obviously it’s biased by presentations I got to see, and what my current interests are, which are very heavily at the practical end of the scale. I’ve no doubt that there was some great work I didn’t get to see.

Papers

SPARQL Query Optimization on Top of DHTs
http://iswc2010.semanticweb.org/accepted-papers/144

This is a solid bit of work on triplestore optimisers. It’s couched as being over DHTs, but it’s rather like the way some versions of 4store worked, and is a much better writeup than I’d ever have been able to do.

Making sense of Twitter
http://iswc2010.semanticweb.org/accepted-papers/352

Didn’t seem terribly relevant to Linked Data / Semantic Web specifically, but interesting anyway.

dbrec - Music Recommendations Using DBpedia
http://iswc2010.semanticweb.org/accepted-papers/431

A genuinely useful application of Linked Data. Music+RDF was going to get my attention anyway!

Semantic MediaWiki in Operation: Experiences with Building a Semantic Portal
http://iswc2010.semanticweb.org/accepted-papers/397

Semantic MediaWiki seems like a great tool.

Posters

Generating RDF for Application Testing
http://iswc2010.semanticweb.org/accepted-poster-demo/473

A significant problem with any data model, this seems to be a way which takes advantage of the structure of RDF to make it easier. We have significant challenges in this area inside Garlik, and some of the techniques here could be useful.

Semantics for music researchers: How country is my country?
http://iswc2010.semanticweb.org/accepted-poster-demo/483

Music+RDF+IR. Interesting work. Very BI-type questions, beyond the applicability to the music domain, showing how/why RDF is relevant to BI — though I don’t think that was the intention.

4sr - Scalable Decentralized RDFS Backward Chained Reasoning
http://iswc2010.semanticweb.org/accepted-poster-demo/498

My name was on this poster, but actually I had very little to do with the work, other than the odd discussion over a pint. The work Manuel is doing on practical (from a business point of view, rather than just academic) incomplete reasoning systems is great. It’s a shame more people aren’t working in this area.

[N.B. by practical here I don’t just mean raw performance, there are other considerations to do with maintainability, robustness, footprint, deployment practicalities, and many others. In short, will your IT team want to strangle you when you ask them to install it?]

Billion Triples Track

Creating voiD Descriptions for Web-scale Data
http://www.cs.vu.nl/~pmika/swc/submissions/swc2010_submission_3.pdf

Couldn’t find a link to the paper, but this pre-print describes the work. Interesting, and it points to some techniques for building voiD descriptions quickly, using statistical methods, and heuristics.

Reflections on ISWC2010

Collecting my thoughts from ISWC2010.

Seems that the Semantic Web is slowly inching towards the mainstream, though not really fast enough for my liking. I think that in Garlik we’ve demonstrated that there are real advantages to deploying software onto of the bottom layers of the infamous layer cake (basically RDF + SPARQL) - whether or not you go on to add more advanced things.

ISWC was really focussed on the academic side of things, and this is as it should be, it’s an academic conference, but personally I would have liked to have seen a little more real-world stuff.

I think I detected a hint of disquiet from the reasoning focussed researchers that a number of the industrial applications out there aren’t really using their technologies. This doesn’t really come as a surprise, I’ve seen it before with the Hypertext community and the web - the problems we thought we were solving, with link bases, generic linking and so on, turned out not to be significant problems — or at least the cure was worse than the disease.

A good case in point was a talk given by Silver Oliver from the BBC, who gave an excellent talk about the BBC’s World Cup website, which was largely driven by RDF data. After the talk there was some discussion about data maintenance, and how they tracked players being in different squads at different time. The answer was straightforward, it’s just manually curated data. There was a general feeling in incredulity from the reasoning-types in the audience, and some questions about why they hadn’t used such-and-such a type of reasoning and so on. An interesting reaction, but demonstrating a fundamental lack of understanding of the economics of running a service like that. The cost of having someone manually curate the data is noticeable, but nothing like the cost and risk of deploying complex software which requires an expert to understand. If you build a tool to manage the data, then anyone with an understanding of football can maintain the data, but requiring (much less) time of someone who understands temporal reasoning is a significantly harder role to fill.

It’s not to say that people in the Real World™ aren’t using inference (lower case i), just that it’s often not the classical GOFAI type beloved of researchers. There are people in the bioinformatics domain using classic DL reasoning and so on, but in more mainstream areas, not currently as far as I can see.

There was an interesting trend to ontology alignment. A couple of session that were ostensibly about other topics have a pretty strong vein of ontology alignment problems in them, in at least one case, against the preferences of the chairs. This is a potentially worrying trend, as I’m not really sure that it reflects a real-world problem — yet.

Overall it was a good conference, the state of the art in SPARQL query engines, which is the research topic I really care about, seems to be moving along nicely, though it still has a way to catch up on relational databases, but that’s to be expected. There were also several good papers/demos/posters in areas I’m not directly working, some very inspirational.

Command line mOPT in Perl

Here’s some Perl code that implemented mOPT (a standard for two factor authentication, designed for mobile phones) on the command line.

You run the script with the has of your secret (a N digit hex number, without the 0x etc.) as an argument, is asks for the PIN, and pushes the password into the cut buffer, so you can paste it into a ssh window with a Cmd-V. Only works on OS X, but it would be pretty easy to convert to other systems.

#!/usr/bin/perl -w

# Usage: motp <hex-secret>

use Digest::MD5 qw(md5_hex);
use Term::ReadKey;

my $initsecret = shift;
$time = int(time() / 10) * 10;
print("PIN: ");
ReadMode('noecho');
my $pin = ReadLine(0);
chomp $pin;
ReadMode(0);
print("\n");
my $pw = substr(md5_hex(substr($time,0,-1).$initsecret.$pin),0,6)."\n";
open(COPY, "| pbcopy") || die "cannot open pbcopy";
print COPY $pw;
close COPY;