Disqus - Latest Comments for chl

Re: live.hackr : Gmahte Wiesn

chl — Fri, 23 Apr 2010 13:29:24 -0000

Wobei sich das aktuell eher als "Club Like" denn als wahrliches "Open Like" darstellt ...

Re: An Exercise in Species Barcoding

chl — Tue, 03 Mar 2009 22:13:57 -0000

I experimented a bit with n-gram histograms, using, as Ryszard suggested, cosine as similarity measure (instead of Jensen-Shannon divergence as used in the paper mentioned above). After filtering out all n-grams containing either "N" or "-" (to mirror Peter's Levenshtein distance adaption), I get the following correlations (edit distance/cosine) and distinct n-gram counts:

n - r ---- # 1. -0.6722 4 2. -0.8715 16 3. -0.9088 64 4. -0.9383 256 5. -0.9649 967 6. -0.9754 2926 7. -0.9839 6240 8. -0.9882 10299 9. -0.9900 14202 10 -0.9907 18413 11 -0.9913 22515 12 -0.9916 26555 13 -0.9917 31257 14 -0.9917 36081 15 -0.9915 40961

Using 7-grams and a cutoff value of 0.81, the neighbourhoods match in 1246 of 1248 cases; calculation of the similarity matrix takes ~11s (thanks, NumPy!).

Maybe it's obvious, well-known or both, but I wouldn't have thought that n-grams correlate with edit distance so strongly (at least in this particular case ;-).

Re: An Exercise in Species Barcoding

chl — Tue, 03 Mar 2009 21:43:06 -0000

I think all the machinery you want is in ibol.py.

Re: An Exercise in Species Barcoding

chl — Wed, 25 Feb 2009 17:51:18 -0000

Comparing those procedures for measuring distance would be _very_ interesting, indeed!

As for n-grams, maybe this paper is of interest to you:

Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions
Gregory E. Sims, Se-Ran Juna, Guohong A. Wua, and Sung-Hou Kima
http://www.pnas.org/content...

From the abstract:

"For comparison of whole-genome (genic + nongenic) sequences, multiple sequence alignment of a few selected genes is not appropriate. One approach is to use an alignment-free method in which feature (or l-mer) frequency proﬁles (FFP) of whole genomes are used for comparison—a variation of a text or book comparison method, using word frequency profiles."

"[...] to illustrate the utility of the method, phylogenies are reconstructed from concatenated mammalian intronic genomes; the FFP derived intronic genome topologies for each l within the optimal range are all very similar. The topology agrees with the established mammalian phylogeny revealing that intron regions contain a similar level of phylogenic signal as do coding regions."

If simple n-gram-based methods turn out to produce interesting results for segments as short as those used in DNA barcoding, that'd be quite exciting (to me, at least ;-).

Re: An Exercise in Species Barcoding

chl — Wed, 25 Feb 2009 13:09:45 -0000

Given that it's the "real task", a few more details on the clustering algorithm would be much appreciated.

Update: Sorry, I didn't realize that all the clustering details I could ever ask for actually are in the Python script mentioned:

http://norvig.com/ibol.py

Re: Gonzo Reader

chl — Wed, 17 Sep 2008 13:51:07 -0000

This is probably the best application on all of the internets. Massive KTHX and mega success!

Re: Parallax

chl — Thu, 14 Aug 2008 20:51:51 -0000

I'm all for entity shift! Or maybe entity pivoting? ;-)

Parallax is a fascinating demonstration for sure; however, a powerful exploration (and query formulation) tool like that makes it all the more obvious how Freebase (still) has a Herculean task in front of them when it comes to data quality & coverage.

Maybe sponsoring DBpedia wouldn't be a bad idea ...

Re: Checking Out Google Trends For Websites

chl — Sun, 22 Jun 2008 10:01:36 -0000

The services are reporting different things: Google Trends "Daily Unique Visitors", the other two services "Monthly Unique Visitors", it seems (if I'm not misinterpreting labels like "People Counts - Monthly").

The huge disparity could be explained (for example) by indeed.com having a large number of non-repeat visitors (day-to-day). The effect would be especially pronounced when comparing to sites like Twitter, where a big fraction of one day's visitors will visit again the day after.

Update: Whoops, I should actually read the comments before posting ...

Re: Delicious 2.0: We've Been Waiting 9 Months

chl — Mon, 09 Jun 2008 09:08:47 -0000

As a fairly heavy user, I'm not exactly holding my breath for 2.0. del.icio.us stumbled upon an awesome mix of functionality, and frankly I fear that any (heavy) tampering would rather make it worse.

There's obviously a lot of innovation potential on top of del.icio.us; but does that really have to come from Yahoo?

The one thing I'd really like to see is del.icio.us uncrippling its API. Currently, the request limits are draconian.

Re: live.hackr : TTYtter

chl — Thu, 18 Oct 2007 19:56:29 -0000

wo isser hin?

Re: Visual tools for the socio–semantic web

chl — Mon, 11 Jun 2007 13:06:03 -0000

Congratulations! Great title, great ideas, great design.

Re: Wil Wheaton via Eventful Demand

chl — Sat, 20 May 2006 22:03:17 -0000

I'm sure the folks at eventful.org are pretty happy about all the traffic you're sending them ;-)

Re: Oooo, I like this Idea

chl — Mon, 02 Jan 2006 23:44:57 -0000

Greg is (as usual) spot on - "Fast Multiresolution Image Querying" is the paper that guided the implementation (which is, by the way, one of my all-time favourites, and recommended reading for anyone with only a passing interest in image retrieval).

I first came across it when someone (I think Edd Dumbill) linked to imgSeek a couple of years back; imgSeek is a standalone image management application that incorporates the same algorithm. retrievr is a new implementation in pure Python (plus a host of great libraries: PIL, aggdraw and numarray).

In my experience, the results are usually fairly good, sometimes even stunning - considering the artistic sophistication most of us are able to come up with (gallery forthcoming); and in the cases they're not so stellar, they are at least entertaining ;-) But clearly, the approach has its limits.

One thing to keep in mind is that it doesn't do object/face/text recognition of any kind, so if you're drawing an outline sketch of a chair (or corporate logos like Tara Calishain has tried), it almost certainly won't get you one back (except your index only contains images of chairs). It helps to think of it as matching the most pronounced slabs of colors. Another thing to know is that there's currently no way to specify the aspect ratio, so you have to rescale the image in your head (things that are close to the borders of the image you're thinking of should be close to the borders of your sketches), but that's really just a missing feature of the drawing flashlet than an inherent problem. Sometimes it also helps to _remove_ detail instead of adding it. And finally, the index covers only about 85k of Flickr's "most interesting" images at the moment (I didn't want to use up even more of their resources before checking back with them; it's fantastic enough that Flickr isn't imposing any up-front limits on API usage like most everyone else is doing).

In a way, I see retrievr less as a "search" tool than an "exploration" tool, and it seems to work very well for that.

Re: Oooo, I like this Idea

chl — Mon, 02 Jan 2006 18:44:57 -0000

In a way, I see retrievr less as a "search" tool than an "exploration" tool, and it seems to work very well for that.

Re: Blog category tags too cumbersome

chl — Thu, 27 Jan 2005 14:42:15 -0000

right on. I've been using a blog/tag thing for about 2 months now (internally), and it sure rocks.