Disqus - Latest Comments for tfmorris

Re: jq: error – Cannot iterate over null (null)

Tom Morris — Fri, 28 Dec 2018 21:32:09 -0000

I think the alternative operator (//) would have worked if you adjusted the parentheses slightly and used (.answers // [])[]

Re: Tools for Extracting Data and Text from PDFs - A Review

Tom Morris — Wed, 20 Apr 2016 13:07:22 -0000

There's also Apache PDFBox & Apache Tika. The website www.newocr.com is based on Tesseract OCR, so does deal with image-only PDFs, which your intro says are excluded.

Kind of weird to see the ScraperWiki folks being called "great" in the same post that says they took an open-source project and made it closed source.

Re: Cleaning Google Analytics Data using Open Refine

Tom Morris — Sat, 31 May 2014 02:41:20 -0000

The product has been called OpenRefine since the end of 2012.

Re: Re, Datu skolas rokasgrāmata!

Tom Morris — Wed, 12 Feb 2014 12:48:27 -0000

If anyone else is wondering exactly *where* on Github since there's no link anywhere, it's at: https://github.com/okfn/dat...

Re: Four short links: 28 November 2013

Tom Morris — Fri, 06 Dec 2013 10:53:24 -0000

Appreciate the OpenRefine mention, but your comment about Google is incorrect. Not sure who told you that Google abandoned Refine, but it's not true. It was developed and open sourced by Metaweb before Google acquired them. Google continued to develop it as an open source tool with the help of outside contributors (although they did the bulk of the heavy lifting), then eventually opened up governance of the project. We rebranded at that time and chose a new project lead (me), but other than that it's the same set of committers, the same code, the same goals.

Re: OpenRefine/LODRefine – A Power Tool for Cleaning Data

Tom Morris — Tue, 16 Jul 2013 00:29:17 -0000

Which parts of this couldn't be done with standard Refine? I know the EU funded the "research" (ie repackaging) for LODRefine, so there's probably a geographical bias, but I'm pretty sure there's nothing that the badge-engineered version adds to the standard version for this use case.

Re: iPhylo: Correcting OCR using hOCR in Firefox

Tom Morris — Thu, 11 Jul 2013 14:16:30 -0000

Thanks. I (belatedly) made a Gist out of it to make it easier for others to find. Hope that's ok! https://gist.github.com/tfm...

Re: A Proposal to FHISO

Tom Morris — Mon, 10 Jun 2013 10:48:49 -0000

On second thought, the copyright is a red herring. The specs are all CC-BY-SA and the code is all Apache licensed, so anyone can fork/reuse either, although it would never be a viable fork unless it gathered sufficient mass to counterweight the church (ie Ancestry or a future, successful FHISO).

Although not stated anywhere, I'm sure the license grant doesn't include rights to the GEDCOM X brand, which would be another point of control.

All written texts are copyright. The prominent IR copyright just means that a) Ryan assigned his copyright (probably as part of his employment agreement) and b) someone wants it prominently shown in the specs. Neither changes what you can do with the specs. It's purely a cosmetic annoyance (if you're annoyed by that kind of thing).

Re: A Proposal to FHISO

Tom Morris — Mon, 10 Jun 2013 10:42:27 -0000

Tod - I probably should have split that into two replies. The first question was for you, while the rest of the reply was general commentary on styles of standards development/promotion.

My simple reason for asking was that when Ryan writes "we," I understand it to mean him and his employers. I didn't know if you were part of that "we" or a different "we."

FWIW - when I write, it's as an individual, and my opinions are mine alone.

Re: A Proposal to FHISO

Tom Morris — Mon, 10 Jun 2013 10:21:29 -0000

Tod - who is "we" in the context of your reply?

There are lots of different standards models. HTML is a good example of what happens when a community divides into two competing groups (WHATWG vs W3C). Standards don't have to be open to be successful if there's a dominant market force behind them (GEDCOM classic, Java, Microsoft products). GEDCOM is probably the least successful of those examples, but I'd argue that that's because the church abandoned it and left a vacuum, not because they controlled it unilaterally.

The file format discussion (https://github.com/FamilySe... is illustrative of the problems with the combination of unilateral control and lack of transparency. There were a few days/weeks of discussion followed by a year of radio silence then "OK, we've decided" accompanied by throwing a copy of the spec over the wall to the unwashed. The recipients are an "audience," not a "community." There is no real two-way communication.

Re: A Proposal to FHISO

Tom Morris — Sat, 08 Jun 2013 14:56:44 -0000

That's a good question, but more important than the copyright, I think, is the governance model. Will the Mormon church actually share control of the spec? (For those who don't know, Intellectual Reserve is the name of one of the church's wholly owned shell corporations.)

A good counter to the proposal for FHISO to rubber stamp GEDCOM X would be for FHISO to propose that FamilySearch to collaborate with the rest of the genealogical community in developing a community driven standard for genealogical data.

Re: Talking New Media: '68 Blocks': The Boston Globe's series inside Dorchester’s Bowdoin-Geneva neighborhood published as an eBook

Tom Morris — Wed, 05 Jun 2013 13:21:16 -0000

Has this blog moved to a new location?

Re: Nomenklatura - Matching and Reconciliation Made Easy

Tom Morris — Sun, 26 May 2013 13:20:50 -0000

Is there any documentation on the OpenRefine reconciliation API?

Re: DocGraph: Open social doctor data

Tom Morris — Fri, 23 Nov 2012 11:07:36 -0000

Thanks Fred, that makes sense. The examples that I looked also included insurance IDs, so that's another set of identifiers which could be joined on and/or validated. Having a well linked (and validated) set of strong identifiers would be a very useful first step.

Re: DocGraph: Open social doctor data

Tom Morris — Fri, 23 Nov 2012 11:03:25 -0000

I agree that "social graph" is a misnomer, even though "social" is all the rage these days.

Re: DocGraph: Open social doctor data

Tom Morris — Mon, 19 Nov 2012 10:54:28 -0000

Even if the state licensing data isn't coded with the NPI number, the NPI file has all the state license numbers in it, so aren't these data sets linked? What am I missing that causes you to state that they're unlinked?

Re: Closing Gender Gaps with Data Science: Women's Voices in UK Media

Tom Morris — Fri, 21 Sep 2012 17:56:01 -0000

Is the pool of authors so big that you need to guess at gender? "Lynn" may be ambiguous, but "Lynn Jones who writes sports for the Guardian" (made up example) most certainly has a specific knowable gender. Also if a name isn't in the ONS dataset, you should consider expanding the search to other sources such as
http://genderednames.freeba... (see the bottom of the page for API access)

Re: brain of mat kelcey

Tom Morris — Mon, 20 Aug 2012 19:07:58 -0000

Very cool! I'm not sure whether you're saying your run included English-only filtering or not. There are a bunch of foreign language month names (Dezember, Oktober) and other foreign words in the list on Github. Also, there are some things which are clearly not noun phrases (Posted, Written, Powered, etc). Any idea why they are getting tagged NPs? Actually, the boilerplate noise (dates, button labels, link labels, etc) highlights one of the downsides of using the supplied simple visible text extraction. It's great because it's already done and super easy to use, but it does have the downside of including boilerplate text.

Re: resemblance with the jaccard coefficient

Tom Morris — Sun, 04 Mar 2012 12:26:48 -0000

Using the processor's POPCNT instruction would be much faster than your little Kernighan inspired loop.

Re: brain of mat kelcey

Tom Morris — Sun, 04 Mar 2012 12:08:07 -0000

Both of your longest phrase examples seems suspicious to me. Did you drop all numbers in your initial parse? Intentionally?

A little Googling confirms my suspicions that these were actually of the form:

United Nations Security Council Resolution 1699, adopted unanimously on August 8, 2006, after recalling
As of the census [ 1 ] of 2000, there were 9536 people, 3922households, and 2517 families residing in the city.

To answer your question about templates vs cut & paste, templates (infoboxes) are excluded from the body text, but this type of stylized pro forma structure is pretty common in Wikipedia. Some of it's from cut & paste or a single author working on a series of related articles, but often it's a semi-formal convention adopted by a group of authors.

There are a number of Wikipedia-isms that would be fascinating to study statistically if there was an equivalent corpus to compare against. For example, I suspect the frequency of the word "notable" is much higher in Wikipedia than elsewhere because they've got a notability requirement for inclusion, so authors writing about marginal cases take pains to stress why their subject is "notable."

Re: Anonymizing user, company, and location data using Faker

Tom Morris — Thu, 23 Feb 2012 11:21:41 -0000

As @tll says, this should have "anonymizing" in huge quotes. People reading this article could easily develop a false sense of security that randomly replaced the fields in a few columns of their DB actually provides an adequate level of protection.
Google "de-anonymization" or "re-identification" to get an idea of how easy it is to correlate and re-identify personal information. There have been a number of high profile PR fiascos including the Netflix contest data, AOL search logs, and others. Do you want your company to be the next?

Re: As Journal Boycott Grows, Elsevier Defends Its Practices

Tom Morris — Wed, 15 Feb 2012 18:50:41 -0000

[This is intended to be a reply to Mr. Gunn's reply, but that comment has no Reply button. Sorry for the thread confusion!] One of the very first lines in that RSS feed is {copyright}Copyright 2012, Mendeley Ltd. {/copyright} [begin/end tags modified so they don't get munged].

If Mendeley isn't attempting to assert copyright on the RSS feed, then that line should be removed. Also, copyright and licensing are disjoint. Saying stuff is CC-BY (where the BY in this case is Mendeley, not the contributor) doesn't say anything about its copyright status (although if it's not copyright or Mendeley isn't the copyright holder, I'm not sure why I'd care what license they think should apply). If you're attempting to assert that the RSS feed consists of licensed data, it would be useful to include license information in the feed in addition to the copyright statement.

Re: As Journal Boycott Grows, Elsevier Defends Its Practices

Tom Morris — Fri, 10 Feb 2012 09:35:35 -0000

Kind of ironic, given the present discussion, that Mendeley claims copyright on that list of papers curated by scientists around the world. http://www.mendeley.com/gro...

Re: Andrey Tarantsov: The third definition of open, or How I nearly picked GPL for my product, but ended up simply publishing the source with no license (for now)

Tom Morris — Thu, 09 Feb 2012 10:07:59 -0000

Wow, that's long. I'd encourage you to continue learning about intellectual property protection and your options. For example, you can license the software under whatever license you want and still reserve rights to the name, preventing other people from using it without your permission. Dual licensing (GPL + commercial) is an option if you are the sole contributor (or all contributors agree). Making money is at least as much about your business strategy as the particular license.

In my opinion, not choosing a license is the worst of all worlds. Conscientious and legal minded people are completely prevented from using your code while unscrupulous types are given a tiny opening to claim that they thought it was public domain. At least put a clear copyright notice with your name and date on the code.

Re: iPhylo: Using Google Refine and taxonomic databases (EOL, NCBI, uBio, WORMS) to clean messy data

Tom Morris — Mon, 06 Feb 2012 09:32:10 -0000

A way to do this which preserves your reconciliation data is to "Create a column based on this column" for the Names column and just use 'value' to copy the original value to a new column named 'Original Names' or some such.

If you just want to facet on name mismatches, you can create a custom text facet on the reconciled column using the expression 'value == cell.recon.match.name' (or perhaps value.toLowercase() == cell.recon.match.name.toLowercase)