<?xml version="1.0" encoding="utf-8"?>
<rss xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title>Disqus - Latest Comments for tfmorris</title><link>http://disqus.com/by/tfmorris/</link><description></description><atom:link href="http://disqus.com/tfmorris/comments.rss" rel="self"></atom:link><language>en</language><lastBuildDate>Fri, 28 Dec 2018 21:32:09 -0000</lastBuildDate><item><title>Re: jq: error &amp;#8211; Cannot iterate over null (null)</title><link>http://www.markhneedham.com/blog/2015/10/09/jq-error-cannot-iterate-over-null-null/#comment-4260518132</link><description>&lt;p&gt;I think the alternative operator (//) would have worked if you adjusted the parentheses slightly and used &lt;code&gt;(.answers // [])[]&lt;/code&gt;&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Tom Morris</dc:creator><pubDate>Fri, 28 Dec 2018 21:32:09 -0000</pubDate></item><item><title>Re: Tools for Extracting Data and Text from PDFs - A Review</title><link>http://okfnlabs.org/blog/2016/04/19/pdf-tools-extract-text-and-data-from-pdfs.html#comment-2633821644</link><description>&lt;p&gt;There's also Apache PDFBox &amp;amp; Apache Tika. The website &lt;a href="http://www.newocr.com" rel="nofollow noopener" target="_blank" title="www.newocr.com"&gt;www.newocr.com&lt;/a&gt; is based on Tesseract OCR, so does deal with image-only PDFs, which your intro says are excluded.&lt;/p&gt;&lt;p&gt;Kind of weird to see the ScraperWiki folks being called "great" in the same post that says they took an open-source project and made it closed source.&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Tom Morris</dc:creator><pubDate>Wed, 20 Apr 2016 13:07:22 -0000</pubDate></item><item><title>Re: Cleaning Google Analytics Data using Open Refine</title><link>http://localhost/wordpress-3.7.1/cleaning-google-analytics-data-using-google-refine/#comment-1413223937</link><description>&lt;p&gt;The product has been called OpenRefine since the end of 2012.&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Tom Morris</dc:creator><pubDate>Sat, 31 May 2014 02:41:20 -0000</pubDate></item><item><title>Re: Re, Datu skolas rokasgrāmata!</title><link>http://lv.schoolofdata.okblogfarm.org/rokasgramata/#comment-1241457431</link><description>&lt;p&gt;If anyone else is wondering exactly *where* on Github since there's no link anywhere, it's at: &lt;a href="https://github.com/okfn/datawrangling" rel="nofollow noopener" target="_blank" title="https://github.com/okfn/datawrangling"&gt;https://github.com/okfn/dat...&lt;/a&gt;&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Tom Morris</dc:creator><pubDate>Wed, 12 Feb 2014 12:48:27 -0000</pubDate></item><item><title>Re: Four short links: 28 November 2013</title><link>http://radar.oreilly.com/2013/11/four-short-links-28-november-2013.html#comment-1152959151</link><description>&lt;p&gt;Appreciate the OpenRefine mention, but your comment about Google is incorrect. Not sure who told you that Google abandoned Refine, but it's not true.  It was developed and open sourced by Metaweb before Google acquired them.  Google continued to develop it as an open source tool with the help of outside contributors (although they did the bulk of the heavy lifting), then eventually opened up governance of the project.  We rebranded at that time and chose a new project lead (me), but other than that it's the same set of committers, the same code, the same goals.&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Tom Morris</dc:creator><pubDate>Fri, 06 Dec 2013 10:53:24 -0000</pubDate></item><item><title>Re: OpenRefine/LODRefine &amp;#8211; A Power Tool for Cleaning Data</title><link>http://lv.schoolofdata.okblogfarm.org/?p=5408#comment-963917095</link><description>&lt;p&gt;Which parts of this couldn't be done with standard Refine?  I know the EU funded the "research" (ie repackaging) for LODRefine, so there's probably a geographical bias, but I'm pretty sure there's nothing that the badge-engineered version adds to the standard version for this use case.&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Tom Morris</dc:creator><pubDate>Tue, 16 Jul 2013 00:29:17 -0000</pubDate></item><item><title>Re: iPhylo: Correcting OCR using hOCR in Firefox</title><link>http://iphylo.blogspot.com/2011/07/correcting-ocr-using-hocr-firefox.html#comment-958923093</link><description>&lt;p&gt;Thanks.  I (belatedly) made a Gist out of it to make it easier for others to find.  Hope that's ok! &lt;a href="https://gist.github.com/tfmorris/5977784" rel="nofollow noopener" target="_blank" title="https://gist.github.com/tfmorris/5977784"&gt;https://gist.github.com/tfm...&lt;/a&gt;&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Tom Morris</dc:creator><pubDate>Thu, 11 Jul 2013 14:16:30 -0000</pubDate></item><item><title>Re: A Proposal to FHISO</title><link>http://familysearch.github.com/gedcomx/2013/06/05/fhiso-proposal.html#comment-925149758</link><description>&lt;p&gt;On second thought, the copyright is a red herring.  The specs are all CC-BY-SA and the code is all Apache licensed, so anyone can fork/reuse either, although it would never be a viable fork unless it gathered sufficient mass to counterweight the church (ie Ancestry or a future, successful FHISO).&lt;/p&gt;&lt;p&gt;Although not stated anywhere, I'm sure the license grant doesn't include rights to the GEDCOM X brand, which would be another point of control.&lt;/p&gt;&lt;p&gt;All written texts are copyright.  The prominent IR copyright just means that a) Ryan assigned his copyright (probably as part of his employment agreement) and b) someone wants it prominently shown in the specs.  Neither changes what you can do with the specs.  It's purely a cosmetic annoyance (if you're annoyed by that kind of thing).&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Tom Morris</dc:creator><pubDate>Mon, 10 Jun 2013 10:48:49 -0000</pubDate></item><item><title>Re: A Proposal to FHISO</title><link>http://familysearch.github.com/gedcomx/2013/06/05/fhiso-proposal.html#comment-925142306</link><description>&lt;p&gt;Tod - I probably should have split that into two replies.  The first question was for you, while the rest of the reply was general commentary on styles of standards development/promotion.&lt;/p&gt;&lt;p&gt;My simple reason for asking was that when Ryan writes "we," I understand it to mean him and his employers.  I didn't know if you were part of that "we" or a different "we."&lt;/p&gt;&lt;p&gt;FWIW - when I write, it's as an individual, and my opinions are mine alone.&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Tom Morris</dc:creator><pubDate>Mon, 10 Jun 2013 10:42:27 -0000</pubDate></item><item><title>Re: A Proposal to FHISO</title><link>http://familysearch.github.com/gedcomx/2013/06/05/fhiso-proposal.html#comment-925124791</link><description>&lt;p&gt;Tod - who is "we" in the context of your reply?&lt;/p&gt;&lt;p&gt;There are lots of different standards models.  HTML is a good example of what happens when a community divides into two competing groups (WHATWG vs W3C).  Standards don't have to be open to be successful if there's a dominant market force behind them (GEDCOM classic, Java, Microsoft products).  GEDCOM is probably the least successful of those examples, but I'd argue that that's because the church abandoned it and left a vacuum, not because they controlled it unilaterally.&lt;/p&gt;&lt;p&gt;The file format discussion (&lt;a href="https://github.com/FamilySearch/gedcomx/issues/185)" rel="nofollow noopener" target="_blank" title="https://github.com/FamilySearch/gedcomx/issues/185)"&gt;https://github.com/FamilySe...&lt;/a&gt; is illustrative of the problems with the combination of unilateral control and lack of transparency.  There were a few days/weeks of discussion followed by a year of radio silence then "OK, we've decided" accompanied by throwing a copy of the spec over the wall to the unwashed.  The recipients are an "audience," not a "community."  There is no real two-way communication.&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Tom Morris</dc:creator><pubDate>Mon, 10 Jun 2013 10:21:29 -0000</pubDate></item><item><title>Re: A Proposal to FHISO</title><link>http://familysearch.github.com/gedcomx/2013/06/05/fhiso-proposal.html#comment-923478575</link><description>&lt;p&gt;That's a good question, but more important than the copyright, I think, is the governance model.  Will the Mormon church actually share control of the spec? (For those who don't know, Intellectual Reserve is the name of one of the church's wholly owned shell corporations.)&lt;/p&gt;&lt;p&gt;A good counter to the proposal for FHISO to rubber stamp GEDCOM X would be for FHISO to propose that FamilySearch to collaborate with the rest of the genealogical community in developing a community driven standard for genealogical data.&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Tom Morris</dc:creator><pubDate>Sat, 08 Jun 2013 14:56:44 -0000</pubDate></item><item><title>Re: Talking New Media: &amp;#039;68 Blocks&amp;#039;: The Boston Globe&amp;#039;s series inside Dorchester&amp;rsquo;s Bowdoin-Geneva neighborhood published as an eBook</title><link>http://beta.boston.com/post/44236086254#comment-920003380</link><description>&lt;p&gt;Has this blog moved to a new location?&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Tom Morris</dc:creator><pubDate>Wed, 05 Jun 2013 13:21:16 -0000</pubDate></item><item><title>Re: Nomenklatura - Matching and Reconciliation Made Easy</title><link>http://okfnlabs.org/blog/2013/05/16/nomenklatura-matching-service-reconciliation-made-easy.html#comment-909176235</link><description>&lt;p&gt;Is there any documentation on the OpenRefine reconciliation API?&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Tom Morris</dc:creator><pubDate>Sun, 26 May 2013 13:20:50 -0000</pubDate></item><item><title>Re: DocGraph: Open social doctor data</title><link>http://strata.oreilly.com/2012/11/docgraph-open-social-doctor-data.html#comment-717572525</link><description>&lt;p&gt;Thanks Fred, that makes sense.  The examples that I looked also included insurance IDs, so that's another set of identifiers which could be joined on and/or validated.  Having a well linked (and validated) set of strong identifiers would be a very useful first step.&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Tom Morris</dc:creator><pubDate>Fri, 23 Nov 2012 11:07:36 -0000</pubDate></item><item><title>Re: DocGraph: Open social doctor data</title><link>http://strata.oreilly.com/2012/11/docgraph-open-social-doctor-data.html#comment-717570259</link><description>&lt;p&gt;I agree that "social graph" is a misnomer, even though "social" is all the rage these days.&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Tom Morris</dc:creator><pubDate>Fri, 23 Nov 2012 11:03:25 -0000</pubDate></item><item><title>Re: DocGraph: Open social doctor data</title><link>http://strata.oreilly.com/2012/11/docgraph-open-social-doctor-data.html#comment-713871526</link><description>&lt;p&gt;Even if the state licensing data isn't coded with the NPI number, the NPI file has all the state license numbers in it, so aren't these data sets linked?  What am I missing that causes you to state that they're unlinked?&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Tom Morris</dc:creator><pubDate>Mon, 19 Nov 2012 10:54:28 -0000</pubDate></item><item><title>Re: Closing Gender Gaps with Data Science: Women&amp;#039;s Voices in UK Media</title><link>https://civic.mit.edu/blog/natematias/closing-gender-gaps-with-data-science-womens-voices-in-uk-media#comment-658306726</link><description>&lt;p&gt;Is the pool of authors so big that you need to guess at gender?  "Lynn" may be ambiguous, but "Lynn Jones who writes sports for the Guardian" (made up example) most certainly has a specific knowable gender.  Also if a name isn't in the ONS dataset, you should consider expanding the search to other sources such as &lt;br&gt;&lt;a href="http://genderednames.freebaseapps.com/" rel="nofollow noopener" target="_blank" title="http://genderednames.freebaseapps.com/"&gt;http://genderednames.freeba...&lt;/a&gt; (see the bottom of the page for API access)&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Tom Morris</dc:creator><pubDate>Fri, 21 Sep 2012 17:56:01 -0000</pubDate></item><item><title>Re: brain of mat kelcey</title><link>http://matpalm.com/blog/2012/08/18/finding_names_in_common_crawl#comment-624423566</link><description>&lt;p&gt;Very cool!  I'm not sure whether you're saying your run included English-only filtering or not.  There are a bunch of foreign language month names (Dezember, Oktober) and other foreign words in the list on Github.  Also, there are some things which are clearly not noun phrases (Posted, Written, Powered, etc).  Any idea why they are getting tagged NPs?  Actually, the boilerplate noise (dates, button labels, link labels, etc) highlights one of the downsides of using the supplied simple visible text extraction.  It's great because it's already done and super easy to use, but it does have the downside of including boilerplate text.&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Tom Morris</dc:creator><pubDate>Mon, 20 Aug 2012 19:07:58 -0000</pubDate></item><item><title>Re: resemblance with the jaccard coefficient</title><link>http://matpalm.com/resemblance/jaccard_coeff/#comment-456041811</link><description>&lt;p&gt;Using the processor's POPCNT instruction would be much faster than your little Kernighan inspired loop.&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Tom Morris</dc:creator><pubDate>Sun, 04 Mar 2012 12:26:48 -0000</pubDate></item><item><title>Re: brain of mat kelcey</title><link>http://matpalm.com/blog/2011/11/15/collocations_3#comment-456027047</link><description>&lt;p&gt;Both of your longest phrase examples seems suspicious to me.  Did you drop all numbers in your initial parse?  Intentionally?&lt;/p&gt;&lt;p&gt;A little Googling confirms my suspicions that these were actually of the form:&lt;/p&gt;&lt;p&gt;United Nations Security Council Resolution 1699, adopted unanimously on August 8, 2006, after recalling  &lt;br&gt;As of the census [ 1 ] of 2000, there were 9536 people, 3922households, and 2517 families residing in the city. &lt;/p&gt;&lt;p&gt;To answer your question about templates vs cut &amp;amp; paste, templates (infoboxes) are excluded from the body text, but this type of stylized pro forma structure is pretty common in Wikipedia.  Some of it's from cut &amp;amp; paste or a single author working on a series of related articles, but often it's a semi-formal convention adopted by a group of authors.&lt;/p&gt;&lt;p&gt;There are a number of Wikipedia-isms that would be fascinating to study statistically if there was an equivalent corpus to compare against.  For example, I suspect the frequency of the word "notable" is much higher in Wikipedia than elsewhere because they've got a notability requirement for inclusion, so authors writing about marginal cases take pains to stress why their subject is "notable."&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Tom Morris</dc:creator><pubDate>Sun, 04 Mar 2012 12:08:07 -0000</pubDate></item><item><title>Re: Anonymizing user, company, and location data using Faker</title><link>http://robots.thoughtbot.com/post/18070048430#comment-447008558</link><description>&lt;p&gt;As @tll  says, this should have "anonymizing" in huge quotes.  People reading this article could easily develop a false sense of security that randomly replaced the fields in a few columns of their DB actually provides an adequate level of protection.&lt;br&gt;Google "de-anonymization" or "re-identification" to get an idea of how easy it is to correlate and re-identify personal information.  There have been a number of high profile PR fiascos including the Netflix contest data, AOL search logs, and others.  Do you want your company to be the next?&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Tom Morris</dc:creator><pubDate>Thu, 23 Feb 2012 11:21:41 -0000</pubDate></item><item><title>Re: As Journal Boycott Grows, Elsevier Defends Its Practices</title><link>http://chronicle.com/article/As-Journal-Boycott-Grows/130600/#comment-440302810</link><description>&lt;p&gt;[This is intended to be a reply to Mr. Gunn's reply, but that comment has no Reply button.  Sorry for the thread confusion!]  One of the very first lines in that RSS feed is {copyright}Copyright 2012, Mendeley Ltd. {/copyright} [begin/end tags modified so they don't get munged].&lt;/p&gt;&lt;p&gt;If Mendeley isn't attempting to assert copyright on the RSS feed, then that line should be removed.  Also, copyright and licensing are disjoint.  Saying stuff is CC-BY (where the BY in this case is Mendeley, not the contributor) doesn't say anything about its copyright status (although if it's not copyright or Mendeley isn't the copyright holder, I'm not sure why I'd care what license they think should apply).  If you're attempting to assert that the RSS feed consists of licensed data, it would be useful to include license information in the feed in addition to the copyright statement.&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Tom Morris</dc:creator><pubDate>Wed, 15 Feb 2012 18:50:41 -0000</pubDate></item><item><title>Re: As Journal Boycott Grows, Elsevier Defends Its Practices</title><link>http://chronicle.com/article/As-Journal-Boycott-Grows/130600/#comment-435268332</link><description>&lt;p&gt;Kind of ironic, given the present discussion, that Mendeley claims copyright on that list of papers curated by scientists around the world.  &lt;a href="http://www.mendeley.com/groups/530031/future-of-science/feed/rss/" rel="nofollow noopener" target="_blank" title="http://www.mendeley.com/groups/530031/future-of-science/feed/rss/"&gt;http://www.mendeley.com/gro...&lt;/a&gt;&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Tom Morris</dc:creator><pubDate>Fri, 10 Feb 2012 09:35:35 -0000</pubDate></item><item><title>Re: Andrey Tarantsov: The third definition of open, or How I nearly picked GPL for my product, but ended up simply publishing the source with no license (for now)</title><link>http://tarantsov.com/blog/2012/02/the-third-definition-of-open/#comment-434015427</link><description>&lt;p&gt;Wow, that's long.  I'd encourage you to continue learning about intellectual property protection and your options.  For example, you can license the software under whatever license you want and still reserve rights to the name, preventing other people from using it without your permission.  Dual licensing (GPL + commercial) is an option if you are the sole contributor (or all contributors agree).  Making money is at least as much about your business strategy as the particular license.&lt;/p&gt;&lt;p&gt;In my opinion, not choosing a license is the worst of all worlds.  Conscientious and legal minded people are completely prevented from using your code while unscrupulous types are given a tiny opening to claim that they thought it was public domain.  At least put a clear copyright notice with your name and date on the code.&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Tom Morris</dc:creator><pubDate>Thu, 09 Feb 2012 10:07:59 -0000</pubDate></item><item><title>Re: iPhylo: Using Google Refine and taxonomic databases (EOL, NCBI, uBio, WORMS) to clean messy data</title><link>http://iphylo.blogspot.com/2012/02/using-google-refine-and-taxonomic.html#comment-430788657</link><description>&lt;p&gt;A way to do this which preserves your reconciliation data is to "Create a column based on this column" for the Names column and just use 'value' to copy the original value to a new column named 'Original Names' or some such.&lt;/p&gt;&lt;p&gt;If you just want to facet on name mismatches, you can create a custom text facet on the reconciled column using the expression 'value == &lt;a href="http://cell.recon.match.name" rel="nofollow noopener" target="_blank" title="cell.recon.match.name"&gt;cell.recon.match.name&lt;/a&gt;' (or perhaps value.toLowercase() == cell.recon.match.name.toLowercase)&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Tom Morris</dc:creator><pubDate>Mon, 06 Feb 2012 09:32:10 -0000</pubDate></item></channel></rss>