{"id":174,"date":"2013-04-29T18:02:43","date_gmt":"2013-04-30T00:02:43","guid":{"rendered":"http:\/\/logicalchaos.org\/blog\/?p=174"},"modified":"2013-04-29T18:06:45","modified_gmt":"2013-04-30T00:06:45","slug":"kerning-pairs-part-ii","status":"publish","type":"post","link":"http:\/\/logicalchaos.org\/blog\/2013\/04\/kerning-pairs-part-ii\/","title":{"rendered":"Kerning Pairs Part II"},"content":{"rendered":"<p>In my previous post <a href=\"http:\/\/logicalchaos.org\/blog\/2013\/04\/kerning-pairs-part-i\/\" title=\"Kerning Pairs Part I\">Kerning Pairs Part I<\/a>, I looked at kerning pairs calculated from the standard Mac dictionary that font designers would concentrate on to get the most payoff for their editing time.  This time, I&#8217;ll do a simple <a href=\"http:\/\/hadoop.apache.org\" title=\"Hadoop\">Hadoop<\/a> implementation of the same calculation.<br \/>\n<!--more--><br \/>\nI&#8217;m not doing this for performance (obviously), as it takes ~1 sec on my iMac, and 2m30s on my Hadoop cluster.  I&#8217;m doing it to learn more about Big Data, and how to work with it (like I need something more to do with my spare time).  My cluster has three machines in it, a dual core AMD Athlon MP on 1Gb network, an Intel Core i5 (iMac) over 802.11n, and an Intel Core i7 (iMac), also on the 1Gb network.  The i7 is the master node.<\/p>\n<p>So all nodes have something to do, I put the 2.4MB dictionary file into the Hadoop file system in block sizes of 8KB.  The task ratio for this problem mapped to the three nodes is about 1:3.5:18.<br \/>\n<code lang=\"bash\">hadoop fs -Ddfs.block.size=8192 -put \/usr\/share\/dict\/words \/usr\/hadoop\/words<\/code><br \/>\nThe KPHReducer class is not needed for a simple counting reduction like this one, but it shows a gotcha with correctly counting values passed in through the iterator.  Despite only writing 1&#8217;s per pair, with a combiner class or a simple counting implementation like this one, the data can be coalesced before getting to the reducer class.<br \/>\n<code lang=\"java\"><br \/>\nimport java.io.IOException;<br \/>\nimport org.apache.hadoop.conf.Configuration;<br \/>\nimport org.apache.hadoop.fs.Path;<br \/>\nimport org.apache.hadoop.io.IntWritable;<br \/>\nimport org.apache.hadoop.io.Text;<br \/>\nimport org.apache.hadoop.mapreduce.Job;<br \/>\nimport org.apache.hadoop.mapreduce.Mapper;<br \/>\nimport org.apache.hadoop.mapreduce.Reducer;<br \/>\nimport org.apache.hadoop.mapreduce.lib.input.FileInputFormat;<br \/>\nimport org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;<\/p>\n<p>public class KPH {<\/p>\n<p>    public static class KPHMapper extends<br \/>\n            Mapper<Object, Text, Text, IntWritable> {<\/p>\n<p>        private final static IntWritable one           = new IntWritable(1);<br \/>\n        private final static int         STRING_LENGTH = 2;<br \/>\n        private final Text               word          = new Text();<\/p>\n<p>        @Override<br \/>\n        public void map(final Object key, final Text value,<br \/>\n                final Context context) throws IOException, InterruptedException {<br \/>\n            final String dictWord = value.toString();<br \/>\n            String lcLine;<br \/>\n            \/*<br \/>\n             * Only looking at words that contain 2 or more characters.<br \/>\n             *\/<br \/>\n            if (dictWord.matches(\"\\\\p{Alpha}{2,}\")) {<br \/>\n                lcLine = dictWord.toLowerCase();<br \/>\n                for (int i = 0; i < (lcLine.length() - KPHMapper.STRING_LENGTH); i++) {\n                    \/*\n                     * Grab each sequential pair and write it out.\n                     *\/\n                    this.word.set(lcLine.substring(i, i\n                            + KPHMapper.STRING_LENGTH));\n                    context.write(this.word, KPHMapper.one);\n                }\n            }\n        }\n    }\n\n    public static class KPHReducer extends\n            Reducer<Text, IntWritable, Text, IntWritable> {<br \/>\n        @Override<br \/>\n        public void reduce(final Text key, final Iterable<IntWritable> values,<br \/>\n                final Context context) throws IOException, InterruptedException {<br \/>\n            int total = 0;<br \/>\n            for (final IntWritable val : values) {<br \/>\n                \/*<br \/>\n                 * We will have values other than 1 here since this is a<br \/>\n                 * combiner and reducer class.<br \/>\n                 *\/<br \/>\n                total += val.get();<br \/>\n            }<br \/>\n            context.write(key, new IntWritable(total));<br \/>\n        }<br \/>\n    }<\/p>\n<p>    public static void main(final String[] args) throws Exception {<br \/>\n        final Configuration conf = new Configuration();<br \/>\n        final Job job = new Job(conf, \"Kerning Pairs\");<br \/>\n        job.setJarByClass(KPH.class);<br \/>\n        job.setMapperClass(KPHMapper.class);<br \/>\n        \/*<br \/>\n         * We don't need either combiner or reducer class for the simple count<br \/>\n         * operation we're doing, but it shows how to utilize them.<br \/>\n         *\/<br \/>\n        job.setCombinerClass(KPHReducer.class);<br \/>\n        job.setReducerClass(KPHReducer.class);<br \/>\n        job.setOutputKeyClass(Text.class);<br \/>\n        job.setOutputValueClass(IntWritable.class);<br \/>\n        FileInputFormat.addInputPath(job, new Path(args[0]));<br \/>\n        FileOutputFormat.setOutputPath(job, new Path(args[1]));<br \/>\n        System.exit(job.waitForCompletion(true) ? 0 : 1);<br \/>\n    }<br \/>\n}<br \/>\n<\/code><br \/>\nThe files containing the pairs are <a href=\"http:\/\/logicalchaos.org\/blog\/wp-content\/uploads\/2013\/04\/KP.txt\">KP<\/a>, sorted alphabetically, and <a href=\"http:\/\/logicalchaos.org\/blog\/wp-content\/uploads\/2013\/04\/KPn.txt\">KPn<\/a>, sorted numerically by frequency.<br \/>\n<code lang=\"bash\">cat KP.txt | sort -nrk2 > KPn.txt<\/code><\/p>\n","protected":false},"excerpt":{"rendered":"<p>In my previous post Kerning Pairs Part I, I looked at kerning pairs calculated from the standard Mac dictionary that font designers would concentrate on to get the most payoff for their editing time. This time, I&#8217;ll do a simple &hellip; <a href=\"http:\/\/logicalchaos.org\/blog\/2013\/04\/kerning-pairs-part-ii\/\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[8],"tags":[20,19],"class_list":["post-174","post","type-post","status-publish","format-standard","hentry","category-tech","tag-hadoop","tag-java"],"_links":{"self":[{"href":"http:\/\/logicalchaos.org\/blog\/wp-json\/wp\/v2\/posts\/174","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/logicalchaos.org\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/logicalchaos.org\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/logicalchaos.org\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/logicalchaos.org\/blog\/wp-json\/wp\/v2\/comments?post=174"}],"version-history":[{"count":41,"href":"http:\/\/logicalchaos.org\/blog\/wp-json\/wp\/v2\/posts\/174\/revisions"}],"predecessor-version":[{"id":220,"href":"http:\/\/logicalchaos.org\/blog\/wp-json\/wp\/v2\/posts\/174\/revisions\/220"}],"wp:attachment":[{"href":"http:\/\/logicalchaos.org\/blog\/wp-json\/wp\/v2\/media?parent=174"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/logicalchaos.org\/blog\/wp-json\/wp\/v2\/categories?post=174"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/logicalchaos.org\/blog\/wp-json\/wp\/v2\/tags?post=174"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}