Kerning Pairs Part II

In my previous post Kerning Pairs Part I, I looked at kerning pairs calculated from the standard Mac dictionary that font designers would concentrate on to get the most payoff for their editing time. This time, I’ll do a simple Hadoop implementation of the same calculation.

I’m not doing this for performance (obviously), as it takes ~1 sec on my iMac, and 2m30s on my Hadoop cluster. I’m doing it to learn more about Big Data, and how to work with it (like I need something more to do with my spare time). My cluster has three machines in it, a dual core AMD Athlon MP on 1Gb network, an Intel Core i5 (iMac) over 802.11n, and an Intel Core i7 (iMac), also on the 1Gb network. The i7 is the master node.

So all nodes have something to do, I put the 2.4MB dictionary file into the Hadoop file system in block sizes of 8KB. The task ratio for this problem mapped to the three nodes is about 1:3.5:18.
hadoop fs -Ddfs.block.size=8192 -put /usr/share/dict/words /usr/hadoop/words
The KPHReducer class is not needed for a simple counting reduction like this one, but it shows a gotcha with correctly counting values passed in through the iterator. Despite only writing 1’s per pair, with a combiner class or a simple counting implementation like this one, the data can be coalesced before getting to the reducer class.

import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class KPH {

public static class KPHMapper extends
Mapper {

private final static IntWritable one = new IntWritable(1);
private final static int STRING_LENGTH = 2;
private final Text word = new Text();

@Override
public void map(final Object key, final Text value,
final Context context) throws IOException, InterruptedException {
final String dictWord = value.toString();
String lcLine;
/*
* Only looking at words that contain 2 or more characters.
*/
if (dictWord.matches("\\p{Alpha}{2,}")) {
lcLine = dictWord.toLowerCase();
for (int i = 0; i < (lcLine.length() - KPHMapper.STRING_LENGTH); i++) { /* * Grab each sequential pair and write it out. */ this.word.set(lcLine.substring(i, i + KPHMapper.STRING_LENGTH)); context.write(this.word, KPHMapper.one); } } } } public static class KPHReducer extends Reducer {
@Override
public void reduce(final Text key, final Iterable values,
final Context context) throws IOException, InterruptedException {
int total = 0;
for (final IntWritable val : values) {
/*
* We will have values other than 1 here since this is a
* combiner and reducer class.
*/
total += val.get();
}
context.write(key, new IntWritable(total));
}
}

public static void main(final String[] args) throws Exception {
final Configuration conf = new Configuration();
final Job job = new Job(conf, "Kerning Pairs");
job.setJarByClass(KPH.class);
job.setMapperClass(KPHMapper.class);
/*
* We don't need either combiner or reducer class for the simple count
* operation we're doing, but it shows how to utilize them.
*/
job.setCombinerClass(KPHReducer.class);
job.setReducerClass(KPHReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

The files containing the pairs are KP, sorted alphabetically, and KPn, sorted numerically by frequency.
cat KP.txt | sort -nrk2 > KPn.txt

This entry was posted in Tech and tagged , . Bookmark the permalink.