Monday, November 28, 2016

Don't Repeat Yourself... Really!

...And I hope that someday
That you, you people will all have the chance
To read The Helping Friendly Book
And experience the wisdom
Of the great, the great and knowledgeable
Man who wrote The Helping Friendly Book 
Because he is, the great and knowledgeable
He is the one, the only author of The Helping Friendly Book
He is the man
The great man
The only, the special
His name is
The author of The Helping Friendly Book
He is the great
The knowledgeable, the one and the only
The great, the knowledgeable
Person who wrote The Helping Friendly Book
His name could only be
The one, the only, the only, the special
The author of The Helping Friendly Book...
-Excerpt from Icculus by Phish
There are few mantras that are as universally agreed upon by Software Engineers as Don't Repeat Yourself or DRY.  It rings deeply and globally true that any form of code duplication is at best a compromise, at worst, an abominable trap.  Writing and maintaining code is difficult enough without doing it many times.  So if we all know this, and we all agree, then why is it that the industry's preferred architectural best practices embrace duplication?

If you've ever built a controller, a repository, a service layer, a data access provider, or anything remotely like one of these things, you know exactly what I'm talking about.  Duplication is what you do.  You copy the logging code.  You copy the query strategy.  You copy the update methodology. You copy the attributes and data structures. You copy the exception handling pattern.  "That's how it was shown to you by smart people, so it must somehow be right," you think as you faithfully type away, all the while wondering if you've become little more than a digital version of a truck driver.

Like the proverbial cliff all your friends jumped off of according to your mother, it doesn't matter how bad of architecture you have seen, there is another, better way... so stop it!  But perhaps you're so used to following your lemming-like friends that you don't know how to escape the allure of duplication?  I'll offer some suggestions.

Existing Hooks and Abstractions

I've seen many applications that duplicate logging and exception handling, while being built in frameworks that provide easy to implement hooks for just these things.  At a minimum, know your tools.


There are new frameworks every day.  Most of them encourage repetitious coding practices and should be avoided, but some aim to automate patterns and escape the madness.  Don't use a framework just because everyone else is doing it, and if you do, at least learn how to extend it.

Aspect Orientation

Aspect Orientation or AOP is an approach that calls for applying on-the-fly code generation or similar to automate all similar patterns.  For instance, an AOP-based framework would be one which allows you to define once how logging is performed, and apply that across all service calls.  Note that Aspect Orientation isn't synonymous with code generation, there are clever ways to accomplish the approach using composition hooks or code injected at runtime.

Data Access Platforms

Most repetition in today's applications seems to revolve around data access.  It seems somehow accepted that most aspects of applications can be automated, but not data access.  Granted, it may seem difficult to orchestrate and automate proper concurrent access of data across all layers, but there are platforms like Dataphor have solidly proven that all aspects of data access, from user interface to storage, can be completely automated.  If your platform or frameworks do not automate it, don't settle.


Regardless of what you do, say no to pattern duplication and begin enjoying your work again.  Do it for the lemmings; do it for your mother!

Sunday, October 09, 2016

Convolution Neural Networks for Image Categorization


I''m writing this to help developers without a degree in artificial intelligence understand the subject, as most material on the matter is rather complex.  I am also hoping to shore up my own understanding as I am relatively new to the subject of deep learning neural networks. 


The problem domain is essentially a function whereby and image is passed in, and one or more categories result, each with an associated probability.
A convolution neural network used for image categorization.
There are many other similar use cases as well, such as gesture recognition but for this article we'll limit discussion to this case.

More technically, a Convolution Neural Network (CNN) is a specific type of deep learning neural network which is comprised of:
  • An input layer containing a bitmap.
  • Some number of convolution layers, which apply convolution filters and output the result to the next layer.
  • Some number of sub-sampling layers, which each downsize the data from the previous layer.
  • A fully connected hidden, and a fully connected categorization (output) layer.  Don't worry if you don't know what these things are, we'll cover it.
There are infinite different ways to compose such layers, and this fact has led to CNN design sitting somewhere between an art form and a science.  Each year there are competitions between researchers in attempts to reach new benchmarks in performance and scale.  A well known such competition is hosted by ImageNet, where recent competitions have for instance involved recognizing objects in one of hundreds of categories and even identifying placement of those objects within the image.  From the years of research, there have arisen some recognized "standard" designs which have traditionally given good results for certain domains.


There are several open source frameworks that have arisen for this domain.  A few of them are:
CaffeWritten in C++, is well established, fast, and supports GPU optimization. Originally targeted Linuxy OSes, but has been ported to Windows (in beta as of this writing).
TorchWritten in Lua, also well established and optimized. Currently for Linux and community seems a bit opinionated about remaining there, but there are porting efforts that look to have been somewhat successful. I wouldn't tread here unless you either know and love Lua already, or are prepared to get your feet wet in the same.
Intel Deep Learning FrameworkVectorized CPU and OpenCL GPU implementations in C++ of CNNs and others. Looks like Intel is no longer actively developing on this, but it appears to be pretty mature and might at least inspire optimizations.
CNN WorkbenchIn C#. There used to be a really nice article on this with a great looking GUI on CodeProject, but the author pulled it down. This link is all I have, which is a fork of just the library part of it. No community to speak of here.
CNNCSharpIn C#. Again no community, but this repo seems to have all the needed goodies.
ConvNetJSJavascript implementation of CNNs which is easy to try in your browser. Great visualizations.


Now we'll take the details step by step.  I'm going to use visualizations from the default network for CIFAR-10 in ConvNetJS to illustrate the process.  The network is structured as follows:
Default network layout for CIFAR-10 in ConvNetJS

Input layer

The first thing that's typically done is to turn the image into a 3D tenser of floating point values; two dimensions for the X and Y pixels and a dimension for the three color channels.  As a reminder, a tenser is simply an n-dimensional array; vectors are 1-dimensional tensors, matrices are 2-dimensional tensors.  The conversion is in order to accommodate the fuzzy math that is coming; such math routines are written to take generic structures of floating point values, not discrete integers such as are used to store bitmap data.


First convolution layer.  Uses 16, 5x5 filters (labeled Weights) against the input
A convolution function essentially returns how well a portion of data matches a given filter.  Most filters are some type of edge detector of some angle and perhaps some color combination.  A convolution layer steps in increments of some stride (one in the example) across and down the input data, and outputs how well each position matches the filter.  In the case of the example, the configuration is to use 16 filters for the first convolution layer, so this scanning and outputting is done for each of 16 filters.  You'll notice there are 16 filters (labeled Weights) represented in the image, but the filter size is 5x5x3.  This is because the above visualization combines the 3 color channels into composite (RGB) images.

What you're seeing in the Activations images in the example are pixels representations of how well that part of the image matched the filter, so lighter portions match better.


Activation function of first convolution layer brings out only the highlights.
Every intermediate layer of the network has something called an activation function.  The name comes from a biological neuron which only fires (activates) if incoming signals from upstream neurons cumulatively reach some level of signal.  This function's job is to amplify contrast and must be non-linear. The extreme case is a threshold function which turns fully on (1) when the inputs reach a certain level, otherwise is fully off (0).  In practice, most neural networks use hyperbolic tangent, sigmoid, or ReLU functions, which are non-linear, but not fully discreet so as to preserve some uncertainty.  The reason a linear function would not work well is that the entire network would degenerate to grays rather than portions of the network taking on specific features.  The activation function can either be thought of as part of the neural network layer, or a layer of its own.


First pooling layer down-samples the result.
Pooling is simply the process of downsizing.  There are different techniques for doing this, the most common for CNNs being using the Max function within a sample segment.  

The reason for pooling is to create a degree of scale invariance and to allow the next layers to match on higher level visual patterns.  The nature of max pooling also creates a degree of rotation invariance because the "max" feature can move a little within each sample segment.

Rinse and Repeat

After this, the Convolution, ReLU, and Pooling processes are repeated, though with variations such as number of convolution filters.  Some networks add back-to-back convolutions without pooling near the end and some nest entire CNNs within each other; clearly there are infinite possibilities.  How is one to know what is best?  The short answer is that this is a design domain, so there will be various trade-offs for each, and the best designs may yet to be discovered.  That said, it is important to know the principles.

The desire is for the network to learn the essential characteristics that constitute the essence of the category.  In order to do this, sufficient layers are needed in order to represent the complexity of the concept.  For example, it is relatively simple to distinguish hand-written characters from each other, compared to recognizing the difference between Corgie and Dachshund dogs in nearly any pose.  To distinguish the latter, sufficient convolution layers must build up from simpler ones.  The first convolution layer basically establishes fundamental edge shapes.  The second recognizes compound shapes such as circles or ripples.  By the time you arrive at the 5th or higher convolution layers, the features are very high level, such as "eyes", "words", etc.  Note that those labels are not necessarily known to the network, but the network will recognize the patterns regardless.  For instance, the actual category of Corgy might fire when the right combination of lower level features are present, such as "eyes", "stubby tail", "short legs".  Yes this is amazing.

So why not just add hundreds of layers with hundreds of filters each?  Performance is one reason, diminishing returns is another.  Though CNNs are quite effecient considering the prospect of trying to accomplish the same feats with fully connected per-pixel deep networks, they are still expensive computationally and memory wise; especially when training.  One promising avenue of research is building networks for new purposes out of networks that were trained for others.  This works, because many of the primitives (such as "eyes") are common in higher level patterns and can be reused.

Fully Connected

The fully connected (FC) layer is essentially a traditional neural network, where every "neuron" coming from the previous layer is "wired" to every neuron in the fully connected layer.  These neuron's decide what combination of features from the convolution layers ultimately constitute each category.  In other words, the FC layer has a neuron for every category, and the connections coming from each neuron (convolution feature) of the previous layer are weighted such that when the correct combination(s) are present, the category scores highly.


The fully connected layer determines the categorical probabilities as arbitrary real values.  The final step is to normalize those into probabilities.  Normalization means that rather than just producing a set of arbitrary numbers for each category, the category values are factored together so that the sum of all categories is 1, or at least nearly so.  For this task, the Softmax function is often used because it gives increased weight to the most positive values.

Now What?

Here are some resources to get you going:
  • I'd recommend spending a little time with ConvNetJS.
  • This article shows how to construct a CNN for the CIFAR-10 dataset.  It is for Torch, but is useful as a detailed expose of CNN construction regardless.  The CIFAR datasets are a good place to start as they are only 32x32 pixels and a relatively simple network works well.
  • nVidia DIGITS is software which makes it easy to construct a CNN and run it against nVidia GPUs.
  • If you're interested in Torch, there are ready-made Amazon G2 (GPU accelerated) instances such as this one.
  • When you're ready for more advanced nets, check out GoogLeNet, AlexNet and others on the Caffe Model Zoo.
  • To better understand what these networks are learning and thus improve them, there are efforts to visualize their learning through deconvolution.

Sunday, July 08, 2007

Camtasia Studio Tips

Camtasia is great software for recording audio/screen presentations.  Here are some tips and standards we've used:


  • Change the recording options to record to ''AVI format'' (the CAMREC format can lose audio/video sync over a long recording)
  • Recorded movies should be 800x600 or less unless there is a specific reason to record larger.  Using the smallest capture size possible improves the readability and reduces the file size.
  • All videos should start with an announcement of what is being demonstrated (e.g. "This is a demonstration of the Device Capture System introduced in Cashwise 3.7)
  • Don't record the application window; set one of your screens to 800x600 and record the screen.  This ensures that any popup windows or menus remain within the recorded region.
  • If recording other than the primary screen, select 'region' and select then entire alternate screen
  • Hide any toolbars or other desktop clutter.
  • Run a short test to be sure that audio/video are working before recording a long demo


  • Files should be recorded to AVI format. Recommend settings:
     Video Codec:    TechSmith Screen Capture Codec
     Audio Codec:    MPEG Layer-3 Codec
     Audio Format:    32 kBit/s, 24000Hz, mono, 3KB/s


  • Before delving into details, give a high-level description of the purpose of the feature and a high-level description of the components of the feature.
  • Consider dividing the topic into multiple videos if it is possible that someone might wish to view topics separately
  • Try to keep videos under 10 minutes long
  • Don't script out the content.  If necessary, create a brief outline to ensure each important facet is discussed.  It's hard to listen to scripted speech.
  • If you realize that some topic is missing from a video, record an addendum rather than re-record the whole thing (unless there is a mess of addendums)
  • Keep a reasonably fast pace.  A video says even more than a thousand words.  The viewer can always pause or rewind.
  • Try to keep the content timeless; e.g. do not say things like "this is a new feature".
  • Try to spend more time on abstract concepts, gotchas and special cases that might not be readily apparent rather than mundane descriptions of the obvious.

Sunday, May 20, 2007

Web applications - now only 15 years behind!

It seems to me that web applications are finally approaching where "rich client" applications were in the late 90s. People are finally pooling together abstractions and forming UI toolkits around them. 

What would be really great would be if we could skip 15 more years and stop using HTML altogether.

Tuesday, May 15, 2007

Down with whiners... Microsoft

I've pandered to, accommodated, supported, and even defended Microsoft for years now, but I'm through. To this point, they may have occasionally been the big bully, but more the special ed kid kind of bully than the vindictive Harvard graduate type. They've had a sort of unspoken culture of defensive litigation. Though they may have used some aggressive tactics, their overall approach has seemed to be: win the battle through providing better software. With their recent whining about how the open source world is stomping on all their B.S. patents, they have crossed the line. I don't like Microsoft any more. Go Linux! Go Mac!

Thursday, May 10, 2007

Needed: voice tracking webcam

We recently acquired Logitech's motion tracking webcam in order to better capture presentations, discussions, and training sessions. In case you are interested, the camera is the Logitech QuickCam Orbit MP.

The camera does what it advertises and does so pretty well.  As the presenter moves around, the camera keeps the person within the field of view.  As an aside I was a little surprised by the fact that this camera moves in small jolts rather than in a smooth manner.  Though the resulting video doesn't feel very professional, I would rather have this behavior that see constant movement as the system tries to decide where to point.  Perhaps with sufficiently sophisticated software, a gradual movement system would be better.

Where the camera doesn't do as well is when more than one person is involved.  The camera basically seems to track the "biggest" movement, so in a presentation setting, if one person walks away from the center of the action, the camera basically follows them.  As a result, in our usage we had to basically wave the camera down to keep the presenter visible.

Clearly this camera was not designed for this type of scenario, but it did get me thinking about how easy it would be to provide a solid solution to this problem.  The answer is to put dual microphones on the camera, and have it rotate to point to the dominant source of audio.  This would be ideal for many scenarios, from video conferencing, to training, to Q & A and presenting.

So, could somebody please make such a thing and not charge too much for it.  Thanks.  :-)

Sunday, May 06, 2007

Example Oriented Programming

I was letting my mind wonder a bit this evening about how far one could take the idea of example base programming in order to make the programming process more concrete.  It seems natural to think of one or more concrete cases, then deduce abstractions from there, so why not allow the system to cooperate in the entire process.  Let me try to explain what I mean.

Query By Example (QBE) is a heavily used and long established concept whereby the user provides an example of what data she seeks.  One could argue that modern search engines are in fact a form of QBE.  QBE was used successfully as a database query language in what used to be Borland's Paradox.  In a database context QBE is usually done only in a user interface context through a set of fields with allow the user to "filter" the dataset.  Paradox took this much further and defined a complete language in which to express even complex disjunctive queries.  When working in Paradox and even afterwards, I would often build queries in QBE, then ask Paradox to translate them into SQL, as I found QBE to be easier and more natural in many cases.

Thinking about the analogue of QBE for programming, a common case comes to mind: test scripts.  QA engineers often use software that allows them to record a series of actions against a piece of software, then go back to the resulting scripts and randomize or parameterize whatever aspects.  This is a process of taking a concrete imperative process, and making it abstract and reusable.

The questions is: could there be value in this for "regular" programming?  When we program, it seems we already undergo this exercise in our minds.  It seems that we "play" algorithms in our minds, considering the various boundary conditions and possible variable states.  If, rather than just thinking this way, we could state concrete "executions" to the system, then inform the system that we wish certain constants to be variable, then perhaps the system could enhance the process.

Still a "thought in progress" to be sure...