Superexponential Conceptspace, and Simple Words

❦

Thingspace, you might think, is a rather huge space. Much larger than reality, for where reality only contains things that actually exist, Thingspace contains everything that could exist.

Actually, the way I “defined” Thingspace to have dimensions for every possible attribute—including correlated attributes like density and volume and mass—Thingspace may be too poorly defined to have anything you could call a size. But it’s important to be able to visualize Thingspace anyway. Surely, no one can really understand a flock of sparrows if all they see is a cloud of flapping cawing things, rather than a cluster of points in Thingspace.

But as vast as Thingspace may be, it doesn’t hold a candle to the size of Conceptspace.

“Concept,” in machine learning, means a rule that includes or excludes examples. If you see the data {2:+, 3:-, 14:+, 23:-, 8:+, 9:-} then you might guess that the concept was “even numbers.” There is a rather large literature (as one might expect) on how to learn concepts from data… given random examples, given chosen examples… given possible errors in classification… and most importantly, given different spaces of possible rules.

Suppose, for example, that we want to learn the concept “good days on which to play tennis.” The possible attributes of Days are

Sky:	{Sunny, Cloudy, Rainy}
AirTemp:	{Warm, Cold}
Humidity:	{Normal, High}
Wind:	{Strong, Weak}.

We’re then presented with the following data, where + indicates a positive example of the concept, and − indicates a negative classification:

+	Sky: Sunny;	AirTemp: Warm;
	Humidity: High;	Wind: Strong.
−	Sky: Rainy;	AirTemp: Cold;
	Humidity: High;	Wind: Strong.
+	Sky: Sunny;	AirTemp: Warm;
	Humidity: High;	Wind: Weak.

What should an algorithm infer from this?

A machine learner might represent one concept that fits this data as follows:

{Sky: ?; AirTemp: Warm; Humidity: High; Wind: ?}.

In this format, to determine whether this concept accepts or rejects an example, we compare element-by-element: ? accepts anything, but a specific value accepts only that specific value.

So the concept above will accept only Days with AirTemp = Warm and Humidity = High, but the Sky and the Wind can take on any value. This fits both the negative and the positive classifications in the data so far—though it isn’t the only concept that does so.

We can also simplify the above concept representation to

{?, Warm, High, ?}.

Without going into details, the classic algorithm would be:

Maintain the set of the most general hypotheses that fit the data—those that positively classify as many examples as possible, while still fitting the facts.
Maintain another set of the most specific hypotheses that fit the data— those that negatively classify as many examples as possible, while still fitting the facts.
Each time we see a new negative example, we strengthen all the most general hypotheses as little as possible, so that the new set is again as general as possible while fitting the facts.
Each time we see a new positive example, we relax all the most specific hypotheses as little as possible, so that the new set is again as specific as possible while fitting the facts.
We continue until we have only a single hypothesis left. This will be the answer if the target concept was in our hypothesis space at all.

In the case above, the set of most general hypotheses would be

{{?, Warm, ?, ?},{Sunny, ?, ?, ?}},

while the set of most specific hypotheses contains the single member {Sunny, Warm, High, ?}.

Any other concept you can find that fits the data will be strictly more specific than one of the most general hypotheses, and strictly more general than the most specific hypothesis.

(For more on this, I recommend Tom Mitchell’s Machine Learning, from which this example was adapted.1)

Now you may notice that the format above cannot represent all possible concepts. E.g., “Play tennis when the sky is sunny or the air is warm.” That fits the data, but in the concept representation defined above, there’s no quadruplet of values that describes the rule.

Clearly our machine learner is not very general. Why not allow it to represent all possible concepts, so that it can learn with the greatest possible flexibility?

Days are composed of these four variables, one variable with 3 values and three variables with 2 values. So there are 3 × 2 × 2 × 2 = 24 possible Days that we could encounter.

The format given for representing Concepts allows us to require any of these values for a variable, or leave the variable open. So there are 4 × 3 × 3 × 3 = 108 concepts in that representation. For the most-general/most-specific algorithm to work, we need to start with the most specific hypothesis “no example is ever positively classified.” If we add that, it makes a total of 109 concepts.

Is it suspicious that there are more possible concepts than possible Days? Surely not: After all, a concept can be viewed as a collection of Days. A concept can be viewed as the set of days that it classifies positively, or isomorphically, the set of days that it classifies negatively.

So the space of all possible concepts that classify Days is the set of all possible sets of Days, whose size is 2²⁴ = 16,777,216.

This complete space includes all the concepts we have discussed so far. But it also includes concepts like “Positively classify only the examples {Sunny, Warm, High, Strong} and {Sunny, Warm, High, Weak} and reject everything else” or “Negatively classify only the example {Rainy, Cold, High, Strong} and accept everything else.” It includes concepts with no compact representation, just a flat list of what is and isn’t allowed.

That’s the problem with trying to build a “fully general” inductive learner: They can’t learn concepts until they’ve seen every possible example in the instance space.

If we add on more attributes to Days—like the Water temperature, or the Forecast for tomorrow—then the number of possible days will grow exponentially in the number of attributes. But this isn’t a problem with our restricted concept space, because you can narrow down a large space using a logarithmic number of examples.

Let’s say we add the Water: {Warm, Cold} attribute to days, which will make for 48 possible Days and 325 possible concepts. Let’s say that each Day we see is, usually, classified positive by around half of the currently-plausible concepts, and classified negative by the other half. Then when we learn the actual classification of the example, it will cut the space of compatible concepts in half. So it might only take 9 examples (2⁹ = 512) to narrow 325 possible concepts down to one.

Even if Days had forty binary attributes, it should still only take a manageable amount of data to narrow down the possible concepts to one. Sixty-four examples, if each example is classified positive by half the remaining concepts. Assuming, of course, that the actual rule is one we can represent at all!

If you want to think of all the possibilities, well, good luck with that. The space of all possible concepts grows superexponentially in the number of attributes.

By the time you’re talking about data with forty binary attributes, the number of possible examples is past a trillion—but the number of possible concepts is past two-to-the-trillionth-power. To narrow down that superexponential concept space, you’d have to see over a trillion examples before you could say what was In, and what was Out. You’d have to see every possible example, in fact.

That’s with forty binary attributes, mind you. Forty bits, or 5 bytes, to be classified simply “Yes” or “No.” Forty bits implies 2⁴⁰ possible examples, and 2²⁴⁰ possible concepts that classify those examples as positive or negative.

So, here in the real world, where objects take more than 5 bytes to describe and a trillion examples are not available and there is noise in the training data, we only even think about highly regular concepts. A human mind—or the whole observable universe—is not nearly large enough to consider all the other hypotheses.

From this perspective, learning doesn’t just rely on inductive bias, it is nearly all inductive bias—when you compare the number of concepts ruled out a priori, to those ruled out by mere evidence.

But what has this (you inquire) to do with the proper use of words?

It’s the whole reason that words have intensions as well as extensions.

In the last essay, I concluded:

The way to carve reality at its joints is to draw boundaries around concentrations of unusually high probability density.

I deliberately left out a key qualification in that (slightly edited) statement, because I couldn’t explain it until now. A better statement would be:

The way to carve reality at its joints, is to draw simple boundaries around concentrations of unusually high probability density in Thingspace.

Otherwise you would just gerrymander Thingspace. You would create really odd noncontiguous boundaries that collected the observed examples, examples that couldn’t be described in any shorter message than your observations themselves, and say: “This is what I’ve seen before, and what I expect to see more of in the future.”

In the real world, nothing above the level of molecules repeats itself exactly. Socrates is shaped a lot like all those other humans who were vulnerable to hemlock, but he isn’t shaped exactly like them. So your guess that Socrates is a “human” relies on drawing simple boundaries around the human cluster in Thingspace. Rather than, “Things shaped exactly like [5-megabyte shape specification 1] and with [lots of other characteristics], or exactly like [5-megabyte shape specification 2] and [lots of other characteristics], … , are human.”

If you don’t draw simple boundaries around your experiences, you can’t do inference with them. So you try to describe “art” with intensional definitions like “that which is intended to inspire any complex emotion for the sake of inspiring it,” rather than just pointing at a long list of things that are, or aren’t art.

In fact, the above statement about “how to carve reality at its joints” is a bit chicken-and-eggish: You can’t assess the density of actual observations until you’ve already done at least a little carving. And the probability distribution comes from drawing the boundaries, not the other way around—if you already had the probability distribution, you’d have everything necessary for inference, so why would you bother drawing boundaries?

And this suggests another—yes, yet another—reason to be suspicious of the claim that “you can define a word any way you like.” When you consider the superexponential size of Conceptspace, it becomes clear that singling out one particular concept for consideration is an act of no small audacity—not just for us, but for any mind of bounded computing power.

Presenting us with the word “wiggin,” defined as “a black-haired green-eyed person,” without some reason for raising this particular concept to the level of our deliberate attention, is rather like a detective saying: “Well, I haven’t the slightest shred of support one way or the other for who could’ve murdered those orphans… not even an intuition, mind you… but have we considered John Q. Wiffleheim of 1234 Norkle Rd as a suspect?”

Tom M. Mitchell, Machine Learning (McGraw-Hill Science/Engineering/Math, 1997). ↩︎

Mutual Information, and Density in Thingspace

Top

Book

Sequence

Conditional Independence, and Naive Bayes