Wednesday, October 26, 2011

Lower bound for element distinctness problem

Summary: We discuss the method of proving lower bounds on a linear decision tree model using connected components argument. A lower bound of Ω(n logn) for element distinctness problem(EDP) follows from this argument.

We are given an input X=(x1,x2,..,x_n) and a property P; the problem is to decide if X has property P. In EDP, we are interested in knowing if all x_is are distinct.

What is a connected component?

Every input point belongs to R^n. Hence, the entire input space R^n can be partioned into `Yes' and `No' regions; a point from `Yes' region has property P while a point from `No' region does not.

In EDP, problem suppose n=2, so the input space is R^2. The input space has two regions: Yes and No. Points that lie in x1<x2 and x1>x2 form the `Yes' region and points on the line x1=x2 form the `No' region.

Two points a and b are said to be connected if they both belong to `Yes'(No) region and there exists a continuous path from a to b that passes only through `Yes'(No) region. For example, all points in x1<x2 region are connected. But a point a in x1<x2 and b in x1>x2 are not connected since any path from a to b must pass through the line x1=x2 which is a `No' region.

Computational Model

The model of computation here is a linear decision tree, which takes as input X and outputs `Yes' or `No' depending on whether X satisfies the given property P or not. For EDP, the decision tree outputs `Yes` if all the elements are distinct and `No' otherwise. Each internal node v of the tree contains a linear function in input variables d_v(X) and three edges: >0, <0 and =0. Given an input vector X, we decide to take the branch corresponding to the value of d_v(X). Computation starts at the root node and control branches recursively to its children. The decision is made when control reaches a leaf. Each leaf node is labelled `Yes' or `No'. We can think of the decision tree as a function F : Rn → {Yes,No} .

Linear decision trees have the following geometric interpretation: Each internal node defines a hyperplane with equation d(X). We branch at a node depending on whether the input is above, below or on the hyperplane. Let l be a leaf node. Consider the set of all input points R(l) that reach the leaf node l. The set R(l) contains all the points which satisfy the inequalities or equalities along its root-to-leaf path. Since such a set is formed by intersection of half-planes, the resulting set must be a convex region. A convex region is always connected. Hence, a single leaf node can accommodate only one connected component.

Claim 1. If two points a and b reach a leaf  then any point that is a convex combination of a and b must also reach the same leaf.

Bound on the height of the tree

The number of leaves is at least the number of connected components in \$F^{-1}(Yes)\$ and \$F^{-1}(No)\$. It follows that the height of the linear decision tree must be at least \$\log_{3}( F^{-1}(Yes) + F^{-1}(No) )\$, implying a lower bound on the number of comparisons the linear decision tree must make in order to give the correct answer.

Number of connected components in EDP problem

All that we have to do is get a lower bound of the number of connected components.

Claim 2. There are n! 'Yes' connected components in R^n.

Take a point a in R^n with all distinct coordinates and let b be a non-identical permutation of coordinates of a. For example, if a=(1,4,5) then b can be (4,5,1) in R^3. Since b is a non-identical permutation of a, there must be an inversion pair (i,j). i.e there must be two coordinates i and j such that a_i > a_j and b_i < b_j or vice versa. Can inputs a and b reach the same leaf? This is not possible since the straight line connecting a and b must pass through some point with x_i=x_j. So there exists a convex combination of a and b which belongs to 'No' region(which contradicts Claim 1). Every non-identical permutation of a is in a connected component of its own, hence there are n! connected components.

Claim 3. The height of the decision tree must be at least log(n!) = Ω(n log n).

Takeaway Puzzle: Prove an Ω(n log n) lower bound for the following problem by reducing from EDP: Given an array of numbers is there a contiguous subsequence whose sum is zero?