## Abstract corpora (part 1)

June 6, 2008

In The Logical Structure of Linguistic Theory (written around 1956, although not published until 1975), Chomsky outlined a theory of linguistic form, and suggested from the beginning that “we will try to show how an abstract theory of linguistic structure can be developed within a framework that admits of operational interpretation, and how such a theory can lead to a practical mechanical procedure by which, given a corpus of linguistic material, various proposed grammars can be compared and the best of them selected” (Chomsky 1975, p. 61). In order for such a mechanical procedure to be used, it would be necessary to present an actual collection of linguistic material—utterances recorded in some suitable form—on which it could operate. A grammar, in this context, is construed as a theory (Chomsky 1975, p. 63):

By “the grammar of a language $L$” we mean that theory of $L$ that attempts to deal with such problems as [projection, ambiguity, sentence type, etc.] wholly in terms of the formal properties of utterances. And by “the general theory of linguistic form” we mean the abstract theory in which the basic concepts of grammar are developed, and by means of which each proposed grammar can be evaluated.

The relationship between a language $L$ and a grammar of $L$, in early generative theory, is conceived of as follows (Chomsky 1957, p. 13):

From now on I will consider a language to be a set (finite or infinite) of sentences, each finite in length and constructed out of a finite set of elements. … The fundamental aim in the linguistic analysis of a language $L$ is to separate the grammatical sequences which are the sentences of $L$ from the ungrammatical sequences which are not sentences of $L$ and to study the structure of the grammatical sequences. The grammar of $L$ will thus be a device that generates all of the grammatical sequences of $L$ and none of the ungrammatical ones.

The general proposal here is, to some degree, analogous to filling out tax forms. A person’s actual financial situation is a collection of transactions, with money being received and dispensed at various points in time. In filling out a tax form, they need to deal with certain problems—net income, withholdings, and the like—which are the financial properties of the transactions, and ignore such things as whether the money was earned by clearing clogged plumbing or by managing a team of financial auditors. The financial situation is evaluated based on the tax laws, which define the basic concepts independent of any specific person’s financial situation. The “general theory of linguistic form” is roughly analogous to the pertinent tax laws, and the “grammar of a language $L$” plays a role similar to the information provided on a tax form. In the tax scenario, all of this description and analysis is performed relative to an actual set of financial transactions. The language $L$ is analogous to these transactions, in that it provides the material to be described and analyzed.