R.  DUNCAN  LUGE 
ROBERT  R.  BUSH 
EUGENE  GALANTER 

EDITORS 


HANDBOOK  OF 

MATHEMATICAL 

PSYCHOLOGY 


VOLUME  II 

CHAPTERS  9-14 


R.  DUNCAN  LUCE  received 
both  his  B.S.  and  his  Ph.D.  from  MIT. 
At  present,  he  is  Professor  of  Psychol- 
ogy at  the  University  of  Pennsylvania, 
a  position  he  has  held  since  1959.  His 
publications  include  Games  and  Deci- 
sions (with  Howard  Raiffa,  Wiley, 
1957)  ,  Individual  Choice  Behavior 
(Wiley,  1959)  ,  and  the  editorship  of 
Developments  in  Mathematical  Psy- 
chology. 

ROBERT  R.  BUSH  graduated 
from  Michigan  State  and  received  his 
Ph.D.  from  Princeton.  He  has  been 
Professor  and  Chairman  of  the  Depart- 
ment of  Psychology  at  the  University  of 
Pennsylvania  since  1958.  His  previous 
publications  include  Stochastic  Models 
for  Learning  (with  F.  Mosteller,  Wiley, 
1955)  and  the  editorship  (with  W.  K, 
Estes)  of  Studies  in  Mathematical 
Learning  Theory. 

EUGENE    GALANTER    is    a 

graduate  of  Swarthmore  and  received 
his  Ph.D.  from  the  University  of  Penn- 
sylvania. In  1962,  he  was  appointed  Pro- 
fessor and  Chairman  of  the  Depart- 
ment of  Psychology  at  the  University 
of  Washington.  His  publications  in- 
clude the  editorship  of  Automatic 
Teaching  (Wiley,  1959)  and  Plans  and 
the  Organization  of  Behavior  (with 
G.  A.  Miller  and  K.  H.  Pribram). 

WH-NU-4690 


66-12967 

152.83  L93h  v.  2      $11.95 
Luce,  Robert  Duncan,  ed. 

66-12967 

152.83  L93h  v.2      $11.95 
Luce,  Robert  Duncan,  ed. 

Handbook  of  mathematical  psy- 
chology.  Wiley  [1963-65] 


Psyefeelegy 


Handbook  of     Mathematical  Psychology 


Volume  II,  Chapters  9-14 


WITH  CONTRIBUTIONS  BY 

Saul  Sternberg  Noam  Chomsky 

Richard  C.  Atkinson       George  A.  Miller 
William  K.  Estes  Anatol  Rapoport 


EDITED   BY 


R.  Duncan  Luce,  University  of  Pennsylvania 
Robert  R.  Bush,  University  of  Pennsylvania 
Eugene  Galanter,  University  of  Washington 


New  York  and  London 


John  Wiley  and  Sons,  Inc. 


Copyright  ©  1963  by  John  Wiley  &  Sons,  Inc. 

All  Rights  Reserved 

This  book  or  any  part  thereof 

must  not  be  reproduced  in  any  form 

without  the  written  permission  of  the  publisher. 


Library  of  Congress  Catalog  Card  Number:  63-9428 
Printed  in  the  United  States  of  America 


Preface 


A  general  statement  about  the  background,  purposes,  assumptions,  and 
scope  of  the  Handbook  of  Mathematical  Psychology  can  be  found  in  the 
Preface  to  Volume  I.  Those  observations  need  not  be  repeated;  indeed, 
nothing  need  be  added  except  to  express  our  appreciation  to  Mrs.  Judith 
White  who  managed  the  administrative  details  while  the  chapters  of 
Volume  II  were  being  written,  to  Mrs.  Sally  Kraska  who  assumed  these 
responsibilities  during  the  production  stage,  to  Miss  Ada  Katz  for  assist- 
ance in  typing,  and  to  Mrs.  Kay  Estes  who  ably  and  conscientiously 
prepared  the  indices.  As  we  said  in  the  Preface  to  Volume  I,  "Although 
editing  of  this  sort  is  mostly  done  in  spare  moments,  the  cumulative 
amount  of  work  over  three  years  is  really  quite  staggering  and  credit  is 
due  the  agencies  that  have  directly  and  indirectly  supported  it,  in  our  case 
the  Universities  of  Pennsylvania  and  Washington,  the  National  Science 
Foundation,  and  the  Office  of  Naval  Research." 

Philadelphia,  Pennsylvania  R.  DUNCAN  LUCE 

June  1963  ROBERT  R.  BUSH 

EUGENE  GALANTER 


"KANSAS  CITY  ulO.)  PUBLIC  LIBRAfif 
31?  6612967 


Contents 


9.      STOCHASTIC    LEARNING    THEORY  1 

by  Saul  Sternberg,  University  of  Pennsylvania 

10.      STIMULUS    SAMPLING    THEORY  121 

by  Richard  C.  Atkinson,  Stanford  University 
and  William  K.  Estes,  Stanford  University 


11.      INTRODUCTION    TO    THE    FORMAL    ANALYSIS    OF 

NATURAL    LANGUAGES  269 

by  Noam  Chomsky,  Massachusetts  Institute  of  Technology 
and  George  A.  Miller,  Harvard  University 


12.  FORMAL    PROPERTIES    OF    GRAMMARS  323 

by  Noam  Chomsky,  Massachusetts  Institute  of  Technology 

13.  FINITARY    MODELS    OF    LANGUAGE    USERS  419 

by  George  A.  Miller,  Harvard  University 

and  Noam  Chomsky,  Massachusetts  Institute  of  Technology 

14.  MATHEMATICAL    MODELS    OF    SOCIAL    INTERACTION      493 

by  Anatol  Rapoport,  University  of  Michigan 


INDEX 


581 


9 

Stochastic  Learning  Theory1 


Saul  Steinberg 

University  of  Pennsylvania 


1.  Preparation  of  this  chapter  was  supported  by  Grant  G- 186 30  from  the  National 
Science  Foundation  to  the  University  of  Pennsylvania.  Doris  Aaronson  provided 
valuable  help  with  computations;  her  work  was  supported  in  part  by  NSF  Grant 
G- 14 839.  I  wish  to  thank  Francis  W.  Irwin  for  his  helpful  criticism  of  the 
manuscript. 


Contents 


1.  Analysis  of  Experiments  and  Model  Identification 

1.1.  Equivalent  events,    7 

1.2.  Response  symmetry  and  complementary  events,     9 

1.3.  Outcome  symmetry,    11 

1 .4.  The  control  of  model  events,     12 

1.5.  Contingent  experiments  and  contingent  events,     14 


2.  Axiomatics  and  Heuristics  of  Model  Construction  15 

2.1.  Path-independent  event  effects,     1 6 

2.2.  Commutative  events,    17 

2.3.  Repeated  occurrence  of  a  single  event,     18 

2.4.  Combining-classes  condition:  Bush  and 

Mosteller's  linear-operator  models,    19 

2.5.  Independence  from  irrelevant  alternatives : 

Luce's  beta  response-strength  model,    25 

2.6.  Urn  schemes  and  explicit  forms,    30 

2.7.  Event  effects  and  their  invariance,    36 

2.8.  Simplicity,    38 


Deterministic  and  Continuous  Approximations  39 

3.1.  Approximations  for  an  urn  model,    40 

3.2.  More  on  the  expected-operator  approximation,    43 

3.3.  Deterministic  approximations  for  a  model 

of  operant  conditioning,    47 


4.  Classification  and  Theoretical  Comparison  of  Models  49 

4.1.  Comparison  by  transformation  of  the  explicit 

formula,    50 

4.2.  Note  on  the  classification  of  operators 

and  recursive  formulas,    56 


CONTENTS 


4.3.  Implications  of  commutativity  for  responsiveness 

and  asymptotic  behavior,     56 

4.4.  Commutativity  and  the  asymptote  in 

prediction  experiments,     61 

4.5.  Analysis  of  the  explicit  formula,     65 


5.  Mathematical  Methods  for  the  Analysis  of  Models  75 

5.1.  The  Monte  Carlo  method,     76 

5.2.  Indicator  random  variables,    77 

5.3.  Conditional  expectations,    78 

5.4.  Conditional  expectations  and  the 

development  of  functional  equations,     81 

5.5.  Difference  equations,     83 

5.6.  Solution  of  functional  equations,     85 


6.  Some  Aspects  of  the  Application  and  Testing  of 

Learning  Models  89 

6.1.  Model  properties:  a  model  type  as  a  subspace,    89 

6.2.  The  estimation  problem,     93 

6.3.  Individual  differences,     99 

6.4.  Testing  a  single  model  type,     102 

6.5.  Comparative  testing  of  models,     104 

6.6.  Models  as  baselines  and  aids  to  inference,     106 

6.7.  Testing  model  assumptions  in  isolation,     109 


7.  Conclusion  116 

References  117 


Stochastic  Learning  Theory 


The  process  of  learning  in  an  animal  or  a  human  being  can  often  be 
analyzed  into  a  series  of  choices  among  several  alternative  responses. 
Even  in  simple  repetitive  experiments  performed  under  highly  controlled 
conditions,  the  choice  sequences  are  typically  erratic,  suggesting  that 
probabilities  govern  the  selection  of  responses.  It  is  thus  useful  to  think  of 
the  systematic  changes  in  a  choice  sequence  as  reflecting  trial-to-trial 
changes  in  response  probabilities.  From  this  point  of  view,  much  of  the 
study  of  learning  is  concerned  with  describing  the  trial-to-trial  probability 
changes  that  characterize  a  stochastic  process. 

In  recent  mathematical  studies  of  learning  investigators  have  assumed 
that  there  is  some  stochastic  process  to  which  the  behavior  in  a  simple 
learning  experiment  conforms.  This  is  not  altogether  a  new  idea  (for  a 
sketch  of  its  history,  see  Bush,  1960b).  But  two  important  features  appear 
primarily  in  the  work  since  1950  that  was  initiated  by  Bush,  Estes,  and 
Mosteller.  First,  the  step-by-step  nature  of  the  learning  process  has  been 
an  explicit  feature  of  the  proposed  models.  Second,  these  models  have  been 
analyzed  and  applied  in  ways  that  do  not  camouflage  their  statistical  aspect. 
Various  models  have  been  proposed  and  studied  as  possible  approxi- 
mations to  the  stochastic  processes  of  learning.  The  purpose  of  this 
chapter  is  to  review  some  of  the  methods  and  problems  of  formulating 
models,  analyzing  their  properties,  and  applying  them  in  the  analysis  of 
learning  data.2 

The  focus  of  our  attention  is  a  simple  type  of  learning  experiment. 
Each  of  a  sequence  of  trials  consists  of  the  selection  of  a  response  alternative 
by  the  subject  followed  by  an  outcome  provided  by  the  experimenter. 
The  response  alternative  may  be  pressing  one  of  a  set  of  buttons,  turning 
right  in  a  maze,  jumping  over  a  barrier  before  a  shock  is  delivered,  or 
failing  to  recall  a  word.3 

2  The  model  enterprise  is  not  and  should  not  be  separate  from  other  efforts  in  the  study 
of  learning.  It  is  partly  for  this  reason  that  I  have  not  attempted  a  summary  of  present 
knowledge  vis-a-vis  various  models.  A  good  proportion  of  the  entire  learning  literature 
is  directly  relevant  to  many  of  the  questions  raised  in  work  with  stochastic  models. 
For  this  reason  an  adequate  survey  would  be  gargantuan  and  soon  outdated. 
8  The  reader  will  note  from  these  examples  that  the  terms  "choice"  and  "response 
alternative"  are  used  in  the  abstract  sense  discussed  in  Chapter  2  of  Volume  I,  For 
example,  I  ignore  the  question  whether  a  choice  represents  a  conscious  decision; 
classes  of  responses  are  defined  both  spatially  (as  in  the  maze)  and  temporally  (as  in  the 
shuttlebox);  a  subject's  inability  to  recall  a  word  is  grouped  with  his  "choice"  not  to 
say  it. 


STOCHASTIC    LEARNING  THEORY 


We  shall  be  concerned  almost  entirely  with  experiments  in  which  the 
subject's  behavior  is  partitioned  into  two  mutually  exclusive  and  exhaustive 
response  alternatives.  The  outcome  may  be  a  pellet  of  food,  a  shock,  or 
the  onset  of  one  of  several  lights.  The  outcome  may  or  may  not  change 
from  trial  to  trial  and  its  occurrence  may  or  may  not  depend  on  the  response 
chosen.  When  no  differential  outcome  (Irwin,  1961)  is  provided,  as,  for 
example,  when  the  experimenter  gives  food  on  every  trial  or  when  he  does 
not  explicitly  pro  vide  any  reward  or  punishment,  we  think  of  the  experiment 
simply  as  a  sequence  of  response  choices.  We  do  not  consider  experiments 
in  which  the  stimulus  situation  is  deliberately  altered  from  trial  to  trial; 
for  our  purposes  it  can  be  referred  to  once  and  then  ignored.  Because 
little  mathematical  work  has  been  done  on  bar-pressing  or  runway 
experiments,  they  receive  little  attention. 

The  elements  of  a  stochastic  learning  model  correspond  to  the  com- 
ponents of  the  experiment.  The  sequence  of  trials  is  indexed  by  n  =  1 , 
2,  ...  5  TV.  There  is  a  set  of  response  alternatives,  {Al9  .  . . ,  A39  .  . ,  Ar}, 
and  a  set  of  outcomes,  {Ol9  <92, . .  . ,  Os}.  Each  response-outcome  pair 
constitutes  a  possible  experimental  trial  event,  Ek.  A  probability  dis- 
tribution, {p<in(l),...,A,«00»-'-^,nW}»  is  defined  OVer  the  Set.°f 

response  alternatives  for  each  subject,  i  =  1,  2, ...,/,  and  each  trial, 
n  =  1,  2, . . .  ,  N.  The  subject  subscript  is  suppressed  when  we  consider  a 
generic  sequence.  The  response  probabilities  form  a  probability  vector 
with  r  elements.  In  most  of  the  examples  that  have  been  studied  r  =  2 
and  pn(l)  =  1  —pn(2),  thus  making  it  possible  to  reduce  the  sequence 
of  probability  vectors  to  a  sequence  of  scalars,  ply . .  .  ,pn,  .  .  . . 

The  crux  of  a  model  is  its  description  of  response-probability  changes 
from  trial  to  trial.  One  type  of  description  is  in  terms  of  explicit  transition 
rules  or  operators,  usually  independent  of  the  trial  number,  that  transform 
the  response  probabilities  of  trial  n  into  those  of  trial  n  +  1.  The  operator 
invoked  depends  on  the  event  that  occurs  on  trial  n.  A  second  type  of 
description  is  in  terms  of  an  explicit  formula  for  the  dependence  of  pn 
on  both  n  and  the  sequence  of  events  through  trial  n  —  1.  The  explicit 
formula  approach,  although  less  popular,  is  somewhat  more  general,  as 
we  shall  see,  because  it  can  be  used  for  models  whose  expression  in  terms 
of  operators  of  the  type  mentioned  would  be  cumbersome  or  impossible. 

One  or  more  parameters  with  unspecified  numerical  values  usually  appear 
in  the  formulation  of  a  model.  These  parameters  may  be  initial  probability 
values  or  they  may  be  quantities  that  reflect  the  magnitude  of  the  effects  of 
different  trial  events.  The  values  of  these  parameters  are  usually  estimated 
from  the  set  of  data  being  analyzed.  In  more  stringent  tests  of  a  model 
parameters  estimated  from  one  phase  of  an  experiment  are  used  in  its 
application  to  another  phase.  A  useful,  if  rough,  distinction  can  be  made 


STOCHASTIC    LEARNING    THEORY 


between  a  model  type,  consisting  of  the  class  of  models  with  all  possible 
values  of  the  parameters,  and  a  model  in  which  the  numerical  values  of 
parameters  are  specified.  Because  there  is  no  theory  of  parameters  in  this 
field,  investigators  have  concentrated  on  determining  which  model  type, 
if  any,  can  describe  a  set  of  data,  rather  than  which  model  within  a  type  is 
the  appropriate  one.  Model  types  themselves  can  be  grouped  into  different 
families  which  reflect  basically  different  conceptions  of  the  learning  process, 
and  the  choice  between  families  is  the  fundamental  problem.  In  this 
chapter,  as  in  much  of  the  published  work,  the  word  "model"  is  used  to 
denote  a  model  type  when  it  is  clear  from  the  context  what  is  meant. 

The  model  builder's  view  of  learning  differs  in  its  emphasis  from  that  of 
many  experimenters.  The  idea  of  trial-to-trial  changes  in  the  behavior  of 
individual  subjects  has  been  basic  in  traditional  approaches  to  learning. 
But,  with  few  exceptions,  the  changes  have  been  investigated  indirectly, 
often  by  considering  the  effects  of  experimental  variations  on  the  gross 
shape  of  a  curve  of  mean  performance  versus  trials.  Stochastic  models 
have  been  used  increasingly  to  supplement  an  interest  in  the  learning  curve 
with  analyses  of  various  sequential  features  of  the  data,  features  that  reflect 
more  directly  the  operation  of  the  underlying  trial-to-trial  changes. 

At  first  glance  stochastic  models  appear  to  have  been  remarkably  success- 
ful in  accounting  for  the  data  from  a  variety  of  learning  experiments 
(Bush  &  Mosteller,  1955;  Bush&Estes,  1959).  Recent  work,  however, 
suggests  that  we  view  the  situation  with  caution.  As  more  model  types  are 
investigated,  the  problem  becomes  not  one  of  fitting  a  model  to  a  set  of 
data  but  of  discrediting  all  but  one  of  the  competing  models.  Apparent 
agreement  between  model  and  data  comes  easily  and  can  lead  to  a  sense  of 
over-confidence.  There  are  a  number  of  ways  of  dealing  with  this  problem, 
such  as  refining  estimation  and  testing  procedures  and  performing  crucial 
experiments.  Criteria  have  been  invoked  that  involve  more  than  merely  the 
ability  of  a  model  to  describe  a  particular  set  of  data.  A  model  is  designed 
to  describe  some  process  that  occurs  in  an  experiment.  But  in  the  present 
state  of  the  art  it  is  difficult  to  perform  experiments  in  which  no  other 
processes,  aside  from  those  described  by  the  model,  intrude.  We  must 
therefore  compromise  between  making  severer  demands  on  the  models 
and  acknowledging  that  at  present  their  descriptions  of  actual  experiments 
cannot  hope  to  be  more  than  approximations. 

1.  ANALYSIS   OF   EXPERIMENTS   AND 
MODEL   IDENTIFICATION 

In  considering  what  may  happen  on  a  trial,  we  must  draw  a  careful 
distinction  between  experimental  events  and  model  events.  Not  all  the  events 


ANALYSIS    OF    EXPERIMENTS    AND    MODEL    IDENTIFICATION  7 

in  an  experiment  are  identified  with  distinct  events  in  the  model  that  is 
applied  to  it.  A  good  deal  of  intuition,  with  varying  degrees  of  experimental 
support,  leads  to  assumptions  of  equivalence  and  complementarity  among 
experimental  events.4  These  assumptions  have  strong  substantive  implica- 
tions. They  also  fix  the  number  of  distinct  operators  (events)  in  the 
model,  impose  constraints  on  them,  and  specify  how  their  application 
to  response  probabilities  is  governed.  In  general,  the  more  radical  the 
assumptions,  the  simpler  the  model,  the  less  the  labor  in  analysis  and 
application,  and  the  greater  the  chance  that  the  model  will  fail.  A  few 
examples  will  serve  to  illustrate  some  of  the  relevant  considerations. 

1.1  Equivalent  Events 

Each  response-outcome  pair  constitutes  an  experimental  event.  Ex- 
amples of  several  sets  of  experimental  events  are  given  in  Table  I.5 

At  this  level  of  analysis  each  experiment  has  four  possible  events  per 
trial.  In  one  analysis  of  the  prediction  experiment  (e.g.,  Estes  &  Straughan, 
1954)  the  four  experimental  events  are  grouped  into  two  equivalence 
classes,  {(AI9  O&  (A2,  OJ}  and  {(A19  02),  (Xa,  O2)}3  which  define  the  two 
model  events,  £x  and  E2.  This  amounts  to  assuming  that  changes  in 

4  Throughout  this  chapter  adjectives  such  as  "equivalent,"  "complementary,"  "path- 
independent,"  and  ''commutative"  are  applied  to  the  term  "events."  In  all  cases  this 
is  a  shorthand  way  of  describing  properties  of  the  effects  on  response  probabilities  of 
the  occurrence  of  events.   These  properties  may  characterize  model  events.   Whether 
they  also  characterize  corresponding  experimental  events  is  a  question  to  be  answered 
by  testing  the  model. 

Occasionally  I  write  as  if  an  event  were  an  active  agent,  as  in  "the  event  transforms 
the  probability .  .  . ."  This  is  another  shorthand  form.  It  stands  for  "the  operator 
corresponding  to  the  event  transforms  ..."  in  the  case  of  model  events  with  operator 
representations.  It  stands  for  "the  occurrence  of  the  event  affects  the  organism  so  as  to 
change  its  probability . . ."  in  the  case  of  experimental  events. 

The  use  of  the  same  terms  in  talking  about  both  kinds  of  events  is  intended  to 
emphasize  the  fact  that  insofar  as  a  model  is  successful  the  properties  of  its  events  are 
also  properties  of  the  corresponding  experimental  events. 

5  Choices  have  to  be  made  even  at  the  stage  of  tabulating  response  and  outcome 
possibilities,  as  illustrated  by  the  difference  between  the  tables  for  T-maze  and  pre- 
diction experiments.    An  alternative  analysis  of  the  T-maze  experiment,  formally 
identical  to  the  one  for  the  prediction  experiment,  is  illustrated  in  Table  2.  One  relevant 
consideration  is  the  type  of  experiment  to  which  the  analysis  may  be  generalized. 
Thus,  if  in  the  prediction  experiment  we  defined  the  outcomes  to  be  Oj_: correct, 
Oz: incorrect,  then  the  generalization  to  an  experiment  with  three  buttons  and  three 
lights  might  be  inappropriate.  For  the  T-maze  experiment  the  analysis  in  Table  1  is  to 
be  preferred  if  we  wish  to  include  experiments  in  which  on  some  trials  neither  or  both 
maze  arms  are  baited.   On  the  other  hand,  Table  2  provides  an  analysis  that  is  more 
easily  extended  to  experiments  with  a  correction  procedure. 


STOCHASTIC    LEARNING    THEORY 


Table  1     Definition  of  Experimental  Events  in  Four  Experiments 


(i)  Two-Choice  Prediction 
Response 


(ii)  T-Maze 


Response 


(iii)  Escape-Avoidance  Shuttlebox 
Response 


Outcome 


A*     Left  button  press 

0i= 

Left  light  onset  (correct) 

A2    Right  button  press 

Ol  : 

Left  light  onset  (incorrect) 

A-L    Left  button  press 

02  = 

Right  light  onset  (incorrect) 

A2    Right  button  press 

02: 

Right  light  onset  (correct) 

Outcome 


A:    Left  turn 

<v 

Food 

Az    Right  turn 

02: 

No  Food 

AI    Left  turn 

02: 

No  Food 

A2    Right  turn 

oi: 

Food 

Outcome 


A±\  Jump  before  US  from  left 

to  right 
At':  Jump  before  US  from  right 

to  left 
A2:  Jump  after  US  from  left 

to  right 
AZ':  Jump  after  US  from  right 

to  left 


O^:  Avoidance  of  US  on  left 
Oi':  Avoidance  of  US  on  right 
O2:  Escape  of  US  on  left 
02' :  Escape  of  US  on  right 


(iv)  Continuous  Reinforcement  in  Runway 
Response 


Outcome 


A::  Run  with  speed  in  first  O±:  Food 

quartile 
A2:  Run  with  speed  in  second  Ox:  Food 

quartile 
AZ\  Run  with  speed  in  third  Ox:  Food 

quartile 
A±:  Run  with  speed  in  fourth  Ox:  Food 

quartile 


ANALYSIS    OF    EXPERIMENTS    AND    MODEL    IDENTIFICATION  p 

response  probability  from  trial  to  trial  depend  only  on  outcomes  and  not 
on  responses.  Reward  of  A±  by  O±  is  assumed  equivalent  to  nonreward 
of  A 2  by  0J.  This  assumption,  as  we  shall  see,  considerably  simplifies 
models  for  the  experiment.  Although  a  comparable  reduction  in  the 
number  of  events  in  the  T-maze  (or  the  analogous  "two-armed  bandit") 
experiment  is  possible,  it  has,  in  general,  not  been  made  (e.g.,  Galanter 
&Bush,  1959).  Analysis  of  the  shuttlebox  experiment  has  ignored  the 
alternation  in  the  animal's  starting  position  and  used  the  equivalence 
classes  {(Al9  OJ,  (A^  0/)}  and  {(A*  O2\  (A2',  0a')}  (Bush  &  Mosteller, 
1955,  1959).  Analyses  of  the  runway  experiment  have  grouped  all  experi- 
mental events  into  one  equivalence  class,  resulting  in  a  single  model  event 
(e.g.,  Bush  &  Mosteller,  1955).  As  an  alternative,  at  least  one  theory 
about  runway  behavior  proposes  that  the  effect  of  a  trial  event  depends 
critically  on  the  running  speed  of  that  trial  (Logan,  1960). 

It  should  be  emphasized  that  the  reduction  in  the  number  of  events  by 
the  definition  of  equivalence  classes  entails  strong  assumptions  about  what 
the  experimental  subjects  are  indifferent  to.  Even  in  forming  the  lists  in 
Table  1  we  have  implicitly  appealed  to  the  existence  of  equivalence  classes ; 
we  have  assumed,  for  example,  that  all  ways  of  turning  left  are  equivalent. 


1 .2  Response  Symmetry  and  Complementary  Events 

When  we  have  determined  the  set  of  events  {Ek}  for  a  model  by  defining 
whatever  equivalence  classes  seem  reasonable,  we  can  introduce  further 
simplifications  by  identifying  pairs  or  sets  of  complementary  events.  In  a 
two-choice  experiment  two  events,  E:  and  E2,  form  a  complementary 
pair  if,  to  put  it  roughly,  the  effect  of  E±  on  p  is  the  same  as  the  effect  of 
£2  on  q  =  1  —  p.  If  the  model  involves  a  set  of  operators,  each 
associated  with  an  event,  then  El  and  E2  will  be  associated  with  complemen- 
tary operators. 

Let  us  suppose,  for  example,  that  the  operators  are  linear  and  that  Ek 
transforms  p  into  Qkp  =  <xkp  +  ak,  where  afc  and  ak  are  constants.  Then 
the  complementarity  of  El  and  E2  requires  that  when  E2  occurs  q  =  1  —  p 
will  be  transformed  into  oc^  +  av  This  requirement  implies  that  p  is 
transformed  into  a^  +  (1  —  o^  —  <%)  when  E2  occurs  and  gives  the 
relations  a2  =  ax  and  az  =  1  —  ax  —  %.  The  result  is  that  we  have  one 
operator  and  its  complement  rather  than  two  independent  operators. 

As  in  the  case  of  equivalence  classes  of  events,  it  is  the  subject's  behavior, 
not  the  experimenter,  that  determines  whether  two  events  are  complemen- 
tary. In  the  analysis  of  prediction  experiments  it  has  frequently  been 
assumed  that  the  two  equivalence  classes  are  complementary.  In  analysis 


JO  STOCHASTIC    LEARNING    THEORY 

of  the  T-maze  experiment  the  event  pairs  {(Al9  6>x),  (A*  O^}  and  {(Ai9  <92), 
(A2,  O2)}  have  been  assumed  to  be  complementary.  In  their  treatment  of 
an  experiment  on  imitation  Bush  and  Mosteller  (1955)  rejected  the  assump- 
tion that  rewarding  an  imitative  response  was  complementary  to  rewarding 
a  nonimitative  response. 

It  is  in  dealing  with  pairs  of  events  in  which  the  same  outcome  (a  food 
reward,  for  example)  occurs  in  conjunction  with  a  pair  of  "symmetric" 
responses  (left  turn  and  right  turn,  for  example)  that  investigators  have 
been  most  inclined  to  assume  that  the  events  are  complementary.  There 
appears,  however,  to  be  no  available  response  theory  that  would  allow 
us  to  determine,  from  the  properties  of  two  (or  more)  responses,  whether 
they  are  symmetric  in  the  desired  sense.  Learning  model  analyses  of  a 
variety  of  experiments  would  provide  one  source  of  information  on  which 
such  a  theory  could  be  based. 

At  present,  therefore,  it  is  primarily  intuition  that  leads  us  to  assume 
that  left  and  right  are  symmetric  in  a  sense  in  which  imitation  and  nonimita- 
tion  are  not.  Perhaps  more  obvious  examples  of  asymmetric  responses  are 
alternatives  A^  and  A%  in  the  shuttlebox  experiment  and  alternatives  A± 
and  A±  in  the  runway  (see  Table  1). 

In  the  foregoing  discussion  I  have  considered  the  relation  between  the 
events  determined  by  two  responses  for  each  of  which  the  same  outcome  is 
provided,  such  as  "left  turn — food"  and  "right  turn — food."  A  second 
sense  in  which  response  symmetry  may  be  invoked  in  the  design  of  learning 
model  operators  arises  when  we  consider  the  effects  of  the  same  event  (such 
as  "left  turn — food")  on  the  probabilities  of  two  different  responses.  In 
many  models  the  operators  that  represent  the  effects  of  an  event  are  of  the 
same  form  for  all  responses;  that  is,  the  operators  are  members  of  a 
restricted  family,  such  as  multiplicative  or  linear  transformations.  These 
models  are,  therefore,  invariant  under  a  reassignment  of  labels  to  responses, 
so  long  as  the  values  of  one  or  two  parameters  are  altered.  Such  invariance 
represents  a  second  type  of  response  symmetry. 

This  type  of  symmetry  may  be  defined  only  in  relation  to  a  specified 
family  of  operators;  when  such  symmetry  obtains,  then  the  family  is 
complete  (Luce,  1963)  in  the  sense  that  it  contains  the  operators  appropriate 
to  all  the  responses. 

As  an  example,  let  us  consider  the  Bush-Mostelier  model  for  two  re- 
sponses, in  which  p  =  Pr  {^1}  and  the  occurrence  of  Ek  transforms  p 
into  <xkp  +  ak.  This  is,  of  course,  equivalent  to  the  transformation  of 
?  =  1  —  P  into  afc'?  +  ak,  where  afc'  =  afc  and  a*  =  1  -  ak  —  afc.  As  a 
result  of  the  occurrence  of  Ek9  the  probabilities  of  the  two  responses  change 
in  the  same  (linear)  manner.  The  model  for  changes  in  Pr  {A^  is  of  the 
same  form  as  the  model  for  changes  in  Pr  {A2}. 


ANALYSIS    OF    EXPERIMENTS    AND    MODEL    IDENTIFICATION  II 

Not  all  learning  models  are  characterized  by  this  sort  of  response- 
symmetry  relative  to  a  simple  family  of  operators.  For  example,  Hull's 
model  (1943,  Chapter  18)  for  changes  in  the  probability  of  a  response,  Al9 
where  the  alternative  response,  A2,  is  defined  as  nonoccurrence  of  Al9 
incorporates  a  threshold  notion  in  the  relationship  between  the  "strength" 
of  A!  (its  SER)  and  its  probability.  No  such  threshold  applies  to  A&  and 
the  responses  are  not  symmetric  in  the  sense  outlined.  (This  model  is 
discussed  in  Sees.  2.5  and  4.1.) 

A  second  model  that  lacks  the  symmetry  features  is  Bush  and  Mosteller's 
(1959)  "late  Thurstone  model"  discussed  in  Sec.  2.6.  In  this  model  the 
transformations  induced  by  events  on  the  probability  of  error  can  be 
expressed  by  applying  an  additive  increment  to  the  reciprocal  of  the 
probability: 


Pn+I         Pn 

The  form  of  the  corresponding  transformation  on  the  reciprocal  of  qn  = 
I  —  pn  is  not  a  member  of  the  family  of  additive  (or  even  full  linear) 
transformations. 

For  any  two-response  model  in  which  the  effects  of  events  may  be 
expressed  in  terms  of  operators,  we  can  use  the  response-symmetry  con- 
dition to  impose  a  restriction  on  the  class  of  allowed  operators.  We  can 
do  this  by  translating  the  condition  into  the  requirement  that  the  operators 
on  p  =  Pr  {AI}  and  q  =  1  —  p  =  Pr  {A2}  that  correspond  to  an  event 
are  to  be  members  of  the  same  family  of  operators.  If,  for  example,  we 
require  the  family  to  be  expressed  by  a  particular  function  with  two  pa- 
rameters, 

Pn+l  = 


then  symmetry  dictates  that  for  all  p,  0  </?  <  1,  and  for  all  allowed 
values  of  a  and  b,  /has  the  property  that 


where  c  =  c(a,  ft)  and  d  =  d(a,  b).  As  indicated  by  the  discussion  of  the 
"late  Thurstone  model,"  when  we  require  that/(p)  be  of  the  form/(p)  = 
[(ajp)  +  b]"1,  then  its  operators  do  not  satisfy  the  condition.  The  con- 
dition may  be  generalized  in  an  obvious  way  to  more  than  two  responses. 


1.3  Outcome  Symmetry 

The  reader  may  have  questioned  the  contrast" between  the  treatments  of 
the  prediction  and  the  T-maze  experiments.    In  the  first  experimental 


I2  STOCHASTIC    LEARNING    THEORY 

events  containing  different  responses  are  grouped  in  the  same  equivalence 
class,  whereas  this  is  not  done  in  the  second.  The  critical  difference  that 
guides  the  definition  of  equivalence  classes  here  appears  to  be  the  extent 
of  outcome  symmetry  from  what  is  thought  to  be  the  subject's  viewpoint. 
From  the  experimenter's  viewpoint  the  possible  outcomes  in  many  T-maze 
experiments  could  be  symmetrically  described  by  "left-arm  baited"  and 
"right-arm  baited."  With  this  terminology  we  can  form  a  new  event  list 
that  is  identical,  formally,  to  the  list  for  the  prediction  experiment. 

Table  2  Alternative  Definition  of  Experimental  Events  in  T-Maze 
Experiment 

Response  Outcome 


A-,.:  Left  turn 

Oi  : 

Left-arm  baited  (correct) 

Az:  Right  turn 

0,: 

Left-arm  baited  (incorrect) 

A-I.  Left  turn 

02: 

Right-arm  baited  (incorrect) 

Az:  Right  turn 

0,: 

Right-arm  baited  (correct) 

Despite  their  formal  identification  with  the  events  in  the  prediction 
experiment,  many  model  builders  would  be  loth  to  assume  that  the  first 
two  events  are  equivalent  in  the  T-maze  experiment.  The  distinction 
between  the  two  experiments  is  more  easily  seen  if  they  are  extended  to 
three  alternatives.  Food  presented  after  one  of  three  alternatives  is  unlikely 
to  have  the  same  effect  as  the  absence  of  food  after  another  of  the  three.  It 
is  perhaps  conceivable  that  the  onset  of  one  of  three  lights  would  have  the 
same  effect  independently  of  the  response  that  precedes  it.  The  important 
criterion  seems  to  be  not  whether  the  outcomes  are  capable  of  symmetric 
description  by  the  experimenter  but  whether  they  "appear"  symmetric 
to  the  subject.  The  question  whether  outcomes  are  symmetric  is,  of 
course,  finally  decided  by  whether  the  behavior  produced  by  subjects  is 
described  by  a  model  in  which  symmetry  is  assumed. 


1.4  The  Control  of  Model  Events 

Experimental  events,  with  assumptions  about  their  equivalence  and 
complementarity,  determine  a  set  of  model  events  and  thereby  give  rise  to 
four  important  classes  of  models.  These  classes  are  defined  in  terms  of  how 
the  occurrence  of  model  events  is  controlled  by  the  sequence  of  response- 
outcome  pairs  in  the  experiment. 


ANALYSIS    OF    EXPERIMENTS    AND    MODEL    IDENTIFICATION  JJ 

If  knowledge  of  both  response  and  outcome  is  needed  in  order  to  know 
which  model  event  has  occurred  on  a  trial,  then  the  events  are  experimenter- 
subject  controlled.  For  example,  in  the  analysis  of  the  T-maze  experiment 
given  in  Table  1  there  are  four  model  events,  and  both  the  direction  of  the 
rat's  turn  and  the  schedule  of  rewards  must  be  known  in  order  to  determine 
which  event  has  occurred.  Although  the  reward  schedule  may  be  pre- 
determined, the  response  is  not.  Therefore  the  sequence  of  probability 
changes  cannot  be  specified  in  advance  of  the  experiment.  This  class  of 
models  is  relatively  intractable  mathematically. 

The  second  class  of  models  is  illustrated  by  the  Bush  and  Hosteller  (1955) 
analysis  of  the  shuttlebox  experiment.  Here  there  are  only  two  model 
events  (avoid  or  escape)  and  the  response  determines  the  outcome  (no- 
shock  or  shock).  This  is  an  example  of  subject-controlled  events  in  which 
the  response  alone  determines  the  model  event.  Any  experiment  in  which 
responses  and  outcomes  are  perfectly  correlated  consists  of  subject- 
controlled  events.  This  correlation  is  produced  in  the  T-maze  by  baiting 
the  left  arm  on  every  trial  and  never  baiting  the  right  arm,  for  example. 
Again,  the  sequence  of  probability  changes  cannot  be  specified  in  advance 
and  in  general  will  be  different  for  each  subject  in  the  experiment.  Even 
if  the  real  subjects  correspond  to  a  set  of  identical  model  subjects  (identical 
in  their  parameter  values  and  initial  response  probabilities),  they  will  have 
a  distribution  of  response  probabilities  on  later  trials. 

The  third  class  of  models  is  illustrated  by  the  Estes  and  Straughan 
(1954)  analysis  of  the  prediction  experiment.  This  is  an  example  of 
experimenter-controlled  events  in  which  the  outcome  alone  (left-light  or 
right-light  onset)  determines  the  eyent.  Because  the  outcome  schedule  can 
be  predetermined,  only  the  parameter  values  and  initial  probabilities  are 
needed  in  order  to  specify  the  trial-to-trial  sequence  of  response  probabili- 
ties. Identical  subjects  with  identical  outcome  sequences  who  behave  in 
accordance  with  such  a  model  will  have  the  same  sequence  of  response 
probabilities.  Although  a  subject's  successive  response  probabilities  may 
be  different,  his  successive  responses  are  independent.  These  features  of 
models  with  experimenter-controlled  events  make  mathematical  analysis 
relatively  simple.  Despite  the  wide  use  of  these  models,  however,  direct 
experimental  evidence  that  favors  the  independence  assumption  has  not 
been  forthcoming,  and  for  the  prediction  experiment  there  is  a  certain 
amount  of  strong  negative  evidence,  for  example,  in  Hanania  (1959)  and 
Nicks  (1959).  The  onset  of  a  light  apparently  has  an  effect  on  the  response 
probability  that  depends  on  whether  the  onset  was  correctly  predicted.  It  is 
not  known  whether  there  are  other  experiments  for  which  models  with 
experimenter-controlled  events  might  be  appropriate. 

Models  in  the  fourth  class  involve  just  a  single  event  and  are  the  simplest. 


STOCHASTIC    LEARNING    THEORY 


A  single-event  model  may  be  obtained  from  any  model  with  subject-con- 
trolled events  by  the  simplification  of  grouping  all  events  into  a  single 
equivalence  class.  If  the  events  "left  turn—  reward"  and  "right  turn—  non- 
reward"  in  a  T-maze  experiment  with  100:0  reward,  for  example,  are 
assumed  to  have  equal  effects  on  the  probability  of  the  right-turn  re- 
sponse, then  a  single-event  model  is  applicable.  A  second  source  for  a 
single-event  model  is  in  the  application  of  a  model  with  experimenter- 
controlled  events  to  the  prediction  experiment  (Sec.  1.1),  under  the  special 
condition  that  the  same  outcome  is  provided  on  every  trial.  Models  with 
a  single  event  are  the  easiest  to  study  mathematically  (see,  for  example, 
Bush  &  Sternberg,  1959)  but  the  assumptions  they  entail  seem  seldom 
to  be  met  in  practice  (Galanter  &  Bush,  1959;  Sternberg,  1959b). 

It  appears  that  the  best  understood  models  are  poor  approximations 
to  the  data,  and  the  models  more  likely  to  apply  are  little  understood. 
We  probably  cannot  dispense  with  subject  control  of  events  in  learning 
experiments,  and  therefore  the  only  choice  available  to  us  is  whether  or 
not  there  is  experimenter  control  as  well.  Insofar  as  we  deal  with  experi- 
ments whose  outcomes  are  perfectly  correlated  with  responses,  we  simplify 
matters  by  eliminating  experimenter  control.  In  this  chapter  I  shall 
consider,  in  particular,  models  with  subject-controlled  events.  Although 
they  apply  only  to  a  restricted  set  of  experiments,  there  are  points  in  their 
favor:  they  appear  to  be  more  realistic  than  experimenter-controlled 
models,  and  more  is  known  about  them  in  terms  of  both  theory  and  data 
than  about  models  with  experimenter-subject  control. 


1.5  Contingent  Experiments  and  Contingent  Events 

It  has  been  emphasized  that  the  analysis  of  an  experiment  depends 
heavily  on  assumptions  made  about  the  subject  and  that  the  analysis  is 
not  an  automatic  consequence  of  the  experimental  design  alone.  Some  of 
the  current  terminology  can  mislead  one  into  thinking  otherwise.  Experi- 
ments have  been  categorized  as  "contingent"  or  "noncontingent," 
according  to  whether  the  occurrence  of  an  outcome  does  or  does  not  de- 
pend on  the  subject's  response  (Bush  &  Mosteller,  1955).  It  has  been 
implied  that  contingent  experiments  correspond  to  experimenter-subject 
(contingent)  events  and  noncontingent  experiments  to  experimenter-con- 
trolled events  and  that  this  correspondence  is  unambiguous. 

One  difficulty  with  this  method  of  classifying  experiments  lies  in  the 
definition  of  outcomes.  If  outcomes  in  the  T-maze  experiment  are  "left- 
arm  baited"  and  "right-arm  baited,"  then  what  the  experimenter  does 
can  be  predetermined  and  is  noncontingent,  although  the  relevant  model 


AXIOMATICS    AND    HEURISTICS    OF    MODEL    CONSTRUCTION  /5 

may  be  a  contingent  one.  This  example  seems  absurd  because  we  have 
confidence  in  our  intuitions  in  regard  to  what  constitutes  a  reinforcing 
event  for  a  rat:  it  is  surely  food  versus  nonfood  rather  than  left-arm  versus 
right-arm  baited.  In  the  analogous  two-armed  bandit  experiment  the 
ambiguity  is  more  obvious,  especially  if  the  subject  imagines  that  exactly 
one  of  the  two  responses  is  correct  on  each  trial. 

Even  if  we  reject  what  may  appear  to  be  bizarre  definitions  of  outcomes, 
the  contingent-noncontingent  distinction  leads  to  difficulties.  Let  us 
suppose  that  in  a  T-maze  or  bandit  experiment  the  subject  is  rewarded  on 
a  preassigned  subset  of  trials  regardless  of  his  response.  The  experiment 
appears  to  be  noncontingent,  but  in  developing  a  model  few  would  be 
willing  to  assume  that  the  effect  of  the  outcome  is  independent  of  the 
response. 

To  begin  one's  analysis  of  any  experiment — contingent  or  noncontin- 
gent— with  the  assumption  that  events  are  not  subject-controlled  would 
appear  to  be  somewhat  rash,  unless  the  assumption  is  treated  as  a  null 
hypothesis  or  an  approximating  device  and  is  later  carefully  tested.  On 
the  other  hand,  analysis  by  means  of  a  model  that  incorporates  subject 
control  should  reveal  the  fact  that  a  model  with  experimenter  control 
alone  (or  a  single-event  model)  can  represent  the  behavior,  if  such  is  the 
case. 

2.  AXIOMATICS   AND   HEURISTICS 
OF    MODEL   CONSTRUCTION 

Various  considerations,  formal  and  informal,  substantive  and  practical, 
have  been  used  as  guides  in  constructing  models.  So  far  I  have  discussed 
the  factors  that  help  to  determine  the  number  of  distinct  alternative  model 
events  that  may  occur  on  a  trial  and  the  determinants  of  their  occurrence. 
There  remains  the  problem  of  the  mathematical  form  in  which  to  express 
the  effects  of  events.  Suppose  that  there  are  two  alternative  responses, 
AI  and  A*,  and  that  pn  =  Pr  {Al  on  trial  n}.  Let  Xn  be  a  row-vector 
random  variable6  with  t  elements  corresponding  to  the  t  possible  events. 
Xn  can  take  on  the  values  (1,0,...,  0),  (0,  1, . .  . ,  0), . . . ,  (0,  0, . . .  ,  1), 
which  correspond  to  the  occurrence  on  trial  n  of  E19.  £2, .  .  . ,  Et.  In 
general,  a  learning  model  is  a  function  that  gives  pn  in  terms  of  the  trial 
number  and  the  sequence  of  events  through  trial  n—\9 

Pn  =  F(n>  Xn-i>  Xn_2, . . . ,  Xj),  (1) 

6  In  this  chapter  vector  random  variables  are  designated  by  boldface  capitals  and 
scalar  random  variables  by  boldface  lower-case  letters.  Realizations  (particular  values) 
of  random  variables  are  designated  by  the  corresponding  lightface  capital  and  lower- 
case letters. 


STOCHASTIC    LEARNING    THEORY 


where  the  initial  probability  and  other  parameters  are  suppressed.7 
This  equation  makes  it  clear  that  pn,  a  function  of  random  variables,  is 
itself  a  random  variable.  Because  it  gives  pn  explicitly  in  terms  of  the  event 
sequence,  we  refer  to  Eq.  1  as  the  explicit  equation  for  pn.  In  this  section  I 
consider  some  of  the  arguments  that  have  been  used  to  restrict  the  form 


2.1  Path-Independent  Event  Effects 

At  the  start  of  the  nth  trial  of  an  experiment,  pn  is  the  subject's  response 
probability,  and  the  sequence  Xl9  X*  .  .  .  ,  X^  describes  the  course  of 
the  experiment  up  to  this  trial.  This  sequence,  then,  specifies  the  "path" 
traversed  by  the  subject  in  attaining  the  probability  pn.  A  simplifying 
assumption  which  underlies  most  of  the  learning  models  that  have  been 
studied  is  that  the  event  on  trial  n  has  an  effect  that  depends  onpn  but  not 
on  the  path.  The  implication  is  that  insofar  as  past  experience  has  any 
influence  on  the  future  behavior  of  the  process  this  influence  is  mediated 
entirely  by  the  value  ofpn.  Another  way  of  saying  this  is  that  the  subject's 
state  or  "memory"  is  completely  specified  by  his  p-  value. 

The  assumption  of  independence  of  path  leads  naturally  to  a  recursive 
expression  for  the  model  and  to  the  definition  of  a  set  of  operators.  The 
recursive  form  is  given  by 

(2) 


7  Equation  1,  and  many  of  the  other  equations  in  this  chapter  in  which  response  prob- 
abilities appear,  may  be  regarded  in  two  different  ways.  The  first  alternative,  expressed 
by  the  notation  in  Eq.  1  ,  is  to  regard  pn  as  a  function  of  random  variables  and  therefore 
to  consider  pn  itself  as  a  random  variable.    This  alternative  is  useful  in  emphasizing 
one  of  the  important  features  of  modern  learning  models  —  the  fact  that  most  of  them 
specify  a  distribution  of  /^-values  on  every  trial  after  the  first.   By  restricting  the  event 
sequence  in  any  way,  we  determine  a  new,  conditional  distribution  for  the  random 
variable.   And  we  may  be  interested,  for  example,  in  determining  the  corresponding 
conditional  expectation. 

The  second  alternative  is  to  regard  the  arguments  in  a  formula  such  as  Eq.  1  as 
realizations  of  the  indicated  random  variables  and  the  p-  values  it  defines  as  conditional 
probabilities,  conditioned  by  the  particular  event  sequence.  The  formula  is  then  more 
properly  written  as 

pn  =  Pr  {A!  on  trial  n  \  Xi  =  Xl9  X2  =  X^  .  .  .  ,  Xw-i  =  X»-i} 
=  F(nt  JCn-i,  -A«-2j  •  •  •  ?  -^i)- 

Aside  from  easing  the  notation  problem  by  reducing  the  number  of  boldface  letters 
required,  this  alternative  is  occasionally  useful;  for  example,  the  likelihood  of  a  sequence 
of  events  can  be  expressed  as  the  product  of  a  sequence  of  such  conditional  probabilities. 
In  this  chapter,  however,  I  make  use  of  the  first  alternative. 

8  1  omit  stimulus  sampling  considerations,  which  are  discussed  in  Chapter  10. 


AXIOMATIGS    AND    HEURISTICS    OF    MODEL    CONSTRUCTION  J/ 

which  indicates  thatpn+l  depends  only  on  pn  and  on  the  event  of  the  nth 
trial.  Equation  2  is  to  be  contrasted  with  the  explicit  form  given  by  Eq.  1. 
We  note  that  conditional  on  the  value  of  pw,  pn+1  is  not  only  independent 
of  the  particular  events  that  have  occurred  (the  content  of  the  path)  but  also 
of  their  number  (the  path  length).  By  writing  Eq.  2  separately  for  each 
possible  value  of  Xn  we  arrive  at  a  set  of  trial-independent  operators  or 
transition  rules : 


(1,0,  ...,0)]    if    E±  on  trial  n 
]n  =/[pw;  (0,  1,  .  . . ,  0)]    if    E2  on  trial  n 


I»=/[P»;  (0,0,  ...,!)]    if    Et  on  trial  n. 

A  common  method  for  developing  a  model  for  an  experiment  is  to  begin 
with  a  set  of  plausible  operators  and  rules  for  their  application.  If  the 
event  probabilities  during  the  course  of  an  experiment  are  functions  of 
pn  alone,  as  they  usually  are,  then  path  independence  implies  that  the 
learning  model  is  a  discrete-time  Markov  process  with  an  infinite  number 
of  states,  the  states  corresponding  to  /rvalues. 

The  assumption  is  an  extremely  strong  one,  as  indicated  by  three  of  its 
consequences,  each  of  which  is  weaker  than  the  assumption  itself: 

1.  The  effect  of  an  event  on  the  response  probability  is  completely 
manifested  on  the  succeeding  trial.    There  can  be  no  "delayed"  effects. 
Examples  of  direct  tests  of  this  consequence  are  given  later  in  this  chapter. 

2.  When  conditioned  by  the  value  of  pw  (i.e.,  for  any  particular  value  of 
pn),  the  magnitude  of  the  effect  on  the  response  probability  of  the  nth 
event  is  independent  of  the  sequence  of  events  that  precedes  it. 

3.  When  conditioned  by  the  value  of  pn,  the  magnitude  of  the  effect  on 
the  response  probability  of  the  nth  event  is  independent  of  the  trial  number. 
Operators  cannot  be  functions  of  the  trial  number. 

Several  models  that  meet  conditions  1  and  2  but  not  3  have  been  studied 
(Audley  &  Jonckheere,  1956,  and  Hanania,  1959).  These  models  are 
quasi-independent  of  path,  involving  event  effects  that  are  independent 
of  the  content  of  the  path  but  dependent  on  its  length. 


2.2  Commutative  Events 

Events  are  defined  to  be  commutative  if  pn  is  invariant  with  respect 
to  alterations  in  the  order  of  occurrence  of  the  events  in  the  path.  To 
make  this  idea  more  precise,  let  us  define  a  ^-dimensional  row  vector,  Wn, 


J(£  STOCHASTIC    LEARNING    THEORY 

whose  fcth  component  gives  the  cumulative  number  of  occurrences  of 
event  Ek  on  trials  1,  2,  ...,«-  1.  We  then  have 

Ww  =  Xx  +  X2  +  .  .  .  +  Xn_i. 

Under  conditions  of  commutativity,  the  vector  Wn  gives  sufficient  in- 
formation about  the  path  to  determine  the  value  of  pn,  and  although  a 
recursive  expression  does  not  result  naturally  Eq.  1  is  simplified  and 
becomes  (4) 


Several  models  have  been  proposed  in  terms  of  an  explicit  equation  for 
yn  of  this  simple  form,  in  which  pn  depends  only  on  the  components  of  Wn. 

A  further  simplification  arises  if  we  require  not  only  that  jFbe  a  function 
of  the  components  of  Wn  but  also  that  it  be  expressible  as  a  continuous 
function  of  an  argument  that  is  linear  in  these  components.  Under  these 
circumstances  we  have  not  only  commutativity  but  path  independence 
as  well.9  As  we  shall  see,  path-independent  models  need  not  be  com- 
mutative. 

Events  that  commute  are,  in  a  certain  sense,  not  subject  to  being  for- 
gotten. In  a  commutative  model  the  effect  of  an  event  on  pn  is  the  same, 
whether  it  occurred  on  trial  1  or  trial  n  —  1.  Because  the  distant  past  is 
as  significant  as  the  immediate  past,  models  of  this  kind  tend  to  be  rela- 
tively unresponsive  to  changes  in  the  outcome  sequence.  (An  example  is 
given  in  Sec.  4.)  Commutativity  leads  to  considerable  simplifications  in  the 
analysis  of  models  and  in  estimation  procedures. 


2.3  Repeated  Occurrence  of  a  Single  Event 

What  is  the  effect  on  response  probabilities  of  the  repeated  occurrence 
of  a  particular  event?  This  is  an  important  consideration  in  formulating  a 

9  If  F  can  be  written  as  a  continuous  function  of  an  argument  that  is  linear  in  the 
components  of  W»,  then  events  must  have  path-independent  effects.  The  linearity 
condition  requires  that  there  exist  some  column  vector,  A,  of  coefficients  for  which 
pn  =  .F(Wn)  =  G(Wn  •  A).  Then 

pn+1  =  G(Wn+1  •  A)  =  G[(Wn  +  Xn)  -  A]  =  G[Wn  -  A  +  Xn  •  A] 
=  GtG-Kpn)  +  Xn  -  A]  =/(pn  •  Xn). 

A  slight  modification  of  the  argument,  which  uses  the  continuity  of  G,  is  needed  if  G 
does  not  possess  a  unique  inverse.  One  implication  of  some  recent  work  by  Luce  (1963) 
on  the  commutativity  property  is  that  the  converse  of  this  result  is  also  true :  if  a  model 
is  both  commutative  and  path-independent  then  F  can  be  written  as  a  function  of  an 
argument  that  is  linear  in  the  components  of  Wn. 


AXIOMATICS    AND    HEURISTICS   OF    MODEL    CONSTRUCTION  1$ 

model.  Does  p  converge  to  any  value  as  Ek  is  repeated  again  and  again, 
and,  if  so,  what  value?  Except  for  events  that  have  no  effect  at  all  on  p, 
it  has  been  assumed  in  many  applications  of  learning  models  that  a  par- 
ticular event  has  either  an  incremental  or  a  decremental  eifect  on  p  and 
has  this  effect  whatever  the  value  of  p  as  long  as  it  is  in  the  open  (0,  1) 
interval.  In  most  cases,  then,  repetition  of  a  particular  event  causes  p 
to  converge  to  zero  or  one.  Although  in  principle,  learning  models  need 
not  have  this  convergence  property,  it  seems  to  be  called  for  by  most  of  the 
experiments  to  which  they  have  been  applied.  This  does  not  imply  that  the 
limiting  behavior  of  these  models  always  involves  extreme  response  prob- 
abilities ;  in  some  instances  successive  repetition  of  any  one  event  is  im- 
probable. 

A  related  question  concerns  the  sense  in  which  the  effect  of  an  event  is 
invariant  during  the  course  of  an  experiment.  Neither  commutativity  nor 
path  independence  requires  that  the  effect  of  an  event  on  p  be  in  the 
same  direction — incremental  or  decremental — throughout  an  experiment. 
Commutativity  alone,  for  example,  does  not  require  that  F  in  Eq.  4  be 
monotonic  in  any  one  of  the  components  of  Wn.  Path  independence 
implies  that  if  the  direction  of  effect  is  to  change  at  all  in  the  course  of  an 
experiment  the  direction  can  depend  at  most  on  pn.  These  possibilities 
and  restrictions  are  relevant  to  the  question  whether  existing  models 
can  handle  such  phenomena  as  the  development  of  secondary  reinforce- 
ment or  any  changes  that  may  occur  in  the  effects  of  reward  and 
nonreward. 


2.4  Combining-Classes  Condition:  Bush  and  Mosteller's 
Linear-Operator  Models 

Despite  their  strong  implications,  neither  path  independence  nor  com- 
mutativity is  restrictive  enough  to  produce  useful  models.  Further 
assumptions  are  needed.  Two  important  families  of  models  have  used 
path  independence  as  a  starting  point.  The  first,  the  family  with  which 
most  work  has  been  done,  comprises  Bush  and  Mosteller's  linear-operator 
models.  In  this  section  we  shall  consider  the  general  characteristics  of  the 
linear-operator  family  and  its  application  to  two  experiments.  The 
second  family,  to  be  discussed  in  Sec.  2.5,  with  applications  to  the  same 
pair  of  experiments,  is  Luce's  response-strength  operator  family.  In  both 
families  the  additional  assumptions  are  invariance  conditions  concerning 
multiple-response  alternatives. 

The  combining-classes  condition  is  a  precise  statement  of  the  assumption 
that  the  definition  of  response  alternatives  is  arbitrary  and  that  therefore 


20  STOCHASTIC    LEARNING    THEORY 

any  set  of  actions  by  the  subject  can  be  combined  and  treated  as  one  alter- 
native in  a  learning  model.10  It  might  appear  that  the  assumption  is  un- 
tenable because  it  ignores  any  natural  response  organization  that  may 
characterize  the  organism;  this  issue  has  yet  to  be  clarified  by  theoretical 
and  experimental  work.  The  assumption  is  not  inconsistent  with  current 
learning  theory,  however.  Concerning  the  problem  of  response  definition, 
Logan  (I960,  p.  117-120)  has  written: 

The  present  approach  assumes  that  responses  that  differ  in  any  way  whatso- 
ever are  different  in  the  sense  that  they  may  be  put  into  different  response 
classes  so  that  separate  response  tendencies  can  be  calculated  for  them  .... 
Differences  among  responses  that  are  not  importantly  related  to  response 
tendency  can  be  suppressed.  This  conception  is  consistent  with  what  appears 
to  be  the  most  common  basis  of  response  aggregation,  namely  the  conditions  of 
reinforcement.  Responses  are  separated  if  the  environment . . .  distinguishes 

between  them  in  administering  rewards The  rules  discussed  above  would 

permit  any  aggregation  that  preserves  the  reinforcement  contingencies  in  the 
situation.  Thus,  if  the  reward  is  given  independently  of  how  the  rat  gets  into 
the  correct  goal  box  of  a  T-maze,  then  the  various  ways  of  doing  it  can  be 
classed  together. 

The  combining-classes  condition  is  concerned  with  an  experiment  in 
which  three  or  more  response  alternatives,  Al9  A&  .  .  . ,  Ar9  are  initially 
defined.  A  subset,  Ah+l9  Ah+2,  .  .  . ,  Ar,  of  the  alternatives  is  to  be  com- 
bined into  A*,  thus  producing  the  reduced  set  of  alternatives  Al9  A2,  . .  .  , 
Ah9  A*.  We  consider  the  probability  vector  associated  with  the  reduced 
set  of  alternatives  after  a  particular  event  occurs.  The  combining-classes 
condition  requires  this  vector  to  be  the  same,  no  matter  whether  the  com- 
bining operation  is  performed  before  or  after  the  event  occurs;  this 
invariance  with  respect  to  the  combining  operation  is  required  to  hold  for 
all  events  and  for  any  subset  of  the  alternatives. 

The  result  of  applying  the  condition  is  that  in  a  multiple-alternative 

10  As  originally  stated  by  Bush,  Mosteller,  and  Thompson  (1954)  and  later  by  Bush  and 
Mosteller  (1955)  the  actions  to  be  combined  must  have  the  same  outcome  probabilities 
associated  with  them.  For  example,  if  Al  and  Az  are  the  alternatives  to  be  combined, 
then  Pr  {Ot  \  AJ  =  Pr  {Ot- 1  Az}  must  hold  for  all  outcomes  Ot.  This  was  necessary 
because  outcome  probabilities  conditional  on  responses  were  thought  of  as  a  part  of 
the  model,  and  without  the  equality  the  probability  Pr  {Ot  \  A±  or  A%}  would  not  be  well 
defined.  From  the  viewpoint  of  this  chapter,  the  model  for  a  subject's  behavior  depends 
on  the  particular  sequence  of  outcomes,  and  any  probability  mechanism  that  the 
experimenter  uses  to  generate  the  sequence  is  extraneous  to  the  model.  What  must  be 
well  defined  is  the  sequence  of  outcomes  actually  applied  to  any  one  of  the  alternative 
responses,  and  this  sequence  is  well  defined  even  if  the  actions  to  be  combined  into  one 
alternative  are  treated  differently. 

For  a  formal  treatment  of  the  combining  classes  condition  see  the  references  already 
cited,  Mosteller  (1955),  or  Bush  (1960b). 


AXIOMATIGS    AND    HEURISTICS    OF    MODEL    CONSTRUCTION  21 

experiment  the  effect  of  an  event  on  the  probability  pn  of  one  of  the  alter- 
natives can  depend  onpn  but  cannot  depend  on  the  relative  probabilities 
of  choice  among  the  other  alternatives.  To  see  this,  let  us  suppose  that  the 
change  mpn  does  depend  on  the  way  in  which  1  —  pn  is  distributed  among 
the  other  alternatives.  Then,  if  alternatives  defined  initially  are  combined 
before  the  event,  the  model  cannot  reflect  the  distribution  of  1  —  pn 
among  them.  Thus,  even  if  we  can  arrange  to  have/?^  well  defined,  its 
value  will  not,  in  general,  be  the  same  as  if  the  alternatives  were  combined 
after  the  event,  in  contradiction  to  the  invariance  assumption. 

Together  with  the  path-independence  assumption,  the  combining-classes 
condition  requires  not  only  that  each  component  of  the/^-vector  depend 
on  Xn  and  on  the  corresponding  component  of  the  pn-vector  only,  but  it 
also  requires  that  this  dependence  be  linear.  The  effect  of  a  particular 
event  on  a  set  of  response  probabilities  is  therefore  to  transform  each  of 
them  linearly,  allowing  us  to  write  the  operator  Qk  as 

Pn+l  =   QlcPn  =  *lcPn  +  <*&  (5) 

where  the  values  of  OCA  and  ak  are  determined  by  the  event  Ek.  This  result 
can  be  proved  only  when  r  >  3,  where  r  is  the  number  of  response  alter- 
natives. However,  if  we  regard  r  =  2  as  having  arisen  from  the  combina- 
tion of  a  larger  set  of  alternatives  for  which  the  condition  holds,  then  the 
form  of  the  operator  is  that  given  by  Eq.  5. 

One  aspect  of  using  linear  transformations  on  response  probabilities 
is  that  the  parameters  must  be  constrained  to  keep  the  probability  within 
the  unit  interval.  The  constraints  have  usually  been  determined  by  the 
requirement  that  all  possible  values  of  pw  from  0  to  1,  be  transformed 
into  probabilities  by  the  operator.  A  consequence  of  this  requirement  is 
that  with  r  alternatives  —  l/(r  —  1)  <  a  <  1.  It  is  in  the  spirit  of  the 
combining-classes  condition  that,  in  principle,  an  unlimited  number  of 
response  classes  may  be  defined.  Ifr  becomes  arbitrarily  large,  then  we  see 
that  one  implication  of  the  condition  is  that  negative  a's  are  inadmissible. 

In  comparing  operators  Qk  and  considering  their  properties,  it  is  useful 
to  define  a  new  parameter  Aj,  =  ak/(l  —  ocfc).  The  general  operator  can 
then  be  rewritten  as 

Pn+l  =   QkPn  =  <**?«  +  (1   ~  OC^,  (6) 

where  the  constraints  are  0  <  ocfc  <  1  and  0  <  lk  <  1.  The  transformed 
probability  Qkpn  may  be  thought  of  as  a  weighted  sum  of  pn  and  hk  and 
the  operator  may  be  thought  of  as  moving  pn  in  the  direction  of  hk. 
Because  Q^k  =  Ato  the  parameter  4  is  the  fixed  point  of  the  operator: 
the  operator  does  not  alter  a  probability  whose  value  is  Afc.  In  addition, 
when  ocfc  =£  1,  ^  is  the  limit  point  of  the  operator:  repeated  occurrence  of 


22  STOCHASTIC    LEARNING    THEORY 

the  event  Ek  leads  to  repeated  application  of  Qk,  and  this  causes  the  prob- 
ability to  approach  Afc  asymptotically.  This  may  be  seen  by  first  calculating 
the  effect  of  m  successive  applications  of  the  operator  Q  to  p: 

Qmp  =  a™p  +  (1  —  aw)A  =  X  —  aw(/l  —  p). 
If  a  <  1,    lim    a™  =  0  and  therefore    lim    Qmp  =  L 

Tfi — *•  oo  m — »•  oo 

As  noted  in  Sec.  2.3,  values  of  A  other  than  the  extreme  values  zero  or 
one  have  seldom  been  used  in  practice.  An  extreme  limit  point  automati- 
cally requires  that  oc  be  nonnegative  if  pn+1  is  to  be  confined  to  the  unit 
interval  for  all  values  of  pn.  In  order  to  justify  the  assumption  that  oc  is 
nonnegative,  we  can  therefore  usually  appeal  to  the  required  limit  point 
rather  than  to  the  extension  of  the  combining-classes  condition  already 
mentioned.  Because  of  this  and  because  multiple-choice  studies  with 
r  >  3  are  relatively  rare,  the  combining-classes  condition  has  never  been 
put  directly  to  the  test. 

The  parameter  ocfc  may  be  thought  of  as  a  learning-rate  parameter.  Its 
value  is  a  measure  of  ineffectiveness  of  the  event  Ek  in  altering  the  response 
probability.  When  afc  takes  on  its  maximum  value  of  1,  then  Qk  is  an 
identity  operator  and  event  Ek  induces  no  change  in  the  response  probability. 
The  smaller  the  value  of  <xft,  the  more  pn  is  moved  in  the  direction  of  Xk  by 
the  operator  Qk. 

For  the  two  extreme  limit  points,  the  operators  are  either  of  the  form 
QiP  =  &ip  or  Q2p  =  <x.2p  +  (1  —  oc2),  and  these  are  the  two  operators  that 
have  most  commonly  been  used.  In  general,  pairs  of  operators  do  not 
commute,  but  under  certain  conditions  (either  they  have  equal  limit  points 
or  one  of  them  is  the  identity  operator)  they  do.  When  all  pairs  of  operators 
in  a  model  commute,  the  explicit  expression  forpn  in  terms  of  the  path  has  a 
simple  form.  When  the  operators  in  a  two-choice  experiment  are  all  of 
the  form  of  £>2,  it  is  convenient  to  deal  with  q  =  1  —  p,  the  probability 
of  the  other  response,  and  to  make  use  of  the  complementary  operators 
whose  form  is  Q2q  =  oc2#. 

To  be  concrete,  we  consider  examples  of  the  Bush-Mosteller  model  for 
two  of  the  experiments  discussed  in  Sec.  1. 

ESCAPE-AVOIDANCE  SHUTTLEBOX.  We  interpret  this  experiment  to 
consist  of  two  subject-controlled  events:  escape  (shock)  and  avoidance. 
Both  events  reduce  the  probability  pw  of  escape  in  the  direction  of  the 
limit  point  A  ==  0.  It  is  convenient  to  define  a  binary  random  variable 
that  represents  the  event  on  trial  n, 

0  if    avoidance  (EJ 

1  if    escape       /rTN  ^ 


AXIOMATICS    AND    HEURISTICS    OF    MODEL    CONSTRUCTION  j?5 

with  a  probability  distribution  given  by  Pr  {xn  =  1}  =  pn,  Pr  {xn  =  0}  = 
1  —  Pn-  The  operators  and  the  rule  for  their  application  are 

SiPn  ==  aiPn    if    xn  =  0    (i.e.,  with  probability  1  -  pj 

(8) 

GaPn  =  <*2PW    if    xn  =  1     (i.e.,  with  probability  pj. 

Because  events  are  subject-controlled,  the  sequence  of  operators  (events) 
is  not  predetermined.  The  vector  Xn  (Sec.  2)  is  given  by  (xn,  1  -  x  J  and 
the  recursive  form  of  Eq.  2  (Sec.  2.1)  is  given  by 

P*+i  =  /(pn ;  XB)  =  af-oi1-3^.  (9) 

The  operators  have  equal  limit  points  and  therefore  commute.  This  makes 
possible  a  simple  explicit  formula  for  pw.  Let  Wn  =  (sn,  tn),  where 

w-l 
Sn  =  2  X, 

5=1 

is  the  number  of  shocks  before  trial  n  and 


is  the  number  of  avoidances  before  trial  n.  The  explicit  form  of  Eq.  4 
(Sec.  2.2)  is  given  by 

p    =  FCW  )  =  as"a  np  (10^ 

Later  I  shall  make  use  of  the  fact  that,  by  redefining  the  parameters  of 
the  model,  pw  may  be  written  as  a  function  of  an  expression  that  is  linear 
in  the  components  of  Wn.  To  do  this,  let^x  =  e~a,  ^  =  e~5,  and  oc2  =  e~c. 
Then  Eq.  10  becomes 

pn  =  exp  [— (a  +  btn  +  csn)].  (11) 

It  should  be  emphasized  that,  for  this  model,  tft  and  sn  and,  therefore,  pn 
are  random  variables  whose  values  are  unknown  until  trial  n  —  1.  The 
set  of  trials  is  a  dependent  sequence. 

PREDICTION  EXPERIMENT.  For  purposes  of  illustration  we  make  the 
customary  assumption  that  this  experiment  consists  of  two  experimenter- 
controlled  events.  Onset  of  the  left  light  (E-^)  increases  the  probability 
pn  of  a  left  button  press  toward  the  limit  point  A  =  1.  Onset  of  the  right 
light  decreases  pn  toward  the  limit  point  X  =  0. 

The  events  are  assumed  to  be  complementary  (Sec.  1.2),  and  therefore 
the  operators  have  equal  rate  parameters  (ocx  =  oc2  =  a)  and  comple- 
mentary limit  points  (Ax  =  1  —  A2).  It  is  convenient  to  define  a  binary 
variable  that  represents  the  event  on  trial  n, 

(0    if    right  light  (£2) 
yn  "~  (l    if    left  light  (E^. 


STOCHASTIC    LEARNING    THEORY 

24 

Because  events  are  experimenter-controlled,  the  sequence  yl9  ya,  .  .  .  can 
be  predetermined.  In  some  experiments  a  random  device  may  be  used 
to  generate  the  actual  sequence  used.  For  example,  the  {yn}  may  be  a 
realization  of  a  sequence  {yj  of  independent  random  variables  with 
pr  {yn  =  1}  =  TT.  However,  insofar  as  we  are  interested  in  the  behavior 
of  the  subject,  the  actual  sequence,  rather  than  any  properties  of  the 
random  device  used  to  generate  it,  is  of  interest.  It  is  shown  later  how 
simplifying  approximations  may  be  developed  by  assuming  that  the 
subject  has  experienced  the  average  of  all  the  sequences  that  the  random 
device  generates.  For  the  purpose  of  such  approximations,  which,  of 
course,  involve  loss  of  information,  yn  may  be  considered  a  random  vari- 
able with  a  probability  distribution.  The  more  exact  treatment,  however, 
deals  with  the  experiment  conditional  on  the  actual  outcome  sequences 
that  are  used. 
The  operators  and  the  rules  for  their  application  are 

QiPn  =  Vn  +  1  ~  a    if    y»  =  l 

=  °  if     Vn  =  0- 


The  vector  Xn  (Sec.  2)  is  given  by  (yw  1  -  yj  and  the  recursive  form  of 
Eq.  2  is  given  by 

/Wl  =  «Pn  +  (1   -   «)?«•  (13) 

Note  that  in  the  exact  treatment  pn  is  not  a  random  variable,  unlike  the 
case  for  an  experiment  with  subject  control.  The  operators  do  not  com- 
mute, and  therefore  the  cumulative  number  of  E^s  and  E%$  does  not 
determine  pn  uniquely.  The  explicit  formula  for  pn9  in  contrast  to  the 
shuttlebox  example,  includes  the  entire  sequence  yl9  ya,  .  . .  and  is  given 

by 

Pn  =  F(n>  -^is  X29 .  .  . ,  ^M_i)  =  ocw~  .pi  +  (1  —  <*)  2  aW      *2^-     (^  v 

3=1 

Equation  14  shows  that  (when  a  <  1)  a  recent  event  has  more  effect  on 
pn  than  an  event  in  the  distant  past.  By  contrast,  Eq.  10  indicates  that  for 
the  shuttlebox  experiment  there  is  no  "forgetting"  in  this  sense:  given 
that  the  event  sequence  has  a  particular  number  of  avoidances,  the  effect 
of  an  early 'avoidance  onpn  is  no  different  from  the  effect  of  a  late  avoid- 
ance. As  I  mentioned  earlier,  this  absence  of  forgetting  is  a  characteristic 
of  all  experiments  with  commutative  events. 

The  model  for  the  prediction  experiment  consists  of  a  sequence  of  inde- 
pendent binomial  trials:  the /^-sequence  is  determined  by  the  2/n-sequence 
which  is  independent  of  all  responses. 


AXIOMATICS    AND    HEURISTICS    OF    MODEL    CONSTRUCTION  2$ 


2.5  Independence  from  Irrelevant  Alternatives:  Luce's  Beta 
Response-Strength  Model 

Stimulus-response  theory  has  traditionally  treated  response  probability 
as  deriving  from  a  more  fundamental  response-strength  variable.  For 
example,  Hull  (1943,  Chapter  18)  conceived  of  the  probability  of  reaction 
(where  the  alternative  was  nonreaction)  as  dependent,  in  a  complicated 
way,  on  a  reaction-potential  variable  that  is  more  fundamental  in  his 
system  than  the  probability  itself.  The  momentary  reaction  potential  was 
thought  to  be  the  sum  of  an  underlying  value  and  an  error  (behavioral 
oscillation)  that  had  a  truncated  normal  distribution:  the  response 
would  occur  if  the  effective  reaction  potential  exceeded  a  threshold  value. 
The  result  of  these  considerations  was  that  reaction  probability  was 
related  to  reaction  potential  by  means  of  a  cumulative  normal  distribution 
which  was  altered  to  make  possible  zero  probabilities.  The  alteration 
implied  that  a  range  of  reaction  potentials  could  give  rise  to  the  same  (zero) 
probability.  Such  a  threshold  mechanism  has  not  been  explicitly  embodied 
in  any  of  the  modern  stochastic  learning  models. 

The  reaction  potential  variable  was  more  fundamental  partly  because 
it  changed  in  a  simple  way  in  response  to  experimental  events  and 
partly  because  the  state  of  the  organism  was  more  completely  described 
by  the  reaction  potential  than  by  the  probability.  The  last  observation  is 
clearer  if  we  turn  from  the  single-response  situation  considered  by  Hull11 
to  an  experiment  with  two  symmetric  responses.  The  viewpoint  to  be 
considered  is  that  such  an  experiment  places  into  competition  two  respon- 
ses, each  of  which  may  vary  independently  in  its  strength.  Let  us  suppose 
that  response  strengths,  symbolized  by  v(\)  and  r(2),  are  associated  with 
each  of  the  two  responses  and  that  the  probability  of  a  response  is  given 
by  the  ratio  of  its  strength  to  the  sum  of  the  two  strengths:  />{!}  = 
KI)/M!)  +  »(2)]-  Then>  although  the  two  strengths  determine  the 
probability  uniquely,  knowledge  of  the  probability  can  tell  us  only  the 
ratio  of  the  strengths,  v(l)/v(2).  Multiplying  both  strengths  by  the  same 
constant  does  not  alter  the  response  probability,  but  it  might,  for  example, 
correspond  to  the  change  from  an  avoidance-avoidance  to  an  approach- 
approach  conflict  and  might  be  revealed  by  response  times  or  amplitudes. 
The  response  strengths  therefore  provide  a  more  basic  description  of  the 
state  of  the  organism  than  the  response  probabilities.  This  is  the  sort  of 
thinking  that  might  lead  one  to  favor  a  learning  model  whose  underlying 

11  For  the  symmetric  two-choice  experiment  a  response-strength  analysis  that  is  con- 
siderably different  from  the  one  discussed  here  is  given  by  Hull  (1952). 


2Q  STOCHASTIC    LEARNING    THEORY 

variables  are  response  strengths.  A  critical  question  regarding  this  view- 
point is  whether  there  are  aspects  of  the  behavior  in  choice  learning  experi- 
ments that  can  be  accounted  for  by  changes  in  response  strengths  but 
that  are  not  functions  of  response  probabilities  alone. 

An  invariance  condition  concerning  multiple  response  alternatives,  but 
different  from  the  combining-classes  condition,  is  used  by  Luce  (1959)  in 
arriving  at  his  beta  response-strength  model.  Path  independence  of  the 
sequence  of  response  strengths  is  also  assumed,  and  within  the  model  this 
entails  path  independence  of  the  sequence  of  probability  vectors.  The 
invariance  condition  (Luce's  Axiom  1)  states  that,  in  an  experiment  in 
which  one  of  a  set  of  responses  is  made,  the  ratio  of  the  probabilities  of 
two  alternatives  is  invariant  with  respect  to  changes  in  the  set  of  remaining 
alternatives  from  which  the  subject  can  select.  As  stated,  the  condition 
applies  to  choice  situations  in  which  the  probabilities  of  choice  from  a 
constant  set  of  alternatives  are  unchanging.  Nonetheless,  by  assuming 
that  the  condition  holds  during  any  instant  of  learning,  we  can  use  it  to 
restrict  the  form  of  a  learning  model.  The  condition  implies  that  a  positive 
response-strength  function,  v(j\  can  be  defined  over  the  set  of  alternatives 
with  the  property  that 


The  subsequent  argument  rests  significantly  on  the  fact  that  v(j)  is  a  ratio 
scale  and  that  the  scale  values  are  determined  by  the  choice  probabilities 
only  up  to  multiplication  by  a  positive  constant;  that  is,  the  unit  of  the 
response-strength  scale  is  arbitrary. 

The  argument  begins  with  the  idea  that  in  a  learning  experiment  the 
effects  of  an  event  on  the  organism  can  be  thought  of  as*  a  transformation 
of  the  response  strengths.  Two  steps  in  the  argument  are  critical  in 
restricting  the  form  of  this  transformation.  First,  it  is  observed  that 
because  the  unit  of  response  strength  is  arbitrary  the  transformation 
/must  be  invariant  with  respect  to  changes  in  this  unit:  f[kv(j)]  =  kf[v(j)]. 
Second,  it  is  assumed  that  the  scale  of  response  strength  is  unbounded  and 
that  therefore  any  real  number  is  a  possible  scale  value.  The  independence- 
of-unit  condition,  together  with  the  unboundedness  of  the  scale,  leads  to 
the  powerful  conclusion  that  the  only  admissible  transformation  is 
multiplication  by  a  constant.12  The  requirement  that  response  strengths 

12  As  suggested  by  Violet  Cane  (1960),  it  is  not  possible  to  have  both  an  unbounded 
response-strength  scale  and  choice  probabilities  equal  to  zero  or  unity  ("perfect  dis- 
crimination"); for,  if  choice  probabilities  can  take  on  all  values  in  the  closed  unit 
interval,  then  the  z?-scale  must  map  onto  this  closed  interval  and  must  therefore  itself 
extend  over  a  closed,  and  thus  bounded,  interval.  But  the  unboundedness  of  the  scale 


AXIOMATICS    AND    HEURISTICS    OF    MODEL    CONSTRUCTION  2J 

be  positive  implies  that  the  multiplying  constant  must  be  positive.  Path 
independence  implies  that  the  constant  depends  only  on  the  event  and  not 
on  the  trial  number  or  the  response  strength.  This  argument  completely 
defines  the  form  of  a  learning  model—  called  the  "beta  model"—  for 
experiments  with  two  alternative  responses.  The  model  defines  a  sto- 
chastic process  on  response  strengths,  which  in  turn  determines  a  stochastic 
process  on  the  choice  probabilities. 

When  event  Ek  occurs,  let  i?(l)  and  v(2)  be  transformed  into  akv(l)  and 
bkv(2).  The  new  probability  is  then 

Pr  {1}  = 


)  +  bkv(2)       1  + 

If  we  let  v  =  v(l)/v(2)  be  the  ratio  of  response  strengths  and  pk  =  ak/bk 
be  the  ratio  of  constants,  the  original  probability  is  v/(l  +  v)  and  the  trans- 
formed probability  is  ftkv/(l  +  0kv).  The  ratio  flk  and  the  relative  response 
strength  v  are  sufficient  to  determine/?  and  its  transformed  value.  Because 
response  strengths  are  important  in  this  chapter  only  insofar  as  they 
govern  probabilities,  the  simplified  notation  is  adequate.  We  let  vn  be  the 
relative  response  strength  v(l)/v(2)  on  trial  n,  let  /3fc  be  the  multiplier  of  vn 
that  is  associated  with  the  event  Ek,  and  let/?w  be  Pr  [A^  on  trial  n}.  Then 


and    "- 


Moreover,  if  event  Ek  occurs  on  trial  «, 


which  gives  the  corresponding  nonlinear  transformation  on  response 
probability. 

A  number  of  implications  of  the  model  can  be  seen  immediately  from 
the  form  of  the  probability  operator.  If  Ek  has  an  incremental  (decre- 
mental)  effect  on  Pr  (4J,  then  fik  >  1(<1).  An  identity  operator  results 

is  an  important  feature  of  the  argument  that  forces  the  learning  transformation  to  be 
multiplicative.  It  therefore  appears  that  Luce's  axiom  leads  to  a  multiplicative  learning 
model  only  when  it  is  combined  with  the  assumption  that  response  probabilities  can 
never  be  exactly  zero  or  unity.  In  practice,  this  assumption  is  not  serious,  since  a  finite 
number  of  observations  do  not  allow  us  to  distinguish  between  a  probability  that  is 
exactly  unity  and  one  that  is  arbitrarily  close  to  that  value.  The  fact  that  the  additional 
assumption  is  needed,  however,  makes  it  difficult  to  disprove  the  axiom  on  the  basis  of 
a  failure  of  the  learning  model,  since  the  fault  may  lie  elsewhere. 


STOCHASTIC    LEARNING    THEORY 


when  fa  =  1.  The  only  limit  points  possible  are/?  =  0  andp  =  1,  which 
are  obtained,  respectively,  when  fa  <  1  and  fa  >  1.  This  follows  because 


QmP  = 


and,  when  ]8  ^  1,  either  £w  or  /3~m  approaches  zero.  These  properties 
imply  that  the  effect  of  an  event  on^n  must  always  be  in  the  same  direction ; 
all  operators  are  unidirectional  in  contrast  to  operators  in  the  linear  model, 
which  may  have  zero  points  other  than  zero  and  unity.  The  restriction 
to  extreme  limit  points  appears  not  to  be  serious  in  practice,  however;  as 
noted  in  the  Sec.  2.4,  most  experiments  seem  to  call  for  unidirectional 
operators. 

Perhaps  more  important  from  the  viewpoint  of  applications  is  the  fact 
that  operators  in  the  beta  response-strength  model  must  always  commute; 
the  model  requires  that  events  in  learning  experiments  have  commutative 
effects.  That  nonlinear  probability  operators  of  the  form  given  by  Eq.  16 
commute  can  be  shown  directly,  or  can  be  seen  more  simply  by  noting 
the  commutativity  of  the  multiplicative  transformations  of  vn,  to  which 
such  operators  correspond.  Whether  or  not  commutativity  is  realistic,  it 
is  a  desirable  simplifying  feature  of  the  model.  A  final  property  is  that  the 
model  cannot  produce  learning  when  p±  =  0  because  this  requires  that  vl9 
hence  all  vn9  be  zero. 

Let  us  consider  applications  of  the  beta  model  to  the  experiments 
discussed  in  Sec.  2.4.  When  applicable,  the  same  definitions  are  used 
as  in  that  section. 

ESCAPE-AVOIDANCE  SHUTTLEBOX.  It  is  convenient  to  let  pn  be  the 
probability  of  escape  (response  A2,  event  £%)  and  to  let  vn  be  the  ratio  of 
escape  strength  to  avoidance  strength.  Both  events  reduce  the  prob- 
ability of  escape  and  therefore  both  fa  <  1  and  fa<  I.  The  binary 
random  variable  xn  is  defined  as  in  Eq.  7.  The  operators  and  the  rules  for 
their  application  are 

C1  —  Pn)  +  AiPn 

(i.e.,  with  probability  1  —  pn) 


(1  -  Pn)  +  fan 

(i.e.,  with  probability  pn). 
The  recursive  form  of  Eq.  2  is  given  by 


AXIOMATICS    AND    HEURISTICS    OF    MODEL    CONSTRUCTION  2Q 

Both  expressions  are  cumbersome.    More  light  is  shed  by  the  explicit 
formula 


Redefining  the  parameters  simplifies  Eq.  19.  Let  vl  =  ea,  ^  =  eb  and 
/?2  =  ec.  Then  the  expression  becomes 

Pw  "  1  +  exp  [-(a  +  btn  +  csj]  '  (20) 

It  is  instructive  to  compare  this  explicit  formula  with  Eq.  1  1  for  the  linear 
model.  In  the  usual  experiment  a  would  be  positive  and  b  and  c  would  be 
negative,  unlike  the  coefficients  in  Eq.  11,  all  of  which  are  positive.  (Recall 
that  as  tw,  sw  increase  pn  decreases.  These  definitions  are  awkward,  but 
they  will  facilitate  matters  later  on.)  Again  it  should  be  noted  that  tn  and 
sn  are  random  variables  whose  behavior  is  governed  by  ^-values  earlier  in 
the  sequence  and  that  the  model  therefore  defines  a  dependent  sequence 
of  trials. 

PREDICTION  EXPERIMENT.  Experimenter-controlled  events  are  assumed 
here,  as  in  Sec.  2.4.  The  {pn}  and  {yn}  are  defined  as  in  that  section.  The 
complementarity  of  events  demands  that  &  =  f$^  =  /?  >  1.  This  can  be 
seen  by  noting  that  if  E±  transforms  v(l)/v(2)  into  j3v(l)/v(2)  then  for 
operators  to  be  complementary  E2  must  transform  z?(2)/u(l)  into  /5t?(2)/t?(l). 
The  operators  and  the  rules  for  their  application  therefore  are 


I    —  «          ,1  A     .      P  lf      y"  ^ 

I  (1  -  pj  +  j8pn 

Pn+l 


/         -  Pn     +  Pn 

The  recursive  expression  is  given  by 


Because  of  the  universal  commutativity  of  the  beta  model,  the  explicit 
formula  is  simple  in  contrast  to  Eq.  14.  We  have  Ww  =  (ln,  rj,  where 


n-l 
'n  = 


n 


is  the  number  of  left-light  outcomes  before  trial  n  and 


20  STOCHASTIC    LEARNING    THEORY 

is  the  corresponding  number  of  right-light  outcomes.  We  define  dn  = 
ln  -  rn  to  be  the  difference  between  these  numbers.  The  explicit  formula 
is  then  . 

1          -  (23) 


For  this  model  commutativity  seems  to  be  of  more  use  than  path  independ- 
ence in  simplifying  formulas.  Again  we  can  define  new  parameters 
v^  =  ea,  ft  =  e-*  to  obtain 

Pn  =  1  +  exp  [-(a  +  bdn)] '  (24) 

Equation  24  indicates  that  the  response  probability  is  expressed  in  terms 
of  dn  by  means  of  the  well-known  logistic  function.  Equation  20  is  a 
generalized  form  of  this  function.  All  of  the  two-alternative  beta  models 
have  explicit  formulas  that  are  (generalized)  logistics.  Because  the 
logistic  function  is  similar  to  the  cumulative  normal  distribution,  the 
relation  in  the  beta  model  between  response  strength  and  probability  is 
reminiscent  of  Hull's  treatment  of  this  problem. 


2.6  Urn  Schemes  and  Explicit  Forms 

The  treatment  of  examples  in  Sees.  2.4  and  2.5  illustrates  two  of  the  alter- 
native ways  of  regarding  a  stochastic  learning  model.  One  approach  is  to 
specify  the  change  in  pn  that  is  induced  by  the  event  (represented  by  Xn) 
on  trial  n.  This  change  in  probability  depends,  in  general,  on  X1?  .  .  .  , 
Xn_!  as  well  as  on  Xn.  In  the  general  recursive  formula  for  response  prob- 
ability we  therefore  express  pn+1  as  a  function  of  pw  and  the  events  through 

Pw+l  =/(PnJ  Xn,  Xn_l5  .  .  .  ,  XJ. 

If  the  model  is  path-independent,  then  pn+1  is  uniquely  specified  by  pn 
and  Xn,  and  the  expression  may  be  simplified  to  give  the  recursive  formula 
of  Sec.  2.1, 


and  its  corresponding  operator  expressions.  The  second  approach  is  to 
specify  the  way  in  which  pn  depends  on  the  entire  sequence  of  events 
through  trial  n  —  1.  This  is  done  by  the  explicit  formula  in  which  pn  is 
expressed  as  a  function  of  the  event  sequence: 

Pn  =  -F(Xn_i,  Xn_2,  .  .  .  ,  X-L). 

A  model  may  have  a  more  "natural"  representation  in  one  of  these  forms 
than  in  the  other.  In  this  section  I  discuss  models  for  which  the  explicit 
form  is  the  more  natural. 


AXIOMATIGS    AND    HEURISTICS    OF    MODEL    CONSTRUCTION  3! 

URN  SCHEMES.  Among  the  devices  traditionally  used  in  fields  other  than 
learning  to  represent  the  aftereffects  of  events  on  probabilities  are  games  of 
chance  known  as  urn  schemes  (Feller,  1957,  Chapter  V).  An  urn  contains 
different  kinds  of  balls,  each  kind  representing  an  event.  The  occurrence 
of  an  event  is  represented  by  randomly  drawing  a  ball  from  the  urn.  After- 
effects are  represented  by  changes  in  the  urn's  composition.  Schemes  of 
this  kind  are  among  the  earliest  stochastic  models  for  learning  (Thurstone, 
1930;  Gulliksen,  1934)  and  are  still  of  interest  (Audley  &  Jonckheere, 
1956;  Bush  &  Mosteller,  1959).  The  stimulus-sampling  models  discussed 
in  Chapter  10  may  be  regarded  as  urn  schemes  whose  balls  are  interpreted 
as  "elements"  of  the  stimulus  situation.  In  contrast,  Thurstone  (1930) 
suggested  that  the  balls  in  his  scheme  be  interpreted  as  elements  of  response 
classes.  In  all  of  these  examples  two  kinds  of  balls,  corresponding  to  two 
events,  are  used.  An  urn  scheme  is  introduced  to  help  make  concrete 
one's  intuitive  ideas  about  the  learning  process.  Except  in  the  case  of 
stimulus-sampling  theory,  the  interpretation  of  the  balls  as  psychological 
entities  has  not  been  pressed  far. 

The  general  scheme  discussed  by  Audley  and  Jonckheere  (1956)  encom- 
passes most  of  the  others  as  special  cases.  It  is  designed  for  experiments 
with  two  subject-controlled  events.  On  trial  1  the  urn  contains  wl  white 
balls  and  rx  red  balls.  A  ball  is  selected  at  random.  If  it  is  white,  event  E^ 
occurs  (xx  =  0),  the  ball  is  replaced,  and  the  contents  of  the  urn  are 
changed  so  that  there  are  now  wx  +  w  white  balls  and  r±  +  r  red  balls.  If 
the  chosen  ball  is  red,  event  E2  occurs  (xx  =  1),  the  ball  is  replaced,  and 
the  new  numbers  are  %  +  w'  and  ^  +  rf.  This  process  is  repeated  on 
each  trial.  The  quantities  w,  w',  r  and  r'  have  fixed  integral  values  that 
may  be  positive,  zero,  or  negative,  but,  if  any  of  them  are  negative,  then 
arrangements  must  be  made  so  that  the  urn  always  contains  at  least  one 
ball  and  so  that  the  number  of  balls  of  either  color  is  never  negative. 
»-i 

Let  tn  =  2  C1  ~"  xj)  ke  the  number  of  occurrences   of  El  and  sn 

n-l  j=l 

=  2  XJ  be  the  number  of  E%  occurrences  before  trial  n.   Let  wn  and  rn 

}=i 
be  the  number  of  white  and  red  balls  in  the  urn  before  the  nth  trial.  Then 

wn  =  wx  +  wtn  +  w'sn»        Tn  =  rx  +  rtn  +  r'$n 
and 

pn  =  Pr  {event  E2  on  trial  n} 


rtn 


ri  +  Wj)  +  (r  +  w)tn  +  (r'  + 


£2  STOCHASTIC    LEARNING    THEORY 

Equation  25  gives  the  explicit  formula  for  pn.   It  demonstrates  the  most 
important  property  of  these  models  —  their  commutativity. 

If  ww  and  rn  are  interpreted  as  response  strengths,  the  model  can  be 
regarded  as  a  description  of  additive  (rather  than  multiplicative)  trans- 
formations of  these  strengths.13 

The  recursive  formula  and  corresponding  operators  are  unwieldy  and 
are  not  given.  Suffice  it  to  say  that  the  operators  are  nonlinear  and 
depend  on  the  trial  number  (that  is,  on  the  path  length)  but  not  on 
the  particular  sequence  of  preceding  events  (the  content  of  the  path).  The 
model,  therefore,  is  only  quasi-independent  of  path  (Sec.  2.1).  This  is  the 
case  because  the  change  induced  by  an  event  in  the  proportion  of  red 
balls  depends  on  pn,  on  the  numbers  of  reds  and  whites  added  (which 
depend  only  on  the  event),  and  on  the  total  number  of  balls  in  the  urn 
before  the  event  occurred  (which  can  in  general  be  inferred  only  from 
knowledge  of  both  pn  and  ri). 

Two  special  cases  of  the  urn  scheme  that  are  exceptions  to  the  fore- 
going statement  and  produce  path-independent  models  are  (1)  those  for 
which  r  +  w  =  r'  +  v/  =  0,  so  that  the  total  number  of  balls  is  constant, 
and  (2)  those  for  which  either  r  =  r'  =  0  or  w  =  w'  =  0,  so  that  the 
number  of  balls  of  one  color  is  constant.  The  first  condition  is  met  by 
Estes'  model,  which,  however,  departs  in  another  respect  from  the  general 
scheme:  its  additive  increments  vary  with  the  changing  composition  of  the 
urn  instead  of  being  constant.  (This  modification  sacrifices  commuta- 
tivity, but  it  is  necessary  if  the  probability  operators  are  to  be  linear.  The 
modification  follows  from  the  identification  of  balls  with  stimulus  ele- 
ments, and  so  is  less  artificial  than  it  sounds.) 

The  second  condition  is  assumed  in  the  urn  scheme  that  Bush  and 
Mosteller  (1959)  apply  to  the  shuttlebox  experiment.  They  assume  that 
r  =  /•'  =  0,  so  that  only  white  balls  are  added  to  the  urn;  neither  escape 
nor  avoidance  alters  the  "strength"  of  the  escape  response.  The  model  is 
modified  so  that  w  and  w'  are  continuous  parameters  rather  than  discrete 
numbers,  as  they  would  have  to  be  in  a  strict  interpretation  as  numbers 
of  balls.  The  result  may  be  expressed  in  simple  form  by  defining  a  = 
(rx  +  Wj)//*!.  To  be  consistent  with  Eqs.  1  1  and  20,  we  let  b  =  w/rx  and 
c  =  v//fi  and  obtain 


13  The  model  is  also  appropriate  if  it  is  thought  that  strength  v(J)  is  transformed  mul- 
tiplicatively  but  that  response  probability  depends  on  logarithms  of  strengths:  /?(!)  = 
log  0(l)/log  1X1X2)].  In  such  a  case  wn  and  rn  are  interpreted  as  logarithms  of 
response  strengths. 


AXIOMATICS    AND    HEURISTICS    OF    MODEL    CONSTRUCTION  5 

The  operators  are  given  by 

rtn  =  1    *"        if    xB  =  0    (i.e.,  with  probability  1  -  pj 


aPn  ^  i  +  "c        if    x^  =  1    fr6-'  with  Probability  pn). 

LINEAR  MODELS  FOR  SEQUENTIAL  DEPENDENCE.  It  has  been  indi- 
cated earlier  that  the  responses  produced  by  learning  models  consist  of 
stochastically  dependent  sequences,  except  for  the  case  of  experimenter- 
controlled  events.  Moreover,  insofar  as  experimenter  control  is  present,  the 
sequence  of  responses  will  be  dependent  on  the  sequence  of  outcomes. 
The  autocorrelation  of  responses  and  the  correlation  of  responses  with 
outcomes  are  interesting  in  themselves,  whether  in  learning  experiments 
or,  for  example,  in  trial-by-trial  psychophysical  experiments  in  which 
there  is  no  over-all  trend  in  response  probability.  Several  models  have 
arisen  directly  from  hypotheses  about  repetition  or  alternation  tendencies 
that  perturb  the  learning  process  and  produce  a  degree  or  kind  of  response- 
response  or  response-outcome  dependence  that  is  unexpected  on  the  basis 
of  other  learning  models.  The  example  to  be  mentioned  is  neither  path- 
independent  nor  commutative. 

The  one  trial  perse  veration  model  (Sternberg,  1959a,b)  is  suggested  by 
the  following  observation:  in  certain  two-choice  experiments  with  sym- 
metric responses  the  probability  of  a  particular  response  is  greater  on  a 
trial  after  it  occurs  and  is  rewarded  than  on  a  trial  after  the  alternative 
response  occurs  and  is  not  rewarded.  There  are  several  possible  expla- 
nations. One  is  that  reward  has  an  immediate  and  lasting  effect  on  pn  that 
is  greater  than  the  effect  of  nonreward.  This  hypothesis  attributes  the 
observed  effect  to  a  differential  influence  of  outcomes  in  the  cumulative 
learning  process.  One  of  the  models  already  discussed  could  be  used  to 
describe  this  mechanism:  for  example,  the  model  given  by  Eq.  8  (Sec.  2.4) 
with  <*!  <  oc2. 

A  second  hypothesis  is  that  the  two  outcomes  are  equally  effective  (i.e., 
they  are  symmetric)  but  that  there  is  a  short-term  one-trial  tendency  to 
repeat  the  response  just  made.  This  hypothesis,  when  applied  to  an  experi- 
ment with  100:0  reward14  leads  to  the  one-trial  perseveration  model. 

Without  the  repetition  tendency,  the  assumption  of  outcome  symmetry 
leads  to  a  model  with  experimenter-controlled  events  of  the  kind  that  was 
applied  in  Sec.  2.4  to  the  prediction  experiment.  The  100:0  reward 

14  The  term  "TTV.^  reward"  describes  a  two-choice  experiment  in  which  one  choice  is 
rewarded  with  probability  TT-J.  and  the  other  with  probability  7r2- 


STOCHASTIC    LEARNING    THEORY 

J  T 

schedule  implies  that  yn  =  0  on  all  trials  and  therefore  that  the  same  oper- 
ator, Qs  in  Eq.  12,  is  applied  on  every  trial.  Equation  14  shows  the  explicit 
formula  to  be 


This  single-operator  model  was  discussed  by  Bush  and  Sternberg  (1959). 
It  may  also  be  regarded  as  a  special  case  of  the  subject-controlled  model 
used  for  the  shuttlebox  (Eq.  8)  with  at  =  a2  =  a. 

In  developing  the  perseveration  model,  the  single-operator  model  is 
taken  to  represent  the  "underlying"  learning  process.  Define  xn  so  that 
Pr  {xn  =  1}  =  P,  and  Pr  {xn  =  0}  =  1  -  pn.  We  note  that  the  strongest 
possible  tendency  to  repeat  the  previous  response  can  be  described  by  the 
model  pw  =  xn_x.  This  is  the  effect  that  perturbs  the  learning  process. 

To  combine  the  underlying  and  perturbing  processes,  we  take  a  weighted 
combination  of  the  two,  with  nonnegative  weights  1  —  /3  and  0.  This 
gives  the  explicit  formula15  for  the  subject-controlled  model: 


pn  =  F(n,  xn_J  =  (1  -  floc^i  +  /?xn_1?        (n  >  2).  (29) 

Knowledge  of  the  trial  number  and  of  only  the  last  response  is  needed  to 
determine  the  value  of  pn.  The  two  possible  values  that  pn  can  have  on  a 
particular  trial  differ  by  the  (constant)  value  of  /3;  pn  takes  on  the  higher 
of  the  two  values  when  x^  =  1  and  the  lower  when  xn_x  =  0.  The 
extent  to  which  the  learning  process  is  perturbed  by  the  repetition  tend- 
ency is  greater  with  larger  /?. 

That  the  model  is  path-dependent  is  shown  by  the  form  of  its  recursive 
expression: 

Pn+i  =/(P,;  *„,  *n-i)  =  "Pn  +  j8xn  -  a/Sxn_l9        (/i  >  2).       (30) 

Knowledge  of  the  values  of  pw  and  xn  alone  is  insufficient  to  specify  the 
value  of  pn+1.  For  this  model,  in  contrast  to  most  others,  more  past  history 
is  needed  in  order  to  specify  pn  by  the  recursive  form  than  by  the  explicit 
form.  Data  from  a  two-armed  bandit  experiment  have  been  fruitfully 
analyzed  with  the  perseveration  model. 

The  development  of  the  perseveration  model  illustrates  a  technique  that 
is  of  general  applicability  and  is  occasionally  of  interest.  A  tendency  to 
alternate  responses  may  be  represented  by  a  similar  device.  Linear  equa- 
tions may  also  be  used  to  represent  positive  or  negative  correlation 
between  outcome  and  subsequent  response  —  for  example,  a  tendency  in  a 

16  A  trivial  modification  in  this  expression  is  made  by  Sternberg  (1959a)  based  on 
considerations  about  starting  values. 


AXIOMATICS    AND    HEURISTICS    OF    MODEL    CONSTRUCTION  j?J 

prediction  experiment  to  avoid  predicting  the  event  that  most  recently 
occurred. 

LOGISTIC  MODELS.  The  most  common  approach  to  the  construction  of 
models  begins  with  an  expression  for  trial-to-trial  probability  changes, 
an  expression  that  seems  plausible  and  that  may  be  buttressed  by  more 
general  assumptions.  An  alternative  approach  is  to  consider  what 
features  of  the  entire  event  sequence  might  affect  pn  and  to  postulate  a 
plausible  expression  for  this  dependence  in  terms  of  an  explicit  formula. 
The  second  approach  is  exemplified  by  the  perseveration  model  and  also 
by  a  suggestion  by  Cox  based  on  his  work  on  the  regression  analysis  of 
binary  sequences  (1958). 

In  many  of  the  models  we  have  considered  the  problem  arises  of  con- 
taining pn  in  the  unit  interval,  and  it  is  solved  by  restrictions  on  the 
parameter  values,  restrictions  that  are  occasionally  complicated  and 
interdependent.  The  problem  is  that  although  pn  lies  in  the  unit  interval 
the  variables  on  which  it  may  depend,  such  as  total  errors  or  the  difference 
between  the  number  of  left-light  and  right-light  onsets,  may  assume 
arbitrarily  large  positive  or  negative  values.  The  probability  itself,  there- 
fore, cannot  depend  linearly  on  these  variables.  If  a  linear  relationship  is 
desired,  then  what  is  needed  is  a  transformation  of  pn  that  maps  the  unit 
interval  into  the  real  line.16 

Such  a  transformation  is  given  by  logit  p  =  log  [p/(l  —  p)].  Suppose 
that  this  quantity  depends  linearly  on  a  variable,  x,  so  that  logit  p  =  a  + 
bx.  Then  the  function  that  relates  p  to  x  is  the  logistic  function  that  we 
have  already  encountered  in  Eq.  24  and  is  represented  by 

P  = •  (31) 

1  +  exp  [-(a  +  bx)] 

As  was  mentioned  in  Sec.  2.5,  the  logistic  function  is  similar  in  form  to  the 
normal  ogive  and  therefore  it  closely  resembles  Hull's  relation  between 
probability  and  reaction  potential.  One  advantage  of  the  logistic  trans- 
formation is  that  no  constraints  on  the  parameters  are  necessary.  A 
second  advantage,  to  be  discussed  later,  is  that  good  estimates  of  the 
parameter  values  are  readily  obtained. 

Cox  (1958)  has  observed  that  many  studies  utilizing  stochastic  learning 
models, 

. . .  have  led  to  formidable  statistical  problems  of  fitting  and  testing.  When 
these  studies  aim  at  linking  the  observations  to  a  neurophysiological  mechanism, 
it  is  reasonable  to  take  the  best  model  practicable  and  to  wrestle  as  vigorously 
16  When  the  dependent  variables  are  nonnegative,  the  unit  interval  needs  to  be  mapped 
only  into  the  positive  reals.  This  can  be  achieved,  for  example,  by  arranging  that  the 
transformations  /r1  or  log  (p~l)  depend  linearly  on  the  variables,  as  illustrated  by  Eq. 
11  andEq.  25. 


3S  STOCHASTIC    LEARNING    THEORY 

as  possible  with  the  resulting  statistical  complications.  If,  however,  the  object 
is  primarily  the  reduction  of  data  to  a  manageable  and  revealing  form,  it  seems 
fair  to  take  for  the  probability  of  a  success  ...  as  simple  an  expression  as 
possible  that  seems  to  be  the  right  general  shape  and  which  is  flexible  enough  to 
represent  the  various  possible  dependencies  that  one  wants  to  examine.  For 
this  the  logistic  seems  a  good  thing  to  consider. 

The  desirable  features  of  the  logistic  function  carry  over  into  its  general- 
ized form,  in  which  logit/?  is  a  linear  function  of  several  variables.  When 
these  variables  are  given  by  the  components  of  Wn  (the  cumulative  number 
of  times  each  of  the  events  E19  E2,  .  .  .  ,  Et  has  occurred  in  the  first  n  —  1 
trials),  then  the  logistic  function  is  exactly  equivalent  to  Luce's  beta  model, 
so  that  the  same  model  is  obtained  from  quite  different  considerations. 
An  example  of  the  logistic  function  generalized  to  two  dependent  variables 
is  given  by  Eq.  20  for  the  shuttlebox  experiment. 

A  second  example  of  a  generalized  logistic  function,  one  that  does  not 
follow  from  Luce's  axioms,  is  given  by  the  analogue  of  the  one-trial 
perseveration  model  (Eq.  29)  in  which 

logit  pw  =  a  +  bn  +  cxn_^ 
or 


exp  [-(a  +  bn 


2.7  Event  Effects  and  Their  Invariance 

The  magnitude  of  the  effect  of  an  event  is  usually  represented  by  the 
value  of  a  parameter  that,  ideally,  depends  only  on  constant  features  of 
the  organism  and  of  the  apparatus  and  therefore  does  not  change  during 
the  course  of  an  experiment.  There  are  some  experiments  or  phases  of 
experiments  in  which  the  ideal  is  clearly  not  achieved,  at  least  not  in  the 
context  of  the  available  models  and  the  customary  definitions  of  events. 
Magazine  training,  detailed  instructions,  practice  trials,  and  other  types  of 
pretraining  are  some  of  the  devices  used  to  overcome  this  difficulty. 
Probably  few  investigators  believe  that  the  condition  is  ever  exactly  met  in 
actual  experiments,  but  the  principle  of  parameter  invariance  within  an 
experiment  is  accepted  as  a  working  rule  with  the  hope  that  it  will  at  least 
approximate  the  truth.  (It  is  also  desirable  that  event  effects  be  invariant 

from  experiment  to  experiment;  this  principle  provides  one  test  of  a 

j  i  \ 
model.) 

A  careful  distinction  should  be  drawn  between  invariance  of  parameter 
values  and  equality  of  an  event's  effects  in  the  course  of  an  experiment. 


AXIOMATICS    AND    HEURISTICS    OF    MODEL    CONSTRUCTION  ft 

All  of  the  models  that  have  thus  far  been  mentioned  imply  that  the  prob- 
ability change  induced  by  an  event  varies  systematically  in  the  course  of 
an  experiment;  different  models  specify  different  forms  for  this  variation. 
It  is  therefore  only  in  the  context  of  a  particular  model  that  the  question  of 
parameter  invariance  makes  sense.  Insofar  as  changes  in  event  effects  are 
in  accord  with  the  model,  parameters  will  appear  invariant,  and  we  would 
be  inclined  to  favor  the  model. 

In  most  models,  event  effects,  defined  as  probability  differences,  change 
because  pn+l  —  pn  depends  on  at  least  the  value  ofpn.  This  dependence 
arises  in  part  from  the  already  mentioned  need  to  avoid  using  transition 
rules  that  may  take  pn  outside  the  unit  interval.  But  the  simple  fact  that 
most  learning  curves  (of  probability,  time,  or  speed  versus  trials)  are  not 
linear  first  gave  rise  to  the  idea  that  event  effects  change. 

Gulliksen  (1934)  reviewed  the  early  mathematical  work  on  the  form  of 
the  learning  curve,  and  he  showed  that  most  of  it  was  based  on  one  of  two 
assumptions  about  changes  in  the  effect  of  a  trial  event.  Let  t  represent 
time  or  trials  and  let  y  represent  a  performance  measure.  Models  of  Type 
A  begin  with  the  assumption  that  the  improvement  in  performance 
induced  by  an  event  is  proportional  to  the  amount  of  improvement  still 
possible.  The  chemical  analogy  was  the  monomolecular  reaction.  This 
assumption  led  to  a  differential  equation  approximation,  dyjdt  =  a(b  —  y) 
whose  solution  is  the  exponential  growth  function  y  =  b  —  c  exp  (—at). 
Models  of  Type  B  begin  with  the  assumption  that  the  improvement  in 
performance  induced  by  an  event  is  proportional  to  the  product  of  the 
improvement  still  possible  and  the  amount  already  achieved.  The  chemical 
analogy  was  the  monomolecular  autocatalytic  reaction.  The  assumption 
led  to  a  differential  equation  approximation,  dy/dt  =  ay(b  —  y\  whose 
solution  is  a  logistic  function  y  =  b/[l  +  c  exp  (—At)]. 

The  modern  versions  of  these  two  models  are,  of  course,  Bush  and 
Mosteller's  linear-operator  models  and  Luce's  beta  models.  In  the  linear 
models  the  quantity  pn+1  —  pn  is  proportional  to  hk  —  pw  the  magnitude 
of  the  change  still  possible  on  repeated  occurrences  of  the  event  Ek.  If  all 
events  change  pn  in  the  same  direction,  let  us  say  toward  p  =  0,  then  their 
effects  are  greatest  when  pn  is  large.  In  contrast,  the  effect  of  an  event  in 
the  beta-model  is  smallest  when  pn  is  near  zero  and  unity  and  greatest 
when/?n  is  near  0.5;  these  statements  are  true  whether  the  event  tends  to 
increase  or  decrease  pn.  The  sobering  fact  is  that  in  more  than  forty  years 
of  study  of  learning  curves  and  learning  a  decision  has  not  been  reached 
between  these  two  fundamentally  different  conceptions. 

There  is  one  exception  to  the  rule  that  no  model  has  the  property  that 
the  increment  or  decrement  induced  inp  by  an  event  is  a  constant.  In  the 
middle  range  of  probabilities  the  effects  vary  only  slightly  in  many  models, 


o#  STOCHASTIC    LEARNING    THEORY 

and  Mosteller  (1955)  has  suggested  an  additive-increment  model  to  serve 
as  an  approximation  for  this  range.  The  transitions  are  of  the  form  pn+1 
=  GfcPn  =  P«  +  <^>  where  dk  is  a  small  quantity  that  may  be  either  positive 
or  negative.  This  model  is  not  only  path-independent  and  commutative, 
it  is  also  '^^-independent." 

The  foregoing  discussion  is  restricted  to  path-independent  models.  In 
other  models  the  magnitude  of  the  effect  of  an  event  depends  on  other 
variables  in  addition  to  the /rvalue. 


2.8  Simplicity 

In  model  construction  appeal  is  occasionally  made  to  a  criterion  of 
simplicity.  Because  this  criterion  is  always  ambiguous  and  sometimes 
misleading,  it  must  be  viewed  with  caution :  simplicity  in  one  respect  may 
carry  with  it  complexity  in  another.  The  relevant  attributes  of  the  model 
are  the  form  of  its  expressions  and  the  number  of  variables  they  contain. 
Linear  forms  are  thought  to  be  simpler  than  nonlinear  forms  (and  are 
approximations  to  them),  which  suggests  that  models  with  linear  operators 
are  simpler  than  those  whose  operators  are  nonlinear.  Path-independent 
models  have  recursive  expressions  containing  fewer  variables  than  those  in 
path-dependent  models,  and  so  they  may  be  thought  to  be  simpler. 
Classification  becomes  difficult,  however,  when  other  aspects  of  the  models 
are  considered,  as  a  few  examples  will  show. 

If  we  consider  the  explicit  formula,  our  perspective  changes.  Com- 
mutativity  is  more  fruitful  of  simplicity  than  path  independence.  A 
conflict  arises  when  we  find  in  the  context  of  an  urn  (or  additive  response- 
strength)  model  that  we  can  have  one  only  at  the  price  of  losing  the  other 
(see  Sec.  2.6).  Also,  in  such  a  model  even  the  path-independence  criterion 
taken  alone  is  somewhat  ambiguous:  one  must  choose  between  path 
independence  of  the  numbers  of  balls  added  and  path  independence  of 
probability  changes.  Among  models  with  more  than  one  event,  the 
greatest  reduction  of  the  number  of  variables  in  the  explicit  formula  is 
achieved  by  sacrificing  both  commutativity  and  path  independence,  as 
illustrated  by  the  one-trial  perseveration  model  (Sec.  2.6). 

To  avoid  complicated  constraints  on  the  parameters,  it  appears  that 
nonlinear  operators  are  needed.  On  the  other  hand,  by  using  the  compli- 
cated logistic  function,  we  are  assured  of  the  existence  of  simple  sufficient 
statistics  for  the  parameters. 

These  complications  in  the  simplicity  argument  are  unfortunate:  they 
suggest  that  simplicity  may  be  an  elusive  criterion  by  which  to  judge 
models. 


DETERMINISTIC    AND    CONTINUOUS    APPROXIMATIONS  3$ 


3.  DETERMINISTIC   AND    CONTINUOUS 
APPROXIMATIONS 

The  models  we  have  been  dealing  with  are,  in  a  sense,  doubly  stochastic. 
Knowledge  of  starting  conditions  and  parameters  is  not  only  insufficient 
to  allow  one  to  predict  the  future  response  sequence  exactly,  but,  in 
general,  it  does  not  allow  exact  prediction  of  the  future  behavior  of  the 
underlying  probability  or  response-strength  variable.  Even  if  all  subjects, 
identical  in  initial  probability  and  other  parameter  values,  behave  exactly 
in  accordance  with  the  model,  the  population  is  characterized  by  a  distri- 
bution of  /^-values  on  every  trial  after  the  first.  For  a  single  subject  both 
the  sequence  of  responses  and  the  sequence  of  ^-values  are  governed  by 
probability  laws. 

The  variability  of  behavior  in  most  learning  experiments  is  undeniable, 
and  probably  few  investigators  have  ever  hoped  to  develop  a  mathe- 
matical representation  that  would  describe  response  sequences  exactly. 
Early  students  of  the  learning  curve,  such  as  Thurstone  (1930),  acknowl- 
edged behavioral  variability  in  the  stochastic  basis  of  their  models.  This 
basis  is  obscured  by  the  deterministic  learning  curve  equations  which  they 
derived,  but  these  investigators  realized  that  the  curves  could  apply  only 
to  the  average  behavior  of  a  large  number  of  subjects.  Stimulus-response 
theorists,  such  as  Hull,  have  dealt  somewhat  differently  with  the  variability 
problem.  In  such  theories  the  course  of  change  of  the  underlying  response- 
strength  variable  (effective  reaction  potential)  is  governed  deterministically 
by  the  starting  values  and  parameters.  Variability  is  introduced  through 
a  randomly  fluctuating  error  term  which,  in  combination  with  the  under- 
lying variable,  governs  behavior. 

Although  the  stochastic  aspect  of  the  learning  process  has  therefore 
usually  been  acknowledged,  it  is  only  in  the  developments  of  the  last 
decade  or  so  that  its  full  implications  have  been  investigated  and  that 
probability  laws  have  been  thought  to  apply  to  the  aftereffects  of  a  trial 
as  well  as  to  the  performed  response.17 

A  second  distinguishing  feature  of  recent  work  is  its  exact  treatment  of 
the  discrete  character  of  many  learning  experiments.  This  renders  the 
models  consistent  with  the  trial-to-trial  changes  of  which  learning  experi- 
ments consist.  In  early  work  the  discrete  trials  variable  was  replaced  by  a 

17  This  change  parallels  comparable  developments  in  the  mathematical  study  of  epidem- 
ics and  population  growth.  For  discussions  of  deterministic  and  stochastic  treatments 
of  these  phenomena,  see  Bailey  (1955)  on  epidemics  and  Kendall  (1949)  on  population 
growth. 


40  STOCHASTIC    LEARNING    THEORY 

continuous  time  variable,  and  the  change  from  one  trial  to  the  next  was 
averaged  over  a  unit  change  in  time.  The  difference  equations,  repre- 
senting a  discrete  process,  were  thereby  approximated  by  differential 
equations.  The  differential  equation  approximations  mentioned  in  Sec. 
2.7  are  examples. 

Roughly  speaking,  then,  a  good  deal  of  early  work  can  be  thought  of  as 
dealing  in  an  approximate  way  with  processes  that  have  been  treated  more 
exactly  in  recent  years.  Usually  the  exact  treatment  is  more  difficult,  and 
modern  investigators  are  sometimes  forced  to  make  continuous  or  deter- 
ministic approximations  of  a  discrete  stochastic  process.  Occasionally  these 
approximations  lead  to  expressions  for  the  average  learning  curve,  for 
example,  that  agree  exactly  with  the  stochastic  process  mean  obtained  by 
more  difficult  methods,  but  sometimes  the  approximations  are  considerably 
in  error.  In  general,  the  stochastic  treatment  of  a  model  allows  a  greater 
richness  of  implications  to  be  drawn  from  it. 

It  is  probably  a  mistake  to  think  of  deterministic  and  stochastic  treat- 
ments of  a  stochastic  model  as  dichotomous.  Deterministic  approximations 
can  be  made  at  various  stages  in  the  analysis  of  a  model  by  assuming  that 
the  probability  distribution  of  some  quantity  is  concentrated  at  its  mean. 
A  few  examples  will  illustrate  ways  in  which  approximations  can  be  made 
and  may  also  help  to  clarify  the  stochastic-deterministic  distinction. 


3.1  Approximations  for  an  Urn  Model 

In  Sec.  2.6  I  considered  a  special  case  of  the  general  urn  scheme,  one 
that  has  been  applied  to  the  shuttlebox  experiment.  A  few  approxima- 
tions will  be  demonstrated  that  are  in  the  spirit  of  Thurstone's  (1930)  work 
with  urn  schemes.  Red  balls,  whose  number,  r,  is  constant,  are  associated 
with  escape;  white  balls,  whose  number,  wn,  increases,  are  associated 
with  avoidance.  Pr  {escape  on  trial  n}  =  pn  =  r/(r  +  wj.  An  avoidance 
trial  increases  vrn  by  an  amount  b;  an  escape  trial  results  in  an  increase 
of  c  balls.  Therefore,  if  Dk  represents  an  operator  that  acts  on  ww, 

(iwn  =  wn  +  b    with  probability  1  —  pn  =      W" 


=  wn  +  c     with  probability  pn  = 


Consider  a  large  population  of  organisms  that  behave  in  accordance 
with  the  model  and  have  common  values  of  r,  wi9  b,  and  c.  On  the  first 
trial  all  subjects  have  the  same  probability  px  =  p:  of  escape.  Some  will 
escape  and  the  rest  will  avoid.  If  b  ^  c,  there  will  be  two  subsets  on  the 


DETERMINISTIC    AND    CONTINUOUS    APPROXIMATIONS  41 

second  trial,  one  for  which  w2  =  v^  +  b  and  another  for  which  w2  = 
wx  +  c.  Each  of  these  subsets  will  divide  again  on  the  second  trial,  but 
because  of  commutativity  there  will  be  three,  not  four,  distinct  values  of 
w3:  W-L  +  2b,  W-L  +  b  +  c,  and  H^  +  2c.  Each  distinct  value  of  wn 
corresponds  to  a  distinct  /7-value.  On  every  trial  after  the  first  there  is  a 
distribution  of  p-  values. 

With  two  events  there  are,  in  general,  2n~1  distinct  p-values  on  the  nth 
trial,  each  corresponding  to  a  distinct  sequence  of  events  on  the  preceding 
n  —  1  trials.  If  the  events  commute,  as  in  this  case,  then  the  number  of 
distinct  />-values  is  reduced  to  n,  the  trial  number. 

Our  problem  for  the  urn  model  is  to  determine  the  mean  probability 
of  escape  on  the  nth  trial,  the  average  being  taken  over  the  population. 

We  let  1  <  v  <  n  be  the  index  for  the  n  subsets  with  distinct  ^-values 
on  trial  n.  Let  Pvn  be  the  proportion  of  subjects  in  the  vth  subset  on  trial 
n  and  letpvn  be  the/»-value  for  this  subset.  Then  the  mth  raw  moment  of 
the  distribution  on  trial  n  is  defined  by 

Vmtn  =  E(ffl=2p?nPvn.  (34) 

V 

We  use  this  definition  later. 

Because  wn,  but  not  pw,  is  transformed  linearly,  it  is  convenient  to 
determine  J5(w  J  =  wn  first.  The  increment  in  numbers  of  white  balls, 
Awn  =  wn+1  —  wB9  is  either  b  or  c,  and  its  conditional  expectation,  con- 
ditional on  the  value  of  wn,  is  given  by 


(35) 


where  Eb  denotes  the  operation  of  averaging  over  the  binomial  distribution 
of  the  increment.  The  unconditional  expectation  of  the  increment  is 
obtained  by  averaging  Eq.  35  over  the  distribution  of  wn-values.  Using 
the  expectation  operator  Ew  to  represent  this  averaging  process,  we  have 

E(AwJ  =  E^AWn  |  ww)  =  Ew(b^  +  Cr\  .  (36) 

V  wn  T-  r  I 

Note  that  the  right-hand  member  of  this  expression  is  not  in  general 
expressible  as  a  simple  function  of  vvn. 

Now  we  perform  two  steps  of  deterministic  approximation.  First,  we 
replace  Aww,  which  has  a  binomial  distribution,  by  its  average  value. 
From  Eq.  35  the  increment  in  ww  (conditional  on  the  value  of  wj  can  then 
be  written 

.  T  fcwn  +  cr 

Aww  c^  Awn  =  —  -  -  . 


^2  STOCHASTIC    LEARNING    THEORY 

Second,  we  act  as  if  the  distribution  of  wn  is  entirely  concentrated  at  its 
mean  value  vPn.  The  expectation  of  the  ratio  in  Eq.  36  is  then  the  ratio  itself, 

and  we  have  ,  _     , 

A            T  _.        bwn  +  cr  X-QV 

Awn  ~  Aww  =  — 2 .  (38) 

Wn  +  r 

These  two  steps  accomplish  what  Bush  and  Mosteller  (1955)  call  the 
expected-operator  approximation.  In  this  method,  the  change  in  the  distribu- 
tion of/7-values  (or  w-values)  on  a  trial  is  represented  as  the  mean /rvalue 
(or  w-value)  acted  on  by  an  "average"  operator  (that  is,  subject  to  an 
average  change).  Two  approximations  are  involved:  the  first  replaces 
a  distribution  of  quantities  by  its  mean  value  and  the  second  replaces  a 
distribution  of  changes  by  the  mean  change.  The  average  operator  D  is 
revealed  in  this  example  if  we  rewrite  Eq.  38  as 


and  compare  it  to  Eq.  33.   The  increments  b  and  c  are  weighted  by  their 
approximate  probabilities  of  being  applied.  In  general, 


(   r   }=- 

\r  +  Ew(wn)/       r 


But  notice  that  our  approximation,  which  assumes  that  every  u^-value 
is  equal  to  wn,  leads  to  this  simple  relationship. 

The  discrete  stochastic  process  given  by  the  urn  scheme  has  surely  been 
transformed  by  virtue  of  the  approximations — but  transformed  into  what? 
There  are  at  least  two  interpretations.  The  first  is  that  the  approximate 
process  defines  a  determined  sequence  of  approximate  probabilities  for  a 
subject.  Like  the  original  process  the  approximation  is  stochastic,  but  on 
only  one  "level":  the  response  sequence  is  governed  by  probability  laws 
but  the  response  probability  sequence  is  not.  According  to  this  interpreta- 
tion, the  approximate  model  is  not  deterministic,  but  it  is  "more  deter- 
ministic" than  the  original  urn  scheme. 

The  second  interpretation  is  that  the  approximate  process  defines  a 
determined  sequence  of  proportions  of  white  balls,  ww,  for  a  population  of 
subjects,  and  thereby  defines  a  determined  sequence  of  proportions  of 
correct  responses,  that  is,  the  mean  learning  curve.  According  to  this 
interpretation  the  approximate  model  is  deterministic  and  it  applies  only 
to  groups  of  subjects. 

However  we  think  of  the  approximate  model,  it  is  defined  by  means  of  a 
nonlinear  difference  equation  for  wn  (Eq.  38).  Solution  of  such  equations 
is  difficult  and  a  continuous  approximation  is  helpful.  We  assume  that 
the  trials  variable  n  is  continuous  and  that  the  growth  of  wn  is  gradual 


DETERMINISTIC    AND    CONTINUOUS    APPROXIMATIONS  43 

rather  than  step-by-step.    The  approximate  difference  equation  given  by 
Eq,  38  can  thus  itself  be  approximated  by  a  differential  equation: 

dw       bw  +  cr 


dn        w  +  r 


(39) 


Integration  gives  a  relation  between  w  and  n  and  therefore  between  Vl>n 
and  n.  For  the  special  case  of  b  =  c  and  w±  =  0  the  relation  has  the  simple 
form  w/c  =  «  —  l,  giving 


?i.«  =  —  -7  —  IT-  (40) 

r  +  (n  —  l)c 

Equation  40  is  an  example  of  an  approximation  that  is  also  an  exact 
result.  In  this  example  it  occurs  for  an  uninteresting  reason:  equating 
the  values  of  b  and  c  transforms  the  urn  scheme  into  a  single-event  model 
in  which  the  approximating  assumption,  namely,  that  all  p-  values  are  con- 
centrated at  their  mean,  is  correct. 


3.2  More  on  the  Expected-Operator  Approximation 

The  expected-operator  approximation  is  important  because  results 
obtained  by  this  method  are,  unfortunately,  the  only  ones  known  for 
certain  models.  Because  the  approximation  also  generates  a  more  deter- 
ministic model,  it  is  discussed  here  rather  than  in  Sec.  5  on  methods  for 
the  analysis  of  models. 

Suppose  that  a  model  is  characterized  by  a  set  {Qk}  of  operators  on 
response  probabilities,  where  Qk  is  applied  on  trial  n  with  probability  Pk 
=  Pk(pn).  This  discussion  is  confined  to  path-independent  models,  and 
therefore  Pk  can  be  thought  of  as  a  function  of  at  most  the  /?-  value  on  the 
trial  in  question  and  fixed  parameters.  Because  the  /rvalue  on  a  trial 
may  have  a  distribution  over  subjects,  the  probability  Pk  may  also  have  a 
probability  distribution.  The  expected  operator  Q  is  defined  by  the  con- 
ditional expectation 

5pn  =  Ek(QkVn  |  pj  =  2 


The  expectation  operator  Ek  represents  the  operation  of  averaging  over 
values  of  k.  The  first  deterministic  approximation  in  the  expected-operator 
method  is  the  assumption  that  the  same  operator  —  the  expected  operator  — 
is  applied  on  every  trial.  Therefore  pn+1  ^  Q$n  for  all  n. 

What  is  of  interest  is  the  average  probability  on  the  (n  +  l)st  trial. 
This  can  be  obtained  by  removing  the  condition  on  the  expectation  in 
Eq.  41  by  averaging  again,  this  time  over  the  distribution  of  /?w-values. 


44  STOCHASTIC    LEARNING    THEORY 


Symbolizing  this  averaging  process  by  EV9  we  have  Vltn+i  — 
The  second  approximation  is  to  replace  E^QpJ  by  Q[Ev(pn)]  =  Q(Vi,  J. 
This  approximation  is  equivalent  to  the  (deterministic)  assumption  that 
the  p^-distribution  is  concentrated  at  its  mean.  In  cases  in  which  Qp  is 
linear  in  p,  however,  £J2p]  =  2[£p(p)]  is  exact  and  therefore  no  assump- 
tion is  needed.  (In  nonlinear  models  the  method  does  not  seem  to  give 
exact  results.  Of  course,  it  is  just  for  these  models  that  exact  methods  are 
difficult  to  apply.)  Applying  the  second  approximation  to  Eq.  41,  we  get 

V>,n+1  ^  QV,,n  =  I  Wi.«)a^i.»  (42) 

ft 

as  an  approximate  recursive  formula  for  the  mean  of  the/?-  value  distribu- 
tion on  the  wth  trial. 

EXPECTED      OPERATOR     FOR      TWO      EXPERIMENTER-CONTROLLED 

EVENTS.  Consider  the  model  in  Sec.  2.4  for  the  prediction  experiment. 
The  operators  and  the  rules  for  their  application  are  given  by  Eq.  12: 

_  (QiPn  =  *Pn  +  1  ~  a     if    yn  =  1 
(QzPn  =  apn  if    yn  =  0. 

Recall  that  the  ^-sequence,  and  thus  the  sequence  of  operators,  can  be 
predetermined.  Therefore  the  probability  pn  is  known  exactly  from  Eq. 
14.  Often  the  event  sequence  is  generated  by  a  random  device,  and,  as 
mentioned  in  Sec.  2.4,  an  approximation  for  pn  can  be  developed  by 
assuming  that  the  subject  experiences  the  average  of  all  the  sequences  that 
the  random  device  generates. 

Because  the  subject  has,  in  fact,  experienced  one  particular  event  sequence, 
the  approximation  may  be  a  poor  one.  An  alternative  interpretation  of  the 
approximation  is  that,  like  many  deterministic  models,  it  applies  to  the 
average  behavior  of  a  large  group  of  subjects.  This  interpretation  is 
reasonable  only  if  the  event  sequences  are  independently  generated  for 
each  of  a  large  number  of  subjects.  In  many  experiments  to  which  the 
approximation  has  been  applied  this  proviso  has  unfortunately  not  been 
met. 

Suppose  that  the  {yn}  are  independent  and  that  Pr  {yw  =  1}  =  TT  and 
Pr  {yn  =  0}  =  1  -  TT.  Then  Px(p)  =  TT  and  ?2(p)  =  1  -  TT  are  indepen- 
dent of  the  /rvalue,  as  is  always  the  case  with  experimenter  control. 
Equation  42  gives  the  recursive  relation 

*W  =  *Vi,n  +  (1  -  a)77.  (43) 

Unlike  Eq.  38  for  the  urn  model,  Eq.  43  is  easily  solved  and  a  continuous 
approximation  is  not  necessary.  The  solution  has  already  been  given  in 


DETERMINISTIC    AND    CONTINUOUS    APPROXIMATIONS  4J 

Sec.  2.4  for  the  repeated  application  of  the  same  linear  operator.  The 
approximate  learning  curve  equation  is 

PI,»  =  ^J7!,!  +  (1  -  ct"-V  =  77  -  0^(77  -  Fif-J.          (44) 

In  this  example  the  result  of  the  approximation  is  exact  in  a  certain  sense: 
if  we  average  the  explicit  equation  for  the  model,  Eq.  14,  over  the  binomial 
event  distributions,  then  the  result  obtained  for  V1>n  will  be  the  same  as 
that  given  by  Eq.  44. 

EXPECTED     OPERATOR    AND     THE    ASYMPTOTIC    PROBABILITY    FOR 

EXPERIMENTER-SUBJECT  EVENTS.  If  one  is  reluctant  to  assume  for  the 
prediction  experiment  that  reward  and  nonreward  have  identical  effects, 
then  changes  in  response  probability  may  depend  on  the  response  per- 
formed, and  the  model  events  are  under  experimenter-subject  control. 
Let  us  assume  that  the  outcomes  are  independent  and  that  Pr  {O3-  =  Oj 
=  77,  Pr  {Oy  =  <92}  =1—77.  If  we  assume  response-symmetry  and 
outcome-symmetry  and  use  the  symbols  given  in  Table  1,  the  appropriate 
Bush-Mosteller  model  is  given  as  follows : 

Event  Operator,  Qk  Pfc(p J 


^1»  Ol  2lPn  ==  alPn  ~T~   1    —   &l          pn77 

^2,  Ox       g2Pn  =  a2pn  +  1  —  a2       (1  —  pn)77  (45) 

^  O%       Q$n  ==  a^^  (1  -  pj(l  -  77) 

The  expected  operator  approximation  (Eq.  42)  gives 
-  [77  +  (ax  —  a2)(l  —  277)]Flj7i 

+  0*i  -  oc2)(l  -  277)  Ff  n.    (46) 


This  quadratic  difference  equation  is  difficult  to  solve,  and  Bush  and 
Mosteller  approximate  it  by  a  differential  equation  which  can  then  be 
integrated  to  give  an  approximate  value  for  F1)W. 

The  continuous  approximation  is  not  necessary  if  we  confine  our 
attention  to  the  asymptotic  behavior  of  the  process.  At  the  asymptote 
the  moments  of  the  /?-value  distribution  are  no  longer  subject  to  change, 
and  therefore  Fiftl+i  =  VI>n  =  FljC0.  Using  these  substitutions  in  Eq. 
46,  we  get  a  quadratic  equation  whose  solution  is 


v      _,          -  y)  -  1  +     2(^  -  I)2 
^  ~ 


2(2.  -  1X1  -  y) 

where  y  =  (1  -  a2)/(l  -  ax)  ^  1  (Bush  &  Mosteller,  1955,  p.  289).    For 
the  model  defined  by  Eq.  45  no  expression  for  the  asymptotic  proportion 


46  STOCHASTIC    LEARNING    THEORY 

of  A!  responses  is  known  other  than  the  approximation  of  Eq.  47.  There 
is  little  evidence  concerning  its  accuracy.  Our  ignorance  about  this  model 
is  especially  unfortunate  because  of  the  considerable  recent  interest  in 
asymptotic  behavior  in  the  prediction  experiment. 

Several  conclusions  about  the  use  of  the  expected- operator  approxi- 
mation are  illustrated  in  Figs.  1  and  2.  Each  figure  shows  average  pro- 
portions of  A!  responses  for  20  artificial  subjects  behaving  in  accordance 
with  the  model  of  Eq.  45.  All  20  subjects  in  each  group  experienced  the 
same  event  sequence.  For  both  sets  of  subjects,  o^  =  0.90,  OLZ  =  0.95, 
and  pi  =  Vlfl  =  0.50.  Reward,  then,  had  twice  the  effect  of  nonreward. 
For  the  subjects  of  Fig.  1,  TT  =  0.9;  for  those  of  Fig.  2, 77  =  0.6.  In  both 
examples  the  expected  operator  estimate  for  the  asymptote  (Eq.  47)  seems 
too  high.  Also  shown  in  each  figure  are  exact  and  approximate  (Eq.  44) 


Fig.  L  a.  The  jagged  solid  line  gives  the  mean  proportion  of  Al  responses  of 
20  stat-organisms  behaving  in  accordance  with  the  four-event  model  with 
experimenter-subject  control  (Eq.  45)  with  ax  =  0.90,  a2  =  0.95,  pl  =  0.50  and 
•JT  =  0.90.  b.  The  horizontal  line  gives  the  expected-operator  approximation  of 
the  four-event  model  asymptote  (Eq.  47).  c.  The  smooth  curve  gives  the  ap- 
proximate learning  curve  (Eq.  44)  for  the  two-event  model  with  experimenter 
control  (Eq.  12)  with  a  =  0.8915  estimated  from  stat-organism  "data,"  and 
pl  =  0.50.  d.  The  dotted  line  gives  the  exact  learning  curve  (Eq.  14)  for  the 
two-event  model. 


DETERMINISTIC    AND    CONTINUOUS    APPROXIMATIONS 
1.0 


Fig.  2.  a.  The  jagged  solid  line  gives  the  mean  proportion  of  Al  responses  of 
20  stat-organisms  behaving  in  accordance  with  the  four-event  model  with 
experimenter-subject  control  (Eq.  45)  with  ax  =  0.90,  a2  =  0.95,  />1  =  0.50, 
and?r  =  0.60.  b.  The  horizontal  line  gives  the  expected-operator  approximation 
of  the  four-event  model  asymptote  (Eq.  47).  c.  The  smooth  curve  gives  the 
approximate  learning  curve  (Eq.  44)  for  the  two-event  model  with  experimenter- 
control  (Eq.  12)  with  a  =  0.8960  estimated  from  stat-organism  "data,"  and 
pi  —  0.50.  d.  The  dotted  line  gives  the  exact  learning  curve  (Eq.  14)  for  the 
two-event  model. 

learning  curves  of  the  model  with  two  experimenter-controlled  events, 
which  has  been  fitted  to  the  data.  The  superiority  of  the  exact  curve  is 
evident.  I  shall  discuss  later  the  interesting  fact  that  even  though  the  data 
were  generated  by  a  model  (Eq.  45)  in  which  reward  had  more  effect  than 
nonreward  a  model  that  assumes  equal  effects  (Eq.  12)  produces  learning 
curves  that  are  in  "good"  agreement  with  the  data. 


3.3  Deterministic  Approximations  for  a  Model  of  Operant 
Conditioning 

Examples  of  approximations  that  transform  a  discrete  stochastic  model 
into  a  continuous  and  completely  deterministic  model  are  to  be  found  in 


STOCHASTIC    LEARNING   THEORY 


treatments  of  operant  conditioning  (Estes,  1950,  1959;  Bush  &  Mosteller, 
1951).  To  demonstrate  the  flavor  of  these  treatments  and  the  approxi- 
mations used,  a  sketch  of  a  model  along  the  lines  of  Estes'  is  given.  In 
applying  a  choice-experiment  analysis  to  a  free-operant  situation,  each 
interresponse  period  is  thought  of  as  a  sequence  of  short  intervals  of  con- 
stant length  h.  It  is  these  intervals  that  are  identified  as  "trials."  During 
each  interval  the  subject  chooses  either  to  press  (A)  or  not  to  press  (A2) 
the  lever;  pressing  occurs  with  some  probability  and  is  rewarded.  The 
probability  is  assumed  to  be  unchanged  by  trials  (intervals)  on  which  A2 
occurs  and  increased  by  trials  (intervals)  on  which  A±  occurs.  The  problem 
is  to  describe  the  resulting  sequence  of  interresponse  times.  To  define 
the  model  completely  it  is  necessary  to  consider  the  way  in  which 
Pr  {A]}  is  increased  by  reward.  We  do  this  in  the  context  of  an  urn 
scheme. 

An  urn  contains  x  white  balls  (which  correspond  to  A^  and  b  —  x  red 
balls  (which  correspond  to  A^.  At  the  beginning  of  a  trial  a  sample  of  balls 
is  randomly  selected  from  the  urn.  Each  ball  has  the  same  fixed  prob- 
ability of  being  included  in  the  sample,  which  is  of  size  s.  The  proportion 
of  white  balls  in  the  sample  defines  the  probability  p  =  Pr  {A^}  for  the  trial 
in  question.  At  the  end  of  an  interval  in  which  A%  occurs  the  sample  of 
balls  is  retained  and  used  to  define  p  for  the  interval  that  follows.  At  the 
end  of  an  interval  in  which  A^  occurs  all  the  red  balls  in  the  sample  [there 
are  s(l  —  p)  of  them]  are  replaced  by  white  balls  and  the  sample  is  returned 
to  the  urn.  The  number  of  trials  (intervals  of  length  h)  from  one  press  to 
the  next,  including  the  one  on  which  the  lever  is  pressed,  is  m.  The 
interresponse  time  thus  defined  is  r  =  m/2. 

The  deterministic  approximations  are  as  follows : 

1.  The  sample  size  s  is  binomially  distributed.    It  is  replaced  by  its 
mean  s. 

2.  Conditional  on  the  value  of  s,  the  proportion  of  white  balls  p  is 
binomially  distributed.  It  is  replaced  by  its  mean  x/b. 

3.  Conditional  on  the  value  of  p,  the  number  of  intervals  m  is  distributed 
geometrically,  with  Pr  {m  =  m}  =  p(l  —  p)111-1.    It  is  replaced  by  its 
mean,  1/p.    By  combining  the  other  approximations  with  this  one,  we 
can  approximate  the  number  of  intervals  in  the  interresponse  period  by 
m  c±:  b/x  and  therefore  the  interresponse  time  is  approximated  by  r  ^  hb/x. 

4.  Finally,  s(l  —  p),  which  is  the  increase  in  x  (the  number  of  white  balls 
in  the  urn)  that  results  from  reward,  is  replaced  by  the  product  of  the  means 
of  s  and  1  —  p,  and  becomes  s(b  —  x)/b. 

The  result  of  this  series  of  approximations  is  a  deterministic  process. 
Given  a  starting  value  of  x/b  and  a  value  for  the  mean  sample  size  J,  the 


CLASSIFICATION    AND    THEORETICAL    COMPARISON    OF    MODELS  49 

approximate  model  generates  a  determined  sequence  of  increasing  values 
of  x  and  of  decreasing  latencies. 

The  final  approximation  is  a  continuous  one.  The  discrete  variables 
x  and  T  are  considered  to  be  continuous  functions  of  time,  x(t)  and  r(t), 
and  n  =  n(f)  is  the  cumulative  number  of  lever  presses.  A  first  integration 
of  an  approximate  differential  equation  gives  the  rate  of  lever  pressing  as  a 
(continuous)  function  of  time;  integration  of  this  rate  gives  n(f). 

Little  work  has  been  done  on  this  model  or  its  variants  in  a  stochastic 
form.  We  therefore  have  little  knowledge  as  to  which  features  of  the 
stochastic  process  are  obscured  by  the  deterministic  approximations.  One 
feature  that  definitely  is  obscured  depends  on  sampling  fluctuations  of  the 
proportion  of  white  balls  p.  When  p  has  a  high  value,  then  the  inter- 
response  time  will  tend  to  be  short  and  the  increment  in  x  small;  when  the 
value  of  p  happens  to  be  low,  then  the  interresponse  time  will  tend  to  be 
above  its  mean  and  the  increment  in  x  large.  One  consequence  is  that 
interresponse  times  constitute  a  dependent  sequence  such  that  the  variance 
of  the  cumulated  interresponse  time  will  be  less  than  the  sum  of  the  vari- 
ances of  its  components. 


4.  CLASSIFICATION   AND    THEORETICAL 
COMPARISON   OF   MODELS 

A  good  deal  of  work  has  been  devoted  to  the  mathematical  analysis  of 
various  model  types,  but  less  attention  has  been  paid  to  the  development 
of  systematic  criteria  by  which  to  characterize  or  compare  models.  What 
are  the  important  features  that  distinguish  one  model  from  another? 
More  pertinent,  in  what  aspects  or  statistics  of  the  data  do  we  expect  these 
features  to  be  reflected  ?  The  need  to  answer  these  questions  arises  primarily 
in  comparative  and  "baseline"  applications  of  models  to  data. 

Comparative  studies,  in  which  we  seek  to  determine  which  of  several 
models  is  most  appropriate  for  a  set  of  data,  require  us  to  discover  dis- 
criminating statistics  of  the  data:  these  are  statistics  that  are  sensitive  to 
the  important  differences  among  the  models  and  that  should  therefore 
help  us  to  select  one  of  several  models  as  best. 

Once  a  model  is  selected  as  superior,  the  statistician  may  be  satisfied 
but  the  psychologist  is  not;  the  data  presumably  have  certain  properties 
that  are  responsible  for  the  model's  superiority,  properties  that  the 
psychologist  wants  to  know  about. 

Finally,  a  model  is  occasionally  used  as  a  baseline  against  which  data  are 
compared  in  order  to  discover  where  the  discrepancies  lie.  Again,  a  study 
is  incomplete  if  it  leads  simply  to  a  list  of  agreeing  and  disagreeing  statistics ; 


JO  STOCHASTIC    LEARNING    THEORY 

what  is  needed  as  well  is  an  interpretation  of  these  results  that  suggests 
which  of  the  model's  features  seem  to  characterize  the  data  and  which  do 
not. 

Analysis  of  the  distinctive  features  of  model  types  and  how  they  are 
reflected  in  properties  of  the  data  is  useful  in  the  discovery  of  discriminating 
statistics,  in  the  interpretation  of  a  model's  superiority  to  others,  and  in 
the  interpretation  of  points  of  agreement  and  disagreement  between  a 
model  and  data.  Some  of  the  important  features  of  several  models  were 
indicated  in  passing  as  the  models  were  introduced  in  Sec.  2.  A  few  ex- 
amples of  more  systematic  methods  of  comparison  are  given  in  this  section. 
Where  possible,  they  are  illustrated  by  reference  to  one  of  the  comparative 
studies  that  have  been  performed  on  the  Solomon-Wynne  shuttlebox  data 
(Bush  &  Mosteller,  1959;  Bush,  Galanter,  &Luce,  1959),  on  the  Good- 
now  two-armed  bandit  data  (Sternberg,  1959b),  and  on  some  T-maze 
data  (Galanter  &  Bush,  1959;  Bush,  Galanter,  &  Luce,  1959). 


4.1  Comparison  by  Transformation  of  the  Explicit  Formula 

A  comparable  form  of  expression  can  be  used  for  all  of  the  path-inde- 
pendent commutative-operator  models  that  were  introduced  in  Sec.  2.  By 
suitably  defining  new  parameters  in  terms  of  the  old,  we  can  write  the 
explicit  formula  for  pw  as  a  function  of  an  expression  that  is  linear  in  the 
components  of  Wn.  (Recall  that  Ww  is  the  vector  whose  t  components 
give  the  cumulative  number  of  occurrences  of  events  El9  .  .  .  ,Et  prior  to 
the  Tith  trial).  Suppose  Ww  =  (tn,  sn),  as  in  the  shuttlebox  experiment, 
where  tn  is  the  total  number  of  avoidances  and  sn  the  total  number  of 
shocks  before  the  nth  trial.  Let  pw  be  the  probability  of  error  (nonavoid- 
ance)  which  decreases  as  $n  and  tn  increase.  Then  for  the  linear-operator 
model  (Eq.  11)  we  have 

pn  =  exp  [-(a  +  btn  +  csj], 
and  thus 

logP«=-(fl  +  *t«  +  cO.  (48) 

For  Luce's  beta  model  (Eq.  20) 


and  thus 

logit  pw  =  -(a  +  6tB  +  csn).  (49) 

For  the  special  case  of  the  urn  scheme  (Eq.  26) 


CLASSIFICATION    AND    THEORETICAL    COMPARISON    OF    MODELS 


and  thus 


—  =  a  +  6tn  +  csn. 

Pn 


51 


(50) 


Finally,  for  Mosteller's  additive-increment-approximation  model  (Sec. 
2.7),  applied  to  the  two-event  experiment, 


n  =  -(a  +  btn 


(51) 


For  each  model  some  transformation,  g(p\  of  the  response  probability 
is  a  linear  function  of  tn  and  sn.  The  models  differ  only  in  the  transfor- 
mations they  specify. 

The  behavior  of  models  for  which  this  type  of  expression  is  possible 
can  be  described  by  a  simple  nomogram.  Examples  for  the  first  three 
models  above  are  given  in  Fig.  3,  in  which  the  transformations  =  dg(p)  +  e 
is  plotted  for  each  model  (d  and  e  are  constants).  The  units  on  the  abscissa 
are  arbitrary.  To  facilitate  comparisons,  the  coefficients  d  and  e  are 


-5     -4      -: 


Fig.  3.  Nomograms  for  three  models.  The  linear-operator  model  (Eq.  48) 
is  represented  by  log,  p  =  -0.500*  +  0.837.  The  beta  model  (Eq.  49) 
is  represented  by  logit  p  =  x.  The  urn  scheme  (Eq.  50)  is  represented  by 
p-i  =  1.214*  4-  2.667.  Constants  were  so  chosen  that  curves  would  agree 
at  p  =  0.25  and  p  =  0.75.  The  units  on  the  abscissa  are  arbitrary. 


j2  STOCHASTIC    LEARNING    THEORY 

chosen  so  that  values  of  x  agree  at p  =  0.25  and/?  —  0.75.  The  additive- 
increment  model  is  represented  by  a  straight  line  passing  through  the  two 
common  points. 

The  nomogram  is  interpreted  as  follows:  a  subject's  state  (the  value  of 
the  linear  expression  a  +  btn  +  csn)  is  represented  by  a  point  on  the 
abscissa,  and  his  error  probability  is  given  by  the  corresponding  point 
on  the  p-axis.  The  occurrence  of  an  event  corresponds  to  a  displacement 
along  the  abscissa  whose  magnitude  depends  only  on  the  event  and  not  on 
the  starting  position.  For  this  example  all  displacements  are  to  the  right 
and  correspond  to  reductions  in  the  error  probability.  An  avoidance 
corresponds  to  a  displacement  b  units  to  the  right,  and  a  shock  to  a  dis- 
placement c  units  to  the  right.  If  events  have  equal  effects,  then  we  have  a 
single-event  model,  and  each  trial  corresponds  to  the  same  displacement. 

Although  a  displacement  along  the  abscissa  is  a  constant  for  a  given 
event,  the  corresponding  displacement  on  the  probability  axis  depends  on 
the  slope  of  the  curve  at  that  point.  Because  the  slope  depends  on  the 
p-value  (except  for  the  additive-increment  model),  the  probability  change 
corresponding  to  an  event  depends  on  that  value.  Because  the  probability 
change  induced  by  an  event  depends  only  on  the  /rvalue,  this  type  of 
nomogram  is  limited  to  path-independent  models.  Its  use  is  also  limited 
to  models  in  which  the  operators  commute.  For  the  additive-increment, 
path-independent  urn,  and  beta  models  it  can  be  used  when  there  are  more 
than  two  events.  For  these  models  events  that  increase/^  correspond  to 
displacements  to  the  left. 

A  number  of  significant  features  of  the  models  can  be  seen  immediately 
from  Fig.  3.  First,  the  figure  indicates  that  in  the  range  of  probabilities 
from  0.2  to  0.8  the  additive-increment  model  approximates  each  of  the 
other  three  models  fairly  well.  This  supports  Mosteller's  (1955)  suggestion 
that  estimates  for  the  additive  model  based  on  data  from  this  range  be 
used  to  answer  simple  questions  such  as  which  of  two  events  has  the  bigger 
effect.  Caution  should  be  exercised,  however,  in  applying  the  model  to 
data  from  a  subsequence  of  trials  that  begins  after  trial  1.  Even  if  we  had 
reason  to  believe  that  all  subjects  have  the  same  /?-value  on  trial  1,  we 
would  probably  be  unwilling  to  assume  that  the  probabilities  on  the  first 
trial  of  the  subsequence  are  equal  from  subject  to  subject.  Therefore  the 
estimation  method  used  should  not  require  us  to  make  this  assumption. 

A  second  feature  disclosed  by  Fig.  3  concerns  the  rate  of  learning  (rate 
of  change  of  p  J  at  the  early  stages  of  learning.  When  p^  is  near  unity, 
events  in  the  urn  and  linear  operator  models  have  their  maximum  effects. 
In  contrast,  the  beta  model  requires  that  pn  change  slowly  when  it  is  near 
unity.  If  the  error  probability  is  to  be  reduced  from  its  initial  value  to, 
let  us  say,  0.75  in  a  given  number  of  early  trials,  then  for  the  beta  model 


CLASSIFICATION    AND    THEORETICAL    COMPARISON    OF    MODELS  Jj? 

to  accomplish  this  reduction  it  must  start  at  a  lower  initial  value  than  the 
other  models  or  its  early  events  must  correspond  to  larger  displacements 
along  the  z-axis  than  they  do  in  the  other  models.  Early  events  tend  pre- 
dominantly to  be  errors,  and  therefore  the  second  alternative  corresponds 
to  a  large  error-effect. 

A  third  feature  has  to  do  with  the  behavior  of  the  models  at  low  values 
of  pn.  The  models  differ  in  the  rates  at  which  the  error  probability  ap- 
proaches zero.  Especially  notable  is  the  urn  model  which,  after  a  transla- 
tion, is  of  the  form  p  =  or1  (x  >  1).  Unlike  the  situation  in  the  other 
models,  the  area  under  the  curve  of  p  versus  x  diverges,  so  that  in  an 
unlimited  sequence  of  trials  we  expect  an  unlimited  number  of  errors. 
(The  expected  number  of  trials  between  successive  errors  increases  as  the 
trial  number  increases  but  not  rapidly  enough  to  maintain  the  total 
number  of  errors  at  a  finite  value).  The  urn  model,  then,  tends  to  produce 
more  errors  than  the  other  models  at  late  stages  in  learning. 

This  analysis  of  the  three  models  helps  us  to  understand  some  of  the 
results  that  have  been  obtained  in  applications  to  the  Solomon- Wynne 
avoidance-learning  data.  The  analyses  have  assumed  common  values 
over  subjects  for  the  initial  probability  and  other  parameters.  The 
relevant  results  are  as  follows : 

1.  A  "model-free"  analysis,  which  makes  only  assumptions  that  are 
common  to  the  three  models,  shows  that  an  avoidance  response  leads  to  a 
greater  reduction  in  the  escape  probability  than  an  escape  response.  This 
analysis  is  described  in  Sec.  6.7. 

2.  The  best  available  estimate  of  pl  in  this  experiment  is  0.997  and  is 
based  on  331  trials,  mainly  pretest  trials  (Bush  &  Mosteller,  1955). 

3.  The  linear-operator  model  is  in  good  agreement  with  the  data  in 
every  way  that  they  have  been  compared  (Bush  &  Mosteller,  1959).   The 
estimates  are  fa  =  1.00,  ^  (avoidance  parameter)  =  0.80,  and  a2  (escape 
parameter)  =  0.92.     According   to   the   parameter  values,    for   which 
approximately  the  same  estimates  are  obtained  by  several  methods  (Bush  & 
Mosteller,  1955),  escape  is  less  effective  than  avoidance  in  reducing  the 
escape  probability. 

4.  There  is  one  large  discrepancy  between  the  urn  model  and  the  data. 
Twenty-five  trials  were  examined  for  each  of  30  subjects.   The  last  escape 
response  occurred,  on  the  average,  on  trial  12.  The  comparable  figure  for 
the  urn  model  is  trial  20.   As  in  the  linear-operator  model,  the  estimates 
suggest  that  escape  is  less  potent  than  avoidance  (Bush  &  Mosteller,  1959). 

5.  One  set  of  estimates  for  the  beta  model  is  given  by  pl  =  0.94,  /?x 
(avoidance  parameter)  =  0.83,  and  /S2  (escape  parameter)  =  0.59  (Bush, 
Galanter,  &  Luce,  1959).  With  these  estimates,  the  model  differs  from  the 


54  STOCHASTIC    LEARNING    THEORY 

data  in  several  respects,  notably  producing  underestimates  of  intersubject 
variances  of  several  quantities,  such  as  total  number  of  escapes.  As 
discussed  later,  this  probably  occurs  because  the  relative  effects  of  avoid- 
ance and  escape  trials  are  incorrectly  represented  by  the  model. 

6.  The  approximate  maximum-likelihood  estimates  for  the  beta  model 
are  given  by  pl  =  0.86,  ^  =  0.74,  and  /32  =  0.81.  In  contrast  to  the 
inference  made  by  Bush,  Galanter,  and  Luce,  these  estimates  imply  that 
escape  is  less  effective  than  avoidance.  The  estimate  of  the  initial 
probability  of  escape  is  lower  than  theirs,  however. 

These  results  strongly  favor  the  linear-operator  model.  The  results  of 
analysis  of  the  data  with  the  other  models  are  intelligible  in  the  light  of  our 
study  of  the  nomogram.  The  first  set  of  estimates  for  the  beta  model  gives 
an  absurdly  low  value  ofp^  In  addition,  the  relative  magnitudes  of  escape 
and  avoidance  eifects  are  reversed.  The  second  set  of  estimates,  which 
avoids  attributing  to  error  trials  an  undue  share  of  the  learning,  under- 
estimates/?! by  an  even  greater  amount.  Apparently,  if  the  beta  model  is 
required  to  account  for  other  features  of  the  data  as  well,  it  cannot  describe 
the  rapidity  of  learning  on  the  early  trials  of  the  experiment.  The  major 
discrepancy  between  the  urn  model  and  data  is  in  accord  with  the  excep- 
tional behavior  of  that  model  at  low  p-  values;  the  average  trial  of  the  last 
error  is  a  discriminating  statistic  when  the  urn  model  is  compared  to  others. 

Before  we  leave  this  type  of  analysis,  it  is  instructive  to  consider  Hull's 
model  (1943)  in  the  same  way.  This  model  was  discussed  briefly  in  Sec. 
2.5.  It  is  intended  to  describe  the  change  in  probability  of  reactions  of  the 
all-or-none  type,  such  as  conditioned  eyelid  responses  and  barpressing 
in  a  discrete-trial  experiment.  If  we  assume  that  incentive  and  drive 
conditions  are  constant  from  trial  to  trial,  then  the  model  involves  the 
assumptions  that  (1)  reaction  potential  (SER)  is  a  growth  function  of  the 
number  of  reinforcements  and  (2)  reaction  probability  (q)  is  a  (normal) 
ogival  function  of  the  difference  between  the  reaction  potential  and  its 
threshold  (SL^  when  this  difference  is  positive;  otherwise  the  prob- 
ability is  zero. 

The  assumptions  can  be  stated  formally  as  follows  : 

L        8ER  =  M(\  -  e~AN\        (A,  M  >  0).  (52) 

The  quantity  N  is  defined  to  be  the  number  of  reinforcements,  not  the  total 
number  of  trials.  Unrewarded  trials  correspond  to  the  application  of  an 
identity  operator  to  SER. 


2. 


(53) 


CLASSIFICATION    AND    THEORETICAL    COMPARISON    OF    MODELS 


55 


For  uniformity  and  ease  of  expression  the  logistic  function  is  substituted 
for  the  normal  ogive.  When  these  equations  are  combined,  we  obtain  for 
the  probability/?  of  not  performing  the  response 


log(logit/? 


+  bN}9 


-co, 


(N>k) 
(N  <  k). 


(54) 


This  is  to  be  compared  to  Eq.  48  and  Eq.  49. 

From  Hull's  figures  a  rough  estimate  of  c  =  5  is  obtained.  This  allows 
us  to  construct  a  nomogram,  again  choosing  constants  so  that  the  curve 
agrees  with  the  other  models  at  p  =  0.25  and  p  =  0.75.  The  result  is 
displayed  in  Fig.  4,  with  nomograms  for  the  linear  and  beta  models.  The 
curve  for  Hull's  model  falls  in  between  those  of  the  other  two.  The  exist- 
ence of  a  threshold  makes  the  model  difficult  to  handle  mathematically,  but, 
in  contrast  to  the  beta  model,  it  allows  learning  to  occur  in  a  finite  number 
of  trials  even  with  an  initial  error-probability  of  unity. 


-5     -4     - 


Fig.  4.  Nomogram  for  Hullian  model  (Eq.  48)  and  two  other  models. 
The  Hullian  model  is  represented  by  loge  (logit  p  +  5)  =  —  0.204*  + 
1.585.  The  nomograms  for  linear-operator  and  beta  models  are  those 
presented  in  Fig.  3. 


STOCHASTIC    LEARNING    THEORY 


4.2  Note  on  the  Classification  of  Operators  and  Recursive 
Formulas 

The  classification  of  operators  and  recursive  formulas  is  probably  more 
relevant  to  the  choice  of  mathematical  methods  for  the  analysis  of  models 
than  it  is  to  the  direct  appreciation  of  their  properties.  (Two  exceptions 
considered  later  are  the  implications  of  commutativity  and  of  the  relative 
magnitudes  of  the  effects  of  rewarded  and  nonrewarded  trials.)  The 
classification  described  here  is  based  on  the  arguments  that  appear  in  the 
recursive  formula.  Let  us  consider  models  with  two  subject-controlled 
events,  in  which  xn  =  1  if  £2  occurs  on  trial  n,  xn  =  0  if  E:  occurs  on 
trial  72,  and  j?n  =  Pr  {xn  =1}.  A  rough  classification  is  given  by  the 
following  list : 

1.  pn+l  =  f(pn).  Response-independent,    path-independent. 

Example:   single-operator  model  (Eq.  28). 

2.  pn+l  =f(n,pn).        Response-independent,    quasi-independent 

of  path.    Example:    urn  scheme  (Eq.  25) 
with  equal  event-effects. 

Classes  1  and  2  produce  sequences  of  independent  trials  and,  if  there  are 
no  individual  differences  in  initial  probabilities  and  other  parameters,  they 
do  not  lead  to  distributions  of  jp-values. 

3.  pw+1  =/(pn;xJ.  Response-dependent,   path-independ- 

ent.   Example:    linear    commutative 
operator  model  (Eq.  8). 

4.  pn+1  =/(TZ,  pw;  xn).  Response-dependent,  quasi-independ- 

ent of  path.    Example:    general  urn 
scheme  (Eq.  25). 

5.  pw+i  — /(pn;  X7i>  xn-i)-         Path-dependent.    Example:    one-trial 

perseveration  model  (Eq.  30). 


4.3  Implications  of  Commutativity  for  Responsiveness  and 
Asymptotic  Behavior 

In  Sec.  2.2  I  pointed  out  that  in  a  model  with  commutative  events  there 
is  no  "forgetting":  the  effect  of  an  event  on  pn  is  the  same  whether  it 
occurred  on  trial  1  or  on  trial  n  —  1.  The  result  is  that  models  with  com- 
mutative events  tend  to  respond  sluggishly  to  changes  in  the  experiment. 

As  an  example  to  illustrate  this  phenomenon  we  take  the  prediction 


CLASSIFICATION    AND    THEORETICAL    COMPARISON    OF    MODELS  57 

experiment  and,  for  the  moment,  consider  it  as  a  case  of  experimenter- 
controlled  events.  We  use  the  linear  model  (Eq.  12)  to  illustrate  non- 
commutative  events  and  the  beta  model  (Eq.  21)  to  illustrate  commutative 
events.  The  explicit  formulas  are  revealing.  The  quantity  dn  is  defined  as 
before  as  the  number  of  left-light  outcomes  less  the  number  of  right-light 
outcomes,  cumulated  through  trial  n  —  1.  The  beta  model  is  then  rep- 
resented by  Eq.  23  which  is  reproduced  here: 


Pn 


1  + 


In  this  model  all  trials  with  equal  ^-values  als°  have  equal  /^-values. 
The  response  probability  can  be  returned  to  its  initial  value  simply  by 
introducing  a  sequence  of  trials  that  brings  dn  back  to  its  initial  value  of 
zero.  The  response  of  the  model  to  successive  reversals  is  illustrated  in  Fig. 
5  with  the  event  sequence  E1ElE1E1E2EtEiEtE1E1Ei.  Despite  the  fact 
that  on  the  ninth  trial,  on  which  d9  is  zero,  the  most  recent  outcomes  have 
been  right-light  onsets,  the  probability  of  predicting  the  left  light  is  no  lower 
than  it  was  initially. 

The  behavior  of  the  commutative  model  is  in  contrast  to  that  of  the  linear 
model,  whose  explicit  formula  (Eq.  14)  is  reproduced  here: 

Pn  =  OC^ft  +  (1   -  GC)  5  OC"-1-'^, 

3=1 

The  formula  shows  that  when  oc  <  1  more  recent  events  are  weighted  more 
heavily  and  that  equal  Jn-values  do  not  in  general  imply  equal  probabilities. 
The  response  of  this  model  to  successive  reversals  is  also  illustrated  in 
Fig.  5.  Parameters  were  chosen  so  that  the  two  models  would  agree  on 
the  first  and  fifth  trials.  This  model  is  more  responsive  to  the  reversal  than 
the  beta  model;  not  only  does  the  curve  of  probability  versus  trials  change 
more  rapidly,  but  its  direction  of  curvature  is  also  altered  by  the  reversal. 

At  first  glance  the  implications  of  commutativity  for  responsiveness  of  a 
model  seem  to  suggest  crucial  experiments  or  discriminating  statistics  that 
would  allow  an  easy  selection  to  be  made  among  models.  The  question  is 
more  complicated,  however.  The  contrast  shown  in  Fig.  5  is  clear-cut  only 
if  we  are  willing  to  make  the  dubious  assumption  that  events  in  the 
prediction  experiment  are  experimenter-controlled.  Matters  become 
complicated  if  we  allow  reward  and  nonreward  to  have  different  effects. 
The  relative  effectiveness  of  reward  and  nonreward  trials  is  then  another 
factor  that  determines  the  responsiveness  of  a  model. 

To  show  this,  we  shift  attention  to  models  with  experimenter-subject 
control  of  events.  To  make  the  conditions  extreme,  we  compare  equal- 
parameter  models  (experimenter-control)  with  models  in  which  the 


STOCHASTIC    LEARNING    THEORY 


1.0 
0.9 
0.8 
0.7 

0.6 

^ 

0.4 
0.3 
0.2 
0.1 


_n   i     r 


i      i      r 


1         I 


1 

E1 


J L_L 


I       I I       I       I       1       I 


6       7       8       9       10 
I     E2     E2     E2     E2     EI     E\ 

Trial  number  and  event 


11      12 


Fig.  5.  Comparison  of  models  with  commutative  and  noncommutative 
experimenter-controlled  events.  The  broken  curve  represents  pn  for  the 
linear-operator  model  (Eq.  12)  with  a  =  0.8  and  p±  —  0.5.  The  con- 
tinuous curve  represents  pn  for  the  beta  model  with  ft  =  0.713  and  p±  = 
0.5.  Parameter  values  were  selected  so  that  curves  would  coincide  at 
trials  1  and  5. 

identity  operator  is  associated  with  either  reward  or  nonreward.  The 
results  are  shown  in  Figs.  6  and  7.  Parameter  values  are  chosen  so  that 
the  models  agree  approximately  on  the  value  of  Vli5.  In  Fig.  6  the  "equal 
alpha"  linear  model  (Eq.  12)  with  a  =  0.76  is  compared  with  the  two 
models  defined  in  (55)  : 


Response       Outcome 


Model  with  Identity 

Operator  for 

Nonreward 

(a  -  0.60) 


Model  with  Identity 
Operator  for 

Reward 
(a  =  0.40) 


A,                  0, 

Pw+i  =  Pn  +  1  -  <* 

Pn+i  =  Pn 

Az                 Ol 

Prt+1  =  Pn 

Pn+i  =  <*pn  + 

1  —  a 

A:                 0, 

Pn+1  =  Pn 

Pn+i  =  apn 

A2                  02 

Pn+i  =  apn 

P*i+l  =  Pn' 

(55) 


In  Fig.  7  the  "equal  beta"  model  (Eq.  21)  with  /3  =  0.68  is  compared 
with  the  two  models  defined  in  (56) : 


CLASSIFICATION    AND    THEORETICAL    COMPARISON    OF    MODELS 


59 


Response     Outcome 


Model  with  Identity 

Operator  for 

Nonreward 

(0  =  0.50) 


Model  with  Identity 
Operator  for 

Reward 
05  =  0.30) 


/4-I  Ot  n  — 

fan 

fan 

^1  ^l  Pn+1  " 
^2  ^1  Pn-|-l  = 

AI  O2  pn+1  = 

(1    ~  Prc)   +  fan 
!)„ 

P«+l  ~  Pn 

Pn-rl         (1 

-*)+/*, 

«+l             ,0/" 

1    -  Pn)  +  P, 

(56) 


p-+1"g(l-pl)+pn  p^1=p- 


1.0 
0.9 
0.8 
0.7 
0.6 


0.4 
0.3 
0.2- 
0.1- 


n r 


n r 


I 


I 


I 


_L 


I 


123456789 
Oi       Oi       Oi       Oi       O2       O2       02       O2 

Trial  number  and  outcome 

Fig.  6.  Responsiveness  of  the  linear-operator  model  (Eq.  55) 
depends  on  the  relative  effectiveness  of  reward  and  nonreward.  The 
solid  curve  represents  Vl>n  for  the  linear-operator  model  with  equal 
reward  and  nonreward  parameters  (ax  =  a2  =  0.757,  pl  =  0.5). 
The  broken  curve  represents  P^  n  for  the  linear-operator  model  with 
an  identity  operator  for  reward  (ocx  —  1.0,  as  =  0.4,^  =  0.5).  The 
dotted  curve  represents  KlfW  for  the  linear-operator  model  with  an 
identity  operator  for  nonreward  (ax  =  0.6,  a2  =  1.0,  pi  —  0.5). 
Parameter  values  were  selected  so  that  the  models  would  agree 
approximately  on  the  values  of  F"1}1  and  Flf5. 


6o 


STOCHASTIC    LEARNING    THEORY 


1.0 
0.9 
0.8 

0.7 
0.6 
"•0.5 
0.4 
0.3 
0.2 
0.1 


T         I  I 


J        4         5         6         7 
)i       Oi       02      02       02 

Trial  number  and  outcome 


02 


Fig.  7.  Responsiveness  of  the  beta  model  (Eq.  56)  depends  on  the 
relative  effectiveness  of  reward  and  nonreward.  The  solid  curve 
represents  V^n  for  the  beta  model  with  equal  reward  and  non- 
reward  parameters  (/?!  =  02  =  0.68,  pl  =  0.5).  The  broken  curve 
represents  V1>n  for  the  beta  model  with  an  identity  operator  for 
reward  (/?!  =  1.0,  ftz  =  0.3,^  =  0.5).  The  dotted  curve  represents 
F1>n  for  the  beta  model  with  an  identity  operator  for  nonreward 
(ft  i  =  0.5,  j52  =  1.0,  p!  =  0.5).  Parameter  values  were  selected 
so  that  the  models  would  agree  approximately  on  the  values  of  F1?1 
and  71>5. 


Roughly  the  same  pattern  appears  for  both  models.  When  nonreward  is 
less  effective  than  reward,  the  response  to  a  change  in  the  outcome  sequence 
that  leads  to  a  higher  probability  of  nonreward  is  sluggish.  When  reward 
is  less  effective  than  nonreward,  the  response  is  rapid.  From  these  examples 
it  appears  that  the  influence  on  responsiveness  of  changing  the  relative 
effects  of  reward  and  nonreward  is  less  marked  in  the  commutative-oper- 
ator beta  model  than  in  the  linear  model. 

Responsiveness  alone,  then,  is  not  useful  in  helping  us  to  choose  between 
the  models.  We  must  use  it  in  conjunction  with  knowledge  about  the 


CLASSIFICATION    AND    THEORETICAL   COMPARISON    OF   MODELS  6l 

relative  effects  of  reward  and  nonreward.  This  situation  is  typical  in 
working  with  models:  observation  of  a  single  aspect  of  the  data  is  often 
insufficient  to  lead  to  a  decision.  If,  in  this  example,  we  observe  that 
subjects'  behavior  is  highly  responsive,  this  might  imply  that  a  model  with 
commutative  operators  is  inappropriate,  but,  alternatively,  it  might  mean 
that  the  effect  of  nonreward  is  relatively  great.  Also,  if  we  examined 
such  data  under  the  hypothesis  that  events  in  the  prediction  experiment 
are  experimenter-controlled,  then  the  increased  rate  of  change  of  Vi>n 
after  the  reversal  would  probably  lead  us  to  conclude,  perhaps  in  error, 
that  a  change  in  the  value  of  a  learning  rate  parameter  had  occurred. 
This  example  indicates  how  delicate  are  the  conclusions  one  draws 
regarding  event  invariance  (Sec.  2.7)  and  illustrates  how  the  apparent 
failure  of  event  invariance  may  signify  that  the  wrong  model  has  been 
applied. 

In  a  number  of  studies  the  prediction  experiment  has  been  analyzed 
by  the  experimenter-controlled  event  model  of  Eq.  12  (Estes  &  Straughan, 
1954;  Bush  &  Mosteller,  1955).  This  model  also  arises  from  Estes' 
stimulus  sampling  theory.  One  of  the  findings  that  has  troubled  model 
builders  is  that  estimates  of  the  learning-rate  parameter  a  tend  to  vary  sys- 
tematically from  experiment  to  experiment  as  a  function  of  the  outcome 
probabilities.  It  is  not  known  why  this  occurs,  but  the  phenomenon  has 
occasionally  been  interpreted  as  indicating  that  event  effects  are  not  invar- 
iant as  desired.  Another  interpretation,  which  has  not  been  investigated,  is 
that  because  reward  and  nonreward  have  different  (but  possibly  invariant) 
effects  the  estimate  of  a  single  learning-rate  parameter  is,  in  effect,  a  weighted 
average  of  reward  and  nonreward  parameters.  Variation  of  outcome  prob- 
abilities alters  the  relative  number  of  reward-trials  and  thus  influences  the 
weights  given  to  reward  and  nonreward  effects  in  the  over-all  estimate. 
The  estimation  method  typically  used  depends  on  the  responsiveness  of  the 
model,  which,  as  we  have  seen,  depends  on  the  extent  to  which  rewarded 
trials  predominate. 


4.4  Commutativity  and  the  Asymptote  in  Prediction 
Experiments 

One  result  of  two-choice  prediction  experiments  that  has  interested 
many  investigators  is  that  when  Pr  (yn  =  1}  =  TrandPr  {yn  =  0}  =  1  —  TT 
then  for  some  experimental  conditions  the  asymptotic  mean  probability 
F"lj00  with  which  human  subjects  predict  yw  =  1  appears  to  "match" 


6s  STOCHASTIC    LEARNING    THEORY 

Pr  {yn  =  1},  that  is,  Klf00  ~  TT.IB  (The  artificial  data  in  Figs.  1  and  2 
illustrate  this  phenomenon.)  The  phenomenon  raises  the  question  of  which 
model  types  or  model  families  are  capable  of  mimicking  it.  No  answer 
even  approaching  completeness  seems  to  have  been  proposed,  but  a  little 
is  known.  Certain  linear-operator  models  are  included  in  the  class,  and 
we  shall  see  that  models  with  commutative  events  can  be  excluded,  at  least 
when  the  events  are  assumed  to  be  experimenter-controlled  and  symmetric. 
[Feldman  and  Newell  (1961)  have  defined  a  family  of  models  more  general 
than  the  Bush-Mosteller  model  that  displays  the  matching  phenomenon.] 
Figures  1  and  2  suggest  that  a  linear-operator  model  with  experimenter- 
subject  control  can  approximate  the  matching  effect.  As  already  men- 
tioned, an  exact  expression  for  the  asymptotic  mean  of  this  model  is  not 
known.  It  is  easy  to  demonstrate  that  the  linear  model  with  experimenter- 
controlled  events  can  produce  the  effect  exactly;  indeed,  a  number  of 
investigators  have  derived  confidence  in  the  adequacy  of  this  particular 
model  from  the  "probability  matching"  phenomenon  (e.g.,  Estes,  1959; 
Bush  &  Mosteller,  1955,  Chapter  13).  The  value  of  FlpGO  for  the  experi- 
menter-controlled model  when  the  {yn}  are  independent  binomial  random 
variables  can  easily  be  determined  from  its  explicit  formula  (Eq.  14),  which 
is  reproduced  here: 


We  take  the  expectation  of  both  sides  of  this  equation  with  respect  to  the 
independent  binomial  distributions  of  the  {yj,  making  use  of  the  fact 
that  E(j3)  =  77.  Performing  the  summation,  we  obtain 

Fi.n  =  a-Vx  +  (1  -  a«-i)7r.  (57) 

Note  that  this  is  the  same  result  given  by  the  expected-operator  approxi- 
mation in  Eq.  44.  The  final  result,  FljCO  =  TT,  is  obtained  by  letting 
n  ->  oo. 

As  an  example  of  a  model  with  experimenter-controlled  events  that 
cannot  produce  the  effect,  we  consider  Luce's  model  (Eq.  23),  with 

18  The  validity  of  this  finding  and  the  particular  conditions  that  lead  to  it  have  been  the 
subjects  of  considerable  controversy.  Partial  bibliographies  may  be  found  in  Edwards 
(1956,  1961),  Estes  (1962),  and  Feldman  &  Newell  (1961).  The  reader  should  also 
consult  Restle  (1961,  Chapter  6)  and,  for  work  with  several  nonhuman  species,  the 
papers  of  Bush  &  Wilson  (1956)  and  of  Bitterman  and  his  colleagues  (e.g.,  Behrend  & 
Bitterman,  1961).  The  general  conclusions  to  be  drawn  are  that  the  phenomenon  does 
not  occur  under  all  conditions  or  for  all  species,  that  when  it  seems  to  occur  the  response 
probability  may  deviate  slightly  but  systematically  from  the  outcome  probability,  that 
matching  may  characterize  a  group  average  although  it  occurs  for  only  a  few  of  the  indi- 
viduals within  the  group,  and  that  an  asymptote  may  not  have  been  reached  in  many 
experiments. 


CLASSIFICATION  AND    THEORETICAL    COMPARISON    OF    MODELS  65 

0<  /?  <  1.  We  restrict  our  attention  to  experiments  in  which  tr  ^  J; 
without  loss  of  generality  we  can  restrict  it  further  to  77  >  £.  Because 
pn  is  governed  entirely  by  the  value  of  dn,  we  must  concern  ourselves  with 
the  behavior  of 


3= 

Roughly  speaking,  because  the  number  of  left-light  outcomes  (Ea)  in- 
creases faster  than  the  number  of  right-light  outcomes  (£2),  the  difference 
between  their  numbers  increases,  and  with  an  unlimited  number  of  trials 
this  difference  dw  becomes  indefinitely  large.  More  precisely,  we  note  that 


and  that  therefore  £(dn)  -*  oo  as  n  ->  oo.  From  the  law  of  large  numbers 
(Feller,  1957,  Chapter  X)  we  conclude  that  with  probability  one  dn  ->  oo 
as  n  ->  oo.  Using  Eq.  23,  it  follows  that  for  this  model  pn  ->  1  when 

77  >    0.50. 

The  asymptotic  properties  of  other  examples  of  the  beta  model  for  the 
prediction  experiment  have  been  studied  by  Luce  (1959)  and  Lamperti 
and  Suppes  (1960).  They  find  that  there  are  special  conditions,  deter- 
mined by  the  values  of  TT  and  model  parameters,  under  which  Vltn  =  E(pJ 
does  not  approach  a  limiting  value  of  either  zero  or  unity.  Therefore  it 
should  not  be  inferred  from  the  foregoing  example  that  the  beta  model  is 
incapable  of  producing  probability  matching.  (In  view  of  the  fact  that 
the  phenomenon  does  not  occur  regularly  or  in  all  species,  one  might  con- 
sider a  model  that  invariably  produces  it  to  be  more  suspect  than  one  in 
which  its  occurrence  depends  on  conditions  or  parameter  values.) 

As  an  illustration  of  the  present  state  of  knowledge,  we  consider  the  beta 
model  with  experimenter-subject  control  for  the  prediction  experiment. 
In  this  model,  absorption  at  a  limiting  probability  of  zero  or  unity  does  not 
always  occur.  The  outcomes  are  Ol  :  y  =  1  and  O2  :  y  ==  0.  The  responses 
are  Al  :  predict  Ol9  and  A2  :  predict  02.  We  assume  that  the  pairs  of  events, 
{A^Oi,  A2O2}  and  {A^O^  A^O^  are  complementary.  The  transformations 
of  the  response-strength  ratio  v  =  v(l)/v(2)  are  therefore  as  follows: 

Event          Transformation 


5^  STOCHASTIC    LEARNING    THEORY 

The  parameters  ft  and  ft'  correspond  to  reward  and  nonreward,  respec- 
tively; both  are  greater  than  one. 

For  this  model  the  results  of  Lamperti  and  Suppes  (I960,  Theorem  3) 
imply  that  the  asymptotic  value  of  pn  =  Pr  {Al  on  trial  n}  is  either  zero  or 
unity  except  when  th.e  following  inequality  is  satisfied: 

log/g          TT          log  ft' 
log  ft1       I -IT      log  £  * 

Luce  has  shown  (1959,  Chapter  4,  Theorem  17)  that  when  the  inequality 
is  satisfied  the  asymptotic  value  of  VltU  is  given  by 


(log/J')/Oogj8)-l 

(It  is  interesting  to  note  that  the  value  of  Vlt  *>  for  the  corresponding  linear- 
operator  model,  given  by  Eq.  47,  is  known  only  approximately,) 

From  these  results  several  conclusions  may  be  drawn.  First,  if  ft  >  /?', 
then  this  model  always  produces  asymptotic  absorption  at  zero  or  one; 
only  if  nonreward  is  more  potent  than  reward  (/?'  >  /?)  is  a  limiting  average 
probability  other  than  zero  or  one  possible.  Second,  for  a  fixed  pair  of 
parameter  values,  /?'  >  /3  >  1,  absorption  at  zero  or  one  can  be  avoided, 
but  only  for  a  limited  range  of  vr-values.  Third,  when  Flf00  is  between 
zero  and  one,  it  is  equal  to  77  only  if  IT  =  £;  otherwise  the  asymptote  is 
further  from  \  than  rr  is  and  in  the  same  direction,  with  the  magnitude 
of  the  "overshoot"  or  "undershoot"  increasing  linearly  with  \rr  —  \\m 

In  the  experimenter-controlled-events  example,  which  was  first  discussed, 
it  is  the  commutativity  of  the  beta  model  that  is  responsible  for  its  asymp- 
totic behavior.  An  informal  argument  shows  that  the  same  asymptotic 
behavior  characterizes  any  model  with  two  events  that  are  complementary, 
commutative,  and  experimenter-controlled  and  in  which  repeated  occur- 
rence of  a  particular  event  leads  to  a  limiting  probability  of  zero  or  unity. 
In  any  such  model  the  response  probability  returns  to  its  initial  value  on 
any  trial  on  which  dw  =  0.  Moreover,  the  probability  on  any  trial  is 
invariant  under  changes  in  the  order  of  the  events  that  precede  that  trial. 
Therefore  the  probability  pn  after  a  mixed  sequence  composed  of  m 
E2's  and  (n  —  m  —  1)  Ej's  is  the  same  as  the  value  of  pn  after  a  block  of 
(n  —  m  —  1)  Ei's  preceded  by  a  block  of  m  £2's.  Now  let  m(n)  be  an 
integral  random  variable  whose  value  is  the  number  of  £2-events  in  n 
trials.  Note  that  E[m(n)]  =  n(l  —  TT).  For  TT  >  J  we  have  already  seen 
that  as  n  increases  we  have  n  —  m(/i)  —  1  >  m(ri)  with  probability  one. 
Consider  what  happens  when  the  order  of  the  events  is  rearranged  so  that 
a  block  of  all  the  E2's  precedes  a  block  containing  all  the  J^'s.  On  the 


CLASSIFICATION    AND    THEORETICAL    COMPARISON    OF    MODELS  65 

mth  trial  of  the  second  block  the  probability  returns  to  its  initial  value. 
After  this  trial  there  are  [n  —  2m(n)  —  1]  jE^-trials;  but 

E[n  -  2m(n)  -  1]  =  n(2-n-  -  1)  -1, 

which  becomes  indefinitely  large.  The  behavior  of  the  model  is  the  same 
as  if,  starting  at  the  initial  probability,  an  indefinitely  long  sequence  of 
£rtrials  occurred.  The  limiting  value  of  pw  is  therefore  unity.  A  similar 
result  applies  when  the  event-effects  have  different  magnitudes.  Without 
further  calculations  we  know  that  the  urn  scheme  of  Eq.  25  cannot  mimic 
the  matching  effect. 

I  have  discussed  the  asymptotic  behavior  of  these  models  partly  because 
it  is  of  interest  in  itself  but  mainly  to  emphasize  the  strong  implications  of 
the  commutativity  property.  As  a  final  example  of  the  absence  of  "forget- 
ting," let  us  consider  an  experiment  in  which  first  a  block  of  E^s  occurs 
and  then  a  series  in  which  E^s  and  £2's  occur  independently  with  prob- 
ability TT  =  •£.  In  both  the  beta  and  linear  models  for  complementary 
experimenter-controlled  events  the  initial  block  of  ^-events  will  increase 
the  probability  to  some  value,  say  p' '.  In  the  linear-operator  model  the 
mixed  event  series  will  reduce  the  probability  from  p'  toward  p  =  J. 
In  the  beta  model,  on  the  other  hand,  the  mixed  event  series  will  cause  the 
probability  to  fluctuate  indefinitely  about  p'  with,  on  the  average,  no 
decrement.  The  last  statement  is  true  for  any  model  whose  events  are 
complementary,  experimenter-controlled,  and  commutative. 


4.5  Analysis  of  the  Explicit  Formula19 

In  this  section  I  consider  some  of  the  important  features  of  explicit 
formulas  for  models  with  two  subject-controlled  events.  These  models 
are  meant  to  apply  to  experiments  such  as  the  escape-avoidance  shuttlebox 
and  100:0  prediction,  bandit,  and  T-maze  experiments.  The  event 
(response)  on  trial  n  is  represented  by  the  value  of  xn,  where  xn  =  0 
if  the  rewarded  response  is  made  and  xn  =  1  if  the  nonrewarded  response 
(an  "error")  is  made.  The  probability  pw  =  Pr{xw  =  1}  decreases  over 
the  sequence  of  trials  toward  a  limiting  value  ofp  =  0. 
EXAMPLES  USED.  The  following  models  are  used  as  examples: 

Model  I       pn  =  F(n)  =  oc^Vi,        (0  <  a  <  1);  (58) 

Model  II      pn  =  F(n,  xn_^ 

=  a"-1^!  -  j8)  +  jSx^,        (0  <  p  <  1,  n  >  2);      (59) 
19  Much  of  this  discussion  is  drawn  from  Sternberg  (1959b). 


66  STOCHASTIC    LEARNING    THEORY 

Model  III    p,,  =  F(n,  xw_l9  xn_2,  . .  .  ,  xx) 

^1"'^,        (0  <  cc,  £  <  1) ;  (60) 


3=1 

Model  IV     pn  =  F(n,  sn)  =  exp  [-(a  +  bn  +  csj],      (0  <  a,  b).    (61) 

We  have  seen  Models  I,  II,  and  IV  before.  Model  I  is  the  single- operator 
model  of  Eq.  28  (Bush  &  Sternberg,  1959)  and  is  an  example  of  the  family 
of  single-event  models.  Model  II  is  the  one-trial  perseveration  model  of 
Eq.  29.  (An  analogous  nonlinear  model  is  given  by  the  generalized  logistic 
in  Eq.  32.)  Model  IV  is  the  Bush-Mosteller  model  of  Eqs.  10  and  11, 
rewritten  by  using  the  fact  that  tn  =  n  —  1  —  sn.  The  quantity  sn 


- 

=  ^  XJ is  tne  number  of  errors  before  trial «,  and  c  =  —log  (aa/aj).  If  the 

j=i 
effect  of  reward  is  greater  than  the  effect  of  nonreward^  <  a2),thenc  <  0 

and  more  errors  (larger  sj  imply  a  higher  probability  of  error  (larger  pj; 
if  ocj  >  oc2,  then  c  >  0  and  the  converse  holds.  This  model  has  been  studied 
by  Tatsuoka  and  Mosteller  (1959).  (Analogous  beta  and  urn  models  are 
given  by  Eqs.  19  and  26.) 

Almost  all  the  models  that  have  been  applied  to  data  involve  either 
identity  operators  or  operators  with  limit  points  of  zero  or  unity.  One 
exception  is  Model  III,  whose  operators  are  given  by 

_  |aPn  if    x*  =  0 

Up,  +  ft    if    xn  =  1. 

Referred  to  as  the  "many-trial  perseveration  model,"  this  model  has  been 
applied  to  two-armed  bandit  data  by  Sternberg  (1959b).  The  explicit 
formula  is  similar  in  form  to  Eq.  14  for  the  linear  model  for  two  experi- 
menter-controlled events;  more  recent  events  are  weighted  more  heavily. 
DIRECT  RESPONSE  EFFECTS.  Consider  first  the  direct  effect  of  a 
response,  x^,  on  the  probability  pn.  By  "direct  effect"  is  meant  the  influence 
of  Xj  on  the  magnitude  of  pn  when  intervening  responses  x3-+1, . . . ,  xn_l 
are  held  fixed.  Response  x^  has  a  direct  effect  on  pn  if  it  appears  as  an 
argument  of  the  explicit  formula  F.  The  effect  is  positive  if  x^  =  1  results 
in  a  larger  value  of  pn  than  does  x3  =  0;  otherwise  the  effect  of  xy  is 
negative.  Models  II  and  III  show  positive  response  effects,  achieved  by 
associating  an  additive  constant  with  x3-  in  the  explicit  formula.  In  Model 
IV  the  direct  effects  can  be  positive  or  negative,  depending  on  the  sign  of 
c.  The  effect  is  achieved  by  adding  a  constant  to  log  pn  when  x,  =  1 ; 
this  is  equivalent  to  applying  a  multiplicative  constant  to  pn.  When 


CLASSIFICATION    AND    THEORETICAL   COMPARISON    OF    MODELS  *J 

response  effects  occur  in  one  of  these  models,  they  are  all  of  the  same  sign; 
the  direction  of  the  effect  of  an  event  does  not  depend  on  when  the  event 
occurred.  Let  us  confine  our  discussion  to  this  type  of  model. 

If  none  of  the  x3-  appears  in  F,  then  there  are  no  response  effects  and  the 
model  is  response-independent.  This  is  a  characteristic  of  all  single-event 
models.  Model  I  is  an  example. 

If  any  of  the  x,-  appear  in  F,  there  are  direct  response  effects.  If  only 
x«-i  appears,  then  pw  is  directly  affected  only  by  the  immediately  preceding 
response,  as  in  Model  II.  Because  the  pw  for  m  >  n  are  not  affected  by 
xw_!  we  say  that  the  direct  effect  is  erased  as  the  process  advances.  If 
several,  say  k,  of  the  x,  appear  in  F,  then  the  direct  effect  of  a  response 
continues  for  k  —  1  trials  and  is  then  erased.  [Audley  and  Jonckheere 
(1956)  have  considered  a  special  case  of  their  urn  scheme  that  has  this 
property.]  If  all  the  x,  (;  =  n  -  1,  n  -  2, . . . ,  1)  appear  in  F,  the  effect 
of  a  response  is  never  erased  and  continues  indefinitely.  This  last  condition 
must  hold  for  any  response-dependent  model  that  is  also  path-independent. 
Models  III  and  IV  are  examples. 

When  more  than  one  x5-  appears  in  F,  we  can  ask  two  further  questions 
concerned  with  the  way  in  which  the  arguments  x,  appear  in  F.  The  first 
is  whether  there  is  damping  of  the  continuing  effects.  We  define  the 
magnitude  of  the  effect  of  x,  on  pn  to  be  the  change  in  the  value  of  pn  when 
the  value  of  x,  in  F(n9  0,  0,  0, .  .  .)  is  increased  from  0  to  1.  When  the 
magnitude  of  the  effect  of  x,  is  smaller  for  earlier  x,,  then  we  say  that  direct 
response  effects  are  damped.  If  the  magnitudes  are  equal,  then  the  effects 
are  undamped.  (Direct  effects  might  also  be  augmented  with  trials;  this 
could  occur  in  a  model  in  which  the  full  effect  of  a  response  took  more 
than  one  trial  to  appear.  No  such  models  have  been  studied,  however.  In 
what  follows  we  assume  that  effects  are  either  damped  or  undamped.) 

The  second  question  we  can  ask,  when  two  or  more  of  the  x,  appear  in 
F,  is  whether  their  effects  accumulate.  If  so,  then  the  effect  on  pn  when 
two  of  the  x,  are  errors  is  greater  than  the  effect  when  either  one  of  them 
alone  is  an  error.  In  all  of  the  models  mentioned  in  this  chapter  for  which 
effects  continue  they  also  accumulate. 

If  a  model  exhibits  damped  response  effects,  the  cumulative  number  of 
errors  alone  is  not  sufficient  to  tell  us  the  value  of  pn;  we  must  also  know 
on  which  trials  the  errors  occurred.  Therefore,  the  events  in  a  model  with 
damped  effects  cannot  commute;  and,  conversely,  if  a  model  with  com- 
mutative events  shows  response  effects,  then  these  effects  cannot  be  damped. 
Models  III  and  IV  provide  examples  of  the  foregoing  statement  and  its 
converse.  In  Model  III  the  response  effects  are  damped  and  events  do  not 
commute;  in  Model  IV,  a  commutative  event  model,  response  effects  are 
undamped.  These  two  models  are  analogous  to  the  linear  and  beta  models 


68  STOCHASTIC    LEARNING    THEORY 

for  experimenter-controlled  events  (Sec.  4.3).  In  the  linear  model  outcome 
effects  are  damped  (there  is  "forgetting")  and  events  do  not  commute; 
in  the  beta  model  there  is  no  damping  and  we  have  commutativity. 

By  means  of  these  ideas  models  can  be  roughly  ordered  in  terms  of  the 
extent  to  which  direct  response  effects  occur.  First  is  the  response-inde- 
pendent, single-event  model  in  which  there  is  no  effect  at  all  (Model  I). 
Then  we  have  a  model  in  which  an  effect  occurs  but  is  erased  (Model  II). 
Next  is  a  model  in  which  the  effect  continues  but  is  damped  (Model  III); 
and  finally  we  have  a  model  with  an  undamped,  continuing  effect 
(Model  IV). 

INDIRECT  RESPONSE  EFFECTS.  One  of  the  most  important  properties 
of  models  with  subject-control  of  events  is  the  fact  that  the  responses  in  a 
sequence  are  not  independent.  This  is  the  property  that  causes  subjects' 
response  probabilities  to  differ  even  when  they  have  common  parameter 
values  and  are  run  under  identical  reinforcement  schedules.  One  result  is 
that,  in  contrast  to  models  in  which  only  the  experimenter  controls  events, 
we  must  deal  with  distributions  rather  than  single  values  of  the  response 
probabilities.  A  second  implication  is  that  events  (responses)  have  indirect 
as  well  as  direct  effects  on  future  responses,  effects  that  are  transmitted  by 
the  intervening  trials.  In  contrast,  experimenter-controlled  events  have 
only  direct  effects. 

Until  now  we  have  been  considering  only  the  direct  effect  of  a  response 
xy  on  pn.  If  j  <  n  —  1,  so  that  trials  intervene  between  responses  x,-  and 
xn,  there  also  may  be  indirect  effects  mediated  by  the  intervening  responses. 
For  example,  whether  or  not  xn_2  has  a  direct  effect  on  pw,  it  may  have  an 
indirect  effect,  mediated  through  its  direct  effect  on  pn_l  and  the  relation  of 
pn_i  to  the  value  of  xn_lt  Therefore,  even  if  the  direct  effect  of  \3  on  pn 
is  erased,  the  response  may  influence  the  probability.  Model  II  provides 
an  example.  In  this  model  the  p- value  on  a  trial  is  determined  uniquely 
by  the  trial  number  and  the  preceding  response,  so  that,  conditional  on 
the  value  of  xn_l5  xn  is  independent  of  all  the  xm,  m  <  n  —  1 .  On  the  other 
hand,  if  the  value  of  xn_x  is  not  specified,  then  xn  depends  on  any  one  of 
the  xm,  m  <  n  —  1,  that  may  be  selected.  Put  another  way,  the  condi- 
tional probability  Pr  {xn  =  1  |  #n_i}  is  uniquely  determined,  whatever 
the  xl5  x2, . .  . ,  xn_2  sequence  is.  But,  given  any  trial  at  all  before  the  wth, 
the  unconditional  ("absolute")  probability  Pr{xu  =  1}  depends  on  the 
response  that  is  made  on  that  trial.  A  more  familiar  example  is  the  one- 
step  Markov  chain,  in  which  the  higher-order  conditional  probabilities 
are  not  the  same  as  the  corresponding  "absolute  probabilities"  (Feller, 
1957),  despite  the  fact  that  the  direct  effects  extend  only  over  a  single  trial. 
Because  xn-1  has  no  effect  at  all  on  pn  in  a  response-independent  model, 
it  cannot  have  any  indirect  effect  on  pm  (m  >  n). 


CLASSIFICATION    AND    THEORETICAL    COMPARISON    OF    MODELS  6$ 

The  total  effect  of  a  response  x,-  on  the  probability  pn  can  be  represented 
by  the  difference  between  two  conditional  probabilities:20 

Pr  {xn  -  1  |  x,  =  1}  -  Pr  {xn  =  1  |  x,  -  0}. 

When  direct  effects  are  positive,  the  total  effect  of  x,  on  pn  cannot  be  less 
than  its  direct  effect  alone.  The  extent  to  which  the  total  effect  is  greater 
depends  in  part  on  whether  there  is  accumulation  of  the  direct  effects  of 
Xj  and  the  intervening  responses  and  in  part  on  whether  and  how  effects 
are  damped.  When  direct  effects  are  negative,  the  situation  is  more  compli- 
cated, and  the  relation  between  total  and  direct  effects  depends  on  whether 
the  number  of  intervening  trials  is  even  or  odd  as  well  as  on  accumulation 
and  damping. 

SUBJECT-CONTROLLED    EVENTS    AS    A    PROCESS    WITH    FEEDBACK. 

In  most  of  the  foregoing  discussion  we  have  been  considering  the 
effects  of  responses  on  probabilities.  The  altered  probabilities  influence 
their  associated  responses,  and  these  responses  in  turn  have  effects  on 
future  probabilities.  Thus  the  effects  we  have  been  considering  "feed  back" 
the  "output"  of  the  ^-sequence  so  as  to  influence  that  sequence.  Insofar 
as  two  response  sequences  have  different  /^-values  on  some  trial,  the  nature 
of  the  response  effects  determines  whether  this  probability  difference  will 
be  enhanced,  maintained,  reduced,  or  reversed  in  sign  on  the  next  trial. 

Each  of  the  /^-sequences  produced  by  a  model  is  an  individual  learning 
curve,  and  the  "area"  under  this  curve  represents  the  expected  number  of 
errors  associated  with  that  sequence.  In  a  large  population  of  response 
sequences  (subjects)  the  model  specifies  a  proportion  of  the  population 
that  will  be  characterized  by  each  of  the  possible  individual  learning  curves. 
The  mean  learning  curve  is  the  average  of  these  individual  curves.  If 
there  are  no  response  effects,  there  is,  of  course,  only  one  individual  curve. 

When  response  effects  exist  and  are  positive,  we  may  speak  of  &  positive 
feedback  of  probability  differences  and  determine  measures  of  its  magnitude. 
With  more  positive  feedback  of  /^-differences,  individual  learning  curves 
have  a  greater  tendency  to  deviate  from  their  mean  curve  as  n  increases. 
The  negative  feedback  of  ^-differences,  which  may  occur  if  response 
effects  exist  and  are  negative,  may  cause  the  opposite  result:  /^-differences 
that  arise  among  sequences  may  be  neutralized  or  reversed  in  sign.  Thus 
an  individual  curve  that  deviated  from  the  mean  curve  would  tend  to 
return  to  it  or  to  cross  it  and  therefore  to  compensate  for  the  deviation. 
A  rough  idea  of  the  magnitude  of  the  feedback  can  be  obtained  by  com- 
paring an  assumed  /^-difference  of  A/?w  on  trial  n  with  the  associated 
expected  difference  of  Apn+1  on  the  next  trial. 

20  For  a  binary-event  sequence  that  is  generated  by  a  stationary  stochastic  process  this 
expression  gives  the  autocorrelation  function  with  lag  n  —  j. 


JO  STOCHASTIC    LEARNING    THEORY 

Also  relevant  to  the  feedback  question  is  the  range  of  the  /?n-values 
that  a  model  can  produce  on  a  given  trial.  For  example,  in  a  model  with 
positive  response  effects  the  maximum  possible  value  of  pn  is  attained 
when  all  the  responses  have  been  errors  and  the  minimum  is  attained  when 
they  have  all  been  successes.  For  a  model  with  negative  effects  the  reverse 
holds.  Therefore  ihtpn-range  is  given  by  the 'absolute  value  of 
F(n,  1,1,  1,...)-  F(n,  0,0,0,...). 

Whatever  the  sign  or  magnitude  of  any  feedback  of  probability  differences 
that  may  occur,  it  cannot  lead  to  /^-differences  larger  than  the  /grange. 
The  /?n-values  corresponding  to  the  extremes  of  the  /grange  produce 
the  pair  of  individual  learning  curves  that  differ  maximally  in  area.  In 
general,  the  /grange  imposes  a  limit  on  all  the  response  effects  discussed 
in  this  section. 

For  Model  I  the  /grange  is  zero.  For  Model  II  it  is  a  constant.  For 
Models  III  and  IV  the  range  increases  with  n. 

DISCRIMINATING      STATISTICS:      SEQUENTIAL      PROPERTIES.      The 

analysis  of  response  effects  presented  above  is  useful,  first  in  suggesting 
statistics  of  the  data  that  may  discriminate  among  Models  I  to  IV  and 
second  in  helping  us  to  interpret  the  results  of  applications  of  these  models. 
To  illustrate  these  uses,  let  us  consider  results  of  Sternberg's  (1959b) 
application  of  these  models  to  data  collected  by  Goodnow  in  a  two- 
armed  bandit  experiment  with  100:0  reward. 

The  analysis  tells  us  that  fundamental  differences  among  the  four 
models  lie  in  the  extent  to  which  response  effects  occur  and  are  erased  or 
damped.  This  suggests  that  the  models  differ  in  their  sequential  properties 
and  that  it  is  among  the  sequential  features  of  the  data  that  we  should  find 
discriminating  statistics.  This  suggestion  is  confirmed  by  the  following 
results  that  were  obtained  in  application  of  the  models  to  the  Goodnow 
data: 

1.  Parameter  values  can  be  chosen  for  all  four  models  so  that  they 
produce  mean  learning  curves  in  good  agreement  with  the  observed  curve 
of  trial-by-trial  proportions  of  errors.  The  observed  and  fitted  curves  are 
shown  in  Fig.  8.    Despite  the  differences  among  the  models,  the  mean 
learning  curve  does  not  discriminate  one  from  another. 

2.  Now  we  begin  to  examine  sequential  properties.   First  we  consider 
the  mean  number  of  runs  of  errors.  The  parameters  in  Model  I  cannot  be 
adjusted  so  that  it  will  retain  its  good  agreement  with  the  learning  curve 
and  at  the  same  time  produce  few  enough  runs  of  errors ;  this  model  can 
be  immediately  disqualified.   In  contrast,  parameters  in  Models  II,  III, 
and  IV  can  be  chosen  so  that  these  models  will  agree  with  both  the  learning 
curve  and  the  number  of  error  runs.    (This  difference  is  not  altogether 


CLASSIFICATION    AND    THEORETICAL    COMPARISON    OF    MODELS 
0.8 


Models  I  and  III 

Model  IV 


Fig.  8.  Observed  trial-by-trial  proportions  pn  of  errors  in  the 
Goodnow  experiment  and  theoretical  means  71>n  for  the  four 
models  of  Eqs.  58  to  61. 

surprising  because  Model  I  has  one  less  free  parameter  than  the  others.) 
A  finer  analysis  of  error  runs,  considering-  the  average  number  of  runs  of 
each  length,  j,  1  </  <  7,  does  not  help;  Models  II,  III,  and  IV  produce 
equally  good  agreement  with  the  distribution  of  error-run  lengths.  This 
is  shown  by  Fig.  9. 

3.  Finally,  we  examine  the  serial  autoco  variance  of  errors  at  lags  1  to 
10.  This  statistic  is  defined  to  be  the  mean  value  of 


where  the  lag  is  given  by  the  value  of  k.  (Models  II,  III,  and  IV  all  agree 
with  the  observed  value  of  cl9  but  this  tells  us  nothing  new,  since  their 
agreement  follows  automatically  from  agreement  with  the  learning  curve 
and  the  number  of  runs.)  What  is  of  interest  is  the  behavior  of  ck  as  k 
increases  :  its  observed  value  falls  rapidly,  and  only  Model  II  is  able  to 
produce  so  rapid  a  decrease.  The  slowest  decrease  is  produced  by  Model 
IV,  with  Model  III  a  close  second.  The  results  are  illustrated  in  Fig.  10. 


STOCHASTIC    LEARNING    THEORY 


Our  analysis  of  the  four  models  in  terms  of  response  effects  aids  the 
interpretation  of  these  results.  Figure  8  suggests,  as  did  Figs.  1  and  2, 
that  interesting  and  important  differences  among  models  and  between 
models  and  data  may  be  totally  obscured  if  we  restrict  our  attention  to  the 
learning  curve.  The  results  regarding  runs  of  errors  reflect  the  fundamental 
difference  between  Model  I,  which  is  response-independent,  and  the  others. 
Responses  in  this  experiment  were  clearly  not  independent  of  each  other; 
when  errors  occurred  they  tended  to  occur  in  clusters,  suggesting  a  positive 
response  effect.  This  suggestion  is  confirmed  by  the  values  of  the  estimated 
parameters  for  the  other  models.  The  behavior  of  ck  is  sensitive  to  the 
extent  of  erasing  or  damping  the  positive  response  effect.  Its  value  drops 
most  rapidly  with  k  in  Model  II,  in  which  the  effect  is  erased  after  a  single 


Data 
Model  II 
Model  I 
Model  IV 
Model  III 


345 
Lengths  of  run,  j 

Fig.  9.  Observed  mean  number  of  error  runs  fj  of  length  j 
in  the  Goodnow  experiment  and  theoretical  values 
for  the  four  models  of  Eqs.  58  to  61. 


CLASSIFICATION    AND    THEORETICAL    COMPARISON    OF    MODELS 
2.4 


73 


5  6 

Lags,  k 

Fig.  10.  Observed  mean  values  ck  of  the  serial  autocovariance  of  errors  in  the 
Goodnow  experiment  at  lags  1  to  10  and  theoretical  values  E(ck)  for  the  four 
models  of  Eqs.  58  to  61. 

trial.  Its  value  drops  most  slowly  in  Model  IV,  in  which  the  effect  con- 
tinues and  is  undamped.  Model  III,  with  a  continuing  but  damped  effect,  is 
intermediate. 

The  interpretation  of  these  results,  based  on  our  analysis  of  response 
effects,  leads  us  to  conclude  that  Goodnow's  data  exhibit  short-lived, 
positive  response  effects.  If  a  model  for  these  data  is  to  reproduce  the 
learning  curve,  a  degree  of  positive  sequential  dependence  is  required  for  it 
to  match  the  number  of  error  runs  as  well.  But  this  response  effect  must  be 
radically  damped  or  erased  if  the  model  is  to  describe  the  autocovariance 
of  errors. 

DISCRIMINATING    STATISTICS:     REWARD    AND    NONREWARD    AND 

THE  VARIANCE  OF  TOTAL  ERRORS.  One  of  the  questions  that  has 
interested  investigators  concerns  the  relative  magnitudes  of  the  effects  of 
reward  and  nonreward  in  situations  in  which  their  effects  have  the  same 
sign.  For  example,  one  plausible  interpretation  of  the  events  in  a  T-maze 
with  100:0  reward  is  that  both  rewarded  and  nonrewarded  trials  increase 


J4  STOCHASTIC    LEARNING    THEORY 

the  probability  of  making  the  rewarded  response.  Similarly,  in  the  shuttle- 
box,  both  the  escape  and  avoidance  responses  might  be  thought  to  increase 
the  probability  of  avoidance.  How  would  differences  in  the  relative 
magnitudes  of  these  eifects  manifest  themselves  in  the  data  ?  One  answer 
is  suggested  if  we  translate  the  question  into  the  language  of  positive  and 
negative  response  effects.  If  the  effect  of  reward  is  the  greater,  then  the 
error  probability  after  an  error  is  higher  than  the  error  probability  after  a 
success,  and  we  have  a  positive  response  effect.  If  the  effect  of  nonreward 
is  the  greater,  a  negative  response  effect  exists. 

We  have  already  considered  the  relation  between  response  effects  and 
the  feedback  of  probability  differences.  When  response  effects  are  positive, 
feedback  is  positive,  individual  learning  curves  tend  to  diverge  from  the 
mean  learning  curve,  and  subjects  with  more  early  errors  tend  to  have 
more  errors  late  in  learning.  When  response  effects  are  negative,  feedback 
is  negative,  differences  between  individual  learning  curves  tend  to  be 
neutralized  or  reversed,  and  subjects  with  more  early  errors  tend  to  have 
fewer  later  errors.  Areas  under  individual  learning  curves  tend  to  be  more 
variable  with  positive  effects  than  they  are  with  negative  effects. 

Because  the  area  under  an  individual  learning  curve  represents  the  total 
number  of  errors  for  that  individual,  this  informal  argument  suggests  that 
there  should  be  a  relation  between  response  effects  and  the  variance  of  the 
total  number  of  errors.  Roughly  speaking,  if  two  experiments  produce 
the  same  average  number  of  total  errors,  then  the  experiment  in  which 
reward  is  the  more  effective  should  produce  a  greater  variance  of  total 
errors  than  the  experiment  in  which  nonreward  is  the  more  effective. 
Moreover,  there  should  be  a  positive  correlation  beween  the  extent  of 
positive  response  effects  (magnitude,  continuation,  damping)  and  the 
variance  of  total  errors. 

Both  conclusions  are  borne  out  by  studies  of  experiments  and  models. 
One  experiment  that  provides  clear  evidence  of  a  negative  response  effect 
is  a  study  of  reversal  after  overlearning  in  a  T-maze  (Galanter  &  Bush, 
1959;  Bush,  Galanter,  &  Luce,  1959).  (A  "model-free"  demonstration 
that  the  effect  is  negative  is  discussed  in  Sec.  6.7.)  In  Table  3  this  experi- 
ment is  compared  with  five  others  in  which  analyses  have  suggested  that 
there  is  a  positive  response  effect.  The  coefficient  of  variation  behaves 
appropriately.  The  second  conclusion  is  supported  by  the  theoretical 
values  of  the  variance  of  total  errors  for  Models  I  to  IV  when  their  param- 
eters are  selected  to  produce  the  same  learning  curve  and,  when  possible, 
the  same  number  of  error  runs  as  observed  in  the  Goodnow  data.  The 
figures  for  the  variance  correspond  roughly  to  the  extent  of  positive 
response  effects:  Model  I:  2.53;  Model  II:  4.04;  Model  III:  10.78; 
Model  IV:  10.42.  (The  corresponding  figure  for  the  data  is  5.17.) 


MATHEMATICAL    METHODS    FOR    THE    ANALYSIS    OF    MODELS 


75 


Table  3     Positive  and  Negative  Response  Effects  and  the  Variance 
of  Total  Errors  in  Several  Experiments 

Coefficient  of 

Response       Mean  Total      S.  D.  of  Total          Variation 
Experiment  Effect  Errors,  C/i       Errors,  S(Uj)       C 


T-Maze  reversal  after      Negative  24.68  2.24  0.09 

overlearning  (rats) 

(Galanter  &  Bush, 

1959) 
T-Maze  reversals 

(rats)  (Galanter  & 

Bush,  1959) 

Period  2  Positive  14.10  3.50  0.24 

Period  3  Positive  9.50  3.04  0.32 

Period  4  Positive  12.70  3.85  0.30 

Solomon-Wynne  Positive  7.80  2.52  0.32 

Shuttlebox  (dogs) 
(Bush  &  Mosteller, 
1959) 

Goodnow  two-armed       Positive  4.32  2.27  0.53 

bandit  (Sternberg, 
1959b) 


5.  MATHEMATICAL  METHODS  FOR  THE 
ANALYSIS  OF  MODELS 

In  the  preceding  sections  I  have  considered  informal  and  approximate 
methods  of  analyzing  and  comparing  models.  A  good  many  of  the  con- 
clusions have  been  qualitative,  and,  although  we  should  not  belittle  their 
usefulness  in  guiding  research  and  in  aiding  the  interpretation  of  results, 
it  is  in  the  quantitative  properties  of  learning  models  that  the  core  of  our 
knowledge  lies.  No  unified  methods  of  analysis  exist,  however.  Various 
devices  have  been  used,  of  which  only  a  few  samples  are  illustrated  here. 
Other  examples  are  contained  in  numerous  references,  including  Bush  & 
Mosteller  (1955),  Bush  &  Estes  (1959),  Karlin  (1953),  Lamperti  &  Suppes 
(1960),  Bush  (1960),  and  Kanal  (1962a,b). 

In  principle,  only  the  investigator's  imagination  limits  the  number  of 
different  statistics  of  response  sequences,  short  of  individual  responses, 
that  his  model  can  be  coaxed  to  describe.  Examples,  some  of  which  we 
have  already  come  across,  are  the  mean  learning  curve,  the  number  of 
trials  before  the  fcth  success,  the  number  of  runs  of  errors  of  a  particular 
length,  the  autocovariance  of  errors,  the  number  of  occurrences  of  a 


STOCHASTIC    LEARNING    THEORY 


particular  outcome-response  pair,  and  the  trial  on  which  the  last  error 
occurs.  The  entire  distribution  for  a  statistic  or,  if  desired,  just  its  mean 
and  variance  can  be  described  by  a  model.  In  practice,  analytic  methods 
are  limited,  often  severely  so,  and  a  good  many  of  these  statistics  must  be 
estimated  from  Monte  Carlo  sampling  experiments. 

Usually  the  expectation  of  a  statistic  produced  by  a  model  is  a  function 
of  parameter  values,  and  therefore  the  problem  of  estimating  these  values 
cannot  be  bypassed  in  the  analysis  of  data.  On  the  other  hand,  as  we  shall 
see  later,  the  dependence  of  statistics  on  parameter  values  has  been  ex- 
ploited a  good  deal  in  estimation  procedures.  Occasionally  a  model 
makes  a  parameter-free  prediction;  examples  are  the  asymptotes  of  the 
beta  and  linear  models  that  were  discussed  in  Sec.  4.3.  When  this  occurs, 
we  can  often  dismiss  or  get  favorable  evidence  for  a  model  type  without 
bothering  to  narrow  it  down  to  a  particular  model  by  estimating  its 
parameters. 

In  addition  to  their  use  in  estimation,  model  statistics  are,  of  course, 
used  in  evaluating  goodness  of  fit.  This  is  often  done  by  enumerating 
the  statistics  thought  to  be  pertinent  and  comparing  their  expected  values 
with  those  observed.  The  general  point  of  view  has  been  that  the  observed 
response  sequences  constitute  a  sample  from  a  population  of  sequences, 
and  the  question  is  which  model  type  describes  the  population. 


5.1  The  Monte  Carlo  Method 

The  Monte  Carlo  method  is  the  generation  of  an  artificial  realization  of 
a  stochastic  process  by  a  sampling  procedure  that  satisfies  the  same 
probability  laws.  A  random  device,  usually  a  table  of  random  numbers, 
is  substituted  for  the  behaving  organism  and  is  used  to  select  responses  with 
the  appropriate  probabilities.  Once  we  have  selected  values  for  the  initial 
probability  and  other  parameters  of  a  model  we  can  generate  as  large  a 
sample  of  artificial  response  sequences  as  we  wish.  Any  feature  of  these 
artificial  data  or  "stat-organisms"  can  be  compared  with  the  corresponding 
feature  of  the  real  response  sequences.  The  method  is  therefore  extremely 
versatile  for  testing  models  whose  parameters  we  are  able  to  estimate.21 

If  the  outcome  sequence  in  the  real  experiment  is  generated  by  a  prob- 
ability mechanism,  as  it  often  is  in  the  prediction  experiment,  for  example, 
there  is  a  choice  between  generating  new  sequences  for  the  Monte  Carlo 
experiment  or  generating  the  artificial  data  conditional  on  the  actual 

21  The  Monte  Carlo  method  is  discussed  in  Bush  &  Mosteller  (1955,  Chapter  6),  and 
in  Chapters  7  and  8  of  Vol.  I  of  this  Handbook.  Examples  and  references  are  given  by 
Barucha-Reid  (1960,  Appendix  C). 


MATHEMATICAL    METHODS    FOR    THE    ANALYSIS    OF    MODELS  77 

sequences  used  in  the  experiment.  In  this  instance,  most  workers  agree 
on  using  conditional  Monte  Carlo  calculations. 

If  the  model  involves  subject-control  of  events,  then  a  similar  choice  is 
available  between  letting  pn  in  the  Monte  Carlo  experiment  be  determined 
by  the  preceding  sequence  of  artificial  responses  and  letting  it  be  deter- 
mined by  the  real  response  sequences.  In  other  words,  the  value  of  pn 
is  specified  by  an  explicit  formula  whose  arguments  consist  of  parameters 
and  variables  that  represent  the  responses  on  trials  1  through  n  —  1.  The 
responses  used  can  be  those  in  the  real  data,  thus  conditioning  the  Monte 
Carlo  experiment  by  those  data,  or  artificial  responses  can  be  used.  Most 
workers  have  used  artificial  data  that  are  not  conditioned  by  the  observed 
responses,  but  the  relative  merits  of  these  methods  for  learning-model 
research  have  not  yet  been  assessed. 

Sampling  experiments  are  extremely  inefficient  for  handling  the  estima- 
tion problem,  and  here  analytic  methods  have  a  considerable  practical 
advantage  as  well  as  their  usual  greater  elegance.  We  turn  now  to  a  few 
examples  of  analytic  methods. 


5.2  Indicator  Random  Variables 

We  have  already  used  random  variables  whose  values  indicate  which  of 
two  responses  or  which  of  two  outcomes  occurs  on  a  trial.  By  appro- 
priately choosing  the  two  possible  values,  generally  as  zero  and  unity,  many 
of  the  statistics  in  which  we  are  interested  can  be  represented  easily  by 
products  and  sums  of  these  random  variables.  This  type  of  representation 
facilitates  the  calculation  of  expectations  and  variances. 

Let  xn  =  1  if  the  response  on  trial  n  is  an  error  and  xn  =  0  if  it  is  a 
success.  Then  the  number  of  errors  in  a  sequence  of  N  trials  is  defined  by 

uljv  =  lxn.  (62) 

n=l 

When  we  consider  an  infinite  sequence  of  trials,  we  drop  the  subscript 
TV;  for  example,  ux  denotes  the  number  of  errors  in  an  infinite  sequence. 
(This  number  may  or  may  not  be  finite.)  One  approximation  often  used, 
when  the  probability  of  error  approaches  zero  as  n  increases,  replaces 
error  statistics  for  finite  sequences  by  their  infinite  counterparts. 

The  number  iTtN  of  runs  of  errors  during  N  trials  is  expressed  in  terms 
of  the  {x,}  by  noting  that  every  error  run,  except  the  one  that  terminates  a 
sequence,  is  followed  by  a  success.  Therefore, 

N-I  *  N          N-I 

*T,N  =  2  Xn(l  -  Xtt+l)  +  X#  =  2  Xn  -  2  *rcXn+l-  (63) 


STOCHASTIC    LEARNING    THEORY 


For  an  infinite  sequence  the  upper  limit  of  the  summations  becomes  in- 
finite and  we  use  the  symbol  rT. 

If  we  define  u,,  the  number  of  "/-tuples"  of  errors  in  an  infinite  sequence, 
by 

00 

u,.  =  2  xnxw+1  .  .  .  xw+  ,._!,  (64) 

n=l 

then  rft,  the  number  of  runs  of  errors  of  a  particular  length  k,  can  be 
expressed  in  terms  of  the  {u;-}  by 


(Bush,  1959). 

In  experiments  in  which  both  outcomes  and  responses  may  vary,  similar 
expressions  can  be  used  for  various  response-outcome  patterns.  On  trial 
n  for  subject  i  (i  =  1,2,...,  /),  let  xitn  =  I  if  response  A1  occurs,  xit/n  =  0 
if  response  A2  occurs,  and  let  yi>n  =  1  if  outcome  Ox  follows,  yt>  =  0 
if  outcome  02  follows.  The  number  of  subjects  for  which  OI  occurred  on 
trial  n  and  A1  on  trial  n  +  1  (a  measure  of  the  correlation  of  responses  with 
prior  outcomes)  is  given  by 

i 

2  y*,nXz>+i-  (65) 


5.3  Conditional  Expectations 

Partly  because  of  the  "doubly  stochastic"  nature  of  most  learning  models 
(Sec.  3)  in  which  both  the  responses  and  the  ^-values  have  probability 
distributions,  it  is  often  convenient  when  finding  the  expectation  of  a 
statistic  to  determine  first  the  expectation  conditional  on,  say,  the  /rvalue 
and  then  to  average  the  result  over  the  distribution  of  ^-values.  A  few 
examples  will  illustrate  this  use  of  conditional  expectations.  We  let 
pw  =  Pr  {xn  =1}.  It  will  be  convenient  to  let  Vltk(p)  denote  the  first  mo- 
ment of  the  /rvalue  distribution  of  a  process  that  started  k  trials  ago  at 
probability  p  ;  that  is, 

Vli1c(p)  =  Pr  {xn+k  =  1  |  pn+1  =  p}  =  E(xn+k  |  pw+1  =  p).       (66) 


It  is  also  useful  to  let  Ex  denote  an  average  over  the  binomial  distribution 
of  responses,  Ey  an  average  over  a  binomial  distribution  of  outcomes, 
and  E9  an  average  over  a  /?-value  distribution.  Recall  that  when  the 
response  probability  is  considered  to  be  a  random  variable  it  is  written  pw; 
a  particular  value  is  pn. 


MATHEMATICAL  METHODS  FOR  THE  ANALYSIS  OF  MODELS  JQ 

Suppose  we  wish  to  evaluate  E(rT)  for  an  experiment  with  subject-con- 
trolled events. 


E(xB)  =  E,Ex(xn  |  p  J  =  £„(?„)  =  K1|B. 
£(xnxn+1)  =  EpEafox^  |  pB)  =  B,[p»Pr  {xn+1  =  1    xn  =  1,  pn}]. 

To  evaluate  the  last  expression  further  requires  us  to  specify  the  model. 
Consider  the  commutative  linear-operator  model  that  we  discussed  in 
connection  with  the  shuttlebox  experiment  (Eq.  8).  Then  Pr{xw+1  =  1 
|  XB  =  1,  p*}  =  a2Pn  and  therefore 


E(xnxn+1)  = 

where  F2  n  is  the  second  (raw)  moment  of  thep-value  distribution  on  trial 
n.  We  thus  have 


sjJ,Pi.n-««I*M>  (67) 

n=l      '  w=l 


and  the  sums  can  be  evaluated  in  terms  of  the  model  parameters  (Bush 
1959).  For  the  single- operator  model  (ax  =  a2  =  a)?  V1>n  =  a"'1/?! 
and  V2>n  =  oc^"-1^!2,  and  Eq.  67  gives 


which  function  is  illustrated,  for/?!  =  1,  in  Fig.  12,  p.  91. 

As  a  second  example,  suppose  we  wish  to  evaluate  the  expectation  of  the 
statistic 

00 
71=1 

which  we  encountered  in  Sec.  4.5.  We  have 

f  i  1^1 

Again  we  use  the  commutative  linear-operator  model  as  our  example. 
The  conditional  probability  is  a2pra  and  therefore 


n=l 


80  STOCHASTIC    LEARNING    THEORY 

Turning  to  experiments  in  which  outcomes  may  vary  from  trial  to  trial, 
let  us  consider  the  evaluation  of  the  expectation  of 

•j      rw+.Y—  1    7 

*  =  777       i       Z*yt,nx*,n-fl» 
Nl       n—m     t=l 

which  is  the  proportion  of  outcome-response  pairs  in  the  indicated  block 
of  trials  for  which  A{  on  trial  n  +  1  follows  Ol  on  trial  n.  Statistics  of  this 
type  were  considered  by  Anderson  (1959)  and  are  examples  of  aspects  of 
the  data  that  are  of  interest  even  after  the  average  response  probability 
has  stabilized.  We  assume  a  linear  operator  model  with  experimenter 
control  and  let  Pr  {y^n  =  1}  =  TT. 

First  let  us  consider  £(t)  conditional  on  the  particular  {y^n}  sequences 
used.  Let  En  denote  an  average  taken  over  trials  and  E{  an  average  over 
subjects. 


=  ^EnEt(yiinPiin)  +  a&  (68) 

where  y  is  the  average  value  of  yitn  for  the  sequences  used.    The  cor- 
responding equation  in  terms  of  statistics  of  the  data  is 


2  2 


(69) 


The  expectation  in  Eq.  68  can  be  evaluated  if  parameters  are  known,  or 
Eq.  69  can  be  used  for  estimation  or  testing. 

By  using  the  fact  that  the  yl>n  are  generated  by  a  probability  mechanism, 
we  can  arrive  at  an  approximation  that  is  easier  to  work  with.  We  do  this 
by  evaluating  £(t)  for  the  "average"  7/  ^-sequence  produced  by  the  prob- 
ability mechanism  rather  than  for  the  particular  sequences  used  in  the 
experiment.  The  approximation  is  obtained  by  applying  to  Eq.  68  the 
expectation  operator  Ey,  which  averages  over  the  binomial  outcome  distri- 
bution of  yipB.  Let  Vl  =  En(Vltn\  where  the  expectation  is  taken  over 
the  indicated  trial  block.  Then, 


.)  +  ^77, 
and  so 

+  fl^r.  (70) 


MATHEMATICAL    METHODS    FOR    THE    ANALYSIS    OF    MODELS  8l 

The  corresponding  equation  in  terms  of  statistics  of  the  data  is 


=  a  ~  JL.J  -  +  a7T  (71) 


A7  NI 

which  is  to  be  compared  to  the  more  exact  Eq.  69. 

5.4  Conditional  Expectations  and  the  Development  of 
Functional  Equations 

Conditional  expectations  are  useful  also  in  establishing  functional 
equations  for  interesting  model  properties.  Let  Gl5  G2,  .  .  . ,  Gk,  ...  be  a 
set  of  mutually  exclusive  and  exhaustive  events,  and  let  h  be  a  statistic 
whose  expectation  is  desired.  Then  the  property  used  is 

and  we  consider  two  examples  of  its  application  to  path-independent 
models  with  two  subject-controlled  events.  Let  xn  =  1  if  there  is  an  error  on 
trial  n  and  xn  =  0  if  there  is  a  success;  let  the  operator  for  error  be  Q%  and 
for  success,  Q^  As  our  first  example  we  let  h  =  ul5  the  total  number  of 
errors  in  an  infinite  sequence.  The  conditioning  events  are  the  possible 
responses  on  trial  1,  so  that  Gl  corresponds  to  Xj.  =  0  and  G2  corresponds 
to  Xj  =  1 .  Equation  72  becomes 

£(Ul  \Pl  =/?)  =  Pr  {Xl  =  0}£(Ul  |  Xl  =  0, />!=/>) 

+  Pr 


Now  we  note  that  E(ut  \xl  =  0,pl=p)  =  £(ut  |  p±  =  Q^p)\  that  is,  we 
can  consider  the  process  as  if  it  began  on  the  second  trial  with  a  differ- 
ent initial  probability.  Similarly,  Efa  \  xl  =  1,  Pi=  p)  =  1  -h  £(ui  | 
pl  =  Qzp)  ;  in  this  case  we  consider  the  process  as  beginning  on  the  second 
trial  but  we  add  the  error  that  has  already  occurred.  The  result  is 


£(11!  \pl=p)  =  (l-p)  £(Ul  |  Pl  =  QlP)  +  p[l  +  £(Ul  |  Pl  = 

(73) 

For  a  particular  model,  the  expectation  £(%  \PI=  p)  depends  on  the 
parameters  and  the  value  of  p;  we  can  suppress  the  parameters  and  write 
it  simply  as/(jp),  a  function  of  p.  This  function  is  unknown,  but  Eq.  73 
tells  us  that  it  has  the  property  that 


82  STOCHASTIC    LEARNING    THEORY 

If  QiP  =  &*p  and  Q2p  =  a2/?,  then 

/(p)  =  (1  -  p)f(^p)  +  P[l  +  /(a2/?)],  (74) 

with  the  boundary  condition  /(O)  =  0.  Equation  74  is  an  example  of  a 
functional  equation,  which  defines  some  property  of  an  unknown  function 
that  we  seek  to  specify  explicitly.  It  has  been  studied  by  Tatsuoka  and 
Mosteller  (1959). 

In  the  preceding  example  a  relation  is  given  among  the  values  of  the 
function  at  an  infinite  set  of  triples  of  the  values  of  its  argument;  that  is, 
the  set  defined  {/>,  a^,  a2/?  |  0  <  p  <  1}.  A  more  familiar  example  of  a 
functional  equation  is  a  difference  equation;  the  values  of  the  argument 
differ  only  by  multiples  of  some  constant.  An  example  of  such  a  set  of 
arguments  is  {p,p  +  h,p  +  2h\p  =  0,h,2h,...,  Nh}.  Without  loss 
of  generality,  a  difference  equation  of  this  kind  can  be  converted  into  one  in 
which  the  arguments  of  the  function  are  a  subset  of  successive  integers.  A 
second  familiar  example  of  a  functional  equation  is  any  differential  equa- 
tion. For  both  of  these  special  types  of  functional  equations,  there  is  a 
much  wider  variety  of  methods  of  solution — methods  of  specifying  the 
unknown  function — than  for  the  more  general  equations. 

As  a  second  example  of  the  use  of  Eq.  72  in  developing  a  functional 
equation  let  us  consider  a  model  with  two  subject-controlled  events  in 
which  xn  =  1  results  in  an  increase  in  Pr  {xn  =  1}  =  pw  toward  pn  =  1 
and  xn  =  0  results  in  a  decrease  in  pn  toward  pn  =  0.  In  such  a  model, 
after  a  sufficient  number  of  trials,  any  response  sequence  will  consist  of 
either  all  "errors"  or  all  "successes" ;  there  are  two  asymptotically  absorb- 
ing barriers,  at  p  =  1  and  p  =  0. 

An  example  of  a  linear-operator  model  of  this  kind  is 

(GiPn  =  <*iP«  +  (1  -  *i),        with  probability  pn, 

Pn+l  =      ^ 

{Q&n  =  <*2Pn>  with  probability  1  —  pw. 

One  of  the  interesting  questions  about  such  a  model  is  to  determine  the 
probability  of  asymptotic  absorption  at  p^  =  1.  Bush  and  Mosteller 
(1955,  p.  155)  show  by  an  elementary  argument  that  the  distribution  of 
p^  in  this  model  is  entirely  concentrated  at  the  two  absorbing  barriers. 
Therefore  Pr  {p^  =  1}  =  EQiJ,  and  it  is  fruitful  to  identify  the  h  in 
Eq.  72  with  p^.  As  before,  we  let  Gx  correspond  to  xx  =  0  and  G2  cor- 
respond to  K!  =  1,  and  we  consider  the  expectation  as  a  function  of  the 
starting  probability.  Equation  72  thus  becomes 


MATHEMATICAL    METHODS    FOR    THE    ANALYSIS    OF    MODELS  8$ 


We  note  that  ^(p^  |  xl  =  0,^  =/?)  =  £($«,  \p1  =  Q-jf)  and  similarly 
that  E(pw  \Xl=\9pl=p)==  E(POO  \PI  =  Qzp).   We  then  have 

£(Poo   |  />!=/>)   =  /?£(?„    |  ft   =    fitf)   +   0    -  /O^CPoo    |  ft   =    fitfO- 

Letting  g-(/?)  represent  the  expectation  as  a  function  of  the  starting  prob- 
ability, we  arrive  at 


£00  = 

as  a  functional  equation  for  the  probability  of  absorption  at  p  =  1. 
Boundary  conditions  are  given  by  g(\)  =  1  and  g(0)  =  0.  The  function 
g(p)  is  understood  to  depend  on  parameters  of  the  model  in  addition  to  p^ 
Mosteller  and  Tatsuoka  (1960)  and  others22  have  studied  this  functional 
equation  for  the  foregoing  linear-operator  model.  In  general,  no  simple 
closed  solution  seems  to  be  available.  For  the  symmetric  case  of  o^ 
=  a2  =  a  <  1  the  solution  is  g(p)  =  p. 

A  similar  functional  equation  can  be  developed  for  the  beta  model  with 
two  absorbing  barriers  and  symmetric  events.  It  is  convenient  to  work 
with  logit  pw  instead  of  pw  itself.  Following  Eq.  49,  we  have,  for  this  model 

logitpn=  -(a  +  *tn  -  6sn), 
and  the  corresponding  operator  expression  is 

flogit  pn  +  b    if  xn  =  1         (i.e.,  with  probability  pn), 


logit  pw+1  = 

logit  pw  —  b     if  XTC  =  0        (i.e.,  with  probability  1  —  pj. 

Let  Ln  =  logit  pn  and  let  g(L)  be  the  probability  of  absorption  at  L  =  oo 
(which  corresponds  to  p^  =  1)  for  a  process  that  starts  at  Lx  =  L.  Then 
Eq.  75  becomes  the  linear  difference  equation 

g(L)  =pg(L  +  b)  +  (\-  p)  g(L  -  b),  (76) 

where  p  =  antilogit  L  =  1/(1  +  e~~L).  The  boundary  conditions  are 
g(—.  oo)  =  0  and  g(+oo)  =  1.  This  equation  has  been  studied  by  Bush 
(1960)  and  Kanal  (1962b). 


5.5  Difference  Equations 

The  discreteness  of  learning  models  makes  difference  equations  ubiqui- 
tous in  their  exact  analysis.  The  recursive  equation  for  pn  is  a  differ- 
ence equation  whose  solution  is  given  by  the  explicit  equation  for  pn; 
the  argument  of  the  difference  equation  is  in  this  case  the  trial  number  n. 

22  See  Shapiro  and  Bellman,  cited  by  Bush  &  Mosteller,  1955. 


84  STOCHASTIC    LEARNING    THEORY 

A  simple  example  is  the  recursive  equation  for  the  single  linear  operator 
model  pn+l  =  &pn,  whose  solution  is  pn  =  a71"1/^.  A  more  interesting 
case  is  Eq.  13  for  the  prediction  experiment, 

Pn+l  =  «pn  +  (1   -  °0yn, 

with  the  solution 

Pn+i  =  a^'Pi  +  (1  -  *)  IX'1'^- 
3=1 

There  are  systematic  methods  of  solution  for  many  linear  difference 
equations  such  as  these  (see,  for  example,  Goldberg,  1958).  Often  a  little 
manipulation  yields  a  conjectured  solution  whose  validity  can  be  proved 
by  mathematical  induction. 

Partial  difference  equations  occasionally  arise  in  learning-model  analysis  ; 
they  are  more  difficult  to  solve.  We  have  seen  in  Sec.  5.2  that  it  is  often 
necessary  to  know  the  moments  {Km>n}  =  |£(pn™)}  of  the  ^-value  dis- 
tributions generated  by  a  model;  properties  of  a  model  are  often  expressed 
in  terms  of  these  moments.  The  transition  rules  of  linear-operator  models 
lead  to  linear  difference  equations  or  other  recurrence  formulas  for  the 
moments.  Occasionally  these  equations  are  "ordinary":  one  of  the  sub- 
scripts of  Vm>n  is  constant  throughout  the  equation.  An  example  is  the 
equation  for  Vl>n  in  the  experimenter-controlled  events  model  above,  when 
we  consider  the  {yy}  to  be  random  variables  and  Pr  {yw  =  1}  =  -n\  it  is 
given  by  the  ordinary  difference  equation  (Eq.  43) 


whose  solution  is  easily  obtained  (Eq.  44). 

It  is  more  usual  for  neither  m  nor  n  to  be  constant  in  the  recurrence 
formula  for  Vmtn9  and  the  formula  is  then  a  partial  difference  equation. 
In  this  case  we  cannot  ignore  the  fact  that  Fw>n  is  a  function  of  a  bivariate 
argument,  and  methods  of  solution  are  correspondingly  more  difficult. 
As  an  example  we  consider  the  linear-operator  model  with  two  com- 
mutative events  and  subject  control,  for  which 

(aiPn        with  probability  1  —  pw 
oc2pw        with  probability  pn. 

First  let  us  consider  how  the  partial  difference  equation  for  Vm>n  is 
derived.  We  assume  a  population  of  subjects  with  common  values  of 
/?!,  a1?  and  oc2.  On  trial  72,  after  n  —  1  applications  of  the  operators,  the 
population  consists  of  n  distinct  subgroups  defined  by  the  number  of 
times  <*!  has  been  applied.  Let  1  <  v  <  n  be  the  index  for  these  subgroups, 
let  pv  >n  be  the  /?-value  for  the  vth  subgroup  on  trial  n,  and  let  PVtU  be  the 
size  of  this  subgroup,  expressed  as  a  proportion  of  the  population.  Now 
let  us  consider  the  fate  of  the  vih  subgroup  on  trial  n.  A  proportion, 


MATHEMATICAL    METHODS    FOR    THE    ANALYSIS    OF    MODELS  85 

Pv.n>  of  the  subgroup  makes  an  error  on  that  trial,  and  its  /rvalue  becomes 
**Pv.n-  The  remaining  proportion  of  the  subgroup,  1  —  pVtn,  performs  a 
correct  response,  and  its  p-  value  becomes  ^pVmn.  The  result  is  expressed  in 
the  following  table  : 

New  /^-Values  New  Proportions 

*2P',»  Pv,nPVin  (78) 

alA,n  (1    -/>„,„)  Pv,w. 

Therefore, 


>W  =  (a2™  -  O^i.n  +  *imVm.n-  (79) 

One  feature  of  this  equation,  which  is  generally  true  of  models  with 
subject  control,  is  that  Km>n  is  expressed  in  terms  of  moments  higher  than 
the  mih  moment  of  the  /rvalue  distribution  on  preceding  trials.  With 
experimenter  control  this  complicating  feature  is  absent,  as  illustrated  by 
Eq.  77. 

Equation  79  has  been  solved  by  conjecture  and  inductive  proof  rather 
than  by  any  direct  method.  To  illustrate  how  cumbersome  some  of  the 
results  become  in  this  field,  I  reproduce  the  solution  here: 

n  k+-m—  2  r      j  j\fi  ^ra—  4+m—  1\ 

vMn  =  ar(^i>pl(fi+2«r(^*>pi+wi"1  n      ~  l  (  r"  *  —  }- 

fc=2  j=m  1   —  a^    m+i 

(m^\,n>  1),     (80) 
where  the  sum  is  defined  to  be  zero  for  n  =  1. 

For  more  examples  of  the  development  of  recursive  formulas  for  mo- 
ments, see  Bush  &  Mosteller  (1955,  Chapter  4),  and  for  some  examples  of 
their  use  see  Bush  (1959),  Estes  &  Suppes  (1959,  Sec.  8),  and  Sternberg 
(1959b). 


5.6  Solution  of  Functional  Equations23 

Two  methods  by  which  functional  equations  have  been  studied  are 
illustrated  here  ;  the  first  is  a  power-series  expansion  and  the  second  is  a 
differential  equation  approximation. 

23  See  Kanal  (1962a,b)  for  the  formulation  of  some  functional  equations  arising  in  the 
analysis  of  the  linear  and  beta  models  and  for  methods  of  solution  and  approximation. 


STOCHASTIC    LEARNING    THEORY 


Tatsuoka  and  Mosteller  (1959)  solved  Eq.  74  by  using  a  power-series 
expansion.   Assume  that/(/?)  is  expressible  as  a  power  series  in/?: 

/*(!>)  =f<W>*.  (81) 


The  boundary  condition/*(0)  =  0  implies  that  c0  =  0.  By  substituting  the 
series  expansion  into  the  functional  equation  and  equating  coefficients  of 
like  powers  of/?  we  find 

fc-i 

IT  fo'  -  axO 
<*  =  ^  -  >     fe>l-  (82) 

mi  -«i') 

3=1 

For  certain  special  cases  this  expression  can  be  simplified.  For  example, 
with  a2  =  1  (identity  operator  for  "error")  and  0  <  ax  <  1,  ck  =  I/ 
(1  -  a/)  and  therefore 

(83) 


. 

fc-i  1  —  <Xi 

A  closed  form  for  this  expression  has  not  been  found,  but  there  are  tables 
(Bush,  1959)  and  approximations  (Tatsuoka  &  Mosteller,  1959). 

The  only  solution  of  Eq.  74  that  satisfies  the  boundary  condition 
and  has  an  expansion  in  powers  of  p  is  /*(/?).  To  prove  this,  we  assume 

00 

that  there  is  a  second  power-series  solution,  /**(/?)  =  2  ^Pk  an<*  replace 

jt=i 

/(/?)  in  Eq.  74  first  by/*(p)  and  second  by/*  *(/?).  By  subtracting  the  second 
resulting  equation  from  the  first  we  obtain  an  equation  for 

/*(P)  -/**(?)  =  I(c*-4K 

jfc  =  l 

whose  solution  requires  that  cfc  —  rffc  =  0  for  k  >  1.  In  general,  however,  a 
functional  equation  may  possess  solutions  for  which  a  power-series  expan- 
sion is  not  possible.  For  this  reason  it  is  necessary  either  to  provide  a 
general  proof  of  the  uniqueness  of/*(p)  or  to  show  that  we  are  interested 
only  in  solutions  of  Eq.  74  with  power-series  expansions. 

Kanal  (I960,  1962a)  has  shown  that/*(p)  is  the  only  solution  of  Eq.  74 
that  is  continuous  at/?  =  0  but  no  general  proof  of  uniqueness  is  available 
at  present.  Fortunately,  we  can  use  Eq.  80  to  show  that  for  the  model  in 
question  E(uJ  has  a  series  expansion  in  powers  of  p  and  that  power-series 
solutions  of  Eq.  74  are  therefore  the  only  ones  of  interest.  To  do  this,  we 
note  that 


MATHEMATICAL    METHODS    FOR    THE    ANALYSIS    OF    MODELS  8j 

and  that 

/£     \        * 
E    2*J  =  l£(xn)  (84) 

\n=l       /          71=1 

if  the  right-hand  series  converges.   Equation  80  provides  an  expression  for 
E(xn)  =  Vl>n',   it  is  a  polynomial  in  pl  =  p.   If  ax  <  1  and  either  oc2  <  1 

00 

or/?!  <  1,  J  PI,*  converges.  We  therefore  know,  incidentally,  that  under 

these  conditions  £(ux)  exists.  Moreover,  because  it  is  the  sum  of  a  con- 
vergent infinite  series  of  polynomials,  it  must  have  a  power-series  expansion. 
For  some  functional  equations  the  power  series  that  is  obtained  may  not 
converge,  and  we  cannot  apply  the  foregoing  method.  As  an  example  of  a 
second  method  we  consider  Bush's  (1960)  solution  of  Eq.  76.  First  the 
equation  is  written  in  terms  of  first  differences  and  (1  —  /?)//?  is  replaced  by 


e~L 


S(L  +  b)~  g(L)  =  erL\g(L)  -  g(L  -  b}}.  (85) 

In  order  to  convert  (85)  into  a  linear  equation,  the  logarithmic  transforma- 
tion is  applied  to  both  sides,  and  the  logarithm  of  the  first  difference  is 
defined  as  a  new  function,  h(L)  ==  log  [g(L)  —  g(L  —  6)],  to  give 

h(L  +  b)  =  h(L)  -  L.  (86) 

Equation  86  is  the  difference  equation  to  be  solved. 

In  this  case  a  solution  is  sought,  not  by  a  power  series  expansion  but  by 
a  differential  equation  approximation  of  the  difference  equation.  We  write 
Eq.  86  in  a  form  symmetric  about  L: 


divide  by  b, 


Aft  =  _  L      I 
AL~       62* 


and  treat  the  result  as  a  derivative, 

^  =  _^+i 

dL          b      2' 
Integration  gives 

L2    .  L  . 


as  a  conjectured  solution.    The  result  satisfies  Eq.  86,  and  therefore  a 

particular  solution  of  the  complete  equation  is  given  by  Eq.  87  with  C  =  0. 

The  homogeneous  equation  h(L  +  b)  =  h(L)  has  as  its  general  solution 


STOCHASTIC    LEARNING    THEORY 


P(L),  an  arbitrary  periodic  function  of  L  with  period  b.  For  the  general 
solution  of  the  complete  equation  we  then  have 


.  (88) 

20 

To  recover  g,  we  use 

g(L)  -  g(L  -b)  =  exp  [fc(L>]  =  exp  [|  -  ^ 
and  completing  the  square  gives  us 

-  g(L  -b)  =  P(L)  exp    -        L  -  -      ,  (89) 


where  P(L)  is  some  other  periodic  function  of  L,  with  period  b.  This  new 
difference  equation  is  simpler  than  the  original  (Eq.  85)  because  it  contains 
one  difference  instead  of  two,  and  therefore  routine  procedures  can  be 
used.  We  first  note  the  boundary  conditions  g(—  oo)  ==  0  and  g(cd)  =  1. 
Then  we  replace  L  by  L  —  b,  L  -  26,  and  so  on,  to  obtain  the  semi- 
infinite  system 


g(L)  -  g(L~  V)  =  P(L)  exp 
-  6)  -  g(L-  26)  =  P(L)  exp    -        L-  b  -  (90) 

-  26)  -  «<L-  36)  =  P(L)  exp    -        L-  26  - 


Addition  of  these  equations  and  use  of  the  first  boundary  condition  gives 

00      r     i  /  6\2~i 

g(L)  =  P(L)2exp    -^  L-fc6-5    I 

fc=o        L      26  \  2/J 

The  sum  may  be  approximated  by  a  normal  integral.  The  periodic  function 
is  still  arbitrary;  to  specify  it,  we  write  the  full  infinite  system  correspond- 
ing to  Eqs.  90,  sum,  and  use  both  boundary  conditions  ;  the  final  result  is 


2b 


26 

which  is  roughly  of  the  form  of  a  normal  integral.  In  most  practical  cases 
P(L)  may  be  approximated  by  a  constant.  When  antilogits  are  taken,  the 
absorption  probability  as  a  function  of  p  is  no  longer  a  normal  integral, 


APPLICATION    AND    TESTING    OF    LEARNING    MODELS  89 

but  it  is  similar  in  character;  the  resulting  S-shaped  curve  is  to  be  compared 
to  the  result  for  the  symmetric  linear-operator  model  for  which  the  absorp- 
tion probability  is  equal  to  the  initial  probability. 


6.  SOME   ASPECTS   OF   THE   APPLICATION   AND 
TESTING   OF   LEARNING   MODELS 

6.1  Model  Properties:  A  Model  Type  as  a  Subspace 

In  the  last  three  sections  I  have  mentioned  examples  of  many  of  the 
model  properties  that  have  been  studied  and  compared  with  data.  Linear 
models  are  the  best  known:  they  were  studied  by  Bush  and  Mosteller 
(1955)  and  recent  progress,  a  good  deal  of  which  is  represented  in  Bush 
&  Estes  (1959),  has  been  considerable.  Even  so,  our  knowledge  tends  to 
be  spotty,  concentrated  at  certain  special  examples  of  linear  models. 
Models  with  extreme  or  equal  limit  points  and  those  with  equal  learning- 
rate  parameters  are  better  understood  than  the  others.  The  single-operator 
model  and  models  with  experimenter  control  are  the  most  thoroughly 
studied  (Bush  &  Sternberg,  1959;  Estes  &  Suppes,  1959);  analytic  ex- 
pressions are  available  for  the  expectations  of  a  good  many  statistics  of 
these  models.  On  the  other  hand,  except  for  some  of  its  asymptotic 
properties,  we  have  less  information  about  Luce's  beta  model  (Bush,  1960; 
Lamperti  &  Suppes,  1960;  Kanal,  1962a,b).  For  this  model,  and  for  any 
more  general  logistic  models,  there  is  a  compensating  advantage :  standard 
methods  of  estimation  can  easily  be  applied.  Once  estimates  are  obtained, 
Monte  Carlo  calculations  can  be  used  for  detailed  comparisons.  It  can  be 
argued,  moreover,  that  the  advantages  of  optimal  (maximum-likelihood) 
estimation  methods  outweigh  the  convenience  of  having  analytic  expres- 
sions for  model  properties. 

To  arrive  at  one  view  of  the  properties  of  a  model — a  view  that  is  helpful 
in  considering  the  problems  of  fitting  and  testing  the  model — we  begin  by 
considering  the  m-dimensional  "property-space"  consisting  of  all  values  of 
the  vector  (sl9  5-2,  .  .  . ,  sm\  where  Sj  denotes  a  property  (the  expectation 
or  variance  of  a  statistic)  of  the  model.  The  corresponding  statistic  for 
some  observed  data  sequences  is  denoted  by  s^  In  general,  the  properties 
depend  on  parameter  values,  and  therefore  ss  =  £/©),  where  0  is  a  vector 
of  parameters  corresponding  to  a  point  in  the  parameter  space.  As  the 
point  moves  through  the  entire  parameter  space,  the  ^  take  on  all  the 
combinations  of  values  allowed  by  the  model  type.  For  certain  purposes 
we  can  now  ignore  the  parameters  and  consider  only  these  allowed  com- 
binations, which  define  a  subspace  of  the  property-space.  If  there  were  no 


STOCHASTIC    LEARNING    THEORY 


30 
25 

^    20 
^ 
^ 
+«    15 


0.7 


0.8 


0.9 


1.0 


Fig.  1 1 .  The  solid  curve  represents  the  expected  number 
of  errors  in  an  infinite  sequence  of  trials,  £(ux) ,  as  a  function 
of  the  learning-rate  parameter  a  for  the  single-operator 
model  (Eq.  28)  with  p:  =  1.  The  distance  between  the 
solid  curve  and  a  broken  curve  represents  the  standard 
deviation  of  the  number  of  errors. 


sampling  variability,  the  problem  of  testing  a  model  type  would  reduce  to 
the  question  whether  the  observed  (s^  sz,  .  .  .  ,  sm)  is  a  point  in  the  subspace. 
The  existence  of  sampling  fluctuations  means  that  the  question  must  be 
modified  so  that  a  certain  degree  of  discrepancy  is  tolerated. 

A  simple  example  with  a  one-dimensional  parameter  space  illustrates 
this  viewpoint.  We  consider  the  single-operator  model  given  by  Eq.  28 
and  discussed  in  Sec.  4.4.  A  good  many  properties  of  this  model  are  known 
(Bush  &  Sternberg,  1959),  and  in  Figs.  11,  12,  and  13  three  are  illustrated 
graphically.  The  total  number  of  errors  (in  an  infinite  sequence  of  trials) 
is  symbolized  by  u1?  the  total  number  of  runs  of  errors  by  rr,  and  the 
number  of  trials  before  the  first  success  by  f.  These  three  properties 
suffice  for  our  purpose,  and  the  property-space  we  consider  is  therefore 
three-dimensional.  In  order  for  the  model  to  have  only  a  single  free 
parameter  we  assume  that  pl9  the  initial  probability  of  error,  is  known  to  be 
unity.  The  dependence  of  E(f),  E(u^9  and  E(rT)  on  the  value  of  the  learn- 
ing-rate parameter  a  is  shown  by  the  figures  and  also  by  the  following 
equations.24 


1  - 


1  - 


E(f)  = 


(92) 


24  The  infinite  sum  for  E(f)  is  tabulated  in  Bush  &  Mosteller  (1955,  Table  A)  and 
approximated  in  Galanter  &  Bush  (1959). 


APPLICATION    AND    TESTING    OF    LEARNING    MODELS 


07 


Fig.  12.  The  solid  curve  represents  the  expected  number 
of  runs  of  errors  in  an  infinite  sequence  of  trials,  E(rT)} 
as  a  function  of  the  learning-rate  parameter  oc  for  the 
single-operator  model  (Eq.  28)  with pl  =  1.  The  distance 
between  the  solid  curve  and  a  broken  curve  represents  the 
standard  deviation  of  the  number  of  runs  of  errors. 


10 


i6 

-H 


07 


0.8 


0.9 


1.0 


Fig.  13.  The  solid  curve  represents  the  expected  number 
of  trials  before  the  first  success  E(f)  as  a  function  of  the 
learning-rate  parameter  a  for  the  single-operator  model 
(Eq.  28)  with  pl  =  1.  The  distance  between  the  solid 
curve  and  a  broken  curve  represents  the  standard  devi- 
ation of  the  number  of  trials  before  the  first  success. 


STOCHASTIC    LEARNING    THEORY 


10      12      14      16      18     20     22     24     26 


Fig.  14.  The  solid  curve  (left-hand  ordinate)  describes  E(f )  as  a  function 
of  JE(ux)  for  the  single-operator  model  (Eq.  28)  with  p:  —  0.  The 
broken  curve  (right-hand  ordinate)  describes  E(rT)  as  a  function  of 
E(a^  for  this  model.  Units  are  chosen  so  that  scales  are  comparable  in 
standard  deviation  units. 


Also  shown  for  each  statistic  is  the  interval  defined  by  twice  its  standard 
deviation,  for  example,  ±V  Var  (uj).  The  parameter  a  can  be  eliminated 
from  any  pair  of  the  functions  ^(°0>  to  define  a  relation  between  the 
two  expectations.  Two  such  relations  are  shown  in  Fig.  14,  in  which 
E(f)  and  E(rT)  are  plotted  against  £(uj).  These  two  relations  are  projec- 
tions of  a  curve  in  the  three-dimensional  property-space  with  dimensions 
£(f),  E(rT),  and  £0%);  this  curve  is  the  subspace  of  the  property-space  to 
which  the  model  corresponds.  The  units  are  so  chosen  that  the  scales  are 
roughly  comparable  in  standard  deviation  units,  that  is,  a  1-cm  discrep- 
ancy on  the  EOiJ  scale  is  as  serious  as  a  1-cm  discrepancy  on  either  of  the 
other  scales. 

In  Table  4  the  average  values  of  the  three  statistics  are  given  for  three 

Table  4     Observed  Values  of  wl9  /,  rT  in  Three  Experiments 
Experiment  ul  f  fT 


T-Maze  reversal  after 
overlearning  (Galanter  & 
Bush,  1959) 

T-Maze  reversal  (Galanter  & 
Bush,  1959,  Period  2) 

Solomon- Wynne  shuttlebox 
(Bush  &  Mosteller, 
1959) 


24.68  13.32          6.11 

14.10  5.30          6.60 

7.80  4.50          3.24 


APPLICATION    AND    TESTING    OF    LEARNING    MODELS  §3 

experiments  in  which  the  assumption  that  p±  =  1  is  tenable.  It  is  instruc- 
tive to  examine  these  values  in  conjunction  with  the  four  graphs  for  the 
single-operator  model.  In  none  of  the  experiments  does  any  of  the  pairs  of 
statistics  satisfy  the  relations  for  this  simple  model.  Put  another  way, 
in  no  case  does  the  observed  point  (wl9  /,  f  T)  fall  within  the  allowed  sub- 
space.  How  large  a  discrepancy  is  tolerable  is  a  statistical  problem  that  is 
touched  on  later. 

A  good  deal  of  the  work  that  has  been  done  on  fitting  and  testing  models 
can  be  thought  of  as  being  analogous  to  the  process  exemplified  above : 
mathematical  study  of  a  model  yields  several  functions,  s;-(0),  of  properties 
in  terms  of  parameter  values,  and  the  question  is  asked  whether  there  is  a 
choice  of  parameter  values  (a  point  in  the  parameter  space)  for  which  the 
observed  s^  are  close  to  their  theoretical  values.  The  process  is  usually 
conducted  in  two  stages:  first,  estimation,  in  which  parameter  values  0 
are  selected  so  that  a  subset  of  the  ss  agrees  exactly  with  the  theoretical 
values,  and,  second,  testing,  in  which  the  remaining  s^  are  compared  to 
their  corresponding  s3(Q).  In  the  second  stage  some  or  all  of  the  theoretical 
values  may  be  estimated  from  Monte  Carlo  calculations.  These  stages 
correspond  in  Fig.  14,  for  example,  to  first  letting  u±  determine  a  point  on 
the  abscissa  and,  second,  comparing  the  corresponding  ordinate  values  to 
/and  FT. 

Conclusions  from  this  method  are  conditional  on  the  choice  of  prop- 
erties used  in  each  of  the  two  stages.  To  assert  that  "model  X  cannot 
describe  the  observed  distribution  of  error-run  lengths"  is  stronger  than  is 
usually  warranted.  More  often  the  appropriate  statement  is  of  the  form 
"when  parameters  for  model  X  are  chosen  so  that  it  describes  ^  and  s2 
exactly  then  it  cannot  describe  §3"  Occasionally  there  are  exceptions  in 
which  some  property  of  a  model  is  independent  of  its  parameter  values. 
For  example,  we  can  assert  unconditionally  that  "an  S-shaped  curve  of 
probability  versus  trials  cannot  be  described  by  the  single-operator  model" 
or  that  "the  linear-operator  model  with  complementary  experimenter- 
controlled  events  for  the  prediction  experiment  must  (if  learning  occurs 
at  all)  produce  asymptotic  probability-matching  for  all  values  of  TT." 
Such  parameter-free  properties  of  a  model  are  worthy  of  energetic  search. 


6.2  The  Estimation  Problem 

Most  model  types  have  one  or  more  free  parameters  whose  values  must 
be  estimated  from  data.  Estimates  that  satisfy  over-all  optimal  criteria, 
such  as  maximum  likelihood  or  minimum  chi-square,  cannot  usually  be 
obtained  explicitly  in  terms  of  statistics  of  the  data.  Because  the  iterative 


94  STOCHASTIC    LEARNING   THEORY 

or  other  numerical  methods  that  are  needed  in  order  to  obtain  such  esti- 
mates are  inconvenient,  they  have  seldom  been  used  in  research  with 
learning  models.  The  more  common  method  has  been  briefly  touched  on  in 
Sec.  6.1  :  parameter  values  are  chosen  to  equate  several  of  the  observed 
statistics  of  the  data  with  their  expectations  as  given  by  the  model.  The 
estimates  0  are  therefore  produced  by  the  solution  of  a  set  of  equations  of 
the  form  s/0)  =  i> 

Because  the  properties  of  estimates  so  obtained  are  not  well  understood, 
this  method  may  lead  to  serious  errors,  as  is  illustrated  by  an  example.  Let 
us  suppose  that  a  learning  process  behaves  in  accordance  with  the  single- 
operator  model  with  a  =  0.90  and  that  we  do  not  know  this  but  wish  to 
test  the  model  as  a  possible  description  of  the  data.  Suppose  that  we  have 
a  single  sequence  of  responses  and  that  we  have  reason  to  assume  that 
/?!  =  1.  Suppose,  further,  that  because  of  sampling  variability  the 
number  of  errors  at  the  beginning  of  the  sample  sequence  is  accidentally 
too  large  and  that  the  observed  values  of  u±  and  /are  inflated,  each  by  an 
amount  equal  to  its  theoretical  standard  deviation.  Figures  12  and  14 
then  provide  us  with  the  following  values  : 

/ 


True  (population)  value          10.00          3.91 
Sample  value  12.18          5.91 

We  have  two  properties  si9  namely  wx  and/,  and  a  single  parameter,  a, 
to  estimate.  One  property  is  needed  for  estimation  and  the  remaining  one 
is  available  for  testing.  The  choice  of  which  property  to  use  for  which 
purpose  involves  only  two  alternatives;  but  it  has  the  essential  character 
of  the  more  complicated  choice  usually  available  to  the  investigator.  One 
procedure  has  been  to  use  "gross"  features  of  the  data,  such  as  the  total 
number  of  errors,  for  estimation,  and  "fine-grain"  features,  such  as  the 
distribution  of  error-run  lengths,  for  testing.  (For  examples,  see  Bush  & 
Mosteller,  1959,  and  Sternberg,  1959b).  Occasionally  the  investigator 
cannot  choose;  whatever  few  statistics  he  is  lucky  enough  to  have  analytic 
expressions  for  are  automatically  elected  for  use  in  estimation,  and  for 
testing  he  must  resort  to  Monte  Carlo  calculations.  (For  an  example,  see 
Bush,  Galanter,  &Luce,  1959.)  Let  us  examine,  in  the  case  of  the  two 
statistics  tabulated  above,  how  the  choice  that  is  made  affects  our  inference 
about  goodness  of  fit  of  the  model. 

First,  suppose  that  we  use/for  estimation,  choosing  a  so  that  E(f  |  a)  = 
/.  Entering  Fig.  13  with  the  observed  value  of/=  5.9,  we  find  the  cor- 
responding estimate  to  be  a  =  0.957.  Now  to  test  the  model  we  refer  to 
Fig.  11.  Corresponding  to  a  =  0.957  are  the  values  of  jE^)  =  23.3  and 
o^Ui)  =  3.35.  The  difference  between  E^)  and  its  observed  value  of 


APPLICATION    AND    TESTING    OF    LEARNING    MODELS  55 

M!  =  12.18  is  more  than  three  times  its  theoretical  standard  deviation, 
a  sizable  discrepancy.  On  these  grounds  we  would  be  inclined  to  discard 
the  model.  But  first  let  us  consider  the  result  if  we  take  the  second  option. 
We  use  Wj  for  estimation,  and  Fig.  11  gives  the  value  of  a  =  0.917  for 
E(U!)  =  12.18.  To  test  the  model,  we  enter  Fig.  13  with  a  =  0.917;  the 
corresponding  theoretical  values  are  £(f)  =  4.3,  a(f)  =  2.15.  The  observed 
value  /=  5.91  is  therefore  within  one  standard  deviation  of  the  theoretical 
value.  The  second  option  inclines  us  to  accept  the  model.  It  is  worth 
noting  that,  if  anything,  this  example  is  conservative :  had  the  number  of 
errors  been  accidentally  large,  but  toward  the  end  of  the  sequence  instead 
of  the  beginning,  /  would  have  been  close  to  its  theoretical  value,  MX 
would  have  been  inflated,  and  the  two  results  would  have  been  still  more 
discrepant. 

The  reason  for  the  disagreement  may  be  clarified  by  Fig.  14.  Here  it  can 
be  seen  that  for  this  particular  model  an  error  in  £(f)  corresponds  to  a 
much  larger  error  in  EO^),  in  terms  of  standard  deviation  units.  The 
total-errors  statistic  is  the  most  "sensitive"  of  the  three;  therefore,  to  give 
the  model  the  best  chance,  it  is  the  one  that  should  be  used  for  estimation. 

The  question  of  the  choice  of  an  estimating  statistic  is  therefore  a 
delicate  one.  In  the  lucky  instance  in  which  a  model  approximates  the  data 
well  in  many  respects  it  is  unimportant  how  the  estimation  is  carried  out. 
Such  an  instance  is  the  Bush-Mosteller  (1955)  analysis  of  the  Solomon- 
Wynne  data,  in  which  several  methods  gave  estimates  in  very  good  agree- 
ment; indeed,  this  fact  in  itself  is  strong  evidence  in  favor  of  the  model. 
But  this  is  a  rare  case,  and  more  often  estimates  are  in  conflict. 

The  question  becomes  especially  important  when  several  models  are 
to  be  compared  in  their  ability  to  describe  a  set  of  data.  It  is  crucial  that 
the  estimation  methods  be  equally  "fair"  to  the  models,  and  the  standard 
procedures  do  not  ensure  this.  For  example,  we  might  be  comparing 
with  the  single-operator  model  of  Fig.  14  another  hypothetical  model  for 
which  the  curve  of  £(f)  versus  E(uJ  had  a  slope  greater  rather  than  less 
than  unity.  If  we  then  used  wa  in  estimation  for  both  models,  we  would 
be  prejudicing  the  comparison  in  favor  of  the  single-operator  model. 
For  different  models  different  sets  of  statistics  may  be  the  best  estimators : 
we  do  not  ensure  equal  fairness  by  using  the  same  estimating  statistics 
for  all  the  models  to  be  compared.  This  observation,  for  which  I  am 
indebted  to  A.  R.  Jonckheere,25  casts  doubt  on  the  results  of  certain  com- 
parative studies,  such  as  those  of  Bush  and  Mosteller  (1959),  Bush, 
Galanter,  and  Luce  (1959),  and  Sternberg  (1959b). 

One  possibility  for  retrieving  the  situation  is  to  search  for  aspects  of  the 
data  that  one  or  more  of  the  competing  models  are  incapable  of  describing, 
26  Personal  communication,  1960. 


g$  STOCHASTIC    LEARNING    THEORY 

regardless  of  the  values  of  its  parameters.  An  example  arises  in  the  analysis 
of  reversal  after  overlearning  in  a  T-maze,  one  of  the  experiments  included 
by  Bush,  Galanter,  and  Luce  (1959)  in  their  comparison  of  the  linear  and 
beta  models.  The  observed  curve  of  proportion  of  successes  versus  trials 
starts  at  zero  and  is  markedly  S-shaped,  rising  steeply  in  its  middle  portion. 
Parameters  can  be  chosen  for  the  beta  model  so  that  its  curve  agrees  well 
with  the  one  observed.  But,  as  Galanter  and  Bush  (1959)  show,  although 
the  linear  model  of  Eq.  8  is  capable  of  producing  an  S-shaped  curve  that 
starts  at  zero  fa  =  1),  no  choice  of  o^  and  a2  permits  its  curve  to  rise  both 
slowly  enough  at  the  beginning  and  end  of  learning  and  steeply  enough  in 
its  middle  portion.  As  the  analysis  stands,  then,  the  beta  model  is  to  be 
preferred  for  these  data.  The  problem  is  that  if  we  search  long  enough  we 
may  be  able  to  find  a  property  of  the  data  that  this  model  cannot  describe 
and  that  the  linear  model  can.  To  choose  between  the  models,  we  would 
then  have  to  decide  which  of  the  two  properties  is  the  more  "important," 
and  the  problem  of  being  equally  fair  to  the  competing  models  would  again 
face  us. 

A  solution  to  the  problem  lies  in  the  use  of  maximum  likelihood  (or 
other  "best")  estimates  (Wald,  1948),  despite  their  frequent  inconvenience, 
and  in  the  comparison  of  the  maximized  likelihoods  and  the  use  of  like- 
lihood-ratio tests  to  assess  relative  goodness  of  fit.  Bush  and  Mosteller 
(1955)  discuss  several  over-all  measures  of  goodness  of  fit.  The  use  of  such 
over-all  tests  has  occasionally  been  objected  to  on  grounds  that  they  may 
be  sensitive  to  uninteresting  differences  among  models  or  between  models 
and  data  and  that  they  may  not  reveal  the  particular  respects  in  which  a 
model  is  deficient.  Our  example  of  the  f  and  ux  statistics  shows  that  the 
first  objection  applies  to  the  more  usual  methods  as  well.  In  answer  to 
the  second  objection,  there  is  no  reason  why  detailed  comparison  of  par- 
ticular statistics  cannot  be  used  as  a  supplement  to  the  over-all  test. 

One  of  the  desirable  features  of  the  beta  model  and  of  more  general 
logistic  models  is  that  a  simple  set  of  sufficient  statistics  exists  for  the 
parameters  and  that  the  standard  iterative  method  (Berkson,  1957)  for 
obtaining  the  maximum-likelihood  estimates  is  easily  generalized  for 
more  than  two  parameters,  converges  rapidly,  and  is  facilitated  by 
existing  tables.  Cox  (1958)  suggests  that,  in  applications  to  learning,  initial 
estimates  be  obtained  by  the  minimum-logit  %2  method  (Berkson,  1955; 
Anscombe,  1956),  which  does  not  require  iteration. 

Examples  of  the  results  of  these  methods  applied  to  the  Solomon- Wynne 
data  (Sec.  4.1)  are  given  in  Table  5.  Both  the  maximum-likelihood  and 
minimum-logit  #2  methods  can  be  thought  of  as  ways  of  fitting  the  linear 
regression  equation  given  by  Eq.  49 : 

logit  pn  =  -(a  +  btn  +  csn). 


APPLICATION    AND    TESTING    OF    LEARNING    MODELS 


97 


The  random  variables  tn  and  sn  are  considered  to  be  the  independent 
variables,  and  the  logit  of  the  escape  probability  is  the  dependent  variable. 
The  equation  defines  a  plane;  the  observed  points  to  which  the  plane  is 
fitted  are  the  logits  of  the  proportions  of  escapes  at  given  values  of  (ta,  sj. 
A  difficulty  arises  with  the  minimum-logit  f  method  when  the  observed 
proportion  for  a  (tn,  sj  pair  is  zero,  as  happens  often  when  sn  +  tn  is 
large  in  the  later  trials.  Most  of  these  zero  observations  were  omitted  in 
obtaining  the  values  in  the  second  row  of  Table  5,  so  that  this  row  of  values 
depends  principally  on  early  trials.  Relations  between  the  values  of 
pl9  /?!,  and  /?2  and  the  estimates  of  a,  6,  and  c  are  given  in  Sec.  2.5. 

Table  5  Results  of  Four  Procedures  for  Estimating  Parameters  of 
the  Beta  Model  (Eqs.  17,  19)  from  the  Solomon- Wynne  Data 


Method 

pi  (initial  escape 
probability) 

h 
(avoidance) 

(escape) 

Bush-Galanter-Luce 

(1959) 

0.94 

0.59 

0.83 

Minimum  logit  #2 

0.864 

0.760 

0.778 

One  maximum-likelihood 

iteration 

0.857 

0.805 

0.718 

Two  maximum-likelihood 

iterations 

0.857 

0.811 

0.735 

When  there  are  only  two  parameters,  as  in  the  case  of  the  beta  model 
for  two  symmetric  experimenter-controlled  events  (Eq.  24), 

logit  pn  =  -(a  +  bdn\ 

a  simple  graphical  method  (Hodges,  1958)  provides  close  approximations 
to  the  maximum-likelihood  estimates;  it  is  probably  preferable  to  mini- 
mum-logit  #2  for  obtaining  starting  values  for  maximum-likelihood  itera- 
tion. The  minimum-logit  %2  method  should  be  used  with  caution;  it  may 
occasionally  be  misleading,  perhaps  because  of  the  difficulty  with  zero 
entries  already  mentioned.  Consider,  as  an  example,  the  logistic  one-trial 
perseveration  model  (Eq.  32):  logit  pw  =  a  +  b(n  —  2)  +  cxn_1?  (n  >  2). 
A  simple  graphical  method  is  the  visual  fitting  of  a  pair  of  parallel  lines 
to  the  proportions  that  estimate  Pr  (xw  =  1  |  xn_x  =  0)  and  Pr  (xn  =  1  | 
xn-i  =  1)  when  they  are  plotted  against  n  >  2  on  logistic  (or  normal 
probability)  paper.  These  lines  then  represent  logit  pn  =  a  +  b(n  —  2) 
and  logit  pn  =  (a  +  c)  +  b(n  —  2)  and  provide  estimates  of  the  three 
parameters.  Values  obtained  for  the  Goodnow  data  (Sec.  4.5),  using  the 
graphical  method  and  then  applying  one  cycle  of  maximum-likelihood 


?8  STOCHASTIC    LEARNING    THEORY 

iteration  to  its  results,  are  presented  in  Table  6.  For  these  data,  the 
minimum-logit  #2  method  gave  values  that  departed  more  from  the  maxi- 
mum-likelihood values  than  the  simple  graphical  procedure. 

The  advantages  of  maximum-likelihood  estimates  are  that  their  variances 
are  known,  at  least  asymptotically,  and  that  their  values  tend  to  represent 
much  of  the  information  in  the  data.  When  the  maximum-likelihood 
method  is  not  used,  alternative  methods  that  have  these  properties  are  to  be 
preferred.  As  an  example,  let  us  consider  estimation  for  the  linear-operator 
model  with  experimenter  control  (Eq.  12)  that  has  been  used  for  the  pre- 
diction experiment  with  Pr  [yn  =  1}  =  TT  and  Pr  {yw  =  0}  =  1  —  TT.  If 

Table  6  Estimates  for  the  Logistic  Perseveration  Model  (Eq.  32) 
from  the  Goodnow  Data 

Method  a  b  c 


Visual  fit  of  parallel  lines 

on  logistic  paper 

0 

-0.24 

0.94 

One  maximum-likelihood 

iteration 

0.035 

-0.236 

0.927 

t^  is  the  total  number  of  A±  responses  by  the  zth  subject  during  the  first 
N  trials,  then  for  this  model 


£(t,)  =  NTT  -  (TT  -  7ia)  ,  (93) 


—  a 


and,  having  determined  Jf^,  we  can  estimate  <x  by  setting  £(t,-)  equal  to  the 
value  of  t  for  a  group  of  subjects.  This  is  the  method  used  by  Bush  and 
Hosteller  (1955),  Estes  and  Straughan  (1954),  and  others.  Equation  93 
is  obtained  by  adding  both  sides  of  the  approximate  equation  for  the 
learning  curve  (Eq.  44)  over  trials,  1  <  n  <  N. 

One  of  the  observed  features  of  estimates  obtained  by  this  method  is 
that  a  varies  with  the  value  of  TT:  the  higher  TT  (the  "easier  the  discrimina- 
tion"), the  more  potent  the  learning-rate  parameter.  Psychological 
mechanisms  have  been  proposed  to  explain  this  effect  (e.g.,  Estes,  1959), 
but  very  little  is  known  about  the  method  of  estimation  itself.  For 
example,  is  a  unbiased?  If  'not,  how  does  the  bias  depend  on  77?  The 
estimate  depends  entirely  on  preasymptotic  data.  (This  can  be  seen  from 
the  fact  that  its  value  is  indeterminate  if  Vltl  =  TT.)  For  experiments  in 
which  Vltl  c±  0.5,  therefore,  the  higher  the  value  of  TT,  the  more  data  are 
used  in  the  estimate  of  a,  hence  the  more  reliable  the  estimate.  Other 
information  about  this  estimation  procedure  that  is  not  known  but  that  is 


APPLICATION    AND    TESTING    OF    LEARNING    MODELS  $$ 

vital  for  the  interpretation  of  the  findings  is  the  extent  to  which  perturba- 
tions in  the  process  affect  the  estimate.  Examples  of  such  perturbations  are 
intersubject  differences  in  initial  probabilities  and  in  values  of  a  and  the 
difference  between  the  effects  of  nonreward  and  reward  discussed  in  Sec. 
4.3. 

The  curious  fact  that  no  information  about  a  seems  to  be  available 
from  asymptotic  data  is  a  result  of  averaging  over  the  distribution  of 
outcomes  and  using  the  resulting  approximate  equation  (Eq.  44).  Clearly 
there  is  more  information  in  the  data  than  is  used  for  the  estimate.  This 
sequential  information  has  been  exploited  by  Anderson  (1959)  in  a  variety 
of  procedures  for  estimation  and  testing.  A  simple  improvement  on  the 
average  learning  curve  procedure  arises  from  the  idea  that  the  extent  to 
which  responses  are  correlated  with  the  immediately  preceding  outcomes 
depends  on  the  value  of  a  and  leads  to  the  use  of  Eq.  71,  or  its  exact  version, 
Eq.  69,  with  a^  =  1  —  aly  to  estimate  a.  By  this  method,  estimates  can  be 
obtained  even  from  trend-free  portions  of  the  data.  Such  an  improvement 
is  in  the  direction  of  the  use  of  sufficient  statistics,  to  which  the  maximum- 
likelihood  method  often  leads.  But  it  is  still  inferior  to  what  is  possible 
for  the  comparable  beta  model. 


6.3  Individual  Differences 

In  most  of  the  discussion  in  this  chapter,  and  in  most  applications  of 
learning  models,  it  is  assumed  that  the  same  values  of  the  initial  probability 
and  other  parameters  characterize  all  the  subjects  in  an  experimental 
group.  When  events  are  subject-controlled,  differences  in  /rvalues  arise 
on  trials  after  the  first,  but  under  this  homogeneity  assumption  these  are 
due  entirely  to  differences  between  event  sequences. 

It  must  be  kept  in  mind,  when  this  assumption  is  made  in  the  application 
of  a  model  type,  that  what  is  tested  by  comparisons  between  data  and 
model  is  the  conjunction  of  the  assumption  and  the  model  type  and  not 
the  model  type  alone.  It  is  convenience,  not  theory,  that  leads  to  the  homo- 
geneity assumption.  The  question  of  primary  interest  is  whether  each 
individual  subject,  with  his  own  parameter  values,  can  be  said  to  behave  in 
accordance  with  the  model  type.  It  is  usually  thought  that  if  the  assump- 
tion is  not  entirely  justified  then  the  discrepancy  will  cause  the  model  to 
underestimate  the  intersubject  variances  of  response-sequence  statistics. 
It  is  hoped  (but  not  known)  that  the  discrepancy  will  have  no  other  adverse 
effects.  We  therefore  expect  the  variances  given  by  a  model  to  be  on  the 
small  side,  and  we  are  not  perturbed  when  this  occurs,  as  it  often  does 
(Bush  &  Mosteller,  1959;  Sternberg,  1959b). 


STOCHASTIC    LEARNING    THEORY 


On  the  other  hand,  unless  we  are  interested  specifically  in  testing  the 
homogeneity  assumption,  it  is  probably  unwise  to  use  an  observed  variance 
as  a  statistic  for  estimation,  and  this  is  seldom  done.  One  difficulty  with 
the  customary  procedure,  in  which  the  assumption  of  homogeneity  is  made, 
is  that  estimation  and  testing  methods  for  different  models  may  be  dif- 
ferentially sensitive  to  deviations  from  homogeneity.  For  comparing 
models,  therefore,  it  is  probably  preferable  to  estimate  parameters  separ- 
ately for  each  subject.  Audley  and  Jonckheere  (1956)  argued  for  the 
desirability  of  this  procedure,  and  Audley  (1957)  carried  it  out  for  a  model 
that  describes  both  choice  and  choice  time. 

Estimates  for  an  individual  subject  that  are  based  on  few  observations 
may  be  unstable.  One  way  of  avoiding  both  the  instability  of  individual 
estimates  and  the  assumption  of  homogeneity  is  to  study  long  sequences  of 
responses  from  individual  subjects.  Anderson  (1959)  favors  this  method 
and  gives  estimation  procedures.  However,  it  clearly  cannot  be  applied  to 
experiments  in  which  a  single  response  is  perfectly  learned  in  a  small 
number  of  trials. 

Certain  types  of  inference,  based  on  between-subject  comparisons,  may 
be  misleading  if  the  homogeneity  assumption  is  not  met.  For  example, 
we  might  observe  a  positive  correlation  between  the  number  of  errors 
before  and  after  some  arbitrary  trial.  One  possible  cause  of  the  correlation 
is  a  positive  response  effect.  A  second  is  the  existence  of  individual 
differences.  Even  if  a  response-independent,  single-event  model  describes 
the  process  for  each  subject,  differences  in  initial  probabilities  or  learning 
rates  will  produce  a  positive  correlation  of  this  kind.  If  there  is  a  negative 
response  effect  as  well  as  individual  differences,  statistics  such  as  the  vari- 
ance of  total  errors  or  the  correlation  of  early  and  late  errors  might  lead  us 
to  infer  no  response  effect  at  all  if  we  assume  homogeneity.  If,  on  the 
other  hand,  we  observe  a  negative  correlation  between  the  number  of  early 
and  late  errors,  despite  the  possible  interference  of  individual  differences, 
we  are  on  sure  ground  when  we  infer  a  negative  response  effect.  By  the 
same  token,  when  homogeneity  is  assumed  and  the  theoretical  variances 
are  too  large,  we  have  especially  strong  evidence  against  the  model  type. 
This  last  effect  has  occurred  in  the  analysis  of  data  from  both  humans 
(Sternberg,  1959b)  and  rats  (Galanter  &  Bush,  1959;  Bush,  Galanter,  & 
Luce,  1959). 

When  we  look  at  the  over-all  picture  of  results  from  the  application  of 
learning  models,  it  is  remarkable  how  weak  the  evidence  against  the 
homogeneity  assumption  usually  appears  to  be.  One  is  reluctant  to 
believe  that  individuals  are  so  alike,  but  the  only  alternative  seems  to  be 
that  our  testing  methods  are  insensitive.  N.  H.  Anderson26  has  argued 
26  Personal  communication,  1962. 


APPLICATION    AND    TESTING    OF    LEARNING    MODELS  IOI 

that  this  alternative  gains  support  from  the  fact  that  analyses  of  variance 
of  repeated  measurements  almost  always  yield  significant  individual 
differences. 

Direct  tests  of  the  homogeneity  assumption  have  occasionally  been 
performed.  Data  from  a  single  trial  alone  are,  of  course,  useless  as  evidence 
of  any  more  than  the  first  moment  of  the  ^-value  distribution.  But  if  the 
p-  value  for  each  subject  is  approximately  constant  during  a  block  of  m 
trials,  then  raw  moments  from  the  first  to  the  mth  can  be  estimated  from 
the  block.  As  an  example,  suppose  we  use  a  block  containing  trials  one 
and  two.  The  method27  depends  on  the  two  relations 


x2)  =  E,Ex(xi  +  x2)  = 
£[(Xl  +  x2)2]  =  E,EJ(Xl  +  x2)2]  =  £p(2p  +  2p2)  =  2F>  +  2F2, 

where  p  is  the  (approximately  constant)  probability  on  the  two  trials  and 
F!  and  V2  are  the  (approximate)  first  and  second  moments  of  its  distribu- 
tion. The  expectations  on  the  left  are  replaced  by  the  averages  of  x±  +  #2 
and  (xl  +  #2)2  over  subjects,  and  then  the  equations  are  solved  for  Kx  and 

v* 

The  homogeneity  assumption  requires  that  on  the  first  trial  V2  =  Fx2. 
In  his  analysis  of  the  Goodnow  two-armed  bandit  data  Bush  estimated 
these  two  quantities  by  using  three-trial  blocks,  drawing  a  smooth  curve 
through  the  estimates  and  extrapolating  back  to  the  first  trial.  The  result 
was  F2jl  =  0.13  and  J^  =  0.11,  making  the  homogeneity  assumption 
tenable  for  initial  probability. 

In  another  test  of  the  assumption  Bush  and  Wilson  (1956)  examined  the 
number  of  A±  responses  in  the  first  10  trials  of  a  two-choice  experiment  for 
each  of  49  paradise  fish.  The  distribution  of  number  of  choices  had  more 
spread  than  could  be  accounted  for  by  a  common  probability  for  all 
subjects.  The  assumption  was  therefore  rejected  and  a  distribution  of 
initial  probabilities  was  used.  Instead  of  one  initial  probability  parameter, 
two  were  then  needed,  one  giving  the  mean  and  the  other  giving  the 
variance  of  the  distribution  of  initial  probabilities,  whose  form  was  assumed 
to  be  that  of  a  beta  distribution. 

Even  less  work  has  been  done  in  which  variation  in  the  learning-rate 
parameters  is  allowed.  One  example  appears  in  Bush  and  Mosteller's 
(1959)  analysis  of  the  Solomon-  Wynne  data:  the  linear  single-operator 
model  was  used  with  a  distribution  of  oc-values.  In  certain  respects  this 
generalization  improved  the  agreement  between  model  and  data. 

27  This  "block-moment"  method  was  developed  by  Bush,  as  a  general  estimation 
scheme,  in  an  unpublished  manuscript,  1955. 


102  STOCHASTIC    LEARNING    THEORY 


6.4  Testing  a  Single  Model  Type 

In  a  good  deal  of  the  work  with  learning  models  a  single  model  is  used 
in  the  analysis  of  a  set  of  data.  Estimates  are  obtained,  and  then  several 
properties  of  the  model  are  compared  with  their  counterparts  in  the  data. 
There  is  little  agreement  as  to  which  properties  should  be  examined  or 
how  many.  Informal  comparisons,  sometimes  aided  by  the  theoretical 
variances  of  the  statistics  considered,  are  used  in  order  to  decide  on  the 
model's  adequacy.  Values  of  the  parameter  estimates  may  be  used  as 
descriptive  statistics  of  the  data. 

As  with  any  theory,  a  stochastic  learning  model  can  be  more  readily 
discredited  than  it  can  be  accepted.  Two  reasons,  however,  lead  investiga- 
tors to  expect  and  allow  some  degree  of  discrepancy  between  model  and 
data.  One  reason  is  the  view,  held  by  some,  that  a  model  is  intended  only 
as  an  approximation  to  the  process  of  interest.  A  second  is  the  fact  that 
today's  experimental  techniques  probably  do  not  prevent  processes  other 
than  the  one  described  by  the  model  from  affecting  the  data.  The  matter 
is  one  of  degree:  how  good  an  approximation  to  the  data  do  we  desire 
and  to  which  of  their  properties  ?  And  how  deeply  must  we  probe  for 
discrepancies  before  we  can  be  reasonably  confident  that  there  are  no 
important  ones  ?  Recent  work  that  reveals  how  difficult  it  may  be  to  select 
among  models  (e.g.,  Bush,  Galanter,  &  Luce,  1959;  Sternberg,  1959b) 
suggests  that  some  of  our  testing  methods  for  a  single  model  may  lack 
power  with  respect  to  alternatives  of  interest  to  us  and  that  we  may  be 
accepting  models  in  error. 

One  finding  is  that  the  learning  curve  is  often  a  poor  discriminator  among 
models.  Two  examples  have  already  been  illustrated  in  this  chapter.  In 
Figs.  1  and  2  a  model  with  experimenter-controlled  events  provides  an 
excellent  description  of  learning  curves  generated  by  a  process  with  a  high 
degree  of  subject  control.  In  Fig.  9  four  models  that  differ  fundamentally 
in  the  nature  of  their  response  effects  produce  equally  good  agreement  with 
an  observed  learning  curve. 

It  would  be  an  error  to  conclude  from  these  examples  that  the  learning 
curve  can  never  discriminate  between  models;  this  is  far  from  true. 
Occasionally  it  provides  us  with  strong  negative  evidence.  We  have  already 
seen  (Sec.  4.4)  that  the  beta  model  with  experimenter  control  cannot  ac- 
count for  the  asymptote  of  the  learning  curve  in  prediction  experiments 
with  77  5^  i,  in  those  experiments  in  which  probability  matching  occurs. 
On  the  other  hand,  the  linear  experimenter-controlled  event  model  can  be 
eliminated  for  a  T-maze  experiment  (Galanter  &  Bush,  1959)  in  which  its 


APPLICATION    AND    TESTING    OF    LEARNING    MODELS  10$ 

theoretical  asymptote  is  exceeded.  The  shape  of  the  preasymptotic 
learning  curve  may  also  occasionally  discriminate  between  models;  for 
example,  as  mentioned  in  Sec.  6.2,  the  linear-operator  model  cannot  produce 
a  learning  curve  that  is  steep  enough  to  describe  the  T-maze  data  of 
Galanter  and  Bush  (1959)  on  reversal  after  overlearning,  whereas  the  beta 
model  provides  a  curve  that  is  in  reasonable  agreement  with  these  data. 
The  important  point,  however,  is  that  agreement  between  an  observed  learn- 
ing curve  and  a  curve  produced  by  a  model  cannot,  alone,  give  us  a  great 
deal  of  confidence  in  the  model. 

More  surprising,  perhaps,  is  that  the  distribution  of  error-run  lengths 
also  seems  to  be  insensitive.  In  Fig.  9  it  can  be  seen  that  three  distinctly 
different  models  can  be  made  to  agree  equally  well  with  an  observed 
distribution.  As  another  example,  let  us  consider  the  fourth  period  of  a 
T-maze  reversal  experiment  of  Galanter  and  Bush  (1959,  Experiment  III). 
In  this  experiment  three  trials  were  run  each  day,  and  by  the  fourth  period 
there  appeared  a  marked  daily  "recovery"  effect :  on  the  first  trial  of  each 
day  there  was  a  large  proportion  of  errors.  Needless  to  say,  this  effect  was 
not  a  property  of  the  path-independent  model  used  for  the  analysis. 
Despite  the  oscillating  feature  of  the  learning  curve,  a  feature  that  one 
might  think  would  have  grave  consequences  for  the  sequential  aspects  of 
the  data,  the  agreement  between  model  and  data,  as  regards  the  run-length 
distribution,  was  judged  to  be  satisfactory.  Again,  as  for  the  learning 
curve,  there  are  examples  in  which  the  run-length  distribution  can  dis- 
criminate. In  Fig.  9  it  can  be  seen  that  one  of  the  four  models  cannot 
be  forced  into  agreement  with  it.  And  Bush  and  Mosteller  (1959)  show  that 
a  Markov  model  and  an  "insight"  model,  when  fitted  to  the  Solomon- 
Wynne  data,  produce  significantly  fewer  runs  than  the  other  models  studied. 

We  do  not  wish  to  be  limited  to  negative  statements  about  the  agreement 
between  models  and  data,  yet  we  have  evidence  that  some  of  the  usual 
tests  are  insensitive,  and  we  have  no  rules  to  tell  us  when  to  stop  testing. 
In  comparative  studies  of  models  this  situation  is  somewhat  ameliorated: 
we  continue  making  comparisons  until  all  but  one  of  the  competing  models 
is  discredited.  Another  possible  solution  is  to  use  over-all  tests  of  goodness 
of  fit.  As  already  mentioned,  these  tests  suffer  from  being  powerful  with 
respect  to  uninteresting  alternatives :  such  a  test,  for  example,  might  lead 
us  to  discard  a  model  type  under  conditions  in  which  only  the  homogeneity 
assumption  is  at  fault.  In  contrast,  the  usual  methods  seem  to  suffer  from 
low  power  with  respect  to  alternatives  that  may  be  important. 

One  role  proposed  for  estimates  of  the  parameters  of  a  model  is 
that  they  can  serve  as  descriptive  statistics  of  the  data.  Such  descriptive 
statistics  are  useful  only  if  the  model  approximates  the  data  well  and  if 
the  values  are  not  strongly  dependent  on  the  particular  method  used  for 


104  STOCHASTIC    LEARNING    THEORY 

their  estimation.  I  have  already  discussed  how  an  apparent  lack  of  param- 
eter invariance  from  one  experiment  to  another  may  be  an  artifact  of 
applying  the  wrong  model.  This  has  been  recognized  in  recent  suggestions 
that  the  invariance  of  parameter  estimates  from  experiment  to  experiment 
be  used  as  an  additional  criterion  by  which  to  test  a  model  type. 

6.5  Comparative  Testing  of  Models 

As  I  have  already  suggested,  the  comparative  testing  of  several  models 
improves  in  some  ways  on  the  process  of  testing  models  singly.  The 
investigator  is  forced  to  use  comparisons  sensitive  enough  so  that  all  but 
one  of  the  models  under  consideration  can  be  discredited.  Attention  is 
thereby  drawn  to  the  features  that  distinguish  the  models  used,  and  this 
allows  a  more  detailed  characterization  of  the  data  than  might  otherwise 
be  possible. 

As  an  example,  let  us  take  the  Bush-Mosteller  (1959)  comparison  of 
eight  models  in  the  analysis  of  the  Solomon-Wynne  data.  Examination  of 
several  properties  eliminates  all  but  two  of  the  models.  The  remaining 
two  are  the  linear-operator  model  (Eq.  8)  and  a  model  of  the  kind  developed 
by  Restle  (1955).  A  theoretical  comparison  of  the  two  models  is  necessary 
to  discover  a  potential  discriminating  statistic.  The  "Restle  model"  is  an 
example  of  a  single-event  model ;  under  the  homogeneity  assumption  all 
subjects  have  the  same  value  of  pn.  The  linear-operator  model,  on  the 
other  hand,  involves  two  subject-controlled  events,  and  parameter  estimates 
suggest  a  positive  response  effect.  This  difference  should  be  revealed  in  the 
magnitude  of  the  correlation  between  number  of  early  and  late  errors 
(shocks).  The  linear-operator  model  calls  for  a  positive  correlation; 
Restle's  model  (together  with  homogeneity)  calls  for  a  zero  correlation. 
The  observed  correlation  is  positive,  and  the  linear-operator  model  is 
selected  as  the  better.  (This  inference  exemplifies  the  type  discussed  in 
Sec.  6.3  that  depends  critically  on  the  validity  of  the  homogeneity 
assumption.) 

As  a  second  example  of  a  comparative  study  let  us  take  the  Bush- Wilson 
study  (1956)  of  the  two-choice  behavior  of  paradise  fish.  On  each  trial 
the  fish  swam  to  one  of  two  goalboxes.  On  75  %  of  the  trials  food  was 
presented  in  one  (the  "favorable"  goalbox);  on  the  remaining  25%  the 
food  was  presented  in  the  other.  For  one  group  of  subjects  food  in  one 
goalbox  was  visible  from  the  other.  Two  models  were  compared,  each  of 
which  expressed  a  different  theory  about  the  effects  of  nonfood  trials. 
The  first  theory  suggests  that  on  these  trials  the  performed  response  is 
weakened,  giving  a  model  that  has  commonly  been  applied  to  the  predic- 
tion experiment  with  humans: 


APPLICATION    AND    TESTING    OF    LEARNING    MODELS  10$ 

Information  Model 

Event  Pw+1 


Favorable  goalbox,  food  apw  +  1  —  a 

Favorable  goalbox,  no  food  apw 

Unfavorable  goalbox,  food  apn 

Unfavorable  goalbox,  no  food  apw  +  1  —  a 

The  second  theory  suggests  that  on  nonfood  trials  the  performed 
response  is  strengthened: 

Secondary  Reinforcement  Model 

Event  pn+1 


Favorable  goalbox,  food  a^  +  1  —  04 

Favorable  goalbox,  no  food  a2pw  +  1  —  a2 

Unfavorable  goalbox,  food  a-j^ 

Unfavorable  goalbox,  no  food  <x2pw 

We  have  already  seen  that  the  information  model  produces  "probability- 
matching  behavior" :  if  this  model  applied,  each  fish  would  tend  to  divide 
its  choices  in  the  ratio  75:25,  and  no  individual  would  consistently  make 
one  choice.  In  regard  to  the  proportion  of  "favorable"  choices  averaged 
over  subjects,  probability-matching  can  also  be  produced  by  the  secondary 
reinforcement  model  with  the  appropriate  choice  of  parameters.  (Again 
we  have  a  case  in  which  the  average  learning  curve  does  not  help  us  to 
discriminate  between  models.)  There  is  a  clear  difference,  however,  if  we 
examine  properties,  other  than  the  mean,  of  the  distribution  over  subjects 
of  asymptotic  choice  proportions.  The  secondary  reinforcement  model 
implies  that  a  particular  fish  will  stabilize  on  either  one  choice  or  the  other 
in  the  long  run  and  that  some  will  consistently  choose  the  favorable  goal- 
box,  others  the  unfavorable.  (If  the  proportion  of  fish  that  stabilized  on 
the  favorable  side  were  0.75,  then  the  average  learning  curve  would  suggest 
probability  matching.) 

The  observed  distribution  of  the  proportion  of  choices  to  the  favorable 
side  on  the  last  49  trials  of  the  experiment  is  U-shaped,  with  most  fish 
either  at  very  low  values  or  very  high  values,  giving  support  to  the  secondary 
reinforcement  model.  One  merit  of  this  study  that  should  be  mentioned 
is  that  the  decision  between  the  two  models  can  be  made  without  having 
to  estimate  values  for  the  parameters. 

A  third  example  of  a  comparative  study  that  makes  use  of  data  from  a 
two-armed  bandit  experiment  was  discussed  in  Sec.  4.5. 


106  STOCHASTIC    LEARNING    THEORY 

Occasionally  we  wish  to  compare  models  that  contain  different  numbers 
of  free  parameters.  When  this  occurs,  a  new  problem  is  added  to  that  of 
equal  "fairness"  of  the  estimation  and  testing  procedures  discussed  in 
Sec.  6.2 :  the  model  with  fewer  degrees  of  freedom  will  be  at  a  disadvantage. 
This  difficulty  can  be  overcome  if  one  model  is  a  special  case  of  another. 
If  so,  we  can  apply  the  usual  likelihood-ratio  test  procedure,  which  takes 
into  account  the  difference  in  the  number  of  free  parameters. 

Suppose,  for  example,  that  we  wish  to  decide  between  the  single-event 
beta  model  and  the  beta  model  with  two  subject-controlled  events  (Eq.  20) 
in  application  to  an  experiment  such  as  the  escape-avoidance  shuttlebox. 
The  second  model  is  the  more  general  and  can  be  written 

logitpn=  _(fl  +  itft  +  csn).  (94) 

The  first  model  is  given  by  the  same  equation,  but  with  c  =  b,  so  that 

logit/>n  =  -[a  +  b(tn  +  sj]  =  -(a  +  bn).  (95) 

The  test  is  equivalent  to  the  question :  are  the  magnitudes  of  the  two 
event  effects  equal?  It  is  performed  by  obtaining  maximum-likelihood 
estimates  of  a,  b,  and  c  in  Eq.  94  and  of  a  and  b  in  Eq.  95  and  calculating 
the  maximized  likelihood  for  each  model.  These  steps  are  straightforward 
for  logistic  models.  Because  Eq.  94  has  an  additional  degree  of  freedom, 
the  likelihood  associated  with  it  will  generally  be  greater  than  the  likeli- 
hood associated  with  the  first  model.  The  question  of  how  much  greater 
it  must  be  in  order  for  us  to  reject  the  first  model  and  decide  that  the  two 
events  have  unequal  effects  can  be  answered  by  making  use  of  the  (large 
sample)  distribution  of  the  likelihood  ratio  X  under  the  hypothesis  that 
the  equal  event  model  holds  (Wilks,  1962).28 

In  an  alternative  procedure  a  statistic  that  behaves  monotonically  with 
the  likelihood  ratio  is  used,  and  its  (small  sample)  distribution  under  the 
hypothesis  of  equal  event  effects  can  be  obtained  analytically  or,  if  this  is 
difficult,  from  a  Monte  Carlo  experiment  based  on  Eq.  95.  A  comparable 
test  for  a  generalized  linear-operator  model  has  been  developed  and 
applied  by  Hanania  (1959)  in  the  analysis  of  data  from  a  prediction  experi- 
ment. She  concludes  for  those  data  that  the  effect  of  reward  is  significantly 
greater  than  the  effect  of  nonreward. 

6.6  Models  as  Baselines  and  Aids  to  Inference 

Most  of  our  discussion  to  this  point  has  been  oriented  towards  the 
question  whether  a  model  can  be  said  to  describe  a  set  of  data.  Usually  a 
model  entails  several  assumptions  about  the  learning  process,  and 

28  For  this  example  —2  log  A  is  distributed  as  chi-square  with  one  degree  of  freedom. 


APPLICATION    AND    TESTING    OF    LEARNING    MODELS  IOJ 

therefore  in  asking  about  the  adequacy  of  a  model  we  are  testing  the  set  of 
assumptions  taken  together.  Models  are  also  occasionally  useful  when  a 
particular  assumption  is  at  stake  or  when  a  particular  feature  of  the  data  is 
of  interest.  Occasionally,  as  with  the  "null  hypothesis"  in  other  problems, 
it  is  the  discrepancy  between  model  and  data  that  is  of  interest,  and  the 
analysis  of  discrepancies  may  reveal  effects  that  a  model-free  analysis  might 
not  disclose.  A  few  examples  may  make  these  ideas  clear. 

In  Sec.  6.5  the  choice  between  the  two  models  of  Eqs.  94  and  95  was 
equivalent  to  the  question  whether  the  effects  of  reward  and  nonreward 
are  different.  We  might,  on  the  other  hand,  start  with  this  question  and 
perform  the  same  analysis,  not  being  concerned  with  whether  either  of  the 
models  fitted  especially  well  but  simply  with  whether  one  (Eq.  94)  was 
significantly  preferable  to  the  other  (Eq.  95). 

With  the  same  kind  of  question  in  mind  we  could  estimate  reward  and 
nonreward  parameters  for  some  model  and  compare  their  values.  One 
difficulty  with  this  procedure  is  that  unless  we  know  the  sampling  properties 
of  the  estimator  it  is  difficult  to  interpret  such  results.  A  second  difficulty 
is  that  if  a  model  is  not  in  accord  with  the  data  the  estimates  may  depend 
strongly  on  the  method  used  to  obtain  them.  An  example  of  this  depend- 
ence is  shown  in  Table  5;  different  estimates  lead  to  contradictory  answers 
to  the  question  which  of  the  two  events  has  the  larger  effect.  That  the 
parameters  in  a  model  properly  represent  the  event  effects  in  a  set  of  data 
to  which  the  model  is  fitted  may  be  conditional  on  the  validity  of  many  of 
the  assumptions  embodied  in  the  model.  What  is  needed  for  this  type  of 
question  is  a  model-free  test — one  that  makes  as  few  assumptions  as 
possible.  Applications  of  such  tests  are  illustrated  in  Sec.  6.7. 

If  a  model  agrees  with  a  number  of  the  properties  of  a  set  of  data,  then 
the  discrepancies  that  do  appear  may  be  instructive  and  may  be  useful 
guides  in  refining  the  model.  One  example  is  provided  by  the  analysis  of 
free-recall  verbal  learning  by  a  model  developed  by  Miller  and  McGill 
(1952).  The  model  is  intended  to  apply  to  an  experiment  in  which,  on 
each  trial,  a  randomly  arranged  list  of  words  is  presented  and  the  subject 
is  asked  to  recall  as  many  of  the  words  as  he  can.  The  model  assumes 
that  the  process  that  governs  whether  or  not  a  particular  word  is  recalled 
on  a  trial  is  independent  of  the  process  for  any  of  the  other  words.  The 
model  is  remarkably  successful,  but  one  discrepancy  appears  when  the 
estimated  recall  probability  after  v  recalls,  py,  is  examined  as  a  function 
of  v  and  compared  to  the  theoretical  mean  curve  (Bush  &  Mosteller,  1955, 
p.  234).  The  observed  proportions  oscillate  about  the  mean  more  than  the 
model  allows.  This  suggests  the  hypothesis  that  a  subject  learns  words  in 
clusters  rather  than  independently  and  that  either  all  or  none  of  the  words 
in  a  cluster  tend  to  be  recalled  on  a  trial. 


I08  STOCHASTIC    LEARNING    THEORY 

A  second  example  of  the  baseline  use  of  a  model  is  provided  by  Stern- 
berg's  (1959b)  analysis  of  the  Goodnow  data  by  means  of  the  linear  one- 
trial  perseveration  model  (Eq.  29).  Several  properties  of  the  data  are 
described  adequately  by  the  model,  but  in  at  least  one  respect  it  is  deficient. 
The  model  implies  that  for  all  trials  n  >  2,  Pr  {xn  =  1  |  xn_^  =  1}  — 
Pr  {xn  =  1  |  xn_!  =  0}  =  /?,  a  constant.  What  is  observed  is  that  the 
difference  between  the  estimates  of  these  conditional  probabilities  decreases 
somewhat  during  the  course  of  learning.  It  was  inferred  from  this 
finding  and  from  other  properties  of  the  data  that  the  tendency  to  per- 
severate  may  change  as  a  function  of  experience.  This  example  may  serve 
to  caution  us,  however,  and  to  indicate  that  the  hypotheses  suggested  by  a 
baseline  analysis  may  be  only  tentative.  The  inference  mentioned  depends 
strongly  on  the  use  of  a  linear  model.  If  the  logistic  perseveration  model 
(Eq.  32)  is  used  instead,  the  observed  decrease  in  the  difference  between 
conditional  probabilities  is  produced  automatically,  without  requiring 
changes  in  parameter  values  during  the  course  of  learning. 

A  final  use  of  models  as  aids  to  inference  is  in  the  study  of  methods  of 
data  analysis.  In  an  effort  to  reveal  some  feature  of  data  the  investigator 
may  define  a  statistic  whose  value  is  thought  to  reflect  the  feature.  Because 
the  relations  between  the  behavior  of  the  statistic  and  the  feature  of 
interest  may  not  be  known,  errors  of  inference  may  occur.  These  errors 
are  sometimes  known  as  artifacts:  the  critical  property  of  the  statistic 
arises  from  its  definition  rather  than  from  the  feature  of  interest  in  the  data. 

A  simple  example  of  an  artifact  has  already  been  mentioned  in  Sec.  6.3. 
If  individuals  differ  in  their  /^-values  and  if  the  existence  of  response 
effects  is  assessed  by  means  of  the  correlation  over  subjects  between  early 
and  late  errors,  then  a  positive  response  effect  may  be  inferred  when  there 
is  none. 

A  second  example  is  the  evaluation  of  response-repetition  tendencies. 
If  subjects  differ  in  their  /^-values  on  trial  n  —  1  and  if  the  existence  of  a 
perseverative  tendency  is  assessed  by  comparing  the  proportions  that 
correspond  to  Pr  {xn  =  1  |  xw_x  =  1}  and  Pr  {xn  =  1  |  xn_!  =  0},  then  a 
perseverative  tendency  may  be  inferred  where  none  is  present.  This  occurs 
because  the  samples  used  to  determine  the  two  proportions  are  not  ran- 
domly selected :  they  are  chosen  on  the  basis  of  the  response  on  trial  n  —  1 . 
The  subjects  used  to  determine  Pr  {xn  =  1  |  xn_!  =1}  tend  to  have  higher 
/7-values  on  trial  n  —  1  (and  consequently  on  trial  n)  than  the  subjects  in 
the  other  sample.  Selective  sampling  effects  of  this  kind  have  been 
discussed  by  Anderson  (1960,  p.  85)  and  Anderson  and  Grant  (1958). 

The  question  in  both  of  these  examples  is  how  extreme  a  value  of  the 
statistic  must  be  observed  in  order  to  make  the  inference  valid.  In  order 
to  answer  this  question,  the  behavior  of  the  statistic  under  the  hypothesis 


APPLICATION    AND    TESTING    OF    LEARNING    MODELS  IQQ 

of  no  effect  must  be  known.  Such  behavior  can  be  studied  by  applying 
the  method  of  analysis  to  data  for  which  the  underlying  process  is  known, 
namely  Monte  Carlo  sequences  generated  by  a  model. 

A  more  complicated  problem  to  which  this  type  of  study  has  been 
applied  is  the  criterion-reference  learning  curve  (Hayes  &  Pereboom, 
1959).  This  is  a  method  of  data  analysis  in  which  an  average  learning  curve 
is  constructed  by  pooling  scores  for  different  subjects  on  trials  that  are 
specified  by  a  performance  criterion  rather  than  by  their  ordinal  position. 
Underwood  (1957)  developed  a  method  of  this  kind  in  an  effort  to  detect 
a  cyclical  component  in  serial  verbal  learning.  The  result  of  the  analysis 
was  a  learning  curve  with  a  distinct  cyclical  component.  Whether  the 
inference  of  cyclical  changes  in  pn  is  warranted  depends  on  the  result  of 
the  analysis  when  it  is  applied  to  a  process  in  which  pn  increases  mono- 
tonically.  Hayes  and  Pereboom  apply  the  method  to  Monte  Carlo  se- 
quences generated  by  such  a  process  and  obtain  a  cyclical  curve.  We 
conclude  that  to  infer  cyclical  changes  in  pn  the  magnitude  of  the  cycles 
in  the  criterion-reference  curve  must  be  carefully  examined;  the  existence 
of  cycles  is  not  sufficient. 


6.7  Testing  Model  Assumptions  in  Isolation 

A  particular  learning  model  can  be  thought  of  as  the  embodiment  of 
several  assumptions  about  the  learning  process.  Testing  the  model,  then, 
is  equivalent  to  testing  all  of  these  assumptions  jointly.  If  the  model  fails, 
we  are  still  left  with  the  question  of  the  assumptions  that  are  at  fault. 
Light  may  be  shed  on  this  question  by  the  comparative  method  and  the 
detailed  analysis  of  discrepancies  discussed  in  the  last  two  sections.  But 
a  preferable  technique  is  to  test  particular  assumptions  in  as  much  isolation 
from  the  others  as  possible.  Examples  of  several  techniques  are  described 
in  this  section. 

To  illustrate  the  equivalence  of  a  model  to  several  assumptions,  each 
of  which  might  be  tested  separately,  let  us  consider  the  analysis  of  the 
data  from  an  escape-avoidance  shuttlebox  experiment  by  means  of  the 
linear-operator  model  (Eqs.  7  to  1 1  in  Sec.  2.4).  Suppose  that  an  estima- 
tion procedure  yields  an  avoidance  operator  that  is  more  potent  than  the 
escape  operator  (o^  <  oc2).  Some  of  the  assumptions  involved  in  this  model 
are  listed : 

1.  The  effect  of  an  event  on  pn  =  Pr  {escape}  is  manifested  completely 
on  the  next  trial. 

2.  Conditional  on  the  value  of  pn,  the  effect  of  an  event  is  independent 
of  the  previous  sequence  of  events. 


HO  STOCHASTIC    LEARNING   THEORY 

3.  Conditional  on  the  value  of  pw,  the  effect  of  avoidance  (reward) 
on  pw  is  greater  than  the  effect  of  escape  (nonreward). 

4.  The  reduction  in  pw  caused  by  an  event  is  proportional  to  the  value 


5.  The  proportional  reduction  is  constant  throughout  learning;  there- 
fore the  change  in  pn  induced  by  avoidance,  for  example,  decreases  during 
the  course  of  learning. 

6.  The  value  of  pn  for  a  subject  depends  only  on  the  number  of  avoid- 
ances and  escapes  on  the  first  n  —  1  trials  and  not  on  the  order  in  which 
they  occurred. 

7.  All  subjects  have  the  same  values  of  pl9  a1?  and  a2. 

Let  us  consider  the  third  assumption  listed  :  avoidance  has  more  effect 
than  escape.  It  has  already  been  indicated  that  to  test  this  assumption 
simply  by  examining  the  parameter  estimates  for  a  model  may  be  mis- 
leading. For  the  Solomon-  Wynne  data,  Bush  and  Mosteller  (1955)  have 
found  the  values  ax  =  0.80,  a2  =  0.92,  and/?x  =  1.00  for  the  linear  model, 
confirming  the  third  assumption.  In  Table  5  it  is  shown  that  estimates  for 
the  beta  model  may  or  may  not  confirm  the  assumption,  depending  on  the 
method  of  estimation  used. 

We  wish  to  test  this  assumption  without  the  encumbrance  of  all  the 
others.  One  step  in  this  direction  is  to  apply  a  more  general  model,  in 
which  at  least  some  of  the  constraints  are  discarded.  Hanania's  work 
(1959)  provides  an  example  of  this  approach,  in  which  she  uses  a  linear, 
commutative  model,  but  with  trial-dependent  operators,  so  that  it  is 
quasi-independent  of  path: 

(wn+1pw      if    xn  =  1        (nonreward) 

Pn+l  =  (96) 

(Own+$n    if    xn  =  0        (reward). 

Although  6  is  assumed  to  be  constant,  wn  may  take  on  a  different  value 
for  each  trial  number.  The  explicit  formula,  after  a  redefinition  of  the 
parameters,  is 

P,  =  e*°un,  (97) 

where 


3=1 

When  the  wn  (or  the  un)  are  all  equal,  this  formulation  reduces  to  the 
Bush-Mosteller  model.  The  relative  effect  of  reward  and  nonreward  is 
reflected  by  the  value  of  0,  for  which  Hanania  develops  statistical  tests. 
This  method  is  an  improvement,  but  a  good  many  assumptions  are  still 
needed.  A  more  direct,  if  less  powerful,  method  that  requires  fewer 


APPLICATION    AND    TESTING    OF    LEARNING    MODELS 
1.0 


567 
Trial  number 


11 


Fig.  15.  Performance  of  two  groups  of  dogs  in  the  Solomon- Wynne 
avoidance-learning  experiment,  selected  on  the  basis  of  their  response 
on  trial  7.  The  broken  curve  represents  the  performance  of  the  14 
dogs  who  failed  to  avoid  on  trial  7.  The  solid  curve  represents  the 
performance  of  the  16  dogs  who  avoided  on  trial  7. 

assumptions  is  illustrated  in  Figs.  15  and  16.  A  trial  is  selected  on  which 
each  response  is  performed  by  about  half  the  subjects.  The  subjects  are 
divided  into  two  groups,  according  to  the  response  they  perform,  and 
learning  curves  are  plotted  separately  for  each  group.  In  Fig.  15  this 
analysis  is  performed  on  the  Solomon- Wynne  shuttlebox  data.  Subjects 
are  selected  on  the  basis  of  their  seventh  response  (x7).  It  can  be  seen  that 
animals  who  escape  on  trial  7  have  a  relatively  high  escape  probability 
on  the  preceding  trials.  There  is  a  positive  correlation  between  escape  on 
trial  7  and  the  number  of  escapes  on  earlier  trials. 

One  assumption  is  needed  in  order  to  make  the  desired  inference:  the 
absence  of  individual  differences  in  parameter  values.  Individual  dif- 
ferences alone,  with  no  positive  response  effect,  could  produce  a  result  of 
this  kind;  slower  learners  would  tend  to  fall  into  the  escape  group  on 
trial  seven.  On  the  other  hand,  if  we  assume  no  individual  differences,  the 
result  strongly  suggests  that  there  is  a  positive  response  effect,  confirming 
the  third  assumption.  This  result  also  casts  doubt  on  the  validity  of  the 
beta  model  for  these  data.  As  shown  by  Table  5,  estimates  for  that  model 
are  either  in  conflict  with  the  third  assumption  or  require  an  absurdly 
low  value  for  the  initial  escape  probability. 


STOCHASTIC    LEARNING    THEORY 


Subjects  who 
i^-ose  unfavorable 
1  side  on  trial  24 


_         Subjects  who       \        \ 
chose  favorable 
side  on  trial  24 


5     8    11 


14  17   20  23  26  29  32  35  38  41   44   47 
Trial  number 


Fig.  16.  Performance  of  two  groups  of  rats  in  the  Galanter-Bush 
experiment  on  reversal  after  overlearning,  selected  on  the  basis 
of  their  response  on  trial  24.  The  broken  curve  represents  the 
performance  of  the  10  rats  that  chose  the  unfavorable  side  on 
trial  24.  The  solid  curve  represents  the  performance  of  the  nine 
rats  that  chose  the  favorable  side  on  trial  24.  Except  for  the  point 
at  trial  24,  points  represent  average  proportions  for  blocks  of  three 
trials. 


Figure  15  also  illustrates  the  errors  in  sampling  that  can  occur  if  a 
subject-controlled  event  is  used  as  a  criterion  for  selection.  This  type  of 
error  was  touched  on  briefly  in  Sec.  6.6.  It  is  exemplified  by  an  alternative 
method  that  we  might  have  used  for  assessing  the  response  effect.  In 
this  method  we  would  compare  the  performance  of  the  two  subgroups 
on  trials  after  the  seventh.  Escape  on  trial  7  is  associated  with  a  high 
escape  probability  on  future  trials.  It  would  be  an  error,  however,  to  infer 
from  this  fact  alone  that  the  seventh  response  caused  the  observed  differ- 
ences; the  subgroups  are  far  from  identical  on  trials  before  the  seventh. 
The  method  of  selecting  subjects  on  the  basis  of  an  outcome  and  examining 
their  future  behavior  to  assess  the  effect  of  the  outcome  must  be  used  with 
caution.  It  can  be  applied  when  either  the  occurrence  of  the  outcome  is 
controlled  by  the  experimenter  and  is  independent  of  the  state  of  the 


APPLICATION    AND    TESTING    OF    LEARNING    MODELS  J/5 

subject  or  when  it  can  be  demonstrated  that  the  subgroups  do  not  differ 
before  the  outcome.  For  an  example  of  this  type  of  demonstration  in  a 
model- free  analysis  of  the  effects  of  subject-controlled  events,  see  Sheffield 
(1948). 

Figure  16  gives  the  result  of  dividing  subjects  on  the  basis  of  the  twenty- 
fourth  trial  of  an  experiment  on  reversal  after  overlearning  (Galanter  & 
Bush,  1959).  These  data  are  mentioned  in  Sec.  4.5  (Table  3),  in  which  the 
coefficient  of  variation  of  ux  is  shown  to  be  unusually  small,  and  in  Sec. 
6.4,  in  which  the  inability  of  a  linear-operator  model  to  fit  the  learning 
curve  is  discussed.  The  results  of  Fig.  16  are  the  reverse  of  those  in  Fig. 
15:  animals  that  make  errors  on  trial  24  tend  to  make  fewer  errors  on 
preceding  and  following  trials  than  those  that  give  the  correct  response 
on  trial  24.  This  negative  relationship  cannot  be  attributed  to  failure  of 
the  homogeneity  assumption;  individual  differences  in  parameter  values 
would  tend  to  produce  the  opposite  effect.  Therefore  we  can  conclude, 
without  having  to  call  on  even  the  homogeneity  assumption,  that  in  this 
experiment  there  is  a  negative  response  effect. 

This  result  gives  additional  information  to  help  choose  between  the 
linear  (Galanter  &  Bush,  1959)  and  beta  (Bush,  Galanter,  &  Luce,  1959) 
models  for  the  data  on  reversal  after  overlearning.  Estimation  for  the 
linear  model  suggested  a  positive  response  effect,  which  had  to  be  in- 
creased if  the  learning  curve  was  to  be  even  roughly  approximated.  Be- 
cause of  its  positive  response  effect,  the  model  produced  a  value  for 
Var  (ux)  that  was  far  too  large.  For  the  beta  model,  on  the  other  hand,  esti- 
mation produces  results  in  agreement  with  the  analysis  of  Fig.  16  and  the 
value  for  Var  (ux)  is  slightly  too  small,  a  result  consistent  with  the  existence 
of  small  individual  differences  that  are  not  incorporated  in  the  model 
The  conclusion  seems  clear  that,  of  the  two,  the  beta  model  is  to  be 
preferred. 

We  are  in  the  embarrassing  position  of  having  discredited  both  the 
linear  and  beta  models,  each  in  a  different  experiment.  Unfortunately 
we  cannot  conclude  that  one  applies  to  rats  and  the  other  to  dogs;  evi- 
dence similar  to  that  presented  clearly  supports  the  linear  model  for  a  T- 
maze  experiment  on  reversal  without  overlearning  (Galanter  &  Bush, 
1959,  Experiment  III,  Period  2). 

It  has  not  been  mentioned  that  when  outcomes  are  independent  of  the 
state  of  the  subject  a  model-free  analysis  of  their  effects  can  be  performed. 
As  an  example,  let  us  consider  a  T-maze  experiment  with  a  75:25  reward 
schedule.  The  favorable  (75%)  side  is  baited  with  probability  0.75, 
independent  of  the  rat's  response.  Suppose  that  we  wish  to  examine  the 
effect  of  reward  on  response  probability.  Rats  that  choose  the  favorable 
side  on  a  selected  trial  are  divided  into  those  that  are  rewarded  on  that 


STOCHASTIC    LEARNING    THEORY 


1.0 


0.9  - 


0.8 
0.7 


^    0.6 

in 
O> 

|    0.5 

O 

"o    0.4 

c 
g 

"•g    0.3 

£   0.2 

0.1 

0 


,_ * — ^^ 


(N) 

75%  Arm,  reward        (217) 
75%  Arm,  nonreward  (59) 
25%  Arm,  nonreward  (113) 
o- o  25%  Arm,  reward         (50) 


J_ 


J_ 


1  2  3 

Number  of  trials  after  event 


Fig.  17.  Performance  after  four  different  events  in  the  Weinstock 
75:25  T-maze  experiment,  as  a  function  of  the  number  of  trials  after 
the  event.  The  solid  (broken)  curves  represent  performance  after 
choice  of  the  75%  (25%)  arm.  Circles  (triangles)  represent  per- 
formance after  reward  (nonreward) .  The  number  of  observations 
used  for  each  curve  is  indicated. 

trial  and  those  that  are  not.  The  behavior  of  the  subgroups  can  be  com- 
pared on  future  trials,  and  any  differences  can  be  attributed  to  the  reward 
effect.  To  enlarge  the  sample  size,  averaging  procedures  can  be  employed. 
The  same  comparison  can  be  made  among  the  rats  that  choose  the  un- 
favorable side  on  the  selected  trial.  Results  of  an  analysis  of  this  kind  are 
shown  in  Fig.  17.  The  data  are  the  first  20  trials  of  a  75:25  T-maze 
experiment  conducted  by  Weinstock  (1955).  On  each  trial,  n,  the  rats  were 
divided  into  four  subgroups  on  the  basis  of  the  response  and  outcome,  and 
the  number  of  choices  during  each  of  the  next  four  trials  was  tabulated 
for  each  subgroup.  These  choice  frequencies  for  all  values  of «,  1  <  n  <  20, 
were  added  and  the  proportions  given  in  the  figure  were  obtained. 

The  results  are  in  comforting  agreement  with  the  assumptions  in  several 
of  our  models.  After  reward,  the  performed  response  has  a  higher 
probability  of  occurrence  than  after  nonreward,  and  this  is  true  for  both 
responses.  In  keeping  with  the  first  assumption  mentioned  in  this  section, 
there  is  no  evidence  of  a  delayed  effect  of  the  reinforcing  event:  the 
hypothesis  that  its  full  effect  is  manifested  on  the  next  trial  cannot  be 
rejected.  On  the  contrary,  there  is  a  tendency  for  the  effect  to  be  reduced 


APPLICATION    AND    TESTING    OF    LEARNING    MODELS  7/5 

as  trials  proceed;  this  last  finding  would  be  expected  if  the  effect  of  reward 
were  less  than  that  of  nonreward. 

A  method  that  has  been  used  to  study  a  "negative-recency  effect"  in  the 
binary  prediction  experiment  (e.g.,  Jarvik,  1951;  Nicks,  1959;  Edwards, 
1961)  provides  us  with  a  final  example  of  the  testing  of  model  assumptions 
in  isolation.  The  assumption  in  question  is  that  the  direction  in  which  pn  is 
changed  by  an  event  is  the  same,  regardless  of  the  value  of  pw  and  the 
sequence  of  prior  events.29 

Consider  the  prediction  experiment  in  terms  of  four  experimenter- 
subject  controlled  events,  as  presented  in  Table  1.  On  a  trial  on  which  Ox 
occurs,  either  A^  or  A%  may  occur,  so  that  two  events  are  possible.  More- 
over, which  of  these  two  events  occurs  on  a  trial  depends  on  the  state  of 
the  subject.  Separate  analysis  of  their  effects  is  therefore  difficult,  as 
explained  earlier  in  this  section.  Fortunately,  we  are  willing  to  assume  that 
both  events  have  effects  that  are  in  the  same  direction :  if  either  (A^O^ 
or  (A2Oi)  occurs  on  trial  n,  then  pn+i  >  pn.  This  assumption  allows  us  to 
perform  the  test  by  averaging  over  all  subjects  that  experienced  Ox  on  trial 
n,  regardless  of  their  response,  and  thus  examining  the  average  effect  of 
0!  on  the  average  /?-value.  This  is  equivalent  to  examining  an  average  of 
the  effects  of  the  two  events  (A-^O-^  and  (A^O^,  and  therefore  the  test  lacks 
power:  if  only  one  of  the  events  violates  the  assumption  in  question, 
the  test  can  fail  to  detect  the  violation.30 

The  variable  of  interest  is  the  length  of  the  immediately  preceding  tuple 
of  Ox's.  Does  the  direction  of  the  effect  of  Ox  on  Pr  {A^  depend  on  this 
length?  The  method  involves  averaging  the  proportion  of  A1  responses 
on  trials  after  all  y-tuples  of  Ox's  in  the  outcome  sequence  for  various 
values  of  j  and  considering  the  average  proportion  as  a  function  of  7. 
The  results  of  such  an  analysis,  for  6>2's  as  well  as  O^s  (Nicks,  1959)  are 
given  in  Fig.  18.  The  data  are  from  a  380-trial,  67:33  prediction  experi- 
ment, and  the  analysis  is  performed  separately  for  each  quarter  of  the 
outcome  sequence.  Under  the  assumption  in  question,  and  in  the  light 
of  the  fact  that  a  1-tuple  of  Ox's  (an  O±  that  is  preceded  by  an  O2)  markedly 
increases  Pr  {A^,  all  curves  in  the  figure  should  have  slopes  that  are 
uniformly  nonnegative.  Such  is  not  the  case,  and  the  results  lead  us  to 
reject  the  assumption.31  If  we  assume  in  this  experiment  that  the  effects 

29  As  mentioned  in  Sec.  2.3,  simple  models  exist  in  which  the  direction  of  the  effect  of 
an  event  depends  on  the  value  of  p«.  Such  models  have  seldom  been  applied,  however. 
80  If  we  assume  that  the  experiment  consists  of  two  experimenter-controlled  events, 
then  this  criticism  does  not  apply;  however,  this  is  precisely  the  sort  of  additional 
assumption  that  we  do  not  wish  to  make. 

31  Using  this  type  of  analysis  of  performance  in  a  prediction  experiment  after  520 
trials  of  practice,  Edwards  (1961)  obtained  results  favorable  to  the  assumption. 


STOCHASTIC    LEARNING    THEORY 


1.0 


C3    0.9 

00 

E 
'"§ 
0.7 


•1-  0.6 


0.5 


0.4 


1         I          I         I 


•          •  First  quarter 

•—  -•  Second  quarter 
Q          Q  Third  quarter 
o  ---  o  Fourth  quarter 


I        I 


R  runs 


G  runs 


Run  length 


Fig.  18.  Proportion  of  subjects  predicting  "green"  immediately  after  outcome 
runs  of  various  lengths  of  green  (G)  and  red  (R)  lights  in  Nick's  67:33  binary 
prediction  experiment.  Separate  curves  are  presented  for  the  four  quarters  of 
the  380-trial  sequence.  After  Nicks,  1959,  Fig.  3. 

of  events  sharing  the  same  outcome  are  in  the  same  direction,  then  the 
direction  of  the  effect  of  an  event  appears  to  depend  in  a  complicated  way 
on  the  prior  sequence  of  events. 


7.  CONCLUSION 

Implicit  in  these  pages  are  two  alternative  views  of  the  place  of  stochastic 
models  in  the  study  of  learning.  The  first  view  is  that  a  model  furnishes  a 
sophisticated  statistical  method  for  drawing  inferences  about  the  effects  of 
individual  trial  events  in  a  learning  experiment  or  for  providing  descriptive 
statistics  of  the  data.  The  method  has  to  be  sophisticated  because  the  prob- 
lem is  difficult:  the  time-series  in  question  is  usually  nonstationary32  and 
involves  only  a  small  number  of  observations  per  subject;  if  the  observa- 
tions are  discrete  rather  than  continuous,  each  one  gives  us  little  informa- 
tion. The  use  of  a  model,  then,  can  be  thought  of  as  a  method  of 

32  In  an  experiment  in  which  there  is  no  over-all  trend  (i.e.,  Vl  n  and  K2  B  are  constant), 
statistical  methods  for  the  analysis  of  stationary  time  series  can  be 'used  (see,  e.g., 
Hannan,  1960,  and  Cox,  1958).  Interesting  models  for  such  experiments  have  been 
developed  and  applied  by  Cane  (1959,  1961). 


CONCLUSION 


combining  data  —  of  averaging  —  in  a  process  with  trend.  The  model  is  not 
expected  to  fit  exactly  even  the  most  refined  experiment;  it  is  simply  a  tool. 

The  second  view  is  that  a  model,  or  a  family  of  models,  is  a  mathematical 
representation  of  a  theory  about  the  learning  process.  In  this  case  our 
focus  shifts  from  features  of  a  particular  set  of  data  to  the  extent  to  which  a 
model  describes  those  data  and  to  the  variety  of  experiments  the  model 
can  describe.  We  become  more  concerned  with  the  assumptions  that  give 
rise  to  the  model  and  with  crucial  experiments  or  discriminating  statistics 
to  use  in  its  evaluation.  We  attempt  to  refine  our  experiments  so  that  the 
process  purportedly  described  by  the  model  can  be  observed  in  its  pure- 
form. 

Whether  we  are  concerned  more  with  describing  particular  aspects  of 
the  process  or  more  with  evaluating  an  over-all  theory,  many  of  the 
fundamental  questions  that  arise  about  learning  can  be  answered  only  by 
the  use  of  explicit  models.  The  use  of  models,  however,  does  not  auto- 
matically produce  easy  answers. 


References 

Anderson,  N.  H.    An  analysis  of  sequential  dependencies.   In  R.  R.  Bush  &  W.  K. 

Estes  (Eds.),  Studies  in  mathematical  learning  theory.    Stanford:    Stanford  Univer. 

Press,  1959.   Pp.  248-264. 
Anderson,  N.  H.   Effect  of  first-order  conditional  probability  in  a  two-choice  learning 

situation.  J.  exp.  PsychoL,  1960,  59,  73-93. 
Anderson,  N.  H.,  &  Grant,  D.  A.   Correction  and  reanalysis.  J.  exp.  Psychol,  1958, 

56,  453-454. 
Anscombe,  F.  J.    On  estimating  binomial  response  relations.    Biometrika,  1956,  43, 

461-464. 
Audley,  R.  J.  A  stochastic  description  of  the  learning  behavior  of  an  individual  subject. 

Quart.  J.  exp.  PsychaL,  1957,  9,  12-20. 
Audley,  R.  J.,  &  Jonckheere,  A.  R.   The  statistical  analysis  of  the  learning  process. 

Brit.  J.  Stat.  PsychoL,  1956,  9,  87-94. 
Bailey,  N.  T.  J.    Some  problems  in  the  statistical  analysis  of  epidemic  data.  /.  Roy. 

Stat.  Soc.  (B\  1955,  17,  35-57. 
Barucha-Reid,  A.  T.  Elements  of  the  theory  of  Markov  processes  and  their  applications. 

New  York:  McGraw-Hill,  1960. 
Behrend,  E.  R.,  &  Bitterman,  M.  E.  Probability-matching  in  the  fish.  Amer.  J.  PsychoL, 

1961,74,542-551. 
Berkson,  J.    Maximum  likelihood  and  minimum  %z  estimates  of  the  logistic  function. 

/.  Amer.  Stat.  Assoc.,  1955,  50,  130-162. 
Berkson,  J.    Tables  for  the  maximum-likelihood  estimation  of  the  logistic  function. 

Biometrics,  1957,  13,  28-34. 
Bush,  R.  R.    A  block-moment  method  of  estimating  paratneters  in  learning  models. 

Unpublished  manuscript,  1955. 


IlB  STOCHASTIC    LEARNING    THEORY 

Bush,  R.  R.  Sequential  properties  of  linear  models.  In  R.  R.  Bush  &  W.  K.  Estes  (Eds.), 

Studies  in  mathematical  learning  theory.    Stanford:    Stanford  Univer.  Press,  1959. 

Pp.  215-227. 
Bush,  R.  R.   Some  properties  of  Luce's  beta  model  for  learning.   In  K.  J.  Arrow,  S. 

Karlin,  &  P.  Suppes  (Eds.),  Mathematical  methods  in  the  social  sciences,  1959. 

Stanford:  Stanford  Univer.  Press,  1960.  Pp.  254-264.  (a) 

Bush,  R.  R.  A  survey  of  mathematical  learning  theory.  In  R.  D.  Luce  (Ed.),  Develop- 
ments in  mathematical  psychology.  Glencoe,  111.  Free  Press,  1960.  Pp.  123-170.   (b) 
Bush,  R.  R.,  &  Estes,  W.  K.  (Eds.).  Studies  in  mathematical  learning  theory.  Stanford : 

Stanford  Univer.  Press,  1959. 
Bush,  R.  R.,  Galanter,  E.,x&  Luce,  R.  D.  Tests  of  the  "beta  model."  In  R.  R.  Bush  & 

W.  K.  Estes  (Eds.),  Studies  in  mathematical  learning  theory.    Stanford:    Stanford 

Univer.  Press,  1959.  Pp.  381-399. 
Bush,  R.  R.,  &  Mosteller,  F.    A  mathematical  model  for  simple  learning.   Psycho  I. 

Rev.,  1951,  58,  313-323. 

Bush,  R.  R.,  &  Mosteller,  F.  Stochastic  models  for  learning.  New  York:  Wiley,  1955. 
Bush,  R.  R.,  &  Mosteller,  F.  A  comparison  of  eight  models.  In  R.  R.  Bush  &  W.  K. 

Estes  (Eds.),  Studies  in  mathematical  learning  theory.    Stanford:    Stanford  Univer. 

Press,  1959.  Pp.  293-307. 
Bush,  R.  R.,  Mosteller,  F.,  &  Thompson,  G.  L.  A  formal  structure  for  multiple-choice 

situations.  In  R.  M.  Thrall,  C.  H.  Coombs,  &  R.  L.  Davis  (Eds.),  Decision  processes. 

New  York:  Wiley,  1954.  Pp.  99-126. 
Bush,  R.  R.,  &  Sternberg,  S.  H.   A  single-operator  model.    In  R.  R.  Bush  &  W.  K. 

Estes  (Eds.),  Studies  in  mathematical  learning  theory.    Stanford:   Stanford  Univer. 

Press,  1959.  Pp.  204-214. 
Bush,  R.  R.,  &  Wilson,  T.  R.  Two-choice  behavior  of  paradise  fish.  /.  exp.  PsychoL, 

1956,51,315-322. 
Cane,  Violet  R.   Behaviour  sequences  as  semi-Markov  chains.  J.  Roy.  Stat.  Soc.  (B), 

1959,  21,  36-58, 
Cane,  Violet  R.  Review  of  R.  D.  Luce,  Individual  Choice  Behavior.  J.  Roy.  Stat.  Soc. 

(A),  1960,  22,  486-488. 
Cane,  Violet  R.  Some  ways  of  describing  behavior.  In  W.  H.  Thorpe  &  O.  L.  Zangwill 

(Eds.),  Current  problems  in  animal  behavior.  Cambridge:   Cambridge  Univer.  Press, 

1961.  Pp.  361-388. 
Cox,  D.  R.  The  regression  analysis  of  binary  sequences.  /.  Roy.  Stat.  Soc.  (B),  1958, 

20,  215-232. 

Edwards,  W.    Reward  probability,  amount,  and  information  as  determiners  of  se- 
quential two-alternative  decisions,  /,  exp.  PsychoL,  1956,  52,  177-188. 
Edwards,  W.   Probability  learning  in  1000  trials.  J.  exp.  PsychoL,  1961,  62,  385-394. 
Estes,  W.  K.    Toward  a  statistical  theory  of  learning.     PsychoL  Rev.,  1950,  57, 

94-107. 

Estes,  W.  K.    The  statistical  approach  to  learning  theory.    In  S.  Koch  (Ed.),  Psy- 
chology: a  study  of  a  science.   Vol.  II.  General  systematic  formulations,  learning,  and 

special  processes.  New  York:  McGraw-Hill,  1959.  Pp.  380-491. 
Estes,  W.  K.  Learning  theory.  Ann.  Rev.  PsychoL,  1962, 13, 107-144. 
Estes,  W.  K.,  &  Straughan,  J.  H.  Analysis  of  a  verbal  conditioning  situation  in  terms 

of  statistical  learning  theory.  /.  exp.  PsychoL,  1954,  47, 225-234. 
Estes,  W.  K.,  &  Suppes,  P.   Foundations  of  linear  models.   In  R.  R.  Bush  &  W.  K. 

Estes  (Eds.),  Studies  in  mathematical  learning  theory.    Stanford:   Stanford  Univer. 

Press,  1959.  Pp.  137-179. 


REFERENCES  1/9 

Feldman,  J.,  &  Newell,  A.  A  note  on  a  class  of  probability  matching  models.  Psycho- 
metrika, 1961,  26,  333-337. 
Feller,  W.   An  introduction  to  probability  theory  and  its  applications,  second  edition. 

New  York:    Wiley,  1957. 

Galanter,  E.,  &  Bush,  R.  R.  Some  T-maze  experiments.  In  R.  R.  Bush  &  W.  K.  Estes 
(Eds.),  Studies  in  mathematical  learning  theory.  Stanford:  Stanford  Univer.  Press, 
1959.  Pp.  265-289. 

Goldberg,  S.  Introduction  to  difference  equations.  New  York:   Wiley,  1958. 
Gulliksen,  H.  A  rational  equation  of  the  learning  curve  based  on  Thorndike's  law  of 

effect.  J.gen.  Psycho!.,  1934,  11,  395-^34. 
Hanania,  Mary  I.  A  generalization  of  the  Bush-Mosteller  model  with  some  significance 

tests.  Psychometrika,  1959,  24,  53-68. 

Hannan,  E.  J.   Time  series  analysis.  London:  Methuen,  1960. 
Hayes,  K.  J.,  &  Pereboom,  A.  C.    Artifacts  in  criterion-reference  learning  curves. 

Psychol.  Rev.,  1959,  66,  23-26. 
Hodges,  J.  L.,  Jr.    Fitting  the  logistic  by  maximum  likelihood.   Biometrics,  1958,  14, 

453-461. 

Hull,  C.  L.  Principles  of  behavior.  New  York:  Appleton-Century  Crofts,  1943. 
Hull,  C.  L.  A  behavior  system.  New  Haven:  Yale  Univer.  Press,  1952. 
Irwin,  F.  W.    On  desire,  aversion,  and  the  affective  zero.   Psychol.  Rev.,  1961,  68, 

293-300. 
Jarvik,  M.  E.  Probability  learning  and  a  negative  recency  effect  in  the  serial  anticipation 

of  alternating  symbols.  /.  exp.  Psychol.,  1951,  41,  291-297. 
Kanal,  L.    Analysis  of  some  stochastic  processes  arising  from  a  learning  model. 

Unpublished  doctoral  thesis,  Univer.  Penn.,  1960. 
Kanal,  L.    A  functional  equation  analysis  of  two  learning  models.   Psychometrika, 

1962,27,89-104.  (a) 
Kanal,  L.    The  asymptotic  distribution  for  the  two-absorbing-barrier  beta  model. 

Psychometrika,  1962,  27,  105-109.  (b) 
Karlin,  S.   Some  random  walks  arising  in  learning  models.  Pacific  J.  Math.,  1953,  3, 

725-756. 
Kendall,  D.  G.    Stochastic  processes  and  population  growth.  /.  Roy.  Stat.  Soc.  (B), 

1949,  11,  230-264. 
Lamperti,  J.,  &  Suppes,  P.  Some  asymptotic  properties  of  Luce's  beta  learning  model. 

Psychometrika,  1960,  25,  233-241. 

Logan,  F.  A.  Incentive.  New  Haven:  Yale  Univer.  Press,  1960. 
Luce,  R.  D.  Individual  choice  behavior.  New  York:  Wiley,  1959. 
Luce,  R.  D.    Some  one-parameter  families  of  commutative  learning  operators.    In 
R.  C.  Atkinson  (Ed.),  Studies  in  mathematical  psychology,  1963.  Stanford :   Stanford 
Univer.  Press,  1963,  in  press. 
Miller,  G.  A.,  &  McGill,  W.  J.   A  statistical  description  of  verbal  learning.   Psycho- 

metrika,  1952,  17,  369-396. 
Mosteller,  F.   Stochastic  learning  models.   In  Proc.  Third  Berkeley  Symp.  Math.  Stat. 

Probability,  1955,  5.  Pp.  151-167. 

Mosteller,  F.,  &  Tatsuoka,  M.    Ultimate  choice  between  two  attractive  goals:  pre- 
dictions from  a  model.  Psychometrika,  1960,  25, 1-17. 
Nicks,  D.  C.  Prediction  of  sequential  two-choice  decisions  from  event  runs.  /.  exp. 

Psychol.,  1959,  57,  105-114. 

Restle,  F.  A  theory  of  discrimination  learning.  Psychol.  Rev.,  1955,  62,  11-19. 
Restle,  F.  Psychology  of  judgment  and  choice.  New  York:  Wiley,  1961. 


I2Q  STOCHASTIC  LEARNING  THEORY 

Sheffield,  F.  D.  Avoidance  training  and  the  contiguity  principle.  J.  comp,  physiol. 
PsychoL,  1948, 41, 165-177. 

Sternberg,  S.  H.  A  path-dependent  linear  model.  In  R.  R.  Bush  &  W.  K.  Estes  (Eds.), 
Studies  in  mathematical  learning-  theory.  Stanford:  Stanford  Univer.  Press,  1959. 
Pp.  308-339.  (a) 

Sternberg,  S.  H.  Application  of  four  models  to  sequential  dependence  in  human 
learning.  In  R.  R.  Bush  &  W.  K.  Estes  (Eds.),  Studies  in  mathematical  learning- 
theory.  Stanford:  Stanford  Univer.  Press:  1959.  Pp.  340-380.  (b) 

Tatsuoka,  M.,  &  Mosteller,  F.  A  commuting-operator  model.  In  R.  R.  Bush  &  W.  K. 
Estes  (Eds.),  Studies  in  mathematical  learning  theory.  Stanford:  Stanford  Univer. 
Press,  1959.  Pp.  228-247. 

Thurstone,  L.  L.  The  learning  function.  J.gen.  PsychoL,  1930,  3,  469-491. 

Underwood,  B.  J.  A  graphical  description  of  rote  learning.  PsychoL  Rev.,  1957,  64, 
119-122. 

Wald,  A.  Asymptotic  properties  of  the  maximum-likelihood  estimate  of  an  unknown 
parameter  of  a  discrete  stochastic  process.  Ann.  math.  Stat.,  1948, 19,  40-46. 

Weinstock,  S.  Unpublished  experiment,  1955. 

Wilks,  S.  S.  Mathematical  statistics.  New  York:  Wiley,  1962. 


IO 


Stimulus  Sampling  Theory 


Richard  C.  Atkinson  and 
William  K.  Estes 

Stanford  University 


1.  Preparation  of  this  chapter  was  supported  in  part  by  Contract  Nonr-908(16] 
between  the  Office  of  Naval  Research  and  Indiana  University,  Grant  M-5 184  from 
the  National  Institute  of  Mental  Health  to  Stanford  University,  and  Contract 
Nonr-225(17]  between  the  Office  of  Naval  Research  and  Stanford  University. 


Contents 


1.  One-Element  Models  125 

1.1.  Learning  of  a  single  stimulus  response  association,     126 

1.2.  Paired-associate  learning,     128 

1.3.  Probabilistic  reinforcement  schedules,     141 


2.  Multi-Element  Pattern  Models  153 

2.1.  General  formulation,     153 

2.2.  Treatment  of  the  simple  noncontingent  case,     162 

2.3.  Analysis  of  a  paired-comparison  learning  experiment,     181 


3.  A  Component  Model  for  Stimulus  Compounding  and 

Generalization  191 

3.1.  Basic  concepts;  conditioning  and  response  axioms,     191 

3.2.  Stimulus  compounding,     193 

3.3.  Sampling  axioms  and  major  response  theorem  of 
fixed  sample  size  model,     198 

3.4.  Interpretation  of  stimulus  generalization,    200 


4.  Component  and  Linear  Models  for  Simple  Learning  206 

4.1.  Component  models  with  fixed  sample  size,     207 

4.2.  Component  models  with  stimulus  fluctuation,     219 

4.3.  The  linear  model  as  a  limiting  case,     226 

4.4.  Applications  to  multiperson  interactions,     234 

5.  Discrimination  Learning  238 

5.1.  The  pattern  model  for  discrimination  learning,     239 

5.2.  A  mixed  model,    243 

5.3.  Component  models,    249 

5.4.  Analysis  of  a  signal  detection  experiment,     250 

5.5.  Multiple-process  models,    257 

References  265 


Stimulus  Sampling  Theory 


Stimulus  sampling  theory  is  concerned  with  providing  a  mathematical 
language  in  which  we  can  state  assumptions  about  learning  and  per- 
formance in  relation  to  stimulus  variables.  A  special  advantage  of  the 
formulations  to  be  discussed  is  that  their  mathematical  properties  permit 
application  of  the  simple  and  elegant  theory  of  Markov  chains  (Feller, 
1957;  Kemeny,  Snell,  &  Thompson,  1957;  Kemeny  &  Snell,  1959)  to 
the  tasks  of  deriving  theorems  and  generating  statistical  tests  of  the 
agreement  between  assumptions  and  data.  This  branch  of  learning  theory 
has  developed  in  close  interaction  with  certain  types  of  experimental 
analyses ;  consequently  it  is  both  natural  and  convenient  to  organize  this 
presentation  around  the  theoretical  treatments  of  a  few  standard  reference 
experiments. 

At  the  level  of  experimental  interpretation  most  contemporary  learning 
theories  utilize  a  common  conceptualization  of  the  learning  situation 
in  terms  of  stimulus,  response,  and  reinforcement.  The  stimulus  term  of 
this  triumvirate  refers  to  the  environmental  situation  with  respect  to  which 
behavior  is  being  observed,  the  response  term  to  the  class  of  observable 
behaviors  whose  measurable  properties  change  in  some  orderly  fashion 
during  learning,  and  the  reinforcement  term  to  the  experimental  operations 
or  events  believed  to  be  critical  in  producing  learning.  Thus,  in  a  simple 
paired-associate  experiment  concerned  with  the  learning  of  English 
equivalents  to  Russian  words,  the  stimulus  might  consist  in  presentation 
of  the  printed  Russian  word  alone,  the  response  measure  in  the  relative 
frequency  with  which  the  learner  is  able  to  supply  the  English  equivalent 
from  memory,  and  reinforcement  in  paired  presentation  of  the  stimulus 
and  response  words. 

In  other  chapters  of  this  Handbook,  and  in  the  general  literature  on 
learning  theory,  the  reader  will  encounter  the  notions  of  sets  of  responses 
and  sets  of  reinforcing  events.  In  the  present  chapter  mathematical  sets  are 
used  to  represent  certain  aspects  of  the  stimulus  situation.  It  should  be 
emphasized  from  the  outset,  however,  that  the  mathematical  models  to  be 
considered  are  somewhat  abstract  and  that  the  empirical  interpretations  of 
stimulus  sets  and  their  elements  are  not  to  be  considered  fixed  and  immu- 
table. Two  main  types  of  interpretations  are  discussed:  in  one  the 
empirical  correspondent  of  a  stimulus  element  is  the  full  pattern  of 
stimulation  effective  on  a  given  trial;  in  the  other  the  correspondent  of  an 

123 


124 


STIMULUS    SAMPLING    THEORY 


element  is  a  component,  or  aspect,  of  the  full  pattern  of  stimulation.  In 
the  first,  we  speak  of  "pattern  models"  and  in  the  second,  of  "component 
models"  (Estes,  1959V). 

There  are  a  number  of  ways  in  which  characteristics  of  the  stimulus 
situation  are  known  to  affect  learning  and  transfer.  Rates  and  limits  of 
conditioning  and  learning  generally  depend  on  stimulus  magnitude,  or 
intensity,  and  on  stimulus  variability  from  trial  to  trial.  Retention  and 
transfer  of  learning  depend  on  the  similarity,  or  communality,  between 
the  stimulus  situations  obtaining  during  training  and  during  the  test  for 
retention  or  transfer.  These  aspects  of  the  stimulus  situation  can  be 
given  direct  and  natural  representations  in  terms  of  mathematical  sets  and 
relations  between  sets. 

The  basic  notion  common  to  all  stimulus  sampling  theories  is  the 
conceptualization  of  the  totality  of  stimulus  conditions  that  may  be 
effective  during  the  course  of  an  experiment  in  terms  of  a  mathematical 
set.  Although  it  is  not  a  necessary  restriction,  it  is  convenient  for  mathe- 
matical reasons  to  deal  only  with  finite  sets,  and  this  limitation  is  assumed 
throughout  our  presentation.  Stimulus  variability  is  taken  into  account 
by  assuming  that  of  the  total  population  of  stimuli  available  in  an  experi- 
mental situation  generally  only  a  part  actually  affects  the  subject  on  any 
one  trial.  Translating  this  idea  into  the  terms  of  a  stimulus  sampling 
model,  we  may  represent  the  total  population  by  a  set  of  "stimulus  ele- 
ments" and  the  stimulation  effective  on  any  one  trial  by  a  sample  from 
this  set.  Many  of  the  simple  mathematical  properties  of  the  models  to  be 
discussed  arise  from  the  assumption  that  these  trial  samples  are  drawn 
randomly  from  the  population,  with  all  samples  of  a  given  size  having 
equal  probabilities.  Although  it  is  sometimes  convenient  and  suggestive 
to  speak  in  such  terms,  we  should  not  assume  that  the  stimulus  elements 
are  to  be  identified  with  any  simple  neurophysiological  unit,  as,  for 
example,  receptor  cells.  At  the  present  stage  of  theory  construction  we 
mean  to  assume  only  that  certain  properties  of  the  set-theoretical  model 
represent  certain  properties  of  the  process  of  stimulation.  If  these  assump- 
tions prove  to  be  sufficiently  well  substantiated  when  the  model  is  tested 
against  behavioral  data,  then  it  will  be  in  order  to  look  for  neurophysio- 
logical variables  that  might  underlie  the  correspondences.  Just  as  the 
ratio  of  sample  size  to  population  size  is  a  natural  way  of  representing 
stimulus  variability,  sample  size  per  se  may  be  taken  as  a  correspondent  of 
stimulus  intensity,  and  the  amount  of  overlap  (i.e.,  proportion  of  common 
elements)  between  two  stimulus  sets  may  be  taken  to  represent  the  degree 
of  communality  between  two  stimulus  situations. 

Our  concern  in  this  chapter  is  not  to  survey  the  rapidly  developing  area 
of  stimulus  sampling  theory  but  simply  to  present  some  of  the  fundamental 


ONE-ELEMENT    MODELS 


mathematical  techniques  and  illustrate  their  applications.  For  general 
background  the  reader  is  referred  to  Bush  (1960),  Bush  &  Estes  (1959), 
Estes  (1959a,  1962),  and  Suppes  &  Atkinson  (1960).  We  shall  consider 
first,  and  in  some  detail,  the  simplest  of  all  learning  models  —  the  pattern 
model  for  simple  learning.  In  this  model  the  population  of  available 
stimuli  is  assumed  to  comprise  a  set  of  distinct  stimulus  patterns,  exactly 
one  of  which  is  sampled  on  each  trial.  In  the  important  special  case 
of  the  one-element  model  it  is  assumed  that  there  is  only  one  such  pattern 
and  that  it  recurs  intact  at  the  beginning  of  each  experimental  trial. 
Granting  that  the  one-element  model  represents  a  radical  idealization 
of  even  the  most  simplified  conditioning  situations,  we  shall  find  that 
it  is  worthy  of  study  not  only  for  expositional  purposes  but  also  for  its 
value  as  an  analytic  device  in  relation  to  certain  types  of  learning  data. 
After  a  relatively  thorough  treatment  of  pattern  models  for  simple  acquisi- 
tion and  for  learning  under  probabilistic  reinforcement  schedules,  we  shall 
take  up  more  briefly  the  conceptualization  of  generalization  and  transfer; 
component  models  in  which  the  patterns  of  stimulation  effective  on 
individual  trials  are  treated  not  as  distinct  elements  but  as  overlapping 
samples  from  a  common  population;  and,  finally,  some  examples  of  the 
more  complex  multiple-process  models  that  are  becoming  increasingly 
important  in  the  analysis  of  discrimination  learning,  concept  formation, 
and  related  phenomena. 


1.  ONE-ELEMENT   MODELS 

We  begin  by  considering  some  one-element  models  that  are  special  cases 
of  the  more  general  theory.  These  examples  are  especially  simple  mathe- 
matically and  provide  us  with  the  opportunity  to  develop  some  mathe- 
matical tools  that  will  be  necessary  in  later  discussions.  Application  of 
these  models  is  appropriate  when  the  stimulus  situation  is  sufficiently  stable 
from  trial  to  trial  that  it  may  be  theoretically  represented  (to  a  good 
approximation)  by  a  single  stimulus  element  which  is  sampled  with  proba- 
bility 1  on  each  trial.  At  the  start  of  a  trial  the  element  is  in  one  of  several 
possible  conditioning  states ;  it  may  or  may  not  remain  in  this  conditioning 
state,  depending  on  the  reinforcing  event  for  that  trial.  In  the  first  part  of 
this  section  we  consider  a  model  for  paired-associate  learning.  In  the 
second  part  we  consider  a  model  for  a  two-choice  learning  situation 
involving  a  probabilistic  reinforcement  schedule.  The  models  generate 
some  predictions  that  are  undoubtedly  incorrect,  except  possibly  under 
ideal  experimental  conditions;  nevertheless,  they  provide  a  useful  intro- 
duction to  more  general  cases  which  we  pursue  in  Section  2. 


126  STIMULUS    SAMPLING    THEORY 

1.1  Learning  of  a  Single  Stimulus-Response  Association 

Imagine  the  simplest  possible  learning  situation.  A  single  stimulus 
pattern,  5,  is  to  be  presented  on  each  of  a  series  of  trials  and  each  trial  is 
to  terminate  with  reinforcement  of  some  designated  response,  the  "correct 
response"  in  this  situation.  According  to  stimulus  sampling  theory, 
learning  occurs  in  an  all-or-none  fashion  with  respect  to  5. 

1.  If  the  correct  response  is  not  originally  conditioned  to  ("connected 
to")  S,  then,  until  learning  occurs,  the  probability  of  the  correct  response  is 
zero. 

2.  There  is  a  fixed  probability  c  that  the  reinforced  response  will  become 
conditioned  to  S  on  any  trial. 

3.  Once  conditioned  to  S,  the  correct  response  occurs  with  probability 
1  on  every  subsequent  trial. 

These  assumptions  constitute  the  simplest  case  of  the  "one-element 
pattern  model."  Learning  situations  that  completely  meet  the  specifica- 
tions laid  down  above  are  as  unlikely  to  be  realized  in  psychological 
experiments  as  perfect  vacuums  or  frictionless  planes  in  the  physics 
laboratory.  However,  reasonable  approximations  to  these  conditions  can 
be  attained.  The  requirement  that  the  same  stimulus  pattern  be  reproduced 
on  each  trial  is  probably  fairly  well  met  in  the  standard  paired-associate 
experiment  with  human  subjects.  In  one  such  experiment,  conducted  in 
the  laboratory  of  one  of  the  writers  (W.  K.  E.),  the  stimulus  member  of 
each  item  was  a  trigram  and  the  correct  response  an  English  word,  for 
example, 

S  R  . 

xvk  house 

On  a  reinforced  trial  the  stimulus  and  response  members  were  exposed 
together,  as  shown.  Then,  after  several  such  items  had  received  a  single 
reinforcement,  each  of  the  stimuli  was  presented  alone,  the  subject  being 
instructed  to  give  the  correct  response  from  memory,  if  he  could.  Then 
each  item  was  given  a  second  reinforcement,  followed  by  a  second  test, 
and  so  on. 

According  to  the  assumptions  of  the  one-element  pattern  model,  a 
subject  should  be  expected  to  make  an  incorrect  response  on  each  test 
with  a  given  stimulus  until  learning  occurs,  then  a  correct  response  on 
every  subsequent  trial;  if  we  represent  an  error  by  a  1  and  a  correct 
response  by  a  0,  the  protocol  for  an  individual  item  over  a  series  of  trials 
should,  then,  consist  in  a  sequence  of  O's  preceded  in  most  cases  by  a 
sequence  of  1's.  Actual  protocols  for  several  subjects  are  shown  below: 


ONE-ELEMENT    MODELS  I2J 

a  0000000000 

b  1111111111 

c  1000000000 

d  0000000000 

e  1100000000 

/  1100000000 

g  1111100000 

h  1000000100 

i  1111011000 

The  first  seven  of  these  correspond  perfectly  to  the  idealized  theoretical 
picture;  the  last  two  deviate  slightly.  The  proportion  of  "fits"  and 
"misfits"  in  this  sample  is  about  the  same  as  in  the  full  set  of  80  cases  from 
which  the  sample  was  taken.  The  occasional  lapses,  that  is,  errors  follow- 
ing correct  responses,  may  be  symptomatic  of  a  forgetting  process  that 
should  be  incorporated  into  the  theory,  or  they  may  be  simply  the  result 
of  minor  uncontrolled  variables  in  the  experimental  situation  which  are 
best  ignored  for  theoretical  purposes.  Without  judging  this  issue,  we  may 
conclude  that  the  simple  one-element  model  at  least  merits  further  study. 

Before  we  can  make  quantitative  predictions  we  need  to  know  the  value 
of  the  conditioning  parameter  c.  Statistical  learning  theory  includes  no 
formal  axioms  that  specify  precisely  what  variables  determine  the  value  of 
c,  but  on  the  basis  of  considerable  experience  we  can  safely  assume  that 
this  parameter  will  vary  with  characteristics  of  the  populations  of  subjects 
and  items  represented  in  a  particular  experiment.  An  estimate  of  the  value 
of  c  for  the  experiment  under  consideration  is  easy  to  come  by.  In  the 
full  set  of  80  cases  (40  subjects,  each  tested  on  two  items)  the  proportion 
of  correct  responses  on  the  test  given  after  a  single  reinforcement  was 
0.39.  According  to  the  model,  the  probability  is  c  that  a  reinforced  re- 
sponse will  become  conditioned  to  its  paired  stimulus;  consequently  c 
is  the  expected  proportion  of  successful  conditionings  out  of  80  cases,  and 
therefore  the  expected  proportion  of  correct  responses  on  the  subsequent 
test.  Thus  we  may  simply  take  the  observed  proportion  0.39  as  an  estimate 
ofc. 

In  order  to  test  the  model,  we  need  now  to  derive  theoretical  expressions 
for  other  aspects  of  the  data.  Suppose  we  consider  the  sequences  of  correct 
and  incorrect  responses,  000,  001,  etc.,  on  the  first  three  trials.  According 
to  the  model,  a  correct  response  should  never  be  followed  by  an  error,  so 
the  probability  of  the  sequence  000  is  simply  c,  and  the  probabilities  of 
001,  010,  Oil,  and  101  are  all  zero.  To  obtain  an  error  on  the  first  trial 
followed  by  a  correct  response  on  the  second,  conditioning  must  fail  on  the 
first  reinforcement  but  occur  on  the  second,  and  this  joint  event  has 


128  STIMULUS    SAMPLING    THEORY 

probability  (1  —  c)c.  Similarly,  the  probability  that  the  first  correct 
response  will  occur  on  the  third  trial  is  given  by  (1  —  c)2c  and  the  proba- 
bility of  no  correct  response  in  three  trials  by  (1  —  c)3.  Substituting  the 
estimate  0.39  for  c  in  each  of  these  expressions,  we  obtain  the  predicted 

Table  1  Observed  and  Predicted  (One-Element  Model)  Values 
for  Response  Sequences  Over  First  Three  Trials  of  a  Paired- 
Associate  Experiment 


Sequence* 

Observed 
Proportions 

Theoretical 
Proportions 

000 

0.36 

0.39 

001 

0.02 

0 

010 

0.01 

0 

Oil 

0 

0 

100 

0.27 

0.24 

101 

0 

0 

110 

0.11 

0.14 

111 

0.23 

0.23 

*  0  =  correct  response 
1  =  error 

values  which  are  compared  with  the  corresponding  empirical  values  for 
this  experiment  in  Table  1 .  The  correspondences  are  seen  to  be  about  as 
close  as  could  be  expected  with  proportions  based  on  80 response  sequences. 


1.2  Paired- Associate  Learning 

In  order  to  apply  the  one-element  model  to  paired-associate  experiments 
involving  fixed  lists  of  items,  it  is  necessary  to  adjust  the  "boundary 
conditions"  appropriately.  Consider,  for  example,  an  experiment  reported 
by  Estes,  Hopkins,  and  Crothers  (1960).  The  task  assigned  their  subjects 
was  to  learn  associations  between  the  numbers  1  through  8,  serving  as 
responses,  and  eight  consonant  trigrams,  serving  as  stimuli.  Each  subject 
was  given  two  practice  trials  and  two  test  trials.  On  the  first  practice  trial 
the  eight  syllable-number  pairs  were  exhibited  singly  in  a  random  order. 
Then  a  test  was  given,  the  syllables  alone  being  presented  singly  in  a  new 
random  order  and  the  subjects  attempting  to  respond  to  each  syllable  with 
the  correct  number.  Four  of  the  syllable-number  pairs  were  presented  on  a 
second  practice  trial,  and  all  eight  syllables  were  included  in  a  final  test 
trial. 


ONE-ELEMENT    MODELS  I2g 

In  writing  an  expression  for  the  probability  of  a  correct  response  on  the 
first  test  in  this  experiment,  we  must  take  account  of  the  fact  that,  after 
the  first  practice  trial,  the  subjects  knew  that  the  responses  were  the 
numbers  1  to  8  and  were  in  a  position  to  guess  at  the  correct  answers  when 
shown  syllables  that  they  had  not  yet  learned.  The  minimum  probability 
of  achieving  a  correct  response  to  an  unlearned  item  by  guessing  would  be 
J.  Thus  we  would  have  for  pQ,  the  probability  of  a  correct  response  on  the 
first  test, 

.    1  -  c 


that  is,  the  probability  c  that  the  correct  association  was  formed  plus  the 
probability  (1  —  c)/8  that  the  association  was  not  formed  but  the  correct 
response  was  achieved  by  guessing.  Setting  this  expression  equal  to  the 
observed  proportion  of  correct  responses  on  the  first  trial  for  the  twice 
reinforced  items,  we  readily  obtain  an  estimate  of  c  for  these  experimental 
conditions, 

0.404  =  c  +  (1  -  c)(0.125), 
and  so 

d  =  0.32. 

Now  we  can  proceed  to  derive  expressions  for  the  joint  probabilities  of 
various  combinations  of  correct  and  incorrect  responses  on  the  first 
and  second  tests  for  the  twice  reinforced  items.  For  the  probability  of 
correct  responses  to  a  given  item  in  both  tests,  we  have 

pn  =  c  +  (1  -  c)(0.125)c  +  (1  -  c)2(0.125)2. 

With  probability  c,  conditioning  occurs  on  the  first  reinforced  trial,  and 
then  correct  responses  necessarily  occur  on  both  tests;  with  probability 
(1  —  c)c(0.125),  conditioning  does  not  occur  on  the  first  reinforced  trial 
but  does  on  the  second,  and  a  correct  response  is  achieved  by  guessing  on 
the  first  test;  with  probability  (1  —  c)2(0.125)2,  conditioning  occurs  on 
neither  reinforced  trial  but  correct  responses  are  achieved  by  guessing 
on  both  tests.  Similarly,  we  obtain 

pQl  =  (l  -  C)2(0.875)(0.125) 

p1Q  =  (1  -  c)(0.875)[c  +  (1  -  c)(0.125)] 
and 

Pn  =  (1  -  W875)2. 

Substituting  for  c  in  these  expressions  the  estimate  computed  above,  we 


STIMULUS    SAMPLING    THEORY 


arrive  at  the  predicted  values  which  we  compare  with  the  corresponding 
observed  values  below. 

Observed       Predicted 


?oo 

0.35 

0.35 

Poi 

0.05 

0.05 

Pio 

0.27 

0.24 

Pll 

0.33 

0.35 

Although  this  comparison  reveals  some  disparities,  which  we  might  hope 
to  reduce  with  a  more  elaborate  theory,  it  is  surprising,  to  the  writers  at 
least,  that  the  patterns  of  observed  response  proportions  in  both  experi- 
ments considered  can  be  predicted  as  well  as  they  are  by  such  an  extremely 
simple  model. 

Ordinarily,  experiments  concerned  with  paired-associate  learning  are 
not  limited  to  a  couple  of  trials,  like  those  just  considered,  but  continue 
until  the  subjects  meet  some  criterion  of  learning.  Under  these  circum- 
stances it  is  impractical  to  derive  theoretical  expressions  for  all  possible 
sequences  of  correct  and  incorrect  responses.  A  reasonable  goal,  instead, 
is  to  derive  expressions  for  various  statistics  that  can  be  conveniently 
computed  for  the  data  of  the  standard  experiment;  examples  of  such 
statistics  are  the  mean  and  variance  of  errors  per  item,  frequencies  of  runs 
of  errors  or  correct  responses,  and  serial  correlation  of  errors  over  trials  with 
any  given  lag.  Bower  (1961,  1962)  carried  out  the  first  major  analysis 
of  this  type  for  the  one-element  model.  We  shall  use  some  of  his  results  to 
illustrate  application  of  the  model  to  a  full  "learning-to-criterion"  experi- 
ment. Essential  details  of  his  experiment  are  as  follows:  a  list  of  10  items 
was  learned  by  29  undergraduates  to  a  criterion  of  two  consecutive  errorless 
trials.  The  stimuli  were  different  pairs  of  consonant  letters  and  the 
responses  were  the  integers  1  and  2;  each  response  was  assigned  as  correct 
to  a  randomly  selected  five  items  for  each  subject.  A  response  was  obtained 
from  the  subject  on  each  presentation  of  an  item,  and  he  was  informed  of 
the  correct  answer  following  his  response. 

As  in  the  preceding  application,  we  shall  assume  that  each  item  in  the 
list  is  to  be  represented  theoretically  by  exactly  one  stimulus  element, 
which  is  sampled  with  probability  1  when  the  item  is  presented,  and  that 
the  correct  response  to  that  item  is  conditioned  in  an  all-or-none  fashion. 
On  trial  n  of  the  experiment  an  element  is  in  one  of  two  "conditioning 
states":  In  state  C  the  element  is  conditioned  to  the  correct  response; 
in  state  C  the  element  is  not  conditioned. 

The  response  the  subject  makes  depends  on  his  conditioning  state. 


ONE-ELEMENT    MODELS 


When  the  element  is  in  state  C,  the  correct  response  occurs  with  proba- 
bility 1.  The  probability  of  the  correct  response  when  the  element  is  in 
state  C  depends  on  the  experimental  procedure.  In  Bower's  experiment 
the  subjects  were  told  the  r  responses  available  to  them  and  each  occurred 
equally  often  as  the  to-be-learned  response.  Therefore  we  may  assume 
that  in  the  unconditioned  state  the  probability  of  a  correct  response  is 
1/r,  where  r  is  the  number  of  alternative  responses. 

The  conditioning  assumptions  can  readily  be  restated  in  terms  of  the 
conditioning  states: 

1.  On  any  reinforced  trial,  if  the  sampled  element  is  in  state  C,  it  has 
probability  c  of  going  into  state  C. 

2.  The  parameter  c  is  fixed  in  value  in  a  given  experiment. 

3.  Transitions  from  state  C  to  state  C  have  probability  zero. 

We  shall  now  derive  some  predictions  from  the  model  and  compare 
these  with  observed  data.  The  data  of  particular  interest  will  be  a  subject's 
sequence  of  correct  and  incorrect  responses  to  a  specific  stimulus  item  over 
trials.  Similarly,  in  deriving  results  from  the  model  we  shall  consider  only 
an  isolated  stimulus  item  and  its  related  sequence  of  responses.  However, 
when  we  apply  the  model  to  data,  we  assume  that  all  items  in  the  list  are 
comparable,  that  is,  all  items  have  the  same  conditioning  parameter 
c  and  all  items  start  out  in  the  same  conditioning  state  (C).  Consequently 
the  response  sequence  associated  with  any  given  item  is  viewed  as  a  sample 
of  size  1  from  a  population  of  sequences  all  generated  by  the  same  under- 
lying process. 

A  feature  of  this  model  which  makes  it  especially  tractable  for  purposes 
of  deriving  various  statistics  is  the  fact  that  the  sequences  of  transitions 
between  states  C  and  C  constitute  a  Markov  chain.  This  means  that, 
given  the  state  on  any  one  trial,  we  can  specify  the  probability  of  each  state 
on  the  next  trial  without  regard  to  the  previous  history.  If  we  represent 
by  Cn  and  Cn  the  events  that  an  item  is  in  the  conditioned  or  unconditioned 
state,  respectively,  on  trial  «,  and  by  </n  and  q21  the  probabilities  of  transi- 
tions from  state  C  to  state  C  and  from  C  to  C,  respectively,  the  conditioning 
assumptions  lead  directly  to  the  relations2 

qu  =  Pr  (Cn+l  |  Cn)  =  1, 


2  See  Feller  (1957)  for  a  discussion  of  conditional  probabilities.  In  brief,  if  Hlt  .  .  .  ,  Hn 
are  a  set  of  mutually  exclusive  events  of  which  one  necessarily  occurs,  then  any  event  A 
can  occur  only  in  conjunction  with  some  Hf.  Since  the  AHj  are  mutually  exclusive, 
their  probabilities  add.  Applying  the  well-known  theorem  on  compound  probabilities, 
we  obtain  Pr  (A)  =  ]£  Pr  (AH,)  =  2  Pr  (X  Ht)  Pr 


132  STIMULUS    SAMPLING    THEORY 

and 

Q=\l     1-cl' 

L  _J 

where  Q  is  the  matrix  of  one-step  transition  probabilities,  the  first  row 
and  column  referring  to  C  and  the  second  row  and  column  to  C.  Now 
the  matrix  of  probabilities  for  transitions  between  any  two  states  in  n 
trials  is  simply  the  nih  power  of  Q,  as  may  be  verified  by  mathematical 
induction  (see,  e.g.,  Kemeny,  Snell,  &  Thompson,  1957,  p.  327), 


Henceforth  we  shall  assume  that  all  stimulus  elements  are  in  state  C  at 
the  onset  of  the  first  trial  of  our  experiment.  Given  that  the  state  is  C 
on  trial  1,  the  probability  of  being  in  state  C  at  the  start  of  trial  n  is 
(1  —  c)71"1,  which  goes  to  0  as  n  becomes  large,  for  c  >  0.  Thus  with 
probability  1  the  subject  is  eventually  to  be  found  in  the  conditioned  state. 
Next  we  prove  some  theorems  about  the  observable  sequence  of  correct 
and  incorrect  responses  in  terms  of  the  underlying  sequence  of  unobserv- 
able  conditioning  states.  We  define  the  response  random  variable 

(0    if  a  correct  response  occurred  on  trial  n, 
1     if  an  error  occurred  on  trial  n. 

By  our  assumed  response  rule  the  probabilities  of  an  error,  given  that 
the  subject  is  in  the  conditioned  or  unconditioned  state,  respectively,  are 


and 


To  obtain  the  probability  of  an  error  on  trial  n,  namely  Pr  (An  =  1), 
we  sum  these  conditional  probabilities  weighted  by  the  probabilities  of 
being  in  the  respective  states  : 

Pr  (A.  =  1)  =  Pr  (An  =  1  1  Cn)  Pr  (C  J  +  Pr  (Are  =  1  |  Q  Pr  (CJ 

_-  (1) 


Consider  next  the  infinite  sum  of  the  random  variables  A1?  A2,  A3,  .  .  . 
which  we  denote  A;  specifically, 


ONE-ELEMENT    MODELS 

But 


(2) 

Thus  the  number  of  errors  expected  during  the  learning  of  any  given  item 
is  given  by  Eq.  2. 

Equation  2  provides  an  easy  method  for  estimating  c.  For  any  given 
subject  we  can  obtain  his  average  number  of  errors  over  stimulus  items, 
equate  this  number  to  the  right-hand  side  of  Eq.  2  with  r  =  2,  and  solve 
for  c.  We  thereby  obtain  an  estimate  of  c  for  each  subject,  and  intersubject 
differences  in  learning  are  reflected  in  the  variability  of  these  estimates. 
Bower,  in  analyzing  his  data,  chose  to  assume  that  c  was  the  same  for  all 
subjects ;  thus  he  set  £(A)  equal  to  the  observed  number  of  errors  averaged 
over  both  list  items  and  subjects  and  obtained  a  single  estimate  of  c. 
This  group  estimate  of  c  simplifies  the  computations  involved  in  generating 
predictions.  However,  it  has  the  disadvantage  that  a  discrepancy  between 
observed  and  predicted  values  may  arise  as  a  consequence  of  assuming 
equal  c's  when,  in  fact,  the  theory  is  correct  but  c  varies  from  subject  to 
subject.  Fortunately,  Bower  has  obtained  excellent  agreement  between 
theory  and  observation  using  the  group  estimate  of  c  and,  for  the  particular 
conditions  he  investigated,  any  increase  in  precision  that  might  be  achieved 
by  individual  estimates  of  c  does  not  seem  crucial. 

For  the  experiment  described  above,  Bower  reports  1.45  errors  per 
stimulus  item  averaged  over  all  subjects.  Equating  E(&)  in  Eq.  2  to  1.45, 
with  r  =  2,  we  obtain  the  estimate  c  =  0.344.  All  predictions  that  we 
derive  from  the  model  for  this  experiment  will  be  based  on  this  single 
estimate  of  c.  It  should  be  remarked  that  the  estimate  of  c  in  terms  of 
Eq.  2  represents  only  one  of  many  methods  that  could  have  been  used. 
The  method  one  selects  depends  on  the  properties  of  the  particular  esti- 
mator (e.g.,  whether  the  estimator  is  unbiased  and  efficient  in  relation  to 
other  estimators).  Parameter  estimation  is  a  theory  in  its  own  right,  and 
we  shall  not  be  able  to  discuss  the  many  problems  involved  in  the  estima- 
tion of  learning  parameters.  The  reader  is  referred  to  Suppes  &  Atkinson 
(1960)  for  a  discussion  of  various  methods  and  their  properties.  Associated 
with  this  topic  is  the  problem  of  assessing  the  statistical  agreement  between 
data  and  theory(i.  e.,  the  goodness-of-fit  between  predicted  and  observed 
values)  once  parameters  have  been  estimated.  In  our  analysis  of  data 


STIMULUS    SAMPLING    THEORY 


in  this  chapter  we  offer  no  statistical  evaluation  of  the  predictions  but 
simply  display  the  results  for  the  reader's  inspection.  Our  reason  is  that 
we  present  the  data  only  to  illustrate  features  of  the  theory  and  its  applica- 
tion; these  results  are  not  intended  to  provide  a  test  of  the  model.  How- 
ever, in  rigorous  analyses  of  such  models  the  problem  of  goodness-of-fit 
is  extremely  important  and  needs  careful  consideration.  Here  again  the 
reader  is  referred  to  Suppes  &  Atkinson  (1960)  for  a  discussion  of  some  of 
the  problems  and  possible  statistical  tests. 

By  using  Eq.  1  with  the  estimate  of  c  obtained  above  we  have  generated 
the  predicted  learning  curve  presented  in  Fig.  1.  The  fit  is  sufficiently  close 
that  most  of  the  predicted  and  observed  points  cannot  be  distinguished 
on  the  scale  of  the  graph. 

As  a  basis  for  the  derivation  of  other  statistics  of  total  errors,  we  require 
an  expression  for  the  probability  distribution  of  A.  To  obtain  this,  we 
note  first  that  the  probability  of  no  errors  at  all  occurring  during  learning 
is  given  by 


r  r*       r  r[l  -  (1  -  c)/r]       r 

where  b  =  c/[l  —  (1  —  c)/r].  This  event  may  arise  if  a  correct  response 
occurs  by  guessing  on  the  first  trial  and  conditioning  occurs  on  the  fifst 
reinforcement,  if  a  correct  response  occurs  by  guessing  on  the  first  two 


0.5Q 


0.4 


0.3 


0.2 


0.1 


6.7          8 
Trials 


10       11        12       13 


Fig.  1.  The  average  probability  of  an  error  on  trial  n  in  Bower's  paired- 
associate  experiment. 


ONE-ELEMENT    MODELS 


trials  and  conditioning  occurs  on  the  second  reinforcement,  and  so  on. 
Similarly,  the  probability  of  no  additional  errors  following  an  error  on 
any  given  trial  is  given  by 


1  -  (1  -  c)/r 

To  have  exactly  k  errors,  we  must  have  a  first  error  (if  k  >  0),  which  has 
probability  1  —  b/r,  k  —  1  additional  errors,  each  of  which  has  probability 
1  —  b,  and  then  no  more  errors.  Therefore  the  required  probability 
distribution  is 


Pr  (A  =  /c)  =  6(1  -  6/r)(l  -  bf-\        for  fe  >  1. 

Equation  3  can  be  applied  to  data  directly  to  predict  the  form  of  the  fre- 
quency distribution  of  total  errors.  It  may  also  be  utilized  in  deriving, 
for  example,  the  variance  of  this  distribution.  Preliminary  to  computing 
the  variance,  we  need  the  expectation  of  A2, 


fc=o       \     r 


=  &(—  )  f  [k(k  -  1)  +  fc](l  -  b?-1 
V    r    /*-o 


where  the  second  step  is  taken  in  order  to  facilitate  the  summation. 
Using  the  familiar  expression 

00  -I 

2d  -*>)*  =  £ 

fc=0  O 

for  the  sum  of  a  geometric  series,  together  with  the  relations 


and 


I$6  STIMULUS    SAMPLING    THEORY 

we  obtain 


and 

Var(A)  =  £(A2)  -  [£(A)]2 


=  (r  -  1)  (2c  -  cr  +  r  -  1) 
re  re 

+  2c  -  2cr  +  r  ~ 


re  re 

(2c  -  1)(1  - 


r)1 
J 


re  re 

=  £(A)[1  +  £(A)(1  -  2c)].  (4) 

Inserting  in  Eq.  4  the  estimates  J?(A)  =  1.45  and  c  =  0.344  from  Bower's 
data,  we  obtain  1.44  for  the  predicted  standard  deviation  of  total  errors, 
which  may  be  compared  with  the  observed  value  of  1.37. 

Another  useful  statistic  of  the  error  sequence  is  E(A.nAn+k);  namely,  the 
expectation  of  the  product  of  error  random  variables  on  trials  n  and  n  +  k. 
This  quantity  is  related  to  the  autocorrelation  between  errors  on  trials 
n  +  k  and  trial  n.  By  elementary  probability  theory, 


But  for  an  error  to  occur  on  trial  n  +  k  conditioning  must  have  failed  to 
occur  during  the  intervening  k  trials  and  the  subject  must  have  guessed 
incorrectly  on  trial  n  +  k.  Hence 


Pr  (An+ 


=  1  |  A,  =  1)  =  (1  -  c)*(l  -  i). 


Substitution  of  this  result  into  the  preceding  expression,  along  with  the 
result  presented  in  Eq.  1,  yields  the  following  expression: 


(5) 


ONE-ELEMENT    MODELS 


137 


A  convenient  statistic  for  comparison  with  data  (directly  related  to  the 
average  autocorrelation  of  errors  with  lag  k,  but  easier  to  compute)  is 
obtained  by  summing  the  cross  product  of  An  and  An+k  over  all  trials. 
We  define  ck  as  the  mean  of  this  random  variable,  where 


-c)fc.  (6) 

To  be  explicit,  consider  the  following  response  protocol  running  in  time 
from  left  to  right:  1  101010010000.  The  observed  values  for  ck  are  ^  =  1, 
c2  ==  2,  c3  =  2,  and  so  on. 

The  predictions  for  cl9  c2,  and  c3  computed  from  the  c  estimate  given 
above  were  0,479,  0.310,  and  0.201.  Bower's  observed  values  were  0.486, 
0.292,  and  0.187. 

Next  we  consider  the  distribution  of  the  number  of  errors  between  the 
kth  and  (k  +  l)st  success.  The  methods  to  be  used  in  deriving  this  result 
are  general  and  can  be  used  to  derive  the  distribution  of  errors  between  the 
kth  and  (k  +  m)th  success  for  any  nonnegative  integer  m.  The  only  limita- 
tion is  that  the  expressions  become  unwieldy  as  m  increases.  We  shall 
define  J&  as  the  random  variable  for  the  number  of  errors  between  the  kth 
and  (k  +  l)st  success;  its  values  are  0,  1,  2,  ....  An  error  following  the 
kth  success  can  occur  only  if  the  kth  success  itself  occurs  as  a  result  of 
guessing;  that  is,  the  subject  necessarily  is  in  state  C  when  the  kth  success 
occurs.  Letting  gk  denote  the  probability  that  the  kth  success  occurs  by 
guessing,  we  can  write  the  probability  distribution 

1  —  agjc          for    i  =  0 

(7> 
for    z>0, 


where  a  =  (1  —  c)[l  -  (1/r)].  To  obtain  Pr  (Jfc  =  0),  we  note  that  0 
errors  can  occur  in  one  of  three  ways:  (1)  the  kth  success  occurs  because 
the  subject  is  in  state  C  (which  has  probability  1  —  gk}  and  necessarily  a 
correct  response  occurs  on  the  next  trial;  (2)  the  kth  success  occurs  by 
guessing,  the  subject  remaining  in  state  C  and  again  guessing  correctly  on 
the  next  trial  [which  has  probability  gk(l  -  c)(l/r)]  ;  or  (3)  the  kth  success 
occurs  by  guessing  but  conditioning  is  effective  on  the  trial  (which 
has  probability  gkc).  Thus  Pr  (  J*  =  0)  =  1  -  gk  +  gk(l  -  c)(l/r)  +  gkc 
=  1  —  ocgfc.  The  event  of  i  errors  (i  >  0)  between  the  kth  and  (k  +  l)st 
successes  can  occur  in  one  of  two  ways:  (1)  the&th  and  (k  +  l)st  successes 
occur  by  guessing  {with  probability  gk(l  -  c)i+1[l  -  (l/Oftl/')}*  or  (2) 


/5<  STIMULUS    SAMPLING    THEORY 

the  kih  success  occurs  by  guessing  and  conditioning  does  not  take  place 
until  the  trial  immediately  preceding  the  (k  +  l)st  success  {with  probability 
gk(l  -  c)'[l  -  (l/r)]'c}.  Hence 

Pr  (Jt  =  i)  =  gfc(l  - 

r    r 

~  -1(1  -  c)'[c  +  - 
r/  L         r 


-  a). 
From  Eq.  7  we  may  obtain  the  mean  and  variance  of  Jfc,  namely 


)  =      i  Pr  (J,  =  i)  =  --  ,  (8) 

and  <=°  *  ~  a 


(1  -  oc)2         (1  -  a)2 

(9) 


a-«)8 

In  order  to  evaluate  these  quantities,  we  require  an  expression  for  gk. 
Consider  gl9  the  probability  that  the  first  success  will  occur  by  guessing. 
It  could  occur  in  one  of  the  following  ways:  (1)  the  subject  guesses  cor- 
rectly on  trial  1  (with  probability  1/r);  (2)  the  subject  guesses  incorrectly  on 
trial  1,  conditioning  does  not  occur,  and  the  subject  guesses  successfully 
on  trial  2  {this  joint  event  has  probability  [1  —  (1/r)]  (1  —  c)(l/r)};  or  (3) 
conditioning  does  not  occur  on  trials  1  and  2,  and  the  subject  guesses 
incorrectly  on  both  of  these  trials  but  guesses  correctly  on  trial  3  {with 
probability  [1  -  (l/r)]2(l  -  c)2(l/r)},  and  so  forth.  Thus 


(1  -  a)r 

Now  consider  the  probability  that  the  kth  success  occurs  by  guessing  for 
k  >  1.  In  order  for  this  event  to  occur  it  must  be  the  case  that  (1)  the 
(k  —  l)st  success  occurs  by  guessing,  (2)  conditioning  fails  to  occur  on  the 


ONE-ELEMENT    MODELS  j^p 

trial  ofthe  (k  -  l)st  success,  and  (3)  since  the  subject  is  assumed  to  be  in 
state  C  on  the  trial  following  the  (ft  —  l)st  success,  the  next  correct 
response  occurs  by  guessing,  which  has  probability^.  Hence 

Solving  this  difference  equation3  we  obtain 

ft  =  0  -  <0*-V- 

Finally,  substituting  the  expression  obtained  for  gl  yields 

fo-^J.  (10) 

(r  -  ar/ 

We  may  now  combine  Eqs.  7  and  10,  inserting  our  original  estimate  of 
c,  to  obtain  predictions  about  the  number  of  errors  between  the  ftth  and 
(ft  +  l)st  success  in  Bower's  data.  To  illustrate,  for  ft  =  1,  the  predicted 
mean  is  0.361  and  the  observed  value  is  0.350. 

To  conclude  our  analysis  of  this  model,  we  consider  the  probability 
pk  that  a  response  sequence  to  a  stimulus  item  will  exhibit  the  property 
of  no  errors  following  the  ftth  success.  This  event  can  occur  in  one  of  two 
ways:  (1)  the  ftth  success  occurs  when  the  subject  is  in  state  C  (the  proba- 
bility of  which  is  1  —  gfc),  or  (2)  the  ftth  success  occurs  when  the  subject 
is  in  state  C  and  no  errors  occur  on  subsequent  trials.  Let  b  denote  the 
probability  of  no  more  errors  following  a  correct  guess.  Then 

Pk  =  0  ~~~  8k)  +  8kb 

-b).  (11) 


But  the  probability  of  no  more  errors  following  a  successful  guess  is  simply 


\2 

-c)2(-    c+  ... 

\r! 


oc  +  c 

Substituting  this  result  for  b  into  Eq.  11,  along  with  our  expression  for 
gk  in  Eq.  10,  we  obtain 

(12) 


,. 
(a  +  c)(r  -  ar)* 

Observed  and  predicted  values  of  pk  for  Bower's  experiment  are  shown  in 
Table  2. 

3  The  solution  of  this  equation  can  quickly  be  obtained.  Note  that^2  =  ^i(l  —  c)g^  = 
(1  _  C)fl*t  similarly,  gs  »^2(1  -  c)^;  substituting  the  result  for  gt,  we  obtain 
gs  ==  (i  —  c)^!2(l  —  c)£i  =  (1  —  c)2^!3.  If  we  continue  in  this  fashion,  it  will  be 
obvious  that  §k  =  (1  — 


STIMULUS    SAMPLING    THEORY 


We  shall  not  pursue  more  consequences  of  this  model.4  The  particular 
results  we  have  examined  were  selected  because  they  illustrated  funda- 
mental features  of  the  model  and  also  introduced  mathematical  techniques 
that  will  be  needed  later.  In  Bower's  paper  more  than  30  predictions  of  the 
type  presented  here  were  tested,  with  results  comparable  to  those  exhibited 
above.  The  goodness-of-fit  of  theory  to  data  in  these  instances  is  quite 

Table  2     Observed  and  Predicted  Values  for  pk9  the  Probability 
of  No  Errors  Following  the  kth  Success 

k  Observed  pk          Predicted  pk 


0 

0.255 

0.256 

1 

0.628 

0.636 

2 

0.812 

0.822 

3 

0.869 

0.912 

4 

0.928 

0.957 

5 

0.963 

0.979 

6 

0.973 

0.990 

7 

0.990 

0.995 

8 

0.990 

0.997 

9 

0.993 

0.998 

10 

0.996 

0.999 

11 

1.000 

1.000 

(Interpret  p0  as  the  probability  of  no  errors  at  all 
during  the  course  of  learning). 

representative  of  what  we  may  now  expect  to  obtain  routinely  in  simple 
learning  experiments  when  experimental  conditions  have  been  appropri- 
ately arranged  to  approximate  the  simplifying  assumptions  of  the  mathe- 
matical model. 

Concepts  of  the  sort  developed  in  this  section  can  be  extended  to  more 
traditional  types  of  verbal  learning  situations  involving  stimulus  similarity, 
meaningfulness,  and  the  like.  For  example,  Atkinson  (1957)  has  presented 
a  model  for  rote  serial  learning  which  is  based  on  similar  ideas  and  deals 

4  Bower  also  has  compared  the  one-element  model  with  a  comparable  single-operator 
linear  model  presented  by  Bush  and  Sternberg  (1959).  The  linear  model  assumes  that 
the  probability  of  an  incorrect  response  on  trial  n  is  a  fixed  number  pn,  where  pn+1  = 
(1  —  c)pn  and/?!  =  [1  —  (1/r)].  The  one-element  model  and  the  linear  model  generate 
many  identical  predictions  (e.g.,  mean  learning  curve),  and  it  is  necessary  to  look  at 
the  finer  structure  of  the  data  to  differentiate  models.  Among  the  20  possible  compar- 
isons Bower  makes  between  the  two  models,  he  finds  that  the  one-element  model 
comes  closer  to  the  data  on  18. 


ONE-ELEMENT    MODELS  r,r 

I4I 

with  such  variables  as  intertrial  interval,  list  length,  and  types  of  errors 
(perseverative,  anticipatory,  or  response-failure).  Unfortunately,  theoret- 
ical analyses  of  this  sort  for  traditional  experimental  routines  often  lead 
to  extremely  complicated  mathematical  models  with  the  result  that  only  a 
few  consequences  of  the  axioms  can  be  derived.  Stated  differently,  a  set  of 
concepts  may  be  general  in  terms  of  the  range  of  situations  to  which  it  is 
applicable;  nevertheless,  in  order  to  provide  rigorous  and  detailed  tests 
of  these  concepts,  it  is  frequently  necessary  to  contrive  special  experi- 
mental routines  in  which  the  theoretical  analyses  generate  tractable  mathe- 
matical systems. 


1.3  Probabilistic  Reinforcement  Schedules 

We  shall  now  examine  a  one-element  model  for  some  simple  two-choice 
learning  problems.  The  one-element  model  for  this  situation,  as  contrasted 
with  the  paired-associate  model,  generates  some  predictions  of  behavior 
that  are  quite  unrealistic,  and  for  this  reason  we  defer  an  analysis  of 
experimental  data  until  we  consider  comparable  multi-element  processes. 
The  reason  for  presenting  the  one-element  model  is  that  it  represents  a 
convenient  introduction  to  multi-element  models  and  permits  us  to  develop 
some  mathematical  tools  in  a  simple  fashion.  Further,  when  we  do  discuss 
multi-element  models,  we  shall  employ  a  rather  restrictive  set  of  condition- 
ing axioms.  However,  for  the  one-element  model  we  may  present  an 
extremely  general  set  of  conditioning  assumptions  without  getting  into 
too  much  mathematical  complexity.  Therefore  the  analysis  of  the  one- 
element  case  will  suggest  lines  along  which  the  multi-element  models  can 
be  generalized. 

The  reference  experiment  (see,  e.g.,  Estes  &  Straughan,  1954;  Suppes  & 
Atkinson,  1960)  involves  a  long  series  of  discrete  trials.  Each  trial  is 
initiated  by  the  onset  of  a  signal.  To  the  signal  the  subject  is  required  to 
make  one  of  two  responses  which  we  denote  A^  and  A2.  The  trial  is 
terminated  with  an  EI  or  E2  reinforcing  event;  the  occurrence  of  Ei  indi- 
cates that  response  Ai  was  the  correct  response  for  that  trial.  Thus  in  a 
human  learning  situation  the  subject  is  required  on  each  trial  to  predict 
the  reinforcing  event  he  expects  will  occur  by  making  the  appropriate 
response — an  A±  if  he  expects  El  and  an  A2  if  he  expects  E2;  at  the  end  of 
the  trial  he  is  permitted  to  observe  which  event  actually  occurred.  Initially 
the  subject  may  have  no  preference  between  responses,  but  as  information 
accrues  to  him  over  trials  his  pattern  of  choices  undergoes  systematic 
changes.  The  role  of  a  model  is  to  predict  the  detailed  features  of  these 
changes. 


142 


STIMULUS    SAMPLING    THEORY 


The  experimenter  may  devise  various  schedules  for  determining  the 
sequence  of  reinforcing  events  over  trials.  For  example,  the  probability 
of  an  E1  may  be  (1)  some  function  of  the  trial  number,  (2)  dependent  on 
previous  responses  of  the  subject,  (3)  dependent  on  the  previous  sequence 
of  reinforcing  events,  or  (4)  some  combination  of  the  foregoing.  For 
simplicity  we  consider  a  noncontingent  reinforcement  schedule.  The  case  is 
defined  by  the  condition  that  the  probability  of  E:  is  constant  over  trials 
and  independent  of  previous  responses  and  reinforcements.  It  is  customary 
in  the  literature  to  call  this  probability  TT;  thus  Pr(EIjn)  =  TT  for  all  n. 
Here  we  are  denoting  by  Eijn  the  event  that  reinforcement  Ei  occurs  on 
trial  n.  Similarly,  we  shall  represent  by  A^n  the  event  that  response  Ai 
occurs  on  trial  n. 

We  assume  that  the  stimulus  situation  comprising  the  signal  light  and 
the  context  in  which  it  occurs  can  be  represented  theoretically  by  a  single 
stimulus  element  that  is  sampled  with  probability  1  when  the  signal  occurs. 
At  the  start  of  a  trial  the  element  is  in  one  of  three  conditioning  states : 
in  state  Cx  the  element  is  conditioned  to  the  ^-response  and  in  state  C2 
to  the  ^2"resPonse3*  in  state  C0  the  element  is  not  conditioned  to  A±  or  to 
A2.  The  response  rules  are  similar  to  those  presented  earlier.  When  the 
subject  is  in  Cx  or  C2,  the  Ar  or  ^-response  occurs  with  probability  1. 
In  state  C0  we  assume  that  either  response  will  be  elicited  equiprobably; 
that  is,  Pr(Alin  |  C0)J  =  J.  For  some  subjects  a  response  bias  may  exist 
that  would  require  the  assumption  Pr  (Alt7l  \  C0>ri)  =  /?,  where  ft  ^  i-  For 
these  subjects  it  would  be  necessary  to  estimate  J3  when  applying  the  model. 
However,  for  simplicity  we  shall  pursue  only  the  case  in  which  responses 
are  equiprobable  when  the  subject  is  in  C0. 


Fig.  2.  Branching  process,  starting  from  state 
CL  on  trial  n,  for  a  one-element  model  in  a  two- 
choice,  noncontingent  case. 


ONE-ELEMENT    MODELS  j^g 

We  now  present  a  general  set  of  rules  governing  changes  in  conditioning 
states.  As  the  model  is  developed  it  will  become  obvious  that  for  some 
experimental  problems  restrictions  that  greatly  simplify  the  process  can 
be  imposed. 

If  the  subject  is  in  state  Cx  and  an  E^  occurs  (i.e.,  the  subject  makes  an 
^-response,  which  is  correct),  then  he  will  remain  in  Ca.  However,  if 
the  subject  is  in  Cx  and  an  E2  occurs,  then  with  probability  c  the  subject 
goes  to  C2  and  with  probability  c'  to  C0.  Comparable  rules  apply  when 
the  subject  is  in  C2.  Thus,  if  the  subject  is  in  Ca  or  C2  and  his  response  is 
correct,  he  will  remain  in  Cj  or  C2.  If,  however,  he  is  in  d  or  C2  and  his 
response  is  not  correct,  then  he  may  shift  to  one  of  the  other  conditioning 
states,  which  reduces  the  probability  of  repeating  the  same  response  on 
the  next  trial. 

Finally,  if  the  subject  is  in  C0  and  an  E±  or  E2  occurs,  then  with  proba- 
bility c"  the  subject  moves  to  Cx  or  C2,  respectively.5  Thus,  to  summarize, 
for  i,j  =  1,  2  and  i 


(    } 


where  0  <  c"  <,  1  and  0  <  c  +  c'  <  1. 

We  now  use  the  assumptions  of  the  preceding  paragraphs  and  the 
particular  assumptions  for  the  noncontingent  case  to  derive  the  transition 
matrix  in  the  conditioning  states.  In  making  such  a  derivation  it  is  con- 
venient to  represent  the  various  possible  occurrences  on  a  trial  by  a  tree. 
Each  set  of  branches  emanating  from  a  point  represents  a  mutually  ex- 
clusive and  exhaustive  set  of  possibilities.  For  example,  suppose  that  at 
the  start  of  trial  n  the  subject  is  in  state  Q;  the  tree  in  Fig.  2  represents  the 
possible  changes  that  can  occur  in  the  conditioning  state. 

6  Here  we  assume  that  the  subject's  response  does  not  affect  the  change;  that  is,  if  the 
subject  is  in  C0  and  an  EI  occurs,  then  he  will  move  to  Cx  with  probability  c",  no  matter 
whether  A  or  A2  has  occurred.  This  assumption  is  not  necessary  and  we  could  readily 
have  the  actual  response  affect  change.  For  example,  we  might  postulate  c-!  for  an 
AiEi  or  AzEz  combination,  and  c/  for  the  Aj.Ez  or  A£±  combination;  that  is, 


and 

Pr  (Cltn+1  1  ElfnA2,nC0tn)  =  Pr  (C2,n+1  1  E2,nAltnC0,n)  =  c* 
where 

c/  ^  c/. 

However,  such  additions  make  the  mathematical  process  more  complicated  and  should 
be  introduced  only  when  the  data  clearly  require  them. 


144  STIMULUS    SAMPLING    THEORY 

The  first  set  of  branches  is  associated  with  the  reinforcing  event  on 
trial  72.  If  the  subject  is  in  Q  and  an  E^  occurs,  then  he  will  stay  in  state 
C].  on  the  next  trial.  However,  if  an  E2  occurs,  then  with  probability  c 
he  will  go  to  C2,  with  probability  c  he  will  go  to  C0,  and  with  probability 
1  —  c  —  cf  he  will  remain  in  Q. 

Each  path  of  a  tree,  from  a  beginning  point  to  a  terminal  point,  re- 
presents a  possible  outcome  on  a  given  trial.  The  probability  of  each 
path  is  obtained  by  multiplying  the  appropriate  conditional  probabilities. 
Thus  for  the  tree  in  Fig.  2  the  probability  of  the  bottom  path  may  be 
represented  by  Pr(E2}7l  \  Cljn)Pr(Cljn+1  1  £a.BC1>n)  =  (1  -  *r)(l  -  c  -  c'). 
Two  of  the  four  paths  lead  from  Cx  to  Q;  hence 

Pll  =   Pr  (CM+1  |   C1(  J   =  TT  +   (1    -   7T)(1    -   C  -  C'). 

Similarly,  JPIO  =  (1  —  TT)C'  and  /?12  =  (1  —  TT)C,  where  /^  denotes  the 
probability  of  a  one-step  transition  from  Ci  to  Q  . 

For  the  C0  state  we  have  the  tree  given  in  Fig.  3.  On  the  top  branch  an 
E!  event  is  indicated;  by  Eq.  13  the  probability  of  going  to  Cx  is  c"  and 
of  staying  in  C0  is  1  —  c".  A  similar  analysis  holds  for  the  bottom  branches. 
Thus  we  have 

Poi  =  *c" 


oo 


1  - 


A  combination  of  these  results  and  the  comparable  results  for  C2  yields 
the  following  transition  matrix: 


P  =  C0 
C2 


c  c  c 

*-"l  ^0  *^2 

"1  -  (1  -  ir)(c'  +  c)     c'(l  -  TT)          c(l  -  17) 


(14) 


c'V  1  -  c"  c^(l  -  TT) 

CTT  C'TT  1  —  TT(C'  +  c)_  . 

As  in  the  case  of  the  paired-associate  model,  a  large  number  of  pre- 
dictions can  be  derived  easily  for  this  process.  However,  we  shall  select 
only  a  few  that  will  help  to  clarify  the  fundamental  properties  of  the  model. 
We  begin  by  considering  the  asymptotic  probability  of  a  particular 
conditioning  state  and,  in  turn,  the  asymptotic  probability  of  an  Ar 
response.  The  following  notation  will  prove  useful:  let  [pi3]  be  the  transi- 
tion matrix  and  define  p$  as  the  probability  of  being  in  state  j  on  trial 
r  +  n,  given  that  at  trial  r  the  subject  was  in  state  f.  The  quantity  is 
defined  recursively: 


Pii^Pa,        Pii=2.P^K' 

V 


ONE-ELEMENT    MODELS 


*45 


2,n+l 


'Q.n+l 


Fig.  3.  Branching  process,  starting  from  state 
C0  on  trial  n,  for  a  one-element  model  in  a  two- 
choice,  noncontingent  case. 

Moreover,  if  the  appropriate  limit  exists  and  is  independent  of  z,  we  set 


The  limiting  quantities  us  exist  for  any  finite-state  Markov  chain  that  is 
irreducible  and  aperiodic.  A  Markov  chain  is  irreducible  if  there  is  no 
closed  proper  subset  of  states;  that  is,  no  proper  subset  of  states  such  that 
once  within  this  set  the  probability  of  leaving  it  is  0.  For  example,  the 
chain  whose  transition  matrix  is 

1    2    3 


'I    |    0 

1    4    0 
i    *    t 


is  reducible  because  the  set  {1,  2}  of  states  is  a  proper  closed  subset.  A 
Markov  chain  is  aperiodic  if  there  is  no  fixed  period  for  return  to  any  state 
and  periodic  if  a  return  to  some  initial  state  j  is  impossible  except  at  t9 
2t,  3t,  .  .  .  trials  for  t  >  1.  Thus  the  chain  whose  matrix  is 

1  2  3 
"0  1  0" 
0  0  1 
.1  0  0. 
has  period  t  =  3  for  return  to  each  state. 


14$  STIMULUS    SAMPLING  THEORY 

If  there  are  r  states,  we  call  the  vector  u  =  (uly  u2>  .  .  .  ,  ur)  the  stationary 
probability  vector  of  the  chain.  It  may  be  shown  (Feller,  1957;  Kemeny  & 
Snell,  1959)  that  the  components  of  this  vector  are  the  solutions  of  the  r 
linear  equations  r 


"2=5X^,2  (15) 


such  that  2  uv  =  1-   Thus,  to  find  the  asymptotic  probabilities  u0  of  the 

«*i 

states,  we  need  find  only  the  solution  of  the  r  equations.  The  intuitive 
basis  of  this  system  of  equations  seems  clear.  Consider  a  two-state  chain. 
Then  the  probability  /?n+1  of  being  in  state  1  on  trial  n  +  1  is  the  probability 
of  being  in  state  1  on  trial  n  and  going  to  1  plus  the  probability  of  being  in 
state  2  on  trial  n  and  going  to  1  ;  that  is 


But  at  asymptote  pn+1  =  pn  =  wx  and  1  —  pn  =  z/2,  whence 


which  is  the  first  of  the  two  equations  of  the  system  when  r  =  2. 

It  is  clear  that  the  chain  represented  by  the  matrix  P  of  Eq.  14  is  irre- 
ducible and  aperiodic;  thus  the  asymptotes  exist  and  are  independent  of 
the  initial  probability  distribution  on  the  states.  Let  [pti]  (i,j  =  1,  2,  3) 
be  any  3x3  transition  matrix.  Then  we  seek  the  numbers  tij  such  that 
ui  ~  2  uvPvi  and  £  MJ  =  1.  The  general  solution  is  given  by  ^  =  D^D, 

V 

where  _ 


f    . 
=  (1  -pnXl  -^22)  - 


D   =£)1  +  JD2  +  D3. 

Inserting  in  these  equations  the  equivalents  of  the  p{j  from  the  transition 
matrix  and  renumbering  the  states  appropriately,  we  obtain 

D!  =  ITC"(C  +  cV) 

J^o  =  *r(l  -  w)cV  +  2c) 


ONE-ELEMENT    MODELS 

Since  D  is  the  sum  of  the  D/s  and  since  u5  =  DJD,  we  may  divide  the 
numerator  and  denominator  by  (c")2  and  obtain 

u     _. 77(p  +  €77) 

77(p  +  €77)  +  77(1   -  77)€(€  +  2p)  +  (1   -  TT)[/>  +  €(1   -  77)] 

UQ   =    • :: L-£ (17^ 

77(p  +  €77)   +  77(1    -  77)€(€  +   2p)  +  (1   -  TT)|>  +  €<1    -  77)]      ^       ' 

where  p  =  c/c"  and  €  =  c'/c". 
By  our  response  axioms  we  have 


for  all  n.   Hence 

lim  Pr  (4lfft)  =  %  +  |w0 

_  77(p  +  €P  +  J62)  +  772(€  -  €P  - 


77(€2  +  2€p  -  2€)  +  772(2€  -  €2  - 

An  inspection  of  Eq.  18  indicates  that  the  asymptotic  probability 
of  an  ^-response  is  a  function  of  77,  p,  and  €.  As  will  become  clear 
later,  the  value  of  Pr(.4lj00)  is  bounded  in  the  open  interval  from  \  to 
772/[7r2  +  (1  —  77)2] ;  whether  Pr  04lj00)  is  above  or  below  77  depends  on 
the  values  of  p  and  €. 

We  now  consider  two  special  cases  of  our  one-element  model.  The  first 
case  is  comparable  to  the  multi-element  models  to  be  discussed  later, 
whereas  the  second  case  is,  in  some  respects,  the  complement  of  the  first 
case. 

Case  ofcr  =  0.  Let  us  rewrite  Eq.  14  with  c'  =  0.  Then  the  transition 
matrix  has  the  following  canonical  form : 

c  r  r 

<•"!  ^2  ^0 

"1  -  c(l  -  77)     c(l  -  77)         0 

C77  1   -  C77  0  (19) 

c'V  c"(l  -  77)    1  -  c"_  . 

We  note  that  once  the  subject  has  left  state  C0  he  can  never  return.  In 
fact,  it  is  obvious  that  Pr  (CM)  =  Pr  (C0>1)(1  —  c")n~l  where  Pr  (C0>i)  is 
the  initial  probability  of  being  in  C0.  Thus,  except  on  early  trials,  C0  is  not 
part  of  the  process,  and  the  subject  in  the  long  run  fluctuates  between 
CL  and  C2,  being  in  Cx  on  a  proportion  77  of  the  trials. 
From  Eq.  19  we  have  also 

Pr  (C1>n+1)  =  Pr  (CM)[1  -  c(l  -  77)]  +  Pr  (C2>n)c77  +  Pr  (Co.Jc**; 


STIMULUS    SAMPLING    THEORY 


that  is,  the  probability  of  being  in  Cx  on  trial  n  +  1  is  equal  to  the  prob- 
ability of  being  in  Cx  on  trial  n  times  the  probability  p^  of  going  from  Cx 
to  Cx  plus  the  probability  of  being  in  C2  times  p2l  plus  the  probability  of 
being  in  C0  times  /?01.  For  simplicity  let  xn  =  Pr  (CM),  yn  =  Pr  (C2jJ, 
and  zn  =  Pr  (C0j7l).  Now  we  know  that  zn  =  z^l  —  c")n~l  and  also  that 
xn  +  Vn  +  *n  =  X  or  yn  =  1  -  xn  -  ^(1  -  c")"'1.  Making  these  sub- 
stitutions in  the  foregoing  recursion  yields 


=  sn(l  -  c)  +  %(1  -  cT~Mc"  -  <0  +  CTT. 
This  difference  equation  has  the  following  solution6: 

xn  =  TT  -  (TT  -  3^(1  -  c)»-i  -  TT^Kl  -  c"y~l  -  (1  -  ^)n-1]- 
But  Pr  (4lf  J  =  o:n  +  ^n;  hence 

Pr  (^If1l)  =  TT  -  [TT  -  TT  Pr  (C0)1)  -  Pr  (Cljl)](i  -  c)-1 

O11-1.     (20) 


If  Pr  (C0jl)  =  0,  then  we  have  a  simple  exponential  learning  function 
starting  at  Pr(CljL)  and  approaching  TT  at  a  rate  determined  by  c.  If 
Pr  (C0)1)  7^  0,  then  the  rate  of  approach  is  a  function  of  both  c  and  c'  '. 
We  now  consider  one  simple  sequential  prediction  to  illustrate  another 
feature  of  the  one-element  model  for  c'  =  0.  Specifically,  consider  the 
probability  of  an  ^-response  on  trial  n  +  1  given  a  reinforced  ^-response 
on  trial  n;  namely  Pr  G4M+i  |  El§nAltJ.  Note  first  of  all  that 

Pr  G4M+1  1  EljnA1}n)  Pr  (E^nA^n)  =  Pr  (^i^+A,^!,  J. 

6  The  solution  of  such  a  difference  equation  can  readily  be  obtained.    Consider 
xn+l  =  axn  +  bcn~l  +  d  where  a,  b,  c,  and  d  are  constants.  Then 

(1)  xz  =  axl  +  b  +  d. 

Similarly,  %*  =  axz  +  be  +  d  and  substituting  (1)  for  x2  we  obtain 

(2)  X*  =  aX  +  ab  +  ad  +  bc  +  d. 
Similarly,  #4  =  #e3  +  6c2  +  cf  and  substituting  (2)  for  #3  we  obtain 

(3)  #4  =  c3^!  +  a*b  +  azd  +  abc  +  ad  +  be*  +  ^. 
If  we  continue  in  this  fashion,  it  will  be  obvious  that  for  n  >  2 

w—2  n—  2  /  y 

sn  =  a"-1^  +  rf  2  fl<  +  fln"2^  2  (-)• 

Carrying  out  the  summations  yields  the  desired  results.  See  Jordan  (1950,  pp.  583-584) 
for  a  detailed  treatment. 


ONE-ELEMENT    MODELS 

Further,  we  may  write 

=2PrG4i,«+iQ, 
i,i 

=  2Pr04i,«+ii 

i,3 


(£i,n 


,  J  Pr  (C^+1  1  El>nA1>nCjfn) 
i.»Cy  J  Pr  (41§B  |  C,,  J  Pr  (C,,  J. 


By  assumption  the  probability  of  a  response  is  determined  solely  by  the 
conditioning  state,  hence 

Pr  (^i,»+i  |  CwlEM^1>nQ>fl)  =  Pi(Altn+l  1  Q§w+1). 

Further,  by  assumption,  the  probability  of  an  E^-event  is  independent  of 
other  events,  and  Pr  (EI>n  \  A^nCit^  =  rr.  Substituting  these  results 
in  the  foregoing  expression,  we  obtain 


Pr 


Pr  (C<f 


Both  i  andy  run  over  0,  1,  and  2,  and  therefore  there  are  nine  terms  in  the 
sum;  but  note  that  when  f  =  2,  the  term  Pr  (>4ljW+1  1  Ci)n+1)  is  zero  and 
when  y  =  2  the  term  Pr  (A1>n  \  C,fW)  is  zero.  Consequently  it  suffices  to 
limit  i  and  j  to  0  and  1  ,  and  we  have 


=  TT      Pr 


Pr  (Ct- 


Pr  (^B  |  ClfJ  Pr  (Cljn 


•  2  Pr  (^lin 


1)  Pr  (Q,n 


Pr 


|  C0,JPr  (C0,J. 


Since  the  subject  cannot  leave  state  Cx  on  a  trial  when  A:  is  reinforced, 
we  know  that 


Pr 


=  1     and     Pr 


=  0; 


further,  Pr  (Al>n+l  \  Cljn+1)  =  1.  Therefore  the  first  sum  is  simply 
TT  Pr  (CM).  For  the  second  sum,  Pr  (CM+1  1  E^fnAlfnC0tJ  =  c"  and 
Pr  (Co,^  |  Elin^i,wC0fn)  =  1  -  c".  Further,  Pr  (^[w  |  C0,  J  =  *;  hence 
for  the  second  sum  we  obtain 


Combining  these  results, 


Pr  (C0,J[c" 


JJO  STIMULUS    SAMPLING    THEORY 

But 

Pr  (ElinAl>n)  =  Pr  (£1§n  |  Al>n)  Pr  (A1>n)  =  rr  Pr  (^  J, 
whence 

j  A     x       Pr  (C^)  +  I  Pr  (C0.  n)[c"  +  (1  -  Q« 


We  know  that  Pr  (C1>n)  and  Pr  (AI>n)  both  approach  TT  in  the  limit  and 
that  Pr  (C0j  J  approaches  0.  Therefore  we  predict  that 

lim  Pr  (41(W+1  1  EltnA1>n)  =  1. 

n~»-oo 

This  prediction  provides  a  sharp  test  for  this  particular  case  of  the 
model  and  one  that  is  certain  to  fail  in  almost  any  experimental  situation; 
that  is,  even  after  a  large  number  of  trials  it  is  hard  to  conceive  of  an 
experimental  procedure  such  that  a  response  will  be  repeated  with  prob- 
ability 1  if  it  occurred  and  was  reinforced  on  the  preceding  trial.  Later  we 
shall  consider  a  multi-element  model  that  provides  an  excellent  description 
of  many  sets  of  data  but  is  based  on  essentially  the  same  conditioning  rules 
specified  by  this  case  of  c'  =  0.  It  should  be  emphasized  that  deterministic 
predictions  of  the  sort  given  in  the  foregoing  equation  are  peculiar  to 
one-element  models;  for  the  multi-element  case  such  difficulties  do  not 
arise.  This  point  is  amplified  later. 

Case  of  c  =  0.  We  now  consider  the  case  in  which  direct  counter- 
conditioning  does  not  occur,  that  is,  c  =  0,  and  thus  p  =  0  and  0  <  e  <  oo. 
With  this  restriction  the  chain  is  still  ergodic,  since  it  is  possible  to  go 
from  every  state  to  every  other  state,  but  transitions  between  Cx  and  C2 
must  go  by  way  of  C0.  Letting  p  =  0  in  Eq.  18,  we  obtain 


From  Eq.  21  we  can  draw  some  interesting  conclusions  about  the 
relationship  of  the  asymptotic  response  probabilities  to  the  ratio  e  =  c'\c". 
Differentiating  with  respect  to  e,  we  obtain 

If  77(1  —  77)(i  —  77)  5^  0,  then  Pr  (Alf(0)  has  no  maximum  for  e  in  the 
open  interval  (0,  oo),  which  is  the  permissible  range  on  e.  In  fact,  since  the 
sign  of  the  derivative  is  independent  of  <=,  we  know  that  Pr  (Al)CG)  is  either 
monotone  increasing  or  monotone  decreasing  in  e:  strictly  increasing  if 
77(1  —  TT)(|  —  77)  >  0  (i.e.,  77  <  J)  and  decreasing  if  77(1  —  TT)(\  —  77)  <  0 
(i.e.,  77  >  J).  Moreover,  because  of  the  monotonicity  of  Pr  (Alt00)  in  €, 


ONE-ELEMENT    MODELS 


it  is  easy  to  compute  bounds  from  Eq.  21.  First,  we  see  immediately  that 
the  lower  bound  (assuming  TT  >  £)  is  Hm  Pr  (AI>OQ)  =  1.  Second,  when 
€  is  very  small,  Pr  (X1>GO)  approaches  7r2/[7r2  +  (1  -  77)*].  Note,  however, 
that  Eq.  21  is  inapplicable  when  e  =  0;  for  if  both  c  =  0  and  c'  =  0  the 
transition  matrix  (Eq.  14)  reduces  to 


P  = 


'10  0       ' 

cV    1  -  C"    c"(l  -  TT) 


0          0  1 

and,  if  the  process  starts  in  C0,  Pr  (Alf(0)  =  TT.  But  for  e  >  0,  if  TT  >  J, 
Pr  C^i.w)  is  a  decreasing  function  of  e  and  its  values  lie  in  the  half-open 
interval 


,   --— .-  -„•  +  (!_„)»• 

It  is  readily  determined  that  probability  matching  would  not  generally  be 
predicted  in  this  case.  When  c'\c"  is  greater  than  2,  the  predicted  value  of 
Pr  (Ai9(0)  is  less  than  TT,  and  when  this  ratio  is  less  than  2  the  predicted  value 
of  Pr  (Alt  OQ)  is  greater  than  TT. 

Finally,  we  derive  Pr  (Al}7L+l  \  El}UAl>n)  for  this  case.    The  derivation 
is  identical  to  that  given  for  c'  =  0,   Hence 


M!  +  JW0 

Note,  however,  that  for  c  =  0  the  quantity  w0  is  never  0  (except  for 
TT  =  0,  1),  and  consequently  Pr(Al>n+l  \  Elt7LAlfn)  is  always  less  than  1. 
Contingent  Reinforcement.  As  a  final  example  we  shall  apply  the 
one-element  model  to  a  situation  in  which  the  reinforcing  event  on  trial  n 
is  contingent  on  the  response  on  that  trial.  Simple  contingent  reinforce- 
ment is  defined  by  two  probabilities  7rn  and  7r21  such  that 

Pr  (Eltn  |  Alfn)  =  TTU    and    Pr  (E1|W  |  A2J  =  7r21. 

We  consider  the  case  of  the  model  in  which  c'  =  0  and  Pr  (C0ji)  =  0; 
that  is,  the  subject  is  not  in  state  C0  on  trial  1  and  (since  c'  =  0)  he  can 
never  reach  C0  from  Cx  or  C2.  Hence  on  all  trials  he  is  in  Cx  or  C2,  and 
transitions  between  these  states  are  governed  by  the  single  parameter  c. 
The  trees  for  the  Ci  and  C2  states  are  given  in  Fig.  4. 
The  transition  matrix  is 

G!  C2 

l  -  (1  -  TTU)C     (1  -  7rn)cl 

1    —   C7T21  J, 


CU ^!.» 


STIMULUS    SAMPLING    THEORY 
O-i  _  i  -i 


1,71  +  1 


Fig.  4.  Branching  process  for  one-element  model  in  two- 
choice,  contingent  case. 

and,  in  terms  of  this  matrix,  we  may  write 

Pr  (CM+1)  =  Pr  (Clf  J[l  -  (1  -  7rn)c]  +  Pr  ( 

But  Pr  (C2>n)  =  1  -  Pr  (Clffl)  and  Pr  (CM)  =  Pr  (^lfll);  hence 
Pr  (^i,w+i)  ==  Pr  (Al>n)[l  -  (1  -  7ru)c  -  C7r21] 

This  difference  equation  has  the  solution 

Pr  C4M)  =  Pr  (Altn)  -  [Pr  (AltJ  -  Pr  (4lfl)][l  -  c(l  ~  7r 

where 


JL    ~~"  "37*11     i"  W' 


21 


MULTI-ELEMENT  PATTERN  MODELS 


The  asymptote  is  independent  of  c,  and  the  rate  of  approach  is  determined 
by  the  quantity  c(l  —  TTU  +  7r21).  It  is  interesting  to  note  that  the  learning 
function  for  Pr  (AIin)  in  this  case  of  the  one-element  model  is  identical 
to  that  of  the  linear  model  (cf.  Estes  &  Suppes,  1959a). 


2.  MULTI-ELEMENT   PATTERN   MODELS 
2.1  General  Formulation 

In  the  literature  of  stimulus  sampling  theory  a  variety  of  proposals 
has  been  made  for  conceptually  representing  the  stimulus  situation. 
Fundamental  to  all  of  these  suggestions  has  been  the  distinction  between 
pattern  elements  and  component  elements.  For  the  one-element  case  this 
distinction  does  not  play  a  serious  role,  but  for  multi-element  formulations 
these  alternative  representations  of  the  stimulus  situation  specify  different 
mathematical  processes. 

In  component  models  the  stimulating  situation  is  represented  as  a 
population  of  elements  which  the  learner  is  viewed  as  sampling  from  trial 
to  trial.  It  is  assumed  that  the  conditioning  of  individual  elements  to 
responses  occurs  independently  as  the  elements  are  sampled  in  conjunction 
with  reinforcing  events  and  that  the  response  probability  in  the  presence  of 
a  sample  containing  a  number  of  elements  is  determined  by  an  averaging 
rule.  The  principal  consideration  has  been  to  account  for  response 
variability  to  an  apparently  constant  stimulus  situation  by  postulating 
random  fluctuations  from  trial  to  trial  in  the  particular  sample  of  stimulus 
elements  affecting  the  learner.  These  component  models  have  provided  a 
mechanism  for  effecting  a  reconciliation  between  the  picture  of  gradual 
change  usually  exhibited  by  the  learning  curve  and  the  all-or-none  law  of 
association. 

For  many  experimental  situations  a  detailed  account  of  the  quantitative 
properties  of  learning  can  be  given  by  component  models  that  assume 
discrete  associations  between  responses  and  the  independently  variable 
elements  of  a  stimulating  situation.  However,  in  some  cases  predictions 
from  component  models  fail,  and  it  appears  that  a  simple  account  of  the 
learning  process  requires  the  assumption  that  responses  become  associated, 
not  with  separate  components  or  aspects  of  a  stimulus  situation,  but  with 
total  patterns  of  stimulation  considered  as  units.  The  model  presented  in 
this  section  is  intended  to  represent  such  a  case.  In  it  we  assume  that  an 
experimentally  specified  stimulating  situation  can  be  conceived  as  an 
assemblage  of  distinct,  mutually  exclusive  patterns  of  stimulation,  each 
of  which  becomes  conditioned  to  responses  on  an  all-or-none  basis.  By 


STIMULUS    SAMPLING    THEORY 


"mutually  exclusive"  we  mean  that  exactly  one  of  the  patterns  occurs 
(is  sampled  by  the  subject)  on  each  trial.  By  "distinct"  we  mean  that  no 
generalization  occurs  from  one  pattern  to  another.  Thus  the  clearest 
experimental  interpretation  would  involve  a  set  of  patterns  having  no 
common  elements  (i.e.,  common  properties  or  components).  In  practice 
the  pattern  model  has  also  been  applied  with  considerable  success  to 
experiments  in  which  the  alternative  stimuli  have  some  common  elements 
but  nevertheless  are  sufficiently  discriminate  so  that  generalization  effects 
(e.g.,  "confusion  errors")  are  small  and  can  be  neglected  without  serious 
error. 

In  this  presentation  we  shall  limit  consideration  to  cases  in  which 
patterns  are  sampled  randomly  with  equal  likelihood  so  that  if  there 
are  N  patterns  each  has  probability  I/TV  of  being  sampled  on  a  trial.  This 
sampling  assumption  represents  only  one  way  of  formulating  the  model 
and  is  presented  here  because  it  generates  a  fairly  simple  mathematical 
process  and  provides  a  good  account  of  a  variety  of  experimental  results. 
However,  this  particular  scheme  for  sampling  patterns  has  restricted 
applicability.  For  example,  in  certain  experiments  it  can  be  demonstrated 
that  the  stimulus  array  to  which  the  subject  responds  is  in  large  part 
determined  by  events  on  previous  trials  ;  that  is,  trace  stimulation  associated 
with  previous  responses  and  rewards  determines  the  stimulus  pattern  to 
which  the  subject  responds.  When  this  is  the  case,  it  is  necessary  to  pos- 
tulate a  more  general  rule  for  sampling  patterns  than  the  random  scheme 
proposed  (e.g.,  see  the  discussion  of  "hypothesis  models"  in  Suppes  & 
Atkinson,  1960). 

Before  stating  the  axioms  for  the  pattern  model  to  be  considered  in 
this  section,  we  define  the  following  notions.  As  before,  the  behaviors 
available  to  the  subject  are  categorized  into  mutually  exclusive  and 
exhaustive  response  classes  (Al9  A2,  .  .  .  9  Ar).  The  possible  experimenter- 
defined  outcomes  of  a  trial  (e.g.,  giving  or  withholding  reward,  uncondi- 
tioned stimulus,  knowledge  of  results)  are  classified  by  their  effect  on 
response  probability  and  are  represented  by  a  mutually  exclusive  and 
exhaustive  set  of  reinforcing  events  (EQ,  El9  .  .  .  ,  Er).  The  event  £,  (i  ^  0) 
indicates  that  response  Ai  is  reinforced  and  E0  represents  any  trial  outcome 
whose  effect  is  neutral  (i.e.,  reinforces  none  of  the  At9$).  The  subject's 
response  and  the  experimenter-defined  outcomes  are  observable,  but  the 
occurrence  of  Et  is  a  purely  hypothetical  event  that  represents  the  rein- 
forcing effect  of  the  trial  outcome.  Event  Et  is  said  to  have  occurred  when 
the  outcome  of  a  trial  increases  the  probability  of  response  At  in  the 
presence  of  the  given  stimulus  —  provided,  of  course,  that  this  probability  is 
not  already  at  its  maximum  value. 

We  now  present  the  axioms.   The  first  group  of  axioms  deals  with  the 


MULTI-ELEMENT    PATTERN     MODELS  /55 

conditioning  of  sampled  patterns,  the  second  group  with  the  sampling  of 
patterns,  and  the  third  group  with  responses. 

Conditioning  Axioms 

Cl.  On  every  trial  each  pattern  is  conditioned  to  exactly  one  response. 

C2.  If  a  pattern  is  sampled  on  a  trial,  it  becomes  conditioned  with  prob- 
ability c  to  the  response  (if  any)  that  is  reinforced  on  the  trial;  if  it  is 
already  conditioned  to  that  response,  it  remains  so. 

C3.  If  no  reinforcement  occurs  on  a  trial  (i.e.,  EQ  occurs),  there  is  no  change 
in  conditioning  on  that  trial. 

C4.  Patterns  that  are  not  sampled  on  a  trial  do  not  change  their  conditioning 
on  that  trial. 

C5.  The  probability  c  that  a  sampled  pattern  mil  be  conditioned  to  a 
reinforced  response  is  independent  of  the  trial  number  and  the  pre- 
ceding events. 

Sampling  Axioms 

51.  Exactly  one  pattern  is  sampled  on  each  trial. 

52.  Given  the  set  of  N  patterns  available  for  sampling  on  a  trial,  the 
probability  of  sampling  a  given  pattern  is  l/N,  independent  of  the  trial 
number  and  the  preceding  events. 

Response  Axiom 

Rl.    On  any  trial  that  response  is  made  to  which  the  sampled  pattern  is 

conditioned. 

Later  in  this  section  we  apply  these  axioms  to  a  two-choice  learning 
experiment  and  to  a  paired-comparison  study.  First,  however,  we  shall 
prove  several  general  theorems.  Before  we  can  begin  our  analysis  it  is 
necessary  to  define  the  notion  of  a  conditioning  state.  For  the  axioms 
given,  all  patterns  are  sampled  with  equal  probability,  and  it  suffices 
to  let  the  state  of  conditioning  indicate  the  number  of  patterns  conditioned 
to  each  response.  Hence  for  r  responses  the  conditioning  states  are 
the  ordered  r-tuples  (kl9  fc2, .  .  . ,  kr)  where  kt  =  0,  1,  2, .  . . ,  N  and 
kt  +  kt  +  ...  +  kr  =  N;  the  integer  ki  denotes  the  number  of  patterns 
conditioned  to  the  At  response.  The  number  of  possible  conditioning 

states  is  ( N  +  ^  ~~     J .  (In  a  generalized  model,  which  permitted  different 

patterns  to  have  different  likelihoods  of  being  sampled,  it  would  be 
necessary  to  specify  not  only  the  number  of  patterns  conditioned  to  a 
response  but  also  the  sampling  probabilities  associated  with  the  patterns.) 
For  simplicity  we  limit  consideration  in  this  section  to  the  case  of  two 
alternatives,  except  for  one  example  in  which  r  =  3.  Given  only  two 
alternatives,  we  denote  the  conditioning  state  on  trial  n  of  an  experiment 


I$  STIMULUS    SAMPLING    THEORY 

as  Cz>,  where  z  =  0,  1,  2,  .  .  .  ,  TV;  the  subscript  i  indicates  the  number  of 
patterns  conditioned  to  A±  and  N  —  z  the  number  conditioned  to  A2. 
TRANSITION  PROBABILITIES.  Only  one  pattern  is  sampled  per  trial; 
therefore  the  subject  can  go  from  state  Ci  only  to  one  of  the  three  states 
Q-i>  Q5  or  Ci+i  on  any  given  trial.  The  probabilities  of  these  transitions 
depend  on  the  value  of  the  conditioning  parameter  c,  the  reinforcement 
schedule,  and  the  value  of  z.  We  now  proceed  to  compute  these  prob- 
abilities. 

If  the  subject  is  in  state  C,  on  trial  n  and  an  El  occurs,  then  the  possible 
outcomes  are  indicated  by  the  tree  in  Fig.  5.  On  the  upper  main  branch, 
which  has  probability  z'/TV,  a  pattern  that  is  conditioned  to  Al  is  sampled 
and,  since  an  ^-reinforcement  occurs,  the  pattern  remains  conditioned  to 
Alf  Hence  the  conditioning  state  on  trial  n  +  I  is  the  same  as  on  trial  n 
(see  Axiom  C2).  On  the  lower  main  branch,  which  has  probability 
(N  —  i)/N,  a  pattern  conditioned  to  A2  is  sampled;  then  with  probability 
c  the  pattern  is  conditioned  to  A:  and  the  subject  moves  to  conditioning 
state  Q+19  whereas  with  probability  1  —  c  conditioning  is  not  effective 
and  the  subject  remains  in  state  C,-.  Putting  these  results  together,  we  obtain 


t 

Pr  (Ci>n+1  1  ElinCitn)  =  1  -  c  +  c  -  . 

N 
Similarly,  if  an  E2  occurs  on  trial  n, 

Pr  (Q-if9i+i  I  E2tnCiin)  =  c  — 


Fig.  5.  Branching  process  for  JV-element  model 
on  a  trial  when  the  subject  starts  in  state  Ct  and 
an  ^-event  occurs. 


MULTI-ELEMENT    PATTERN    MODELS  757 

By  Axiom  C3,  if  an  EQ  occurs,  then 

MQ>+i  |  £0,^)  =  1.  (22c) 

Noting  that  a  transition  upward  can  occur  only  when  a  pattern  condi- 
tioned to  A2  is  sampled  on  an  1^-trial  and  a  transition  downward  can 
occur  only  when  a  pattern  conditioned  to  A±  is  sampled  on  an  £2-trial,  we 
can  combine  the  results  from  Eq.  22a~c  to  obtain 

Pr  (Ci+1>n+1  1  C4>n)  =  c  Z-J.  Pr  (^  |  A^C^  (23a) 


2._ljre+1  1  Q.  J  =  c      Pr  (£2>n  |  A1>BQin) 


Pr  (£1>m  ,  A1>nQ.J 
,J^,nCi>n)  (23c) 


for  the  probabilities  of  one-step  transitions  between  states.  Equation  23a, 
for  example,  states  that  the  probability  of  moving  from  the  state  with  f 
elements  conditioned  to  A1  to  the  state  with  i  +  1  elements  conditioned  to 
A-L  is  the  product  of  the  probability  (N  —  f)/N  that  an  element  not  already 
conditioned  to  A1  is  sampled  and  the  probability  cPr  (Ei>n  \  AQ  nCi>n) 
that,  under  the  given  circumstances,  conditioning  occurs. 

As  defined  earlier,  we  have  a  Markov  process  in  the  conditioning  states 
if  the  probability  of  a  transition  from  any  state  to  any  other  state  depends 
at  most  on  the  state  existing  on  the  trial  preceding  the  transition.  By 
inspection  of  Eq.  23  we  see  that  the  Markov  condition  may  be  satisfied 
by  limiting  ourselves  to  reinforcement  schedules  in  which  the  probability 
of  a  reinforcing  event  £",-  depends  at  most  on  the  response  of  the  given 
trial;  that  is,  in  learning-theory  terminology,  to  noncontingent  and  simple 
contingent  schedules.  This  restriction  will  be  assumed  throughout  the 
present  section  except  for  a  few  remarks  in  which  we  explicitly  consider 
various  lines  of  generalization. 

With  these  restrictions  in  mind,  we  define 


where  j  =  0  to  r,  i  =  1  to  r,  and  2  *w  =  1  ;  that  is,  the  reinforcement  on 

s 
a  trial  depends  at  most  on  the  response  of  the  given  trial.   Further,  the 


IJ<9  STIMULUS    SAMPLING    THEORY 

reinforcement  probabilities  do  not  depend  on  the  trial  number.  We  may 
then  rewrite  Eq.  23  as  follows  : 

N  —  i 
4i,t+i  =  c  *2i  (240) 

•i          N  —  i  i 

4M  =  1  -  c  —^-  7r21  -  c  —  7r12  (246) 

&,z-i  =  ^^la-  (24c) 

Note  that  we  use  the  notation  ^  in  place  of  Pr  (C,>+1  1  CifJ.  The  reason 
is  that  the  transition  probabilities  do  not  depend  on  n,  given  the  restrictions 
on  the  reinforcement  schedule  stated  above,  and  the  simpler  notation 
expresses  this  fact. 

RESPONSE  PROBABILITIES  AND  MOMENTS.  By  Axioms  SI,  S2,  and 
Rl  we  know  that  the  relation  between  response  probability  and  the  con- 
ditioning state  is  simply 

Pr(^,n|Q(J  =  l. 

Hence  N 

Pr  (A,J  =  I  Pr  (Alin  |  CiiB)  Pr  (Ci>B) 

i=0 

=  f^Pr(Q,J.  (25) 

i=oN 

But  note  that  by  definition  of  the  transition  probabilities  qit 

,n)  =  Pr  (C0,ra_i)4o*  +  Pr  (Clin_Jqu  +  .  .  .  +  Pr  (CKi^)qm 

-  (26) 


The  latter  expression,  together  with  Eq.  25,  serves  as  the  basis  for  a  general 
recursion  in  Pr  (Alfn): 


i=0        3=0 

Now  substituting  for  q^  in  terms  of  Eq.  24  and  rearranging  the  sum  we 
have 


MULTI-ELEMENT  PATTERN  MODELS 

The  first  sum  is,  by  Eq.  25,  Pr  (A1>n_J.   Let  us  define 


then  the  second  sum  is  simply  -CTT^^.    Similarly,  the  third  sum  is 
-C7r21[Pr  (Aljn_J  -  Pr  (Cv^)  -  a2jn^ 


and  so  forth.  Carrying  out  the  summation  and  simplifying,  we  obtain  the 
following  recursion  in  Pr  (AlL>  J : 

Pr  (Alt  w)  =    1  -  ~  (7ria  +  7T21)|  Pr  (Ai  n_i)  +  —  7r21.          (27) 

L       N  J  '  JV 

This  difference  equation  has  the  well-known  solution  (cf.  Bush  &  Mos- 
teller,  1955;  Estes,  1959b;  Estes  &Suppes,  1959) 

I  c 

"Pr  ( A      ^\  "~~~  Pr  ( A       ~\  — —  ["Pi*  ( A       *\  ^—  "Pt*  ( A      \\    1    /          i 

1,1      |^  N        12  ! 

(28) 
where 


At  this  point  it  will  also  be  instructive  to  calculate  the  variance  of  the 
distribution  of  response  probabilities  Pr  (AI>n  \  Ciin).  The  second  raw 
moment,  as  defined  above,  is 


Carrying  out  the  summation,  as  in  the  case  of  Eq.  27,  we  obtain 

<*M   =  a2,w_i     1    -  ~  (77ia  +  7721) 


Subtracting  the  square  of  Pr  (^ijW)?  as  given  in  Eq.  28,  from  a2jW  yields 
the  variance  of  the  response  probabilities.  The  second  and  higher  moments 
of  the  response  probabilities  are  of  experimental  interest  primarily  because 
they  enter  into  predictions  concerning  various  sequential  statistics.  We 
shall  return  to  this  point  later. 

ASYMPTOTIC  DISTRIBUTIONS.  The  pattern  model  has  one  particularly 
advantageous  feature  not  shared  by  many  other  learning  models  that  have 
appeared  in  the  literature.  This  feature  is  a  simple  calculational  procedure 


l6o  STIMULUS    SAMPLING    THEORY 

for  generating  the  complete  asymptotic  distribution  of  conditioning  states 
and  therefore  the  asymptotic  distribution  of  responses.  The  derivation 
to  be  given  assumes  that  all  elements  #M_i,  qiti,  #^+1  of  the  transition 
matrix  are  nonzero  ;  the  same  technique  can  be  applied  if  there  are  zero 
entries,  except,  of  course,  that  in  forming  ratios  one  must  keep  the  zeros 
out  of  the  denominators. 
As  in  Sec.  1.3,  we  let  lim  Pr  (Q}J  =  wz.  The  theorem  to  be  proved  is 

n-+oo 

that  all  of  the  asymptotic  conditioning  state  probabilities  z^  can  be 
expressed  recursively  in  terms  of  w0;  since  the  w/s  must  sum  to  unity,  this 
recursion  suffices  to  determine  the  entire  distribution. 
By  Eq.  26  we  note  that 


hence 

^o  _.      gio     ._gio 
«i       1  —  4oo      £01 

We  now  prove  by  induction  that  a  similar  relation  holds  for  any  adjacent 
pair  of  states;  that  is, 


For  any  state  i  we  have  by  Eq.  26 

«<  =  Wi-tfi-M 

Rearranging, 


However,  under  the  inductive  hypothesis  we  may  replace  u^  by  its 
equivalent  w#M_i/#z-_M.   Hence 


or 


However,  1  -  #M  -  ^^^  =  qitM,  since  jri>w  +  ^.jf  +  qifi+1  =1,  and 
therefore 


which  concludes  the  proof. 
Thus  we  may  write 


«21 


MULTI-ELEMENT    PATTERN    MODELS  l6l 

and  so  forth.  Since  the  w/s  must  sum  to  unity,  z/0  also  is  determined.  To 
illustrate  the  application  of  this  technique,  we  consider  some  simple  cases. 
For  the  noncontingent  case  discussed  in  Sec.  1.3. 

77  =  7T21  =  7TU 
1    —  7T  =^  77*12  ==  77*22' 

By  Eq.  24  we  have 

N  -  f 


4M_i  =  c  ~  (1  -  TT). 
Applying  the  technique  of  the  previous  paragraph, 


7T)  (1-77) 

u_2      7rc[(JV  -  1)/N]       (N  -  l 


Ml        (1  -  7r)c(2/JV)         2(1  -TT) 
and  in  general 

j^  ^  (N  -  k  +  !)TT 
wfc_i          /c(l  -  77) 

This  result  has  two  interesting  features.  First,  we  note  that  the  asymptotic 
probabilities  are  independent  of  the  conditioning  parameter  c.  Second, 
the  ratio  of  uk  to  uk_±  is  the  same  as  that  of  neighboring  terms 

vr)-^     and 

in  the  expansion  of  [TT  +  (1  —  ir)]N.  Therefore  the  asymptotic  prob- 
abilities in  this  case  are  binomially  distributed.  For  a  population  of 
subjects  whose  learning  is  described  by  the  model,  the  limiting  proportion 
of  subjects  having  all  TV  patterns  conditioned  to  A±  is  TT^;  the  proportion 
having  all  but  one  of  the  ./V  patterns  conditioned  to  A±  is 
and  so  on. 
For  the  case  of  simple  contingent  reinforcement, 


uk    (N  —  fc  4-  lV2ic  jk-rr-^c  __  (N  —  k  +  IVai 

~  ~~  jy  /       JV       ~  /C7T12 

Again  we  note  that  the  ^  are  independent  of  c.    Further  the  ratio  wfc  to 
M,U_I  is  the  same  as  that  of 


l6s  STIMULUS    SAMPLING    THEORY 

Therefore  the  asymptotic  state  probabilities  are  the  terms  in  the  expansion 
of  •  \N 

— +  -rH- 

_     -  '  ^12         ^21  ~T  ^127 

Explicit  formulas  for  state  probabilities  are  useful  primarily  as  inter- 
mediary expressions  in  the  derivation  of  other  quantities.  In  the  special 
case  of  the  pattern  model  (unlike  other  types  of  stimulus  sampling  models) 
the  strict  determination  of  the  response  on  any  trial  by  the  conditioning 
state  of  the  trial  sample  permits  a  relatively  direct  empirical  interpretation, 
for  the  moments  of  the  distribution  of  state  probabilities  are  identical  with 
the  moments  of  the  response  random  variable.  Thus  in  the  simple  con- 
tingent case  we  have  immediately  for  the  mean  and  variance  of  the  response 
random  variable  A^ 

*»  w-* 


Wax  +  7T12/  ^21  4"  7712 

and 


Var  (A.)  =lk  r—  r-       -  [£(A°°)]2 

*-i  AT  W  W21  +  *ia'  W21  +  * 

7T217r12 


A  bit  of  caution  is  needed  in  applying  this  last  expression  to  data.  If  we 
select  some  fixed  trial  n  (large  enough  so  that  the  learning  process  may  be 
assumed  asymptotic),  then  the  theoretical  variance  for  the  ^-response 
totals  of  a  number  of  independent  samples  of  K  subjects  on  trial  n  is 
simply  ^[77-2i7ri2/(7r2i  +  7ri2)2]  by  the  familiar  theorem  for  the  variance  of  a 
sum  of  independent  random  variables.  However,  this  expression  does  not 
hold  for  the  variance  of  ^-response  totals  over  a  block  of  K  successive 
trials.  The  additional  considerations  involved  in  the  latter  case  are  dis- 
cussed in  the  next  section. 


2.2  Treatment  of  the  Simple  Noncontingent  Case 

In  this  section  we  shall  consider  various  predictions  that  may  be  derived 
from  the  pattern  model  for  simple  predictive  behavior  in  a  two-choice 
situation  with  noncontingent  reinforcement.  Each  tri^l  in  the  reference 
experiment  begins  with  the  presentation  of  a  ready  signal;  the  subject's 
task  is  to  respond  to  the  signal  by  operating  one  of  a  pair  of  response  keys, 
AI  or  AZ,  indicating  his  prediction  as  to  which  of  two  reinforcing  lights 
will  appear.  The  reinforcing  lights  are  programmed  by  the  experimenter  to 


MULTI-ELEMENT    PATTERN    MODELS 


163 


occur  in  random  sequence,  exactly  one  on  each  trial,  with  probabilities 
that  are  constant  throughout  the  series  and  independent  of  the  subject's 
behavior. 

For  illustrative  purposes,  we  shall  use  data  from  two  experiments  of  this 
sort.  In  one  of  these,  henceforth  designated  the  0.6  series,  30  subjects  were 
run,  each  for  a  series  of  240  trials,  with  probabilities  of  0.6  and  0.4  for  the 
two  reinforcing  lights.  Details  of  the  experimental  procedure,  and  a  more 
complete  analysis  of  the  data  than  we  shall  undertake  here,  are  given  in 
Suppes  &  Atkinson  (1960,  Chapter  10).  In  the  other  experiment,  hence- 
forth designated  the  0.8  series,  80  subjects  were  run,  each  for  a  series  of 
288  trials,  with  probabilities  of  0.8  and  0.2  for  the  two  reinforcing  lights. 
Details  of  the  procedure  and  results  have  been  reported  by  Friedman  et  al. 
(1960).  A  possibly  important  difference  between  the  conditions  of  the 
two  experiments  is  that  in  the  0.6  series  the  subjects  were  new  to  this  type 
of  experiment,  whereas  in  the  0.8  series  the  subjects  were  highly  practiced, 
having  had  experience  with  a  variety  of  noncontingent  schedules  in  two 
previous  experimental  sessions. 

For  our  present  purposes  it  will  suffice  to  consider  only  the  simplest 
possible  interpretation  of  the  experimental  situation  in  terms  of  the  pattern 
model.  Let  O^  denote  the  more  frequently  occurring  reinforcing  light  and 
O%  the  less  frequent  light.  We  then  postulate  a  one-to-one  correspondence 
between  the  appearance  of  light  Ot  and  the  reinforcing  event  Ei  which  is 
associated  with  Ai  (the  response  of  predicting  02).  Also  we  assume  that 
the  experimental  conditions  determine  a  set  of  JV  distinct  stimulus  patterns, 
exactly  one  of  which  is  present  at  the  onset  of  any  given  trial.  Since,  in 
experiments  of  the  sort  under  consideration,  the  experimenter  usually 
presents  the  same  ready  signal  at  the  beginning  of  every  trial,  we  might 
assume  that  N  would  necessarily  equal  unity.  However,  we  shall  not 
impose  this  restriction  on  the  model.  Rather,  we  shall  let  N  appear  as  a  free 
parameter  in  theoretical  expressions ;  then  we  shall  seek  to  determine  from 
the  data  the  value  of //required  to  minimize  the  disparities  between  theo- 
retical and  observed  values. 

If  the  data  of  a  particular  experiment  yield  an  estimate  of  N  greater  than 
unity  and  if,  with  this  estimate,  the  model  provides  a  satisfactory  account 
of  the  empirical  relationships  in  question,  we  shall  conclude  that  the 
learning  process  proceeds  as  described  by  the  model  but  that,  regardless 
of  the  experimenter's  intention,  the  subjects  are  sampling  a  population  of 
stimulus  patterns.  The  pattern  effective  at  the  onset  of  a  given  trial  might 
comprise  the  experimenter's  ready  signal  together  with  stimulus  traces 
(perhaps  verbally  mediated)  of  the  reinforcing  events  and  responses  of  one 
or  more  preceding  trials. 

It  will  be  apparent  that  the  pattern  model  could  scarcely  be  expected  to 


l6^  STIMULUS    SAMPLING    THEORY 

provide  a  completely  adequate  account  of  the  data  of  two-choice  experi- 
ments run  under  the  conditions  sketched  above.  First,  if  the  stimulus 
patterns  to  which  the  subject  responds  include  cues  from  preceding  events, 
then  it  is  extremely  unlikely  that  all  of  the  available  patterns  would  have 
equal  sampling  probabilities  as  assumed  in  the  model.  Second,  the  different 
patterns  must  have  component  cues  in  common,  and  these  would  be 
expected  to  yield  transfer  effects  (at  least  on  early  trials)  so  that  the  response 
to  a  pattern  first  sampled  on  trial  n  would  be  influenced  by  conditioning 
that  occurred  when  components  of  that  pattern  were  present  on  earlier 
trials.  However,  the  pattern  model  assumes  that  all  of  the  patterns  avail- 
able for  sampling  are  distinct  in  the  sense  that  reinforcement  of  a  response 
to  one  pattern  has  no  effect  on  response  probabilities  associated  with  other 
patterns. 

Despite  these  complications,  many  investigators  (e.g.,  Suppes  &  Atkin- 
son, 1960;  Estes,  1961b;  Suppes  &  Ginsberg,  1962;  Bower,  1961)  have 
found  it  a  useful  strategy  to  apply  the  pattern  model  in  the  simple  form 
presented  in  the  preceding  section.  The  goal  in  these  applications  is  not 
the  perhaps  impossible  one  of  accounting  for  every  detail  of  the  experi- 
mental results  but  rather  the  more  modest,  yet  realizable,  one  of  obtaining 
valuable  information  about  various  theoretical  assumptions  by  comparing 
manageably  simple  models  that  embody  different  combinations  of  assump- 
tions. This  procedure  is  illustrated  in  the  remainder  of  the  section. 
SEQUENTIAL  PREDICTIONS.  We  begin  our  application  of  the  pattern 
model  with  a  discussion  of  sequential  statistics.  It  should  be  emphasized 
that  one  of  the  major  contributions  of  mathematical  learning  theory  has 
been  to  provide  a  framework  within  which  the  sequential  aspects  of  learn- 
ing can  be  scrutinized.  Before  the  development  of  mathematical  models 
little  attention  was  paid  to  trial-by-trial  phenomena;  at  the  present  time, 
for  many  experimental  problems,  such  phenomena  are  viewed  as  the  most 
interesting  aspect  of  the  data. 

Although  we  consider  only  the  noncontingent  case,  the  same  methods 
may  be  used  to  obtain  results  for  more  general  reinforcement  schedules. 
We  shall  develop  the  proofs  in  terms  of  two  responses,  but  the  results  hold 
for  any  number  of  alternatives.  If  there  are  r  responses  in  a  given  experi- 
mental application,  any  one  response  can  be  denoted  Al  and  the  rest 
regarded  as  members  of  a  single  class,  A2. 

We  consider  first  the  probability  of  an  Al  response,  given  that  it  occurred 
and  was  reinforced  on  the  preceding  trial;  that  is,  Pr  C4ijW+i  |  El>nAItn). 
It  is  convenient  to  deal  first  with  the  joint  probability  Pr  (AI}n+IE1)UAltn) 
and  to  conditionalize  later.  First  we  note  that 

Pr  (Al>n^El)nA1)n)  =  £  Pr  (A^C^E^A^C^),          (30) 


MULTI-ELEMENT  PATTERN  MODELS 


and  that  Pr  (Aiin+lCj}n+1EI}nAljnCi)7l)  may  be  expressed  in  terms  of  con- 
ditional probabilities  as 


|  A,,A  JPr  (A^n  \  Ci>n)  Pr  (C,,J. 

But  from  the  sampling  and  response  axioms  the  probability  of  a  response 
on  trial  n  is  determined  solely  by  the  conditioning  state  on  trial  n;  that  is, 
the  first  factor  in  the  expansion  can  be  rewritten  simply  as 
Pr  (^i,n+i  |  Q,n+i)-  Further,  by  Axiom  Rl,  we  have 


For  the  noncontingent  case  the  probability  of  an  E^  on  any  trial  is  inde- 
pendent of  previous  events  and  consequently  we  may  write 


Next,  we  note  that 


if    ij&ji 

that  is,  an  element  conditioned  to  Al  is  sampled  on  trial  n  (since  an  Ar 
response  occurs  on  n)  and  thus  by  Axiom  C2  no  change  in  the  conditioning 
state  can  occur. 
Putting  these  results  together  and  substituting  in  Eq.  30,  we  obtain 

Pr  (A1)n+lE1>nA1>n)  -  TT       L  Pr  (C,,n+1  1  E^A^C^  Pr  (Q,  J 


and 


.  (316) 

) 

In  order  to  express  this  conditional  probability  in  terms  of  the  parameters 
TT,  c,  TV,  and  Pr(Al}l),  we  simply  substitute  into  Eq.  3lb  the  expression 
given  for  Pi(Altn)  in  Eq.  28  and  the  corresponding  expression  for  a2jM 
that  would  be  given  by  the  solution  of  the  difference  equation  (Eq.  29). 
Unfortunately,  the  expression  so  obtained  is  extremely  cumbersome  to 
work  with.  Consequently  it  is  usually  preferable  in  working  with  data 
to  proceed  in  a  different  way. 


STIMULUS    SAMPLING    THEORY 


Suppose  the  data  to  be  treated  consist  of  proportions  of  occurrences  of 
the  various  trigrams  Aktn+lEftnAitn  over  blocks  of  M  trials.  If,  for  example, 
M  =  5,  then  in  the  protocol 


Trial 

1          2 

3 

4 

5 

Event 

A1E1    A& 

A   F 

•™-'2i\ 

A^    Ip' 

A& 

There  are  four  opportunities  for  such  trigrams.  The  combination  AltH+l 
•  E1}nA1>n  occurs  on  two  of  these,  A2in+1El)nAl>n  on  one  and  Alfn+lElt/nA2fn 
on  the  other;  hence  the  proportions  of  occurrence  of  these  trigrams  are 
0.50,  0.25,  and  0.25,  respectively.  To  deal  theoretically  with  quantities  such 
as  these,  we  need  only  average  both  sides  of  Eq.  3  la  (and  the  corresponding 
expressions  for  other  trigrams)  over  the  appropriate  block  of  trials,  ob- 
taining, for  example,  for  the  block  running  from  trial  n  through  trial 
n  +  M-  1 

1  n+M-l  ^  ra+jMT-1 

Pm  =  T7    2    K(A1,n'+iE1,n.Alin,)  =  -    2    a2,n,=7ra2(n,M),     (32a) 

M    n'=n  M    n'=n 

where  a2(/2,  M)  is  the  average  value  of  the  second  moment  of  the  response 
probabilities  over  the  given  trial  block.  By  strictly  analogous  methods 
we  can  derive  theoretical  expressions  for  other  trigram  proportions  : 


M    ri=n 


—  l,n'+l2,n'l,n' 

M    n'=n 

=  (1  -  TT)  foc2(n,  M)  -  ~  GCiCn,  M)l  ,  (32c) 

L  N  J 

^  n+Jf-l 

=  —     2      Pr  (XlX-fl^M'^.n') 

JV    n'-n 

=  (1  -  ^)[«i(»,  M)  -  a2(n,  M)],  (32d) 

and  so  on;  the  quantity  a1(?z>  M)  denoting  the  average  ^-probability  (or, 
equivalently,  the  proportion  of  ^-responses)  over  the  given  trial  block. 
Now  the  average  moments  at  can  be  treated  as  parameters  to  be  esti- 
mated from  the  data  in  order  to  mediate  theoretical  predictions.  To 
illustrate,  let  us  consider  a  sample  of  data  from  the  0.8  series.  Over  the 
first  12  trials  of  the  TT  =  0.8  series,  the  observed  proportion  of  y^-responses 


MULTI-ELEMENT    PATTERN    MODELS  j£7 

for  the  group  of  80  subjects  was  0.63  and  the  observed  values  for  the  tri- 
grams  of  Eq.  32a-d  were  /?U1  =  0.379,  Pll2  =  0.168,  pizi  =  0.061,  and 
;?122  =  0.035.  Using  plu  to  estimate  a2(l,  12),  we  have  from  Eq.  32a 

0.379  =  0.8^(1,  12)], 
which  yields  as  our  estimate 

£2(1,  12)  =  0.47. 

Now  we  are  in  a  position  to  predict  the  value  of  p122.  Substituting  the 
appropriate  parameter  values  into  Eq.  32rf,  we  have 

pl22  =  0.2(0.63  -  0.47)  =  0.032, 

which  is  not  far  from  the  observed  value  of  0.035.  Proceeding  similarly, 
we  can  use  Eq.  32b  to  estimate  c/N,  namely, 

=  0.168  =  0.8  |7l  -  -Vo.63)  +  -  -  0.47], 
LA        Nj  N  J 

from  which 

~  =  0.135. 

N 

With  this  estimate  in  hand,  together  with  those  already  obtained  for  the 
first  and  second  moments,  we  can  substitute  into  Eq.  32c  and  predict  the 
value  of/>121:  ^  =  a2[Q  4?  _  ai35(0  63)] 

=  0.077, 

which  is  somewhat  high  in  relation  to  the  observed  value  of  0.061. 

It  should  be  mentioned  that  the  simple  estimation  method  used  above 
for  illustrative  purposes  would  be  replaced,  in  a  serious  application  of  the 
model,  by  a  more  systematic  procedure.  For  example,  one  might  simul- 
taneously estimate  5c2  and  c/N  by  least  squares,  employing  all  eight  of  the 
pijk;  this  procedure  would  yield  a  better  over-all  fit  of  the  theoretical  and 
observed  values. 

A  limitation  of  the  method  just  described  is  that  it  permits  estimation 
of  the  ratio  c/N  but  not  estimation  of  c  and  N  separately.  Fortunately,  in 
the  asymptotic  case,  the  expressions  for  the  moments  a,-  are  simple  enough 
so  that  expressions  for  the  trigrams  in  terms  of  the  parameters  are  manage- 
able; and  it  turns  out  to  be  easy  to  evaluate  the  conditioning  parameter 
and  the  number  of  elements  from  these  expressions.  The  limit  of  oc1>n  for 
large  n  is,  of  course,  TT  in  the  simple  noncontingent  case.  The  limit,  a2, 
of  a2>n  may  be  obtained  from  the  solution  of  Eq.  29;  however,  a  simpler 
method  of  obtaining  the  same  result  is  to  note  that,  by  definition, 


l6S  STIMULUS    SAMPLING    THEORY 

where  ut  again  represents  the  asymptotic  probability  of  the  state  in  which 
z  elements  are  conditioned  to  Av  Recalling  that  the  ut  are  terms  of  the 
binomial  distribution,  we  may  then  write 


The  summation  is  the  second  raw  moment  of  the  binomial  distribution 
with  parameter  TT  and  sample  size  N.  Therefore 

_  NTT(!  -  TT)  +  JV  V 

N2 

<33) 


N 
Using  Eq.  33  and  the  fact  that  lira  Pr  (A1>n)  =  w,  we  have 


lim  Pr  (Xlpn+1    £liBXliB)  =  *  1  -  -^ )+  ^ .  (34a) 

»-»»  \        NJ       N 

By  identical  methods  we  can  establish  that 

lim  Pr  (Alin+l  |  E^A^J  =  nYl  -  -W  -  ,  (346) 

lim  Pr  f>d              IP       >d       ^  —  -rrM—  I    -4-  C\A.r\ 

Hill   JTl    <-rli   «_Ll        A-'O  «-^ll   «/   ~~"   VT I  JL    "~            I     ~p                         ,  I  J*tC  I 

V        X,7t-r-0.                *,7t        ijTi^                        1                          liTf                            XT?  v                ' 


and 

lim  Pr  (A,«+i  |  £2,^2>K)  =  TT  (l  -  1  j  .  (34*0 

With  these  formulas  in  hand,  we  need  only  apply  elementary  probability 
theory  to  obtain  expressions  for  dependencies  of  responses  on  responses  or 
responses  on  reinforcements,  namely, 

lira  Pr  (A,n+1  1  A,J  =  "  +  (1  ~  ^  ~  ^  (35a) 

+1|^2iJ  =  7r-^  <356) 


lim  Pr  G41>n+1  1  £1>n)  =    l  -  +  (35c) 

lim  Pr  (Alin+l  |  £,,„)  =     l  -       TT.  (35d) 


MULTI-ELEMENT    PATTERN    MODELS  i^Q 

Given  a  set  of  trigram  proportions  from  the  asymptotic  data  of  a  two- 
choice  experiment,  we  are  now  in  a  position  to  achieve  a  test  of  the 
model  by  using  part  of  the  data  to  estimate  the  parameters  c  and  N9 
and  then  substituting  these  estimates  into  Eq.  34a-d  and  35a-d  to  predict 
the  values  of  all  eight  of  these  sequential  statistics.  We  shall  illustrate  this 
procedure  with  the  data  of  the  0.6  series.  The  observed  transition  fre- 
quencies F(Aitn+i  |  EJinAk>n)  for  the  last  100  trials,  aggregated  over  subjects, 
are  as  follows : 

Al      A% 


748  298 

394  342 

462  306 

186  264 

An  estimate  of  the  asymptotic  probability  of  an  ^-response  given  an  A^E-^ 
event  on  the  preceding  trial  can  be  obtained  by  dividing  the  first  entry  in 
row  one  by  the  sum  of  the  row;  that  is,  Pr  (Al  \  E±A^  =  748/(748  +  298)  = 
0.715.  But,  if  we  turn  to  Eq.  340,  we  note  that  lim  Pr  (A1>n+l  \  EltnAl}r,)  = 
7r(l  -  1/AO  +  l/N.  Hence,  letting  0.715  =  0.6(1  -  IjN)  +  l/#,  we  obtain 
an  estimate7  ofN=  3.48.  Similarly  Pr  (A^  \  E^A^  =  462/(462  +  306)  = 
0.602,  which  by  Eq.  34b  is  an  estimate  of  TT(!  —  I/TV)  +  c/N;  using  our 
values  of  TT  and  N  we  find  that  c/N  =  0.174  and  c  =  0.605. 

Having  estimated  c  and  N,  we  may  now  generate  predictions  for  any  of 
our  asymptotic  quantities.  Table  3  presents  predicted  and  observed  values 
for  the  quantities  given  in  Eq.  340  to  Eq.  35d.  Considering  that  only  two 
degrees  of  freedom  have  been  utilized  in  estimating  parameters,  the  close 
correspondence  between  theoretical'  and  observed  quantities  in  Table  3 
may  be  interpreted  as  giving  considerable  support  to  the  assumptions  of 
the  model.  A  similar  analysis  of  the  asymptotic  data  from  the  0.8  series, 
which  has  been  reported  elsewhere  (Estes,  1961b),  yields  comparable 
agreement  between  theoretical  and  observed  trigram  proportions.  The 
estimate  of  c/N  for  the  0.8  data  is  very  close  to  that  for  the  0.6  data 
(0.172  versus  0.174),  but  the  estimates  ofc  and  TV  (0.31  and  1.84,  respec- 
tively) are  both  smaller  for  the  0.8  data.  It  appears  that  the  more  highly 
practiced  subjects  of  the  0.8  series  are,  on  the  average,  sampling  from  a 
smaller  population  of  stimulus  patterns  and  at  the  same  time  are  less 
responsive  to  the  reinforcing  lights  than  the  more  naive  subjects  of  the 
0.6  series. 

7  For  any  one  subject,  AT  must,  of  course,  be  an  integer.  The  fact  that  our  estimation 
procedures  generally  yield  nonintegral  values  for  TV  may  signify  that  N  varies  somewhat 
between  subjects,  or  it  may  simply  reflect  some  contamination  of  the  data  by  sources 
of  experimental  error  not  represented  in  the  model. 


170  STIMULUS    SAMPLING    THEORY 

Since  no  model  can  be  expected  to  give  a  perfect  account  of  fallible  data 
arising  from  real  experiments  (as  distinguished  from  the  idealized  experi- 
ments to  which  the  model  should  apply  strictly),  it  is  difficult  to  know  how 
to  evaluate  the  goodness-of-fit  of  theoretical  to  observed  values.  In 
practice,  investigators  usually  proceed  on  a  largely  intuitive  basis,  evaluat- 
ing the  fit  in  a  given  instance  against  that  which  it  appears  reasonable  to 
hope  for  in  the  light  of  what  is  known  about  the  precision  of  experimental 
control  and  measurement.  Statistical  tests  of  goodness-of-fit  are  sometimes 

Table   3     Predicted    (Pattern   Model)    and    Observed    Values    of 
Sequential  Statistics  for  Final  100  Trials  of  the  0.6  Series 

Asymptotic 
Quantity  Predicted        Observed 


0.715  0.715 

0.541  0.535 

0.601  0.601 

0.428  0.413 

AJ)  ~                  0.645  0.641 

AJ                     0.532  0.532 

Ej)                     0.669  0.667 

E2)                    0.496  0.489 


possible  (discussions  of  some  tests  which  may  be  used  in  conjunction  with 
stimulus  sampling  models  are  given  in  Suppes  &  Atkinson,  1960);  however, 
statistical  tests  are  not  entirely  satisfactory,  taken  by  themselves,  for  a 
sufficiently  precise  test  will  often  indicate  significant  diiferences  between 
theoretical  and  observed  values  even  in  cases  in  which  the  agreement  is 
as  close  as  could  reasonably  be  hoped  for.  Generally,  once  a  degree  of 
descriptive  accuracy  that  appears  satisfactory  to  investigators  familiar 
with  the  given  area  has  been  attained,  further  progress  must  come  largely 
via  differential  tests  of  alternative  models. 

In  the  case  of  the  two-choice  noncontingent  situation  the  ingredients 
for  one  such  test  are  immediately  at  hand;  for  we  developed  in  Sec.  1.3 
a  one-element,  guessing-state  model  that  is  comparable  to  the  TV-element 
model  with  respect  to  the  number  of  free  parameters  and  that  to  many 
might  seem  equally  plausible  on  psychological  grounds.  These  models 
embody  the  ^11-or-none  assumption  concerning  the  formation  of  learned 
associations,  but  they  differ  in  the  means  by  which  they  escape  the  deter- 
ministic features  of  the  simple  one-element  model  It  will  be  recalled  that 
the  one-element  model  cannot  handle  the  sequential  statistics  considered 


MULTI-ELEMENT    PATTERN    MODELS  IJI 

in  this  section  because  it  requires,  for  example,  a  probability  of  unity  for 
response  Ai  on  any  trial  following  a  trial  on  which  Ai  occurred  and  was 
reinforced.  In  the  TV-element  model  (with  N  >  2),  there  is  no  such  con- 
straint, for  the  stimulus  pattern  present  on  the  preceding  reinforced  trial 
may  be  replaced  by  another  pattern,  possibly  conditioned  to  a  different 
response,  on  the  following  trial.  In  the  guessing-state  model  there  is  no 
strict  determinacy,  since  the  ^-response  may  occur  on  the  reinforced  trial 
by  guessing  if  the  subject  is  in  state  C0;  and,  if  the  reinforcement  were  not 
effective,  a  different  response  might  occur,  again  through  guessing,  on  the 
following  trial. 

The  case  of  the  guessing-state  model  with  c  =  0  (c,  it  will  be  recalled, 
being  the  counterconditioning  parameter)  provides  a  two-parameter  model 
which  may  be  compared  with  the  two-parameter,  N-element  model.  We 
will  require  an  expression  for  at  least  one  of  the  trigram  proportions  studied 
in  connection  with  the  N-element  model.  Let  us  take  Pr  (Al>n+1El>nAl>n)  for 
this  purpose.  In  Sec.  1.3  we  obtained  an  expression  for  Pr(/l1>n+1  1  EltnAl>n) 
for  the  case  in  which  c  =  0,  and  thus  we  can  write  at  once 


Since  we  are  interested  only  in  the  asymptotic  case,  we  drop  the  w-sub- 
script  from  the  right-hand  side  of  Eq.  36a  and  have  for  the  desired 
theoretical  asymptotic  expression 

/>iu  =  T[*I  +  «f0(l  +  Oil-  (36*) 

Substituting  now  into  Eq.  36b  the  expressions  for  wx  and  UQ  derived  in 
Sec.  1.3,  we  obtain  finally 


2 

-  ^ 

To  apply  this  model  to  the  asymptotic  data  of  the  0.6  series,  we  may 
first  evaluate  the  parameter  e  by  setting  the  observed  proportion  of  Ar 
responses  over  the  terminal  100  trials,  0.593,  equal  to  the  right-hand  side 
of  Eq.  21  and  solving  for  e,  namely, 

n  w  - 

~ 


_  0.6(0:6  +  0.2c) 
~~    0.52  +  0.24e   ' 

and  c  =  2.315. 


STIMULUS    SAMPLING    THEORY 


IJ2 

Now,  by  introducing  this  value  for  e  into  Eq.  36c  and  simplifying,  we 
obtain  the  prediction    ^  =  a2782  + 


Since  the  observed  value  of  />1U  for  the  0.6  data  is  0.249,  it  is  apparent 
that  no  matter  what  value  (in  the  admissible  range  0  <  c"  <  1)  is  chosen 
for  the  parameter  c"  the  value  predicted  from  the  guessing  state  model  will 
be  too  large.  Further  analysis,  using  the  methods  illustrated,  makes  it 
clear  that  for  no  combination  of  parameter  estimates  can  the  guessing- 
state  model  achieve  predictive  accuracy  comparable  to  that  demonstrated 
for  the  N-element  model  in  Table  3.  Although  this  one  comparison  cannot 
be  considered  decisive,  we  might  be  inclined  to  suspect  that  for  inter- 
pretation of  two-choice,  probability  learning  the  notion  of  a  reaccessible 
guessing  state  is  on  the  wrong  track,  whereas  the  N-element  sampling 
model  merits  further  investigation. 

MEAN   AND    VARIANCE   OF   Al   RESPONSE   PROPORTION.      By   letting 

7rn  =  7r21  =  TT  in  Eq.  28,  we  have  immediately  an  expression  for  the 
probability  of  an  ^-response  on  trial  n  in  the  noncontingent  case,  namely, 


Pr(A,ra)  =  --[--Pr(^)](l-^P 


(37) 


If  we  define  a  response  random  variable  An  which  equals  1  or  0  as  A: 
or  Az,  respectively,  occurs  on  trial  72,  then  the  right  side  of  Eq.  37  also 
represents  the  expectation  of  this  random  variable  on  trial  n.  The  expected 
number  of  ^-responses  in  a  series  of  K  trials  is  then  given  by  the  sum- 
mation of  Eq.  37  over  trials, 


In  experimental  applications  we  are  frequently  interested  in  the  learning 
curve  obtained  by  plotting  the  proportion  of  ^-responses  per  J^-trial 
block.  A  theoretical  expression  for  this  learning  function  is  readily  ob- 
tained by  an  extension  of  the  method  used  to  derive  Eq.  38.  Let  x  be  the 
ordinal  number  of  a  ^T-trial  block  running  from  trial  K(x  —  1)  +  1  to 
Kx,  where  #  =  1,2,...,  and  define  P(x)  as  the  proportion  of  y^-responses 
in  block  x.  Then 

rKx  K(x-l) 


i  rKx 

PW  =  i\  2 

A.  Lw=l 

N  r      /      c  \K~\  t      c  \K(X-U 

=  ^_Jl[7r_Pr(J4    )]  h_    1--         1--  .    (39a) 

Kc  '     L        \        W  J\        NJ 

The  value  of  Pr  (AltJ)  should  be  in  the  neighborhood  of  0.5  if  response  bias 


MULTI-ELEMENT  PATTERN  MODELS 


does  not  exist.  However,  to  allow  for  sampling  deviations  we  may  elimi- 
nate Pr  C4M)  in  favor  of  the  observed  value  of  P(l).  This  can  be  done  in  the 
following  way.  Note  that 


Solving  for  [TT  —  Pr^t-^]  and  substituting  the  result  in  Eq.  39<2,  we 
obtain 

\K(x--L) 

.  (39ft) 

Applications  of  Eq.  39£  to  data  have  led  to  results  that  are  satisfying  in 
some  respects  but  perplexing  in  others  (see,  e.g.,  Estes,  1959a).  In  most 
instances  the  implication  that  the  learning  curve  should  have  TT  as  an  asymp- 
tote has  been  borne  out  (Estes,  1961b,  1962),  and  further,  with  a  suitable 
choice  of  values  for  c/N,  the  curve  represented  by  Eq.  39Z>  has  served  to 
describe  the  course  of  learning.  However,  in  experiments  run  with  naive 
subjects,  as  has  been  nearly  always  the  case,  the  value  of  c/N  required  to 
fit  the  mean  learning  curve  has  been  substantially  smaller  than  the  value 
required  to  handle  the  sequential  statistics  discussed  in  Sec.  2.1.  Consider, 
for  example,  the  learning  curve  for  the  0.6  series  plotted  by  20  trial  blocks. 
The  observed  value  of  P(l)  is  0.48  and  the  value  of  c/N  estimated  from  the 
sequential  statistics  of  the  second  20-trial  block  is  0.12.  With  these  param- 
eter values,  Eq.  396  yields  a  prediction  of  0.59  for  P(3)  and  the  theoretical 
curve  is  essentially  at  asymptote  from  block  4  on.  The  empirical  learning 
curve,  however,  does  not  approach  0.59  until  block  6  and  is  still  short  of 
asymptote  at  the  end  of  12  blocks,  the  mean  proportion  of  ^-responses 
over  the  last  five  blocks  being  0.593  (Suppes  &  Atkinson,  1960,  p.  197). 

In  the  case  of  the  0.8  series  there  is  a  similar  disparity  between  the  value 
of  c/N  estimated  from  the  sequential  statistics  and  the  value  estimated 
from  the  mean  learning  curve.  As  we  have  already  noted,  an  optimal 
account  of  the  trigram  proportions  Pr  (Aktn+lEj>nAitn)  requires  a  c/JV-value 
of  approximately  0.17.  But,  if  this  estimate  is  substituted  into  Eq.  39<z, 
the  predicted  ^-frequency  in  the  first  block  of  12  trials  is  0.67,  compared 
to  an  observed  value  of  0.63,  and  the  theoretical  curve  runs  appreciably 
above  the  empirical  curve  for  another  five  blocks.  A  c/7V-value  of  0.06 
yields  a  satisfactory  graduation  of  the  observed  mean  curve  in  terms  of 
Eq.  39a,  and  a  fit  to  the  trigrams  that  does  not  look  bad  by  usual  standards 
for  prediction  in  learning  experiments.  However,  comparing  predictions 
based  on  the  two  c/N-estimates  for  the  trigrams  that  contain  this  param- 
eter, we  see  that  the  estimate  of  0.17  is  distinctly  superior.  For  the 
trigrams  averaged  over  the  first  12  trials,  the  result  is  as  follows: 


STIMULUS    SAMPLING    THEORY 

Observed        Theoretical:   c/N  =  0.17       Theoretical:   c/N  =  0.06 


Pll2 

0.168 

0.177 

0.144 

Pui 

0.061 

0.073 

0.087 

P2I2 

0.121 

0.119 

0.152 

Pm. 

0.062 

0.053 

0.039 

The  reason  for  this  discrepancy  in  the  value  of  c/N  required  to  give 
optimal  descriptions  of  two  different  aspects  of  the  data  is  not  clear  even 
after  much  investigation.  One  contributing  factor  might  be  individual 
differences  in  learning  rates  (c/JV-values)  among  subjects;  these  would  be 
expected  to  affect  the  two  types  of  statistics  differently.  However,  in  the 
case  of  the  0.8  series,  when  a  more  homogeneous  subgroup  of  subjects 
(the  middle  50%  on  total  Al  frequency)  is  analyzed,  the  disparity,  although 
somewhat  reduced,  is  not  eliminated;  optimal  c/N- values  for  the  mean 
curve  and  the  trigram  statistics  are  now  0.08  and  0.15,  respectively.  The 
principal  source  of  the  remaining  discrepancy  in  this  homogeneous  sub- 
group is  a  much  smaller  increment  in  ^-frequency  from  the  first  to  the 
second  12-trial  block  than  is  predicted.  Over  the  first  three  blocks  the 
observed  proportions  are  0.633,  0.665,  and  0.790;  the  proportions 
predicted  from  Eq.  39a  with  c/N  =  0.15  run  0.657,  0.779,  and  0.800.  A 
possible  explanation  is  that  in  the  early  part  of  the  series  the  subjects  are 
responding  to  cues,  perhaps  verbal  in  character,  which  are  discarded  (i.e., 
are  not  resampled)  when  they  fail  to  elicit  consistently  correct  responding. 
An  interpretation  of  this  sort  could  be  incorporated  into  the  model  and 
subjected  to  formal  testing,  but  this  has  not  yet  been  done.  In  any  event, 
we  can  see  that  analyses  of  data  in  terms  of  a  model  enables  us  to  deter- 
mine precisely, which  aspects  of  the  subjects'  behavior  are  and  which  are 
not  accounted  for  in  terms  of  a  particular  set  of  assumptions. 

Next  to  the  mean  learning  curve,  the  most  frequently  used  behavioral 
measure  in  learning  experiments  is  perhaps  the  variance  of  response 
occurrences  in  a  block  of  trials.  Predicting  this  variance  from  a  theoretical 
model  is  an  exceedingly  taxing  assignment;  for  the  effects  of  individual 
differences  in  learning  rate,  together  with  those  of  all  sources  of  experi- 
mental error  not  represented  in  the  model,  must  be  expected  to  increase 
the  observed  response  variance.  However,  this  statistic  is  relatively  easy 
to  compute  for  the  pattern  model,  and  the  derivation  may  serve  as  a  pro- 
totype for  derivations  of  similar  expressions  in  other  learning  models. 
For  simplicity,  we  shall  limit  consideration  here  to  the  case  of  the  variance 
of  ^-response  frequency  in  a  trial  block  after  the  mean  curve  has  reached 
asymptote. 

As  a  preliminary  to  computation  of  the  variance,  we  require  a  statistic 


MULTI-ELEMENT    PATTERN    MODELS  175 

that  is  also  of  interest  in  its  own  right,  the  covariance  of  ^-responses  on 
any  two  trials;  that  is, 

Cov  (A^^AJ  =  £(A,+&AJ  -  £(Aw+fc)  £(A  J 

=  Pr  (Al)n+kA1>n)  -  Pr  (Ai>n+k)  Pr  (^  J.          (40) 
First,  we  can  establish  by  induction  that 

Pr  (AItn+kAl}  J  =  TT  Pr  (4lf  J  -  [*•  Pr  (^  J  -  Pr  (^^ 

This  formula  is  obviously  an  identity  for  k  =  1.  Thus,  assuming  that 
the  formula  holds  for  trials  n  and  n  +  k,  we  may  proceed  to  establish  it 
for  trials  n  and  n  +  k  +  1.  First  we  use  our  standard  procedure  to  expand 
the  desired  quantity  in  terms  of  reinforcing  events  and  states  of  condition- 
ing. Letting  Cj>n  denote  the  state  in  which  exactly  j  of  the  N  elements  are 
conditioned  to  response  A^  we  may  write 


=  2, 

i,3 

Now  we  can  make  use  of  the  assumptions  that  specify  the  noncontingent 
case  to  simplify  the  second  factor  to 

^Pr  (C3;n+kAlt  n)     and     (1  -  TT)  Pr  (C,>+^ljn) 

for  z  =  1,2,  respectively.  Also,  we  may  apply  the  learning  axioms  to  the 
first  factor  to  obtain 

D^      I*    r    A  ^    ;2  a.  /i    nra-c)./  ,  cq  +  i)i 

Pr  (Al>n+M  I  £i,n+fcC,>+^1>n)  =  —  5  +  ^1  -  -j  [     ^       +       ^      J 


and  /         c\j 

Pr  (A1>n+jc+i  I  E2in+7cC3-in+JCAl}n)  =  II  —  -—  I—  . 

Combining  these  results,  we  have 


STIMULUS    SAMPLING    THEORY 


Substitution  into  this  expression  in  terms  of  our  inductive  hypothesis 
yields 


Pr  (AI>n+k+1Al}n)  =     l  ~         ^  Pr  (A^  -  fr  Pr  (Al>n)  -  Pr  (A 


=  TT  Pr  (4^  -  [TT  Pr  (Al>n)  -  Pr  (X^^ 

as  required. 

We  wish  to  take  the  limit  of  the  right  side  of  Eq.  40  as  n  -*  oo  in  order 
to  obtain  the  covariance  of  the  response  random  variable  on  any  two 
trials  at  asymptote.  The  limits  of  Pr  (A1>n)  and  Pr  (Alin+k)  we  know  to  be 
equal  to  TT,  and  from  Eq.  35  we  have  the  expression 


for  the  limit  of  Pr  (Alf/n+IAlfn).   Making  the  appropriate  substitutions  in 
Eq.  40,  yields  the  simple  result 

lim  Cov  (AB 


N    J\        N) 

N  \-N>-  ^ 

Now  we  are  ready  to  compute  Var  (A^),  the  variance  of  ^-response 
frequencies  in  a  block  of  K  trials  at  asymptote,  by  applying  the  standard 
theorem  for  the  variance  of  a  sum  of  random  variables  (Feller,  1957): 

Var  (i,)  =  lim  [x  Var  (AJ  +  2  f  | Cov  (An+,An+i)l . 

Since  *  l'  * 

lim£(AK2)  =  TT  •  1  +  (1  -  TT)  •  0  =  77, 

n-»-oo 

the  limiting  variance  of  An  is  simply 

lim  Var  (AJ  =  lim  £(AW2)  -  lim  £(An)2  =  TT  -  rr\ 


Substituting  this  result  and  that  for  lim  Cov  (A^AJ  into  the  general 
expression  for  Var  (A^),  we  obtain 


Application  of  this  formula  can  be  conveniently  illustrated  in  terms 
of  the  asymptotic  data  for  the  0.8  series.    Least-squares  determinations 


MULTI-ELEMENT    PATTERN    MODELS  777 

of  c/N  and  TV  from  the  trigram  proportions  (using  Eq.  34a-d)  yielded 
estimates  of  0.17  and  1.84,  respectively.  Inserting  these  values  into  Eq.  42, 
we  obtain  for  a  48-trial  block  at  asymptote  Var  (A^)  =  37.50;  this 
variance  corresponds  to  a  standard  deviation  of  6.12.  The  observed 
standard  deviation  for  the  final  48-trial  block  was  6.94.  Thus  the  theory 
predicts  a  variance  of  the  right  order  of  magnitude  but,  as  anticipated, 
underestimates  the  observed  value. 

From  the  many  other  statistics  that  can  be  derived  from  the  TV-element 
model  for  two-choice  learning  data,  we  take  one  final  example,  selected 
primarily  for  the  purpose  of  reviewing  the  technique  for  deriving  sequential 
statistics.  This  technique  is  so  generally  useful  that  the  major  steps  should 
be  emphasized :  first,  expand  the  desired  expression  in  terms  of  the  con- 
ditioning states  (as  done,  for  example,  in  the  case  of  Eq.  30);  second, 
conditionalize  responses  and  reinforcing  events  on  the  preceding  sequence 
of  events,  introducing  whatever  simplifications  are  permitted  by  the 
boundary  conditions  of  the  case  under  consideration;  third,  apply  the 
axioms  and  simplify  to  obtain  the  appropriate  result.  These  steps  are  now 
followed  in  deriving  an  expression  of  considerable  interest  in  its  own  right 
— the  probability  of  an  ^-response  following  a  sequence  of  exactly 
reinforcing  events : 


77  (1    —  7T) 

~~77t  -  \ 
7TV(1   —  TT) 

~~77*  -  \ 
7TV(1    —  TT)  i,j 

-  Pr  (EItn^,i  '  '  '  E^nE^n_^  |  Cin_^)  Pr  (C^-i 

2  TT 


<43) 


STIMULUS    SAMPLING    THEORY 


The  derivation  has  a  formidable  appearance,  mainly  because  we  have 
spelled  out  the  steps  in  more  than  customary  detail,  but  each  step  can 
readily  be  justified.  The  first  involves  simply  using  the  definition  of  a 
conditional  probability,  Pr  (A  \  S)  =  Pr  (AB)/Px  (£),  together  with  the 
fact  that  in  the  simple  noncontingent  case  Pr  (E1>n)  =  TT  and  Pr  (E2}U)  = 
1  —  TT  for  all  n  and  Pr  (El>n+v_-L  .  .  .  E^^E^iU^  =  TTV(\  —  TT).  The  second 
step  introduces  the  conditioning  states  Ci>n+v  and  Cj>n_l9  denoting  the 
states  in  which  i  elements  are  conditioned  to  A^  on  trial  n  +  v  and  j 
elements  on  trial  n  —  1,  respectively.  Their  insertion  into  the  right-hand 
expression  of  line  1  is  permissible,  since  the  summation  of  Pr  (Q)  over  all 
values  of  z  is  unity  and  similarly  for  the  summation  of  Pr  (Q).  The  third 
step  is  based  solely  on  repeated  application  of  the  defining  equation 
for  a  conditional  probability,  which  permits  the  expansion 

Pr  (ABC  ...  J)  =  Pr  (A  \  BC  .  .  .  /)  Pr  (B  \  C  .../)...  Pr  (J). 

The  fourth  step  involves  assumptions  of  the  model  :  the  conditionalization 
of  Al}7i+v  on  the  preceding  sequence  can  be  reduced  to  Pr  (Alfn+v  \  Cij7l+v)  = 
i/N,  since,  according  to  the  theory,  the  preceding  history  affects  response 
probability  on  a  given  trial  only  insofar  as  it  determines  the  state  of  con- 
ditioning, that  is,  the  proportion  of  elements  conditioned  to  the  given  re- 
sponse. The  decomposition  of 

Pr(E1)n+v_l...EI>nEZin_lCjjn_l)    into    nv(l  -  *•)  Pr  (C,^) 

is  justified  by  the  special  assumptions  of  the  simple  noncontingent  case. 
The  fifth  step  involves  calculating,  for  each  value  of  j  on  trial  n  —  1  ,  the 
expected  proportion  of  elements  conditioned  to  A1  on  trial  n  +  v. 
There  are  two  main  branches  to  the  process,  starting  with  state  Q  on 
trial  n  —  1.  In  one,  which  by  the  axioms  has  probability  1  —  c(j/N),  the 
state  of  conditioning  is  unchanged  by  the  jE^-event  on  trial  n  —  1  ;  then, 
applying  Eq.  37  with  TT  =  1  (since  from  trial  n  onward  we  are  dealing  with 
a  sequence  of  E^s)  and  Pr  (Ai}1)  =j/N9  we  obtain  the  expression 
(1  _  [i  1  (y/^)][l  -  (c/AOr} 

for  the  expected  proportion  of  elements  connected  to  A1  on  trial  n  +  v 
in  this  branch.  In  the  other  branch,  which  has  probability  c(]/N),  applica- 
tion of  Eq.  37  with  TT  =  1  and  Pr  (^1?1)  =  (j  —  1)/N  yields  the  expression 
{1  -  [1  —  (j  —  l)/N](l  —  c/N)v}  for  the  expected  proportion  of  elements 
connected  to  A±  on  trial  n  +  v.  Carrying  out  the  summation  overy  and 
using  the  by-now  familiar  property  of  the  model  that 

I  -f-  Pr  (C,,^)  =  Pr  (A^)  =  ?„_!, 

3=0  N 

we  finally  arrive  at  the  desired  expression  for  probability  of  A^  following 
exactly  v  EI$. 


MULTI-ELEMENT    PATTERN    MODELS 


Application  of  Eq.  43  can  conveniently  be  illustrated  in  terms  of  the  0.8 
series.  Using  the  estimate  of  0.17  for  c/N  (obtained  previously  from  the 
trigram  statistics)  and  taking  pn_^  =  0.83  (the  mean  proportion  of  A^ 
responses  over  the  last  96  trials  of  the  0.8  series),  we  can  compute  the 
following  values  for  the  conditional  response  proportions  : 


V 

0 

1 

2 

3 

4 

Theoretical 
Observed 

0.689 
0.695 

0.742 
0.787 

0.786 
0.838 

0.822 
0.859 

0.852 
0.897 

It  car/be  seen  that  the  trend  of  the  theoretical  values  represents  quite  well 
the  trend  of  the  observed  proportions  over  the  last  96  trials.  Somewhat 
surprisingly,  the  observed  proportions  run  slightly  above  the  predicted 
values.  There  is  no  indication  here  of  the  "negative  recency  effect" 
(decrease  in  ^-proportion  with  increasing  length  of  the  ^-sequence) 
reported  in  a  number  of  published  two-choice  studies  (e.g.,  Jarvik,  1951; 
Nicks,  1959).  It  may  be  significant  that  no  negative  recency  effect  is 
observed  in  the  0.8  series,  which,  it  will  be  recalled,  involved  well-practiced 
subjects  who  had  had  experience  with  a  wide  range  of  77-  values  in  preceding 
series.  However,  the  effect  is  observed  in  the  0.6  series,  conducted  with 
subjects  new  to  this  type  of  experiment  (cf.  Suppes  &  Atkinson,  1960, 
pp.  212-213).  This  differential  result  appears  to  support  the  idea  (Estes, 
1962)  that  the  negative  recency  phenomenon  is  attributable  to  guessing 
habits  carried  over  from  everyday  life  to  the  experimental  situation  and 
extinguished  during  a  long  training  series  conducted  with  noncontingent 
reinforcement. 

We  shall  conclude  our  analysis  of  the  JV-element  pattern  model  by 
proving  a  very  general  "matching  theorem."  The  substance  of  this 
theorem  is  that,  so  long  as  either  an  E^  or  an  JE2  reinforcing  event  occurs 
on  each  trial,  the  proportion  of  ^-responses  for  any  individual  subject 
should  tend  to  match  the  proportion  of  ^-events  over  a  sufficiently  long 
series  of  trials  regardless  of  the  reinforcement  schedule. 

For  purposes  of  this  derivation,  we  shall  identify  by  a  subscript  x 
the  probabilities  and  events  associated  with  the  individual  re  in  a  popula- 
tion of  subjects;  thus  pxlfU  will  denote  probability  of  an  ^-response  by 
subject  x  on  trial  n,  and  E^  and  A^^  will  denote  random  variables 
which  take  on  the  values  1  or  0  according  as  an  J£revent  and  an  A^ 
response  do  or  do  not  occur  in  this  subject's  protocol  on  trial  n.  With  this 
notation,  the  probability  of  an  ^-response  by  subject  x  on  trial  n  +  1 
can  be  expressed  by  the  recursion 


~  (Exi.n  ~  A^,  J.  (44) 


STIMULUS    SAMPLING    THEORY 


The  genesis  of  Eq.  44  should  be  reasonably  obvious  if  we  recall  that 
Pxi  n  is  eclual  to  tiie  proportion  of  elements  currently  conditioned  to  the 
^-response.  This  proportion  can  change  only  if  an  ^-event  occurs  on  a 
trial  when  a  stimulus  pattern  conditioned  to  A2  is  sampled,  in  which  case 
E^  n  _  A^  n  =  1  —  0  =  1,  or  if  an  ^-event  occurs  on  a  trial  when  a 
pattern  conditioned  to  Al  is  sampled,  in  which  case 

E^n-A^^O-^-l. 

In  the  first  case  the  proportion  of  patterns  conditioned  to  A±  increases 
by  I/  N  if  conditioning  is  effective  (which  has  probability  c)  and  in  the 
second  case  this  proportion  decreases  by  I/TV  (again  with  probability  c). 
Consider  now  a  series  of,  say,  n*  trials:  we  can  convert  Eq.  44  into  an 
analogous  recursion  for  response  proportions  over  the  series  simply  by 
summing  both  sides  over  n  and  dividing  by  n*9  namely, 


Now  we  subtract  the  first  sum  on  the  right  from  both  sides  of  the  equation 
and  distribute  the  second  sum  on  the  right  to  obtain 


, 

n        «=i 

The  limit  of  the  left  side  of  this  last  equation  is  obviously  zero  as  72*  ->  oo  ; 
thus  taking  the  limit  and  rearranging  we  have8 

lim  \  2A,1>B  =  lim  ^  2Eal>TC.  (45) 

n*-»oo  n*  n=l  TZ*^OO  71*  n=l 

8  Equation  45  holds  only  if  the  two  limits  exist,  which  will  be  the  case  if  the  reinforcing 
event  on  trial  n  depends  at  most  on  the  outcomes  of  some  finite  number  of  preceding 
trials.  When  this  restriction  is  not  satisfied,  a  substantially  equivalent  theorem  can  be 
derived  simply  by  dividing  both  sides  of  the  equation  immediately  preceding  by 


1     * 

—  2  EXI  *  before  passing  to  the  limit;  that  is 
* 


C  n  =  l 


n*  n= 


«.=!  n=l 

Except  for  special  cases  in  whieh  the  sum  in  the  denominators  converges,  the  limit  of 
the  left-hand  side  is  zero  and 

n* 

2A*i,« 
lim          -  =  1. 


MULTI-ELEMENT    PATTERN    MODELS  j#r 

To  appreciate  the  strength  of  this  prediction,  one  should  note  that  it 
holds  for  the  data  of  an  individual  subject  starting  at  any  arbitrarily  selected 
point  in  a  learning  series,  provided  only  that  a  sufficiently  long  block  of 
trials  following  that  point  is  available  for  analysis.  Further,  it  holds  regard- 
less of  the  values  of  the  parameters  N  and  c  (provided  that  c  is  not  zero)  and 
regardless  of  the  way  in  which  the  schedule  of  reinforcement  may  depend 
on  preceding  events,  the  trial  number,  the  subject's  behavior,  or  even  events 
outside  the  system  (e.g.,  the  behavior  of  another  individual  in  a  competitive 
or  cooperative  social  situation).  Examples  of  empirical  applications  of  this 
theorem  under  a  variety  of  reinforcement  schedules  are  to  be  found  in 
studies  reported  by  Estes  (1957a)  and  Friedman  et  al.  (1960). 

2.3  Analysis  of  a  Paired-Comparison  Learning  Experiment 

In  order  to  exhibit  a  somewhat  different  interpretation  of  the  axioms 
of  Sec.  2.1,  we  shall  now  analyze  an  experiment  involving  a  paired-com- 
parison procedure.  The  experimental  situation  consists  of  a  sequence  of 
discrete  trials.  There  are  r  objects,  denoted  Ai  (i  =  1  to  r).  On  each  trial 
two  (or  more)  of  these  objects  are  presented  to  the  subject  and  he  is  re- 
quired to  choose  between  them.  Once  his  response  has  been  made  the 
trial  terminates  with  the  subject  winning  or  losing  a  fixed  amount  of  money. 
The  subject's  task  is  to  win  as  frequently  as  possible.  There  are  many  aspects 
of  the  situation  that  can  be  manipulated  by  the  experimenter ;  for  example, 
the  strategy  by  which  the  experimenter  makes  available  certain  subsets  of 
objects  from  which  the  subjects  must  choose,  the  schedule  by  which  the 
experimenter  determines  whether  the  selection  of  a  given  object  leads  to  a 
win  or  loss,  and  the  amount  of  money  won  or  lost  on  each  trial. 

The  particular  experiment  for  which  we  shall  essay  a  theoretical  analysis 
was  reported  by  Suppes  and  Atkinson  (I960,  Chapter  11).  The  problem 
for  the  subjects  involved  repeated  choices  from  subsets  of  a  set  of  three 
objects,  which  may  be  denoted  Alt  A^  and  A%.  On  each  trial  one  of  the 
following  subsets  of  objects  was  presented:  (A^AJ,  UiA),  (A2A3\  or 
(AiA2A^).  The  subject  selected  one  of  the  objects  in  the  presentation  set; 
then  the  trial  terminated  with  a  win  or  a  loss  of  a  small  sum  of  money. 
The  four  presentation  sets  (A^A^9  (AtA^9  (A2A3),  and  (X1^2X8)  occurred 
with  equal  probabilities  over  the  series  of  trials.  Further,  if  object  Ai 
was  selected  on  a  trial,  then  with  probability  ^  the  subject  lost  and  with 
probability  1  —  Af  he  won  the  predesignated  amount.  More  complex 
schedules  of  reinforcement  could  be  used;  of  particular  interest  is  a  sched- 
ule in  which  the  likelihood  of  a  win  following  the  selection  of  a  given  object 
depends  on  the  other  available  objects  in  the  presentation  group.  For 


STIMULUS    SAMPLING    THEORY 


example,  the  probability  of  a  win  following  an  A1  choice  could  differ, 
depending  on  whether  the  OM2),  (A^AZ\  or  (A^AJ  presentation  group 
occurred.  The  analysis  of  these  more  complex  schedules  does  not  introduce 
new  mathematical  problems  and  may  be  pursued  by  the  same  methods  we 
shall  use  for  the  simpler  case. 

Before  the  axioms  of  Sec.  2.1  can  be  applied  to  the  present  experiment, 
we  need  to  provide  an  interpretation  of  the  stimulus  situation  confronting 
the  subject  from  trial  to  trial.  The  one  we  select  is  somewhat  arbitrary 
and  in  Sec.  3  alternative  interpretations  are  examined.  Of  course,  dis- 
crepancies between  predicted  and  observed  quantities  will  indicate  ways 
in  which  our  particular  analysis  of  the  stimulus  needs  to  be  modified. 
We  represent  the  stimulus  display  associated  with  the  presentation  of 
the  pair  of  objects  (A^)  by  a  set  Sis  of  stimulus  patterns  of  size  N;  the 
triple  of  objects  (A^A^  is  represented  by  a  set  of  stimulus  patterns  Sls& 
of  size  N*.  Thus  there  are  four  sets  of  stimulus  patterns,  and  we  assume 
that  the  sets  are  pairwise  disjoint  (i.e.,  have  no  patterns  in  common). 
Since,  in  the  model  under  consideration,  the  stimulus  element  sampled  on 
any  trial  represents  the  full  pattern  of  stimulation  effective  on  the  trial, 
one  might  wonder  why  a  given  combination  of  objects,  say  (A^A^),  should 
have  more  than  one  element  associated  with  it.  It  might  be  remarked  in 
this  connection  that  in  introducing  a  parameter  N  to  represent  set  size 
we  do  not  necessarily  assume  TV  >  1.  We  simply  allow  for  the  possibility 
that  such  variations  in  the  situation  or  different  orders  of  presentation  of  the 
same  set  of  objects  on  different  trials  might  give  rise  to  different  stimulus 
patterns.  The  assumption  that  the  stimulus  patterns  associated  with  a 
given  presentation  set  are  pairwise  disjoint  does  not  seem  appealing  on 
common-sense  grounds;  nevertheless,  it  is  of  interest  to  see  how  far  we 
can  go  in  predicting  the  data  of  a  paired-comparison  learning  experiment 
with  the  simplified  model  incorporating  this  highly  restrictive  assumption. 
Even  though  we  cannot  attempt  to  handle  the  positive  and  negative  transfer 
effects  that  must  occur  between  different  members  of  the  set  of  patterns 
associated  with  a  given  combination  of  objects  during  learning,  we  may 
hope  to  account  for  statistics  of  asymptotic  data. 

When  the  pair  of  objects  (AtA-)  is  presented,  the  subject  must  select 
At  or  Aj  (i.e.,  make  response  A{  or  Aj);  hence  all  pattern  elements  in  Sif 
become  conditioned  to  A  f  or  A$.  Similarly,  all  elements  in  SISIB  become 
conditioned  to  Ai9  A&  or  A3.  When  (AtAj)  is  presented,  the  subject 
samples  a  single  pattern  from  Sy  and  makes  the  response  to  which  the 
pattern  is  conditioned. 

The  final  step,  before  applying  the  axioms  of  Sec.  2.1,  is  to  provide  an 
interpretation  of  reinforcing  events.  Our  analysis  is  as  follows  :  if  (A^)  is 
presented  and  the  ^-object  is  selected,  then  (a)  the  Ei  reinforcing  event 


MULTI-ELEMENT    PATTERN    MODELS  283 

occurs  if  the  ^-response  is  followed  by  a  win  and  (6)  the  £revent  occurs 
if  the  ^-response  is  followed  by  a  loss.  If  (AtAjAk)  is  presented  and  the 
^-object  is  selected,  then  (a)  the  £revent  occurs  if  the  ^-response  is 
followed  by  a  win  and  (6)  Ej  or  Ek  occurs,  the  two  events  having  equal 
probabilities,  if  the  Ar response  is  followed  by  a  loss.  This  collection  of 
rules  represents  only  one  way  of  relating  the  observable  trial  outcomes  to 
the  hypothetical  reinforcing  events.  For  example,  when  Ai  is  selected 
from  (A.AjA^)  and  followed  by  a  loss,  rather  than  having  £,  or  Ek  occur 
with  equal  likelihoods  one  might  postulate  that  they  occur  with  probabili- 
ties dependent  on  the  ratio  of  wins  following  .^-responses  to  wins  follow- 
ing ^.-responses  over  previous  trials.  Many  such  variations  in  the  rules 
of  correspondence  between  trial  outcomes  and  reinforcing  events  have 
been  explored;  these  variations  become  particularly  important  when  the 
experimenter  manipulates  the  amount  of  money  won  or  lost,  the  magnitude 
of  reward  in  animal  studies,  and  related  variables  (see  Estes,  1960b; 
Atkinson,  1962;  and  Suppes  &  Atkinson,  I960,  Chapter  11,  for  dis- 
cussions of  this  point). 
In  analyzing  the  model  we  shall  use  the  following  notation : 

=  occurrence  of  an  ^-response   on   the  rath  presentation   of 
(AtAj)  [note  that  the  reference  is  not  to  the  nth  trial  of  the 
experiment  but  to  the  nth  presentation  of  (A^]. 
=  a  win  on  the  nth  presentation  of  (AtA^ 
Z       =  a  loss  on  the  nth  presentation  of  (AtAs). 

We  now  proceed  to  derive  the  probability  of  an  ^-response  on  the 
72th  presentation  of  (A^);  namely  Pr  (A$).  First  we  note  that  the  state 
of  conditioning  of  a  stimulus  pattern  can  change  only  when  it  is  sampled. 
Since  all  of  the  sets  of  stimulus  patterns  are  pairwise  disjoint,  the  sequence 
of  trials  on  which  (AtAj)  is  presented  forms  a  learning  process  that  may  be 
studied  independently  of  what  happens  on  other  trials  (see  Axiom  C4) ;  that 
is,  the  interspersing  of  other  types  of  trials  between  the  nth  and  (n  4-  l)st 
presentation  of  (A^  has  no  effect  on  the  conditioning  of  patterns  in  set  S^. 

We  now  want  to  obtain  a  recursive  expression  for  Pr  (A($).  This 
can  be  done  by  using  the  same  methods  employed  in  Sec.  2.2.  But,  to 
illustrate  another  approach,  we  proceed  differently  in  this  case. 

Let  Pr  U$>)  =  yn  and  Pr(^>)  =  1  -  yn.  The  possible  changes  in 
yn  are  given  in  Fig.  6.  With  probability  1  —  c  no  change  occurs  in 
conditioning,  regardless  of  trial  events,  hence  yn+1  =  yn;  with  probability 
c  change  can  occur.  If  Ai  occurs  and  is  followed  by  a  win,  then  the  sampled 
element  remains  conditioned  to  Ai ;  however,  if  a  loss  occurs,  the  sampled 
element  (which  was  conditioned  to  Aj)  becomes  conditioned  to  Af  and 
thus  yB+1  =  yn  —  IJN.  If  Aj  occurs  and  is  followed  by  a  win,  then 


184 


STIMULUS    SAMPLING    THEORY 


^B+1  =  yn;  however,  if  it  is  followed  by  a  loss,  the  sampled  element  (which 
was  conditioned  to  A^  becomes  conditioned  to  Ai9  hence  yn+1  —  yn  +  1/N. 
Putting  these  results  together,  we  have 


yn  - 


which  simplifies  to  the  expression 


Solving  this  difference  equation,  we  obtain 

Pr  (A?S)  =  -^— 

lr 


N 


-    (47) 


We  now  consider  Pr(A(^);  for  simplicity  let  ccn  =  Pr 
^  =  Pr  (AgW),  and  1  -  <xw  -  /8B  =  Pr  (A$*»).  The  possible  changes  in 
an  are  given  in  Fig.  7.  For  example,  on  the  bottom  branch  conditioning 
is  effective  and  an  ^-response  occurs  which  leads  to  a  loss;  hence  El  or 
E2  occur  with  equal  probabilities.  But  an  Az  followed  by  E^  makes 


Fig.  6.  Branching  process  for  a  diad  probability  on  a  paired  comparison 
learning  trial. 


MULTI-ELEMENT  PATTERN  MODELS 


Fig.  7.  Branching  process  for  a  triad  probability  on  a  paired  comparison 
learning  trial. 

an+1  =  an  +  I/TV,  and  ,43  followed  by  E*  makes  ocn+1  =  ocn.    Combining 
the  results  in  this  figure  yields  the  following  difference  equation: 


an+1  =  (1  -  c)an  +  ajcan(l  - 


a  n[c(l  -  aw  -  ^(1  -  ^3)]  +     aw  +          [c(l  -  an  -  ^  JA, 


lB6  STIMULUS    SAMPLING    THEORY 

Simplifying  this  result,  we  obtain 

I  c  c 

3t»ii  =  *J  i U/i  -t-  /*)  I  +  ]8B  - — -  (/2  —  /3)  +  rrr:  **-     (48a) 


By  a  similar  argument  we  obtain 


,          (A,  -  A3 


, 

Solutions  for  the  pair  of  difference  equations  given  by  Eqs.  48a  and  48* 
are  well  known  and  can  be  obtained  by  a  number  of  different  techniques 
(see  Goldberg,  1958,  pp.  130-133,  or  Jordan,  1950).  Any  solution  pre- 
sented can  be  verified  by  substituting  into  the  appropriate  difference 
equations.  However  for  now  we  shall  limit  consideration  to  asymptotic 
results.  In  terms  of  the  Markov  chain  property  of  our  process  it  can  be 
shown  that  the  limits  a  =  lim  an  and  /3  =  lim  f$n  exist.  Letting  ocn+1  = 

n—  *•  oc  fi  —  »•  oo 

ocn  =  a  and  /Sn+1  =  $n  =  /?  in  Eqs.  48a  and  486,  we  obtain 


Solving  for  a  and  /3  and  rewriting,  we  have 

lim  Pr  «")  =  .  .   ^f*      .  .   ,  (49a) 

n-"oo  /1/2  +  /i/3  +  ^2^3 

lim  Pr  (^S3')  = 

w-*oo 

and 

lim 


The  other  moments  of  the  distribution  of  response  probabilities  can 
be  obtained  by  following  the  methods  employed  in  Sec.  2.1;  and  at 
asymptote  we  can  generate  the  entire  distribution.  In  particular,  for  set 
Sjj  the  asymptotic  probability  that  k  patterns  are  conditioned  to  A^  and 
N  —  k  to  AJ  is  simply 


For  the  set  S123  the  asymptotic  probability  of  k±  patterns  conditioned  to 
AK  k2  to  AZ,  and  &3  to  A$  (where  k:  +  k%  +  fc3  =  JV*)  is 


MULTI-ELEMENT    PATTERN    MODELS 


i87 


In  analyzing  data  it  is  helpful  also  to  examine  the  marginal  limiting 
probability  of  an  ^-response,  Pr  (At),  in  addition  to  the  other  quantities 
already  mentioned.  We  define  Pr  (AJ  as  the  probability  of  an  ^-response 
on  any  trial  (regardless  of  the  stimulus  display)  once  the  process  has 
reached  asymptote.  Theoretically 


Pr  (Al)  = 


and 


Pr  (4 


Pr  (D 


Pr 


-  Pr  (A2\ 


where  Pr  (D(ij))  is  the  probability  of  presenting  the  pair  of  objects  (AtA^. 
The  experimental  results  we  consider  were  reported  in  preliminary  form 
in  Suppes  &  Atkinson  (1960).  Two  groups,  each  involving  48  subjects, 
were  run;  subjects  in  one  group  won  or  lost  one  cent  on  each  trial,  and 
those  in  the  other  group  won  or  lost  five  cents  on  each  trial.  We  shall 
consider  only  the  one-cent  group,  for  an  analysis  of  the  differential  effects 
of  the  two  reward  values  requires  a  more  elaborate  interpretation  of  rein- 
forcing events.  Subjects  were  run  for  400  trials  with  the  following  rein- 
forcement schedule: 

Ax  =  I,        ^2  =  ro,        A3  =  !-(,. 

Figure  8  presents  the  observed  proportions  of  >4r,  A2-,  and  ^-responses 
in  successive  20-trial  blocks.  The  three  curves  appear  to  be  stable  over  the 
last  10  or  so  blocks;  consequently  we  treat  the  data  over  trials  301  to 
400  as  asymptotic. 

By  Eq.  47  and  Eq.  49a-c  we  may  generate  predictions  for  Pr  (A^) 
and  Pr  (A(^}.  Given  these  values  and  the  fact  that  the  four  presentation 
sets  occur  with  equal  probabilities,  we  may,  as  previously  shown,  generate 


0.1 


— 2      34      5     6      7     8     9     10    11    12    13   14    15    16    17    18    19    20 
Blocks  of  20  trials 

Fig.  8.  Observed  proportion  of  ^-responses  in  successive  20-trial  blocks  for 
paired  comparison  experiment. 


lB8  STIMULUS    SAMPLING    THEORY 

predictions  for  Pr(Ai>cc).  The  predicted  values  for  these  quantities  and 
the  observed  proportions  over  the  last  100  trials  are  presented  in  Table  4. 
The  correspondence  between  predicted  and  observed  values  is  very  good, 
particularly  for  Pr  (^z-jCO)  and  Pr  (A&£).  The  largest  discrepancy  is  for 
the  triple  presentation  set,  in  which  we  note  that  the  observed  value  of 
Pr  (A(^)  is  0.041  above  the  predicted  value  of  0.507.  The  statistical 
problem  of  determining  whether  this  particular  difference  is  significant 
is  a  complex  matter  and  we  shall  not  undertake  it  here.  However,  it 

Table  4  Theoretical  and  Observed  Asymptotic  Choice  Proportions 
for  Paired-Comparison  Learning  Experiment 

Predicted        Observed 


PrWi) 

0.464 

0.473 

Pr  (A  2) 

0.302 

0.294 

Pr  Wa) 

0.234 

0.233 

Pr(412>) 
Pr(413>) 

0.643 
0.706 

0.651 
0.700 

Pr  (/^^23)) 

0.571 

0.561 

Pr(4123>) 
Pr(4123>) 
Pr(4123)) 

0.507 
0.282 
0.211 

0.548 
0.258 
0.194 

should  be  noted  that  similar  discrepancies  have  been  found  in  other 
studies  dealing  with  three  or  more  responses  (see  Gardner,  1957;  Detam- 
bel,  1955),  and  it  may  be  necessary,  in  subsequent  developments  of  the 
theory,  to  consider  some  reinterpretation  of  reinforcing  events  in  the 
multiple-response  case. 

In  order  to  make  predictions  for  more  complex  aspects  of  the  data,  it  is 
necessary  to  obtain  estimates  of  c,  N,  and  N*.  Estimation  procedures  of 
the  sort  referred  to  in  Sec.  2.2  are  applicable,  but  the  analysis  becomes 
tedious  and  such  details  are  not  appropriate  here.  However,  some  com- 
parisons can  be  made  between  sequential  statistics  that  do  not  depend  on 
parameter  values.  For  example,  certain  nonparametric  comparisons  can 
be  made  between  statistics  where  each  depends  on  c  and  N  but  where  the 
difference  is  independent  of  these  parameters.  Such  comparisons  are 
particularly  helpful  when  they  permit  us  to  discriminate  among  different 
models  without  introducing  the  complicating  factor  of  having  to  estimate 
parameters. 

To  indicate  the  types  of  comparisons  that  are  possible,  we  may  consider 
the  subsequence  of  trials  on  which  (A^A^  is  presented  and,  in  particular, 
the  expression  (12) 

rr  V/^n-fl  |   W n     Al,n  "  n-1^2,7i-lJ  » 


MULTI-ELEMENT    PATTERN    MODELS  l8$ 

that  is,  the  probability  of  an  ^-response  on  the  (n  +  l)st  presentation  of 
(AiA2\  given  that  on  the  nth  presentation  of  (A^A^  an  AT  occurred  and  was 
followed  by  a  win  and  that  on  the  (n  —  l)st  presentation  of  OM2)  an  A2 
occurred,  followed  by  a  win.  To  compute  this  probability,  we  note  that 

Prf  x<<12)     M/U2)  AtWw  U2)  A  (12)  \ 
)       I  137(12)  j(12)TJ7(12)  j(12)     \  _  rr  V/1]  .7H-1  KK  n     ^±l,n  yy  n-l^n-l) 


Pr 

rr 


(12) 


Now  our  problem  is  to  compute  the  two  quantities  on  the  right-hand  side 
of  this  equation.  We  first  observe  that 


(12)rrr(12)  j(12)     >, 

1,71    ^  71-1^2.71-1  J 


*,; 

where  C|^}  denotes  the  conditioning  state  for  set  S12  in  which  z  elements 
are  conditioned  to  Al  and  N  —  z  to  ^42  on  ^e  wt^i  presentation  of  (A-^- 
Conditionali2dng  and  applying  the  axioms,  we  may  expand  the  last  expres- 
sion into 


n 
•  (1  -  ^)  Pr 


Further,  the  sampling  and  response  axioms  permit  the  simplifications 

=  ^7 

N' 


and 


N 
Finally,  in  order  to  carry  out  the  summation,  we  make  use  of  the  relation 


which  expresses  the  fact  that  no  change  in  the  conditioning  state  can  occur 
if  the  pattern  sampled  leads  to  a  win  (see  Axiom  C2).  Combining  these 
results  and  simplifying,  we  have 


Pr  (c 


igO  STIMULUS    SAMPLING    THEORY 

Similarly,  we  obtain 


Pr  (C£li),    (506) 


and,  finally,  taking  the  quotient  of  the  last  two  expressions, 


(12)    N  _ 

— 


We  next  consider  the  same  sequential  statistic  but  with  the  responses 
reversed  on  trials  n  and  n  —  1  ;  namely, 

Pr  (A(l®     I  PF(12)  A(l^  M/d2)  jd2)    \ 

rr  ^1,71  +  1  I   "n       ^2,n    ^7*  -1^1,7*  -I/ 

Interestingly  enough,  if  we  compute 


they  turn  out  to  be  expressed  by  the  right  sides  of  Eq.  50#  and  506,  re- 
spectively.  Hence,  for  all  n9 


)  j  (12)     \ 
lAl>n-l)' 

Comparable  predictions,  of  course,  hold  for  the  subsequences  of  trials 
on  which  (A^A^)  or  (A2AZ)  are  presented. 

Equation  51  provides  a  test  of  the  theory  which  does  not  depend  on 
parameter  estimates.  Further,  it  is  a  prediction  that  differentiates  between 
this  model  and  many  other  models.  For  example,  in  the  next  section  we 
consider  a  certain  class  of  linear  models,  and  it  can  be  shown  that  they 
generate  the  same  predictions  for  the  quantities  in  Table  4  as  the  pattern 
model.  However,  the  sequential  equality  displayed  in  Eq.  51  does  not  hold 
for  the  linear  model. 

To  check  these  predictions,  we  shall  utilize  the  data  over  all  trials  of  the 
(AiA^  subsequence  and  not  restrict  the  analysis  to  asymptotic  perform- 
ance. Specifically,  we  define 


=  T  Pr 

—  Zr  rr 

n 

=      pr 


STIMULUS    COMPOUNDING    AND    GENERALIZATION  I$I 

But  by  the  results  just  obtained  we  have  £121  =  £112  and  £21  =  £12  for  any 
given  subject.  Further,  if  we  define  ^  as  the  sum  of  the  £WJfc*s  over  all 
subjects,  then  it  follows  that  £121  =  £112,  independent  of  intersubject 
differences  in  c  and  JV.  Similarly,  £12'  =  £21.  Thus  we  have  a  set  of  pre- 
dictions that  are  not  only  nonparametric  but  that  require  no  restrictive 
assumptions  on  variability  between  subjects.  Observed  frequencies 
corresponding  to  these  theoretical  quantities  are  as  follows : 

£wi  =  140  £118  =  138 

4i  =  243  £12  =  244 

Ip  =  0.576        k^  =  0.566. 

S21  b!2 

Similarly,  for  the  OM3)  subsequence, 

Cm  =  67  4i3  =  64 

4i  =  120  Ci3  -  122 

4^i  =  0.558        k^  =  0.525. 

b31  b!3 

Finally,  for  the  (A2AB}  subsequence, 

£232  =  45  ^223  =  49 

^32  ^  82  423  ^  8*7 

fe  =  0.549        fe-3  =  0.563. 

b32  b23 

Further  analyses  will  be  required  to  determine  whether  the  pattern 
model  gives  an  entirely  satisfactory  interpretation  of  paired-comparison 
learning.  It  is  already  apparent,  however,  that  it  may  be  difficult  indeed 
to  find  another  theory  with  equally  simple  machinery  that  will  take  us 
further  in  this  direction  than  the  pattern  model. 


3.  A   COMPONENT   MODEL   FOR   STIMULUS 
COMPOUNDING  AND   GENERALIZATION 

3.1  Basic  Concepts;  Conditioning  and  Response  Axioms 

In  the  preceding  section  we  simplified  our  analysis  of  learning  in  terms 
of  the  JV-element  pattern  model  by  assuming  that  all  of  the  patterns 


J02  STIMULUS    SAMPLING    THEORY 

involved  in  a  given  experiment  are  disjoint  or,  at  any  rate,  that  generaliza- 
tion effects  from  one  stimulus  pattern  to  another  are  negligible.  Now 
we  shall  go  to  the  other  extreme  and  treat  problems  of  simple  transfer  of 
training  between  different  stimulus  situations  that  have  elements  in  com- 
mon, and  make  no  reference  to  a  learning  process  occurring  over  trials. 
Again  the  basic  mathematical  apparatus  is  that  of  sets  and  elements  but  with 
a  ^interpretation  that  needs  to  be  clearly  distinguished  from  that  of  the 
pattern  model  In  Sees.  1  and  2  we  regarded  the  pattern  of  stimulation 
effective  on  any  trial  as  a  single  element  sampled  from  a  larger  set  of  such 
patterns;  now  we  shall  consider  the  trial  pattern  as  itself  constituting  a 
set  of  elements,  the  elements  representing  the  various  components  or 
aspects  of  the  stimulus  situation  that  may  be  sampled  by  the  subject  in 
differing  combinations  on  different  trials.  We  proceed  first  to  give  the 
two  basic  axioms  that  establish  the  dependence  of  response  probability 
on  the  conditioning  state  of  the  stimulus  sample.  Then  some  theorems 
that  specify  relationships  between  response  probabilities  in  overlapping 
stimulus  samples  are  derived  and  are  illustrated  in  terms  of  applications 
to  experiments  on  simple  stimulus  compounding.  Consideration  of  the 
process  by  which  trial  samples  are  drawn  from  a  larger  stimulus  population 
is  deferred  to  Sec.  3.3. 
The  basic  axioms  of  the  component  model  are  as  follows  : 

Basic  Axioms 

Cl.  The  sample  s  of  stimulation  effective  on  any  trial  is  partitioned  into 
subsets  Si  (i  —  1  ,  2,  .  .  .  r,  where  r  is  the  number  of  response  alternatives),, 
the  ith  subset  containing  the  elements  conditioned  to  (or  "connected  to") 
response  A^ 

C2.  The  probability  of  response  Ai  in  the  presence  of  the  stimulus  sample 
s  is  given  by 


, 
N(s) 

where  N(x)  denotes  the  number  of  elements  in  the  set  x. 

In  Axiom  Cl  we  modify  the  usual  definition  of  a  partition  to  the  extent 
of  permitting  some  of  the  subsets  to  be  empty;  that  is,  there  may  be  some 
response  alternatives  that  are  conditioned  to  none  of  the  elements  of  s. 
We  do  mean  to  assume,  however,  that  each  element  of  s  is  conditioned  to 
exactly  one  response.  The  substance  of  Axiom  C2  is,  then,  to  make  the 
probability  that  a  given  response  will  be  evoked  by  s  equal  to  the  propor- 
tion of  elements  of  5  that  are  conditioned  to  that  response. 


STIMULUS    COMPOUNDING    AND    GENERALIZATION 


3.2  Stimulus  Compounding 

An  elementary  transfer  situation  arises  if  two  responses  are  reinforced, 
each  in  the  presence  of  a  different  stimulus  sample,  and  all  or  part  of  one 
sample  is  combined  with  all  or  part  of  the  other  to  form  a  new  test  situa- 
tion. To  begin  with  a  special  case,  let  us  consider  an  experiment  conducted 
in  the  laboratory  of  one  of  the  writers  (W.K.E.).9  In  one  stage  of  the 
experiment  a  number  of  disjoint  samples  of  three  distinct  cues  drawn  from 
a  large  population  were  used  as  the  stimulus  members  of  paired-associate 
items,  and  by  the  usual  method  of  paired  presentation  one  response  was 
reinforced  in  the  presence  of  some  of  these  samples  and  a  different  response 
in  the  presence  of  others.  The  constituent  cues,  intended  to  serve  as  the 
empirical  counterparts  of  stimulus  elements,  were  various  typewriter 
symbols,  which  for  present  purposes  we  designate  by  small  letters  a,  b,  c, 
etc.;  the  responses  were  the  numbers  "one"  and  "two,"  spoken  aloud. 
Instructions  to  the  subjects  indicated  that  the  cues  represented  symptoms 
and  the  numbers  diseases  with  which  the  symptoms  were  associated. 
Following  the  training  trials,  new  combinations  of  "symptoms"  were 
formed,  and  the  subjects  were  instructed  to  make  their  best  guesses  at  the 
correct  diagnoses. 

Suppose  now  that  response  A^  had  been  reinforced  in  the  presence  of 
the  sample  (abc)  and  response  A2  in  the  presence  of  the  sample  (def). 
If  a  test  trial  were  given  subsequently  with  the  sample  (abd),  direct  applica- 
tion of  Axiom  C2  yields  the  prediction  that  response  Al  should  occur  with 
probability  f .  Similarly,  if  a  test  were  given  with  the  sample  (ade\ 
response  A:  would  be  predicted  to  occur  with  probability  -J.  Results 
obtained  with  40  subjects,  each  given  24  tests  of  each  type,  were  as  follows: 

percentage  overlap  of  training  and  test  sets        0.667        0.333 
percentage  response  1  to  test  set  0.669        0.332 

Success  in  bringing  off  a  priori  predictions  of  this  sort  depends  not  only 
on  the  basic  soundness  of  the  theory  but  also  on  one's  success  in  realizing 
various  simplifying  assumptions  in  the  experimental  situation.  As  we 
have  mentioned,  it  was  our  intention  in  designing  this  experiment  to 
choose  cues,  a,  b,  c,  etc.,  which  would  take  on  the  role  of  stimulus  elements. 
Actually,  in  order  to  justify  our  theoretical  predictions,  it  was  necessary 
only  that  the  cues  behave  as  equal-sized  sets  of  elements.  To  bring  out  the 

9  This  experiment  was  conducted  at  Indiana  University  with  the  assistance  of  Miss 
Joan  SeBreny. 


IQ4  STIMULUS    SAMPLING    THEORY 

importance  of  the  equal  N  assumption,  let  us  suppose  that  the  individual 
cues  actually  correspond  to  sets  sa,  sb,  etc.,  of  elements.  Then,  given  the 
same  training  (response  Al  reinforced  to  the  combination  abc  and  response 
AZ  to  def)  and  assuming  the  training  effective  in  conditioning  all  elements 
of  each  subset  to  the  reinforced  response,  application  of  Axiom  C2  yields 
for  the  probability  of  response  A±  to  abd 

Pr  (A,  I  sasbsd)  =       N*+N*       9 
1   a     d        Na  +  N,+  Nd' 

where  we  have  used  the  obvious  abbreviation  N(st)  =  N^  This  equation 
reduces  to  Pr  (A:  \  sas A)  ==  f  if  Na  =  N,  =  Nd. 

In  this  experiment  we  depended  on  common-sense  considerations  to 
choose  cues  that  could  be  expected  to  satisfy  the  equal-TV  requirement  and 
also  counterbalanced  the  design  of  the  experiment  so  that  minor  deviations 
might  be  expected  to  average  out.  Sometimes  it  may  not  be  possible  to 
depend  on  common-sense  considerations.  In  that  case  a  preliminary 
experiment  can  be  utilized  to  check  on  the  simplifying  assumptions. 
Suppose,  for  example,  we  had  been  in  doubt  as  to  whether  cues  a  and  b 
would  behave  as  equal-sized  sets.  To  check  on  them,  we  could  have  run  a 
preliminary  experiment  in  which  we  reinforced,  say,  response  Al  to  a 
and  response  A^  to  b,  then  tested  with  the  compound  ab.  Probability  of 
response  A±  to  ab  is,  according  to  the  model,  given  by 


which  should  deviate  in  the  appropriate  direction  from  J-  if  Na  and  Nb 
are  not  equal.  By  means  of  calibration  experiments  of  this  sort  sets  of 
cues  satisfying  the  equal- N  assumption  can  be  assembled  for  use  in  further 
research  involving  applications  of  the  model. 

The  expressions  we  have  obtained  for  probabilities  of  response  to  stimu- 
lus compounds  can  readily  be  generalized  with  respect  both  to  set  sizes  and 
to  level  of  training.  Suppose  that  a  collection  of  cues  a,  by  c, . . .  corre- 
sponds to  a  collection  of  stimulus  sets  sa,  sb,  sc, . . .  of  sizes  Na>  N^9  Nc, .  . . 
and  that  some  response  A^  is  conditioned  to  a  proportion  pai  of  the 
elements  in  sa,  a  proportion  p^  of  the  elements  in  ^,  and  so  on.  Then 
probability  of  response  Ai  to  a  compound  of  these  cues  is,  by  Axiom  C2, 
expressed  by  the  relation 


Pr  (A,  |  Sa,  s»,  se, . . .)  =     aa'N+SN°^      •  •  •  -        (52) 

Application  of  Eq.  52  can  be  illustrated  in  terms  of  a  study  of  probabil- 
istic discrimination  learning  reported  in  Estes,  Burke,  Atkinson,  &  Frank- 
mann  (1957).  In  this  study  the  individual  cues  were  lights  that  differed 


STIMULUS    COMPOUNDING    AND    GENERALIZATION  jpj 

from  each  other  only  in  their  positions  on  a  panel.  The  first  stage  of  the 
experiment  consisted  in  discrimination  training  according  to  a  routine 
that  we  shall  not  describe  here  except  to  say  that  on  theoretical  grounds  it 
was  predicted  that  at  the  end  of  training  the  proportion  of  elements  in  a 
sample  associated  with  the  zth  light  conditioned  to  the  first  of  two  alter- 
native responses  would  be  given  by  pa  =  f/13.  Following  this  training, 
the  subjects  were  given  compounding  tests  with  various  triads  of  lights. 
Considering,  say,  the  triad  of  lights  1,  2,  and  3,  the  values  ofpa  should  be 
Pu  =  is,  Pzi  =  A,  and  p31  =  &9  assuming  JVi  =  Nz  =  N3  =  N,  and 
substituting  these  values  into  Eq.  52,  we  obtain 


3JV  13 

as  the  predicted  probability  of  response  1  to  the  compound  1,  2,  3.  Theo- 
retical values  similarly  computed  for  a  number  of  triads  are  compared  with 
the  empirical  test  proportions  reported  by  Estes  et  al.  in  Table  5. 

Table  5  Theoretical  and  Observed  Proportions  of  Response  A:  to 
Triads  of  Lights  in  Stimulus  Compounding  Test 

Triad  Theoretical          Observed 


1,2,3 

0.15 

0.22 

4,5,6 

0.38 

0.31 

1,3,11 

0.38 

0.41 

7,8,9 

0.62 

0.59 

2,  10,  12 

0.62 

0.58 

10,  11,  12 

0.85 

0.77 

An  important  consideration  in  applications  of  models  for  stimulus 
compounding  is  the  question  whether  the  experimental  situation  contains 
an  appreciable  amount  of  background  stimulation  in  addition  to  the 
controlled  stimuli  manipulated  by  the  experimenter.  Suppose,  for  example, 
we  are  interested  in  the  problem  that  a  compound  of  two  conditioned 
stimuli,  say  a  light  and  a  tone,  each  of  which  has  been  paired  with  the  same 
unconditioned  stimulus,  may  have  a  higher  probability  of  evoking  a  con- 
ditioned response  (CR)  than  either  of  the  stimuli  presented  separately. 
To  analyze  this  problem  in  terms  of  the  present  model,  we  may  represent 
the  light  and  the  tone  by  stimulus  sets  SL  and  ST.  Assuming  that  as  a 
result  of  the  previous  reinforcement  the  proportions  of  conditioned  ele- 
ments in  SL  and  ST  (and  therefore  the  probabilities  of  CR9&  to  the  stimuli 
taken  separately)  are  pL  and  pT,  respectively,  application  of  Axiom  C2 


196 


STIMULUS    SAMPLING    THEORY 


yields  for  the  probability  of  a  CR  to  the  compound  of  light  and  tone 
presented  together,  neglecting  any  possible  background  stimulation, 


L  +  N 


Clearly,  the  probability  of  a  CR  to  the  compound  is  simply  a  weighted 
mean  of  pL  and/?T,  and  therefore  its  value  must  fall  between  the  prob- 
abilities of  a  CR  to  the  two  conditioned  stimuli  taken  separately.  No 
"summation"  effect  is  predicted. 

Often,  however,  it  may  be  unrealistic  to  assume  that  background  stimula- 
tion from  the  apparatus  and  surroundings  is  negligible.  In  fact,  the 
experimenter  may  have  to  count  on  an  appreciable  amount  of  background 
stimulation,  predominantly  conditioned  to  behaviors  incompatible  with 
the  CR,  to  prevent  "spontaneous"  occurrences  of  the  to-be-conditioned 
response  during  intervals  between  presentations  of  the  experimentally 
controlled  stimuli.  Let  us  now  expand  our  representation  of  the  condi- 
tioning situation  by  defining  a  set  sb  of  background  elements,  a  proportion 
pb  of  which  are  conditioned  to  the  CR.  For  simplicity,  we  shall  consider 
only  the  special  case  of  pb  =  0.  Then  the  theoretical  probabilities  of 
evocation  of  the  CR  by  the  light,  the  tone,  and  the  compound  of  light 
and  sound  (together  with  background  stimulation  in  each  case)  are  given 


T\  — 

l  )  — 

and 


NT 


respectively.  Under  these  conditions  it  is  possible  to  obtain  a  summation 
effect  Assume,  for  example,  that  NT  =  NL  =  Nb  and  pT  >  pL9  so 
Pr  (CR  |  T)  >  Pr  (CR  \  L).  Taking  the  difference  between  the  probability 
of  a  CR  to  the  compound  and  probability  of  a  CR  to  the  tone  alone,  we 
have 

L,T)  -  Pr(CR  |  T)  =  PT  +  PL  -  ^ 


_  ^-PL  —  PT 
6 


STIMULUS    COMPOUNDING    AND    GENERALIZATION  /97 

which  is  positive  if  the  inequality  2pL  >  pT  holds.  Thus,  in  this  case, 
probability  of  a  CR  to  the  compound  will  exceed  probability  of  a  CR  to 
either  conditioned  stimulus  alone,  provided  thatpT  is  not  more  than  twice 

PL- 

The  role  of  background  stimuli  has  been  particularly  important  in 
the  interpretation  of  drive  stimuli.  It  has  been  assumed  (Estes,  1958, 
1961a)  that  in  simple  animal  learning  experiments  (e.g.,  those  involving 
the  learning  of  running  or  bar-pressing  responses  with  food  or  water 
reward)  the  stimulus  sample  to  which  the  animal  responds  at  any  time  is 
compounded  from  several  sources:  the  experimentally  controlled  con- 
ditioned stimulus  (CS)  or  equivalent;  stimuli,  perhaps  largely  intra- 
organismic  in  origin,  controlled  by  the  level  of  food  or  water  deprivation; 
and  extraneous  stimuli  that  are  not  systematically  correlated  with  reward 
of  the  response  undergoing  training  and  therefore  remain  for  the  most 
part  connected  to  competing  responses.  It  is  assumed  further  that  the 
sizes  of  samples  of  elements  associated  with  the  CS  and  with  extraneous 
sources  sc  and  SE  are  independent  of  drive  but  that  the  size  of  the  sample 
of  drive-stimulus  elements,  SD,  increases  as  a  function  of  deprivation.  In 
most  simple  reward-learning  experiments  conditioning  of  the  CS  and 
drive  cues  would  proceed  concurrently,  and  it  might  be  expected  that  at  a 
given  stage  of  learning  the  proportions  of  elements  in  samples  from  these 
sources  conditioned  to  the  rewarded  response  R  would  be  equal,  that  is, 
pc  =  pDt  If  this  were  the  case,  then  probability  of  the  rewarded  response 
would  be  independent  of  deprivation;  for,  letting  D  and  D'  correspond 
to  levels  of  deprivation  such  that  ND  <  ND>9  we  have  as  the  theoretical 
probabilities  of  response  R  at  the  two  deprivations, 


and 


, 

NC+ND, 

If  the  same  training  were  given  at  the  two  drive  levels,  then  we  would 
have  pD  =  pjy  as  well  as  pc  =  pD\  in  this  case  the  difference  between 
the  two  expressions  is  zero.  Considering  the  same  assumptions,  but  with 
extraneous  cues  taken  explicitly  into  account,  we  arrive  at  a  quite  different 
picture.  In  this  case  the  two  expressions  for  response  probability  are 

Pr  (R  I  CS.  D,  E)  •  N°pc  +  N* 

N0+  ND 

Pr  (R  |  CS,  D',  E)  = 


jgS  STIMULUS    SAMPLING    THEORY 

Now,  letting  pc  =  pD  =  p^  =  p  and,  for  simplicity,  taking  pE  =  0,  we 
obtain  for  the  difference 

Pr  (R  I  CS,  V\  E)  -  Pr  (R  \  CS,  D,  E) 

Nc  +  ND. Nc+  ND 

+  N     +  NE       Nc  +  ND+  1 


-  ND  +  NE) 

which  is  obviously  greater  than  zero,  given  the  assumption  ND,  >  ND. 
Thus,  in  this  theory,  the  principal  reason  why  probability  of  the  rewarded 
response  tends,  other  things  being  equal,  to  be  higher  at  higher  deprivations 
is  that  the  larger  the  sample  of  drive  stimuli,  the  more  effective  it  is  in  out- 
weighing the  effects  of  extraneous  stimuli. 

3.3  Sampling  Axioms  and  Major  Response  Theorem  of  Fixed 
Sample  Size  Model 

In  Sec.  3.2  we  considered  some  transfer  effects  which  can  be  derived 
within  a  component  model  by  considering  only  relationships  among 
stimulus  samples  that  have  had  different  reinforcement  histories .  Generally, 
however,  it  is  desirable  to  take  account  of  the  fact  that  there  may  not 
always  be  a  one-to-one  correspondence  between  the  experimental  stimulus 
display  and  the  stimulation  actually  influencing  the  subject's  behavior. 
Because  of  a  number  of  factors,  for  example,  variations  in  receptor- 
orienting  responses,  fluctuations  in  the  environmental  situation,  or 
variations  in  excitatory  states  or  thresholds  of  receptors,  the  subject  often 
may  sample  only  a  portion  of  the  stimulation  made  available  by  the 
experimenter.  One  of  the  chief  problems  of  statistical  learning  theories 
has  been  to  formulate  conceptual  representations  of  the  stimulus  sampling 
process  and  to  develop  their  implications  for  learning  phenomena.  With 
respect  to  specific  mathematical  properties  of  the  sampling  process, 
component  models  that  have  appeared  in  the  literature  may  be  classified 
into  two  main  types:  (1)  models  assuming  fixed  sampling  probabilities 
for  the  individual  elements  of  a  stimulus  population,  in  which  case  sample 
size  varies  randomly  from  trial  to  trial;  and  (2)  models  assuming  a  fixed 
ratio  between  sample  size  and  population  size.  The  first  type  was  first 
discussed  by  Estes  and  Burke  (1953),  the  second  by  Estes  (1950),  and  some 
detailed  comparisons  of  the  two  types  have  been  presented  by  Estes 
(1959b).  In  this  section  we  shall  limit  consideration  to  models  of  the  second 
type,  since  these  are  in  most  respects  easier  to  work  with. 


STIMULUS    COMPOUNDING    AND    GENERALIZATION 


*99 


In  the  remainder  of  this  section  we  shall  distinguish  stimulus  populations 
and  samples  by  using  S,  with  subscripts  as  needed,  for  a  population  and  s 
for  a  sample.  The  sampling  axioms  to  be  utilized  are  as  follows: 

Sampling  Axioms 

SI.     For  any  fixed,   experimenter-defined  stimulating  situation,  sample 

size  and  population  size  are  constant  over  trials. 
82.     All  samples  of  the  same  size  have  equal  probabilities. 

A  prerequisite  to  nearly  all  applications  of  the  model  is  a  theorem 
relating  response  probability  to  the  state  of  conditioning  of  a  stimulus 
population.  We  derive  this  theorem  in  terms  of  a  stimulus  situation  S 
containing  N  elements  from  which  a  sample  of  size  N(s)  =  cr  is  drawn  on 
each  trial.  Assuming  that  some  number  Nt  of  the  elements  of  S  is  con- 
ditioned to  response  Ai9  we  wish  to  obtain  an  expression  for  the  expected 
proportion  of  elements  conditioned  to  Ai  in  samples  drawn  from  S,  since 
this  proportion  will,  by  Axiom  C2,  be  equal  to  the  probability  of  evocation 
of  response  Ai  by  samples  from  S.  We  begin,  as  usual,  with  the  probability 
in  which  we  are  interested;  then,  using  the  axioms  of  the  model  as  appro- 
priate, we  proceed  to  expand  in  terms  of  the  state  of  conditioning  and 
possible  stimulus  samples  : 


c 

the  summation  being  over  all  samples  of  size  a  that  can  be  drawn  from  S. 
Next,  substituting  expressions  for  the  conditioned  probabilities,  we  obtain 

&)("--* 


M$,)=O    a  (N 


In  the  expression  on  the  right  N(st)la  represents  the  probability  of 
Ai  in  the  presence  of  a  sample  of  size  a  containing  a  subset  st  of  elements 
conditioned  to  At;  the  product  of  binomial  coefficients  denotes  the  num- 
ber of  ways  of  obtaining  exactly  N(st)  elements  conditioned  to  Ai  in  a 
sample  of  size  a,  so  that  the  ratio  of  this  product  to  the  number  of  ways  of 
drawing  a  sample  of  size  a  is  the  probability  of  obtaining  the  given  value 
of  Nfcd/a.  The  resulting  formula  will  be  recognized  as  the  familiar 
expression  for  the  mean  of  a  hypergeometric  distribution  (Feller,  1957, 
p.  218),  and  we  have  the  pleasingly  simple  outcome  that  the  probability  of 
a  response  to  the  stimulating  situation  represented  by  a  set  Sis  equal  to  the 
proportion  of  elements  of  S  that  are  conditioned  to  the  given  response: 

S)=.  (53) 


200  STIMULUS    SAMPLING    THEORY 

This  result  may  seem  too  intuitively  obvious  to  have  needed  a  proof,  but 
it  should  be  noted  that  the  same  theorem  does  not  hold  in  general  for 
component  models  with  fixed  sampling  probabilities  for  the  elements 
(cf.  Estes  &  Suppes,  1959b). 


3.4  Interpretation  of  Stimulus  Generalization 

Our  approach  to  the  problem  of  stimulus  generalization  is  to  represent 
the  similarity  between  two  stimuli  by  the  amount  of  overlap  between  two 
sets  of  elements.10  In  the  simplest  experimental  paradigm  for  exhibiting 
generalization  we  begin  with  two  stimulus  situations,  represented  by  sets  Sa 
and  Sb,  neither  of  which  has  any  of  its  elements  conditioned  to  a  reference 
response  A^  Training  is  given  by  reinforcement  of  A:  in  the  presence  ofSa 
only  until  the  probability  of  A±  in  that  situation  reaches  some  value 
pal  >  0,  Then  test  trials  are  given  in  the  presence  of  56,  and  if  pbl  now 
proves  to  be  greater  than  zero  we  say  that  stimulus  generalization  has 
occurred.  If  the  axioms  of  the  component  model  are  satisfied,  the  value  of 
pbl  provides,  in  fact,  a  measure  of  the  overlap  of  Sa  and  S6  ;  for,  by  Eq.  53, 
we  have,  immediately,  n 


where  Sa  c\  Sb  denotes  the  set  of  elements  common  to  Sa  arid  5&J  since  the 
numerator  of  this  fraction  is  simply  the  number  of  elements  in  Sb  that  are 
now  conditioned  to  response  A^  More  generally,  if  the  proportion  of 
elements  of  Sb  conditioned  to  At  before  the  experiment  were  equal  to 
£w,  not  necessarily  zero,  the  probability  of  response  A±  to  stimulus  Sb 
after  training  in  Sa  would  be  given  by 

n  SJPta  +  [N(SJ  -  N(Sa  n 


or,  with  the  more  compact  notation  Nab  =  N(Sa  r\  Sb}9  etc., 
p&i  =  NabPal  +  (Nb  ~  JVa&)jTdl  ^ 

^ 

This  relation  can  be  put  in  still  more  convenient  form  by  letting 
=  W-,  namely, 


This  equation  may  be  rearranged  to  read 

PVL  =  Wab(pai  ~  gw)  +  gbl,  (54*) 

and  we  see  that  the  difference  (pal  —  ^61)  between  the  posttraining  prob- 
ability of  Al  in  Sa  and  the  pretraining  probabih'ty  in  Sb  can  be  regarded 
10  A  model  similar  in  most  essentials  has  been  presented  in  Bush  &  Mosteller  (1951b). 


STIMULUS    COMPOUNDING    AND    GENERALIZATION 


201 


as  the  slope  parameter  of  a  linear  "gradient"  of  generalization,  in  which 
pbl  is  the  dependent  variable  and  the  proportion  of  overlap  between  Sa 
and  Sb  is  the  independent  variable.  If  we  hold  gbl  constant  and  let  pal 
vary  as  the  parameter,  we  generate  a  family  of  generalization  gradients 
which  have  their  greatest  disparities  at  wab  =  1  (i.e.,  when  the  test  stimulus 
Sb  is  identical  with  SJ  and  converge  as  the  overlap  between  Sb  and  Sa 
decreases,  until  the  gradients  meet  at  pbl  =  gbl  when  wab  =  0.  Thus  the 
family  of  gradients  shown  in  Fig.  9  illustrates  the  picture  to  be  expected  if  a 
series  of  generalization  tests  is  given  at  each  of  several  different  stages  of 
training  in  Sa,  or,  alternatively,  at  several  different  stages  of  extinction 
following  training  in  Sa,  as  was  done,  for  example,  by  Guttman  and 
Kalish  (1956).  The  problem  of  "calibrating"  a  physical  stimulus  dimen- 
sion to  obtain  a  series  of  values  that  represent  equal  differences  in  the 
value  of  wab  has  been  discussed  by  Carterette  (1961). 

The  parameter  wab  might  be  regarded  as  an  index  of  the  similarity  of 
Sa  to  Sb.  In  general,  similarity  is  not  a  symmetrical  relation,  for  wab  is  not 
equal  to  wba  (wab  being  given  by  Nab/Nb  and  the  wba  by  Nab/Na)  except  in 
the  special  case  Na  =  Nb.  When  Na  ^  Nb>  generalization  from  training 
with  the  larger  set  to  a  test  with  the  smaller  set  will  be  greater  than  general- 


1.0 


.f 

la 


\ 


Sa  Sb 

Fig.  9.  Generalization  from  a  training  stimulus, 
Sa,  to  a  test  stimulus,  S63  at  several  stages  of  train- 
ing. The  parameters  are  w^  =  0,5,  the  propor- 
tion of  overlap  between  Sa  and  Sb,  and  g&1  =  0.1, 
the  probability  of  response  Al  to  St  before  training 


STIMULUS    SAMPLING    THEORY 


ization  from  training  with  the  smaller  set  to  a  test  with  the  larger  set 
(assuming  that  the  reinforcement  given  the  reference  response  AI  in  the 
presence  of  the  training  set  St  establishes  the  same  value  ofptl  in  each  case 
before  testing  in  S,-).  We  shall  give  no  formal  assumption  relating  size  of  a 
stimulus  set  to  observable  properties;  however,  it  is  reasonable  to  expect 
that  larger  sets  will  be  associated  with  more  intense  (where  the  notion  of 
intensity  is  applicable)  or  attention-getting  stimuli.  Thus,  if  Sa  and  Sb 
represent  tones  a  and  b  of  the  same  frequency  but  with  tone  a  more  intense 
than  A,  we  should  predict  greater  generalization  if  we  train  the  reference 
response  to  a  given  level  with  a  and  test  with  b  than  if  we  train  to  the  same 
level  with  b  and  test  with  a. 

Although  in  the  psychological  literature  the  notion  of  stimulus  generaliza- 
tion has  nearly  always  been  taken  to  refer  to  generalization  along  some 
physical  continuum,  such  as  wavelength  of  light  or  intensity  of  sound,  it  is 
worth  noting  that  the  set-theoretical  model  is  not  restricted  to  such  cases. 
Predictions  of  generalization  in  the  case  of  complex  stimuli  may  be 
generated  by  first  evaluating  the  overlap  parameter  wa&  for  a  given  pair  of 
situations  a  and  b  from  a  set  of  observations  obtained  with  some  particular 
combination  of  values  of  pal  and  g-61  and  then  computing  theoretical  values 
of  PVL  f°r  new  conditions  involving  different  levels  of  pal  and  gbi.  The 
problem  of  treating  a  simple  "stimulus  dimension"  is  of  special  interest, 
however,  and  we  conclude  our  discussion  of  generalization  by  sketching 
one  approach  to  this  problem.11 

We  shall  consider  the  type  of  stimulus  dimension  that  Stevens  (1957) 
has  termed  substitutive  or  metathetic,  that  is,  one  which  involves  the  notion 
of  a  simple  ordering  of  stimuli  along  a  dimension  without  variation  in 
intensity  or  magnitude.  Let  us  denote  by  Z  a  physical  dimension  of  this 
sort,  for  example,  wavelength  of  visible  light,  which  we  wish,  to  represent 
by  a  sequence  of  stimulus  sets.  First  we  shall  outline  the  properties  that 
we  wish  this  representation  to  have  and  then  spell  out  the  assumptions  of 
the  model  more  rigorously. 

It  is  part  of  the  intuitive  basis  of  a  substitutive  dimension  that  one  moves 
from  point  to  point  by  exchanging  some  of  the  elements  of  one  stimulus 
for  new  ones  belonging  to  the  next.  Consequently,  we  assume  that  as 
values  of  Z  change  by  constant  increments  each  successive  stimulus  set 
should  be  generated  by  deleting  a  constant  number  of  elements  from  the 
preceding  set  and  adding  the  same  number  of  new  elements  to  form  the 

11  We  follow,  in  most  respects,  the  treatment  given  by  W.  K.  Estes  and  D.  L.  LaBerge 
in  unpublished  notes  prepared  for  the  1957  SSRC  Summer  Institute  in  Social  Science 
for  College  Teachers  of  Mathematics.  For  an  approach  combining  essentially  the  same 
set-theoretical  model  with  somewhat  different  learning  assumptions,  the  reader  is 
referred  to  Restle  (1961). 


STIMULUS    COMPOUNDING    AND    GENERALIZATION  -20J 

next  set;  but,  to  ensure  that  the  organism's  behavior  can  reflect  the  order- 
ing of  stimuli  along  the  Z-scale  without  ambiguity,  we  need  also  to  assume 
that  once  an  element  is  deleted  as  we  go  along  the  Z-scale  it  must  not 
reappear  in  the  set  corresponding  to  any  higher  Z-value.  Further,  in  view 
of  the  abundant  empirical  evidence  that  generalization  declines  in  an 
orderly  fashion  as  the  distance  between  two  stimuli  on  such  a  dimension 
increases,  we  must  assume  that  (at  least  up  to  the  point  at  which  sets 
corresponding  to  larger  differences  in  Z  are  disjoint)  the  overlap  between 
two  stimulus  sets  is  directly  related  to  the  interval  between  the  corre- 
sponding stimuli  on  the  Z-scale.  These  properties,  taken  together,  enable 
us  to  establish  an  intuitively  reasonable  correspondence  between  charac- 
teristics of  a  sequence  of  stimulus  sets  and  the  empirical  notion  of 
generalization  along  a  dimension. 

These  ideas  are  incorporated  more  formally  in  the  following  set  of 
axioms.  The  basis  for  these  axioms  is  a  stimulus  dimension  Z,  which  may 
be  either  continuous  or  discontinuous,  a  collection  S*  of  stimulus  sets, 
and  a  function  x(Z)  with  a  finite  number  of  consecutive  integers  in  its 
range.  The  mapping  of  the  set  (x)  of  scaled  stimulus  values  onto  the  sub- 
sets St  of  S*  must  satisfy  the  following  axioms  : 

Generalization  Axioms 


Gl.  : 

G2.  For  all  /  <y*  <  &  in  (x),  if  Si  C\Sk^99  where  0  is  the  null  set,  then 

S,  <=  to  U  Si). 
G3.  For  all  h<i,j<k  in  (x),  if  i  -  h  =  k  -j,  then  Nhi  =  NiJt;  and 

for  all  i  in  (x),  Nit  =  N. 

The  set  (x)  may  simply  be  a  set  of  Z  scale  values  or  it  may  be  a  set  of 
Z-values  rescaled  by  some  transformation.  The  reasons  for  introducing 
(x)  are  twofold.  First,  for  mathematical  simplicity  we  find  it  advisable  to 
restrict  ourselves,  at  least  for  present  purposes,  to  a  finite  set  of  Z-values 
and  therefore  to  a  finite  collection  of  stimulus  sets.  Second,  there  is  no 
reason  to  believe  that  equal  distances  along  physical  dimensions  will  in 
general  correspond  to  equal  overlaps  between  stimulus  sets.  All  that  is 
required,  however,  to  make  the  theory  workable  is  that  for  any  given 
physical  dimension,  wavelength  of  light,  frequency  of  a  tone,  or  whatever, 
we  can  find  experimentally  a  transformation  x  such  that  equal  distances  on 
the  o>scale  do  correspond  to  equal  overlaps. 

Axiom  Gl  states  that  if  an  element  belongs  to  any  two  sets  it  also  belongs 
to  all  sets  that  fall  between  these  two  sets  on  the  z-scale.  Axiom  G2  states 
that  if  two  sets  have  any  common  elements  then  all  of  the  elements  of  any 
set  faffing  between  them  belong  to  one  or  the  other  (or  both)  of  the  given 


2O4  STIMULUS    SAMPLING    THEORY 

sets;  this  property  ensures  that  the  elements  drop  out  of  the  sets  in  order 
as  we  move  along  the  dimension.  Axiom  G3  describes  the  property  that 
distinguishes  a  simple  substitutive  dimension  from  an  additive,  or  intensity 
(in  Stevens'  terminology,  prothetic),  dimension.  It  should  be  noted  that 
only  if  the  number  of  values  in  the  range  of  x(Z)  is  no  greater  than  N(S+) 
—  N  +  I  can  Axiom  G3  be  satisfied.  This  restriction  is  necessary  in  order 
to  obtain  a  one-to-one  mapping  of  the  ^-values  into  the  subsets  St  of  S*. 
One  advantage  in  having  the  axioms  set  forth  explicitly  is  that  it  then 
becomes  relatively  easy  to  design  experiments  bearing  on  various  aspects 
of  the  model.  Thus,  to  obtain  evidence  concerning  the  empirical  tenability 
of  Axiom  Gl,  we  might  choose  a  response  A±  and  a  set  (x)  of  stimuli, 
including  a  pair  i  and  k  such  that  Pr  (A:  \  z)  =  Pr  (A^  \  k)  =  0,  then  train 
subjects  with  stimulus  z  only  until  Pr  (A:  |  z)  =  1,  and  finally  test  with 
stimulus  k.  If  Pr  (Al  \  k)  is  found  to  be  greater  than  zero,  it  must  be 
concluded,  in  terms  of  the  model,  that  St  n  Sk  ^  0;  that  is,  the  sets 
corresponding  to  z  and  k  have  some  elements  in  common.  Given 


it  must  be  predicted  that  for  every  stimulus  j  in  (x),  such  that  i  <  j  <  k, 
Pr  (Al  \j)  ^  Pr  (AI  |  k).  Axiom  Gl  ensures  that  all  of  the  elements  of  Sk 
which  are  now  conditioned  to  Al  by  virtue  of  belonging  also  to  St  must  be 
included  in  S^  possibly  augmented  by  other  elements  of  Si  which  are  not 
inSk. 

To  deal  similarly  with  Axiom  G2,  we  proceed  in  the  same  way  to  locate 
two  members  z  and  k  of  a  set  (x)  such  that  S^  n  Sk  ^  0.  Then  we  train 
subjects  on  both  stimulus  i  and  stimulus  k  until  Pr  (A±  \  i)  =  Pr  (A±  \  k) 
=  1,  response  AI  being  one  that  before  this  training  had  probability  of  less 
than  unity  to  all  stimuli  in  (x).  Now,  by  G2,  if  any  stimulus/  falls  between 
i  and  k,  the  set  S3  must  be  contained  entirely  in  the  union  5,  U  Sk;  con- 
sequently, we  must  predict  that  we  will  now  find  Pr  (A±  |/)  =  1  for  any 
stimulus  j  such  that  z  </  <  k. 

To  evaluate  Axiom  G3  empirically,  we  require  four  stimuli  h  <  z,/  <  k 
such  that  i  —  h  =  k  —  j.  If  the  four  stimuli  are  all  different,  we  can  simply 
train  subjects  on  h  and  test  generalization  to  z*,  then  train  subjects  to  an 
equal  degree  on/  and  test  generalization  to  k.  If  the  amount  of  generaliza- 
tion, as  measured  by  the  probability  of  the  test  response,  is  the  same  in  the 
two  cases,  then  the  axiom  is  supported.  In  the  special  case  in  which  h  —  i 
and/  =  k  we  would  be  testing  the  assertion  that  the  sets  associated  with 
different  values  of  z  are  of  equal  size.  To  accomplish  this  test,  we  need  only 
take  any  two  neighboring  values  of  or,  say  z  and  /,  train  subjects  to  some 
criterion  on  i  and  test  on/,  then  reverse  the  procedure  by  training  (dif- 
ferent) subjects  to  the  same  criterion  on/  and  testing  on  i.  If  the  axiom  is 


STIMULUS    COMPOUNDING    AND    GENERALIZATION  20$ 

satisfied,  the  amount  of  generalization  should  be  the  same  in  both  direc- 
tions. 

Once  we  have  introduced  the  notion  of  a  dimension,  it  is  natural  to 
inquire  whether  the  parameter  that  represents  the  degree  of  communality 
between  pairs  of  stimulus  sets  might  not  be  related  in  some  simple  way  to  a 
measure  of  distance  along  the  dimension.  With  one  qualification,  which 
we  mention  later,  the  quantity  dti  =  1  —  ww  could  serve  as  a  suitable 
measure  of  the  distance  between  stimuli  i  and  j.  We  can  check  to  see 
whether  the  familiar  axioms  for  a  metric  are  satisfied.  These  axioms  are 

1.  dis  =  0  if  and  only  if  i  =  j\ 

2.  4-  >  0, 

3.  4-  =  4-> 

4.  4-  +  dik  >  4^ 

where  it  is  understood  that  z,  j,  and  k  are  any  members  of  the  set  (x) 
associated  with  a  given  dimension.  The  first  three  obviously  hold,  but 
the  fourth  requires  a  bit  of  analysis.  To  carry  out  a  proof,  we  use  the 
notation  Nti  for  the  number  of  elements  common  to  St  and  SJ9  N^.  for 
the  number  of  elements  in  both  St  and  S,  but  not  in  Sk,  and  so  on.  The 
difference  between  the  two  sides  of  the  inequality  we  wish  to  establish  can 
be  expanded  in  terms  of  this  notation: 


ijk 


N 


N 


The  last  expression  on  the  right  is  nonnegative,  which  establishes  the  desired 
inequality.  To  find  the  restrictions  under  which  d  is  additive,  let  us  assume 
that  stimuli  i9j\  and  k  fall  in  the  order  i  <  j  <  k  on  the  dimension.  Then, 
by  Axiom  Gl,  we  know  that  N^  =  0.  However  it  is  only  in  the  special 
cases  in  which  St  and  Sk  are  either  overlapping  or  adjacent  that  N^  =  0 


206  STIMULUS    SAMPLING    THEORY 

and,  therefore,  that  d2J  +  d}k  =  dlk.  It  is  possible  to  define  an  additive 
distance  measure  that  is  not  subject  to  this  restriction,  but  such  extensions 
raise  new  problems  and  we  are  not  able  to  pursue  them  here. 

In  concluding  this  section,  we  should  like  to  emphasize  one  difference 
between  the  model  for  generalization  sketched  here  and  some  of  those 
already  familiar  in  the  literature  (see,  e.g.,  Spence,  1936;  Hull,  1943). 
We  do  not  postulate  a  particular  form  for  generalization  of  response 
strength  or  excitatory  tendency.  Rather,  we  introduce  certain  assumptions 
about  the  properties  of  the  set  of  stimuli  associated  with  a  sensory  dimen- 
sion; then  we  take  these  together  with  learning  assumptions  and  informa- 
tion about  reinforcement  schedules  as  a  basis  for  deriving  theoretical 
gradients  of  generalization  for  particular  types  of  experiments.  Under  the 
special  conditions  assumed  in  the  example  we  have  considered,  the  theory 
predicts  that  a  family  of  linear  gradients  with  simple  properties  will  be 
observed  when  response  probability  is  plotted  as  a  function  of  distance 
from  the  point  of  reinforcement.  Predictions  of  this  sort  may  reasonably 
be  tested  by  means  of  experiments  in  which  suitable  measures  are  taken  to 
meet  the  conditions  assumed  in  the  derivations  (see,  e.g.,  Carterette, 
1961) ;  but,  to  deal  with  experiments  involving  different  training  conditions 
or  response  measures  other  than  relative  frequencies,  further  theoretical 
analysis  is  called  for,  and  we  must  be  prepared  to  find  substantial  differ- 
ences in  the  phenotypic  properties  of  generalization  gradients  derived  from 
the  same  basic  theory  for  different  experimental  situations. 


4.  COMPONENT   AND   LINEAR   MODELS 
FOR   SIMPLE   LEARNING 

In  this  section  we  combine,  in  a  sense,  the  theories  discussed  in  the 
preceding  sections.  Until  now  it  was  convenient  for  expositional  purposes 
to  treat  the  problems  of  learning  and  generalization  separately.  We  first 
considered  a  type  of  learning  model  in  which  the  different  possible  samples 
of  stimulation  from  trial  to  trial  were  assumed  to  be  entirely  distinct  and 
then  turned  to  an  analysis  of  generalization,  or  transfer,  effects  that  could 
be  measured  on  an  isolated  test  trial  following  a  series  of  learning  trials. 
Prediction  of  these  transfer  effects  depended  on  information  concerning 
the  state  of  the  stimulus  population,  just  before  the  test  trial  but  did  not 
depend  on  information  about  the  course  of  learning  over  preceding  training 
trials.  However,  in  many  (perhaps  most)  learning  situations  it  is  not 
reasonable  to  assume  that  the  samples,  or  patterns,  of  stimulation  affecting 
the  organism  on  different  trials  of  a  series  are  entirely  disjoint;  rather, 
they  must  overlap  to  various  intermediate  degrees,  thus  generating  transfer 


COMPONENT    AND     LINEAR    MODELS    FOR    SIMPLE     LEARNING  20J 

effects  throughout  the  learning  series.  In  the  "component  models"  of 
stimulus  sampling  theory  one  simply  takes  the  learning  assumptions  of  the 
pattern  model  (Sec.  2)  together  with  the  sampling  axioms  and  response 
rule  of  the  generalization  model  (Sec.  3)  to  generate  an  account  of  learning 
for  this  more  general  case. 


4.1  Component  Models  with  Fixed  Sample  Size 

As  indicated  earlier,  the  analysis  of  a  simple  learning  experiment  in  terms 
of  a  component  model  is  based  on  the  representation  of  the  stimulus  as  a 
set  S  of  N  stimulus  elements  from  which  the  subject  draws  a  sample  on 
each  trial.  At  any  time,  each  element  in  the  set  S  is  conditioned  to  exactly 
one  of  the  r  response  alternatives  A19 . .  .  ,Ar;  by  the  response  axiom  of 
Sec.  3.1  the  probability  of  a  response  is  equal  to  the  proportion  of  elements 
in  the  trial  sample  conditioned  to  that  response.  At  the  termination  of  a 
trial,  if  reinforcing  event  £z-  (/  ^  0)  occurs,  then  with  probability  c  all 
elements  in  the  trial  sample  become  conditioned  to  response  At.  If  EQ 
occurs,  the  conditioned  status  of  elements  in  the  sample  does  not  change. 
The  conditioning  parameter  c  plays  the  same  role  here  as  in  the  pattern 
model.  It  should  be  noted  that  in  the  early  literature  of  stimulus  sampling 
theory  this  parameter  was  usually  assumed  to  be  equal  to  unity. 

Two  general  types  of  component  models  can  be  distinguished.  For  the 
fixed-sample-size  model  we  assume  that  the  sample  size  is  a  fixed  number 
s  throughout  any  given  experiment.  For  the  independent-sampling  model 
we  assume  that  the  elements  of  the  stimulus  set  S  are  sampled  independ- 
ently on  each  trial,  each  element  having  some  fixed  probability  6  of  being 
drawn.  In  this  section  we  discuss  the  fixed-sample-size  model  and  consider 
the  case  in  which  all  possible  samples  of  size  s  are  sampled  with  equal 
probability. 

FORMULATION  FOR  RTT  EXPERIMENTS.  To  illustrate  the  model,  we 
first  consider  an  experimental  procedure  in  which  a  particular  stimulus 
item  is  given  a  single  reinforced  trial,  followed  by  two  consecutive  non- 
reinforced  test  trials.  The  design  may  be  conveniently  symbolized  RT^. 
Procedures  and  results  for  a  number  of  experiments  using  an  RTT  design 
have  been  reported  elsewhere  (Estes,  1960a;  Estes,  Hopkins,  &  Crothers, 
1960;  Estes,  1961b;  Crothers,  1961).  For  simplicity,  suppose  we  select 
a  situation  in  which  the  probability  of  a  correct  response  is  zero  before 
the  first  reinforcement  (and  in  which  the  likelihood  of  a  subject's  obtaining 
correct  responses  by  guessing  is  negligible  on  all  trials).  In  terms  of  the 
fixed-sample-size  model  we  can  readily  generate  predictions  for  the  prob- 
abilities pis  of  various  combinations  of  response  i  on  7\  and  response  j 


2O8  STIMULUS    SAMPLING    THEORY 

on  T2.  If  i9j  =  0  denote  correct  responses  and  i,j  =  1  denote  errors,  then 


x  (55) 

/i        s\  s 
,  =  cl  1 I  — 

I        N/N 


To  obtain  the  first  result,  we  note  that  the  correct  response  can  occur  on 
either  trial  only  if  conditioning  occurs  on  the  reinforced  trial,  which  has 
probability  c.  On  occasions  when  conditioning  occurs,  the  whole  sample 
of  s  elements  becomes  conditioned  to  the  correct  response  and  the  prob- 
ability of  this  response  on  each  of  the  test  trials  is  s/N.  On  occasions  when 
conditioning  does  not  occur  on  the  reinforced  trial,  probability  of  a  correct 
response  remains  at  zero  over  both  test  trials.  Note  that  when  s  =  N  =  1 
this  model  is  equivalent  to  the  one-element  model  discussed  in  Sec.  1.1. 
If  more  than  one  reinforcement  is  given  prior  to  7\,  the  predictions  are 
essentially  unchanged.  In  general,  for  k  preceding  reinforcements,  the 
expected  proportion  of  elements  conditioned  to  the  correct  response  (i.e., 
the  probability  of  a  correct  response)  at  the  time  of  the  first  test  is 


and  the  probability  of  correct  responses  on  both  T±  and  T2  is  given  by 


To  obtain  this  last  expression,  we  note  that  a  subject  for  whom  /  of 
the  k  reinforcements  have  been  effective  will  have  probability  {1  — 
[I  —  (slN)]*}  of  making  a  correct  response  on  each  test,  and  the  probability 

that  exactly  j  reinforcements  will  be  effective  is  (  .  J  c*(l  —  c)k-\  Similarly, 


and 

PU  =  a  -  c  +    feWi  -  ci  - 

N 


COMPONENT    AND    LINEAR    MODELS    FOR    SIMPLE    LEARNING  20$ 

If  s  =  N,  these  expressions  reduce  to 

PQO  =  i  -  (i  -  <o* 

jPio  ^  .Poi  =  0 
Ai  =  (1  -  cf. 

This  special  case  appears  well  suited  to  the  interpretation  of  data  obtained 
by  G.  H.  Bower  (personal  communication)  from  a  study  in  which  the  I\ T2 
procedure  was  applied  following  various  numbers  of  presentations  of 
word-word  paired-associates.  For  32  subjects,  each  tested  on  10  items, 
Bower  reports  observed  proportions  of  p^  =  0.894,  /?10  =  pQi  =  0.003, 
and  pn  =  0.100. 

When  applied  to  other  RTT  experiments,  this  model  has,  however,  not 
yielded  consistently  accurate  predictions.  The  difficulty  apparently  stems 
from  the  fact  that  our  assumptions  do  not  take  account  of  the  retention 
loss  that  is  usually  observed  from  TL  to  T2  (see,  e.g.,  Estes,  1961b).  An 
extension  of  the  model  that  is  capable  of  handling  retention  decrement  as 
well  as  the  acquisition  process  is  discussed  in  Sec.  4.2  below. 

For  RTT  experiments,  in  which  the  probability  of  successful  guessing 
is  not  negligible  (as  in  paired-associate  tasks  involving  a  fixed  list  of  re- 
sponses which  are  known  to  the  subject  from  the  start),  some  additional 
considerations  arise.  Perhaps  the  most  natural  extension  of  the  preceding 
treatment  is  to  assume  that  the  subject  will  start  the  experiment  with  a 
proportion  1/r  of  the  elements  of  a  given  set  5<  connected  to  the  correct 
response  and  a  proportion  [1  —  (1/r)]  connected  to  incorrect  responses,  r 
being  the  number  of  alternative  responses.  Then,  for  a  fixed-sample-size 
model,  the  probability  pQ  of  a  correct  response  to  a  given  item  on  the  first 
test  trial  after  a  single  reinforcement  is 


the  bracketed  quantity  being  the  proportion  of  elements  connected  to  the 
correct  response  in  the  event  that  the  reinforcement  is  effective.  The 
probabilities  of  various  combinations  of  correct  and  incorrect  responses 
on  the  two  test  trials  are  given  by 

j-j  x    J-        •  /Q 


=  Pa. 


(56) 


2IQ  STIMULUS    SAMPLING    THEORY 

where 


r       N       \         N/r 

An  alternative  approach  to  the  type  of  experiment  in  which  the  subject 
guesses  on  unlearned  items  is  to  assume  that  initially  all  elements  are 
neutral,  that  is,  are  connected  neither  to  correct  nor  to  incorrect  responses. 
In  the  presence  of  a  sample  containing  only  neutral  elements  the  subject 
guesses,  with  probability  1/r  of  being  correct.  If  the  sample  contains  any 
conditioned  elements,  then  the  proportion  of  conditioned  elements  in  the 
sample  connected  to  the  correct  response  determines  its  probability  (e.g., 
if  the  sample  contains  nine  elements,  three  conditioned  to  the  correct 
response,  two  conditioned  to  an  incorrect  response,  and  four  uncondi- 
tioned, then  the  probability  of  a  correct  response  is  simply  3/5).  These 
assumptions  seem  in  some  respects  more  intuitively  satisfactory  than 
those  we  have  considered.  Perhaps  the  most  important  difference  with 
respect  to  empirical  implications  lies  in  the  fact  that  with  the  latter  set  of 
assumptions  exposure  time  on  test  trials  must  be  taken  into  account.  If 
the  stimulus  exposure  time  is  just  long  enough  to  permit  a  response  (in 
terms  of  the  theory,  just  long  enough  to  permit  the  subject  to  draw  a  single 
sample  of  stimulus  elements),  then  the  probabilities  of  correct  and  in- 
correct response  combinations  on  7\  and  T2  are 


Pio  =  Poi  =  (1  -  <0  -1  -        +  cf  (1  -  f  ),  (57) 


where 

(N  -  s 


The  factor!  j  /  I     1  is  the  probability  that  the  subject  will  draw  a 

sample  containing  none  of  the  s  elements  that  became  conditioned  on  the 
reinforced  trial;  therefore  1  —  <f>  represents  the  probability  that  a  subject 
for  whom  the  reinforced  trial  was  effective  nevertheless  draws  a  sample 


COMPONENT    AND    LINEAR    MODELS    FOR    SIMPLE     LEARNING  211 

containing  no  conditioned  elements  and  makes  an  incorrect  guess,  whereas 
$  is  the  probability  that  such  a  subject  will  make  a  correct  response  on 
either  test  trial. 

The  two  sets  of  equations  (56  and  57)  are  formally  identical  and  thus 
cannot  be  distinguished  in  application  to  RTTd&t&.  Like  Eq.  55,  they  have 
the  limitation  of  not  allowing  adequately  for  the  retention  loss  usually 
observed  (see,  e.g.,  Estes,  Hopkins,  &  Crothers,  1960);  we  return  to  this 
point  in  Sec.  4.2. 

If  exposure  time  is  long  enough  on  the  test  trials,  then  we  assume  that 
the  subject  continues  to  draw  successive  random  samples  from  S  and 
makes  a  response  only  when  he  finally  draws  a  sample  containing  at  least 
one  conditioned  element.  Thus  in  cases  in  which  the  reinforcement  has 
been  effective  on  a  previous  trial  (so  that  S  contains  a  subset  of  s  con- 
ditioned elements)  the  subject  will  eventually  draw  a  sample  containing  one 
or  more  conditioned  elements  and  will  respond  on  the  basis  of  these  ele- 
ments, thereby  making  a  correct  response  with  probability  1.  Therefore, 
for  the  case  of  unlimited  exposure  time,  <j>'  =  1  and  Eq.  57  reduces  to 

Poo  =  (1  ~  c)  -  +  c, 


which  are  identical  with  the  corresponding  equations  for  the  one-element 
model  of  Sec.  1.2. 

GENERAL  FORMULATION.  We  turn  now  to  the  problem  of  deriving 
from  the  fixed-sample-size  model  predictions  concerning  the  course  of 
learning  over  an  experiment  consisting  of  a  sequence  of  trials  run  under 
some  prescribed  reinforcement  schedule.  We  shall  limit  consideration  to 
the  case  in  which  each  element  in  S  is  conditioned  to  exactly  one  of  the 
two  response  alternatives,  Al  or  A&  so  that  there  are  TV  +  1  conditioning 
states.  Again,  we  let  Ct  (i  =  0,  .  .  .  ,  JV)  denote  the  state  in  which  i  elements 
of  the  set  S  are  conditioned  to  A^  and  N  —  i  to  Az.  As  in  the  pattern 
model,  the  transition  probabilities  among  conditioning  states  are  functions 
of  the  reinforcement  schedules  and  the  set-theoretical  parameters  c,  s,  and 
N.  Following  our  approach  in  Sec.  2.1,  we  restrict  the  analysis  to  cases 
in  which  the  probability  of  reinforcement  depends  at  most  on  the  response 
on  the  given  trial;  we  thereby  guarantee  that  all  elements  in  the  transition 


212 


STIMULUS    SAMPLING    THEORY 


matrix  for  conditioning  states  are  constant  over  trials.  Thus  the  sequence 
of  conditioning  states  can  again  be  conceived  as  a  Markov  chain. 

Transition  Probabilities.  Let  s% >n  denote  the  event  of  drawing  a 
sample  on  trial  n  with  /  elements  conditioned  to  AI  and  s  —  z  conditioned 
to  A2-  Then  the  probability  of  a  one-step  transition  from  state  C,  to  state 
C^9  is  given  by 

—  7\  /    3     \ 

"  /l'-"/^  \Ss_vC^         (59a) 


SN\ 

w 


where  Pr  (E^  \  S8_VCJ)  is  the  probability  of  an  £revent,  given  conditioning 
state  Cj  and  a  sample  with  v  elements  conditioned  to  Az.  To  obtain  Eq. 
59a,  we  note  that  an  El  must  occur  and  that  the  subject  must  sample 
exactly  v  elements  from  the  N  —j  elements  not  already  conditioned  to 
Ail  the  probability  of  the  latter  event  is  the  number  of  ways  of  drawing 
samples  with  v  elements  conditioned  to  A2  divided  by  the  total  number  of 
ways  of  drawing  samples  of  size  s.  Similarly 


and 


o 


Pr(£s|svC3.) 


(596) 


9u  =  1  ~  c  +  c 


+ 


IN  -A 

\    s    / 


(59c) 


Although  it  is  an  obvious  conclusion,  it  is  important  for  the  reader  to 
realize  that  the  pattern  model  discussed  in  Sec.  2  is  identical  to  the  fixed- 
sample-size  model  when  5=1.  This  correspondence  between  the  two 
models  is  indicated  by  the  fact  that  Eqs.  59a,  b,  c  reduce  to  Eq.  23a,  b,  c 
when  we  let  s  =  1. 

For  the  simple  noncontingent  schedule  in  which  only  the  two  events 
EI  and  Ez  occur  (with  probabilities  TT  and  1  —  TT,  respectively)  Eqs.  59a,  b,  c 


COMPONENT    AND    LINEAR    MODELS    FOR    SIMPLE    LEARNING 

simplify  to 


=  C7r  v      -      , -\*  -  ;  ^  (6Qfl) 


n 

\sj 


1  -  c  +  c 


W 


(606) 

'N-1\t 

(60c) 


It  is  apparent  that  state  CN  is  an  absorbing  state  when  tr  =  1  and  that  C0 
is  an  absorbing  state  when  TT  =  0.   Otherwise,  all  states  are  ergodic. 

Mean  Learning  Curve.  Following  the  same  techniques  used  in  con- 
nection with  Eq.  27,  we  obtain  for  the  component  model  in  the  simple, 
noncontingent  case 


Pr  (Alrn)  =  TT  -  [TT  -  Pr  04^)1  l  -™       •  (61) 

This  mean  learning  function  traces  out  a  smooth  growth  curve  that  can 
take  any  value  between  0  and  1  on  trial  n  if  parameters  are  selected 
appropriately.  However,  it  is  important  to  note  that  for  a  given  realization 
of  the  experiment  the  actual  response  probabilities  for  individual  subjects 
(as  opposed  to  expectations)  can  only  take  on  the  values  0,  I/TV,  2/N9  .  .  .  , 
(j\r  —  i)/jV,  1  ;  that  is,  the  values  associated  with  the  conditioning  states. 
This  stepwise  aspect  of  the  process  is  particularly  important  when  one 
attempts  to  distinguish  between  this  model  and  models  that  assume  gradual 
continuous  increments  in  the  strength  or  probability  of  a  response  over 
time  (Hull,  1943;  Bush  &  Mosteller,  1955;  Estes  &  Suppes,  1959a). 

To  illustrate  this  point,  we  consider  an  experiment  on  avoidance  learning 
reported  by  Theios(  1963).  Fifty  rats  were  used  as  subjects.  The  apparatus 
was  a  modified  Miller-Mowrer  electric-shock  box,  and  the  animal  was 
always  placed  in  the  black  compartment.  Shortly  thereafter  a  buzzer  and 
light  came  on  as  the  door  between  the  compartments  was  opened.  The 
correct  response  (A^  was  to  run  into  the  other  compartment  within  3 
seconds.  If  A±  did  not  occur,  the  subject  was  given  a  high  intensity  shock 
until  it  escaped  into  the  other  compartment.  After  20  seconds  the  subject 
was  returned  to  the  black  compartment,  and  another  trial  was  given. 


214  STIMULUS    SAMPLING    THEORY 

Each  rat  was  run  until  it  met  a  criterion  of  20  consecutive  successful 
avoidance  responses. 

Theios  analyzed  the  situation  in  terms  of  a  component  model  in  which 
N  =  2  and  s  =  1.  Further,  he  assumed  that  Pr  (^lfl)  =  0,  hence  on  trial 
1  the  subject  is  in  conditioning  state  C0.  Employing  Eq.  60  with  TT  =  1, 
N  =  2,  and  s  =  1,  we  obtain  the  following  transition  matrix: 


1  0  0 

c  c 

2  '-2        ° 

0         c        1  —  c_ 

The  expected  probability  of  an  ^-response  on  trial  n  is  readily  obtained 
by  specialization  of  Eq.  61, 

w-l 


Applying  this  model,  Theios  estimated  c  =  0.43  and  provided  an  impressive 
account  of  such  statistics  as  total  errors,  the  mean  learning  curve,  trial 
number  of  last  error,  autocorrelation  of  errors  with  lags  of  1,  2,  3,  and  4 
trials,  mean  number  of  runs,  probability  of  no  reversals,  and  many  others. 
However,  for  our  immediate  purposes  we  are  interested  in  only  one  feature 
of  his  data;  namely,  whether  the  underlying  response  probabilities  are 
actually  fixed  at  0,  J,  and  1,  as  specified  by  the  model.  First  we  note  that 
it  is  not  possible  to  establish  the  exact  trial  on  which  the  subject  moves 
from  C0  to  Cx  or  from  Q  to  C2.  Nevertheless,  if  there  are  some  trials 
between  the  first  success  (^-response)  and  the  last  error  (y^-response), 
we  can  be  sure  that  the  subject  is  in  state  Q  on  these  trials,  for,  if  the  sub- 
ject has  made  one  success,  at  least  one  of  the  two  stimulus  elements  is 
conditioned  to  the  ^-response;  if  on  a  later  trial  the  subject  makes  an 
error,  then,  up  to  that  trial,  at  least  one  of  the  elements  is  not  conditioned 
to  the  .^-response.  Since  deconditioning  does  not  occur  in  the  present 
model,  the  subject  must  be  in  conditioning  state  Q.  Thus,  according  to 
the  model,  the  sequence  of  responses  after  the  first  success  and  before  the 
last  error  should  form  a  sequence  of  Bernoulli  trials  with  constant  prob- 
ability p  =  q  =  |  of  an  ^-response.  Theios  has  applied  several  statistical 
tests  to  check  this  hypothesis  and  none  suggests  that  the  assumption  is 
incorrect.  For  example,  the  response  sequences  for  the  trials  between  the 
first  success  and  last  error  were  divided  into  blocks  of  four  trials  and  the 
number  of  ^-responses  iu  each  block  was  counted.  The  obtained  fre- 
quencies for  0, 1, 2,  3,  and  4  successes  were  2, 12, 17, 15,  and  4,  respectively; 


COMPONENT   AND    LINEAR   MODELS    FOR    SIMPLE   LEARNING  21$ 

the  predicted  binomial  frequencies  were  3.1,  12.5,  18.5,  12.5,  and  3.1. 
The  correspondence  between  predicted  and  observed  frequencies  is  excel- 
lent, as  indicated  by  a  %2  goodness-of-fit  test  that  yielded  a  value  of  1.47 
with  4  degrees  of  freedom. 

Theios  has  applied  the  same  analysis  to  data  from  an  experiment  by 
Solomon  and  Wynne  (1953),  in  which  dogs  were  required  to  learn  an 
avoidance  response.  The  findings  with  regard  to  the  binomial  property 
on  trials  after  the  first  success  and  before  the  last  error  are  in  agreement 
with  his  own  data  but  suggest  that  the  binomial  parameter  is  other  than  J. 
From  a  stimulus  sampling  viewpoint  this  observation  would  suggest 
that  the  two  elements  are  not  sampled  with  equal  probabilities.  For  a 
detailed  discussion  of  this  Bernoulli  stepwise  aspect  of  certain  stimulus 
sampling  models,  related  statistical  tests,  and  a  review  of  relevant  experi- 
mental data  the  reader  is  referred  to  Suppes  &  Ginsberg  (1963). 

The  mean  learning  curve  for  the  fixed  sample  size  model  given  by  Eq.  60 
is  identical  to  the  corresponding  equation  for  the  pattern  model  with  the 
sampling  ratio  cs/N  taking  the  role  of  cjN.  However,  we  need  not  look 
far  to  find  a  difference  in  the  predictions  generated  by  the  two  models. 
If  we  define  <x2  >n  as  in  Eq.  29,  that  is, 


then  by  carrying  out  the  summation,  using  the  same  methods  as  in  the  case 
of  Eq.  27,  we  obtain 


T       2cs 

*•  =  L1  "  ~N 


csQ-l)"l  ,    c[s 


,    C7TS2  ,<0, 

-i+--  (62) 


The  asymptotic  variance  of  the  response  probabilities  for  the  component 
model  is  simply 


Letting  oc2>n  =  ^n^  =  a?>00>  noting  that  Pr  (Alj00)  =  TT  and  carrying  out 
the  appropriate  computations,  we  obtain 


-  ^1-^  +  ^-2)51 
~       N.      L2JV-S-1  J 


2l6  STIMULUS   SAMPLING   THEORY 

This  asymptotic  variance  of  the  response  probabilities  depends  in  relatively 
simple  ways  on  s  and  TV.  If  we  hold  N  fixed  and  differentiate  with  respect 
to  s,  we  find  that  c^2  increases  monotonically  with  $;  in  particular,  then, 
this  variance  for  a  fixed  sample  size  model  with  s  >  1  is  larger  than  that  of 
the  pattern  model  with  the  same  number  of  elements.  If  we  hold  the 
sampling  ratio  s/N  fixed  and  take  the  partial  derivative  with  respect  to  TV, 
we  find  cr^2  to  be  a  decreasing  function  of  N.  In  the  limit,  if  N  ~>  co  in  such 
a  way  that  s/N  =  d  remains  constant,  then 

(64) 


_, 

which,  we  shall  see  later,  is  the  variance  for  the  linear  model  (Estes  & 
Suppes,  1959a).  In  contrast,  for  the  pattern  model  the  variance  of  the 
^-values  approaches  0  as  TV  becomes  large.  We  return  to  comparisons 
between  the  two  models  in  Sec.  4,3. 

Sequential  Predictions.  We  now  examine  some  sequential  statistics  for 
the  fixed-sample-size  model  which  later  will  help  to  clarify  relationships 
among  the  various  stimulus  sampling  models.  As  in  previous  cases  (e.g., 
Eq.  31<2),  we  give  results  only  for  the  noncontingent  case  in  which  Pr  (E0>n) 

=  0  and  r  =  2. 

Consider,  first,  Pr  (AltH+l  \  Ei>n).  By  taking  account  of  the  conditioning 
states  on  trial  n  +  1  and  trial  n  and  also  the  sample  on  trial  n  we  may 
write 


where,  as  before,  si>n  denotes  the  event  of  drawing  a  sample  on  trial  n 
with  i  elements  conditioned  to  A±  and  s  —  i  conditioned  to  A2.  Con- 
ditionalizing,  with  our  learning  axioms  in  mind,  we  obtain 


•  Pr  (E1>n  |  s€,nCi;J  Pr  (s,,n  |  Cfcn)  Pr  (Cft,  J. 

But  for  our  reinforcement  procedures  Pr  (£1>M)  =  Pr  (£liB  |  5i 
Further 

("    c       if  j  =  k  +  s-i, 
|  E^i>nC^  =    1  -  c    if  J  =  k, 

{    0        otherwise; 


COMPONENT   AND    LINEAR   MODELS   FOR   SIMPLE   LEARNING  2IJ 

that  is,  the  s  —  i  elements  in  the  sample  originally  conditioned  to  A2  now 
become  conditioned  to  A±  with  probability  c,  hence  a  move  from  state 
Ck  to  Ck+s-i  occurs.  Also,  as  noted  with  regard  to  Eq.  59, 


Substitution  of  these  results  in  our  last  expression  for  Pr  (Ala7i+l  \  Elf7 
yields 

/fe\  /N  -  k\ 


We  now  need  the  fact  that  the  first  raw  moment  of  the  hypergeometric 
distribution  is 

fk\  /N  -  fe\ 


iS1         (N\  N3 

W 


permitting  the  simplification 

Pr  (A,,n+1 1  Elin)  =  2  [f  +  |(l  -  f 
but,  by  definition, 


whence 

/        cs\  cs 

Pr  (A1>n+1 1  £1<n)  =  ^1  -  -J  Pr  (A^J  +  - . 

By  the  same  method  of  proof  we  may  show  that 


Pr  (A,,n+1 1  £2>n) 
Finally,  for  comparison  with  other  models,  we  present  the  expressions  for 


2lS  STIMULUS   SAMPLING   THEORY 

PT(Aktn+1E,tnAltJ.    Derivations  of  these  probabilities  are  based  on  the 
same  methods  used  in  connection  with  Eq.  61a. 


n+1ElinA2tn)  =          (1  -  a1>n) 


Tcs      c(s  -  1)1       1      ... 
—    ---  i  -  -«i«}.    (66c) 
LAT        JV  -  1  J  1>n/      v      ^ 


(66e) 


i,n  -«,,„).    (66/) 


(1  -  TT)  1  -  aliB 

(66/1) 


Application  of  these  equations  to  the  corresponding  set  of  trigram 
proportions  for  a  preasymptotic  trial  block  is  not  particularly  rewarding. 
The  difficulty  is  that  certain  combinations  of  parameters,  for  example, 
{1  —  [c(s  —  1)1  'N  —  I]}(a1>n  —  02jW)  and  cs/N,  behave  as  units;  con- 
sequently, the  basic  parameters  c,  s,  and  N  cannot  be  estimated  individually 
and,  as  a  result,  the  predictions  available  from  the  simpler  JV-element 
pattern  model  via  Eq.  32  cannot  be  improved  upon  by  use  of  Eq.  66.  For 


COMPONENT   AND   LINEAR  MODELS   FOR  SIMPLE   LEARNING  21$ 

asymptotic  data  the  situation  is  somewhat  different.   By  substituting  the 
limiting  values  for  a1>n  and  a2j7i  in  Eq.  66,  that  is,  0^  =  77  and  from  Eq.  63 


7T)nV  +  (jV-2)5l  2 

-       -J-  77 

L  2N  -  s  -  1  J 


_  TT[JV  -  2s  +  Ns  +  27r(N  -  s)(N  -  1)] 
N(2N  -  5  -  1) 

we  can  express  the  trigram  probabilities  Pr  (Akta,EjfXAi}CC)  in  terms  of  the 
basic  parameters  of  the  model  The  resulting  expressions  are  somewhat 
cumbersome,  however,  and  we  shall  not  pursue  this  line  of  analysis  here. 


4.2  Component  Models  with  Stimulus  Fluctuation 

In  Sec.  4.1,  as  in  most  of  the  literature  on  stimulus  sampling  models 
for  learning,  we  restricted  attention  to  the  special  case  in  which  the  stimula- 
tion effective  on  successive  trials  of  an  experiment  may  be  considered  to 
represent  independent  random  samples  from  the  population  of  elements 
available  under  the  given  experimental  conditions.  More  generally,  we 
would  expect  that  the  independence  of  successive  samples  would  depend 
on  the  interval  between  trials.  The  concept  of  stimulus  sampling  in  the 
model  corresponds  to  the  process  of  stimulation  in  the  empirical  situation. 
Thus  sampling  and  resampling  from  a  stimulus  population  must  take  time; 
and,  if  the  interval  between  trials  is  sufficiently  short,  there  will  not  be  time 
to  draw  a  completely  new  sample.  We  should  expect  the  correlation,  or 
degree  of  overlap,  between  successive  stimulus  samples  to  vary  inversely 
with  the  intertrial  interval,  running  from  perfect  overlap  in  the  limiting 
case  (not  necessarily  empirically  realizable)  of  a  zero  interval  to  independ- 
ence at  sufficiently  long  intervals.  These  notions  have  been  embodied  in  the 
stimulus  fluctuation  model  (Estes,  1955a,  1955b,  1959a).  In  this  section 
we  shall  develop  the  assumption  of  stimulus  fluctuation  in  connection  with 
fixed-sample-size  models;  consequently,  the  expressions  derived  will 
differ  in  minor  respects  from  those  of  the  earlier  presentations  (cited 
above)  that  were  not  restricted  to  the  case  of  fixed  sample  size. 

ASSUMPTIONS  AND  DERIVATION  OF  RETENTION   CURVES.      Follow- 

ing  the  convention  of  previous  articles  on  stimulus  fluctuation  models,  we 
denote  by  S*  the  set  of  stimulus  elements  potentially  available  for  sampling 
under  a  given  set  of  experimental  conditions,  by  S  the  subset  of  elements 
available  for  sampling  at  any  given  time,  and  by  Sr  the  subset  of  elements 
that  are  temporarily  unavailable  (so  that  S*  =  S  U  S').  The  trial  sample 
s  is  in  turn  a  subset  of  S;  however,  in  this  presentation  we  assume  for 
simplicity  that  all  of  the  temporarily  available  elements  are  sampled  on 


22O  STIMULUS   SAMPLING   THEORY 

each  trial  (i.e.,  5  =  s).    We  denote  by  N,  N',  and  TV*,  respectively,  the 
numbers  of  elements  in  s,  S',  and  S*. 

The  interchange  between  the  stimulus  sample  and  the  remainder  of  the 
population,  that  is,  between  s  and  S',  is  assumed  to  occur  at  a  constant 
rate  over  time.  Specifically,  we  assume  that  during  an  interval  Af,  which 
is  just  long  enough  to  permit  the  interchange  of  a  single  element  between 
s  and  S',  there  is  probability  g  that  such  an  interchange  will  occur,  the 
parameter  g  being  constant  over  time.  We  shall  limit  consideration  to  the 
special  case  in  which  all  stimulus  elements  are  equally  likely  to  participate 
in  an  interchange.  With  this  restriction,  the  fluctuation  process  can  be 
characterized  by  the  difference  equation 


f(t  +  i)  =  (i  -  g)f(t)  + 

-  (67) 


where  f(t)  denotes  the  probability  that  any  given  element  of  5*  is  in  s 
at  time  t.  This  recursion  can  be  solved  by  standard  methods  to  yield  the 
explicit  formula 

=A   rjv      ir     /i  +JL\T 

N*       IN*  JL  \N      JV'/J 


,  (68) 

where  /  =  NfN*,  the  proportion  of  all  the  elements  in  the  sample,  and 
a  =  1  -  g(l/N  +  1/JV). 

Equation  68  can  now  serve  as  the  basis  for  deriving  numerous  expres- 
sions of  experimental  interest.  Suppose,  for  example,  that  at  the  end  of  a 
conditioning  (or  extinction)  period  there  were  /0  conditioned  elements  in 
S  and  &0  conditioned  elements  in  S',  the  momentary  probability  of  a 
conditioned  response  thus  being  p0  =  JQJN.  To  obtain  an  expression  for 
probability  of  a  conditioned  response  after  a  rest  interval  of  duration  t, 
we  proceed  as  follows.  For  each  conditioned  element  in  S  at  the  beginning 
of  the  interval,  we  need  only  set  /(O)  =  1  in  Eq.  68  to  obtain  the  probability 
that  the  element  is  in  S  at  time  t.  Similarly,  for  a  conditioned  element 
initially  in  S'  we  set/(0)  =  0  in  Eq.  68.  Combining  the  two  types,  we 
obtain  for  the  expected  number  of  conditioned  elements  in  S  at  time  t 

MJ  -  (J  -  1X1  +  V(l  -  **)  =  C/o  +  fco)  /  -  K  7o  +  k,)J  -  /0]a*. 
Dividing  by  N  (and  noting  that  J  =  #/#*)  we  have,  then,  for  the  prob- 
ability of  a  conditioned  response  at  time  t 


P° 


N*          I    N* 
=  V  -  Oo*  -  PoK,  (69) 


COMPONENT   AND    LINEAR   MODELS   FOR   SIMPLE    LEARNING  221 

where  />0*  and  p0  denote  the  proportion  of  conditioned  elements  in  the 
total  population  S*  and  the  initial  proportion  in  S,  respectively.  If  the  rest 
interval  begins  after  a  conditioning  period,  we  will  ordinarily  have/?0  >  pQ* 
in  which  case  Eq.  69  describes  a  decreasing  function  (forgetting,  or 
spontaneous  regression).  If  the  rest  interval  begins  after  an  extinction 
period,  we  will  have/?0  <  pQ*,  in  which  case  Eq.  69  describes  an  increasing 
function  (spontaneous  recovery).  The  manner  in  which  cases  of  spontane- 
ous regression  or  recovery  depend  on  the  amount  and  spacing  of  previous 
acquisition  or  extinction  has  been  discussed  in  detail  elsewhere  (Estes, 
1955a). 

APPLICATION  TO  THE  RTT  EXPERIMENT.  We  noted  in  the  preceding 
section  that  the  fixed-sample-size  model  could  not  provide  a  generally 
satisfactory  account  of  RTT  experiments  because  it  did  not  allow  for  the 
retention  loss  usually  observed  between  the  first  and  second  tests.  It 
seems  reasonable  that  this  defect  might  be  remedied  by  removing  the 
restriction  on  independent  sampling.  To  illustrate  application  of  the 
more  general  model  with  provision  for  stimulus  fluctuation,  we  again 
consider  the  case  of  an  RTT  experiment  in  which  the  probability  of  a 
correct  response  is  negligible  before  the  reinforced  trial  (and  also  on  later 
trials  if  learning  has  not  occurred).  Letting  t±  and  ?2  denote  the  intervals 
between  R  and  J\  and  between  7^  and  T2,  respectively,  we  may  obtain  the 
following  basic  expressions  by  setting  /(O)  equal  to  1  or  0,  as  appropriate, 
in  Eq.  68:  For  the  probability  that  an  element  sampled  on  R  is  sampled 
again  on  7\, 


for  the  probability  that  an  element  sampled  on  7^  is  sampled  again  on 


and  for  the  probability  that  an  element  not  sampled  on  7\  is  sampled  on 

T* 

/s  =  /(!-<*)• 

Assuming  now  that  N  =  1,  so  that  we  are  dealing  with  a  generalized 
form  of  the  pattern  model,  we  can  write  the  probabilities  of  the  four  com- 
binations of  correct  and  incorrect  responses  on  Tj  and  T%  in  terms  of  the 
conditioning  parameter  c  and  the  parameters  ft: 

POQ 

Pm.  =  ,  (?0) 


=  1  -  c 


where,  as  before,  the  subscripts  0  and  1  denote  correct  responses  and 
errors,  respectively.  As  they  stand,  Eqs.  70  are  not  suitable  for  application 


222  STIMULUS    SAMPLING    THEORY 

to  data  because  there  are  too  many  parameters  to  be  estimated.  This 
difficulty  could  be  surmounted  by  adding  a  third  test  trial,  for  then  the 
resulting  eight  observation  equations 


Pwo  = 


etc.,  would  permit  overdetermination  of  the  four  parameters.  In  the  case 
of  some  published  studies  (e.g.,  Estes,  1961b)  the  data  can  be  handled 
quite  well  on  the  assumption  that/!  is  approximately  unity,  in  which  case 
Eqs.  70  reduce  to 

Poo  =  cf* 

Poi  =  c(l  -/a)» 

Pw  =  0> 

Pu  =  1  -  c. 

In  the  general  case  of  Eqs.  70  some  predictions  can  be  made  without 
knowing  the  exact  parameter  values.  It  has  been  noted  in  published 
studies  (Estes,  Hopkins,  &  Crothers,  1960;  Estes,  1961b)  that  the  observed 
proportion  /?01  is  generally  larger  than/?10.  Taking  the  difference  between 
the  theoretical  expressions  for  these  quantities,  we  have 


-  /(I 


which  obviously  must  be  equal  to  or  greater  than  zero.  The  experiments 
cited  above  have  in  all  cases  had  ^  <  t2  and  therefore  /i  >/2.  Since  f^ 
which  is  directly  estimated  by  the  proportions  of  instances  in  which  correct 
responses  on  Ta  are  repeated  on  T2,  has  ranged  from  about  0.6  to  0.9  in 
these  experiments  (and/j  must  be  larger),  it  is  clear  that/?10,  the  probability 
of  an  incorrect  followed  by  a  correct  response,  should  be  relatively  small. 
This  theoretical  prediction  accords  well  with  observation. 

Numerous  predictions  can  be  generated  concerning  the  effects  of  varying 
the  durations  of  /x  and  tz.  The  probability  of  repeating  a  correct  response 
from  T!  to  T2,  for  example,  should  depend  solely  on  the  parameter  /2, 
decreasing  as  t%  increases  (and/2  therefore  decreases).  The  probability 
of  a  correct  response  on  T2  following  an  incorrect  response  on  7^  should 
depend  most  strongly  on/3,  increasing  as  *2  (and  therefore  /3)  increases. 


COMPONENT    AND    LINEAR    MODELS    FOR    SIMPLE    LEARNING  22$ 

The  over-all  proportion  correct  per  test  should,  of  course,  decrease  from 
T!  to  T2  (although  the  difference  between  proportions  on  7\  and  T2  tends 
to  zero  as  ^  becomes  large).  Data  relevant  to  these  and  other  predictions 
are  available  in  'studies  by  Estes,  Hopkins,  and  Crothers  (I960),  Peterson, 
Saltzman,  Hillner,  and  Land  (1962),  and  Witte  (R.  Witte,  personal  com- 
munication). The  predictions  concerning  effects  of  variation  of  fa  are  well 
confirmed  by  these  studies.  Results  bearing  on  predictions  concerning 
variation  in  ra  are  not  consistent  over  the  set  of  experiments,  possibly 
because  of  artifacts  arising  from  item  selection  (discussed  by  Peterson 
et  al.,  1962). 

APPLICATION   TO  THE  SIMPLE  NONCONTINGENT   CASE.      We  restrict 

consideration  to  the  special  case  of  TV  =  1  ;  thus  we  are  dealing  with  a 
variant  of  the  pattern  model  in  which  the  pattern  sampled  on  any  trial  is 
the  one  most  likely  to  be  sampled  on  the  next  trial.  No  new  concepts  are 
required  beyond  those  introduced  in  connection  with  the  RTT  experiment, 
but  it  is  convenient  to  denote  by  a  single  symbol,  say  g,  the  probability 
that  the  stimulus  pattern  sampled  on  any  trial  n  is  exchanged  for  another 
pattern  on  trial  n  +  I  .  In  terms  of  this  notation, 

g  =  1  -A  =  (1  -  J)(l  -  cf)  =     l  - 


where  t  is  now  taken  to  denote  the  intertrial  interval.  Also,  we  denote  by 
wlm>w  the  probability  of  the  state  of  the  organism  in  which  m  stimulus 
patterns  are  conditioned  to  the  ^-response  and  one  of  these  is  sampled 
and  by  u0m^n  the  probability  that  m  patterns  are  conditioned  to  A±  but  a 
pattern  conditioned  to  A2  is  sampled.  Obviously 


.v* 

Pn  =  2  Wlm,n 
<m=0 


where,  as  usual,  pn  denotes  the  probability  of  the  ^-response  on  trial  72. 
Now  we  can  write  expressions  for  trigram  probabilities,  following 
essentially  the  same  reasoning  used  before  in  the  case  of  the  pattern  model 
with  independent  sampling.  For  the  joint  event  A-JE^A^  we  obtain 


,n)  =  *  2  «1«J  1  -  g  +  ^^~- 
m  L  iV        J 


for  if  an  element  conditioned  to  A:  is  sampled  on  trial  n  then  with 
probability  1  —  g  it  is  resampled  and  with  probability  g[(m  — 


STIMULUS   SAMPLING    THEORY 


it  is  replaced  by  another  element  conditioned  to  A±\  in  either  event 
an  ^-response  must  occur  on  trial  n  +  1.  If  the  abbreviations  Un 
=  2  Kim.iXw/tf  0  and  Kn  =  2  "om.nOMO  are  used,  the  trigram  proba- 

#i  m 

bilities  can  be  written  in  relatively  compact  form: 


-  c)(l  -  g)  -        pn  +  gt 

!  -  P»)  +  S^L 
Pr  (A^E^AiJ  =  (1  -  TT)^,  (71) 


+l^rAn)  «  (1  "  ^)c  -  ^  +  g  +  Pn  - 

Pr  (^2,n+1EljTl^2,n)  -  7r[(l  -  c  +  cg)(l  -  Pn)  - 
Pr  (^2,n+1£ 


The  chief  difference  between  these  expressions  and  the  corresponding  ones 
for  the  independent  sampling  models  is  that  sequential  effects  now  depend 
on  the  intertrial  interval.  Consider,  for  example,  the  first  two  of  Eqs. 
71  ,  involving  repetitions  of  response  A±.  It  will  be  noted  that  both  expres- 
sions represent  linear  combinations  of  pn  and  Un,  with  the  relative  con- 
tribution of  pn  increasing  as  the  intertrial  interval  (and  therefore  g) 
decreases.  Also,  it  is  apparent  from  the  defining  equations  for  pn  and  Un 
that  pn  >  Un9  with  equality  obtaining  only  in  the  special  cases  in  which 
both  are  equal  to  unity  or  both  equal  to  zero.  Therefore,  the  probability 
of  a  repetition  is  inversely  related  to  the  intertrial  interval.  In  particular, 
the  probability  that  a  correct  Ar  or  ^-response  will  be  repeated  tends  to 
unity  in  the  limit  as  the  intertrial  interval  goes  to  zero.  When  the  intertrial 
interval  becomes  large,  the  parameter  g  approaches  1  —  I/TV*  and  Eqs. 
71  reduce  to  those  of  a  pattern  model  with  N  elements  and  independent 
sampling. 

Summing  the  first  four  of  Eqs.  71,  we  obtain  a  recursion  for  probability 
of  the  ^-response: 


Pn+l  =      l   -  C  -  g  ~  -        +  CgPn  +  C(l  -  g>  +  g(Un  +   Vn\ 

Now,  although  a  full  proof  would  be  quite  involved,  it  is  not  hard  to 


COMPONENT   AND    LINEAR   MODELS   FOR   SIMPLE    LEARNING  22$ 

show  heuristically  that  the  asymptote  is  independent  of  the  intertrial 
interval.   We  note  first  that  asymptotically  we  have 


/V 

mm 


\N* 

N* 


where  um  is  the  probability  that  m  elements  are  conditioned  to  A^ 
The  substitution  of  um(m/N*)  for  ulm  is  possible  in  view  of  the  intuitively 
evident  fact  that,  asymptotically,  the  probability  that  an  element  condi- 
tioned to  A!  will  constitute  the  trial  sample  is  simply  equal  to  the  pro- 
portion of  such  elements  in  the  total  population.  Substituting  into  the 
recursion  for  pn  in  terms  of  this  relation,  and  the  analogous  one  for  Vw 

N*    /  >> 

V.  =  -  (Pn  ~  a2,  J, 
we  obtain 

pn+i=  \l-c-g-  —  +  • 


=  (1  -  c  +  cg)pn  +  c(l  -  gK 

the  simplification  in  the  last  line  having  been  effected  by  means  of  the 
identity 

N>  +  * 


Nr    /  N' 

Setting  pn+l  =pn=P«>  and  solving  for  p^,  we  arrive  at  the  tidy  outcome 


whence 

Poo  =^ 

The  recursion  inpn  can  be  solved,  but  the  resulting  formula  expressing 
pn  as  a  function  of  n  and  the  parameters  is  too  cumbersome  to  yield  much 
useful  information  by  visual  inspection.  It  seems  intuitively  obvious  that 
for  g  <  I  —  I  /N*  (i.e.,  for  any  but  very  long  intertrial  intervals)  the 
learning  curve  will  rise  more  sharply  on  early  trials  than  the  corresponding 
curve  for  the  independent  sampling  case.  This  is  so  because  only  sampled 
elements  can  undergo  conditioning,  and,  once  sampled,  an  element  is 
more  likely  to  be  resampled  the  shorter  the  intertrial  interval.  However, 


226  STIMULUS    SAMPLING   THEORY 

the  curves  for  longer  and  shorter  intervals  must  cross  ultimately,  with  the 
curve  for  the  longer  interval  approaching  asymptote  more  rapidly  on  later 
trials  (Estes,  1955b).  If  77  =  1,  the  total  number  of  errors  expected  during 
learning  must  be  independent  of  the  intertrial  interval  because  each  initially 
unconditioned  element  will  continue  to  produce  an  error  each  time  it  is 
sampled  until  it  is  finally  conditioned,  and  the  probability  of  any  specified 
number  of  errors  before  conditioning  depends  only  on  the  value  of  the 
conditioning  parameter  c.  Similarly,  if  77  is  set  equal  to  0  after  a  condition- 
ing session,  the  total  number  of  conditioned  responses  during  extinction  is 
independent  of  the  intertrial  interval. 


4.3  The  Linear  Model  as  a  Limiting  Case 

For  those  experiments  in  which  the  available  stimuli  are  the  same  on  all 
trials  the  possibility  arises  of  using  a  model  that  suppresses  the  concept  of 
stimuli.  In  such  a  "pure"  reinforcement  model  the  learning  assumptions 
specify  directly  how  response  probability  changes  on  a  reinforced  trial. 
By  all  odds  the  most  popular  models  of  this  sort  are  those  which  assume 
probability  of  a  response  on  a  given  trial  to  be  a  linear  function  of  the 
probability  of  that  response  on  the  previous  trial.12 

The  so-called  "linear  models"  received  their  first  systematic  treatment 
by  Bush  and  Mosteller  (1951a,  1955)  and  have  been  investigated  and 
developed  further  by  many  others.  We  shall  be  concerned  only  with  a 
certain  class  of  linear  models  based  on  a  single  learning  parameter  6. 
A  more  extensive  analysis  of  this  class  of  linear  models  has  been  given  in 
Estes  &  Suppes  (1959a). 

The  linear  theory  is  formulated  for  the  probability  of  a  response  on 
trial  n  +  1,  given  the  entire  preceding  sequence  of  responses  and  rein- 
forcements.13 Let  xn  be  the  sequence  of  responses  and  reinforcements  of  a 
given  subject  through  trial  n\  that  is,  xn  is  a  .sequence  of  length  2n  with 
entries  in  the  odd  positions  indicating  responses  and  entries  in  the  even 
positions  indicating  reinforcements.  The  axioms  of  the  linear  model  are 
as  follows. 

Linear  Axioms 

For  every  f,  i'  and  k  such  that  1  <  i9  if  <  r  and  0  <  k  <  r: 
LI.    //     PT(E^Ai,tnxn_^>Qythen 

Pr  (Ait^  I  E^A^x^  =  (1  -  0)  Pr  (A^  \  xn_J  +  6. 

12  For  a  discussion  of  this  general  class  of  "incremental"  models  see  Chapter  9  by 
Steinberg  in  this  volume. 

13  In  the  language  of  stochastic  processes  we  have  a  chain  of  infinite  order. 


COMPONENT   AND    LINEAR   MODELS   FOR   SIMPLE    LEARNING  22J 

L2.     If     Pr  (EktnA^nxn_^  >  0,  k  ^  i  and  k  ^  0,  then 

Pr  (Ai>n+I  |  E^nA^nxn_^  =  (1-0)  Pr  (AlfH  \  xn^\ 

L3.    //     Pr  CEo.X-,  A_0  >  0, 


By  Axiom  LI,  if  the  reinforcing  event  £,,  corresponding  to  response  Ai9 
occurs  on  trial  n,  then  (regardless  of  the  response  occurring  on  trial  n) 
the  probability  of  At  increases  by  a  linear  transform  of  the  old  value.  By 
L2,  if  some  reinforcing  event  other  than  Ei  occurs  on  trial  w,  then  the  prob- 
ability of  Ai  decreases  by  a  linear  transform  of  its  old  value;  and  by  L3 
occurrence  of  the  "neutral"  event  E0  leaves  response  probabilities  un- 
changed. The  axioms  may  be  written  more  compactly  in  terms  of  the 
probability  pxi^n  that  a  subject  identified  with  sequence  x  makes  an  A€ 
response  on  trial  n: 

1.  If  the  subject  receives  an  £revent  on  trial  n, 


2.  If  the  subject  receives  an  ^-event  (k  ^  i  and  k  ^  0)  on  trial  n, 

Pxi,n+I  =  (1  -  6K',n- 

3.  If  the  subject  receives  an  £*0-event  on  trial  «, 


From  a  mathematical  standpoint  it  is  important  to  note  that  for  the 
linear  model  the  response  probability  associated  with  a  particular  subject 
is  free  to  vary  continuously  over  the  entire  interval  from  0  to  1,  since  this 
probability  undergoes  linear  transformations  as  a  result  of  reinforcement. 
Consequently,  if  we  wish  to  interpret  changes  in  response  probability  as 
transitions  among  states  of  a  Markov  process,  we  must  deal  with  a  con- 
tinuous-state space.  Thus  the  Markov  interpretation  is  of  little  practical 
value  for  calculational  purposes.  In  stimulus  sampling  models  response 
probability  is  defined  in  terms  of  the  proportion  of  stimuli  conditioned; 
since  the  set  of  stimuli  is  finite,  so  also  is  the  set  of  values  taken  on  by  the 
response  probability  of  any  individual  subject.  It  is  this  finite  character  of 
stimulus  sampling  models  that  makes  possible  the  extremely  useful  inter- 
pretation of  the  models  as  finite  Markov  chains. 

An  inspection  of  the  three  axioms  for  the  linear  model  indicates  that 
they  have  the  same  general  form  as  Eqs.  65,  which  describe  changes  in 
response  probability  for  the  fixed-sample-size  component  model;  that  is, 
if  we  let  6  =  cs/N,  then  the  two  sets  of  rules  are  similar.  As  might  be 
expected  from  this  observation,  many  of  the  predictions  generated  by  the 


228  STIMULUS   SAMPLING   THEORY 

two  models  are  identical  when  6  =  cs/N.  For  example,  in  the  simple  non- 
contingent  situation  the  mean  learning  curve  for  the  linear  model  is 

Pr  (A^  =  77  -  [TT  -  Pr  G4lfl)](l  -  0)-1,  (72) 

which  is  the  same  as  that  of  the  component  model  (see  Estes  &  Suppes, 
1959a,  for  a  derivation  of  results  for  the  linear  model).  However,  the  two 
models  are  not  identical  in  all  respects,  as  is  indicated  by  a  comparison  of 
the  asymptotic  variances  of  the  response  distributions.  For  the  linear 
model 


as  contrasted  to  Eq.  63  for  the  component  model.  However,  as  already 
noted  in  connection  with  Eq.  63,  in  the  limit  (as  N->  oo)  the  a^  for  the 
component  model  equals  the  predicted  value  for  the  linear  model. 

The  last  result  suggests  that  the  component  model  may  converge  to 
the  linear  process  as  N  -+  oo.  This  conjecture  is  substantially  correct; 
it  can  be  shown  that  in  the  limit  both  the  fixed-sample-size  model  and  the 
independent  sampling  model  approach  the  linear  model  for  an  extremely 
broad  class  of  assumptions  governing  the  sampling  of  elements.  The 
derivation  of  the  linear  model  from  component  models  holds  for  any 
reinforcement  schedule,  for  any  finite  number  r  of  responses,  and  for  every 
trial  72,  not  simply  at  asymptote.  The  proof  of  this  convergence  theorem  is 
lengthy  and  it  is  not  presented  here.  However,  the  proof  depends  on  the 
fact  that  the  variance  of  the  sampling  distribution  for  any  statistic  of  the 
trial  sample  approaches  0  as  N  becomes  large.  A  proof  of  the  convergence 
theorem  is  given  by  Estes  and  Suppes  (1959b).  Kemeny  and  Snell  (1957) 
also  have  considered  the  problem  but  their  proof  is  restricted  to  the  two- 
choice  noncontingent  situation  at  asymptote. 

COMPARISON  OF  THE  LINEAR    AND    PATTERN    MODELS.      The    Same 

limiting  result  does  not,  of  course,  hold  for  the  pattern  model  discussed  in 
Sec.  2.  For  the  pattern  model  only  one  element  is  sampled  on  each  trial, 
and  it  is  obvious  that  as  N  ->  oo  the  learning  effect  of  this  sampling  scheme 
would  diminish  to  zero.  For  experimental  situations  in  which  both  the 
linear  model  and  the  pattern  model  appear  to  be  applicable  it  is  important 
to  derive  differential  predictions  from  the  two  models  that,  on  empirical 
grounds,  will  permit  the  researcher  to  choose  between  them.  To  this  end 
we  display  a  few  predictions  for  the  linear  model  applied  to  both  the  RTT 
situation  and  the  simple  two-response  noncontingent  situation;  these 
results  will  be  compared  with  the  corresponding  equations  for  the  pattern 
model. 

For  simplicity  let  us  assume  that  in  the  case  of  the  RTT  situation  the 
likelihood  of  a  correct  response  by  guessing  is  negligible  on  all  trials.  Then, 


COMPONENT   AND    LINEAR   MODELS    FOR   SIMPLE    LEARNING  22§ 

according  to  the  linear  model,  the  probability  of  a  reinforced  response 
changes  in  accordance  with  the  equation 

Pn+l  =  (1   -  0K  +  9. 

In  the  present  application  the  probability  of  a  correct  response  on  the 
first  trial  (the  R  trial)  is  zero,  hence  the  probability  of  a  correct  response  on 
the  first  test  trial  is  simply  0.  No  reinforcement  is  given  on  Tl9  and  con- 
sequently the  probability  of  a  correct  response  does  not  change  between 
TI  and  T2.  Therefore,  /?00,  the  probability  of  a  correct  response  on  both 
7*!  and  T2  (as  defined  in  connection  with  Eq.  55)  is  02.  Similarly,  we  obtain 
pQ1  =  plQ  =  6(1  —  6)  and  />u  =  (1  —  0)2.  Some  relevant  data  are  pre- 
sented in  Table  6  (from  Estes,  1961b).  They  represent  joint  response 

Table  6  Observed  Joint  Response  Proportions  for  R  TT  Experi- 
ment and  Predictions  from  Linear  Retention-Loss  Model  and 
Sampling  Model 

Observed  Retention-Loss  Sampling 
Proportion        Model  Model 


PM 

0.238 

0.238 

0.238 

pQl 

0.147 

0.238 

0.152 

PlQ 

0.017 

0.018 

0 

Pn 

0.598 

0.506 

0.610 

proportions  for  40  subjects,  each  tested  on  15  paired  associate  items  of 
the  type  described  in  Sec.  LI,  the  RTT  design  applied  to  each  item.  In 
order  to  minimize  the  probability  of  correct  responses  occurring  by  guess- 
ing, these  items  were  introduced  (one  per  trial)  into  a  larger  list,  the  com- 
position of  which  changed  from  trial  to  trial.  A  critical  item  introduced  on 
trial  n  received  one  reinforcement  (paired  presentation  of  stimulus  and 
response  members),  followed  by  a  test  (presentation  of  stimulus  alone) 
on  trial  n  and  trial  n  +  1,  after  which  it  was  dropped  from  the  list. 

From  an  inspection  of  the  data  column  of  Table  6  it  is  obvious  that  the 
simple  linear  model  cannot  handle  these  proportions.  It  suffices  to  note 
that  the  model  requires  p01  =  pw,  whereas  the  difference  between  these 
two  entries  in  the  data  column  is  quite  large. 

One  might  try  to  preserve  the  linear  model  by  arguing  that  the  pattern 
of  observed  results  in  Table  6  could  have  arisen  as  an  artifact.  If,  for 
example,  there  are  differences  in  difficulty  among  items  (or,  equivalently, 
differences  in  learning  rate  among  subjects),  then  the  instances  of  incorrect 
response  on  7\  would  predominately  represent  smaller  6-values  than 
instances  of  correct  responses.  On  this  account  it  might  be  expected  that 


STIMULUS   SAMPLING   THEORY 


the  predicted  proportion  of  correct  following  incorrect  responses  would  be 
smaller  than  that  allowed  for  under  the  "equal  0"  assumption  and  there- 
fore that  the  linear  model  might  not  actually  be  incompatible  with  the  data 
of  Table  6.  We  can  easily  check  the  validity  of  such  an  argument.  Suppose 
that  parameter  6t  is  associated  with  a  proportion/t  of  the  items  (or  subjects). 
Then  in  each  case  in  which  6t  is  applicable  the  probability  of  a  correct 
response  on  7\  followed  by  an  error  on  T2  is  6,(l  —  0a).  Clearly,  then,/?0i 
estimated  from  a  group  of  items  described  by  differences  in  6  would  be 


But  a  similar  argument  yields 

Pio  = 

i 

Since,  again,  the  expressions  for/?10  and/?01  are  equal  for  all  distributions 
of  0i9  it  is  clear  that  individual  differences  in  learning  rates  alone  could  not 
account  for  the  observed  results. 

A  related  hypothesis  that  might  seem  to  merit  consideration  is  that  of 
individual  differences  in  rates  of  forgetting.  Since  the  proportion  of  correct 
responses  on  T2  is  less  than  that  on  Tl9  there  is  evidently  some  retention  loss, 
and  differences  among  subjects,  or  items,  in  susceptibility  to  this  retention 
loss  might  be  a  source  of  bias  in  the  data.  The  hypothesis  can  be  formu- 
lated in  the  linear  model  as  follows  :  the  probability  of  the  correct  response 
on  rx  is  equal  to  6;  if,  however,  there  is  a  retention  loss,  then  the  proba- 
bility of  a  correct  response  on  T2  will  have  declined  to  some  value  p,  such 
that  p  <  6.  If  there  are  individual  differences  in  amount  of  retention  loss, 
then  we  should  again  categorize  the  population  of  subjects  and  items  into 
subgroups,  with  a  proportion  /<  of  the  subjects  characterized  by  retention 
parameter  p£.  Theoretical  expressions  for  pik  can  be  derived  for  such  a 
population  by  the  same  method  used  in  the  preceding  case;  the  results  are 

Poo  =  0 


This  time  the  expressions  forpw  and/?01  are  different  ;  with  a  suitable  choice 
of  parameter  values,  they  could  accommodate  the  difference  between  the 
observed  proportions  pol  and  /?lo.  However,  another  difficulty  remains. 
To  obtain  a  near-zero  value  for  /?10  would  require  either  a  6  near  unity, 
which  would  be  incompatible  with  the  observed  proportion  of  0.385 
correct  on  Tl9  or  a  value  of^f^  near  zero,  which  would  be  incompatible 


COMPONENT    AND    LINEAR   MODELS    FOR   SIMPLE    LEARNING  2$I 

with  the  observed  proportion  of  0.255  correct  on  J2.  Thus  we  have  no 
support  for  the  hypo  thesis  that  individual  differences  in  amount  of  retention 
loss  might  account  for  the  pattern  of  empirical  values. 

We  could  go  on  in  a  similar  fashion  and  examine  the  results  of  supple- 
menting the  original  linear  model  by  hypotheses  involving  more  complex 
combinations  or  interactions  of  possible  sources  of  bias  (see  Estes,  1961b). 
For  example,  we  might  assume  that  there  are  large  individual  differences 
in  both  learning  and  retention  parameters.  But,  even  with  this  latitude,  it 
would  not  be  easy  to  adjust  the  linear  model  to  the  RTT  data.  Suppose 
that  we  admit  different  learning  parameters,  0X  and  02,  and  different 
retention  parameters,  pt  and  p2,  the  combination  01pI  obtaining  for  half 
the  items  and  the  combination  62/>2  f°r  the  other  half.  Now  the  pis 
formulas  become 


_  0^1  -  Pl)  +  0,(1  -  P2) 
Poi  —  -  -  - 


^10 

3 

2 

n,,  —  -  -^  ^ 

1(1  -  Pi)  + 

(1  -  62)(1  -  p2) 

From  the  data  column  of  Table  6  the  proportions  of  correct  responses  on 
the  first  and  second  test  trials  are/?o_  =  0.385  and/?_0  =  0.255,  respectively. 
Adding  the  first  and  second  of  the  foregoing  equations  to  obtain  the 
theoretical  expression  for^o_  and  the  first  and  third  equations  to  get/?__0, 
we  have 


and 


.        _  Pi 

~~ 


2 

Equating  theoretical  and  observed  values,  we  obtain  the  constraints 

0!  +  02  =  0.770 
Pi  +  P2  =  0.510, 

which  should  be  satisfied  by  the  parameter  values.    If  the  proportion 
pw  in  Table  6  is  to  be  predicted  correctly,  we  must  have 

=  0.238, 


232  STIMULUS    SAMPLING   THEORY 

or,  substituting  from  the  two  preceding  equations, 

6lPi  +  (0.77  -  0a)(0.51  -  pj  =  0.476, 
which  may  be  solved  for  91 : 

G   =  0.083  +  0.77ft! 
1          2Pl  -  0.51 

Now  the  admissible  range  of  parameter  values  can  be  further  reduced. 
For  the  right-hand  side  of  this  last  equation  to  have  a  value  between  0 
and  1,  pI  must  be  greater  than  0.48;  so  we  have  the  relatively  narrow 
bounds  on  the  parameters  pz 

0.48  <  Pl  <  0.51 
0  <  ft2  <  0.03. 

Using  these  bounds  on  pl9  we  find  from  the  equation  expressing  Q±  as  a 
function  of  ^  that  B:  must  in  turn  satisfy  0.93  <  ^  <  1.0.  But  now  the 
model  is  in  trouble,  for,  in  order  to  satisfy  the  constraint  0X  +  02  =  0.77, 
62  would  have  to  be  negative  (and  the  correct  response  probabilities  for 
half  of  the  items  on  7\  would  also  be  negative).  About  the  best  we  can 
do,  without  allowing  "negative  probabilities,"  is  to  use  the  limits  we  have 
obtained  for  pl9  />2,  and  Bl  and  arbitrarily  assign  a  zero  or  small  positive 
value  to  6S.  Choosing  the  combination  0X  =  0.95,  02  ==0.01,  p1  =  0.5, 
and  p2  =  0.01,  we  obtain  the  theoretical  values  listed  for  the  linear  model 
in  Table  6.  By  introducing  additional  assumptions  or  additional  param- 
eters, we  could  improve  the  fit  of  the  linear  model  to  these  data,  but 
there  would  seem  to  be  little  point  in  doing  so.  The  refractoriness  of  the 
data  to  description  by  any  reasonably  simple  form  of  the  model  suggests 
that  perhaps  the  learning  process  is  simply  not  well  represented  by  the 
type  of  growth  function  embodied  in  the  linear  model. 

By  contrast,  these  data  can  be  quite  readily  handled  by  the  stimulus 
fluctuation  model  developed  in  the  preceding  section.  Letting  /i  =  1 
in  Eqs.  70  and  using  the  estimates  c  =  0.39  and/2  =  0.61,  we  obtain  the 
theoretical  values  listed  under  "Sampling  Model"  in  Table  6.  We  would 
not,  of  course,  claim  that  the  sampling  model  had  been  rigorously  tested, 
since  two  parameters  had  to  be  estimated  and  there  are  only  three  degrees 
of  freedom  in  this  set  of  data.  However,  the  model  does  seem  more 
promising  than  any  of  the  variants  of  the  linear  model  that  have  been 
investigated.  More  stringent  tests  of  the  sampling  model  can  readily  be 
obtained  by  running  similar  experiments  with  longer  sequences  of  test 
trials,  since  predictions  concerning  joint  response  proportions  over  blocks 
of  three  or  more  test  trials  can  be  generated  without  additional  assumptions. 


COMPONENT   AND    LINEAR   MODELS   FOR   SIMPLE    LEARNING  2^3 

ADDITIONAL   COMPARISONS  BETWEEN   THE   LINEAR   AND    PATTERN 

MODEL.  We  now  turn  to  a  few  comparisons  between  the  linear  model 
and  the  multi-element  pattern  model  for  the  simple  noncontingent  situa- 
tion. First  of  all,  we  note  that  the  mean  learning  curves  for  the  two  models 
(as  given  in  Eq.  37  and  Eq.  72)  are  identical  if  we  let  cjN  =  6.  However, 
the  expressions  for  the  variance  of  the  asymptotic  response  distribution 
are  different;  for  the  linear  model  o^  =  77(1  —  Tr)[6j(2  —  6)],  whereas 
for  the  pattern  model  o^2  =  TT(\  —  -rr)(ljN).  This  difference  is  reflected 
in  another  prediction  that  provides  a  more  direct  experimental  test  of  the 
two  models.  It  concerns  the  asymptotic  variance  of  the  distribution  of  the 
number  of  ^-responses  in  a  block  of  K  trials  which  we  denote  Var  (A^). 
For  the  linear  model  (cf.  Estes  &  Suppes,  1959a), 


For  the  pattern  model,  by  Eq.  42, 


Note  that,  for  c  =  6,  the  variance  for  the  pattern  model  is  larger  than  for 
the  linear  model.  However,  for  the  case  of  6  =  c/N  the  variance  for  the 
pattern  model  can  be  larger  or  smaller  than  for  the  linear  model  depending 
on  the  particular  values  of  c  and  N* 

Finally,  we  present  certain  asymptotic  sequential  predictions  for  the 
linear  model  in  the  noncontingent  situation;  namely 

lim  Pr  041}n+i  \  E^nA^  =  (1  -  6)a  +  6 
lim  Pr  (A,*+i  [  E2fnAlfJ  =  (1  -  6)a 
lim  Pr  G4i,*-fi  j  E^nA^  =  (1  -  0)6  +  0 
ITTTI  Pr  ( A          \  f     A      i  ~~~  1 1  •""•  Gih 
where 


,-7  , 

a  =  77  +  — and 


2-0  2-0 

These  predictions  are  to  be  compared  with  Eq.  34  for  the  pattern 
model.  In  the  case  of  the  pattern  model  we  note  that  Pr  (A:  \  E^A^)  and 
depend  only  on  TT  and  N9  whereas  Pr  (A±  \  E^A^)  and 
depend  on  TT,  N,  and  c.  In  contrast,  all  four  sequential 
probabilities  depend  on  TT  and  0  in  the  linear  model.  For  comparisons 
between  the  linear  model  and  the  pattern  model  in  application  to  two- 
choice  data,  the  reader  is  referred  to  Suppes  &  Atkinson  (1960). 


234.  STIMULUS    SAMPLING   THEORY 


4.4  Applications  to  Multiperson  Interactions 

In  this  section  we  apply  the  linear  model  to  experimental  situations 
involving  multiperson  interactions  in  which  the  reinforcement  for  any 
given  subject  depends  both  on  his  response  and  on  the  responses  of  other 
subjects.  Several  recent  investigations  have  provided  evidence  indicating 
the  fruitfulness  of  this  line  of  development.  For  example,  Bush  and  Mos- 
teller  (1955)  have  analyzed  a  study  of  imitative  behavior  in  terms  of  their 
linear  model,  and  Estes  (1957a),  Burke  (1959,  1960),  and  Atkinson  and 
Suppes  (1958)  have  derived  and  tested  predictions  from  linear  models  for 
behavior  in  two-  and  three-person  games.  Suppes  and  Atkinson  (1960) 
have  also  provided  a  comparison  between  pattern  models  and  linear 
models  for  multiperson  experiments  and  have  extended  the  analysis  to 
situations  involving  communication  between  subjects,  monetary  payoff, 
social  pressure,  economic  oligopolies,  and  related  variables. 

The  simple  two-person  game  has  particular  advantages  for  expository 
purposes,  and  we  use  this  situation  to  illustrate  the  technique  of  extending 
the  linear  model  to  multiperson  interactions.  We  consider  a  situation 
which,  from  the  standpoint  of  game  theory  (see,  e.g.,  Luce  &  Raiffa, 
1957),  may  be  characterized  as  a  game  in  normal  form  with  a  finite  number 
of  strategies  available  to  each  player.  Each  play  of  the  game  constitutes  a 
trial,  and  a  player's  choice  of  a  strategy  for  a  given  trial  corresponds  to  the 
selection  of  a  response.  To  avoid  problems  having  to  do  with  the  measure- 
ment of  utility  (or  from  the  viewpoint  of  learning  theory,  problems  of 
reward  magnitude),  we  assume  a  unit  reward  that  is  assigned  on  an  all-or- 
none  basis.  Rules  of  the  game  require  the  two  players  to  exhibit  their 
choices  simultaneously  on  all  trials  (as  in  a  game  of  matching  pennies),  and 
each  player  is  informed  that,  given  the  choice  of  the  other  player  on  the 
trial,  there  is  exactly  one  choice  leading  to  the  unit  reward. 

We  designate  the  two  players  as  A  and  B  and  let  At  (i  =  1,  .  . . ,  r)  and 
Bj(j=  1, . , . ,  rr)  denote  the  responses  available  to  the  two  players.  The 
set  of  reinforcement  probabilities  prescribed  by  the  experimenter  may  be 
represented  in  a  matrix  (<%,  bi3)  analogous  to  the  "payoff  matrix*7  familiar 
in  game  theory.  The  number  ais  represents  the  probability  of  Player  A 
being  correct  on  any  trial  of  the  experiment,  given  the  response  pair 
AtB^;  similarly,  by  is  the  probability  of  Player  B  being  correct,  given  the 
response  pair  AtBf.  For  example,  consider  the  matrix 

t,i    1,01 
1,0    0,lJ. 


COMPONENT   AND    LINEAR   MODELS    FOR   SIMPLE    LEARNING  2$$ 

When  both  subjects  make  Response  1 ,  each  has  probability  J  of  receiving 
reward;  when  both  make  Response  2,  then  only  Player  B  receives  reward; 
when  either  of  the  other  possible  response  pairs  occurs  (i.e.,  A2B1  or  A-^B^^ 
then  only  Player  A  receives  reward.  It  should  be  emphasized  that,  although 
one  usually  thinks  of  one  player  winning  and  the  other  losing  on  any  given 
play  of  a  game,  this  is  not  a  necessary  restriction  on  the  model.  In  theory, 
and  in  experimental  tests  of  the  theory,  it  is  quite  possible  to  permit  both 
or  neither  of  the  players  to  be  rewarded  on  any  trial.  However,  to  provide 
a  relatively  simple  theoretical  interpretation  of  reinforcing  events,  it  is 
essential  that  on  a  nonrewarded  trial  the  player  be  informed  (or  led  to  infer) 
that  some  other  choice,  had  he  made  it  under  the  same  circumstances, 
would  have  been  successful.  We  return  to  this  point  later. 

Let  ElA)  denote  the  event  of  reinforcing  the  A£  response  for  Player  A 
and  jEjB)  the  event  of  reinforcing  the  B3  response  for  Player  B.  To  simplify 
our  analysis,  we  consider  the  case  in  which  each  subject  has  only  two 
response  alternatives,  and  we  define  the  probability  of  occurrence  of  a 
particular  reinforcing  event  in  terms  of  the  payoff  parameters  as  follows 
(for  i  5*  i 'and y  ^/): 

a,,  =  Pr  (E^  |  AiiUBJt  J  fr<,  =  Pr  (£<*>  |  A^nB^ 

1  -  ais  =  Pr  (Ef}  |  AitnBj}n)        1  -  bis  =  Pr  (£<*>  j  AitUB^.    (    ) 

For  example,  if  Player  A  makes  an  ^-response  and  is  rewarded,  then  an 
E(A}  occurs;  however,  if  an  Al  is  made  and  no  reward  occurs,  then  we 
assume  that  the  other  response  is  reinforced,  that  is,  an  E^  occurs. 
Finally,  one  last  definition  to  simplify  notation.  We  denote  Player  A's 
response  probability  by  a  and  Player  #'s  by  /?,  and  we  denote  by  y  the 
joint  probability  of  an  A:-  and  ^-response.  Specifically, 

ocn  =  Pr  (Al9J,        Pn  =  Pr  (BltJ9        yn  =  Pr  (AlfnBlfJ.      (74) 

We  now  derive  a  theorem  that  provides  recursive  expressions  for  an 
and  fin  and  points  up  a  property  of  the  model  that  greatly  complicates  the 
mathematics,  namely,  that  both  aw+1  and  fin+1  depend  on  the  joint  prob- 
ability yn  =  ?r(AlinB1)n).  The  statement  of  the  theorem  is  as  follows: 

=  [1  -  6A(2  -  fl18  -  a22)K  +  QA(a<u  - 


(75a) 
=  [1  -  6B(2  -  Z>21  - 


where  6A  and  6B  are  the  learning  parameters  for  players  A  and  B.  In 
the  proof  of  this  theorem  it  will  suffice  to  derive  the  difference  equation 
for  <xn+1,  since  the  derivation  for  jSw+1  is  identical.  To  begin  with,  from 


STIMULUS   SAMPLING   THEORY 


Axioms  LI  and  L2  we  can  easily  show  that  the  general  form  of  a  recursion 

for  an  is 


The  term  Pr  (E$)  can  then  be  expanded  to 


=  2  Pr  (££>  |  A{,nBiin)  Pr  (A^B,-  J 

«»J 

and  by  Eqs.  73 

Pr  (££>)  =  an  Pr  (Alf  wB1>n)  +  al2  Pr  (^wB8fB) 

+  (1  -  flu)  Pr  (A2iBBlfB)  +  (1  -  022)  Pr  G42,,A.B).     (76) 
Next  we  observe  that 

Pr  (A1)nB2,n)  =  Pr  (*M  |  ^ljW)  Pr  (A^ 

|^lfW)]Pr(^lfJ  (77a) 


Similarly, 

Pr  (At.&J  =  Pr  (Bltn)  -  Pr  (A^J,  (lib) 

and 

Pr  (AZtnB2fn)  =  Pr  (^2,w  |  B£ffl)  Pr  (J^J 


CJ 
=  1  -  Pr  (5ljn)  -  Pr  (Alsn)  +  Pr  (^ltnJ1>fl). 

Substituting  into  Eq.  76  from  Eqs.  11  a^  lib,  and  77c  and  simplifying  by 
means  of  the  definitions  of  a,  /3,  and  7,  we  obtain 


Pr  (E^)  =  flllyB  +  a12(aw  -  yj  +  (1  -  a21)(^  -  yj 

+  (1  -  a22)(l  -  «B  -  ^  +  yj 
=  —(1  —  a12  —  a22)aw 


ai 

Substitution  of  this  expression  into  the  general  recursion  for  <xn  yields  the 
desired  result,  which  completes  the  proof. 

It  has  been  shown  by  Lamperti  and  Suppes  (1959)  that  the  limits  a,  /?, 
and  7  exist,  whence  (letting  aw+1  =  art  =  a,  ^M+1  =  /3B  =  ft  and  7n  =  7 
in  Eqs.  75a  and  75b)  we  have  two  linear  relations  that  are  independent  of 
6A  and  6B,  namely, 

aK  =  bp  +  cy  +  d,  ep=fa  +  gy  +  h9  (78) 


COMPONENT    AND    LINEAR    MODELS    FOR   SIMPLE    LEARNING 

where 


a  =  2  —  an  —  a22  b  =  # 


22 


—  a12  — 


*  =  2   —  *21   "~  ^22  /  =  £>22  —  *12 

If  =  *11   +  ^12  ~  *21   —  £>22  A  =   1    —  £22. 

By  eliminating  7  from  Eqs.  78  we  obtain  the  following  linear  relation  in  oc 


(-ag  -  ce)v.  +  (bg  +  cf)p  =  ch-  dg.  (80) 

Unfortunately,  this  relationship  is  one  of  the  few  quantitative  results 
that  can  be  directly  computed  for  the  linear  model.  It  has,  however,  the 
advantageous  feature  that  it  is  independent  of  the  learning  parameters 
6A  and  6B  and  therefore  may  be  compared  directly  with  experimental  data. 
Application  of  this  result  can  be  illustrated  in  terms  of  the  game  cited 
earlier  in  which  the  payoff  matrix  takes  the  form 

Bi       B2 

i  t     1,  01 

l,0    0,lJ. 
From  Eqs.  79  we  obtain 

0  =  1        i=  -1        c  =  J  rf=  1 

e  =  l        /=!  ^=-t        A  =  0 

and  Eq.  80  becomes 

(i  -  t)«  +  (i  +  t)]8  =  t 

or  )S  =  J.  From  this  result  we  predict  immediately  that  the  long-run 
proportion  of  ^-responses  will  tend  to  J.  To  derive  a  prediction  for 
Player  A,  we  substitute  the  known  values  of  the  parameters  into  the  first 
part  of  Eq.  78  to  obtain 


Unfortunately  we  cannot  compute  y,  the  asymptotic  probability  of  the 
^i^-response  pair.  However,  we  know  y  is  positive,  and,  since  only  one 
half  of  Player  5's  responses  are  ^'s,  y  cannot  be  greater  than  \.  Therefore 
we  have  0  <  y  <  J  and  as  a  result  can  set  definite  bounds  on  the  long-run 
probability  of  an  y^-response,  namely, 

t<a<i  +  i-t  =  i- 

Thus  we  have  the  basis  for  a  rather  exacting  experimental  test,  since  the 
asymptotic  predictions  for  both  subjects  are  parameter-free;  that  is,  they 
do  not  depend  on  the  0-values  of  either  subject  or  on  the  initial  response 
probabilities. 


2%8  STIMULUS   SAMPLING    THEORY 

Of  course,  by  imposing  restrictions  on  the  experimentally  determined 
parameters  al5  and  btj  a  variety  of  results  can  be  obtained.  We  limit  our- 
selves to  the  consideration  of  one  such  case:  choice  of  the  parameters  so 
that  the  coefficients  of  yn  will  vanish  in  the  recursive  equations  (750)  and 
(754),  Specifically,  if  we  let  c  =  g  —  0  and  af—be^Q,  then 

a»+i  =  av-n  +  bpn  +  d 

Pn+l=ePn+f*n+h. 

Solutions  for  this  system  are  well  known  and  can  be  obtained  by  a  number 
of  different  techniques ;  for  a  detailed  discussion  of  the  problem  of  ob- 
taining explicit  expressions  of  aw  and  /?„  for  arbitrary  n  the  reader  is 
referred  to  an  article  by  Burke  (1960).  We  do  know,  however,  that  the 
limits  for  an  and  j8n  exist  and  are  independent  of  both  the  initial  conditions 
and  6A  and  0B.  By  substituting  a  =  an+1  =  aw  and  ft  =  j8n+1  =  fin  into 
the  two  recursions  we  obtain 

bh  +  df 

a  = 

af  —  be 
and 

g  _  ah  +  de 
af—  be 

The  fact  that  a  and  j3  are  independent  of  6A  and  0^  under  the  restrictions 
imposed  on  the  parameters  in  no  way  implies  that  y  is  also  independent  of 
these  quantities. 

Equations  81  provide  a  precise  test  of  the  model,  and  the  necessary  con- 
ditions for  this  test  involve  only  experimentally  manipulable  parameters. 
A  great  deal  of  experimental  work  has  been  conducted  on  this  restricted 
problem,  and,  in  general,  the  correspondence  between  predicted  and 
observed  values  has  been  good;  for  accounts  of  this  work  see  Atkin- 
son &  Suppes  (1958),  Burke  (1959,  1960),  and  Suppes  &  Atkinson  (1960). 

In  conclusion  we  should  mention  that  all  of  the  predictions  presented 
in  this  section  are  identical  to  those  that  can  be  derived  from  the  pattern 
model  of  Sec.  2.  However,  in  general,  only  the  grosser  predictions,  such 
as  those  for  an  and  {tn,  are  the  same  for  the  two  models. 


5.  DISCRIMINATION   LEARNING14 

The  distinction  between  simple  learning  and  discrimination  learning  is 
somewhat  arbitrary.  By  discrimination  we  refer,  roughly  speaking,  to  the 
14  Using  the  terminology  proposed  by  Bush,  Galanter,  and  Luce  in  Chapter  2,  the  class 
of  problems  considered  in  this  section  would  be  called  "identification-learning" 
experiments. 


DISCRIMINATION    LEARNING 


process  whereby  the  subject  learns  to  make  one  response  to  one  of  a  pair 
of  stimuli  and  a  different  response  to  the  other.  But  there  is  an  element  of 
discrimination  in  any  learning  situation.  Even  in  the  simplest  conditioning 
experiment  the  subject  learns  to  make  a  conditioned  response  only  when 
the  conditioned  stimulus  is  presented,  and  therefore  to  do  something  else 
when  that  stimulus  is  absent.  In  the  paired-associate  situation  (referred  to 
several  times  in  preceding  sections)  the  subject  learns  to  associate  the 
appropriate  member  of  a  response  set  with  each  member  of  a  set  of 
stimuli  and  therefore  to  "discriminate"  the  stimuli.  The  principal  basis  for 
differentiation  between  the  two  categories  of  learning  seems  to  be  that  in 
the  case  of  discrimination  learning  the  similarity,  or  communality,  between 
stimuli  is  a  major  independent  variable;  in  the  case  of  simple  learning 
stimulus  similarity  is  an  extraneous  factor  to  be  minimized  experimentally 
and  neglected  in  theory  as  far  as  possible. 

One  of  the  general  strategic  assumptions  of  the  type  of  stimulus-response 
theory,  which  has  been  associated  with  the  development  of  stimulus 
sampling  models,  is  that  discrimination  learning  involves  a  combination  of 
processes,  each  of  which  can  be  studied  independently  in  simpler  situations 
—  the  learning  aspect  in  experiments  on  acquisition  or  extinction  and  the 
stimulus  relationships  in  experiments  on  stimulus  generalization  or  transfer 
of  training.  Thus  there  will  be  nothing  new  at  the  conceptual  level  in  our 
treatment  of  discrimination.  There  is  adequate  scope  for  analysis  of 
different  types  of  discriminative  situations,  but,  since  our  main  concern 
in  this  section  is  with  methods  rather  than  content,  we  shall  not  go  far  in 
this  direction.  We  propose  only  to  show  how  the  processes  of  association 
and  generalization  treated  in  preceding  sections  enter  into  discrimination 
learning,  and  this  can  be  accomplished  by  formulating  assumptions  and 
deriving  results  of  general  interest  for  a  few  important  cases. 


5.1  The  Pattern  Model  for  Discrimination  Learning 

As  in  the  cases  of  simple  acquisition  and  probability  learning,  it  is 
sometimes  useful  in  the  treatment  of  discriminative  situations  to  ignore 
generalization  effects  among  the  stimuli  involved  in  an  experiment  and  to 
regard  each  stimulus  display  as  a  unique  pattern.  Thus  behavior  elicited 
by  the  stimulus  display  will  depend  only  on  the  subject's  reinforcement 
history  with  respect  to  that  particular  pattern.  Two  important  variants 
of  the  model  arise,  depending  on  whether  experimental  arrangements  do 
or  do  not  ensure  that  the  subject  will  sample  the  entire  stimulus  display 
presented  on  each  trial. 


240  STIMULUS    SAMPLING   THEORY 

Case  1.  All  cues  presented  are  sampled  on  each  trial.  For  a  classical 
two-stimulus,  two-response  discrimination  problem  (e.g.,  a  Lashley 
situation  in  which  the  rat  is  differentially  rewarded  for  jumping  to  a  black 
card  and  avoiding  a  grey  card)  our  conceptualization  requires  a  distinction 
among  three  types  of  cues  :  we  denote  by  5X  the  set  of  component  cues 
present  only  in  the  stimulus  situation  associated  with  reinforcement  of 
response  A^  by  S2  the  set  of  cues  present  only  in  the  situation  associated 
with  reinforcement  of  response  A2,  and  by  Sc  the  set  of  cues  present  in  both 
situations.  In  the  example  of  the  Lashley  situation  Al  might  be  the  re- 
sponse of  jumping  to  the  left-hand  window;  A2,  the  response  of  jumping 
to  the  right-hand  window;  Sl9  the  stimulation  present  only  on  trials  with 
black  cards  ;  S2,  the  stimulation  present  only  on  trials  with  grey  cards  ;  and 
Sc,  the  stimulation  common  to  both  types  of  trials.  We  denote  by  Ni9 
N&  and  Nc  the  number  of  cues  in  each  of  these  subsets.  In  standard  experi- 
ments the  "cues"  refer  to  experimentally  manipulable  aspects  of  the  situa- 
tion, such  as  tones,  objects,  colors,  or  symbols,  and  it  is  reasonably  well 
known  just  how  many  different  combinations  of  these  cues  will  be  re- 
sponded to  by  subjects  as  distinct  patterns.  In  some  instances,  however,  the 
experimenter  may  have  no  a  priori  knowledge  of  the  patterns  distinguish- 
able by  the  subject;  in  such  instances  the  JVf  may  be  treated  as  unknown 
parameters  to  be  estimated  from  data,  and  the  model  may  thus  serve  as  a 
tool  in  securing  evidence  concerning  the  subject's  perceptions  of  the 
physical  situation. 

Suppose,  now,  that  the  experimenter's  procedure  is  to  present  on  some 
trials  (TV-trials)  a  set  of  cues  including  mx  from  Si  and  mc  from  Sc  and  on 
the  remaining  trials  (TV-trials)  m2  cues  from  52  and  mc  from  Sc.  Further, 
let  the  two  types  of  trials  occur  with  equal  frequencies  in  random  sequence. 

On  trials  of  type  Tl  there  will  be  I    x  )  (    c  |  different  patterns  of  cues 

\mi)  \mj 

available.    Assuming  that  these  patterns  are  all  equally  probable  and 
letting  bic  =     I    *  1  1    c  1       ,  we  can  obtain  an  expression  for  probability 

of  a  correct  response  on  a  TV-trial  simply  by  appropriate  substitution  into 
Eq.  28,  namely, 


<*  J"-1,     (82) 
where  ^  is  the  ordinal  number  of  the  TV-trial.  The  corresponding  function 

for  ^-trials  is  obtained  similarly  with  parameter  b9f  =     (    a|  1    c\] 
r  .    .      .  r  LWWJ 

In  the  discrimination  literature  cues  in  the  sets  5X  and  S2  are  commonly 
referred  to  as  relevant  and  those  in  Sc  as  irrelevant,  since  S^  and  S2  are 
associated  with  reinforcing  events,  whereas  the  Sc  are  not.  It  is  apparent 


DISCRIMINATION    LEARNING  24! 

by  inspection  of  Eq.  82  that  (for  the  foregoing  specified  experimental 
conditions)  the  pattern  model  predicts  that  probability  of  correct  respond- 
ing will  go  asymptotically  to  unity  regardless  of  the  numbers  of  relevant 
and  irrelevant  cues,  provided  only  that  neither  m^  nor  m2  is  equal  to  zero. 
Rate  of  approach  to  asymptote  on  each  type  of  trial  is  inversely  related  to 
the  total  number  of  patterns  available  for  sampling.  Therefore,  other 
things  being  equal,  rate  of  learning  is  decreased  (and  total  errors  to  criterion 
increased)  by  the  addition  of  either  relevant  or  irrelevant  cues. 

Case  2.  Only  a  subset  of  the  cues  presented  on  each  trial  is  sampled. 
We  consider  now  the  situation  that  arises  if  the  number  of  cues  presented 
per  trial  is  too  large,  or  the  exposure  time  too  short,  for  the  entire  stimulus 
display  to  be  sampled  by  the  subject.  Let  us  suppose  that  there  are  only 
two  stimulus  displays.  The  display  on  TV-trials  comprises  the  N:  cues  of 
Sl  together  with  the  Nc  cues  of  SC9  and  that  on  Trials,  the  N2  cues  of  S2 
together  with  the  Nc  cues  of  Sc;  further,  to  simplify  the  analysis  let 
#,_  =  JV2  =  N.  For  a  given  fixed  exposure  time  we  assume  a  fixed  sample 
size  s,  with  all  samples  of  exactly  s  cues  being  equiprobable.  On  T^-trials, 

/N\  /   N    \ 
then,  there  will  be  I       I       c      ways  of  filling  the  sample  with  sl  cues  from 

V*i/  Vs  —  si/ 

Sl  and  the  remainder  from  5C.  The  asymptote  of  discriminative  perform- 
ance will  depend  on  the  size  of  s  in  relation  to  Nc.  If  s  <  Ne,  so  that  the 
entire  sample  can  come  from  the  set  of  irrelevant  cues,  then  the  asymptotic 
probability  of  a  correct  response  will  be  less  than  unity. 

In  Case  2  two  types  of  patterns  need  to  be  distinguished  for  each  type 
of  trial.  We  can  limit  consideration  to  TV-trials,  since  analogous  arguments 
hold  for  TV  There  may  be  some  patterns  that  include  only  cues  from  Sc 
and  learning  with  respect  to  them  will  be  on  a  simple  random  reinforcement 
schedule.  The  proportion  of  such  patterns,  wc,  is  given  by 


which  is  equal  to  zero  if  s  >  Nc.  If  TV  and  T>trials  have  equal  prob- 
abilities, then  the  probability,  to  be  denoted  Vn9  that  a  pattern  containing 
only  cues  from  Sc  will  be  conditioned  to  the  ^-response  on  trial  n  can  be 
obtained  from  Eq.  28  by  setting  7r12  =  7r21  =  £: 


-V     and        _ 


(I') 


2^2  STIMULUS    SAMPLING   THEORY 

where 


that  is, 

cfe)"-1.  (83) 


The  remaining  patterns  available  on  7\-trials  all  contain  at  least  one  cue 
from  Si  and  thus  occur  only  on  trials  when  response  Al  is  reinforced. 
The  probability,  to  be  denoted  £/„,  that  any  one  of  these  is  conditioned  to 
At  on  trial  n  may  be  similarly  obtained  by  rewriting  Eq.  28,  this  time  with 
7rla  =  0,  7721  =  1,  Pr  (AltU)  =  Un9  and  c/N  =  Jcfi,  that  is, 

t/n  =  l  -  (1  -  170(1  -  ic6)«~i,  (84) 

where  the  factor  |  enters  because  these  patterns  are  available  for  sampling 
on  only  one  half  of  the  trials. 

Now,  to  obtain  the  probability  of  an  ^-response  if  a  7\-display  is 
presented  on  trial  «,  we  need  only  combine  Eqs.  83  and  84,  weighting  each 
by  the  probability  of  the  appropriate  type  of  pattern,  namely, 


-  eft)"-1,  (850) 

which  may  be  simplified,  if  C/i  =  Fx  =  £,  to 

\,  J  =  1  -  K  -  Kl  -  vvc)(l  -  icZO«-i. 


The  resulting  expression  for  probability  of  a  correct  response  has  a 
number  of  interesting  general  properties.  The  asymptote,  as  anticipated, 
depends  in  a  simple  way  on  wc,  the  proportion  of  "irrelevant  patterns." 
When  wc  =  0,  the  asymptotic  probability  of  a  correct  response  is  unity; 
when  wc  =  1,  the  whole  process  reduces  to  simple  random  reinforcement. 
Between  these  extremes,  asymptotic  performance  varies  inversely  with 
wc,  so  that  the  terminal  proportion  of  correct  responses  on  either  type  of 
trial  provides  a  simple  estimate  of  this  parameter  from  data.  The  slope 
parameter  cb  could  then  be  estimated  from  total  errors  over  a  series  of 
trials.  As  in  Case  1,  the  rate  of  approach  to  asymptote  proves  to  depend 
only  on  the  conditioning  parameters  and  total  number  of  patterns  avail- 
able for  sampling;  thus  it  is  a  joint  function  of  the  total  number  of  cues 
N  +  Nc  and  the  sample  size  s  but  does  not  depend  on  the  relative  pro- 
portions of  relevant  and  irrelevant  cues.  The  last  result  may  seem  im- 
plausible, but  it  should  be  noted  that  the  result  depends  on  the  simplifying 
assumption  of  the  pattern  model  that  there  are  no  transfer  effects  from 


DISCRIMINATION    LEARNING  243 

learning  on  one  pattern  to  performance  on  another  pattern  that  has 
component  cues  in  common  with  the  first.  The  situation  in  this  regard 
is  different  for  the  "mixed  model"  to  be  discussed  next. 


5.2  A  Mixed  Model 

The  pattern  model  may  provide  a  relatively  complete  account  of  dis- 
crimination data  in  situations  involving  only  distinct,  readily  discriminable 
patterns  of  stimulation,  as,  for  example  the  "paired-comparison"  experi- 
ment discussed  in  Sec.  2.3  or  the  verbal  discrimination  experiment  treated 
by  Bower  (1962).  Also,  this  model  may  account  for  some  aspects  of  the 
data  (e.g.,  asymptotic  performance  level,  trials  to  criterion)  even  in  dis- 
crimination experiments  in  which  similarity,  or  communality,  among 
stimuli  is  a  major  variable.  But,  to  account  for  other  aspects  of  the  data  in 
cases  of  the  latter  type,  it  is  necessary  to  deal  with  transfer  effects  through- 
out the  course  of  learning.  The  approach  to  this  problem  which  we  now 
wish  to  consider  employs  no  new  conceptual  apparatus  but  simply  a 
combination  of  ideas  developed  in  preceding  sections. 

In  the  mixed  model  the  conceptualization  of  the  discriminative  situation 
and  the  learning  assumptions  is  exactly  the  same  as  that  of  the  pattern 
model  discussed  in  Sec.  5.1.  The  only  change  is  in  the  response  rule  and 
that  is  altered  in  only  one  respect.  As  before,  we  assume  that  once  a 
stimulus  pattern  has  become  conditioned  to  a  response  it  will  evoke  that 
response  on  each  subsequent  occurrence  (unless  on  some  later  trial  the 
pattern  becomes  reconditioned  to  a  different  response,  as,  for  example, 
during  reversal  of  a  discrimination).  The  new  feature  concerns  patterns 
which  have  not  yet  become  conditioned  to  any  of  the  response  alternatives 
of  the  given  experimental  situation  but  which  have  component  cues  in 
common  with  other  patterns  that  have  been  so  conditioned.  Our  assump- 
tion is  simply  that  transfer  occurs  from  a  conditioned  to  an  unconditioned 
pattern  in  accordance  with  the  assumptions  utilized  in  our  earlier  treatment 
of  compounding  and  generalization  (specifically,  by  axiom  C2,  together 
with  a  modified  version  of  Cl,  of  Sec,  3.1). 

Before  the  assumptions  about  transfer  can  be  employed  unambiguously 
in  connection  with  the  mixed  model,  the  notion  of  conditioned  status  of 
a  component  cue  needs  to  be  clarified.  We  shall  say  that  a  cue  is  condi- 
tioned to  response  A±  if  it  is  a  component  of  a  stimulus  pattern  that  has 
become  conditioned  to  response  A+  If  a  cue  belongs  to  two  patterns,  one 
of  which  is  conditioned  to  response  A^  and  one  to  response  Aj  (i  ^j\ 
then  the  conditioning  status  of  the  cue  follows  that  of  the  more  recently 
conditioned  pattern.  If  a  cue  belongs  to  no  conditioned  pattern,  then  it  is 


244  STIMULUS   SAMPLING   THEORY 

said  to  be  in  the  unconditioned,  or  "guessing,"  state.  Note  that  a  pattern 
may  be  unconditioned  even  though  all  of  its  cues  are  conditioned.  Suppose 
for  example,  that  a  pattern  consisting  of  cues  x,  y,  and  z  in  a  particular 
arrangement  has  never  been  presented  during  the  first  n  trials  of  an  experi- 
ment but  that  each  of  the  cues  has  appeared  in  other  patterns,  say  wxy 
and  HTZ,  which  have  been  presented  and  conditioned.  Then  all  of  the  cues 
of  pattern  xyz  would  be  conditioned,  but  the  pattern  would  still  be  in  the 
unconditioned  state.  Consequently,  if  wxy  had  been  conditioned  to 
response  AI  and  wvz  to  A2,  the  probability  of  A±  in  the  presence  of  pattern 
xyz  would  be  f  ;  but,  if  response  A±  were  effectively  reinforced  in  the 
presence  of  xyz,  its  probability  of  evocation  by  that  pattern  would  hence- 
forth be  unity. 

The  only  new  complication  arises  if  an  unconditioned  pattern  includes 
cues  that  are  still  in  the  unconditioned  state.  Several  alternative  ways  of 
formulating  the  response  rule  for  this  case  have  some  plausibility,  and  it  is 
by  no  means  sure  that  any  one  choice  will  prove  to  hold  for  all  types  of 
situations.  We  shall  limit  consideration  to  the  formulation  suggested  by 
a  recent  study  of  discrimination  and  transfer  which  has  been  analyzed  in 
terms  of  the  mixed  model  (Estes  &  Hopkins,  1961).  The  amended  response 
rule  is  a  direct  generalization  of  Axiom  C2  of  Sec.  3.1  ;  specifically,  for  a 
situation  involving  r  response  alternatives  the  following  assumptions  will 
apply: 

1.  If  all  cues  in,  a  pattern  are  unconditioned,  the  probability  of  any 
response  Ai  is  equal  to  1/r. 

2.  If  a  pattern  (sample)  comprises  m  cues  conditioned  to  response  A^ 
mf  cues  conditioned  to  other  responses,  and  m"  unconditioned  cues,  then 
the  probability  that  Ai  will  be  evoked  by  this  pattern  is  given  by 


m  +  m  +  m 

In  other  words,  Axiom  C2  holds  but  with  each  unconditioned  cue  con- 
tributing "weight5'  1/r  toward  the  evocation  of  each  of  the  alternative 
responses. 

To  illustrate  these  assumptions  in  operation,  let  us  consider  a  simple 
classical  discrimination  experiment  involving  three  cues,  a,  b,  and  c,  and 
two  responses,  A^  and  A2.  We  shall  assume  that  the  pattern  ac  is  presented 
"on  half  of  the  trials,  with  A^  reinforced,  and  be  on  the  other  hah0"  of  the 
trials,  with  A2  reinforced,  the  two  types  of  trials  occurring  in  random 
sequence.  We  assume  further  that  conditions  are  such  as  to  ensure  the 
subject's  sampling  both  cues  presented  on  each  trial.  In  a  tabulation  of  the 
possible  conditioning  states  of  each  pattern  a  1,  2,  or  0,  respectively,  in  a 
state  column  indicates  that  the  pattern  is  conditioned  to  Al9  conditioned  to 
Az,  or  unconditioned.  For  each  pair  of  values  under  States,  the  associated 


DISCRIMINATION    LEARNING 


245 


.^-probabilities,  computed  according  to  the  modified  response  rule,  are 
given  in  the  corresponding  positions  under  ^-probability.  To  reduce 
algebraic  complications,  we  shall  carry  out  derivations  for  the  special 
case  in  which  the  subject  starts  the  experiment  with  both  patterns  un- 
conditioned. Then,  under  the  conditions  of  reinforcement  specified,  only 


States 

^i-Probability 
to  Each  Pattern 

ac 

be 

ac 

be 

1 

2 

1 

0 

1 

1 

1 

1 

2 

2 

0 

0 

2 

1 

0 

1 

0 

1 

1 

1 

0 

1 

2 
0 

J 
1 

0 

! 

2 
0 

0 
0 

0 

i 

i 

* 

the  states  represented  in  the  first,  seventh,  sixth,  and  ninth  rows  of  the 
table  are  available  to  the  subject,  and  for  brevity  we  number  these  states 
3,  2,  1,  and  0,  in  the  order  just  listed;  that  is, 

State  3  =  pattern  ac  conditioned  to  Al9  and  pattern  be  conditioned  to 

A* 

State  2  =  pattern  ac  conditioned  to  Aly  and  pattern  be  unconditioned. 
State  1  =  pattern  ac  unconditioned,  and  pattern  be  conditioned  to  Az. 
State  0  =  both  patterns  ac  and  be  are  unconditioned. 

Now,  these  states  can  be  interpreted  as  the  states  of  a  Markov  chain, 
since  the  probability  of  transition  from  any  one  of  them  to  any  other  on  a 
given  trial  is  independent  of  the  preceding  history.  The  matrix  of  prob- 
abilities for  one-step  transitions  among  the  four  states  takes  the  following 
form: 

^1000 


£ 

2 

2 
0 


0 


c 
2 


0 

0 

1-c 


(86) 


2^6  STIMULUS    SAMPLING   THEORY 

where  the  states  are  ordered  3,  2,  1,  0  from  top  to  bottom  and  left  to  right. 
Thus  State  3  (in  which  ac  is  conditioned  to  At  and  be  to  A^  is  an  absorbing 
state,  and  the  process  must  terminate  in  this  state,  with  asymptotic  prob- 
ability of  a  correct  response  to  each  pattern  equal  to  unity.  In  State  2 
pattern  ac  is  conditioned  to  Ai9  but  be  is  still  unconditioned.  This  state 
can  be  reached  only  from  State  0,  in  which-both  patterns  are  unconditioned  ; 
the  probability  of  the  transition  is  J  (the  probability  that  pattern  ac  will 
be  presented)  times  c  (the  probability  that  the  reinforcing  event  will 
produce  conditioning);  thus  the  entry  in  the  second  cell  of  the  bottom  row 
is  c/2.  From  State  2  the  subject  can  go  only  to  State  3,  and  this  transition 
again  has  probability  c/2.  The  other  cells  are  filled  in  similarly. 

Now  the  probability  uin  of  being  in  state  i  on  trial  n  can  be  derived  quite 
easily  for  each  state.  The  subject  is  assumed  to  start  the  experiment  in 
State  0  and  has  probability  c  of  leaving  this  state  on  each  trial;  hence 

«o.n  =  (1  -  O*-1- 
For  State  1  we  can  write  a  recursion, 

(\rt-2  /  A77"3  c  C 

i-f)  |+(i-D  a-c)f  +  ...  +  a-«r-f. 

which  holds  if  n  >  2.  To  be  in  State  1  on  trial  n  the  subject  must  have 
entered  at  the  end  of  trial  1,  which  has  probability  c/2,  and  then  remained 
for  n  —  2  trials,  which  has  probability  [(1  —  (c/2)]n~2;  have  entered  at  the 
end  of  trial  2,  which  has  probability  (1  —  c)(c/2),  and  then  remained  for 
n  —  3  trials,  which  has  probability  [1  —  (c/2)]n~3;  .  .  .  ;  or  have  entered 
at  the  end  of  trial  n  -  1,  which  has  probability  (1  -  c)n~2(c/2).  The  right- 
hand  side  of  this  recursion  can  be  summed  to  yield 


I  r    9  —  r  ~~\  "-1 

i  Lf  -  c_\       _ 
lL2(l-c)J 

-  0  -  c)n"1- 

By  an  identical  argument  we  obtain 

/          c\n-  x 

«M  =  (J  -  f)       -  <J  -  C)"r"1' 
and  then  by  subtraction 

1   -  "2,n  -  MI,«  ~  W0,n 

n-1 

+(1-C)-1. 


DISCRIMINATION    LEARNING  2£J 

From  the  tabulation  of  states  and  response  probabilities  we  know  that 
the  probability  of  response  AI  to  pattern  ac  is  equal  to  1,  1,  J,  and  J> 
respectively,  when  the  subject  is  in  State  3,  2,  1,  or  0.  Consequently  the 
probability  of  a  correct  (A ^  response  to  ac  is  obtained  simply  by  summing 
these  response  probabilities,  each  weighted  by  the  state  probability,  namely, 

Pr  (AItn  |  ac)  =  u3tn  +  u2t7l  +  -  uI>n  +  -  u0iU 


(/\n—  1 
l_£j 


71-1 


Equation  87  is  written  for  the  probability  of  an  ^-response  to  ac  on 
trial  n ;  however,  the  expression  for  probability  of  an  ,42-response  to  be  is 
identical,  and  consequently  Eq.  87  expresses  also  the  probability  pn  of  a 
correct  response  on  any  trial,  without  regard  to  the  stimulus  pattern  pre- 
sented. A  simple  estimator  of  the  conditioning  parameter  c  is  now  obtain- 
able by  summing  the  error  probability  over  trials.  Letting  e  denote  the 
expected  total  errors  during  learning,  we  have 


c\n-l 

4~1\-~2/      ~4n~i 
4c 


An  example  of  the  sort  of  prediction  involving  a  relatively  direct  assess- 
ment of  transfer  effects  is  the  following.  Suppose  the  first  stimulus  pattern 
to  appear  is  ac;  the  probability  of  a  correct  response  to  it  is,  by  hypothesis, 
\,  and  if  there  were  no  transfer  between  patterns  the  probability  of  a  correct 
response  to  be  when  it  first  appeared  on  a  later  trial  should  be  J  also. 
Under  the  assumptions  of  the  mixed  model,  however,  the  probability  of  a 


2$8  STIMULUS    SAMPLING   THEORY 

correct  response  to  be,  if  it  first  appeared  on  trial  2?  should  be 
[1  -  jq  -  c)  -  c]  +  \  =  1       c  . 
2  24' 

if  it  first  appeared  on  trial  3,  it  should  be 

Kl  -  c)2  +  j  =  1 
2  2 

and  so  on,  tending  to  J  after  a  sufficiently  long  prior  sequence  of  #c  trials. 
Simply  by  inspection  of  the  transition  matrix  we  can  develop  an  interest- 
ing prediction  concerning  behavior  during  the  presolution  period  of  the 
experiment.  By  presolution  period  we  mean  the  sequence  of  trials  before  the 
last  error  for  any  given  subject.  We  know  that  the  subject  cannot  be  in 
State  3  on  any  trial  before  the  last  error.  On  all  trials  of  the  presolution 
period  the  probability  of  a  correct  response  should  be  equal  either  to  \ 
(if  no  conditioning  has  occurred)  or  to  f  (if  exactly  one  of  the  two  stimulus 
patterns  has  been  conditioned  to  its  correct  response).  Thus  the  propor- 
tion, which  we  denote  by  Pps,  of  correct  responses  over  the  presolution 
trial  sequence  should  fall  in  the  interval 

t  <  P9,  <  f  > 

and,  in  fact,  the  same  bounds  obtain  for  any  subset  of  trials  within  the 
presolution  sequence.  Clearly,  predictions  from  this  model  concerning 
presolution  responding  differ  sharply  from  those  derivable  from  any  model 
that  assumes  a  continuous  increase  in  probability  of  correct  responding 
during  the  presolution  period;  this  model  also  differs,  though  not  so 
sharply,  from  a  pure  "insight57  model  that  assumes  no  learning  on  pre- 
solution trials.  As  far  as  we  know,  no  data  relevant  to  these  differential 
predictions  are  available  in  the  literature  (though  similar  predictions  have 
been  tested  in  somewhat  different  situations:  Suppes  &  Ginsberg,  1963; 
Theios,  1963).  Now  that  the  predictions  are  in  hand,  it  seems  likely  that 
pertinent  analyses  will  be  forthcoming. 

The  development  in  this  section  was  for  the  case  in  which  there  were 
only  three  cues,  a,  b,  and  c.  For  the  more  general  case  we  could  assume 
that  there  are  Na  cues  associated  with  stimulus  a,  Nb  with  stimulus  Z>, 
and  Nc  with  stimulus  c.  If  we  assume,  as  we  have  in  this  section,  that 
experimental  conditions  are  such  to  ensure  the  subject's  sampling  all  cues 
presented  on  each  trial,  then  Eq.  87  may  be  rewritten  as 

Pr  (A,  n  I  ac)  =  1 (1  +  w 

w  2\ 

Pr  (A^  |  be}  =  1  -  ~(1  +  w2)(l  -  01  *  +  i  w2(l  -  c)^1, 


DISCRIMINATION    LEARNING 

where 


Further, 


=  [1  -  Pr  (Al>n  |  flc)]  +     [1  -  Pr  (4a,n  |  be)] 


2 

where  w  =  J(H'i  +  ^2)-  The  parameter  w  is  an  index  of  similarity  between 
the  stimuli  #  c  and  be  ;  as  w  approaches  its  maximum  value  of  1  ,  the  number 
of  total  errors  increases.  Further,  the  proportion  of  correct  responses 
over  the  presolution  trial  sequence  should  fall  in  the  interval 

i  <  P,,  <  \  +  1(1  -  ^i) 

or  in  the  interval 

\  <  P»s  <  i  +  id  -  H'a), 
depending  on  whether  ac  or  be  is  conditioned  first. 


5.3  Component  Models 

As  long  as  the  number  of  stimulus  patterns  involved  in  a  discrimination 
experiment  is  relatively  small,  an  analysis  in  terms  of  an  appropriate  case 
of  the  mixed  model  can  be  effected  along  the  lines  indicated  in  Sec.  5.2. 
But  the  number  of  cues  need  become  only  moderately  large  in  order  to 
generate  a  number  of  patterns  so  great  as  to  be  unmanageable  by  these 
methods.  However,  if  the  number  of  patterns  is  large  enough  so  that  any 
particular  pattern  is  unlikely  to  be  sampled  more  than  once  during  an 
experiment,  the  emendations  of  the  response  rule  presented  in  Sec.  5.2 
can  be  neglected  and  the  process  treated  as  a  simple  extension  of  the  com- 
ponent model  of  Sec.  4.1. 

Suppose,  for  example,  that  a  classical  discrimination  involved  a  set, 
SI9  of  cues  available  only  on  trials  when  A±  is  reinforced,  a  set,  S2,  of  cues 
available  only  on  trials  when  A2  is  reinforced,  and  a  set,  SC9  of  cues  available 
on  all  trials;  further,  assume  that  a  constant  fraction  of  each  set  presented 
is  sampled  by  the  subject  on  any  trial.  If  the  two  types  of  trials  occur  with 
equal  probabilities  and  if  the  numbers  of  cues  in  the  various  sets  are  large 
enough  so  that  the  number  of  possible  trial  samples  is  larger  than  the  number 
of  trials  in  the  experiment,  then  we  may  apply  Eq.  53  of  Sec.  3.3  to  obtain 
approximate  expressions  for  response  probabilities.  For  example,  asymp- 
totically all  of  the  NI  elements  of  SI  and  half  of  the  Nc  elements  of  Sc 


STIMULUS    SAMPLING    THEORY 


(on  the  average)  would  be  conditioned  to  response  Ai9  and  therefore 
probability  of  A±  on  a  trial  when  S±  was  presented  would  be  predicted  by 
the  component  model  to  be 


which  will,  in  general,  have  a  value  intermediate  between  \  and  unity. 
Functions  for  learning  curves  and  other  aspects  of  the  data  can  be  derived 
for  various  types  of  discrimination  experiments  from  the  assumptions  of 
the  component  model.  Numerous  results  of  this  sort  have  been  pub- 
lished (Burke  &  Estes,  1957;  Bush  &  Mosteller,  1951b;  Estes,  1958, 
1961a;  Estes,  Burke,  Atkinson  &  Frankmann,  1957;  Popper,  1959; 
Popper  &  Atkinson,  1958). 


5,4  Analysis  of  a  Signal  Detection  Experiment 

Although,  so  far,  we  have  developed  stimulus  sampling  models  only  in 
connection  with  simple  associative  learning  and  discrimination  learning, 
it  should  be  noted  that  such  models  may  have  much  broader  areas  of 
application.  On  occasion  we  may  even  see  possibilities  of  using  the  con- 
cepts of  stimulus  sampling  and  association  to  interpret  experiments  that, 
by  conventional  classifications,  do  not  fall  within  the  area  of  learning. 
In  this  section  we  examine  such  a  case. 

The  experiment  to  be  considered  fits  one  of  the  standard  paradigms 
associated  with  studies  of  signal  detection  (see,  e.g.,  Tanner  &  Swets, 
1954;  Swets,  Tanner,  &  BirdsaU,  1961;  or  Chapter  3,  Vol.  1,  by  Luce). 
The  subject's  task  in  this  experiment,  like  that  of  an  observer  monitoring  a 
radar  screen,  is  to  detect  the  presence  of  a  visual  signal  which  may  occur 
from  time  to  time  in  one  of  several  possible  locations.  Problems  of  interest 
in  connection  with  theories  of  signal  detection  arise  when  the  signals  are 
faint  enough  so  that  the  observer  is  unable  to  report  them  with  complete 
accuracy  on  all  occasions.  One  empirical  relation  that  we  would  want  to 
account  for,  in  quantitative  detail,  is  that  between  detection  probabilities 
and  the  relative  frequencies  with  which  signals  occur  in  different  locations. 
Another  is  the  improvement  in  detection  rate  that  may  occur  over  a  series 
of  trials  even  when  the  observer  receives  no  knowledge  of  results. 

A  possible  way  of  accounting  for  the  "practice  effect"  is  suggested  by 
some  rather  obvious  analogies  between  the  detection  experiment  and  the 
probability  learning  experiment  considered  earlier:  we  expect  that,  when 
the  subject  actually  detects  a  signal  (in  terms  of  stimulus  sampling  theory, 
samples  the  corresponding  stimulus  element),  he  will  make  the  appropriate 


DISCRIMINATION    LEARNING  2$I 

verbal  report.  Further,  in  the  absence  of  any  other  information,  this 
detection  of  the  signal  may  act  as  a  reinforcing  event,  leading  to  condition- 
ing of  the  verbal  report  to  other  cues  in  the  situation  which  may  have  been 
available  for  sampling  before  the  occurrence  of  the  signal.  If  so,  and  if 
signals  occur  in  some  locations  more  often  than  in  others,  then  on  the  basis 
of  the  theory  developed  in  earlier  sections  we  should  predict  that  the  subject 
will  come  to  report  the  signal  in  the  preferred  location  more  frequently  than 
in  others  on  trials  when  he  fails  to  detect  a  signal  and  is  forced  to  respond 
to  background  cues.  These  notions  are  made  more  explicit  in  connection 
with  the  following  analysis  of  a  visual  recognition  experiment  reported  by 
Kinchla  (1962). 

Kinchla  employed  a  forced-choice,  visual-detection  situation  involving  a 
series  of  more  than  900  discrete  trials  for  each  subject.  Two  areas  were 
outlined  on  a  uniformly  illuminated  milk-glass  screen.  Each  trial  began 
with  an  auditory  signal,  during  which  one  of  the  following  events  occurred : 

1.  A  fixed  increment  in  radiant  intensity  occurred  in  area  1 — a  T^-type 
trial. 

2.  A  fixed  increment  in  radiant  intensity  occurred  in  area  2 — a  T2-type 
trial. 

3.  No  change  in  the  radiant  character  of  either  signal  area  occurred — a 
retype  trial. 

Subjects  were  told  that  a  change  in  illumination  would  occur  in  one  of 
the  two  areas  on  each  trial.  Following  the  auditory  signal,  the  subject  was 
required  to  make  either  an  A^  or  an  ^-response  (i.e.,  select  one  of  two 
keys  placed  below  the  signal  area)  to  indicate  the  area  he  believed  had 
changed  in  brightness.  The  subject  was  given  no  information  at  the  end  of 
the  trial  as  to  whether  his  response  was  correct.  Thus,  on  a  given  trial,  one 
of  three  events  occurred  (Tl9  T2>  T0),  the  subject  made  either  an  Ar  or  an 
^-response,  and  a  short  time  later  the  next  trial  began. 

For  a  fixed  signal  intensity,  the  experimenter  has  the  option  of  specifying 
a  schedule  for  presenting  the  rrevents.  Kinchla  selected  a  simple  prob- 
abilistic procedure  in  which  Pr  (TM)  =  ££  and  ^  +  f  2  +  I0  =  L  Two 
groups  of  subjects  were  run.  For  Group  /,  f  i  +  ?*  =  0.4  and  f0  =  0.2. 
For  Group  II,  f x  =  |0  =  0-2  and  fa  =  0.6.  The  purpose  of  Kinchla's 
study  was  to  determine  how  these  event  schedules  influenced  the  likelihood 
of  correct  detections. 

The  model  that  we  shall  use  to  analyze  the  experiment  combines  two 
quite  distinct  processes:  a  simple  perceptual  process  defined  with  regard 
to  the  signal  events  and  a  learning  process  associated  with  background 
cues.  The  stimulus  situation  is  conceptually  represented  in  terms  of  two 
sensory  elements,  s-L  and  %  corresponding  to  the  two  alternative  signals, 


STIMULUS   SAMPLING    THEORY 


and  a  set,  S,  of  elements  associated  with  stimulus  features  common  to  all 
trials.  On  every  trial  the  subject  is  assumed  to  sample  a  single  element  from 
the  background  set  5,  and  he  may  or  may  not  sample  one  of  the  sensory 
elements.  If  the  s1  element  is  sampled,  an  Al  occurs;  if  s2  is  sampled,  an 
A%  occurs.  If  neither  sensory  element  is  sampled,  the  subject  makes  the 
response  to  which  the  background  element  is  conditioned.  Conditioning 
of  elements  in  S  changes  from  trial  to  trial  via  a  learning  process. 

The  sampling  of  sensory  elements  depends  on  the  trial  type  (Tly  T^  T0) 
and  is  described  by  a  simple  probabilistic  model.  The  learning  process 
associated  with  S  is  assumed  to  be  the  multi-element  pattern  model  pre- 
sented in  Sec.  2.  Specifically,  the  assumptions  of  the  model  are  embodied 
in  the  following  statements  : 

1.  If  Ti  (i  =  1,  2)  occurs,  then  sensory  element  st  will  be  sampled  with 
probability  h  (with  probability  1  —  h  neither  sl  nor  s2  will  be  sampled). 
If  TQ  occurs,  then  neither  sl  nor  s2  will  be  sampled. 

2.  Exactly  one  element  is  sampled  from  S  on  every  trial.   Given  the  set 
S  of  N  elements,  the  probability  of  sampling  a  particular  element  is  l/N. 

3.  If  St  (i  =  1,  2)  is  sampled  on  trial  «,  then  with  probability  c'  the  ele- 
ment sampled  from  S  on  the  trial  becomes  conditioned  to  AI  at  the  end  of 
trial  n.   If  neither  ^  nor  s2  is  sampled,  then  with  probability  c  the  element 
sampled  from  S  becomes  conditioned  with  equal  likelihood  to  A±  or  A2 
at  the  end  of  trial  n. 

4.  If  sensory  element  st  is  sampled,  then  Ai  will  occur.  If  neither  sensory 
element  is  sampled,  then  the  response  to  which  the  sampled  element  from 
5  is  conditioned  will  occur. 

If  we  let/?n  denote  the  expected  proportion  of  elements  in  S  conditioned 
to  A!  at  the  start  of  trial  «,  then  (in  terms  of  statements  1  and  4)  we  can 
immediately  write  an  expression  for  the  likelihood  of  an  ^-response,  given 
a  7,-event,  namely, 

1  rl>n)  =  h  +  (1  -  K)Pn,  (88fl) 

|  J2,n)  =  h  +  (1  -  A)(l  -  pn\  (88Z>) 


The  expression  for  pn  can  be  obtained  from  Statements  2  and  3  by  the 
same  methods  used  throughout  Sec.  2  of  this  chapter  (for  a  derivation  of 
this  result,  see  Atkinson,  1963a): 

r     i        i"'1 

Pn  =  Poo  -  (Poa  -  Pi)  I  1  -  ~  (*  +  6)  > 


DISCRIMINATION    LEARNING 


where  a  =  £Jicr  +  (i  -  A)(c/2)  +  ^(c/2),  6  =  f2Ac'  +  (1  -  h)(cj2)  + 
fo/z(c/2),  and/?^  =  #/(#  +  Z>).  Division  of  the  numerator  and  denominator 
of  />  co  by  c  yields  the  expression 

+  id  -  *D  +  gpfci 


where  y  =  c'jc.   Thus  the  asymptotic  expression  for  pn  does  not  depend 
on  the  absolute  values  of  c'  and  c  but  only  on  their  ratio. 

An  inspection  of  Kinchla's  data  indicates  that  the  curves  for  Pr  (At  \  Tj) 
are  extremely  stable  over  the  last  400  or  so  trials  of  the  experiment;  con- 
sequently we  shall  view  this  portion  of  the  data  as  asymptotic.  Table  7 

Table  7     Predicted  and  Observed  Asymptotic  Response  Probabilities 
for  Visual  Detection  Experiment 

Group  I  Group  II 


Observed  Predicted  Observed  Predicted 


Pr(AI\T1) 

0.645 

0.645 

0.558 

0.565 

Pr(A2 

Tj 

0.643 

0.645 

0.730 

0.724 

Pr(^i 

TO) 

0.494 

0.500 

0.388 

0.388 

presents  the  observed  mean  values  of  Pr  (Ai  \  T^)  for  the  last  400  trials. 
The  corresponding  asymptotic  expressions  are  specified  in  terms  of  Eqs. 
88  and  89  and  are  simply 

Km  Pr  (Alin  \  T1>M)  =  /z  +  (1  -  h)p»,  (90a) 


lim  Pr  G42,n  |  r2>M)  =  h  +  (l 

71-*  00 

limPr(A1>n\T0in)  =  px.  (90c) 

n-*<x> 

In  order  to  generate  asymptotic  predictions,  we  need  values  for  h  and  y>. 
We  first  note  by  inspection  of  Eq.  89  that  p^  =  \  for  Group  I;  in  fact, 
whenever  ^  =  f  2»  we  have/?^  =  J.  Hence  taking  the  observed  asymptotic 
value  for  Pr  (A±  \  T^)  in  Group  I  (i.e.,  0.645)  and  setting  it  equal  to  h  + 
(1  —  /z)J  yields  an  estimate  ofh  —  0.289.  The  background  illumination. 
and  the  increment  in  radiant  intensity  are  the  same  for  both  experimental 
groups,  and  therefore  we  would  require  an  estimate  of  h  obtained  from 
Group  I  to  be  applicable  to  Group  II.  In  orderto  estimate  y>,  we  take  the 
observed  asymptotic  value  of  Pr  (A±  \  TQ)  in  Group  II  and  set  it  equal  to 
the  right  side  of  Eq.  89  with  h  =  0.289,  ^  =  g  0  =  0.2,  and  £2  =  0.6; 
solving  for  ip,  we  obtain  y  =  2.8.  Use  of  these  estimates  of  h  and  y  in 
Eqs.  89  and  90  yields  the  asymptotic  predictions  given  in  Table  7. 


2^4  STIMULUS   SAMPLING   THEORY 

Over-all,  the  equations  give  an  excellent  account  of  these  particular 
response  measures.  However,  a  more  crucial  test  of  the  model  is  provided 
by  an  analysis  of  the  sequential  data.  To  indicate  the  nature  of  the  sequen- 
tial predictions  that  can  be  obtained,  consider  the  probability  of  an  A^ 
response  on  a  rrtrial,  given  the  various  trial  types  and  responses  that  can 
occur  on  the  preceding  trial,  that  is, 


where  z  =  1,  2  and/  =  0,  1,  2.  Explicit  expressions  for  these  quantities 
can  be  derived  from  the  axioms  by  the  same  methods  used  throughout 
this  chapter.  To  indicate  their  form,  theoretical  expressions  for 

lim  Pr(Alfn+l\Tl}n^Ai)nTjfn) 

n-*& 

are  given,  and,  to  simplify  notation,  they  are  written  as  Pr  (A:L  \  T 
The  expressions  for  these  quantities  are  as  follows  : 


Pr  (Al  |  TlAlTl)  =  I*          --          -•        +       -         f  (91a) 
Pr  A 


N(l-X)  N 


Pr  (A,  1  TlAzTz)  =        .  --.  +       -         ,  (91c) 


Pr  (A 


, 


Pr  (A,  |  Tl^ID  =      +       ~         ,  (91e) 


where 

y    =  C'h  +  (1  -  C'), 

/  =  C'  +  (1  -  C')*, 

5  -  (c/2)A  +  [1  -  (c/2)], 

V  =  (c/2)  +  [1  -  (c/2)]A, 

and 

It  is  interesting  to  note  that  the  asymptotic  expressions  for  Pr  (Altn  \  T^J 
depend  only  on  h  and  y,  whereas  the  quantities  in  Eq.  91  are  functions  of 


DISCRIMINATION    LEARNING  2$$ 

all  four  parameters  N,  c,  c',  and  h.   Comparable  sets  of  equations  can  be 
written  for  Pr  (A2  \  T^AJ^  and  Pr  (Al  \  TgA^). 

The  expressions  in  Eq.  91  are  rather  formidable,  but  numerical  predic- 
tions can  be  easily  calculated  once  values  for  the  parameters  have  been 
obtained.  Further,  independent  of  the  parameter  values,  certain  relations 
among  the  sequential  probabilities  can  be  specified.  As  an  example  of  such 

Table  8     Predicted  and  Observed  Asymptotic  Sequential  Response 
Probabilities  in  Visual-Detection  Experiment 

Group  I  Group  II 


Observed        Predicted        Observed       Predicted 


Pr  (At  \ 

^1^1)               °-57 

0.58 

0.59 

0.64 

Pr  (At  \ 

Tr2/42J'1)               0.65 

0.69 

0.70 

0.76 

Pr  (A, 

T^Ty                0.71 

0.71 

0.79 

0.77 

Pr  (Az  \  T^AiT^)               0.61 

0.59 

0.69 

0.66 

PrWil 

r^To)                0.54 

0.59 

0.68 

0,66 

Pi(Az 

TV^TQ)           °-66 

0.70 

0.71 

0.76 

Pr  (A  \  T  A  T  ^                0  73 

0.71 

0.70 

0.65 

Pr  G4X  |  TiA^T?)               0.62 

0.59 

0.59 

0.52 

Pr  (Al  I  jnXJV)               0.53 

0.58 

0.53 

0.51 

PrUi 

T^ra)               0.66 

0.70 

0.64 

0.64 

PrC^i 

^i^i^o)               0-72 

0.70 

0.61 

0.63 

Pr  04! 

TV^O)           °-61 

0.59 

0.48 

0.52 

PrG42 

r^Ti)           o 

.38 

0.40 

0.47 

0.49 

T$A<>T-d               0 

.56 

0.58 

0.59 

0.66 

PrU2 

r^Tg)           o 

.64 

0.60 

0.67 

0.68 

Pr(A2 

r^iTs)               0.47 

0.42 

0.51 

0.51 

TV^To)               0.47 

0.42 

0.50 

0.51 

Prw! 

ir^ro)           o 

.60 

0.58 

0.65 

0.66 

a  relation,  it  can  be  shown  that  Pr  (A±  \  T^T^)  >  Pr  (A±  \  T^A2T^  for 
any  stimulus  schedule  and  any  set  of  parameter  values.  To  see  this,  simply 
subtract  Eq.  9 1/ from  Eq.  9le  and  note  that  <5  >  <5'. 

In  Table  8  the  observed  values  for  Pr  (A€  \  T^AkT^  are  presented  as 
reported  by  Kinchla.  Estimates  of  these  conditional  probabilities  were 
computed  for  individual  subjects,  using  the  data  over  the  last  400  trials; 
the  averages  of  these  individual  estimates  are  the  quantities  given  in  the 
table.  Each  entry  is  based  on  24  subjects. 

In  order  to  generate  theoretical  predictions  for  the  observed  entries  in 
Table  8,  values  for  N,  c,  c',  and  h  are  needed.  Of  course,  estimates  of  h 
and  y  =  c'\c  have  already  been  made  for  this  set  of  data,  and  therefore  it  is 


2j£  STIMULUS    SAMPLING   THEORY 

necessary  only  to  estimate  N  and  either  c  or  cr.  We  obtain  our  estimates  of 
N  and  c  by  a  least-squares  method;  that  is,  we  select  a  value  of  TV  and  c 
(where  c'  =  cip)  so  that  the  sum  of  squared  deviations  between  the  36 
observed  values  in  Table  8  and  the  corresponding  theoretical  quantities  is 
minimized.  The  theoretical  quantities  for  Pr  (AI  \  T^T^  are  com- 
puted from  Eq.  91;  theoretical  expressions  for  Pr  (A2  \  T^AjT^  and 
Pr  (A%  |  ToAtTj)  have  not  been  presented  here  but  are  of  the  same  general 
form  as  those  given  in  Eq.  91. 
With  this  technique,  estimates  of  the  parameters  are  as  follows: 

TV  =4.23          c'=1.00 

(92) 
h  =  0.289        c  =  0.357. 

The  predictions  corresponding  to  these  parameter  values  are  presented 
in  Table  8.  When  we  note  that  only  four  of  the  possible  36  degrees  of 
freedom  represented  in  Table  8  have  been  utilized  in  estimating  parameters, 
the  close  correspondence  between  theoretical  and  observed  quantities 
may  be  interpreted  as  giving  considerable  support  to  the  assumptions  of  the 
model. 

A  great  deal  of  research  needs  to  be  done  to  explore  the  consequences 
of  this  approach  to  signal  detection.  In  terms  of  the  experimental  prob- 
lem considered  in  this  section,  much  progress  can  be  made  via  differential 
tests  among  alternative  formulations  of  the  model.  For  example,  we 
postulated  a  multi-element  pattern  model  to  describe  the  learning  process 
associated  with  background  stimuli;  it  would  be  important  to  determine 
whether  other  formulations  of  the  learning  process  such  as  those  developed 
in  Sec.  4  or  those  proposed  by  Bush  and  Mosteller  (1955)  would  provide  as 
good  or  even  better  theoretical  fits  than  the  ones  displayed  in  Tables  7 
and  8.  Also,  it  would  be  valuable  to  examine  variations  in  the  scheme  for 
sampling  sensory  elements  along  lines  developed  by  Luce  (1959,  1963)  and 
Restle  (1961). 

More  generally,  further  development  of  the  theory  is  required  before 
we  can  attempt  to  deal  with  the  wide  range  of  empirical  phenomena  en- 
compassed in  the  approach  to  perception  via  decision  theory  proposed  by 
Swets,  Tanner,  and  Birdsall  (1961)  and  others.  Some  theoretical  work 
has  been  done  by  Atkinson  (1963b)  along  the  lines  outlined  in  this  section 
to  account  for  the  ROC  (receiver-operating-characteristic)  curves  that  are 
typically  observed  in  detection  studies  and  to  specify  the  relation  between 
forced-choice  and  yes-no  experiments.  However,  this  work  is  still  quite 
tentative,  and  an  evaluation  of  the  approach  will  require  extensive  analyses 
of  the  detailed  sequential  properties  of  psychophysical  data. 


DISCRIMINATION    LEARNING 


257 


5.5  Multiple-Process  Models 

Analyses  of  certain  behavioral  situations  have  proved  to  require 
formulations  in  terms  of  two  or  more  distinguishable,  though  possibly 
interdependent,  learning  processes  that  proceed  simultaneously.  For 
some  situations  these  separate  processes  may  be  directly  observable; 
for  other  situations  we  may  find  it  advantageous  to  postulate  processes  that 
are  unobservable  but  that  determine  in  some  well-defined  fashion  the 
sequence  of  observable  behaviors.  For  example,  in  Restle's  (1955)  treat- 
ment of  discrimination  learning  it  is  assumed  that  irrelevant  stimuli  may 
become  "adapted"  over  a  period  of  time  and  thus  be  rendered  nonfunc- 
tional. Such  an  analysis  entails  a  two-process  system.  One  process  has 
to  do  with  the  conditioning  of  stimuli  to  responses,  whereas  the  other 
prescribes  both  the  conditions  under  which  cues  become  irrelevant  and 
the  rate  at  which  adaptation  occurs. 

Another  application  of  multiple-process  models  arises  with  regard  to 
discrimination  problems  in  which  either  a  covert  or  a  directly  observable 
orienting  response  is  required.  One  process  might  describe  how  the 
stimuli  presented  to  the  subject  become  conditioned  to  discriminative 
responses.  Another  might  specify  the  acquisition  and  extinction  of  various 
orienting  responses ;  these  orienting  responses  would  determine  the  specific 
subset  of  the  environment  that  the  subject  would  perceive  on  a  given  trial. 
For  models  dealing  with  this  type  of  problem,  see  Atkinson  (1958),  Bush  & 
Mosteller  (1951b),  Bower  (1959),  and  Wyckoff  (1952). 

As  another  example,  consider  a  two-process  scheme  developed  by 
Atkinson  (1960)  to  account  for  certain  types  of  discrimination  behavior. 
This  model  makes  use  of  the  distinction,  developed  in  Sees.  2  and  3, 
between  component  models  and  pattern  models  and  suggests  that  the 
subject  may  (at  any  instant  in  time)  perceive  the  stimulus  situation  either 
as  a  unit  pattern  or  as  a  collection  of  individual  components.  Thus  two 
perceptual  states  are  defined:  one  in  which  the  subject  responds  to  the 
pattern  of  stimulation  and  one  in  which  he  responds  to  the  separate 
components  of  the  situation.  Two  learning  processes  are  also  defined. 
One  process  specifies  how  the  patterns  and  components  become  conditioned 
to  responses,  and  the  second  process  describes  the  conditions  under  which 
the  subject  shifts  from  one  perceptual  state  to  another.  The  control  of  the 
second  process  is  governed  by  the  reinforcing  schedule,  the  subject's 
sequence  of  responses,  and  by  similarity  of  the  discriminanda.  In  this 
model  neither  the  conditioning  states  nor  the  perceptual  states  are  observ- 
able; nevertheless,  the  behavior  of  the  subject  is  rigorously  defined  in 
terms  of  these  hypothetical  states. 


2$8  STIMULUS   SAMPLING   THEORY 

Models  of  the  sort  described  are  generally  difficult  to  work  with  mathe- 
matically and  consequently  have  had  only  limited  development  and 
analysis.  It  is  for  this  reason  that  we  select  a  particularly  simple  example 
to  illustrate  the  type  of  formulation  that  is  possible.  The  example  deals 
with  a  discrimination-learning  task  investigated  by  Atkinson  (1961)  in 
which  observing  responses  are  categorized  and  directly  measured. 

The  experimental  situation  consists  of  a  sequence  of  discrete  trials. 
Each  trial  is  specified  in  terms  of  the  following  classifications: 

Tl9  T2:  Trial  type.    Each  trial  is  either  a  T±  or  a  T2.  The  trial  type  is  set 

by  the  experimenter  and  determines  in  part  the  stimulus  event 

occurring  on  the  trial. 
Rl3  jR2:  Observing  responses.    On  each  trial  the  subject  makes  either  an 

R!  or  R2.  The  particular  observing  response  determines  in  part 

the  stimulus  event  for  that  trial. 
SiiSfrS^:  Stimulus  events.    Following  the  observing  response,  one  and 

only  one  of  these  stimulus  events  (discriminative  cues)  occurs. 

On  a  TV-trial  either  s±  or  j&  can  occur;  on  a  J^-trial  either  s2  or 

sb  can  occur.15 
Afr  Az:  Discriminative  responses.    On  each  trial  the  subject  makes  either 

an  A*?  or  ^-response  to  the  presentation  of  a  stimulus  event. 
O1?  02:  Trial  outcome.    Each  trial  is  terminated  with  the  occurrence  of 

one  of  these  events.   An  O^  indicates  that  A±  was  the  correct 

response  for  that  trial  and  02  indicates  that  A2  was  correct. 

The  sequence  of  events  on  a  trial  is  as  follows:  (1)  The  ready  signal 
occurs  and  the  subject  responds  with  Rl  or  R2.  (2)  Following  the  observing 
response,  sl9  s2,  or  5&  is  presented.  (3)  To  the  onset  of  the  stimulus  event 
the  subject  responds  with  either  AI  or  A&  (4)  The  trial  terminates  with 
either  an  Ox-  or  02-event. 

To  keep  the  analysis  simple,  we  consider  an  experimenter-controlled 
reinforcement  schedule.  On  a  TV-trial  either  an  Ox  occurs  with  probability 
-TTi  or  an  Oz  with  probability  1  —  n^  on  a  Ja-trial  an  Ol  occurs  with 
probability  7r2  or  an  O2  with  probability  1  —  772.  The  TV-trial  occurs 
with  probability  ft  and  T2  with  probability  1  —  /3.  Thus  a  ^-Ox-combina- 
tion occurs  with  probability  (hrl9  a  7i-0a,  with  probability  £(1  —  77-3),  and 
so  on. 

The  particular  stimulus  event  si  (i  =  1,  2, 4)  that  the  experimenter 

15  The  subscript  b  has  been  used  to  denote  the  stimulus  event  that  may  occur  on  both 
TX-  and  TVtrials;  the  subscripts  1  and  2  denote  stimulus  events  unique  to  Tx-  and 
T2-trials,  respectively. 


DISCRIMINATION    LEARNING 


presents  on  any  trial  depends  on  the  trial  type  (T±  or  J2)  and  the  subject's 
observing  response  (RI  or  R»). 

1.  If  an  R1  is  made,  then 

(a)  with  probability  y.  the  ,srevent  occurs  on  a  retrial  and  the  ja- 
event  on  a  r2-trial; 

(b)  with  probability  1  —  a.  the  s6-event  occurs,  regardless  of  the  trial 
type. 

2.  If  an  R%  is  made,  then 

(a)  with  probability  a  the  s&-event  occurs,  regardless  of  the  trial  type  ; 

(b)  with  probability  1  —  a  the  ^-event  occurs  on  a  retrial  and  s%  on 
a  r2-trial. 

To  clarify  this  procedure,  consider  the  case  in  which  a  =  1,  ^  =  1, 
and  772  =  0.  If  the  subject  is  to  be  correct  on  every  trial,  he  must  make 
an  AI  on  a  TVtrial  and  an  A2  on  a  T^-trial.  However,  the  subject  can 
ascertain  the  trial  type  only  by  making  the  appropriate  observing  response  ; 
that  is,  Rl  must  be  made  in  order  to  identify  the  trial  type,  for  the 
occurrence  of  R2  always  leads  to  the  presentation  of  sb,  regardless  of 
the  trial  type.  Hence  for  perfect  responding  the  subject  must  make  Rt 
with  probability  1  and  then  make  A^  to  s±  or  Az  to  %  The  purpose  of  the 
Atkinson  study  was  to  determine  how  variations  in  wl9  772,  and  a  would 
affect  both  the  observing  responses  and  the  discriminative  responses. 

Our  analysis  of  this  experimental  procedure  is  based  on  the  axioms 
presented  in  Sees.  1  and  2.  However,  in  order  to  apply  the  theory,  we 
must  first  identify  the  stimulus  and  reinforcing  events  in  terms  of  the  experi- 
mental operations.  The  identification  we  offer  seems  quite  natural  to  us 
and  is  in  accord  with  the  formulations  given  in  Sees.  1  and  2. 

We  assume  that  associated  with  the  ready  signal  is  a  set  SR  of  pattern 
elements.  Each  element  in  SR  is  conditioned  to  the  R^  or  the  ^-observing 
response;  there  are  N'  such  elements.  At  the  start  of  each  trial  (i.e.,  with. 
the  onset  of  the  ready  signal)  an  element  is  sampled  from  SR,  and  the 
subject  makes  the  response  to  which  the  element  is  conditioned. 

Associated  with  each  stimulus  event,  st  (i  =  1,  2,  b\  is  a  set,  Si9  of 
pattern  elements;  elements  in  St  are  conditioned  to  the  AT  or  the  A$- 
discriminative  response.  There  are  N  such  elements  in  each  set,  Si9  and 
for  simplicity  we  assume  that  the  sets  are  pairwise  disjoint.  When  the 
stimulus  event  j4  occurs,  one  element  is  randomly  sampled  from  Siy  and 
the  subject  makes  the  discriminative  response  to  which  the  element  is  con- 
ditioned. 

Thus  we  have  two  types  of  learning  processes:  one  defined  on  the  set 
SR  and  the  other  defined  on  the  sets  S13  Sb,  and  52.  Once  the  reinforcing 


STIMULUS  SAMPLING   THEORY 


events  have  been  specified  for  these  processes,  we  can  apply  our  axioms. 
The  interpretation  of  reinforcement  for  the  discriminative-response 
process  is  identical  to  that  given  in  Sec.  2.  If  a  pattern  element  is  sampled 
from  set  ^  for  i  =  1,  2,  b  and  is  followed  by  an  O}-  outcome,  then  with 
probability  c  the  element  becomes  conditioned  to  Aj  and  with  prob- 
ability 1  —  c  the  conditioning  state  of  the  sampled  element  remains 
unchanged. 

The  conditioning  process  for  the  SR  set  is  somewhat  more  complex  in 
that  the  reinforcing  events  for  the  observing  responses  are  assumed  to  be 
subject-controlled.  Specifically,  if  an  element  conditioned  to  Ri  is  sampled 
from  SR  and  followed  by  either  an  A^Or  or  ^2<92-event?  then  the  element 
will  remain  conditioned  to  R^  however,  if  A-^O^  or  A20:  occurs,  then  with 
probability  c'  the  element  will  become  conditioned  to  the  other  observing 
response.  Otherwise  stated,  if  an  element  from  SR  elicits  an  observing 
response  that  selects  a  stimulus  event  and,  in  turn,  the  stimulus  event 
elicits  a  correct  discriminative  response  (i.e.,  A^O^  or  A2O2),  then  the 
sampled  element  will  remain  conditioned  to  that  observing  response. 
However,  if  the  observing  response  selects  a  stimulus  event  that  gives  rise 
to  an  incorrect  discriminative  response  (i.e.,  A^O^  or  A%Oj)9  then  there  will 
be  a  decrement  in  the  tendency  to  repeat  that  observing  response  on  the 
next  trial. 

Given  the  foregoing  identification  of  events,  we  can  now  generate  a 
mathematical  model  for  the  experiment.  To  simplify  the  analysis,  we  let 
JV'  =  TV  =  1  ;  namely,  we  assume  that  there  is  one  element  in  each  of  our 
stimulus  sets  and  consequently  the  single  element  is  sampled  with  prob- 
ability 1  whenever  the  set  is  available.  With  this  restriction  we  may  de- 
scribe the  conditioning  state  of  a  subject  at  the  start  of  each  trial  by  an 
ordered  four-tuple  (ijkl}  : 

1.  The  first  member  z  is  1  or  2  and  indicates  whether  the  single  element 
of  SR  is  conditioned  to  R:  or  R2. 

2.  The  second  member  j  is  1  or  2  and  indicates  whether  the  single  ele- 
ment of  S:  is  conditioned  to  Al  or  A%. 

3.  The  third  member  k  is  1  or  2  and  indicates  whether  the  element  of 
Sb  is  conditioned  to  Al  or  A%. 

4.  The  fourth  member  /  is  1  or  2  and  indicates  whether  the  element  of 
52  is  conditioned  to  A±  or  A^ 

Thus,  if  the  subject  is  in  state  (ijkl),  he  will  make  the  Rt  observing  re- 
sponse; then,  to  $Iy  $&  or  %  he  will  make  discriminative  response  AJ9  Ak9 
or  Al9  respectively. 

From  our  assumptions  it  follows  that  the  sequence  of  random  variables 
that  take  the  subject  states  (ijkT)  as  values  is  a  16-state  Markov  chain. 


DISCRIMINATION   LEARNING 


26l 


1122 


1122 


Fig.  10.  Branching  process,  starting  in  state  <1122),  for  a  single  trial  in  the  two-process 
discrimination-learning  model. 

Figure  10  displays  the  possible  transitions  that  can  occur  when  the  subject 
is  in  state  <1122>  on  trial  n.  To  clarify  this  tree,  let  us  trace  out  the  top 
branch.  An  R:  is  elicited  with  probability  1,  and  with  probability  ^ 
a  I^-trial  with  an  O^outcome  will  occur;  further,  given  an  jRrresponse 
on  a  TV-trial,  there  is  probability  <x  that  the  ^-stimulus  event  will  occur; 
the  onset  of  the  s^-event  elicits  a  correct  response,  hence  no  change  occurs 
in  the  conditioning  state  of  any  of  the  stimulus  patterns.  Now  consider 
the  next  set  of  branches:  an  Rj_  occurs  and  we  have  a  T^-trial;  with 
probability  1  —  a  the  vstimulus  w^  ^  presented  and  an  A%  will  occur; 
the  .^-response  is  incorrect  (in  that  it  is  followed  by  an  0revent);  hence 


262  STIMULUS   SAMPLING   THEORY 

with  probability  c  the  element  of  set  Sb  will  become  conditioned  to  Al 
and  with  independent  probability  c'  the  element  of  set  SR  will  become 
conditioned  to  the  alternative  observing  response,  namely  R2. 

From  this  tree  we  obtain  probabilities  corresponding  to  the  <(1122} 
row  in  the  transition  matrix.  For  example,  the  probability  of  going  from 
<1122>to<2112>is  simply  ^(1  -  QL)CC'  +  (1  -  0>r2(l  -  a)cc';  that  is, 
the  sum  over  branches  2  and  15.  An  inspection  of  the  transition  matrix 
yields  important  results.  For  example,  if  a  =  1  ,  ^  =  1,  and  ?r2  =  0,  then 
states  (1  1  12}  and  <(1  122}  are  absorbing,  hence  in  the  limit  Pr  (R±  J  =  1, 
Pr  (A,n  |  r1(fl)  =  1,  and  Pr  (A2,n  \  T2,n)  =  1. 

As  before,  let  u^  denote  the  probability  of  being  in  state  <(zyfc/}  on  trial 
n;  when  the  limit  exists,  let  umi  =  lim  u^.  Experimentally,  we  are 

interested  in  evaluating  the  following  theoretical  predictions  : 

Pr  (KlfB)  -  «<;>!  +  «<&  +  «<&  +  «<& 

"mi  +  wila*  (93a) 


+  ati/Sk  +  «<&  +  t*^!  +  tt 
+  (1  -  «)[ttiSii  +  wi?i2  +  I4&.  +  uSj,    (936) 
Pr  (A1>n  |  TM)  =  uig,  +  U& 


+  (1  -  a)[i4?>2  +  ttiS>a  +  K^  +  ui&L     (93c) 
Pr  (X^  n  A,J  =  «iHi  +  oa4&  +  (1  -  a)  11^1, 


Pr  (R2tn  n  ^ljn)  =  u^  +  au^2  +  (1 

+ 


The  first  equation  gives  the  probability  of  an  ^-response.  The  second 
and  third  equations  give  the  probability  of  an  ^-response  on  2V  and  T2- 
trials,  respectively.  Finally,  the  last  two  equations  present  the  probability 
of  the  joint  occurrence  of  each  observing  response  with  an  ^-response. 

In  the  experiment  reported  by  Atkinson  (1961)  six  groups  with  40 
subjects  in  each  group  were  run.  For  all  groups  TTX  =  0.9  and  ft  =  0.5. 
The  groups  differed  with  respect  to  the  value  of  a  and  7r2.  For  Groups 
I  to  III  the  value  of  a  =  1;  and  for  Groups  IV  to  VI  a  =  0,75.  For 


DISCRIMINATION   LEARNING 


263 


Groups  I  and  IV,  rrz  =  0.9;  for  II  and  V,  7?2  =  0.5;  and  for  Groups  III 
and  VI,  7r2  =  0.1.    The  design  can  be  described  by  the  following  array: 


0.9    0.5    0.1 


1.0 


0.75 


I      II     III 


IV      V     VI 


Given  these  values  of  irl9  7r2,  a,  and  /?,  the  16-state  Markov  chain  is 
irreducible  and  aperiodic.  Thus  lim  u(^  =  uijkl  exists  and  can  be  ob- 
tained by  solving  the  appropriate  set  of  16  linear  equations  (see  Eq.  16). 

Table  9     Predicted  and  Observed  Asymptotic  Response  Probabilities 
in  Observing  Response  Experiment 


Group  I 


Group  II 


Group  III 


Pred.    Obs.     SD     Pred.    Obs.     SD     Pred.    Obs.     SD 


0.90 
0.90 
0.50 
0.45 
0.45 


0.94 
0.94 
0.45 
0.43 
0.47 


0.014 
0.014 
0.279 
0.266 
0.293 


0.81 
0.59 
0.55 
0.39 
0.31 


0.85 
0.61 
0.59 
0.42 
0.31 


0.164 
0.134 
0.279 
0.226 
0.232 


0.79 
0.21 
0.73 
0.37 
0.13 


0.79 
0.23 
0.70 
0.36 
0.16 


0.158 
0.182 
0.285 
0.164 
0.161 


Group  IV 


Group  V 


Group  VI 


Pred.    Obs.    SD      Pred.    Obs.     SD     Pred.    Obs.     SD 


pr(AlT1) 

0.90 

0.93    0.063    0.80     0.82    0.114    0.73     0.73    0.138 

Pr  0*!  1  TV) 

0.90 

0.95    0.014    0.60     0.68    0.114    0.27     0.25    0.138 

PrCiy 

0.49 

0.50    0.257    0.52     0.53    0.305    0.63     0.72    0.263 

Pr  (#!  n  A^ 

i)       0-44 

0.47    0.241    0.35     0.38    0.219    0.32     0.36    0.138 

Pr  (7?2  n  AI 

i)       0.46 

0.47    0.247    0.34     0.36    0.272    0.19     0,13    0.168 

The  values  predicted  by  the  model  are  given  in  Table  9  for  the  case  in 
which  c  =  c'.  Values  for  the  uiikl's  were  computed  and  then  combined  by 
Eq.  93  to  predict  the  response  probabilities.  By  presenting  a  single  value  for 
each  theoretical  quantity  in  the  table  we  imply  that  these  predictions  are 
independent  of  c  and  c'.  Actually,  this  is  not  always  the  case.  However, 
for  the  schedules  employed  in  this  experiment  the  dependency  of  these 
asymptotic  predictions  on  c  and  c'  is  virtually  negligible.  For  c  =  c', 
ranging  over  the  interval  from  0.0001  to  LO,  the  predicted  values  given  in 


264  STIMULUS   SAMPLING   THEORY 

Table  9  are  affected  in  only  the  third  or  fourth  decimal  place;  it  is  for  this 
reason  that  we  present  theoretical  values  to  only  two  decimal  places. 

In  view  of  these  comments  it  should  be  clear  that  the  predictions  in 
Table  9  are  based  solely  on  the  experimental  parameter  values.  Conse- 
quently, differences  between  subjects  (that  may  be  represented  by  inter- 
subject  variability  in  c  and  cr)  do  not  substantially  affect  these  predictions. 

In  the  Atkinson  study  400  trials  were  run  and  the  response  proportions 
appear  to  have  reached  a  fairly  stable  level  over  the  second  half  of  the 
experiment.  Consequently,  the  proportions  computed  over  the  final 
block  of  160  trials  were  used  as  estimates  of  asymptotic  quantities.  Table 
9  presents  the  mean  and  standard  deviation  of  the  40  observed  proportions 
obtained  under  each  experimental  condition. 

Despite  the  fact  that  these  gross  asymptotic  predictions  hold  up  quite 
well,  it  is  obvious  that  some  of  the  predictions  from  the  model  will  not  be 
confirmed.  The  difficulty  with  the  one-element  assumption  is  that  the 
fundamental  theory  laid  down  by  the  axioms  of  Sec.  2  is  completely  deter- 
ministic in  many  respects.  For  example,  when  N'  =  1,  we  have 


namely,  if  an  R^  occurs  on  trial  n  and  is  reinforced  (i.e.,  followed  by  an 
^Oi-event),  then  R±  will  recur  with  probability  1  on  trial  n  +  1.  This 
prediction  is,  of  course,  a  consequence  of  the  assumption  that  we  have  but 
one  element  in  set  SR  which  necessarily  is  sampled  on  every  trial.  If  we 
assume  more  than  one  element,  the  deterministic  features  of  the  model 
no  longer  hold,  and  such  sequential  statistics  become  functions  of  c, 
c',  N,  and  N'.  Unfortunately,  for  elaborate  experimental  procedures  of 
the  sort  described  in  this  section  the  multi-element  case  leads  to  com- 
plicated mathematical  processes  for  which  it  is  extremely  difficult  to  carry 
out  computations.  Thus  the  generality  of  the  multi-element  assumption 
may  often  be  offset  by  the  difficulty  involved  in  making  predictions. 

Naturally,  it  is  usually  preferable  to  choose  from  the  available  models 
the  one  that  best  fits  the  data,  but  in  the  present  state  of  psychological 
knowledge  no  single  model  is  clearly  superior  to  all  others  in  every  facet 
of  analysis.  The  one-element  assumption,  despite  some  of  its  erroneous 
features,  may  prove  to  be  a  valuable  instrument  for  the  rapid  exploration  of 
a  wide  variety  of  complex  phenomena.  For  most  of  the  cases  we  have 
examined  the  predicted  mean  response  probabilities  are  usually  independ- 
ent of  (or  only  slightly  dependent  on)  the  number  of  elements  assumed. 
Thus  the  one-element  assumption  may  be  viewed  as  a  simple  device  for 
computing  the  grosser  predictions  of  the  general  theory. 

For  exploratory  work  in  complex  situations,  then,  we  recommend  using 
the  one-element  model  because  of  the  greater  difficulty  of  computations 


REFERENCES  265 

for  the  multi-element  models.  In  advocating  this  approach,  we  are  taking 
a  methodological  position  with  which  some  scientists  do  not  agree.  Our 
position  is  in  contrast  to  one  that  asserts  that  a  model  should  be  discarded 
once  it  is  clear  that  certain  of  its  predictions  are  in  error.  We  do  not  take 
it  to  be  the  principal  goal  (or  even,  in  many  cases,  an  important  goal)  of 
theory  construction  to  provide  models  for  particular  experimental  situa- 
tions. The  assumptions  of  stimulus  sampling  theory  are  intended  to 
describe  processes  or  relationships  that  are  common  to  a  wide  variety  of 
learning  situations  but  with  no  implication  that  behavior  in  these  situa- 
tions is  a  function  solely  of  the  variables  represented  in  the  theory.  As 
we  have  attempted  to  illustrate  by  means  of  numerous  examples,  the  for- 
mulation of  a  model  within  this  framework  for  a  particular  experiment  is  a 
matter  of  selecting  the  relevant  assumptions,  or  axioms,  of  the  general 
theory  and  interpreting  them  in  terms  of  the  conditions  of  the  experiment. 
How  much  of  the  variance  in  a  set  of  data  can  be  accounted  for  by  a  model 
depends  jointly  on  the  adequacy  of  the  theoretical  assumptions  and  on  the 
extent  to  which  it  has  been  possible  to  realize  experimentally  the  boundary 
conditions  envisaged  in  the  theory,  thereby  minimizing  the  effects  of 
variables  not  represented.  In  our  view  a  model,  in  application  to  a  given 
experiment,  is  not  to  be  classified  as  "correct"  or  "incorrect";  rather,  the 
degree  to  which  it  accounts  for  the  data  may  provide  evidence  tending 
either  to  support  or  to  cast  doubt  on  the  theory  from  which  it  was  derived. 


References 

Atkinson,  R.  C.  A  stochastic  model  for  rote  serial  learning.  Psychometrika,  1957,  22, 

87-96. 
Atkinson,  R.  C.  A  Markov  model  for  discrimination  learning.  Psychometrika,  1958, 

23,  308-322. 
Atkinson,  R.  C.   A  theory  of  stimulus  discrimination  learning.   In  K.  J.  Arrow,  S. 

Karlin,  &  P.  Suppes  (Eds.),  Mathematical  methods  in  the  social  sciences.   Stanford: 

Stanford  Univer.  Press,  1960.  Pp.  221-241. 
Atkinson,  R.  C.  The  observing  response  in  discrimination  learning.  J.  exp.  PsychoL, 

1961,62,253-262. 
Atkinson,  R.  C.   Choice  behavior  and  monetary  payoffs.   In  J.  Criswell,  H.  Solomon, 

&  P.  Suppes  (Eds.),  Mathematical  methods  in  small  group  processes.    Stanford : 

Stanford  Univer.  Press,  1962.  Pp.  23-34. 
Atkinson,  R.  C.    Mathematical  models  in  research  on  perception  and  learning.    In 

M.  Marx  (Ed.),  Psychological  Theory.  (2nd  ed.)   New  York:   Macmillan,  1963,  in 

press,  (a) 
Atkinson,  R.  C.  A  variable  sensitivity  theory  of  signal  detection.  PsychoL  Rev.,  1963, 

70,  91-106.  (b) 
Atkinson,  R.  C.,  &  Suppes,  P.  An  analysis  of  two-person  game  situations  in  terms  of 

statistical  learning  theory.  /.  exp.  PsychoL,  1958,  55,  369-378. 


2QS  STIMULUS  SAMPLING   THEORY 

Biilingsley,  P.  Statistical  inference  for  Markov  processes.  Chicago:  Univer.  of  Chicago 

Press,  1961. 
Bower,  G.  H.  Choice-point  behavior.  In  R.  R.  Bush  &  W.  K.  Estes  (Eds.),  Studies  in 

mathematical  learning  theory.  Stanford:  Stanford  Univer.  Press,  1959.   Pp.  109-124. 
Bower,  G.  H.   Application  of  a  model  to  paired-associate  learning.  Psychometrika, 

1961,26,255-280. 
Bower,  G.  H.  A  model  for  response  and  training  variables  in  paired-associate  learning. 

Psychol  Rev.,  1962,  69,  34-53. 
Burke,  C.  J.  Applications  of  a  linear  model  to  two-person  interactions.  In  R.  R.  Bush 

&  W.  K.  Estes  (Eds.),  Studies  in  mathematical  learning  theory.   Stanford:   Stanford 

Univer.  Press,  1959.  Pp.  180-203. 
Burke,  C.  J.   Some  two-person  interactions.   In  K.  J.  Arrow,  S.  Karlin,  &  P.  Suppes 

(Eds.),  Mathematical  methods  in  the  social  sciences.    Stanford:    Stanford  Univer. 

Press,  1960.  Pp.  242-253. 

Burke,  C.  J.,  &  Estes,  W.  K.   A  component  model  for  stimulus  variables  in  discrim- 
ination learning.  Psychometrika,  1957,  22,  133-145. 

Bush,  R.  R.  A  survey  of  mathematical  learning  theory.  In  R.  D.  Luce  (Ed.),  Develop- 
ments in  mathematical  psychology.  Glencoe,  Illinois:  The  Free  Press,  1960.  Pp.123- 

165. 
Bush,  R.  R.,  &  Estes,  W.  K.  (Eds.),  Studies  in  mathematical  learning  theory.  Stanford : 

Stanford  Univer.  Press,  1959. 
Bush,  R.  R.,  &  Mosteller,  F.  A  mathematical  model  for  simple  learning.  Psychol.  Rev., 

1951,  58,  313-323.   (a) 
Bush,  R.  R.,  &  Mosteller,  F.  A  model  for  stimulus  generalization  and  discrimination. 

Psychol.  Rev.,  1951,  58,  413-423.   (b) 

Bush,  R.  R.,  &  Mosteller,  F.  Stochastic  models  for  learning.  New  York:  Wiley,  1955. 
Bush,  R.  R.,  &  Sternberg,  S.  A  single-operator  model.  In  R.  R.  Bush  &  W.  K.  Estes 

(Eds.),  Studies  in  mathematical  learning  theory,    Stanford:   Stanford  Univer.  Press, 

1959.  Pp.  204-214. 
Carterette,  Teresa  S.    An  application  of  stimulus  sampling  theory  to  summated 

generalization.  /.  exp.  Psychol.,  1961,  62,  448-455. 
Crothers,  E.  J.  All-or-none  paired  associate  learning  with  unit  and  compound  responses. 

Unpublished  doctoral  dissertation,  Indiana  University,  1961. 
Detambel,  M.  H.   A  test  of  a  model  for  multiple-choice  behavior.   /.  exp.  Psychol., 

1955,  49,  97-104. 

Estes,  W.  K.  Toward  a  statistical  theory  of  learning.  Psychol.  Rev.,  1950,  57,  94-107. 
Estes,  W.  K.  Statistical  theory  of  spontaneous  recovery  and  regression.  Psychol.  Rev., 

1955,62,145-154.  (a) 
Estes,  W.  K.  Statistical  theory  of  distributional  phenomena  in  learning.  Psychol.  Rev., 

1955,62,369-377.  (b) 

Estes,  W.  K.  Of  models  and  men.  Amer.  Psychol.,  1957,  12,  609-617.  (a) 
Estes,  W.  K.  Theory  of  learning  with  constant,  variable,  or  contingent  probabilities  of 

reinforcement.  Psychometrika,  1957,  22, 113-132.  (b) 
Estes,  W.  K.    Stimulus-response  theory  of  drive.    In  M.  R.  Jones  (Ed.),  Nebraska 

symposium  on  motivation.    Vol.  6.    Lincoln,  Nebraska:    Univer.  Nebraska  Press, 

1958. 
Estes,  W.  K.  The  statistical  approach  to  learning  theory.  In  S.  Koch  (Ed.),  Psychology: 

a  study  of  a  science.    Vol.2.   New  York:  McGraw-Hill,  1959.  Pp.  380-491.   (a) 
Estes,  W.  K.  Component  and  pattern  models  with  Markovian  interpretations.  In  R.  R. 

Bush  &  W.  K.  Estes  (Eds,),  Studies  in  mathematical  learning  theory.    Stanford: 

Stanford  Univer.  Press,  1959.  Pp.  9-52.  (b) 


REFERENCES  267 

Estes,  W.  K.  Learning  theory  and  the  new  mental  chemistry.  Psychol,  Rev.,  1960,  67, 

207-223.  (a) 
Estes,  W.  K.  A  random-walk  model  for  choice  behavior.  In  K.  J.  Arrow,  S.  Karlin,  & 

P.  Suppes  (Eds.),  Mathematical  methods  in  the  social  sciences,   Stanford:   Stanford 

Univer.  Press,  1960.  Pp.  265-276.  (b) 
Estes,  W.  K.   Growth  and  function  of  mathematical  models  for  learning.   In  Current 

trends  in  psychological  theory.    Pittsburgh:    Univer.  of  Pittsburgh  Press,  1961.    Pp. 

134-151.   (a) 
Estes,  W.  K.    New  developments  in  statistical  behavior  theory:    differential  tests  of 

axioms  for  associative  learning.  Psychometrika,  1961,  26,  73-84.  (b) 
Estes,  W.  K.  Learning  theory.  Ann.  Rev.  Psychol,  1962,  13,  107-144. 
Estes,  W.  K.,  &  Burke,  C.  J.  A  theory  of  stimulus  variability  in  learning.  Psychol. 

Rev.,  1953,  60,  276-286. 
Estes,  W.  K.,  Burke,  C.  J.,  Atkinson,  R.  C.,  &  Frankmann,  Judith  P.    Probabilistic 

discrimination  learning.  /.  exp.  Psychol.,  1957,  54,  233-239. 
Estes,  W.  K.,  &  Hopkins,  B.  L.  Acquisition  and  transfer  in  pattern  -vs.-  component 

discrimination  learning.  /.  exp.  Psychol.,  1961,  61,  322-328. 
Estes,  W.  K.,  Hopkins,  B.  L.,  &  Crothers,  E.  J.   All-or-none  and  conservation  effects 

in  the  learning  and  retention  of  paired  associates.  J.  exp.  Psychol.,  1960,  60,  329-339. 
Estes,  W.  K.,  &  Straughan,  J.  H.  Analysis  of  a  verbal  conditioning  situation  in  terms 

of  statistical  learning  theory.  J.  exp.  Psychol.,  1954,  47,  225-234. 
Estes,  W.  K.,  &  Suppes,  P.   Foundations  of  linear  models.   In  R.  R.  Bush  &  W.  K. 

Estes  (Eds.),  Studies  in  mathematical  learning  theory.   Stanford:    Stanford  Univer. 

Press,  1959.  Pp.  137-179.  (a) 
Estes,  W.  K.,  &  Suppes,  P.   Foundations  of  statistical  learning  theory,  II.   The  stimulus 

sampling  model  for  simple  learning.  Tech.  Rept.  No.  26,  Psychology  Series,  Institute 

for  Mathematical  Studies  in  the  Social  Sciences,  Stanford  Univer.,  1959.  (b) 
Feller,  W.   An  introduction  to  probability  theory  and  its  applications.   (2nd  ed.)   New 

York:  Wiley,  1957. 
Friedman,  M.  P.,  Burke,  C.  J.,  Cole,  M.,  Estes,  W.  K.,  &  Millward,  R.  B.  Extended 

training  in  a  noncontingent  two-choice  situation  with  shifting  reinforcement  probabilities. 

Paper  given  at  the  First  Meetings  of  the  Psychonomic  Society,  Chicago,  Illinois, 

1960. 
Gardner,  R.  A.   Probability-learning  with  two  and  three  choices.   Amer.  J.  Psychol, 

1957,  70,  174-185. 
Guttman,  N.,  &  Kalish,  H.  L   Discriminability  and  stimulus  generalization.  J.  exp. 

Psychol,  1956,  51,  79-88. 

Goldberg,  S.  Introduction  to  difference  equations.  New  York:  Wiley,  1958. 
Hull,  C.  L.    Principles  of  behavior:   an  introduction  to  behavior  theory.    New  York : 

Appleton-Century-Crofts,  1943. 

Jarvik,  M.  E.   Probability  learning  and  a  negative  recency  effect  in  the  serial  antici- 
pation of  alternating  symbols.  /.  exp.  Psychol,  1951,  41,  291-297. 
Jordan,  C.   Calculus  of  finite  differences.  New  York:  Chelsea,  1950. 
Kemeny,  J.  G.,  &  Snell,  J.  L.    Markov  processes  in  learning  theory.  Psychometrika, 

1957,  22,  221-230. 
Kemeny,  J.  G.,  &  Snell,  J.  L.  Finite  Markov  chains.  Princeton,  N.  J.:  Van  Nostrand, 

1959. 
Kemeny,  J.  G.,  Snell,  J.  L.,  &  Thompson,  G.  L.   Introduction  to  finite  mathematics. 

New  York:  Prentice  Hall,  1957. 

Kinchla,  R.  A.  Learned  factors  in  visual  discrimination.  Unpublished  doctoral  disserta- 
tion, Univer.  of  California,  Los  Angeles,  1962. 


268  STIMULUS  SAMPLING   THEORY 

Lamperti,  J.,  &  Suppes,  P.  Chains  of  infinite  order  and  their  applications  to  learning 

theory.  Pacific  J.  Math.,  1959,  9,  739-754. 

Luce,  R.  D.  Individual  choice  behavior:  a  theoretical  analysis.  New  York:  Wiley,  1959. 
Luce,  R.  D.  A  threshold  theory  for  simple  detection  experiments.  Psychol  Rev.,  1963, 

70,  61-79. 

Luce,  R.  D.,  &  Raiffa,  H.   Games  and  decisions.  New  York:  Wiley,  1957. 
Nicks,  D.  C.  Prediction  of  sequential  two-choice  decisions  from  event  runs.  /.  exp. 

Psychol,  1959,  57, 105-114. 
Peterson,  L.  R.,  Saltzman,  Dorothy,  Hillner,  K.,  &  Land,  Vera.  Recency  and  frequency 

in  paired-associate  learning.  /.  exp.  Psychol.,  1962,  63,  396-403. 
Popper,  Juliet.  Mediated  generalization.  In  R.  R.  Bush  &  W.  K.  Estes,  (Eds.),  Studies 

in  mathematical  learning  theory.  Stanford:  Stanford  Univer.  Press,  1959.  Pp.  94-108. 
Popper,  Juliet,  &  Atkinson,  R.  C.  Discrimination  learning  in  a  verbal  conditioning 

situation.  /.  exp.  Psychol.,  1958,  56,  21-26. 

Restle,  F.  A  theory  of  discrimination  learning.  Psychol.  Rev.,  1955,  62, 11-19. 
Restle,  F.  Psychology  of  judgment  and  choice.  New  York:  Wiley,  1961. 
Solomon,  R.  L.,  &  Wynne,  L.  C.  Traumatic  avoidance  learning:  acquisition  in  normal 

dogs.  Psychol  Monogr.,  1953,  67,  No.  4. 
Spence,  K.  W.  The  nature  of  discrimination  learning  in  animals.  Psychol  Rev.,  1936, 

43, 427-449. 

Stevens,  S.  S.  On  the  psychophysical  law.  Psychol  Rev.,  1957,  64,  153-181. 
Suppes,  P.,  &  Atkinson,  R.  C.   Markov  learning  models  for  multiperson  interactions. 

Stanford:  Stanford  Univer.  Press,  1960. 
Suppes,  P.,  &  Ginsberg,  Rose.  Application  of  a  stimulus  sampling  model  to  children's 

concept  formation  of  binary  numbers  with  and  without  an  overt  correction  response. 

/.  exp.  Psychol,  1962,  63,  330-336. 
Suppes,  P.,  &  Ginsberg,  Rose.  A  fundamental  property  of  all-or-none  models.  Psychol 

Rev.,  1963,  70,  139-161. 
Swets,  J.  A.»  Tanner,  W.  P.,  Jr.,  &  Birdsall,  T.  G.  Decision  processes  in  perception. 

Psychol  Rev.,  1961,  68,  301-340. 
Tanner,  W.  P.,  Jr.,  &  Swets,  J.  A.    A  decision-making  theory  of  visual  detection. 

Psychol  Rev.,  1954,  61,  401-409. 
Theios,  J.  Simple  conditioning  as  two-stage  all-or-none  learning.  Psychol  Rev.,  1963, 

in  press. 
Wyckoff,  L.  B.,  Jr.  The  role  of  observing  responses  in  discrimination  behavior.  Psychol 

Rev.,  1952, 59,  431-442. 


1 1 

Introduction  to  the  Formal  Analysis 
of  Natural  Languages^ 

Noam  Chomsky 

Massachusetts  Institute  of  Technology 

George  A.  Miller 
Harvard  University 


1.  The  preparation  of  this  Chapter  was  supported  in  part  by  the  Army  Signal 
Corps,  the  Air  Force  Office  of  Scientific  Research,  and  the  Office  of  Naval  Research, 
and  in  part  by  the  National  Science  Foundation  (Grants  No.  NSF  G-16486  and 
No.  NSF  G-l 3903}. 


269 


Contents 


1 .  Limiting  the  Scope  of  the  Discussion  272 

2.  Some  Algebraic  Aspects  of  Coding  277 

3.  Some  Basic  Concepts  of  Linguistics  283 

4.  A  Simple  Class  of  Generative  Grammars  292 


5.  Transformational  Grammars  296 

5.1.  Some  shortcomings  of  constituent-structure  grammars,    297 

5.2.  The  specification  of  grammatical  transformations,     300 

5.3.  The  constituent  structure  of  transformed  strings,     303 


6.  Sound  Structure  306 

6.L    The  role  of  the  phonological  component,     306 

6.2.  Phones  and  phonemes,    308 

6.3.  Invariance  and  linearity  conditions,     310 
6 A    Some  phonological  rules,    313 


References  319 


2JO 


Introduction  to  the  Formal  Analysis 
of  Natural  Languages 


Language  and  communication  play  a  special  and  important  role  in 
human  affairs;  they  have  been  pondered  and  discussed  by  every  variety 
of  scholar  and  scientist.  The  psychologist's  contribution  has  been  but  a 
small  part  of  the  total  effort.  In  order  to  give  a  balanced  picture  of  the 
larger  problem  that  a  mathematical  psychologist  faces  when  he  turns  his 
attention  toward  verbal  behavior,  therefore,  this  chapter  and  the  next  two 
must  go  well  beyond  the  traditional  bounds  of  psychology. 

The  fundamental  fact  that  must  be  faced  in  any  investigation  of  language 
and  linguistic  behavior  is  the  following :  a  native  speaker  of  a  language  has 
the  ability  to  comprehend  an  immense  number  of  sentences  that  he  has 
never  previously  heard  and  to  produce,  on  the  appropriate  occasion,  novel 
utterances  that  are  similarly  understandable  to  other  native  speakers.  The 
basic  questions  that  must  be  asked  are  the  following: 

1.  What  is  the  precise  nature  of  this  ability? 

2.  How  is  it  put  to  use? 

3.  How  does  it  arise  in  the  individual? 

There  have  been  several  attempts  to  formulate  questions  of  this  sort  in  a 
precise  and  explicit  form  and  to  construct  models  that  represent  certain 
aspects  of  these  achievements  of  a  native  speaker.  When  simple  enough 
models  can  be  constructed  it  becomes  possible  to  undertake  certain  purely 
abstract  studies  of  their  intrinsic  character  and  general  properties.  Studies 
of  this  kind  are  in  their  infancy;  few  aspects  of  language  and  communica- 
tion have  been  formalized  to  a  point  at  which  such  investigations  are  even 
thinkable.  Nevertheless,  there  is  a  growing  body  of  suggestive  results. 
We  shall  survey  some  of  those  results  here  and  try  to  indicate  how  such 
studies  can  contribute  to  our  understanding  of  the  nature  and  function  of 
language. 

The  first  of  our  three  basic  questions  concerns  the  nature  of  language 
itself.  In  order  to  answer  it,  we  must  make  explicit  the  underlying  structure 
inherent  in  all  natural  languages.  The  principal  attack  on  this  problem  has 
its  origins  in  logic  and  linguistics;  in  recent  years  it  has  focused  on  the 
critically  important  concept  of  grammar.  The  justification  for  including  this 
work  in  the  present  handbook  is  to  make  psychologists  more  realistically 

271 


2^2  FORMAL  ANALYSIS  OF  NATURAL  LANGUAGES 

aware  of  what  it  is  a  person  has  accomplished  when  he  has  learned  to 
speak  and  understand  a  natural  language.  Associating  vocal  responses 
with  visual  stimuli — a  feature  that  has  attracted  considerable  psycho- 
logical attention — is  but  one  small  aspect  of  the  total  language-learning 
process. 

Our  second  question  calls  for  an  attempt  to  give  a  formal  characteriza- 
tion of,  or  model  for,  the  users  of  natural  languages.  Psychologists,  who 
might  have  been  expected  to  attack  this  question  as  part  of  their  general 
study  of  behavior,  have  as  yet  provided  only  the  most  programmatic  (and 
often  implausible)  kinds  of  answers.  Some  valuable  ideas  on  this  topic 
have  originated  in  the  field  of  communication  engineering;  their  psycho- 
logical implications  were  relatively  direct  and  were  promptly  recognized. 
However,  the  engineering  concepts  have  been  largely  statistical  and  have 
made  little  contact  with  what  is  known  of  the  inherent  structure  of  lan- 
guage. 

By  presenting  (1)  and  (2)  as  two  distinct  questions  we  explicitly  reject 
the  common  opinion  that  a  language  is  nothing  but  a  set  of  verbal  re- 
sponses. To  say  that  a  particular  rule  of  grammar  applies  in  some  natural 
language  is  not  to  say  that  the  people  who  use  that  language  are  able  to 
follow  the  rule  consistently.  To  specify  the  language  is  one  task.  To 
characterize  its  user  is  another.  The  two  problems  are  obviously  related 
but  are  not  identical. 

Our  third  question  is  no  less  important  than  the  first  two,  yet  far  less 
progress  has  been  made  in  formulating  it  in  such  a  way  as  to  support  any 
abstract  investigation.  What  goes  on  as  a  child  begins  to  talk  is  still 
beyond  the  scope  of  our  mathematical  models.  We  can  only  mention  the 
genetic  issue  and  regret  its  relative  neglect  in  the  following  pages. 


1.  LIMITING  THE   SCOPE   OF   THE   DISCUSSION 

The  mathematical  study  of  language  and  communication  is  a  large  topic. 
We  must  limit  it  sharply  for  our  present  purposes.  It  may  help  to  orient 
the  reader  if  we  enumerate  some  of  the  limitations  we  have  imposed  in  this 
and  in  the  two  chapters  that  follow. 

The  first  limitation  we  imposed  was  to  restrict  our  interest  generally  to 
the  so-called  natural  languages.  There  are,  of  coure,  many  formal  lan- 
guages developed  by  logicians  and  mathematicians;  the  study  of  those 
languages  is  a  major  concern  of  modern  logic.  In  these  pages,  however, 
we  have  tried  to  limit  our  attention  to  the  formal  study  of  natural  languages 
and  largely  ignore  the  study  of  formal  languages.  It  is  sometimes 
convenient  to  use  miniature,  artificial  languages  in  order  to  illustrate  a 


LIMITING   THE   SCOPE   OF   THE   DISCUSSION  2J$ 

particular  property  in  a  simplified  context,  and  the  programming  languages 
developed  by  computer  specialists  are  often  of  special  interest.  Never- 
theless, the  central  focus  here  is  on  natural  languages. 

A  further  limitation  was  to  eliminate  all  serious  consideration  of  con- 
tinuous systems.  The  acoustic  signal  produced  by  a  speaker  is  a  continuous 
function  of  time  and  is  ordinarily  represented  as  the  sum  of  a  Fourier 
series.  Fourier  representation  is  especially  convenient  when  we  study  the 
effects  of  continuous  linear  transformations  (filters).  Fortunately  this 
important  topic  has  been  frequently  and  thoroughly  treated  by  both 
mathematicians  and  communication  engineers;  its  absence  here  will  not 
be  critical. 

Communication  systems  can  be  thought  of  as  discrete  because  of  the 
existence  of  what  communication  engineers  have  sometimes  called  a 
fidelity  criterion  (Shannon,  1949).  A  fidelity  criterion  determines  how  the 
set  of  all  signals  possible  during  a  finite  time  interval  should  be  partitioned 
into  subsets  of  equivalent  signals — equivalent  for  the  receiver.  A  com- 
munication system  may  transmit  continuous  signals  precisely,  but  if  the 
receiver  cannot  (or  will  not)  pay  attention  to  the  fine  distinctions  that  the 
system  is  capable  of  registering  the  fidelity  of  the  channel  is  wasted.  Thus 
it  is  the  receiver  who  establishes  a  criterion  of  acceptability  for  the  system. 
The  higher  his  criterion,  the  larger  the  number  of  distinct  subsets  of  signals 
the  communication  system  must  be  able  to  distinguish  and  transmit. 

The  receiver  we  wish  to  study  is,  of  course,  a  human  listener.  The 
fidelity  criterion  is  set  by  his  capacities,  training,  and  interests.  On  the 
basis  of  his  perceptual  distinctions,  therefore,  we  can  establish  a  finite  set 
of  categories  to  serve  as  the  discrete  symbols.  Those  sets  may  be  alphabets, 
syllabaries,  or  vocabularies;  the  discrete  elements  of  those  sets  are  the 
indivisible  atoms  from  which  longer  messages  must  be  constructed.  A 
listener's  perception  of  those  discrete  units  of  course  poses  an  important 
psychological  problem;  formal  psychophysical  aspects  of  the  detection 
and  recognition  problem  have  already  been  discussed  in  Chapter  3,  and 
we  shall  not  repeat  them  here.  However,  certain  considerations  that  may 
be  unique  for  speech  perception  are  mentioned  briefly  in  Sec.  6,  where  we 
discuss  the  subject  of  sound  structure,  and  again  in  Chapter  13,  where  we 
consider  how  our  knowledge  of  grammar  might  serve  to  organize  our 
perception  of  speech.  As  we  shall  see,  the  precise  description  of  a  human 
listener's  fidelity  criterion  for  speech  is  a  complex  thing,  but  for  the  moment 
the  critical  point  is  that  people  do  partition  speech  sounds  into  equivalent 
subsets,  so  a  discrete  notation  is  justifiable. 

In  the  realm  of  discrete  systems,  moreover,  we  limit  ourselves  to  con- 
catenation systems  and  to  their  further  algebraic  structure  and  their  inter- 
relations. In  particular,  we  think  of  the  flow  of  speech  as  a  sequence  of 


2^4  FORMAL  ANALYSIS  OF  NATURAL  LANGUAGES 

discrete  atoms  that  are  immediately  juxtaposed,  or  concatenated,  one  after 
the  other.  Simple  as  this  limitation  may  sound,  it  has  some  implications 
worth  noting. 

Let  L  be  the  set  of  all  finite  sequences  (including  the  sequence  of  zero 
length)  that  can  be  formed  from  the  elements  of  some  arbitrary  finite  set 
V.  Now,  if  <f>,  %BL  and  if  </>^%  represents  the  result  of  concatenating  them 
in  that  order  to  form  a  new  sequence  %  then  ip  el,;  that  is  to  say,  L  is 
closed  under  the  binary  operation  of  concatenation.  Furthermore,  con- 
catenation is  associative, 


and  the  empty  sequence  plays  the  role  of  a  unique  identity  element.  A 
set  that  includes  an  identity  and  is  closed  under  an  associative  law  of 
composition  is  called  a  monoid.  Because  monoids  satisfy  three  of  the  four 
postulates  of  a  group,  they  are  sometimes  called  semigroups.  A  group  is  a 
monoid  all  of  whose  elements  have  inverses. 

Although  we  must  necessarily  construct  our  spoken  utterances  by  the 
associative  operation  of  concatenation,  the  matter  must  be  formulated 
carefully.  Consider,  for  example,  the  ambiguous  English  sentence,  They 
are  flying  planes,  which  is  really  two  different  sentences: 

(Id) 
(Ib) 

If  we  think  only  of  spelling  or  pronunciation,  then  Example  la  equals 
Example  Ib  and  simple  concatenation  offers  no  difficulties.  But  if  we 
think  of  grammatical  structure  or  meaning,  Examples  la  and  Ib  are 
distinctly  different  in  a  way  that  ordinarily  is  not  phonetically  or  graphi- 
cally indicated. 

Linguists  generally  deal  with  such  problems  by  assuming  that  a  natural 
language  has  several  distinct  levels.  In  the  present  chapters,  we  think  of 
each  level  as  a  separate  concatenation  system  with  its  own  elements  and 
rules.  The  structure  at  a  lower  level  is  specified  by  the  way  in  which  its 
elements  are  related  to  the  next  higher  level.  In  order  to  preserve  associa- 
tivity, therefore,  we  introduce  several  concatenation  systems  and  study 
the  relations  between  them. 

Consider  two  different  operations  that  we  might  perform  on  a  written 
text  The  first  operation  maps  a  sequence  of  written  characters  into  a 
sequence  of  acoustic  signals;  let  us  refer  to  it  as  pronunciation.  Pro- 
nunciations of  segments  of  a  message  are  (approximately)  segments  of  the 


LIMITING   THE   SCOPE   OF  THE   DISCUSSION  2JJ 

pronunciation  of  that  message.  Thus  pronunciation  has  about  it  a  kind  of 
linearity  (cf.  Sec,  6.3): 

pron  0)^pron  (y)  =  pron  (x^y).  (2) 

Although  Eq.  2  is  not  true  (e.g.,  it  ignores  intonation  and  articulatory 
transitions  between  successive  segments),  it  is  more  nearly  true  than  the 
corresponding  statement  for  the  next  operation. 

This  operation  maps  the  sequence  of  symbols  into  some  representation 
of  its  subjective  meaning;  let  us  refer  to  it  as  comprehension.  The  mean- 
ings of  segments  of  a  message,  however,  are  seldom  identifiable  as  segments 
of  the  meaning  of  the  message.  Even  if  we  assume  that  meanings  can 
somehow  be  simply  concatenated,  in  most  cases  we  would  probably  find, 
under  any  reasonable  interpretation  of  these  notions,  that 

comp  (x)^comp  (y)  <  comp  (a^),  -  (3) 

which  is  one  way,  perhaps,  to  interpret  the  Gestalt  dictum  that  a  meaning- 
ful whole  is  greater  than  the  linear  sum  of  its  parts.  Unless  one  is  a  con- 
firmed associationist  in  the  tradition  of  James  Mill,  it  is  not  obvious  what 
the  concatenation  of  two  comprehensions  would  mean  or  how  such  an 
operation  could  be  performed.  By  comparison,  the  operations  in  Eq.  2 
seem  well  defined. 

To  introduce  the  process  of  comprehension,  however,  raises  many 
difficult  issues:  Recently  there  have  been  interesting  proposals  for  studying 
abstractly  certain  denotative  (Wallace,  1961)  and  connotative  (Osgood, 
Suci,  &  Tannenbaum,  1957)  aspects  of  natural  lexicons.  Important  as  this 
subject  is  for  any  general  theory  of  psycholinguistics,  we  shall  say  little 
about  it  in  these  pages.  Nevertheless,  our  hope  is  that  by  clearing  away 
some  syntactic  problems  we  shall  have  helped  to  clarify  the  semantic  issue 
if  only  by  indicating  some  of  the  things  that  meaning  is  not. 

Finally,  as  we  have  already  noted,  these  chapters  include  little  on  the 
process  of  language  learning.  Although  it  is  possible  to  give  a  formal 
description  of  certain  aspects  of  language  and  although  several  mathe- 
matical models  of  the  learning  process  have  been  developed,  the  inter- 
section of  these  two  theoretical  endeavors  remains  disconcertingly  vacant. 

How  an  untutored  child  can  so  quickly  attain  full  mastery  of  a  language 
poses  a  challenging  problem  for  learning  theorists.  With  diligence,  of 
course,  an  intelligent  adult  can  use  a  traditional  grammar  and  a  dictionary 
to  develop  some  degree  of  mastery  of  a  new  language;  but  a  young  child 
gains  perfect  mastery  with  incomparably  greater  ease  and  without  any 
explicit  instruction.  Careful  instruction  and  precise  programming  of 
reinforcement  contingencies  do  not  seem  necessary.  Mere  exposure  for  a 


27^  FORMAL   ANALYSIS    OF    NATURAL   LANGUAGES 

remarkably  short  period  is  apparently  all  that  is  required  for  a  normal  child 
to  develop  the  competence  of  a  native  speaker. 

One  way  to  highlight  the  theoretical  questions  involved  here  is  to 
imagine  that  we  had  to  construct  a  device  capable  of  duplicating  the 
child's  learning  (Chomsky,  1962a).  It  would  have  to  include  a  device  that 
accepted  a  sample  of  grammatical  utterances  as  its  input  (with  some 
restrictions,  perhaps,  on  their  order  of  presentation)  and  that  would  pro- 
duce a  grammar  of  the  language  (including  the  lexicon)  as  its  output.  A 
description  of  this  device  would  represent  a  hypothesis  about  the  innate 
intellectual  equipment  that  a  child  brings  to  bear  in  acquiring  a  language. 
Of  course,  other  input  data  may  play  an  essential  role  in  language  learning. 
For  example,  corrections  by  the  speech  community  are  probably  important. 
A  correction  is  an  indication  that  a  certain  linguistic  expression  is  not  a 
sentence.  Thus  the  device  may  have  a  set  of  nonsentences,  as  well  as  a  set 
of  sentences,  as  an  input.  Furthermore,  there  may  be  indications  that  one 
item  is  to  be  considered  a  repetition  of  another,  and  perhaps  other  hints 
and  helps.  What  other  inputs  are  necessary  is,  of  course,  an  important 
question  for  empirical  investigation. 

Equally  important,  however,  is  to  specify  the  properties  of  the  grammar 
that  our  universal  language-learning  device  is  supposed  to  produce  as  its 
output.  This  grammar  is  intended  to  represent  certain  of  the  abilities 
of  a  mature  speaker.  First,  it  should  indicate  how  he  is  able  to  determine 
what  is  a  properly  formed  sentence,  and,  second,  it  should  provide  infor- 
mation about  the  arrangements  of  the  units  into  larger  structures.  The 
language-learning  device  must,  for  example,  come  to  understand  the 
difference  between  Examples  la  and  Ib. 

The  characterization  of  a  grammar  that  will  provide  an  explicit  enumera- 
tion of  grammatical  sentences,  each  with  its  own  structural  description,  is  a 
central  concern  in  the  pages  that  follow.  What  we  seek  is  a  formalized 
grammar  that  specifies  the  correct  structural  descriptions  with  a  fairly 
small  number  of  general  principles  of  sentence  formation  and  that  is 
embedded  within  a  theory  of  linguistic  structure  that  provides  a  justifica- 
tion for  the  choice  of  this  grammar  over  other  alternatives.  One  task  of  the 
professional  linguist  is,  in  a  sense,  to  make  explicit  the  process  that  every 
normal  child  performs  implicitly. 

A  practical  language-learning  device  would  have  to  incorporate  strong 
assumptions  about  the  class  of  potential  grammars  that  a  natural  language 
can  have.  Presumably  the  device  would  have  available  an  advance  specifi- 
cation of  the  general  form  that  a  grammar  might  assume  and  also  some 
procedure  to  decide  whether  a  particular  grammar  is  better  than  some 
alternative  grammar  on  the  basis  of  the  sample  input.  Moreover,  it  would 
have  to  have  certain  phonetic  capacities  for  recognizing  and  producing 


SOME   ALGEBRAIC   ASPECTS   OF   CODING  2yj 

sentences,  and  it  would  need  to  have  some  method,  given  one  of  the  per- 
mitted grammars,  to  determine  the  structural  description  of  any  arbitrary 
sentence.  All  this  would  have  to  be  built  into  the  device  in  advance  before 
it  could  start  to  learn  a  language.  To  imagine  that  an  adequate  grammar 
could  be  selected  from  the  infinitude  of  conceivable  alternatives  by  some 
process  of  pure  induction  on  a  finite  corpus  of  utterances  is  to  misjudge 
completely  the  magnitude  of  the  problem. 

The  learning  process,  then,  would  consist  in  evaluating  the  various 
possible  grammars  in  order  to  find  the  best  one  compatible  with  the  input 
data.  The  device  would  seek  a  grammar  that  enumerated  all  the  sentences 
and  none  of  the  nonsentences  and  assigned  structural  descriptions  in  such 
a  way  that  nonrepetitions  would  differ  at  appropriate  points.  Of  course, 
we  would  have  to  supply  the  language-learning  device  with  some  sort  of 
heuristic  principles  that  would  enable  it,  given  its  input  data  and  a  range 
of  possible  grammars,  to  make  a  rapid  selection  of  a  few  promising  alter- 
natives, which  could  then  be  submitted  to  a  process  of  evaluation,  or  that 
would  enable  it  to  evaluate  certain  characteristics  of  the  grammar  before 
others.  The  necessary  heuristic  procedures  could  be  simplified,  however, 
by  providing  in  advance  a  narrower  specification  of  the  class  of  potential 
grammars.  The  proper  division  of  labor  between  heuristic  methods  and 
specification  of  form  remains  to  be  decided,  of  course,  but  too  much  faith 
should  not  be  put  in  the  powers  of  induction,  even  when  aided  by  intelli- 
gent heuristics,  to  discover  the  right  grammar.  After  all,  stupid  people 
learn  to  talk,  but  even  the  brightest  apes  do  not. 


2.  SOME   ALGEBRAIC   ASPECTS   OF   CODING 

Mapping  one  monoid  into  another  is  a  pervasive  operation  in  com- 
munication systems.  We  can  refer  to  it  rather  loosely  as  coding — including 
in  that  term  the  various  processes  of  encoding,  recoding,  decoding,  and 
transmitting.  In  order  to  make  this  preliminary  discussion  definite,  we  can 
think  of  one  monoid  as  consisting  of  all  the  strings  that  can  be  formed  with 
the  characters  of  a  finite  alphabet  A  and  the  other  as  consisting  of  the 
strings  that  can  be  formed  by  the  words  in  a  finite  vocabulary  V.  In  this 
section,  therefore,  we  consider  some  abstract  properties  of  concatenation 
systems  in  general,  properties  that  apply  equally  to  artificial  and  to  natural 
codes. 

A  code  C  is  a  1  : 1  mapping  6  of  strings  in  Finto  strings  in  A  such  that 
if  vi9  Vj  are  strings  in  V  then  6(vf^v^)  =  Q(v^T^&(p^.  8  is  an  isomorphism 
between  strings  in  Fand  a  subset  of  the  strings  in  A;  strings  in  A  provide 
the  spellings  for  strings  in  F.  In  the  following,  if  there  is  no  danger  of 


2jB  FORMAL    ANALYSIS   OF    NATURAL   LANGUAGES 

confusion,  we  can  simplify  our  notation  by  suppressing  the  symbol  ^  for 
concatenation,  thus  adopting  the  normal  convention  for  spelling  systems. 
Consider  a  simple  example  of  a  code  Cx.    Let  A  =  {0,  1}  and  V  = 
{ul5  .  .  .  ,  z;4}.   Define  a  mapping  6  as  follows  : 


flfra)  =  010, 

e(t?4)  =  oo. 

This  particular  mapping  can  be  represented  by  a  tree  graph,  as  in  Fig.  1. 
(For  a  formal  discussion  of  tree  graphs,  see,  for  example,  Berge,  1958.) 
The  nodes  represent  choice  points;  a  path  down  to  the  left  from  a  node 
represents  the  selection  of  1  from  A  and  a  path  down  to  the  right  represents 
the  selection  of  0.  Each  word  has  a  unique  spelling  indicated  by  a  unique 
branch  through  the  coding  tree.  When  the  end  of  a  branch  is  reached  and  a 
full  word  has  been  spelled,  the  system  returns  to  the  top,  ready  to  spell  the 
next  word. 

In  order  to  decode  the  message,  of  course,  it  is  essential  to  maintain 
synchronization.  For  example,  the  string  of  words  v^v^v^  is  spelled 
0010011,  but  if  the  first  letter  of  this  spelling  is  lost  it  will  be  decoded  as 
vzv2.  We  use  the  symbol  #  at  the  beginning  of  a  string  of  letters  to  indicate 
that  it  is  known  that  this  is  the  beginning  of  the  total  message;  otherwise, 
a  string  of  periods  ...  is  used. 

At  any  particular  point  in  a  string  of  letters  that  spells  some  acceptable 
message  there  will  be  a  fixed  set  of  possible  continuations  that  terminate 


Fig.  1.  Graph  of  coding  tree  for  Cx. 


SOME   ALGEBRAIC   ASPECTS   OF   CODING 


279 


at  the  end  of  a  word.  Moreover,  different  initial  strings  may  permit 
exactly  the  same  continuations.  In  Q,  for  example,  the  two  messages  that 
begin  #000 ...  and  #10  ...  can  be  terminated  by  . . .  0#,  .  .  .10#, 
.  .  .  1 1#,  or  by  one  of  those  followed  by  other  words.  We  say  that  the 
relation  R  holds  between  any  two  initial  strings  that  permit  the  same  con- 
tinuation. We  see  immediately  that  R  must  be  reflexive,  symmetrical,  and 
transitive  and  so  is  an  equivalence  relation;  two  initial  strings  of  charac- 
ters that  permit  exactly  the  same  set  of  continuations  are  referred  to  as 
equivalent  on  the  right.  In  terms  of  this  relation,  we  can  define  the  im- 
portant concept  of  a  state:  the  set  of  all  strings  equivalent  on  the  right 
constitutes  a  state  of  the  coding  system. 

The  state  of  a  coding  system  constitutes  its  memory  of  what  has  already 
occurred.  Each  time  a  letter  is  added  to  the  encoded  string  the  system 
can  move  to  a  new  state.  In  Cl  there  are  three  states:  (1)  SQ  when  a  com- 
plete word  has  just  been  spelled,  (2)  Si  after  #0 . . . ,  and  (3)  S2  after 
#01  ....  These  correspond  to  the  three  nonterminal  nodes  in  the  tree 
graph  of  Fig.  1. 

Following  Schiitzenberger  (1956),  we  can  summarize  the  state  transi- 
tions by  matrices,  one  for  each  string  of  letters.  Let  rows  represent  the 
state  after  n  letters,  and  let  the  columns  represent  the  state  after  n  +  1 
letters.  If  a  transition  is  possible,  enter  1  in  that  cell;  otherwise,  0.  For 
Ci  we  require  two  matrices,  one  to  represent  the  effect  of  adding  the  letter 
0,  the  other  to  represent  the  effect  of  adding  the  letter  1.  (In  general,  this 
corresponds  to  a  partition  of  the  coding  tree  into  subgraphs,  one  for  each 
letter  in  A).  To  each  string  x  associate  the  matrix  Mx  with  elements  m€i 
giving  the  number  of  paths  between  states  £f  and  S;  when  the  string  x 
occurs  in  the  coded  messages.  For  Ci  the  matrices  associated  with  the 
elementary  strings  0  and  1  are 

TO    1    Ol  fl    0    01 


1    0    0 


and 


0    0    1 


.      0    Oj  LI    0    Oj 

For  a  longer  string  the  matrix  is  the  ordered  product  of  the  matrices  for 
the  letters  in  the  string.  The  product  matrix  s£3&  is  interpreted  in  the 
following  fashion:  from  Si  the  system  moves  to  S^  according  to  the  transi- 
tions permitted  by  jaf,  then  moves  from  S,  to  Sk  according  to  the  transitions 
permitted  by  £$.  The  number  of  distinct  paths  from  St  to  Sf  to  Sk  is 
00&0-  The  total  number  of  paths  from  St  to  Sk,  summing  over  all  inter- 
vening states  Sj>  is  2  a» A*»  t^ie  row-by-column  product  that  gives  the 

i 
elements  of  £&&.  In  case  a  particular  letter  cannot  occur  in  a  given  state, 

the  row  of  its  matrix  corresponding  to  that  state  will  consist  entirely  of 


280  FORMAL   ANALYSIS   OF   NATURAL   LANGUAGES 

zeros.  Any  matrix  corresponding  to  a  string  in  A  that  does  not  spell  any 
part  of  a  string  in  V  will  be  a  zero  matrix.  In  general,  the  matrices  need  not 
possess  inverses;  they  do  not  form  a  group,  but  they  are  adequate  to 
provide  an  isomorphism  with  the  elements  of  a  semigroup. 

If  the  function  mapping  V  into  A  is  not  bi-unique,  entries  greater  than 
unity  will  occur  in  the  matrices  or  their  products,  signifying  that  a  single 
string  of  72  letters  must  be  the  spelling  for  more  than  one  string  of  words. 
When  this  occurs,  the  received  message  is  ambiguous  and  cannot  be 
understood  even  though  it  is  received  without  noise  or  distortion.  There 
is  no  way  of  knowing  which  of  the  alternative  strings  of  words  was 
intended. 

Because  there  is  no  phonetic  boundary  marker  between  successive 
words  in  the  normal  flow  of  speech,  such  ambiguities  can  easily  arise  in 
natural  languages.  Miller  (1958)  gives  the  following  example  in  English: 

The  good  candy  came  anyway. 
The  good  can  decay  many  ways. 

The  string  of  phonetic  elements  can  be  pronounced  in  a  way  that  is 
relatively  neutral,  so  that  the  listener  has  an  experience  roughly  comparable 
to  the  visual  phenomenon  of  reversible  perspective.  Ambiguities  of 
segmentation  may  be  even  commoner  in  French;  for  example,  the  follow- 
ing couplet  is  well  known  as  an  instance  of  complete  rhyme: 

Gal,  amant  de  la  Reine,  alia  (tour  magnanime), 
Galamment  de  1'arene  a  la  Tour  Magne,  a  Nimes. 

Consideration  of  examples  of  this  type  indicates  that  there  is  far  more  to 
the  production  or  perception  of  speech  than  merely  the  production  or 
identification  of  successive  phonetic  qualities  and  that  different  kinds  of 
information  processing  must  go  on  at  several  levels  of  organization. 
These  problems  are  defined  more  adequately  for  natural  languages  in 
Sec.  6. 

Difficulties  of  segmentation  can  be  avoided,  of  course.  Consideration 
of  how  to  avoid  them  leads  to  a  simple  classification  of  the  various  types 
of  codes  (Schiitzenberger,  personal  communication).  General  codes 
include  any  codes  that  always  give  a  different  spelling  for  every  different 
string  of  words.  A  special  subset  of  the  general  codes  are  the  tree  codes 
whose  spelling  rules  can  be  represented  graphically,  as  in  Fig.  1,  with  the 
spelling  of  each  word  terminating  at  the  end  of  a  separate  branch.  Every 
tree  code  must  be  of  one  of  two  types:  the  left  tree  codes  are  those  in 
which  no  spelling  of  any  word  forms  an  initial  segment  (left  segment) 
of  the  spelling  of  any  other  word;  the  right  tree  codes  are,  similarly,  those 


SOME    ALGEBRAIC    ASPECTS    OF   CODING 


in  which  no  word  forms  a  terminal  segment  (right  segment)  of  any  other 
word  (right  tree  codes  can  be  formed  by  reversing  the  spelling  of  all  words 
in  a  left  tree  code).  A  special  case  is  the  class  of  codes  that  are  both  left 
and  right  tree  codes  simultaneously;  Schtitzenberger  has  called  them 
anagrammatic  codes.  The  simplest  subset  of  anagrammatic  codes  are  the 
uniform  codes,  in  which  every  word  is  spelled  by  the  same  number  of  letters 
and  scansion  into  words  is  achieved  at  the  receiver  simply  by  counting. 
Uniform  codes  are  frequently  used  in  engineering  applications  of  coding 
theory,  but  they  have  little  relevance  for  the  description  of  natural  lan- 
guages. 

Another  important  set  of  codes  are  the  self-synchronizing  codes.  In  a 
self-synchronizing  code,  if  noise  or  error  causes  some  particular  word 
boundary  to  be  misplaced,  the  error  will  not  perpetuate  itself  indefinitely; 
within  a  finite  time  the  error  will  be  absorbed  and  the  correct  synchrony 
will  be  reestablished.  (Cx  is  an  example  of  a  self-synchronizing  code; 
uniform  codes  are  not.)  In  a  self-synchronizing  tree  code  the  word  bound- 
aries are  always  marked  by  the  occurrence  of  a  particular  string  of  letters. 
If  the  particular  string  is  thought  of  as  terminating  the  spelling  of  every 
word,  we  have  a  left-synchronizing  tree  code.  When  the  particular  string 
consists  of  a  single  letter  (which  cannot  then  be  used  anywhere  else), 
we  have  what  has  been  called  a  natural  code.  In  written  language  the 

General  codes 


Tree  codes 


Left  tree  codes 


Self-synchronizing 
tree  codes 


Right  tree  codes 


Anagrammatic  codes 


Natural  codes  Uniform  codes 

Fig.  2.  Classification  of  coding  systems. 


282  FORMAL   ANALYSIS   OF   NATURAL    LANGUAGES 

space  between  words  keeps  the  receiver  synchronized.  In  spoken  lan- 
guage the  process  is  much  more  complex;  discussion  of  how  strings  of 
words  (morphemes)  are  mapped  into  strings  of  phonetic  representation 
is  postponed  until  Sec.  6. 

In  order  to  be  certain  that  a  particular  mapping  is  in  fact  a  code,  it  is 
common  engineering  practice  to  inspect  it  to  see  if  it  is  a  left  tree  code  —  to 
see  that  no  speEing  of  any  word  forms  an  initial  segment  of  the  spelling  of 
any  other  word.  It  is  possible,  however,  to  have  general  codes  that  are  not 
tree  codes.  Schiitzenberger  (1956)  offers  the  following  code  C2  as  the 
simplest,  nontrivial  example: 

X  =  {0,1},        K={i?i,...,0B}» 

and  flCoJ  =  00 

0(Ca)  =  001 
=  011 
vj  =  01 


Note  that  6(v^)  is  an  initial  segment  of  6(v^),  so  it  is  not  a  left  tree  code; 
and  6(v^  is  a  terminal  segment  of  6(v^9  so  it  is  not  a  right  tree  code. 

There  is  an  understandable  desire,  when  constructing  artificial  codes,  to 
keep  the  coded  words  as  short  as  possible.  In  this  connection  an  interesting 
inequality  can  easily  be  shown  to  hold  for  tree  codes  (Kraft,  1949). 
Suppose  we  are  given  a  vocabulary  V  =  (t?l9  .  .  .  ,  vn},  an  alphabet  A  = 
{al9  .  .  .  ,  #£>},  and  a  mapping  6  that  constitutes  a  left  tree  code.  Let  c^ 
be  the  length  of  6(17^,  Then 

2  D~Ci  <  1-  (4) 

t=i 

This  inequality  can  be  established  as  follows:   let  ws  be  the  number  of 
coded  words  of  exactly  length  j.   Then,  since  6  is  a  tree  code,  we  have 


Dividing  by  Dn  gives 


But  if  n  >  cz-  for  all  f,  then  this  summation  will  be  taken  over  all  the  coded 
words,  which  gives  the  relation  expressed  in  Eq.  4.  The  closer  Eq.  4  comes 
to  equality  for  any  code,  the  closer  that  code  will  be  to  minimizing  the 


SOME    BASIC    CONCEPTS    OF    LINGUISTICS  28$ 

average  length  of  its  code  words.  (This  inequality  has  also  been  shown  to 
hold  for  nontree  codes:  see  Mandelbrot,  1954;  McMillan,  1956.) 

Each  tree  code  (including,  of  course,  all  natural  codes)  must  have  some 
function  w}  giving  the  number  of  words  of  coded  lengthy.  The  function  is 
of  some  theoretical  interest,  since  it  summarizes  in  a  particularly  simple 
form  considerable  information  about  the  structure  of  the  coding  tree. 

Finally,  it  should  be  remarked  that  a  code  can  be  thought  of  as  a  simple 
kind  of  automaton  (cf.  Chapter  12)  that  accepts  symbols  in  one  alphabet 
and  produces  symbols  in  another  according  to  predetermined  rules  that 
depend  only  on  the  input  symbol  and  the  internal  state  of  the  device. 
Some  of  the  subtler  difficulties  in  constructing  good  codes  will  not  become 
apparent  until  we  assign  different  probabilities  to  the  various  words  (cf. 
Chapter  13). 


3.  SOME   BASIC   CONCEPTS   OF   LINGUISTICS 

A  central  concept  in  these  pages  is  that  of  a  language,  so  a  clear  definition 
is  essential.  We  consider  a  language  L  to  be  a  set  (finite  or  infinite)  of 
sentences,  each  finite  in  length  and  constructed  by  concatenation  out  of  a 
finite  set  of  elements.  This  definition  includes  both  natural  languages  and 
the  artificial  languages  of  logic  and  of  computer-programming  theory. 

In  order  to  specify  a  language  precisely,  we  must  state  some  principle 
that  separates  the  sequences  of  atomic  elements  that  form  sentences  from 
those  that  do  not.  We  cannot  make  this  distinction  by  mere  listing,  since 
in  any  interesting  system  there  is  no  bound  on  sentence  length.  There  are 
two  ways  open  to  us,  then,  to  specify  a  language.  Either  we  can  try  to 
develop  an  operational  test  of  some  sort  that  will  distinguish  sentences  from 
nonsentences  or  we  can  attempt  to  construct  a  recursive  procedure  for 
enumerating  the  infinite  list  of  sentences.  The  first  of  these  approaches  has 
rarely  been  attempted  and  is  not  within  the  domain  of  this  survey.  The 
second  provides  one  aspect  of  what  could  naturally  be  called  a  grammar 
of  the  language  in  question.  We  confine  ourselves  here  to  the  second 
approach — to  the  investigation  of  grammars. 

In  actual  investigations  of  natural  language  a  proposed  operational 
test  or  a  proposed  grammar  must,  of  course,  meet  certain  empirical 
conditions.  Before  the  construction  of  such  a  test  or  grammar  can  begin, 
there  must  be  a  finite  class  K±  of  sequences  that  can,  with  reasonable 
security,  be  assigned  to  the  set  of  sentences  as  well  as,  presumably,  a  class 
KZ  of  sequences  that  can,  with  reasonable  security,  be  excluded  from  this 
class.  The  empirical  significance  of  a  proposed  operational  test  or  a  pro- 
posed grammar  will,  in  large  part,  be  determined  by  their  success  in  drawing 


284  FORMAL   ANALYSIS    OF    NATURAL    LANGUAGES 

a  distinction  that  separates  KI  from  K%.  The  question  of  empirical  ade- 
quacy, however,  and  the  problems  to  which  it  gives  rise  are  beyond  the 
scope  of  this  survey. 

We  limit  ourselves,  then,  to  the  study  of  grammars:  by  a  grammar  we 
mean  a  set  of  rules  that  (in  particular)  recursively  specify  the  sentences  of  a 
language.  In  general,  each  of  the  rules  we  need  will  be  of  the  form 

&>  •  •  •  >  <f>n  -*•  <f>n+i>  (5) 

where  each  of  the  <j>i  is  a  structure  of  some  sort  and  where  the  relation  — > 
is  to  be  interpreted  as  expressing  the  fact  that  if  our  process  of  recursive 
specification  generates  the  structures  <£1?  .  . .  ,  <j>n  then  it  also  generates  she 
structure  ^n+1. 

The  precise  specification  of  the  kinds  of  rules  that  can  be  permitted  in  a 
grammar  is  one  of  the  major  concerns  of  mathematical  linguistics,  and  it  is 
to  this  question  that  we  shall  turn  our  attention.  Consider,  for  the  moment, 
the  following  special  case  of  rules  of  the  form  of  Rule  5.  Let  n  =  1  in 
Rule  5  and  let  ^  and  <f>2  each  be  a  sequence  of  symbols  of  a  certain  alphabet 
(or  vocabulary).  Thus,  if  we  had  a  finite  language  consisting  of  the  sen- 
tences crl9 . . . ,  an  and  an  abstract  element  S  (representing  "sentence"), 
which  we  take  as  an  initial,  given  element,  we  could  present  the  grammar 

S-+o^...  ;S-+an;  (6) 

which  would,  in  this  trivial  case,  be  nothing  but  a  sentence  dictionary. 
More  interestingly,  consider  the  case  of  a  grammar  containing  the  two 

mles  S^aS;S-+a.  (7) 

This  pair  of  rules  can  generate  recursively  any  of  the  sentences  a,  aa,  aaa, 
aaaa, . .  .  .2  (Obviously  the  sentences  can  be  put  in  one-to-one  corre- 
spondence with  the  integers  so  that  the  language  will  be  denumerably 
infinite.)  To  generate  aaa,  for  example,  We  proceed  as  follows: 

S    (the  given,  initial  symbol), 

aS    (applying  the  first  rewriting  rule), 

(o) 

aaS    (reapplying  the  first  rewriting  rule), 
aaa    (applying  the  second  rewriting  rule). 

Below  we  shall  study  systems  of  rules  of  this  and  of  somewhat  richer  forms. 

To  recapitulate,  then,  the  (finite)  set  of  rules  specifying  a  particular 

language  constitutes  the  grammar  of  that  language.  (This  definition  is  made 

more  precise  in  later  sections.)  An  acceptable  grammar  must  give  a  precise 

2  More  precisely,  we  should  say  that  we  are  now  considering  the  case  of  rules  of  the 
form  fafafa  ~+ fat^s,  in  which  fa  and  ^2  are  variables  ranging  over  arbitrary, 
possibly  null,  strings,  and  fa  and  fa  are  constants.  The  variables  can,  obviously,  be 
suppressed  in  the  actual  statement  of  the  rules  as  long  as  we  restrict  ourselves  to  rules 
of  this  form. 


SOME    BASIC    CONCEPTS    OF    LINGUISTICS  28$ 

specification  of  the  (in  general,  infinite)  list  of  sentences  (strings  of  symbols) 
that  are  sentences  of  this  language.  As  a  matter  of  principle,  a  grammar 
must  be  finite.  If  we  permit  ourselves  grammars  with  an  unspecifiable  set 
of  rules,  the  entire  problem  of  grammar  construction  disappears ;  we  can 
simply  adopt  an  infinite  sentence  dictionary.  But  that  would  be  a  com- 
pletely meaningless  proposal.  Clearly,  a  grammar  must  have  the  status  of  a 
theory  about  those  recurrent  regularities  that  we  call  the  syntactic  structure 
of  the  language.  To  the  extent  that  a  grammar  is  formalized,  it  constitutes 
a  mathematical  theory  of  the  syntactic  structure  of  a  particular  natural 
language. 

It  should  be  obvious,  however,  that  a  grammar  must  do  more  than 
merely  enumerate  the  sentences  of  a  language  (though,  in  actual  fact,  even 
this  goal  has  never  been  approached).  We  require  as  well  that  the  grammar 
assign  to  each  sentence  it  generates  a  structural  description  that  specifies 
the  elements  of  which  the  sentence  is  constructed,  their  order,  arrange- 
ment, and  interrelations  and  whatever  other  grammatical  information  is 
needed  to  determine  how  the  sentence  is  used  and  understood.  A  theory  of 
grammar  must,  therefore,  provide  a  mechanical  way  of  determining,  given 
a  grammar  G  and  a  sentence  s  generated  by  G,  what  structural  description 
is  assigned  to  s  by  G.  If  we  regard  a  grammar  as  a  finitely  specified  function 
that  enumerates  a  language  as  its  range,  we  could  regard  linguistic  theory 
as  specifying  a  functional  that  associates  with  any  pair  (G,  s\  in  which  G 
is  a  grammar  and  s  a  sentence,  a  structural  description  of  s  with  respect  to 
G ;  and  one  of  the  primary  tasks  of  linguistic  theory,  of  course,  would  be  to 
give  a  clear  account  of  the  notion  of  structural  description. 

This  conception  of  grammar  is  recent  and  may  be  unfamiliar.  Some 
artificial  examples  can  clarify  what  is  intended.  Consider  the  following 
three  artificial  languages  described  by  Chomsky  (1956): 

Language  Lx.    L^   contains   the  sentences  ab,   aabb,   aaabbb,    etc.;    all 

sentences  contain  n  occurrences  of  a,  followed  by  n  occurrences  ofb,  and 

only  these. 
Language  L2.    L2  contains  aa,  bb,  abba,  baab,  aabbaa,  etc.;    all  mirror 

image  sentences  consisting  of  a  string,  followed  by  the  same  string  in 

reverse,  and  only  these. 
Language  £3.    L%  contains  aa,  bb,  ctbab,  baba,  aabaab,  etc.;   all  sentences 

consisting  of  a  string  followed  again  by  that  identical  string,  and  only  these. 

A  grammar  G±  for  L±  may  take  the  following  form : 
Given:  S, 

Fl-.S-**, 

F2:  S-^aSb, 


286  FORMAL    ANALYSIS    OF    NATURAL    LANGUAGES 

where  5  is  comparable  to  an  axiom  and  Fl,  2  are  rules  of  formation  by 
which  admissible  strings  of  symbols  can  be  derived  from  the  axiom. 
Derivations  would  proceed  after  the  manner  of  Example  8.  A  derivation 
terminates  whenever  the  grammar  contains  no  rules  for  rewriting  any  of 
the  symbols  in  the  string. 

In  the  same  vein  a  grammar  G2  for  L2  might  be  the  following: 
Given:  S, 

Fl:  S-^aa, 

F2:  S-*bb, 

F3:  S-..S,.  (10) 

F4:  S-+bSb. 

An  interesting  and  important  feature  of  both  L±  and  L2  is  that  new  con- 
structions can  be  embedded  inside  of  old  ones.  In  aabbaa,  for  example, 
there  is  in  L2  a  relation  of  dependency  between  the  first  and  sixth  elements; 
nested  inside  it  is  another  dependency  between  the  second  and  fifth 
elements;  inside  that,  in  turn,  is  a  relation  between  the  third  and  fourth 
elements.  As  Gl  and  G2  are  stated,  of  course,  there  is  no  upper  limit  to  the 
number  of  embeddings  that  are  possible  in  an  admissible  string. 

There  can  be  little  doubt  that  natural  languages  permit  this  kind  of 
parenthetical  embedding  and  that  their  grammars  must  be  able  to  generate 
such  sequences.  For  example,  the  English  sentence  (the  rat  (the  cat  (the 
dog  chased)  killed)  ate  the  malt)  is  surely  confusing  and  improbable  but  it  is 
perfectly  grammatical  and  has  a  clear  and  unambiguous  meaning.  To 
illustrate  more  fully  the  complexities  that  must  in  principle  be  accounted  for 
by  a  real  grammar  of  a  natural  language,  consider  this  English  sentence: 

Anyonex  who  feels  that  if3  so-many3  more4  students5  whom 
we6  haven't6  actually  admitted  are5  sitting  in  on  the  course 
than4  ones  we  have  that3  the  room  had  to  be  changed,  then2      (11) 
probably  auditors  will  have  to  be  excluded,  is±  likely  to 
agree  that  the  curriculum  needs  revision. 

There  are  dependencies  in  Example  1 1  between  words  with  the  same  sub- 
script;  the  result  is  a  system  of  nested  dependencies,  as  in  L2.  Furthermore, 
to  complicate  the  picture  further,  there  are  dependencies  that  cross  those 
indicated,  for  example,  between  students  and  ones,  between  haven't . .  . 
admitted  and  have  10  words  later  (with  an  understood  occurrence  of 
admitted  deleted).  Note,  incidentally,  that  we  can  have  nested  depend- 
encies, as  in  Example  1 1,  in  which  a  variety  of  constructions  are  involved; 
in  the  special  case  in  which  the  same  nested  construction  occurs  more  than 
once,  we  speak  of  self-embedding. 


SOME    BASIC    CONCEPTS    OF    LINGUISTICS  28j 

Of  course,  we  can  safely  predict  that  Example  1  1  will  never  be  pro- 
duced, except  as  an  example,  just  as  we  can,  with  equal  security,  predict 
that  such  perfectly  well-formed  sentences  as  birds  eat,  black  crows  are 
black,  black  crows  are  white,  Tuesday  follows  Monday,  etc.,  will  never  occur 
in  normal  adult  discourse.  Like  other  sentences  that  are  too  obviously  true, 
too  obviously  false,  too  complex,  too  inelegant,  or  that  fail  in  innumerable 
other  ways  to  be  of  any  use  for  ordinary  human  affairs,  they  are  not  used. 
Nevertheless,  Example  1  1  is  a  perfectly  well-formed  sentence  with  a  clear 
and  unambiguous  meaning,  and  a  grammar  of  English  must  be  able  to 
account  for  it  if  the  grammar  is  to  have  any  psychological  relevance. 

A  grammar  that  will  generate  the  language  L3  must  be  quite  a  bit  more 
complex  than  that  for  jL2,  if  we  restrict  ourselves  to  rules  of  the  form 
(f>->y,  where  </>  and  y  are  strings,  as  proposed  {see  Chapter  12,  Sec.  3,  for 
further  discussion).  We  can,  however,  construct  a  simple  grammar  for  this 
language  if  we  allow  more  powerful  rules.  Let  us  establish  the  convention 
that  the  symbol  x  stands  for  any  string  consisting  of  just  the  symbols  a, 
b  and  let  us  add  the  symbol  #  as  a  boundary  symbol  marking  the  beginning 
and  the  end  of  a  sentence.  Then  we  can  propose  the  following  grammar 
for  Lz\ 

Given  :#S#, 

r  1  :  o  —  >  <zo, 

F2:  S->bS,  (12) 


Rules  Fl  and  2  permit  the  generation  of  an  arbitrary  string  of  a's  and  b's: 
Fz  repeats  any  such  string.  Clearly,  the  language  generated  is  exactly 
L3  (with  boundary  symbols  on  each  sentence). 

It  is  important  to  note,  however,  that  F3  has  a  different  character  from 
the  other  rules,  since  it  necessitates  an  analysis  of  the  total  string  to  which 
it  applies  ;  this  analysis  goes  well  beyond  what  is  allowed  by  the  restricted 
form  of  rule  we  considered  first. 

If  we  adopt  richer  and  more  powerful  rules  such  as  F3,  then  we  can 
often  effect  a  great  simplification  in  the  statement  of  a  grammar;  that  is  to 
say,  we  can  make  use  of  generalizations  regarding  linguistic  structure  that 
would  otherwise  be  simply  unstatable.  There  can  be  no  objection  to 
permitting  such  rules,  and  we  shall  give  some  attention  to  their  formulation 
and  general  features  in  Sec.  5.  Since  a  grammar  is  a  theory  of  a  language 
and  simplicity  and  generality  are  primary  goals  of  any  theory  construction, 
we  shall  naturally  try  to  formulate  the  theory  of  linguistic  structure  to 
accommodate  rules  that  permit  the  formulation  of  deeper  generalizations. 
Nevertheless,  the  question  whether  it  is  possible  in  principle  to  generate 
natural  languages  with  rules  of  a  restricted  form  —  such  as  Fl  to  4  in 


288  FORMAL   ANALYSIS    OF   NATURAL    LANGUAGES 

Example  10  —  retains  a  certain  interest.  An  answer  to  this  question  (either 
positive  or  negative)  would  reveal  certain  general  structural  properties  of 
natural  language  systems  that  might  be  of  considerable  interest. 

The  dependency  system  illustrated  in  I3  is  quite  different  from  that  of 
I2.  Thus  in  the  string  baabaa  of  Lz  the  dependencies  are  not  nested,  as 
they  are  in  the  string  aabbaa  of  L2;  instead  the  fourth  symbol  depends  on 
the  first,  the  fifth  on  the  second,  and  the  sixth  on  the  third.  Dependency 
systems  of  this  sort  are  also  to  be  found  in  natural  language  (see  Chapter 
12,  Sec.  4.2,  for  some  examples)  and  thus  must  also  be  accommodated  by 
an  adequate  theory  of  grammar.  The  artificial  languages  L2  and  Z,3,  then, 
illustrate  real  features  of  natural  language,  and  we  shall  see  later  that  the 
illustrated  features  are  critical  for  determining  the  adequacy  of  various 
types  of  models  for  grammar. 

In  order  to  illustrate  briefly  how  these  considerations  apply  to  a  natural 
language,  consider  the  following  small  fragment  of  English  grammar  : 

Given:  #5#, 

Fl:  S-+AB, 

F2:  A 


F4:  C->  a,  the,  another,  .  .  .  , 
F5:  D  ->  ball,  boy,  girl, 
F6:  E-*hit,  struck,  ____ 

F4  to  6  are  groups  of  rules,  in  effect,  since  they  offer  several  alternative 
ways  of  rewriting  C,  D,  and  E.  (Ordinarily,  we  refer  to  A  as  a  noun 
phrase  and  to  B  as  a  verb  phrase,  etc.,  but  these  familiar  names  are  not 
essential  to  the  formal  structure  of  the  grammar,  although  they  may  play 
an  important  role  in  general  linguistic  theory.)  In  any  real  grammar,  of 
course,  there  must  also  be  phonological  rules  that  code  the  terminal  strings 
into  phonetic  representations.  For  simplicity,  however,  we  shall  postpone 
any  consideration  of  the  phonological  component  of  grammar  until  Sec.  6. 
With  this  bit  of  grammar,  we  can  generate  such  terminal  strings  as 
#  the  boy  hit  the  girl#.  In  this  simple  case  all  of  the  terminal  strings  have 
the  same  phrase  structure,  which  we  can  indicate  by  bracketing  with 
labeled  parentheses, 

#  CsG(c^e)c  (zf>oy)D)j.  (sGeWOs  Gt(c^)c  Gtf  frfyo)  J*)s#» 

or,  equivalently,  with  a  labeled  tree  graph  of  the  kind  shown  in  Fig.  3. 
We  assume  that  such  a  tree  graph  must  be  a  part  of  the  structural  de- 
scription of  any  sentence;  we  refer  to  it  as  a  phrase-marker  (P  -marker). 
A  grammar  must,  for  adequacy,  provide  a  P-marker  for  each  sentence. 


SOME   BASIC    CONCEPTS    OF    LINGUISTICS  28$ 

Each  P-marker  contains,  as  the  successive  labels  of  its  final  nodes,  the 
record  of  the  vocabulary  elements  (e.g.,  words)  of  which  the  sentence  is 
actually  composed.  Two  P-markers  are  the  same  only  if  they  have  exactly 
the  same  branching  structure  and  exactly  the  same  labels  on  corresponding 
nodes.  Note  that  the  tree  graph  of  a  P-marker,  unlike  the  coding  trees  of 
Sec.  2,  must  have  a  specified  ordering  of  its  branches  from  left  to  right, 
corresponding  to  the  order  of  the  elements  in  the  string. 

The  function  of  P-markers  is  well  illustrated  by  the  sentences  displayed 
in  Examples  la  and  Ib.  A  grammar  that  merely  generated  the  given  string 
of  words  would  not  be  able  to  characterize  the  grammatical  differences 
between  those  two  sentences. 

Linguists  generally  refer  to  any  word  (morpheme)  or  sequence  that 
functions  as  a  unit  in  some  larger  construction  as  a  constituent.  In  the 
sentence  whose  P-marker  is  shown  in  Fig.  3,  girl,  the  girl,  and  hit  the  girl 
are  all  constituents,  but  hit  the  is  not  a  constituent.  Constituents  can  be 
traced  back  to  nodes  in  the  tree;  if  the  node  is  labeled  A,  then  the  con- 
stituent is  said  to  be  of  type  A.  The  immediate  constituents  of  any  con- 
struction are  the  constituents  of  which  that  construction  is  directly  formed. 
For  example,  the  boy  and  hit  the  girl  are  immediate  constituents  of  the 
sentence  in  Fig.  3 ;  hit  and  the  girl  are  immediate  constituents  of  the  verb 
phrase  E\  etc.  We  will  not  be  satisfied  with  any  formal  characterization  of 
grammar  that  does  not  provide  at  least  a  structural  description  in  terms  of 
immediate  constituents. 

A  grammar,  then,  must  provide  a  P-marker  for  each  of  an  infinite  class 
of  sentences,  in  such  a  way  that  each  P-marker  can  be  represented  graphi- 
cally as  a  labeled  tree  with  labeled  lines  (i.e.,  the  nodes  are  labeled  and  lines 


#the 


hit 


the 


girl* 


Fig.  3.  A  graphical  representation  (P-marker) 
of  the  derivation  of  a  grammatical  sentence. 


2QQ  FORMAL   ANALYSIS    OF    NATURAL    LANGUAGES 

A  A  A 

/\  /\  /\ 

B  C  B  C  EC 

/\    /\  /\/\  /\/\ 

ADA  / 

/\/\      /\  /\ 

(a)  Self-embedding  (b)  Left-recursive  (c)  Right-recursive 

Fig.  4,  Illustrating  types  of  recursive  elements. 

branching  from  a  single  node  are  distinguished  with  respect  to  order). 
By  a  branch  of  a  tree  we  mean  a  sequence  of  lines,  each  connected  to  the 
preceding  one.  For  example,  one  of  the  branches  in  Fig.  3  is  the  sequence 
((5-5),  (B-£),  (E-hit));  another  is  ((A-D),  (D-girl))9  etc.  Since  the  tree 
graph  represents  proper  parenthesization,  its  branches  do  not  cross.  (To 
make  these  informal  remarks  precise  in  the  intended  and  obvious  sense, 
we  would  have  to  distinguish  occurrences  of  the  same  label.)  The  symbols 
that  label  the  nodes  of  the  tree  are  those  that  appear  in  the  grammatical 
rules.  Since  the  rules  are  finite  and  there  are  in  any  interesting  case  an 
infinite  number  of  P-markers,  there  must  be  some  symbols  of  the  vocabu- 
lary of  the  grammar  that  can  occur  indefinitely  often  in  P-markers. 
Furthermore,  there  will  be  branches  that  contain  some  of  these  symbols 
more  than  n  times  for  any  fixed  n.  Given  a  set  of  P-markers,  we  say  that  a 
symbol  of  the  vocabulary  is  a  recursive  element  if  for  every  n  there  is  a 
P-marker  with  a  branch  in  which  this  symbol  occurs  more  than  n  times  as 
a  label  of  a  node. 

We  distinguish  three  different  kinds  of  recursive  elements  that  are  of 
particular  importance  in  later  developments.  We  say  that  a  recursive 
element  A  is  self-embedding  if  it  occurs  in  a  configuration  such  as  that  of 
Fig.  4a;  left-recursive  if  it  occurs  in  a  configuration  such  as  that  of  Fig.  4b ; 
right-recursive  if  it  occurs  in  a  configuration  such  as  that  of  Fig.  4c. 

Thus,  if  A  is  a  left-recursive  element,  it  dominates  (i.e.,  is  a  node  from 
which  can  be  derived)  a  tree  configuration  that  contains  A  somewhere  on 
its  leftmost  branch;  if  it  is  a  right  recursive  element  it  dominates  a  tree 
configuration  that  contains  A  somewhere  on  its  rightmost  branch;  if  it  is  a 
self-embedding  element,  it  dominates  a  tree  configuration  that  contains  A 
somewhere  on  an  inner  branch.  It  is  not  difficult  to  make  these  definitions 
perfectly  precise,  but  we  will  do  so  only  for  particular  cases  of  special 
interest. 

The  study  of  recursive  generative  processes  (such  as,  in  particular, 
generative  grammars)  has  grown  out  of  investigations  of  the  foundations 
of  mathematics  and  the  theory  of  proof.  For  a  recent  survey  of  this  field, 


SOME    BASIC    CONCEPTS    OF    LINGUISTICS  2QI 

see  Davis  (1958);  for  a  shorter  and  more  informal  introduction  see,  for 
example,  Rogers  (1959)  or  Trachtenbrot  (1962).  We  return  to  more 
general  considerations  involving  recursive  generation  and  its  properties  in 
Chapter  12. 

In  this  section  we  have  mentioned  a  few  basic  properties  of  grammars  and 
have  given  some  examples  of  generative  devices  that  might  be  regarded  as 
grammars.  In  the  rest  of  this  chapter  we  try  to  formulate  the  characteristics 
of  grammars  in  more  detail.  In  Sec.  4  we  consider  more  carefully  grammars 
meeting  the  condition  that,  in  each  rule  of  the  form  (5),  n  =  1  and  each 
structure  is  a  string,  as  in  Examples  9,  10,  and  13.  (In  Chapter  12  we 
study  some  of  the  properties  of  these  systems.)  In  Sec.  5  we  turn  our 
attention  to  a  richer  class  of  grammars,  akin  to  that  suggested  for  Lz  in 
Example  12,  that  do  not  impose  the  restrictive  conditions  that  in  each  rule 
of  the  form  of  Example  5  the  structures  are  limited  to  strings  or  that  n  =  1 . 
Finally,  in  Sec.  6  we  indicate  briefly  some  of  the  properties  of  the  phono- 
logical component  of  a  grammar  that  converts  the  output  of  the  recursive 
generative  procedure  into  a  sequence  of  sounds. 

We  have  described  a  generative  grammar  G  as  a  device  that  enumerates 
a  certain  subset  L  of  the  set  2  of  strings  in  a  fixed  vocabulary  V  and  that 
assigns  structural  descriptions  to  the  members  of  the  set  L(G),  which  we 
call  the  language  generated  by  G.  From  the  point  of  view  of  the  intended 
application  of  the  models  that  we  are  considering,  it  would  be  more  realistic 
to  regard  G  as  a  device  that  assigns  a  structural  description  to  each  string 
of  S,  where  the  structural  description  of  a  string  x  indicates,  in  particular, 
the  respects  in  which  x  deviates,  if  at  all,  from  well-formedness,  as  defined 
by  G.  Instead  of  partitioning  S  into  the  two  subsets  L(G)  (well-formed 
sentences)  and  L(G)  (nongrammatical  strings),  G  would  now  distinguish 
in  S  a  class  L±  of  perfectly  well-formed  sentences  and  would  partially  order 
all  strings  in  S  in  terms  of  degree  ofgrammaticalness.  We  could  say,  then, 
that  L!  is  the  language  generated  by  G;  but  we  could  attempt  to  account 
for  the  fact  that  strings  not  in  L^  can  still  often  be  understood,  even  under- 
stood unambiguously  by  native  speakers,  in  terms  ^f  the  structural  de- 
scriptions assigned  to  these  strings.  A  string  of  S  C\  Lx  can  be  understood 
by  imposing  on  it  an  interpretation,  guided  by  its  analogies  and  similarities 
to  sentences  of  L^  We  could  attempt  to  relate  ease  and  uniformity  of 
interpretation  of  a  sentence  to  degree  ofgrammaticalness,  given  a  precise 
definition  of  this  notion.  Deviation  from  grammatical  regularities  is  a 
common  literary  or  quasi-literary  device,  and  it  need  not  produce  un- 
intefligibility— in  fact,  as  has  often  been  remarked,  it  can  provide  a  certain 
richness  and  compression. 

The  problems  of  defining  degree  of  grammaticalness,  relating  it  to 
interpretation  of  deviant  utterances,  and  constructing  grammars  that 


2Q2  FORMAL   ANALYSIS    OF   NATURAL    LANGUAGES 

assign  degrees  of  grammaticalness  other  than  zero  and  one  to  utterances 
are  all  interesting  and  important.  Various  aspects  of  these  questions  are 
discussed  in  Chomsky  (1955,  1961b),  in  Ziff(1960a,  1960b,  1961)  and  in 
Katz  (1963).  We  consider  this  matter  further  in  Chapter  13.  Here,  and 
in  Chapter  12,  however,  we  limit  our  attention  to  the  special  case  of 
grammars  that  partition  all  strings  into  the  two  categories  grammatical 
and  ungrammatical  and  that  do  not  go  on  to  establish  a  hierarchy  of 
grammaticalness. 


4.  A    SIMPLE   CLASS    OF    GENERATIVE 
GRAMMARS 

We  consider  here  a  simple  class  of  grammars  that  can  be  called  con- 
stituent-structure grammars  and  introduce  some  of  the  more  important 
notations  that  are  used  in  this  and  the  next  two  chapters.  In  this  section 
we  regard  a  grammar  G, 

G=IV~,-»,VT,S,#]9 

as  a  system  of  concatenation  meeting  the  following  conditions: 

1.  V  is  a  finite  set  of  symbols  called  the  vocabulary.    The  strings  of 
symbols  of  this  vocabulary  are  formed  by  concatenation  ;  ^  is  an  associa- 
tive and  noncommutative  binary  operation  on  strings  formed  on  the 
vocabulary  V.   We  suppress  ^  where  no  confusion  can  result. 

2.  V?  c:  V.    VT  we  call  the  terminal  vocabulary.   The  relative  comple- 
ment of  VT  with  respect  to  V  we  call  the  nonterminal  or  auxiliary  vocabulary 
and  designate  it  by  VN. 

3.  —  >•  is  a  finite,  two-place,  irreflexive  and  asymmetric  relation  defined  on 
certain  strings  on  V  and  read  "is  rewritten  as."  The  pairs  (<f>9  ip)  such  that 
<f>  -*•  ip  are  called  the  (grammatical)  rules  of  G. 

4.  Where  A  e  V,  A  e  VN  if  and  only  if  there  are  strings  <j>,  y,  co  such 
that  </>Ay  -*  <£coy>.  #  e  VT\   SeVN;   etVT\   where  #  is  the  boundary 
symbol,  S  is  the  initial  symbol  that  can  be  read  sentence,  and  e  is  the  identity 
element  with  the  property  that  for  each  string  <f>9  e</>  =  <f>  =  <f>e. 

We  define  the  following  additional  notions: 

5.  A  sequence  of  strings  D  =  (<f>l9  .  .  .  ,  <f>  J  (n  >  1)  is  a  ^-derivation  of 
tp  if  and  only  if 

(a)  <f>  =  &;  y  =  <f>n 

(b)  for  each  i  <  n  there  are  strings  ^i,  V&  £,  a>  such  that  %  ->  o>, 

>25  and  <f>i+I 


The  set  of  derivations  is  thus  completely  determined  by  the  finitely  specified 
relation  -*,  that  is,  by  the  finite  set  of  grammatical  rules.  Where  there  is  a 


A   SIMPLE    CLASS    OF    GENERATIVE    GRAMMARS  2Q3 

^-derivation  of  ^y,  we  say  that  <f>  dominates  y  and  write  </>  =>  y;  =>  is  thus  a 
reflexive  and  transitive  relation. 

6.  A  ^-derivation  D  of  ^  is  terminated  if  ^  is  a  string  on  FT  and  D  is  not 
the  proper  initial  subsequence  of  any  derivation  (note  that  these  conditions 
are  independent). 

7.  ip  is  a  terminal  string  of  G  if  there  is  a  terminated  #S#-derivation  of 
#y#\  that  is,  a  terminal  string  is  the  last  line  of  a  terminated  derivation 
beginning  with  the  initial  string  #S#. 

8.  The  terminal  language  generated  by  G  is  the  set  of  terminal  strings 

of  G. 

9.  Two  grammars  G  and  G*  are  (weakly)  equivalent  if  they  generate  the 
same  terminal  language.   A  stronger  equivalence  is  discussed  later. 

We  want  the  boundary  symbol  #  to  meet  the  condition  that  if  (<£15  .  .  .  ,  <f>n) 
is  a#iS#-derivation  then,  for  each  f,  <f>i  =  #^#,  where  tpf  does  not  contain 
#.  To  guarantee  that  this  will  be  the  case,  we  impose  on  the  rules  of  the 
grammar  (on  the  relation  ->)  the  following  additional  condition: 

10.  If  ax  ...  am  -*  /?!...  /3n  is  a  rule  of  a  grammar  (where  o^,  .  .  .  , 


(a) 

(6)  ft  =  #  if  and  only  if  oq  =  #; 
(c)  ^=#ifandonlyifam=#. 

The  conditions  so  far  laid  down  do  not  show  how  the  grammar  may 
provide  a  P-marker  for  every  terminal  string.  This  is  a  further  requirement 
that  we  will  have  to  keep  in  mind  as  we  consider  additional,  more  restrictive 
conditions. 

See  Culik  (1962)  for  a  critique  of  an  earlier  formulation  of  these  notions 
in  Chomsky  (1959). 

Recalling  now  our  discussion  of  recursive  elements  in  Sec.  3,  we  can 
observe  the  following,  where  A  e  VN  : 

1.  If  there  are  nonnull  strings  <f>,  y>  such  that  A  =>  <j>Aip,  then 
A  is  a  self-embedding  element. 

2.  If  there  is  a  nonnull  <f>  such  that  A  =>  A<j>,  then  A  is  a  left- 
recursive  element.  04) 

3.  If  there  is  a  nonnull  <f>  such  that  A  =>  ^4,  then  A  is  a  right- 
recursive  element. 

4.  If  ^4.  is  a  nonrecursive  element,  then  there  are  no  strings  (f>,  y 
such  that  ^4  =>  <f>Aip. 

The  converses  of  these  statements  are  not  necessarily  true  for  grammars 
of  the  form  we  have  so  far  described,  although  they  will  be  true  in  the  case 
that  we  now  discuss.  Thus  suppose  the  grammar  G  contains  the  rules 
S  -*  BA  C,  BA  ->  BBAC,  but  no  rule  A  -*  %  for  any  #.  Then  there  are  no 
strings  <f>9  y  such  that  A  =>  ^^^>  but  ^  is  recursive  (in  fact,  self-embedding). 


2Q4  FORMAL    ANALYSIS    OF    NATURAL    LANGUAGES 

Our  discussion  of  these  systems  requires  us  to  distinguish  terminal  from 
nonterminal  elements  and  atomic  elements  from  strings  formed  by  con- 
catenation. It  will  help  to  keep  these  distinctions  clear  if  we  adopt  the 
following  important  convention : 

Type  Single  atomic  elements  Strings  of  elements 


Nonterminal 
Terminal 
Arbitrary 

A,B,C9...                         X,Y,Z,... 
a9b,c9...                            x9y,z9... 
a,  ft  -/,...                                &  *,  V,  •  -  - 

This  convention  has  in  fact  already  been  followed;  it  is  followed  without 
special  comment  throughout  this  and  the  next  two  chapters. 

We  would  now  like  to  add  conditions  to  guarantee  that  a  P-marker  for  a 
terminal  string  can  be  recovered  uniquely  from  its  derivation.  There  is  no 
natural  way  to  do  this  for  the  case  of  grammars  of  the  sort  so  far  discussed. 
For  example,  if  we  have  a  grammar  with  the  rules 

S-+AB;        AB->cde,  (15) 

there  is  no  way  to  determine  whether  in  the  string  cde  the  segment  cd 
is  a  phrase  of  the  type  A  (dominated  by  A  in  the  phrase  marker)  or  whether 
de  is  a  phrase  of  the  type  B  (dominated  by  B  in  the  phrase  marker).  We 
can  achieve  the  desired  result  most  naturally  by  requiring  that  in  each  rule 
of  the  grammar  only  a  single  symbol  can  be  rewritten.  Thus  grammars  can 
contain  rules  of  either  of  the  forms  in  Rule  I6a  or  I6b: 

A  — co  (I6a) 

<f>Aip  —*  <f>cotp  (equivalently:  A  -+•  o>  in  the  context  <j> — yi).       (16Z>) 

A  grammar  containing  only  rules  of  the  type  in  Rule  I6b  is  called  a  context- 
sensitive  grammar.  A  grammar  containing  only  rules  of  the  type  in  Rule 
I6a  is  called  a  context-free  grammar,  and  the  language  it  generates,  a 
context-free  language.  (Note,  incidentally,  that  it  is  the  rules  that  are 
sensitive  to  or  free  of  their  context,  not  the  elements  in  the  terminal  string.) 
In  either  case,  if  ^  is  a  line  of  a  derivation  and  y  is  the  line  immediately 
succeeding  it,  then  there  are  unique3  strings  <f>l9  <f>&  a,  co  such  that  <f> 
=  ^a^g  and  y  =  faoxfrzl  and  we  say  that  co  is  a  string  of  type  a  (i.e.,  it  is 

3Actually,  to  guarantee  uniqueness  certain  additional  conditions  must  be  satisfied  by 
the  set  of  rules;  in  particular,  we  must  exclude  the  possibility  of  such  a  sequence  of 
lines  as  AB,  ACB,  and  so  on.  We  will  assume  without  further  comment  that  such 
conditions  are  met.  They  can,  in  fact,  always  be  met  without  restricting  further  the 
class  of  languages  that  can  be  generated,  although,  of  course,  they  affect  in  some  measure 
the  class  of  systems  of  P-markers  that  can  be  generated.  In  the  case  of  context-sensitive 
grammars,  this  further  condition  is  by  no  means  innocuous,  as  we  shall  see  in  Chapter  12, 
Sec.  3. 


A  SIMPLE   CLASS   OF    GENERATIVE    GRAMMARS  2Q$ 

dominated  by  a  node  labeled  a  in  the  tree  representing  the  associated  P- 
marker).  In  the  case  of  context-free  grammars,  the  converse  of  each 
assertion  of  Observation  14  is  true,  and  we  have  precise  definitions  of  the 
various  kinds  of  recursiveness  in  terms  of  =>.  In  fact,  we  shall  study  the 
various  types  of  recursive  elements  only  in  the  case  of  context-free  gram- 
mars. 

The  principal  function  of  rules  of  the  type  in  Rule  I6b  is  to  permit  the 
statement  of  selectional  restrictions  on  the  choice  of  elements.  Thus  among 
subject- verb-object  sentences  we  find,  for  example,  The  fact  that  the  case 
was  dismissed  doesn't  surprise  me,  Congress  enacted  a  new  law,  The  men 
consider  John  a  dictator,  John  felt  remorse,  but  we  do  not  find  the  se- 
quences formed  by  interchange  of  subject  and  object:  /  don't  surprise  the 
fact  that  the  case  was  dismissed,  A  new  law  enacted  Congress,  John  considers 
the  men  a  dictator,  Remorse  felt  John.  Native  speakers  of  English  recognize 
that  the  first  four  are  perfectly  natural,  but  the  second  four,  if  intelligible  at 
all,  require  that  some  interpretation  be  imposed  on  them  by  analogy  to  well- 
formed  sentences.  They  are,  in  the  sense  described  at  the  close  of  Sec.  3, 
of  lower,  if  not  zero,  degree  of  grammaticalness.  A  grammar  that  did  not 
make  this  distinction  would  clearly  be  deficient;  it  can  be  made  most 
naturally  and  economically  by  introducing  context-sensitive  rules  to 
provide  specific  selectional  restrictions  on  the  choice  of  subject,  verb,  and 
object. 

A  theory  of  grammar  must,  ideally,  contain  a  specification  of  the  class 
of  possible  grammars,  the  class  of  possible  sentences,  and  the  class  of 
possible  structural  descriptions;  and  it  must  provide  a  general  method  for 
assigning  one  or  more  structural  descriptions  to  each  sentence,  given  a 
grammar  (that  is,  it  must  be  sufficiently  explicit  to  determine  what  each 
grammar  states  about  each  of  the  possible  sentences — cf.  Chomsky,  196 la, 
for  discussion).  To  establish  the  theory  of  constituent-structure  grammar 
finally,  we  fix  the  vocabularies  VN  and  VT  as  given  disjoint  finite  sets, 
meeting  the  conditions  stated.  A  grammar,  then,  is  simply  a  finite  relation 
on  strings  in  V  =  VN  u  VT  meeting  these  conditions.  Note  that  the 
problem  of  fixing  VT,  in  the  case  of  natural  language,  is  essentially  the 
problem  of  constructing  a  universal  phonetic  theory  (including,  in  particu- 
lar, a  universal  phonetic  alphabet  and  laws  determining  universal  con- 
straints on  distribution  of  its  segments).  For  more  on  this  topic,  see  Sec.  6 
and  the  references  cited  there.  In  addition,  we  must  set  a  bound  on  mor- 
pheme length  (so  that  VT  is  finite)  and  establish  a  set  of  "grammatical 
morphemes"  (e.g.,  tenses,  aspects,  and  numbers).  This  latter  problem, 
along  with  that  of  giving  a  substantive  interpretation  to  the  members  of 
the  set  VN9  is  the  classical  problem  of  "universal  grammar,"  namely, 
giving  a  language-independent  characterization  of  the  set  of  "grammatical 


2<)6  FORMAL  ANALYSIS  OF  NATURAL  LANGUAGES 

categories"  that  can  function  in  some  natural  language,  a  characterization 
that  will  no  doubt  ultimately  involve  both  formal  considerations  involving 
grammatical  structure  and  considerations  of  an  absolute  semantic  nature. 
This  is  a  problem  that  has  not  been  popular  in  recent  decades,  but  there  is 
no  reason  to  regard  it  as  beyond  the  bounds  of  serious  study.  On  the 
contrary,  it  remains  a  significant  and  basic  question  for  general  linguistics. 
We  return  to  the  study  of  the  various  kinds  of  constituent-structure 
grammars  in  Chapter  12. 


5.  TRANSFORMATIONAL   GRAMMARS 

We  stipulated  in  Sec.  3  that  rules  of  grammar  are  each  of  the  form 


where  each  of  <£1?  .  .  .  ,  (f>n,  ^  is  a  structure  of  some  sort,  and  the  symbol 
—  »  indicates  that,  if  <f>l9  .  .  .  ,  <f>n  have  been  generated  in  the  course  of  a 
derivation,  then  ip  can  also  be  generated  as  an  additional  step.  In  Sec.  4 
we  considered  a  simple  case  of  grammars,  called  constituent-structure 
grammars,  with  rules  representing  a  very  special  case  of  (17),  namely,  the 
case  in  which  n  =  1  and  in  which,  in  addition,  each  of  the  structures  fa 
and  -y>  is  simply  a  string  of  symbols.  (We  also  required  that  the  grammar 
meet  the  additional  condition  stated  in  Rule  1  6.)  In  this  section  we  con- 
sider a  class  of  grammars  containing  two  syntactic  subcomponents:  a 
constituent-structure  component  consisting  of  rules  meeting  the  restrictive 
conditions  discussed  in  Sec.  4,  and  several  others,  and  a  transformational 
component  containing  rules  of  the  form  of  Rule  17,  in  which  n  may  be 
greater  than  1  and  in  which  each  of  the  structures  </>l5  .  .  .  ,  <f>n,  and  ip  is  not 
a  string  but  rather  a  phrase-marker  (cf.  p.  288).  Such  grammars  we  call 
transformational  grammars. 

The  plausibility  of  this  generalization  to  transformational  grammars  is 
suggested  by  the  obvious  psychological  fact  that  some  pairs  of  sentences 
seem  to  be  grammatically  closely  related.  Why,  for  example,  is  the  sentence 
John  threw  the  ball  felt  to  be  such  a  close  relative  of  the  sentence  The  ball 
was  thrown  by  John!  It  cannot  be  because  they  mean  the  same  thing,  for  a 
similar  kind  of  closeness  can  be  felt  between  John  threw  the  ball  and  Who 
threw  the  ball  or  Did  John  throw  the  ball,  which  are  not  synonymous. 
Nor  can  it  be  a  matter  of  some  simple  formal  relation  (in  linguistic  par- 
lance, a  "co-occurrence  relation")  between  the  ^-tuples  of  words  that  fill 
corresponding  positions  in  the  paired  sentences,  as  can  be  seen,  for  example 
by  the  fact  that  The  old  man  met  the  young  woman  and  The  old  woman  met 
the  young  man  are  obviously  not  related  in  the  same  way  in  which  active 
and  passive  are  structurally  related,  although  the  distinction  between  this 


TRANSFORMATIONAL   GRAMMARS  2QJ 

case  and  the  active-passive  case  cannot  be  given  in  any  general  "distribu- 
tional" terms  (for  discussion,  see  Chomsky,  1962b).  In  a  constituent-struc- 
ture grammar  all  of  these  sentences  would  have  to  be  generated  more  or  less 
independently  and  thus  would  bear  no  obvious  relation  to  one  another. 
But  in  a  transformational  grammar  these  sentences  can  be  related  directly 
by  simple  rules  of  transformation. 


5. 1  Some  Shortcomings  of  Constituent-Structure  Grammars 

The  basic  reasons  for  rejecting  constituent-structure  grammars  in  favor 
of  transformational  grammars  in  the  theory  of  linguistic  structure  have  to 
do  with  the  impossibility  of  stating  many  significant  generalizations  and 
simplifications  of  the  rules  of  sentence  formation  within  the  narrower 
framework.  These  considerations  go  beyond  the  bounds  of  the  present 
survey.  For  discussion,  see  Chomsky  (1955,  Chapters  7  to  9;  1957, 
Chapters  6  to  7;  1962b),  Lees  (1957;  I960),  and  Postal  (1963);  also 
Chapter  12,  Sec.  4.2.  However,  some  of  the  respects  in  which  constituent- 
structure  grammars  are  formally  defective  are  relatively  easy  to  describe 
and  involve  some  important  general  considerations  that  are  often  over- 
looked in  considering  the  adequacy  of  grammars. 

A  grammar  must  generate  a  language  regarded  as  an  infinite  set  of 
sentences.  It  must  also  associate  with  each  of  these  sentences  a  structural 
description;  it  must,  in  other  words,  generate  an  infinite  set  of  structural 
descriptions,  each  of  which  uniquely  determines  a  particular  sentence 
(though  not  conversely).  Hence  there  are  two  kinds  of  equivalence  that  we 
can  consider  when  evaluating  the  generative  capacity  of  grammars  and  of 
classes  of  grammars.  Two  grammars  will  be  called  weakly  equivalent  if 
they  generate  the  same  language;  they  will  be  called  strongly  equivalent 
if  they  generate  the  same  set  of  structural  descriptions.  In  this  chapter, 
and  again  in  Chapter  12,  we  consider  mainly  weak  equivalence  because  it  is 
more  accessible  to  study  and  has  been  investigated  in  more  detail,  but  strong 
equivalence  is,  ultimately,  by  far  the  more  interesting  notion.  Similarly, 
we  have  no  interest,  ultimately,  in  grammars  that  generate  a  natural 
language  correctly  but  fail  to  generate  the  correct  set  of  structural  de- 
scriptions. 

The  question  whether  natural  languages,  regarded  as  sets  of  sentences, 
can  be  generated  by  one  or  another  type  of  constituent-structure  grammar 
is  an  interesting  and  important  one.  However,  there  is  no  doubt  that  the 
set  of  structural  descriptions  associated  with,  let  us  say,  English  cannot  in 
principle  be  generated  by  a  constituent-structure  grammar,  however  com- 
plex its  rules  may  be.  The  problem  is  that  a  constituent-structure  grammar 


2^8  FORMAL  ANALYSIS  OF  NATURAL  LANGUAGES 

necessarily  imposes  too  rich  an  analysis  on  sentences  because  of  features 
inherent  in  the  way  in  which  P-markers  are  defined  for  such  grammars. 
The  germ  of  the  problem  can  be  seen  in  the  case  of  such  sentences  as 

Why  has  John  always  been  such  an  easy  man  to  please?        (18) 

The  whole  is  a  sentence;  the  last  several  words  constitute  a  noun  phrase; 
the  words  can  be  assigned  to  syntactic  categories.  But  there  is  no  reason 
to  assign  any  phrase  structure  beyond  that.  To  assign  just  this  amount  of 
phrase  structure,  a  constituent-structure  grammar  would  have  to  contain 
these  rules  :  5  _>  why  has  NP  always  been  NP 

#/>->•  John 

NP  -*  such  an  easy  N  to  Ftr(ansitive)  (19) 


Ktr  —  *  please, 

or  something  of  the  sort.  It  is  obvious  that  a  collection  of  rules  such  as  this 
is  quite  absurd  and  leaves  unstated  all  sorts  of  structural  regularities. 
(Furthermore,  the  associated  P-marker  is  defective  in  that  it  does  not 
indicate  that  man  is  grammatically  related  to  please  as  it  is,  for  example,  in 
ft  pleases  the  man  ;  that  is  to  say,  that  the  verb-object  relation  holds  for  this 
pair.  Cf.  Sec.  2.2  of  Chapter  13.)  In  particular,  much  of  the  point  of 
constituent-structure  grammar  is  lost  if  we  have  to  give  rules  that  analyze 
certain  phrases  into  six  immediate  constituents,  as  in  Example  19. 

This  difficulty  changes  from  a  serious  complication  to  an  inadequacy  in 
principle  when  we  consider  the  case  of  true  coordination,  as,  for  example, 
in  such  sentences  as 

The  man  was  old,  tired,  tall,  .  .  .  ,  and  friendly.  (20) 

In  order  to  generate  such  strings,  a  constituent-structure  grammar  must 
either  impose  some  arbitrary  structure  (e.g.,  using  a  right-recursive  rule), 
in  which  case  an  incorrect  structural  description  is  generated,  or  it  must 
contain  an  infinite  number  of  rules.  Clearly,  in  the  case  of  true  coordina- 
tion, by  the  very  meaning  of  this  term,  no  internal  structure  should  be 
assigned  at  all  within  the  sequence  of  coordinate  items. 

We  might  try  to  meet  this  problem  by  extending  the  notion  of  constituent- 
structure  grammar  to  permit  such  rule  schemata  as 

Predicate  ->  Adjn    and    Adj        (n  >  1).  (21) 

Aside  from  the  many  difficulties  involved  in  formulating  this  notion  so  that 
descriptive  adequacy  may  be  maintained,  it  is,  of  course,  beside  the  point 
in  the  kind  of  difficulty  that  arises  in  Example  19.  In  general,  for  each 


TRANSFORMATIONAL    GRAMMARS 


particular  kind  of  difficulty  that  arises  in  constituent-structure  grammars,  it 
is  possible  to  devise  some  ad  hoc  adjustment  that  might  circumvent  it. 
Much  to  be  preferred,  obviously,  would  be  a  conceptual  revision  that 
would  succeed  in  avoiding  the  mass  of  these  difficulties  in  a  uniform  way, 
while  allowing  the  simple  constituent-structure  grammar  to  operate  without 
essential  alteration  for  the  class  of  cases  for  which  it  is  adequate  and  which 
initially  motivated  its  development.  As  far  as  we  know,  the  theory  of 
transformational  grammar  is  unique  in  holding  out  any  hope  that  this  end 
can  be  achieved. 

In  a  transformational  grammar  the  set  of  rules  meets  the  following  con- 
ditions. We  have,  first,  a  constituent-structure  component  consisting  of  a 
sequence  of  rules  of  the  form  <f>Ay  ->  <f)coy>,  where  A  is  a  single  symbol, 
a)  is  nonnull,  and  <^>,  ip  are  possibly  null  strings.  This  constituent-structure 
component  of  the  grammar  will  generate  a  finite  number  of  C-terminal 
strings,  to  each  of  which  we  can  associate,  as  before,  a  labeled  tree  —  a 
P-marker  —  representing  its  constituent  structure.  We  now  add  to  the 
grammar  a  set  of  operations  called  grammatical  transformations,  each  of 
which  maps  an  w-tuple  of  P-markers  (n  >  1)  into  a  new  P-marker.  The 
recursive  property  of  the  grammar  can  be  attributed  entirely  to  these 
transformations.  Among  these  transformations,  some  are  obligatory  — 
they  must  be  applied  in  every  derivation  (furthermore,  some  transforma- 
tions are  obligatory  relative  to  others,  i.e.,  they  must  be  applied  if  the  others 
are  applied).  A  string  derived  by  the  application  of  all  obligatory  and  some 
optional  transformations  is  called  a  T-terminal  string.  We  can  regard  a 
J-terminal  string  as  being  essentially  a  sequence  of  morphemes.  A  T- 
terminal  string  derived  by  the  use  of  only  obligatory  transformations  can 
be  called  a  kernel  string.  If  a  language  contains  kernel  strings  at  all,  they 
will  represent  only  the  simplest  sentences.  The  idea  of  using  grammatical 
transformations  to  overcome  the  inadequacies  of  other  types  of  generative 
grammars  derives  from  Harris*  investigation  of  the  use  of  such  operations 
to  "normalize"  texts  (Harris,  1952a,  1952b).  The  description  we  give 
subsequently  is  basically  that  of  Chomsky  (1955);  the  exposition  follows 
Chomsky  (196  la).  A  rather  different  development  of  the  underlying  idea 
was  given  by  Harris  (1957). 

As  we  have  previously  remarked,  the  reason  for  adding  transformational 
rules  to  a  grammar  is  simple.  There  are  some  sentences  (simple  declarative 
active  sentences  with  no  complex  noun  or  verb  phrases)  that  can  be  gen- 
erated quite  naturally  by  a  constituent-structure  grammar  —  more  precisely, 
this  is  true  only  of  the  terminal  strings  underlying  them.  There  are  others 
(passives,  questions,  and  sentences  with  discontinuous  phrases  and 
complex  phrases  that  embed  sentence  transforms,  etc.)  that  cannot  be 
generated  in  a  natural  and  economical  way  by  a  constituent-structure 


30O  FORMAL   ANALYSIS    OF   NATURAL    LANGUAGES 

grammar  but  that  are,  nevertheless,  related  systematically  to  sentences  of 
simpler  structure.  Transformations  express  these  relations.  When  used  to 
generate  more  complex  sentences  (and  their  structural  descriptions)  from 
already  generated  simpler  ones,  transformations  can  account  for  aspects  of 
grammatical  structure  that  cannot  be  expressed  by  constituent-structure 
grammar. 

The  problem,  therefore,  is  to  construct  a  general  and  abstract  notion  of 
grammatical  transformation,  one  that  will  incorporate  and  facilitate  the 
expression  of  just  those  formal  relations  between  sentences  that  have  a 
significant  function  in  language. 

5.2  The  Specification  of  Grammatical  Transformations 

A  transformation  cannot  be  simply  an  operation  defined  on  terminal 
strings,  irrespective  of  their  constituent  structure.  If  it  could,  then  the 
passive  transformation  could  be  applied  equally  well  in  Examples  22  and 
23: 

The  man  saw  the  boy  ->  The  boy  was  seen  by  the  man  (22) 

_,  tit,   (The  boy  was  seen  by  the  man  leave 

The  man  saw  the  boy  leave  -/+  {_,     -       ,  J       ,     ,, 

J  (The  boy  leave  was  seen  by  the  man 

(23) 

In  order  to  apply  a  transformation  to  some  particular  string,  we  must 
know  the  constituent  structure  of  that  string.  For  example,  a  transforma- 
tion that  would  turn  a  declarative  sentence  into  a  question  might  prepose 
a  certain  element  of  the  main  verbal  phrase  of  the  declarative  sentence. 
Applied  to  The  man  who  was  here  was  old,  it  would  yield  Was  the  man  who 
was  here  old!  The  second  was  in  the  original  sentence  has  been  proposed 
to  form  a  question.  If  the  first  was  is  preposed,  however,  we  get  the  un- 
grammatical  Was  the  man  who  here  was  old!  Somehow,  therefore,  we 
must  know  that  the  question  transformation  can  be  applied  to  the  second 
was  because  it  is  part  of  the  main  verbal  phrase,  but  it  cannot  be  applied  to 
the  first  was.  In  short,  we  must  know  the  constituent  structure  of  the 
original  sentence. 

It  would  defeat  the  purpose  of  transformational  analysis  to  regard 
transformations  as  higher  level  rewriting  rules  that  apply  to  undeveloped 
phrase  designations.    In  the  case  of  the  passive  transformation,  for  ex- 
ample, we  cannot  treat  it  merely  as  a  rewriting  rule  of  the  form 
NPl  +  Auxiliary  +  Verb  +  NP2  -+  NP2  +  Auxiliary 

+  be+  Verb  +  en  +  by  +  NP19     (24) 

or  something  of  the  sort.  Such  a  rule  would  be  of  the  type  required  for  a 
constituent-structure  grammar,  defined  in  Sec,  4,  except  that  it  would  not 


TRANSFORMATIONAL    GRAMMARS  301 

meet  the  condition  imposed  on  constituent-structure  rules  in  16  (that  is, 
the  condition  that  only  a  single  symbol  be  rewritten),  which  provides  for 
the  possibility  of  constructing  a  P-marker.  A  sufficient  argument  against 
introducing  passives  by  such  a  rule  as  Example  24  is  that  transformations, 
so  formulated,  would  not  provide  a  method  to  simplify  the  grammar 
when  selectional  restrictions  on  choice  of  elements  appear,  as  in  the  ex- 
amples cited  at  the  end  of  Sec.  4.  In  the  passives  corresponding  to  those 
examples  the  same  selectional  relations  are  obviously  preserved,  but  they 
appear  in  a  different  arrangement.  Now,  the  point  is  that,  if  the  passive 
transformation  were  to  apply  as  a  rewriting  rule  at  a  stage  of  derivation 
preceding  the  application  of  the  selectional  rules  for  subject-verb-object, 
an  entirely  independent  set  of  context-sensitive  rules  would  have  to  be 
given  in  order  to  determine  the  corresponding  agent-verb-subject  selection 
in  the  passive.  One  of  the  virtues  of  a  transformational  grammar  is  that  it 
provides  a  way  to  avoid  this  pointless  duplication  of  selectional  rules, 
with  its  consequent  loss  of  generality,  but  that  advantage  is  lost  if  we  can 
apply  the  transformation  before  the  selection  of  particular  elements. 

It  seems  evident,  therefore,  that  a  transformational  rule  must  apply  to  a 
fully  developed  P-marker,  and,  since  transformational  rules  must  reapply 
to  transforms,  it  follows  that  the  result  of  applying  a  transformation  must 
again  be  a  P-marker,  the  derived  P-marker  of  the  terminal  string  resulting 
from  the  transformation.  A  grammatical  transformation,  then,  is  a 
mapping  of  P-markers  into  P-markers. 

We  can  formulate  this  notion  of  grammatical  transformation  in  the 
following  way.  Suppose  that  Q  is  a  P-marker  of  the  terminal  string  t 
and  that  t  can  be  subdivided  into  successive  segments  t^ . .  . ,  tn  in  such  a 
way  that  each  ti  is  traceable,  in  Q,  to  a  node  labeled  A{.  We  say,  in  such 
a  case,  that  t  is  analyzable  as  (tl9  . . . ,  tn\  Al9 . . . ,  An)  with  respect  to  Q. 

In  the  simplest  case  a  transformation  T  will  be  specified  in  part  by  a 
sequence  of  symbols  (A^  ...,AJ  that  defines  its  domain  by  the  following 
rule: 

A  string  t  with  P-marker  Q  is  in  the  domain  of  T  if  t  is  analyzable  as 
fo,  . .  . ,  tn;  A^...,An]  with  respect  to  Q.  Then  fo,  ...,**)  is  a  proper 
analysis  oft  with  respect  to  Q,T,  and  (A19  ...,AJis  the  structure  index 
ofT. 

To  complete  the  specification  of  the  transformation  J,  we  must  describe 
the  effect  that  T  has  on  the  terms  of  the  proper  analysis  of  any  string  to 
which  it  applies.  For  instance,  T  may  have  the  effect  of  deleting  or  per- 
muting certain  terms,  of  substituting  one  for  another,  or  adding  a  constant 
string  in  a  fixed  place,  and  so  on.  Suppose  that  we  associate  with  a  trans- 
formation T  an  underlying  elementary  transformation  Tel  such  that 


$O2  FORMAL   ANALYSIS    OF    NATURAL    LANGUAGES 

Tel(i;  /a,...,  t  „)  =  C7Z,  where  (tl9  .  .  .  ,  t  J  is  the  proper  analysis  of  t  with 
respect  to  Q,T.  Then  the  string  resulting  from  application  of  the  trans- 
formation T  to  the  string  /  with  P-marker  Q  is 

T(t,  0  =  0i  -  •  -  <7»- 

Obviously,  we  do  not  want  any  arbitrary  mapping  of  the  sort  just  de- 
scribed to  qualify  as  a  grammatical  transformation.  We  would  not  want, 
for  example,  to  permit  in  a  grammar  a  transformation  that  associates  such 
pairs  as  John  saw  the  boy  ->  /'//  leave  tomorrow;  John  saw  the  man  ->  why 
don't  you  try  again;  John  saw  the  gir!-+  China  is  industrializing  rapidly. 
Only  rules  that  express  genuine  structural  relations  between  sentence 
forms  —  active-passive,  declarative-interrogative,  declarative-nominalized 
sentence,  and  so  on—  should  be  permitted  in  the  grammar.  We  can  avoid 
an  arbitrary  pairing  off  of  sentences  if  we  impose  an  additional,  but  quite 
natural,  requirement  on  the  elementary  transformations.  The  restriction 
can  be  formulated  as  follows:4 

IfTel  is  an  elementary  transformation,  then  for  all  integers  i  and  n  and  all 
strings  xl9  .  .  .  ,  xn9  #1,  .  •  .  ,  2/w  //  must  be  the  case  that  Tel(i;  x^  .  .  .  ,  xn)  is 
formed  from  Tel(i;  yl9  .  .  .  ,  2/J  by  replacing  yj  in  the  latter  by  x^for  each 
j  <*• 

In  other  words,  the  effect  of  an  elementary  transformation  is  independent 
of  the  particular  choice  of  strings  to  which  it  applies.  This  requirement 
has  the  effect  of  ruling  out  the  possibility  of  applying  transformations  to 
particular  strings  of  actually  occurring  words  (or  morphemes).  Thus  no 
single  elementary  transformation  meeting  this  restriction  can  have  both 
the  effect  of  replacing  John  will  try  by  will  John  try!  and  the  effect  of  replac- 
ing John  tried  by  did  John  fry?,  although  this  also  is  clearly  the  effect  of  the 
simple  question-transformation.  The  elementary  transformation  that  we 
need  in  this  case  is  that  which  converts  x^x^  to  x^x3.  That  is  to  say, 
the  transformation  Tel  is  defined  as  follows,  for  arbitrary  strings  xl9  x2y  xz: 


But  if  this  is  to  yield  did  John  tryly  it  will  be  necessary  to  apply  it  not  to  the 
sentence  John  tried  but  rather  to  a  hypothetical  string  having  the  form 
John  +  past  +  try  (a  terminal  string  that  is  parallel  in  structure  to  the 
sentence  John  will  try)  that  underlies  the  sentence  John  tried.  In  general, 
we  cannot  require  that  terminal  strings  be  related  in  any  simple  way  to 

4  A  mare  precise  formulation  would  have  to  distinguish  occurrences  of  the  same  string 
(Chomsky  1955). 


TRANSFORMATIONAL    GRAMMARS 


303 


actual  sentences.  The  obligatory  mappings  (both  transformational  and 
phonological)  that  specify  the  physical  shape  may  reorder  elements,  add 
or  delete  elements,  and  so  on. 

For  empirical  adequacy,  the  notion  of  transformation  just  described 
must  be  generalized  in  several  directions.  First,  we  must  admit  transfor- 
mations that  apply  to  pairs  of  P-markers.  (Transformations  such  as  those 
previously  discussed  that  apply  to  a  single  P-marker  we  shall  henceforth 
call  singulary  transformations^)  Thus  the  terminal  string  underlying  the 
sentence  His  owning  property  surprised  me  is  constructed  transformationally 
from  the  already  formed  strings  underlying  it  surprised  me  and  he  owns 
property  (along  with  their  respective  P-markers).  We  might  provide  for 
this  possibility,  in  the  simplest  cases,  by  allowing  all  strings  #S##$# . . . 
#S#  to  head  derivations  in  the  underlying  constituent-structure  grammar 
instead  of  just  #S#.  We  would  then  allow  structure  indices  of  the  form 
(#,  NP,  V,  NP9  #,  #,  NP9  V,  NP,  #),  thus  providing  for  His  owning  property 
surprised  me  and  similar  cases.  (For  a  more  thorough  and  adequate 
discussion  of  this  problem,  see  Chomsky,  1955.) 

We  must  also  extend  the  manner  in  which  the  domain  of  a  transforma- 
tion and  the  proper  analysis  of  the  transformed  string  is  specified.  First, 
there  is  no  need  to  require  that  the  terms  of  a  structure  index  be  single 
symbols.  Second,  we  can  allow  the  specification  of  a  transformation  to  be 
given  by  a  finite  set  of  structure  indices.  More  generally,  we  can  specify 
the  domain  of  a  transformation  simply  by  a  structural  condition  based  on 
the  predicate  Analyzable,  defined  above.  In  terms  of  this  notion,  we  can 
define  identity  of  terminal  strings  and  can  allow  terms  of  the  structure  index 
to  remain  unspecified.  With  these  and  several  other  extensions,  it  is 
possible  to  provide  an  explicit  and  precise  basis  for  transformational 
grammar. 


5.3  The  Constituent  Structure  of  Transformed  Strings 

A  grammatical  transformation  is  determined  by  a  structural  condition 
stated  in  terms  of  the  predicate  Analyzable  and  by  an  elementary  trans- 
formation. It  has  been  remarked,  however,  that  a  transformation  must 
produce  not  merely  strings  but  derived  P-markers.  We  must,  therefore, 
show  how  constituent  structure  is  assigned  to  the  terminal  string  formed 
by  a  transformation.  The  best  way  to  assign  derived  P-markers  appears 
to  be  by  a  set  of  rules  that  would  form  part  of  general  linguistic  theory 
rather  than  by  an  additional  clause  appended  to  the  specification  of  each 
transformation.  Precise  statement  of  those  rules  would  require  an  analysis 
of  fundamental  notions  going  well  beyond  the  informal  account  we  have 


304  FORMAL   ANALYSIS    OF    NATURAL    LANGUAGES 

sketched.  (In  this  connection,  see  Chomsky,  1955;  Matthews,  1962; 
Postal,  1962.)  Nevertheless,  certain  features  of  a  general  solution  to  this 
problem  seem  fairly  clear.  We  can,  first  of  all,  assign  each  transformation 
to  one  of  a  small  number  of  classes,  depending  on  the  underlying  ele- 
mentary transformation  on  which  it  is  based.  For  each  class  we  can  state 
a  general  rule  that  assigns  to  the  transform  a  derived  P-marker,  the  form  of 
which  depends,  in  a  fixed  way,  on  the  P-markers  of  the  underlying  ter- 
minal strings.  A  few  examples  will  illustrate  the  kinds  of  principles  that 
seem  necessary. 

The  basic  recursive  devices  in  the  grammar  are  the  generalized  trans- 
formations that  produce  a  string  from  a  pair  of  underlying  strings.  (Ap- 
parently there  is  a  bound  on  the  number  of  singulary  transformations  that 
can  apply  in  sequence.)  Most  generalized  transformations  are  based  on 
elementary  transformations  that  substitute  a  transformed  version  of  the 
second  of  the  pair  of  underlying  terminal  strings  for  some  term  of  the 
proper  analysis  of  the  first  of  this  pair.  [In  the  terminology  suggested  by 
Lees  (1960)  these  are  the  constituent  string  and  the  matrix  string,  respec- 
tively.] In  such  a  case  a  single  general  principle  seems  to  be  sufficient  to 
determine  the  derived  constituent  structure  of  the  transform.  Suppose 
that  the  transformation  replaces  the  symbol  a  of  a:  (the  matrix  sentence) 
by  cr2  (the  constituent  sentence).  The  P-marker  of  the  result  is  simply  the 
former  P-marker  of  c^  with  a  replaced  by  the  P-marker  of  a2. 

All  other  generalized  transformations  are  attachment  transformations 
that  take  a  term  a  of  the  proper  analysis,  with  the  term  /?  of  the  structure 
index  that  most  remotely  dominates  it  (and  all  intermediate  parts  of  the 
P-marker  that  are  dominated  by  /?  and  that  dominate  a),  and  attaches  it 
(with,  perhaps,  a  constant  string)  to  some  other  term  of  the  proper  analysis. 
In  this  way  we  form,  for  example,  John  is  old  and  sad  with  the  P-marker 
(Fig.  5)  from  John  is  old,  John  is  sad  by  a  transformation  with  the  structure 
index  (NP,  is,  A9##,  NP,  is,  A). 

Singulary  transformations  are  often  just  permutations  of  terms  of  the 
proper  analysis.  For  example,  one  transformation  converts  Fig.  6a  into 
Fig,  6b.  The  general  principle  of  derived  constituent  structure  in  this  case 
is  simply  that  the  minimal  change  is  made  in  the  P-marker  of  the  under- 
lying string,  consistent  with  the  requirement  that  the  resulting  P-marker 
again  be  representable  in  tree  form.  The  transformation  that  gives  Turn 
some  of  the  lights  out  is  based  on  an  elementary  transformation  that  per- 
mutes the  second  and  third  terms  of  a  three-termed  proper  analysis;  it 
has  the  structure  index  (V,  Prt,  NP)  (this  is,  of  course,  a  special  case  of  a 
more  general  rule). 

Figure  6  illustrates  a  characteristic  effect  of  permutations,  namely,  that 
they  tend  to  reduce  the  amount  of  structure  associated  with  the  terminal 
string  to  which  they  apply.  Thus,  although  Fig.  6a  represents  the  kind  of 


TRANSFORMATIONAL    GRAMMARS 


305 


old  and  sad 

Fig.  5.  P-marker  resulting  from  the  inter- 
pretation of  and  by  an  attachment  trans- 
formation 

purely  binary  structure  regarded  as  paradigmatic  in  most  linguistic 
theories,  in  Fig.  6b  there  is  one  less  binary  split  and  one  new  ternary 
division;  and  Prt  is  no  longer  dominated  by  Verb.  Although  binary  divi- 
sions are,  by  and  large,  characteristic  of  the  simple  structural  descriptions 
generated  by  the  constituent-structure  grammar,  they  are  rarely  found  in 
jP-markers  associated  with  actual  sentences.  A  transformational  approach 
to  syntactic  description  thus  allows  us  to  express  the  element  of  truth 
contained  in  the  familiar  theories  of  immediate  constituent  analysis,  with 
their  emphasis  on  binary  splitting,  without  at  the  same  time  committing 
us  to  an  arbitrary  assignment  of  superfluous  structure.  Furthermore,  by 
continued  use  of  attachment  and  permutation  transformations  in  the 
VP  VP 


Verb 

A 


NP 


Verb 


NP 


Prt 


Determ 


N 


Determ 


Prt 


N       out 


turn       out         Quant         Art      lights 


A 


turn        Quant  Art     lights 


some     of 


the 


some     of 


the 


(a)  (b) 

Fig.  6.  The  singular/  transformation  that  carries  (a)  into  (b)  is  a  permutation; 
its  effect  is  to  reduce  slightly  the  amount  of  structure  associated  with  the  sentence. 


FORMAL  ANALYSIS  OF  NATURAL  LANGUAGES 

manner  illustrated  it  is  possible  to  generate  classes  of  P-markers  that 
cannot  in  principle  be  generated  by  constituent-structure  grammars  (in 
particular,  those  associated  with  the  coordinate  constructions  such  as 
Example  20).  Similarly,  it  is  not  difficult  to  see  how  a  transformational 
approach  can  deal  with  the  problem  noted  in  connection  with  Example 
18  and  others  like  it  (cf.  Sec.  2.2  of  Chapter  13),  although  the  difficulty  of 
actually  working  out  adequate  analyses  should  not  be  underestimated. 

Some  singulary  transformations  simply  add  constant  strings  at  a 
designated  place  in  the  proper  analysis ;  others  delete  certain  terms  of  the 
proper  analysis.  The  first  are  treated  just  like  attachment  transformations. 
In  the  case  of  deletion  we  delete  nodes  that  dominate  no  terminal  string 
and  leave  the  P-marker  otherwise  unchanged.  Apparently  it  is  possible 
to  restrict  the  application  of  deletion  transformations  in  a  rather  severe 
way.  Restrictions  on  the  applicability  of  deletion  transformations  play  a 
fundamental  role  in  determining  the  kinds  of  languages  that  can  be 
generated  by  transformational  grammars. 

In  summary,  then,  a  transformational  grammar  consists  of  a  finite 
sequence  of  context-sensitive  rewriting  rules  <f>-^-y>  and  a  finite  number  of 
transformations  of  the  type  just  described,  together  with  a  statement  of 
the  restrictions  on  the  order  in  which  those  transformations  are  applied. 
The  result  of  a  transformation  is  generally  available  for  further  transfor- 
mation, so  that  an  indefinite  number  of  P-markers  of  quite  varied  kinds 
can  be  generated  by  repeated  application  of  transformations.  At  each 
stage  a  P-marker  representable  as  a  labeled  tree  is  associated  with  the 
terminal  string  so  far  derived.  The  full  structural  description  of  a  sentence 
will  consist  not  only  of  its  P-marker  but  also  of  theP-markers  of  its  under- 
lying C-terminal  strings  and  its  transformational  history.  For  further 
discussion  of  the  role  of  such  a  structural  description  in  determining  the 
way  a  sentence  is  understood,  see  Sec.  2.2  of  Chapter  13. 


6.  SOUND   STRUCTURE 

We  regard  a  grammar  as  having  two  fundamental  components,  a 
syntactic  component  of  the  kind  we  have  already  described  and  a  phono- 
logical component  to  which  we  now  briefly  turn  our  attention. 


6.1  The  Role  of  the  Phonological  Component 

The  syntactic  component  of  a  grammar  contains  rewriting  rules  and 
transformational  rules,  formulated  and  organized  in  the  manner  described 


SOUND    STRUCTURE 


in  Sees.  4  and  5,  and  it  gives  as  its  output  terminal  strings  with  structural 
descriptions.  The  structural  description  of  a  terminal  string  contains,  in 
particular,  a  derived  P-marker  that  assigns  to  this  string  a  labeled  bracket- 
ing; in  the  present  section  we  consider  only  this  aspect  of  the  structural 
description.  We  therefore  limit  attention  to  such  items  as  the  following, 
taken  (with  many  details  omitted)  as  an  example  of  the  output  of  the 
syntactic  component: 

ULvpLv#  Ted#J^VP[rp[r#  see]F  past# 

LVP[D*#  the  dem  p\#]Det  [x#  booklv  pl#lvp]ppfc. 
(where  the  symbol  #  is  used  to  indicate  the  word  boundaries).    The 
terminal  string,  Example  25,  is  a  representation  of  the  sentence  Ted  saw 
those  books,  which  we  might  represent  on  the  phonetic  level  in  the  following 
manner:  1231 

the  •  d  +  sow  +  5swz  +  buks,  (26) 

(where  the  numerals  indicate  stress  level)  again  omitting  many  refinements, 
details,  discussions  of  alternatives,  and  many  phonetic  features  to  which 
we  pay  no  attention  in  these  brief  remarks. 

Representations  such  as  Example  26  identify  an  utterance  in  a  rather 
direct  manner.  We  can  assume  that  these  representations  are  given  in 
terms  of  a  universal  phonetic  system  which  consists  of  a  phonetic  alphabet 
and  a  set  of  general  phonetic  laws.  The  symbols  of  the  phonetic  alphabet 
are  defined  in  physical  (i.e.,  acoustic  and  articulatory)  terms;  the  general 
laws  of  the  universal  phonetic  system  deal  with  the  manner  in  which 
physical  items  represented  by  these  symbols  may  combine  in  a  natural 
language.  The  universal  phonetic  system,  much  like  the  abstract  definition 
of  generative  grammar  suggested  in  Sees,  4  and  5,  is  a  part  of  general 
linguistic  theory  rather  than  a  specific  part  of  the  grammar  of  a  particular 
language.  Just  as  in  the  case  of  the  other  aspects  of  the  general  theory  of 
linguistic  structure,  a  particular  formulation  of  the  universal  phonetic 
system  represents  a  hypothesis  about  linguistic  universals  and  can  be 
regarded  as  a  hypothesis  concerning  some  of  the  innate  data-processing 
and  concept-forming  capacities  that  a  child  brings  to  bear  in  language 
learning. 

The  role  of  the  phonological  component  of  a  generative  grammar  is  to 
relate  representations  such  as  Examples  25  and  26;  that  is  to  say,  the 
phonological  component  embodies  those  processes  that  determine  the 
phonetic  shape  of  an  utterance,  given  the  morphemic  content  and  general 
syntactic  structure  of  this  utterance  (as  in  Example  25).  As  distinct  from 
the  syntactic  component,  it  plays  no  part  in  the  formulation  of  new  utter- 
ances but  merely  assigns  to  them  a  phonetic  shape.  Although  Investigation 
of  the  phonological  component  does  not,  therefore,  properly  form  a  part 


308  FORMAL    ANALYSIS   OF    NATURAL    LANGUAGES 

of  the  study  of  mathematical  models  for  linguistic  structure,  the  processes 
by  which  phonetic  shape  is  assigned  to  utterances  have  a  great  deal  of 
independent  interest.  We  shall  indicate  briefly  some  of  their  major 
features.  Our  description  of  the  phonological  component  follows  closely 
Halle  (1959a,  1959b)  and  Chomsky  (1959,  1962a). 


6.2  Phones  and  Phonemes 

The  phonological  component  can  be  thought  of  as  an  input-output 
device  that  accepts  a  terminal  string  with  a  labeled  bracketing  and  codes 
it  as  a  phonetic  representation.  The  phonetic  representation  is  a  sequence 
of  symbols  of  the  phonetic  alphabet,  some  of  which  (e.g.,  the  first  three  of 
Example  26)  are  directly  associated  with  physically  defined  features,  others 
(e.g.,  the  symbol  +  in  Example  26),  with  features  of  transition.  Let  us  call 
the  first  kind  phonetic  segments  and  the  second  kind  phonetic  junctures. 
Let  us  consider  more  carefully  the  character  of  the  phonetic  segments. 

Each  symbol  of  the  universal  phonetic  alphabet  is  an  abbreviation  of  a 
certain  set  of  physical  features.  For  example,  the  symbol  [ph]  represents 
a  labial  aspirated  unvoiced  stop.  These  symbols  have  no  independent 
status  in  themselves;  they  merely  serve  as  notational  abbreviations. 
Consequently  a  representation  such  as  Example  26,  and,  in  general,  any 
phonetic  representation,  can  be  most  appropriately  regarded  as  &  phonetic 
matrix:  the  rows  represent  the  physical  properties  that  are  considered 
primitive  in  the  linguistic  theory  in  question  and  the  columns  stand  for 
successive  segments  of  the  utterance  (aside  from  junctures).  The  matrix 
element  (/,/)  indicates  whether  (or  to  what  degree)  the  yth  segment  has 
the  zth  property.  The  phonetic  segments  thus  correspond  to  columns  of  a 

3 

matrix.  In  Example  26  the  symbol  [9]  might  be  an  abbreviation  for  the 
column  [vocalic,  nonconsonantal,  grave,  compact,  unrounded,  voiced, 
lax,  tertiary  stress,  etc.],  assuming  a  universal  phonetic  theory  based  on 
features  that  have  been  proposed  by  Jakobson  as  constituting  a  universal 
phonetic  system.  Matrices  with  such  entries  constitute  the  output  of  the 
phonological  component  of  the  grammar. 

What  is  the  input  to  the  phonological  component?  The  terminal  string 
Example  25  consists  of  lexical  morphemes,  such  as  Ted,  book;  grammatical 
morphemes^  such  a.spast,  plural;  and  certain /wnc/wra/  elements,  such  as#. 
The  junctural  elements  are  introduced  by  rules  of  the  syntactic  component 
in  order  to  indicate  positions  in  which  morphological  and  syntactic 
structures  have  phonetic  effects.  They  can,  in  fact,  be  regarded  as  gram- 
matical morphemes  for  our  purposes.  Each  grammatical  morpheme  is 
in  general,  represented  by  a  single  terminal  symbol,  unanalyzed  into 


SOUND    STRUCTURE 


features.  On  the  other  hand,  the  lexical  morphemes  are  represented  rather 
by  strings  of  symbols  that  we  call  phonemic  segments  or  simply  phonemes.5 
Aside  from  the  labeled  brackets,  then,  the  input  to  the  phonological 
component  is  a  string  consisting  of  phonemes  and  special  symbols  for 
grammatical  morphemes.  The  representation  in  Example  25  is  essentially 
accurate,  except  for  the  fact  that  lexical  morphemes  are  given  in  ordinary 
orthography  instead  of  in  phonemic  notation.  Thus  Ted,  see,  the,  book, 
should  be  replaced  by  /ted/,  /si/,  /5I/,  /buk/,  respectively.  We  have,  of 
course,  given  so  little  detail  in  Example  26  that  phonetic  and  phonemic 
segments  are  scarcely  distinguished  in  this  example. 

We  shall  return  shortly  to  the  question:  What  is  the  relation  between 
phonemic  and  phonetic  segments?  Observe  for  now  that  there  is  no  re- 
quirement so  far  that  they  be  closely  related. 

Before  going  on  to  consider  the  status  of  the  phonemic  segments  more 
carefully,  we  should  like  to  warn  the  reader  that  there  is  considerable 
divergence  of  usage  with  regard  to  the  terms  phoneme,  phonetic  representa- 
tion, etc.,  in  the  linguistic  literature.  Furthermore,  that  divergence  is  not 
merely  terminological;  it  reflects  deep-seated  differences  of  opinion,  far 
from  resolved  today,  regarding  the  real  nature  of  sound  structure.  This  is 
obviously  not  the  place  to  review  these  controversies  or  to  discuss  the 
evidence  for  one  or  another  position.  (For  detailed  discussion  of  these 
questions,  see  Halle,  1959b,  Chomsky,  1962b,  and  the  forthcoming  Sound 
Pattern  of  English  by  Halle  &  Chomsky.)  In  the  present  discussion  our 
underlying  conceptions  of  sound  structure  are  close  to  those  of  the  foun- 
ders of  modern  phonology  but  diverge  quite  sharply  from  the  position 
that  has  been  more  familiar  during  the  last  twenty  years,  particularly  in 
the  United  States  —  a  position  that  is  often  called  neo-Bloomfieldian.  In 
particular,  our  present  usage  of  the  term  phoneme  is  much  like  that  of 
Sapir  (e.g.,  Sapir,  1933),  and  our  notion  of  a  universal  phonetic  system 
has  its  roots  in  such  classical  work  as  Sweet  (1877)  and  de  Saussure 
(1916  —  the  Appendix  to  the  Introduction,  which  dates,  in  fact,  from  1897). 
What  we,  following  Sapir,  call  phonemic  representation  is  generally  called 
morphophonemic  today.  It  is  generally  assumed  that  there  is  a  level  of 
representation  intermediate  between  phonetic  and  morphophonemic,  this 
new  intermediate  level  usually  being  called  phonemic.  However,  there 
seems  to  us  good  reason  to  reject  the  hypothesis  that  there  exists  an 
intermediate  level  of  this  sort  and  to  reject,  as  well,  many  of  the  assump- 
tions concerning  sound  structure  that  are  closely  interwoven  with  this 
hypothesis  in  many  contemporary  formulations  of  linguistic  theory. 

5  More  precisely,  we  should  take  the  phonemes  to  be  the  segments  that  appear  at  the 
stage  of  a  derivation  at  which  all  grammatical  morphemes  have  been  eliminated  by  the 
phonological  rules. 


5/O  FORMAL   ANALYSIS    OF   NATURAL    LANGUAGES 

Clearly,  we  should  attempt  to  discover  general  rules  that  apply  to  such 
large  classes  of  elements  as  consonants,  stops,  voiced  segments,  etc., 
rather  than  to  individual  elements.  We  should,  in  short,  try  to  replace  a 
mass  of  separate  observations  by  simple  generalizations.  Since  the  rules 
will  apply  to  classes  of  elements,  elements  must  be  identified  as  members  of 
certain  classes.  Thus  each  phoneme  will  belong  to  several  overlapping 
categories  in  terms  of  which  the  phonological  rules  are  stated.  In  fact,  we 
can  represent  each  phoneme  simply  by  the  set  of  categories  to  which  it 
belongs;  in  other  words,  we  can  represent  each  lexical  item  by  a  classifi- 
catory  matrix  in  which  columns  stand  for  phonemes  and  rows  for  cate- 
gories and  the  entry  (/,/")  indicates  whether  or  not  phoneme  j  belongs  to 
category  f.  Each  phoneme  is  now  represented  as  a  sequence  of  categories, 
which  we  can  call  distinctive  features,  using  one  of  the  current  senses  of 
this  term.  Like  the  phonetic  symbols,  the  phonemes  have  no  independent 
status  in  themselves.  It  is  an  extremely  important  and  by  no  means  ob- 
vious fact  that  the  distinctive  features  of  the  classificatory  phonemic 
matrix  define  categories  that  correspond  closely  to  those  determined  by 
the  rows  of  the  phonetic  matrices.  This  point  was  noted  by  Sapir  (1925) 
and  has  been  elaborated  in  recent  years  by  Jakobson,  Fant,  and  Halle 
(1952)  and  by  Jakobson  and  Halle  (1956);  it  is  an  insight  that  has  its 
roots  in  the  classical  linguistics  that  flourished  in  India  more  than  two 
millenia  ago. 


6.3  Invariance  and  Linearity  Conditions 

The  input  to  the  phonological  component  thus  consists,  in  part,  of  dis- 
tinctive-feature matrices  representing  lexical  items ;  and  the  output  consists 
of  phonetic  matrices  (and  phonetic  junctures).  What  is  to  be  the  relation 
between  the  categorial,  distinctive-feature  matrix  that  constitutes  the  input 
and  the  corresponding  phonetic  matrix  that  results  from  application  of  the 
phonological  rules?  What  is  to  be  the. relation,  for  example,  between  the 
input  matrix  abbreviated  as  /ted/  (where  each  of  the  symbols  /t/,  /e/,  /d/ 
stands  for  a  column  containing  a  plus  in  a  given  row  if  the  symbol  in 
question  belongs  to  the  category  associated  with  that  row,  a  minus  if  the 
symbol  is  specified  as  not  belonging  to  this  category,  and  a  blank  if  the 

symbol  is  unspecified  with  respect  to  membership  in  this  category)  and 

i 
the  output  matrix  abbreviated  as  [the-  d]  (where  each  of  the  symbols 

[th]»  [e*L  [d]  stands  for  a  column,  the  entries  of  which  indicate  phonetic 
properties)? 
TThe  strongest  requirement  that  could  be  imposed  would  be  that  the 


SOUND   STRUCTURE 


input  classificatory  matrix  must  literally  be  a  submatrix  of  the  output 
phonetic  matrix,  differing  from  it  only  by  the  deletion  of  certain  re- 
dundant entries.  Thus  the  phonological  rules  would  fill  in  the  blanks  of 
the  classificatory  matrix  to  form  the  corresponding  phonetic  matrix.  This 
strong  condition,  for  example,  is  required  by  Jakobson  and,  implicitly, 
by  Bloch  in  their  formulations  of  phonemic  theory.5  If  this  condition  is  met, 
then  phonemic  representation  will  satisfy  what  we  can  call  the  invariance 
condition  and  the  linearity  condition. 

By  the  linearity  condition  we  refer  to  the  requirement  that  each  phoneme 
must  have  associated  with  it  a  particular  stretch  of  sound  in  the  represented 
utterance  and  that,  if  phoneme  A  is  to  the  left  of  phoneme  B  in  the  pho- 
nemic representation,  the  stretch  associated  with  A  precedes  the  stretch 
associated  with  B  in  the  physical  event.  (We  are  limiting  ourselves  here  to 
what  are  called  segmental  phonemes,  since  we  are  regarding  the  so-called 
supra-segmentals  as  features  of  them.) 

The  invariance  condition  requires  that  to  each  phoneme  A  there  be 
associated  a  certain  defining  set  S(^4)  of  physical  phonetic  features,  such 
that  each  variant  (allophone)  of  A  has  all  the  features  of  2(/4),  and  no 
phonetic  segment  which  is  not  a  variant  (allophone)  of  A  has  all  of  the 
features  of  1Z(A). 

If  both  the  invariance  and  linearity  conditions  were  met,  the  task  of 
building  machines  capable  of  recognizing  the  various  phonemes  in  normal 
human  speech  would  be  greatly  simplified.  Correctness  of  these  conditions 
would  also  suggest  a  model  of  perception  based  on  segmentation  and 
classification  and  would  lend  support  to  the  view  that  the  methods  of 
analysis  required  in  linguistics  should  be  limited  to  segmentation  and 
classification.  However,  correctness  of  these  requirements  is  a  question  of 
fact,  not  of  decision,  and  it  seems  to  us  that  there  are  strong  reasons  to 
doubt  that  they  are  correct.  Therefore,  we  shall  not  assume  that  for  each 
phoneme  there  must  be  some  set  of  phonetic  properties  that  uniquely 
identifies  all  of  its  variants  and  that  these  sets  literally  occur  in  a  temporal 
sequence  corresponding  to  the  linear  order  of  phonemes. 

We  cannot  go  into  the  question  in  detail,  but  a  single  example  may 
illustrate  the  kind  of  difficulty  that  leads  us  to  reject  the  linearity  and 
invariance  conditions.  Clearly  the  English  words  write  and  ride  must 
appear  in  any  reasonable  phonemic  representation  as  /rayt/  and  /rayd/, 
respectively — that  is,  they  differ  phonemicaEy  in  voicing  of  the  final  con- 
sonant. They  differ  phonetically  in  the  vowel  also.  Consider,  for  example, 

8  They  would  not  regard  what  they  call  phonemic  representations  as  the  input  to  the 
phonological  component.  However,  as  previously  mentioned,  we  see  no  way  of 
maintaining  the  view  that  there  is  an  intermediate  representation  of  the  type  called 
"phonemic**  by  these  and  other  phoeologlsts* 


JJ2  FORMAL  ANALYSIS  OF  NATURAL  LANGUAGES 

a  dialect  in  which  write  is  phonetically  [rayt]  and  ride  is  phonetically 
[ra-yd],  with  the  characteristic  automatic  lengthening  before  voiced  con- 
sonants. To  derive  the  phonetic  from  the  phonemic  representation  in  this 
case,  we  apply  the  phonetic  rule, 

vowels  become  lengthened  before  voiced  segments,  (27) 

which  is  quite  general  and  can  easily  be  incorporated  into  our  present 
framework.  Consider  now  the  words  writer  and  rider  in  such  a  dialect. 
Clearly,  the  syntactic  component  will  indicate  that  writer  is  simply  write 
+  agent  and  rider  is  simply  ride  +  agent,  where  the  lexical  entries  write 
and  ride  are  exactly  as  given;  that  is,  we  have  the  phonemic  representations 
/rayt  +  r/,  /rayd  +  r/  for  writer,  rider,  respectively.  However,  there  is  a 
rather  general  rule  that  the  phonemes  /t/  and  /d/  merge  in  an  alveolar  flap 
[D]  in  several  contexts,  in  particular,  after  main  stress  as  in  writer  and  rider. 
Thus  the  grammar  for  this  dialect  may  contain  the  phonetic  rule, 

[t,  d]  ->  D  after  main  stress.  (28) 

Applying  Rules  27  and  28,  in  this  order,  to  the  phonemic  representations 
/rayt  +  r/,  /rayd  +  r/,  we  derive  first  [rayt  +  r],  [ra-yd  +  r],  by  Rule  27, 
and  eventually  [rayDr],  [ra-yDr],  by  Rule  28,  as  the  phonetic  representa- 
tions of  the  words  writer 9  rider.  Note  however,  that  the  phonemic  rep- 
resentations of  these  words  differ  only  in  the  fourth  segment  (voiced 
consonant  versus  unvoiced  consonant),  whereas  the  phonetic  representa- 
tions differ  only  in  the  second  segment  (longer  vowel  versus  shorter  vowel). 
Consequently,  it  seems  impossible  to  maintain  that  a  sequence  of  phonemes 
Ai...Amis  associated  with  the  sequence  of  phonetic  segments  a: .  .  .  am, 
where  at  contains  the  set  of  features  that  uniquely  identify  Ai  in  addition 
to  certain  redundant  features.  This  is  a  typical  example  that  shows  the 
untenability  of  the  linearity  and  invariance  conditions  for  phonemic 
representation.  It  follows  that  phonemes  cannot  be  derived  from  phonetic 
representations  by  simple  procedures  of  segmentation  and  classification 
by  criterial  attributes,  at  least  as  these  are  ordinarily  construed. 

Notice,  incidentally,  that  we  have  nowhere  specified  that  the  phonetic 
features  constituting  the  universal  system  must  be  defined  in  absolute 
terms.  Thus  one  of  the  universal  features  might  be  the  feature  "front 
versus  back"  or  "short  versus  long."  If  a  phonetic  segment  A  differs  from 
a  phonetic  segment  B  only  in  that  A  has  the  feature  "short"  whereas  B  has 
the  feature  "long,"  this  means  that  in  any  particular  context  X —  Y  the 
longer  element  is  identified  as  B  and  the  shorter  as  A.  It  may  be  that  A 
in  one  context  is  actually  as  long  as  or  longer  than  B  in  another  context. 
Many  linguists,  however,  have  required  that  phonetic  features  must  be 
defined  in  absolute  terms.  Instead  of  the  feature  "short  versus  long,"  they 
require  us  to  identify  the  absolute  length  (to  some  approximation)  of  each 


SOUND    STRUCTURE 

segment.  If  we  add  this  requirement  to  the  invariance  condition,  we 
conclude  that  even  partial  overlapping  of  phonemes— that  is,  assignment 
of  a  phone  a  to  a  phoneme  B  in  one  context  and  to  the  phoneme  C  in  a 
different  context,  in  which  the  choice  is  contextually  determined— cannot 
be  tolerated.  Such,  apparently,  is  the  view  of  Bloch  (1948,  1950).  This  is 
an  extremely  restrictive  assumption  which  is  invalidated  not  only  by  such 
examples  as  the  one  we  have  just  given  but  by  a  much  wider  range  of 
examples  of  partial  overlapping  (see  Bloch,  1940,  for  examples).  In  fact, 
work  in  acoustic  phonetics  (Liberman,  Delattre,  &  Cooper,  1952;  Schatz, 
1954)  has  shown  that  if  this  condition  must  be  met,  where  features  are 
defined  in  auditory  and  acoustic  terms  (as  proposed  in  Bloch,  1950),  then 
not  even  the  analysis  of  the  stops  /p,  t,  k/  can  be  maintained,  since  they 
overlap,  a  consequence  that  is  surely  a  reduction  to  absurdity. 

The  requirements  of  relative  or  of  absolute  invariance  both  suggest 
models  for  speech  perception,  but  the  difficulty  (or  impossibility)  of  main- 
taining either  of  these  requirements  suggests  that  these  models  are  in- 
correct and  leads  to  alternative  proposals  of  a  kind  to  which  we  shall 
return. 

We  return  now  to  the  main  theme. 


6.4  Some  Phonological  Rules 

We  have  described  the  input  to  the  phonological  component  of  the 
grammar  as  a  terminal  string  consisting  of  lexical  morphemes,  grammatical 
morphemes,  and  junctures,  with  the  constituent  structure  marked.  This 
component  gives  as  its  output  a  phonetic  matrix  in  which  the  columns  stand 
for  successive  segments  and  the  rows  for  phonetic  features.  Obviously,  we 
want  the  rules  of  the  phonological  component  to  be  as  few  and  general  as 
possible.  In  particular,  we  prefer  rules  that  apply  to  large  and  to  natural 
classes  of  elements  and  that  have  a  simple  and  brief  specification  of  relevant 
context.  We  prefer  a  set  of  rules  in  which  the  same  classes  of  elements  figure 
many  times.  These  and  other  requirements  are  met  if  we  define  the  com- 
plexity of  the  phonological  component  in  terms  of  the  number  of  features 
mentioned  in  the  rules,  where  the  form  of  rules  is  specified  in  such  a  way  as 
to  facilitate  valid  generalizations  (Halle,  1961).  We  then  choose  simpler 
(more  general)  grammars  over  more  complex  ones  with  more  feature 
specifications  (more  special  cases). 

The  problem  of  phonemic  analysis  is  to  assign  to  each  utterance  a 
phonemic  representation,  consisting  of  matrices  in  which  the  columns 
stand  for  phonemes  and  the  rows  for  distinctive  (classificatory)  features, 
and  to  discover  the  simplest  set  of  rules  (where  simplicity  is  a  well-defined 


314  FORMAL    ANALYSIS    OF    NATURAL    LANGUAGES 

formal  notion)  that  determine  the  phonetic  matrices  corresponding  to 
given  phonemic  representations.  There  is  no  general  requirement  that  the 
linearity  and  invariance  conditions  will  be  met  byphonemic  representations. 
It  is  therefore  an  interesting  and  important  observation  that  these  condi- 
tions are,  in  fact,  substantially  met,  although  there  is  an  important  class  of 
exceptions. 

In  order  to  determine  a  phonetic  representation,  the  phonological  rules 
must  utilize  other  information  outside  the  phonemic  representation;  in 
particular,  they  must  utilize  information  about  its  constituent  structure. 
Consequently,  it  is  in  general  impossible  for  a  linguist  (or  a  child  learning 
the  language)  to  discover  the  correct  phonemic  representation  without  an 
essential  use  of  syntactic  information.  Similarly,  it  would  be  expected  that 
in  general  the  perceiver  of  speech  should  utilize  syntactic  cues  in  deter- 
mining the  phonemic  representation  of  a  presented  utterance — he  should,  in 
part,  base  his  identification  of  the  utterance  on  his  partial  understanding  of 
it,  a  conclusion  that  is  not  at  all  paradoxical. 

The  phonological  component  consists  of  (1)  a  sequence  of  rewriting 
rules,  including,  in  particular,  a  subsequence  of  morpheme  structure  rules, 
(2)  a  sequence  of  transformational  rules,  and  (3)  a  sequence  of  rewriting 
rules  that  we  can  call  phonetic  rules.  They  are  applied  to  a  terminal  string 
in  the  order  given. 

Morpheme  structure  rules  enable  us  to  simplify  the  matrices  that  specify 
the  individual  lexical  morphemes  by  taking  advantage  of  general  properties 
of  the  whole  set  of  matrices.  In  English,  for  example,  if  none  of  the  three 
initial  segments  of  a  lexical  item  is  a  vowel,  the  first  must  be  /s/,  the  second 
a  stop,  and  the  third  a  liquid  or  glide.  This  information  need  not  therefore 
be  specified  in  the  matrices  that  represent  such  morphemes  as  string  and 
square.  Similarly,  the  glide  ending  an  initial  consonant  cluster  need  not  be 
further  specified,  since  it  is  determined  by  the  following  vowel;  except 
after  /s/,  it  is  /y/  if  followed  by  /u/,  and  it  is  /w/  otherwise.  Thus  we  have 
cure  and  queer  but  not  /kwur/  or  /kyir/.  There  are  many  other  rules  of  this 
sort.  They  permit  us  to  reduce  the  number  of  features  mentioned  in  the 
grammar,  since  one  morpheme  structure  rule  may  apply  to  many  matrices, 
and  they  thus  contribute  to  simplicity,  as  previously  defined.  (Incidentally, 
the  morpheme  structure  rules  enable  us  to  make  a  distinction  on  a  prin- 
cipled and  non  ad  hoc  basis  between  permissible  and  nonpermissible 
nonsense  syllables.) 

Transformational  phonemic  rules  determine  the  phonetic  effects  of 
constituent  structure.  (Recall  that  the  fundamental  feature  of  transforma- 
tional rules,  as  they  have  been  defined,  is  that  they  apply  to  a  string  by  virtue 
of  the  fact  that  it  has  a  particular  constituent  structure.)  In  English  there 
is  a  complex  interplay  of  rules  of  stress  assignment  and  vowel  reduction 


SOUND    STRUCTURE 

that  leads  to  a  phonetic  output  with  many  degrees  of  stress  and  an  intricate 
distribution  of  reduced  and  unreduced  vowels  (Chomsky,  Halle,  &  Lukoff, 
1956;  Halle  &  Chomsky,  forthcoming).  These  rules  involve  constituent 
structure  in  an  essential  manner  at  both  the  morphological  and  the  syn- 
tactic level;  consequently,  they  must  be  classified  as  transformational 
rather  than  rewriting  rules.  They  are  ordered,  and  apply  in  a  cycle,  first  to 
the  smallest  constituents  (that  is,  lexical  morphemes),  then  to  the  next 
larger  ones,  and  so  on,  until  the  largest  domain  of  phonetic  processes  is 
reached.  It  is  a  striking  fact,  in  English  at  least,  that  essentially  the  same 
rules  apply  both  inside  and  outside  the  word.  Thus  we  have  only  a  single 
cycle  of  transformational  rules,  which,  by  repeated  application,  determines 
the  phonetic  form  of  isolated  words  as  well  as  of  complex  phrases.  The 
cyclic  ordering  of  these  rules,  in  effect,  determines  the  phonetic  structure  of 
a  complex  form,  whether  morphological  or  syntactic,  in  terms  of  the  pho- 
netic structure  of  its  underlying  elements. 

The  rules  of  stress  assignment  and  vowel  reduction  are  the  basic  ele- 
ments of  the  transformational  cycle  in  English.  Placement  of  main  stress  is 
determined  by  constituent  type  and  final  affix.  As  main  stress  is  placed  in  a 
certain  position,  all  other  stresses  in  the  construction  are  automatically 
weakened.  Continued  reapplication  of  this  rule  to  successively  larger 
constituents  of  a  string  with  no  original  stress  indications  can  thus  lead  to 
an  output  with  a  many-leveled  stress  contour.  A  vowel  is  reduced  to  H 
in  certain  phonemic  positions  if  it  has  never  received  main  stress  at  an 
earlier  stage  of  the  derivation  or  if  successive  cycles  have  weakened  its 
original  main  stress  to  tertiary  (or,  in  certain  positions,  to  secondary). 
The  rule  of  vowel  reduction  applies  only  once  in  the  transformational 
cycle,  namely,  when  we  reach  the  level  of  the  word. 

A  detailed  discussion  of  these  rules  is  not  feasible  within  the  limits  of 
this  chapter,  but  a  few  comments  may  indicate  how  they  operate.  Con- 
sider in  particular,  the  following  four  rules,7  which  apply  in  the  order  given : 

A  substantive  rule  that  assigns  stress  in  initial  position 

in  nouns  (also  stems)  under  very  general  circumstances        ^      ' 

A  nuclear  stress  rule  that  makes  the  last  main  stress 

dominant,  thus  weakening  all  other  stresses  in  the         (296) 

construction. 

The  vowel  reduction  rule.  (29c) 

A  rule  of  stress  adjustment  that  weakens  all  nonmain 
stresses  in  a  word  by  one. 

7  These  differ  somewhat  from  the  rules  that  would  appear  in  a  more  detailed  and 
general  grammar.    See  Halle  &  Chomsky  (forthcoming)  for  details. 


FORMAL  ANALYSIS  OF  NATURAL  LANGUAGES 

From  the  verbs  permit,  torment,  etc.,  we  derive  the  nouns  permit, 
torment  in  the  next  transformational  cycle  by  the  substantive  rule,  the 
stress  on  the  second  syllable  being  automatically  weakened  to  secondary. 
The  rule  of  stress  adjustment  gives  primary-tertiary  as  the  stress  sequence 
in  these  cases.  The  second  syllable  does  not  reduce  to  H,  since  it  is  protected 
by  secondary  stress  at  the  stage  at  which  the  rule  of  vowel-reduction 
applies. 

Thus  for  permit,  torment  we  have  the  following  derivations: 


A-  Lvlr 

1  1 

2-  Lvtr  Per  +  nutlplv  Lvfr  torment]F]iV 
i  i 

3.  [jy  per  +  mit^Y  Lv  torment]^ 

12  12 

4.  LV  per  +  mit]lV  [N  torment]iV 

12  12 

5.  per  +  mit  torment 


i  3  13 

6.  per  +  mit  torment 

13  13 

7.  /^innit  tftorment 

Line  1  is  the  phonemic,  line  7  the  phonetic  representation  (details  omitted). 
Line  2  is  derived  by  a  general  rule  (that  we  have  not  given)  for  torment 
and  by  Rule  29b  for  permit  (since  the  heaviest  stress  in  this  case  is  zero). 
Line  3  terminates  the  first  transformational  cycle  by  erasing  innermost 
brackets.  Line  4  results  from  Rule  29a.  Line  5  terminates  the  second 
transformational  cycle,  erasing  innermost  brackets.  Line  6  results  from 
Rule  29d  (29c  being  inapplicable  because  of  secondary  stress  on  the  second 
vowel),  and  line  7  results  from  other  phonetic  rules. 

Consider,  in  contrast,  the  word  torrent.  This,  like  torment,  has  phonemic 
/e/  as  its  second  vowel  (cf.  torrential),  but  it  is  not,  like  torment,  derived 
from  a  verb  torrent.  Consequently,  the  second  vowel  does  not  receive 
main  stress  on  the  first  cycle;  it  will  therefore  reduce  by  Rule  29c  to  H. 
Thus  we  have  reduced  and  unreduced  vowels  contrasting  in  torment- 
torrent  as  a  result  of  a  difference  in  syntactic  analysis.  Initial  stress  in 
torrent  is  again  a  result  of  Rule  29af 

The  same  rule  that  forms  permit  and  torment  from  permit  and  torment 
changes  the  secondary  stress  of  the  final  syllable  of  the  verb  advocate 
to  tertiary,  so  that  it  is  reduced  to  \¥\  by  the  rule  of  vowel  reduction  29c. 
Thus  we  have  reduced  and  unreduced  vowels  contrasting  in  the  noun 
advocate  and  the  verb  advocate  and  generally  with  the  suffix  -ate.  Exactly 
the  same  rules  give  the  contrast  between  reduced  and  unreduced  vowels 


SOUND   STRUCTURE  5/7 

in  the  noun  compliment  ([.  .  .  mint])  and  the  verb  compliment  ([.  .  .  ment]) 
and  similar  forms. 

Now  consider  the  word  condensation.  In  an  early  cycle  we  assign  main 
stress  to  the  second  syllable  of  condense.  In  the  next  cycle  the  rules  apply 
to  the  form  condensation  as  a  whole,  this  being  the  next  larger  constituent. 
The  suffix  -ion  always  assigns  main  stress  to  the  immediately  preceding 
syllable,  in  this  case,  ate.  Application  of  this  rule  weakens  the  syllable 
dens  to  secondary.  The  rule  of  vowel  reduction  does  not  apply  to  this 
vowel,  since  it  is  protected  by  secondary  stress.  Another  rule  of  some 
generality  replaces  an  initial  stress  sequence  xxl  by  231,  and  the  rule  of 
stress  adjustment  gives  the  final  contour  3414.  Thus  the  resulting  form  has 
a  nonreduced  vowel  in  the  second  syllable  with  stress  four.  Consider,  in 
contrast,  the  word  compensation.  The  second  vowel  of  this  word,  also 
phonemically  /e/  (cf.  compensatory),  has  not  received  stress  in  any  cycle 
before  the  word  level  at  which  the  rule  of  vowel  reduction  applies  (i.e.,  it  is 
not  derived  from  compense  as  condensation  is  derived  from  condense). 
It  is  therefore  reduced  to  $.  We  thus  have  a  contrast  of  reduced  and 
unreduced  vowels  with  weak  stress  in  compensation-condensation  as  an 
automatic,  though  indirect,  effect  of  difference  in  constituent  structure. 

As  a  final  example,  to  illustrate  the  interweaving  of  Rules  29a  and  29b 
as  syntactic  patterns  grow  more  complex,  consider  the  phrases  John's 
blackboard  eraser,  small  boys'  school  (meaning  small  school  for  boys),  and 
small  boys  school  (meaning  school  for  small  boys).  These  have  the  follow- 
ing derivations,  after  the  initial  cycles  which  assign  main  stress  within  the 
words: 


LI.  [NP  John's  [N[N  black  board]^ 

1  121 

2.  [^p  John's  [N  black  board  eraserj^p 

(applying  Rule  29a  to  the  innermost  constituent  and  erasing 
brackets) 

1  132 

3.  [NP  John's  black  board  eraser]^ 

(applying  Rule  29a  to  the  innermost  constituent  and  erasing 
brackets) 

2  143 

4.  John's  black  board  eraser 

(applying  Rule  29b  and  erasing  brackets) 


II.  1.  [NP  small  [N  boys* 

i      i  2 

2.  [NP  small  boys'  school]^ 

(applying  Rule  29a  to  the  innermost  constituent  and  erasing 
brackets) 


318  FORMAL   ANALYSIS    OF   NATURAL    LANGUAGES 

21  3 

3.  small  boys'  school 

(applying  Rule  29b  and  erasing  brackets) 

11  i 

III.l.  LvLvp  smatt  boys]Vp  school]^ 

21  1 

2.  [y  small  boys  school]  v 

(applying  Rule  296  to  the  innermost  constituent  and  erasing 
brackets) 

31  2 

3.  small  boys  school 

(applying  Rule  29a  and  erasing  brackets) 

31  3 

4.  small  boys  school 

(by  a  rule  of  wide  applicability  that  we  have  not  given). 

In  short,  a  phonetic  output  that  has  an  appearance  of  great  complexity 
and  disorder  can  be  generated  by  systematic  cyclic  application  of  a  small 
number  of  simple  transformational  rules,  where  the  order  of  application 
is  determined  by  what  we  know,  on  independent  grounds,  to  be  the  syn- 
tactic structure  of  the  utterance.  It  seems  reasonable,  therefore,  to  assume 
that  rules  of  this  kind  underlie  both  the  production  and  perception  of  actual 
speech.  On  this  assumption  we  have  a  plausible  explanation  for  the  fact 
that  native  speakers  uniformly  and  consistently  produce  and  identify  new 
sentences  with  these  intricate  physical  characteristics  (without,  of  course, 
any  conscious  awareness  of  the  underlying  processes  or  their  phonetic 
effects).  This  suggests  a  somewhat  novel  theory  of  speech  perception — that 
identifying  an  observed  acoustic  event  as  such-and-such  a  particular 
phonetic  sequence  is,  in  part,  a  matter  of  determining  its  syntactic  structure 
(to  this  extent,  understanding  it).  A  more  usual  view  is  that  we  determine 
the  phonetic  and  phonemic  constitution  of  an  utterance  by  detecting  in  the 
sound  wave  a  sequence  of  physical  properties,  each  of  which  is  the 
defining  property  of  some  particular  phoneme;  we  have  already  given 
some  indication  why  this  view  (based  on  the  linearity  and  invariance  con- 
ditions for  phonemic  representation)  is  untenable. 

We  might  imagine  a  sentence-recognizing  device  (that  is  to  say,  a  per- 
ceptual model)  that  incorporates  both  the  generative  rules  of  the  grammar 
and  a  heuristic  component  that  samples  an  input  to  extract  from  it  certain 
cues  relating  to  the  rules  used  to  generate  it,  selecting  among  alternative 
possibilities  by  a  process  of  successive  approximation.  With  this  approach, 
there  is  no  reason  to  assume  that  each  segmental  unit  has  a  particular 
defining  property  or,  for  that  matter,  that  speech  segments  literally  occur 
in  sequence  at  all.  Moreover,  it  avoids  the  implausible  assumption  that  there 
is  one  kind  of  grammar  for  the  talker  and  another  kind  for  the  listener. 


REFERENCES  %ig 

Such  an  approach  to  perceptual  processes  has  occasionally  been 
suggested  recently  (MacKay,  1951;  Bruner,  1958;  Halle  &  Stevens, 
1959, 1962;  Stevens,  I960— with  regard  to  speech  perception,  this  view  was 
in  fact  proposed  quite  clearly  by  Wilhelm  von  Humboldt,  1836).  Recogni- 
tion and  understanding  of  speech  is  an  obvious  topic  to  study  in  developing 
this  idea.  On  the  basis  of  the  sampled  cues,  a  hypothesis  can  be  formed 
about  the  spoken  input;  from  this  hypothesis  an  internal  representation 
can  be  generated ;  by  comparison  of  the  input  and  the  internally  generated 
representation  the  hypothesis  can  be  tested;  as  a  result  of  the  test,  the 
hypothesis  can  be  accepted  or  revised  (cf.  Sec.  2  of  Chapter  13).  Although 
speech  perception  is  extremely  complex,  it  is  natural  to  normal  adult 
human  beings,  and  it  is  unique  among  complex  perceptual  processes  in 
that  we  have,  in  this  case  at  least,  the  beginnings  of  a  plausible  and  precise 
generative  theory  of  the  organizing  principles  underlying  the  input  stimuli. 


References 

Berge,  C.   Theorie  des graphes  et  ses  applications.  Paris:  Dunod,  1958. 

Bloch,  B.  A  set  of  postulates  for  phonemic  analysis.  Language,  1948,  24,  3-46. 

Bloch,  B.  Phonemic  overlapping.  Am.  Speech,  1940, 16,  278-84.  Reprinted  in  M.  Joos 

(Ed.),  Readings  in  linguistics.   Washington:  Am.  Counc.  Learned  Socs.,  1957.   Pp. 

93-96. 
Bloch,  B.  Studies  in  colloquial  Japanese  IV:  Phonemics.  Language,  1950,  26,  86-125. 

Reprinted  in  M.  Joos  (Ed.),  Readings  in  linguistics.    Washington:    Am.  Counc. 

Learned  Socs.,  1957.    Pp.  329-348. 
Bruner,  J.  S.   Neural  mechanisms  in  perception.   In  H.  C.  Solomon,  S.  Cobb,  &  W. 

Penfield  (Eds.),  The  brain  and  human  behavior.   Baltimore:  Williams  and  Wilkins, 

1958.  Pp.  118-143. 
Chomsky,  N.    Logical  structure  of  linguistic  theory.    Microfilm,  Mass.  Inst.  Tech. 

Library,  1955. 
Chomsky,  N.   Three  models  for  the  description  of  language.   IRE  Trans,  on  Inform. 

Theory.,  1956,  IT-2,  113-124. 

Chomsky,  N.  Syntactic  structures.  The  Hague:  Mouton,  1957. 
Chomsky,  N.  The  transformational  basis  of  syntax.  In  A.  A.  Hill  (Ed.),  IVth  Univer.  of 

Texas  Symp.  on  English  and  Syntax,  1959  (unpublished). 
Chomsky,  N.  On  the  notion  "Rule  of  grammar."  In  R.  Jakobson  (Ed.),  Structure  of 

language  and  its  mathematical  aspects,  Proc.  \2th  Symp.  in  App.  Math.  Providence, 

R.  L:  American  Mathematical  Society,  196L  Pp.  6-24.  (a).  Reprinted  in  J.  Katz 

and  J.  Fodor  (Eds.),  Readings  in  philosophy  of  language.  New  York:  Prentice-Hall, 

1963. 
Chomsky,  N.  Some  methodological  remarks  on  generative  grammar.    Word,  1961,  17, 

219-239.  (b). 
Chomsky,  N.  Explanatory  models  in  linguistics.  In  E.  Nagel,  P.  Suppes,  &  A.  Tarski, 

(Eds.),  Logic,  Methodology,   &  Philosophy  of  Science:    Proceedings  of  the  1960 

International  Congress.  Stanford:  Stanford  Univer.  Press,  1962.  Pp.  528-550.   (a). 


320  FORMAL  ANALYSIS  OF  NATURAL  LANGUAGES 

Chomsky,  N.  The  logical  basis  of  linguistic  theory.  Proc,  Ninth  Int.  Cong,  of  Linguists, 

1962.  Preprints,  Cambridge,  Mass.,  1962.   Pp.  509-574.   (b).  Reprinted  in  J.  Katz 

and  J.  Fodor  (Eds.),  Readings  in  philosophy  of  language.  New  York:  Prentice-Hall, 

1963. 
Chomsky,  N.,  Halle,  M.,  &  Lukoff,  F.  On  accent  and  juncture  in  English.  In  For  Roman 

Jakobson.  The  Hague:  Mouton,  1956. 

Culik,  K.    On  some  axiomatic  systems  for  formal  grammars  and  languages.    Mimeo- 
graphed, 1962. 

Davis,  M.   Computability  and unsolvability.   New  York:   McGraw-Hill,  1958. 
Halle,  M.  Sound  pattern  of  Russian,  The  Hague:  Mouton.  1959.  (a). 
Halle,  M.  Questions  of  linguistics.  Nuovo  Cimento,  1959, 13,494-517.  (b).  Reprinted 

in  J.  Katz  and  J.  Fodor  (Eds.),  Readings  in  philosophy  of  language.    New  York : 

Prentice-Hall,  1963. 
Halle,  M.   On  the  role  of  simplicity  in  linguistic  descriptions.   In  R.  Jakobson  (Ed.), 

Structure  of  language  and  its  mathematical  aspects,  Proc.  \2th  Symp.  in  App.  Math. 

Providence,  R.  L;  Amer.  Mathematical  Society,  1961.  Pp.  89-94. 
Halle,  M.,  &  Chomsky,  N.  Sound  pattern  of  English.  (In  prep.) 
Halle,  M.,  &  Stevens,  K.  N.   Analysis  by  synthesis.   Proc.  Seminar  on  Speech  Com- 
pression and  Production,  AFCRC-TR-59-198, 1959.  Reprinted  in  J.  Katz  and  J.  Fodor 

(Eds.),  Readings  in  philosophy  of  language.  New  York:  Prentice-Hall,  1963. 
Halle,  M.,  &  Stevens,  K.  N.  Speech  recognition:  a  model  and  a  program  for  research. 

IRE  Trans,  on  Inform.  Theory,  1962,  IT-8,  155-159. 
Harris,  Z.  S.  Discourse  analysis.  Language,  1952,  28,  1-30.  (a).  Reprinted  in  J.  Katz 

and  J.  Fodor  (Eds.),  Readings  in  philosophy  of  language.  New  York :  Prentice-Hall, 

1963. 

Harris,  Z.  S.  Discourse  analysis:  a  sample  text.  Language,  1952,  28,  474-494.  (b). 
Harris,  Z.  S.    Co-occurrence  and  transformation  in  linguistic  structure.    Language, 

1957,  33,  283-340.  Reprinted  in  J.  Katz  and  J.  Fodor  (Eds.),  Readings  in  philosophy 

of  language.  New  York:  Prentice-Hall,  1963. 
Humboldt,  W.  von.    Vber  die  Verschiedenheit  des  menschlichen  Sprachbaues.    Berlin, 

1836.  Facsimile  edition:  Bonn,  1960. 
Jakobson,  R.,  Fant,  C.  G.  M.,  &  Halle,  M.   Preliminaries  to  speech  analysis.  Tech. 

Rept.  13,  Acoustics  Laboratory,  Mass.  Inst.  Tech.,  Cambridge,  Mass.,  1952. 
Jakobson,  R.,  &  Halle,  M.  Fundamentals  of  language.  The  Hague:  Mouton,  1956. 
Katz,  J.  Semi-sentences.   In  J.  Katz  and  J.  Fodor  (Eds.),  Readings  in  philosophy  of 

language.  New  York:  Prentice-Hall,  1963. 
Kraft,  L.  G.  A  device  for  quantifying,  grouping,  and  coding  amplitude  modulated  pulses. 

MS  thesis,  Dept  Elec.  Eng.,  Mass.  Inst.  Tech.,  1949. 

Lees,  R.  B.  Review  of  Chomsky,  Syntactic  Structures.  Language,  1957,  33,  375-408. 
Lees,  R.  B.    A  grammar  of  English  nominalizations.    Supplement  to  International  J. 

Amer.  Linguistics.  Baltimore,  1960. 
Libennan,  A.  M.,  Delattre,  P.,  &  Cooper,  F.  S.  The  role  of  selected  stimulus  variables 

in  the  perception  of  unvoiced  stop  consonants.  Amer.  J.  PsychoL,  1952,  65, 497-516. 
MacKay,  D.  M.    Mindlike  behavior  in  artefacts.    Brit.  J.  Philos.  Science,  1951,  2, 

105-121. 
Mandelbrot,  B.  On  recurrent  noise  limiting  coding.  In.  Proc.  Symposium  on  Information 

Networks,  Polytechnic  Institute  of  Brooklyn,  1954.  Pp.  205-22L 
Matthews,  G.  H.  Hidatsa  syntax  Mimeographed,  M.LT.,  1962. 
McMillan,  B.    Two  inequalities  implied  by  unique  decipherability.    IRE  Trans,  on 

Inform.  Theory.  December  1956,  IT-2, 115-116. 


REFERENCES  $21 

Miller,  G,  A.   Speech  and  communication.  /.  acoust,  Soc.  Amer.,  1958,  30,  397-398. 
Osgood,  C.  E.,  Suci,  G.  J.,  &  Tannenbaum,  P.  H.  The  measurement  of  meaning.  Urbana, 

111.:  Univer.  of  Illinois  Press,  1957. 
Postal,  P.  Some  syntactic  rules  in  Mohawk.  PhD  dissertation,  Dept.  of  Anthropology, 

Yale  University,  1962. 
Postal,  P.    Constituent  analysis.    Supplement  to  International  J.  Amer.  Linguistics. 

Baltimore.  (In  press.) 
Rogers,  H.  The  present  theory  of  Turing  machine  computability.  J.  soc.  indust.  appl. 

math.,  1959,  7,  114-130. 
Sapir,  E.  Sound  patterns  in  language.  Language,  1925,  1,  37-51.  Reprinted  in  D.  G. 

Mandelbaum  (Ed.),  Selected  writings  of  Edward  Sapir.    Berkeley:    Univer.   of 

California  Press,  1949.   Pp.  33-45. 
Sapir,  E.   La  realite  psychologique  des  phonemes.  J.  de  psychologie  normale  et  patho- 

logique,  1933,  247-265.   Reprinted  in  D.  G.  Mandelbaum  (Ed.),  Selected  writings  of 

Edward  Sapir.  Berkeley:  Univer.  of  California  Press,  1949.  Pp.  46-60. 
Saussure,  F.  de.    Cours  de  linguistique  generale.    Paris:    1916.    Translation  by  W. 

Baskin,  Course  in  general  linguistics,  New  York:  Philosophical  Library,  1959. 
Schatz,  C.  D.  The  role  of  context  in  the  perception  of  stops.  Language*  1954,  30,  47. 
Schiitzenberger,  M.  P.  On  an  application  of  semi-group  methods  to  some  problems  in 

coding.  IRE  Trans,  on  Inform.  Theory,  1956,  IT-2,  47-60. 

Shannon,  C.  E.  Communication  in  the  presence  of  noise.  Proc.  IRE,  1949,  37,  10-21. 
Stevens,  K.  N.  Toward  a  model  for  speech  recognition.  /.  acoust.  Soc.  Amer.,  1960, 

34, 47-55. 

Sweet,  H.  A  handbook  of  phonetics.  Oxford:  Clarendon  Press,  1877. 
Trakhtenbrot,  B.  A.   Algorithms  and  automatic  computing  machines.   Boston :  Heath, 

1963.  Translated  by  J.  Khristian,  J.  D.  McCawley,  &  S.  A.  Schmitt  from  2nd  edition 

Algoritmy  i  mashimoe  reshenie  zadach,  1960. 
Wallace,  A.  F.  C.  On  being  just  complicated  enough.  Proc.  Nat.  Acad.  ScL,  1961,  47, 

458-^64. 

Ziff,  P.  Semantic  Analysis.  Ithaca,  New  York:  Cornell  Univer.  Press,  1960.  (a). 
Ziff,  P.    On  understanding  "Understanding  utterances."    Mimeographed,  1960.  (b). 

Reprinted  in  J.  Katz  and  J.  Fodor  (Eds.),  Readings  in  philosophy  of  language.  New 

York:  Prentice-Hall,  1963. 
Ziff,  P.  About  ungrammaticalness.  Mimeographed,  University  of  Pennsylvania,  1961. 


12 

Formal  Properties  of 
Grammars' 


Noam  Chomsky 

Massachusetts  Institute  of  Technology 


1  The  preparation  of  this  chapter  was  supported  in  part  by  the  U.S.  Army, 
the  Air  Force  Office  of  Scientific  Research,  and  the  Office  of  Naval  Research; 
and  in  part  by  the  National  Science  Foundation  (Grant  No.  NSF  G- 1390 3}. 


323 


Contents 


1.  Abstract  Automata  326 

1.1.  Representation  of  linguistic  competence,     326 

1.2.  Strictly  finite  automata,     331 

.3.  Linear-bounded  automata,     338 

.4.  Pushdown  storage,     339 

.5.  Finite  transducers,     346 

.6.  Transduction  and  pushdown  storage,     348 

.7.  Other  kinds  of  restricted-infinite  automata,     352 

.8.  Turing  machines,     352 

1.9.  Algorithms  and  decidability,     354 

2.  Unrestricted  Rewriting  Systems  357 

3.  Context-Sensitive  Grammars  360 

4.  Context-Free  Grammars  366 

4.1.  Special  classes  of  context-free  grammars,     368 

4.2.  Context-free  grammars  and  restricted-infinite  automata,     371 

4.3.  Closure  properties,     380 

4.4.  Undecidable  properties  of  context-free  grammars,     382 

4.5.  Structural  ambiguity,    387 

4.6.  Context-free  grammars  and  finite  automata,     390 

4.7.  Definability  of  languages  by  systems  of  equations,     401 

4.8.  Programming  languages,     409 

5.  Categorial  Grammars  410 
References                                                                                                 415 


3*4 


Formal  Properties  of  Grammars 


A  proposed  theory  of  linguistic  structure,  in  the  sense  of  Chapter  1 1 , 
must  specify  precisely  the  class  of  possible  sentences,  the  class  of  possible 
grammars,  and  the  class  of  possible  structural  descriptions  and  must  pro- 
vide a  fixed  and  uniform  method  for  assigning  one  or  more  structural 
descriptions  to  each  sentence  generated  by  an  arbitrarily  selected  grammar 
of  the  specified  form.  In  Chapter  1 1  we  developed  two  conceptions  of 
linguistic  structure — the  theory  of  constituent-structure  grammar  and  the 
theory  of  transformational  grammar — that  meet  these  minimal  conditions. 
We  observed  that  the  empirical  inadequacies  of  the  theory  of  constituent- 
structure  grammar  are  rather  obvious ;  for  this  reason  there  has  been  no 
sustained  attempt  to  apply  it  to  a  wide  range  of  linguistic  data.  In  contrast, 
there  is  a  fairly  substantial  and  growing  body  of  evidence  that  the  theory 
of  transformational  grammar  may  provide  an  accurate  picture  of  gram- 
matical structure  (Chomsky,  I962b,  and  references  cited  there). 

On  the  other  hand,  there  are  very  good  reasons  why  the  formal  in- 
vestigation of  the  theory  of  constituent-structure  grammars  should  be 
intensively  pursued.  As  we  observed  in  Chapter  11,  it  does  succeed  in 
expressing  certain  important  aspects  of  grammatical  structure  and  is  thus 
by  no  means  without  empirical  motivation.  Furthermore,  it  is  the  only 
theory  of  grammar  with  any  linguistic  motivation  that  is  sufficiently  simple 
to  permit  serious  abstract  study.  It  appears  that  a  deeper  understanding 
of  generative  systems  of  this  sort,  and  the  languages  that  they  are  capable 
of  describing,  is  a  necessary  prerequisite  to  any  attempt  to  raise  serious 
questions  concerning  the  formal  properties  of  the  richer  and  much  more 
complex  systems  that  do  offer  some  hope  of  empirical  adequacy  on  a 
broad  scale.  For  the  present  this  seems  to  be  the  area  in  which  the  study 
of  mathematical  models  is  most  likely  to  provide  significant  insight  into 
linguistic  structure  and  the  capacities  of  the  language  user. 

In  accordance  with  the  terminology  of  Chapter  11,  we  may  distinguish 
the  weak  generative  capacity  of  a  theory  of  linguistic  structure  (i.e.,  the 
set  of  languages  that  can  be  enumerated  by  grammars  of  the  form  permitted 
by  this  theory)  from  its  strong  generative  capacity  (the  set  of  systems  of 
structural  descriptions  that  can  be  enumerated  by  the  permitted  grammars). 
This  survey  is  largely  restricted  to  weak  generative  capacity  of  constituent- 
structure  grammars  for  the  simple  reason  that,  with  a  few  exceptions,  this 
is  the  only  area  in  which  substantial  results  of  a  mathematical  character 

325 


326  FORMAL    PROPERTIES    OF    GRAMMARS 

have  been  achieved.  Ultimately,  of  course,  we  are  interested  in  studying 
strong  generative  capacity  of  empirically  validated  theories  rather  than 
weak  generative  capacity  of  theories  which  are  at  best  suggestive.  It  is 
important  not  to  allow  the  technical  feasibility  for  mathematical  study  to 
blur  the  issue  of  linguistic  significance  and  empirical  justification.  We 
want  to  narrow  the  gap  between  the  models  that  are  accessible  to  mathe- 
matical investigation  and  those  that  are  validated  by  confrontation  with 
empirical  data,  but  it  is  crucial  to  be  aware  of  the  existence  and  character 
of  the  gap  that  still  exists.  Thus,  in  particular,  it  would  be  a  gross  error  to 
suppose  that  the  richness  and  complexity  of  the  devices  available  in  a  par- 
ticular theory  of  generative  grammar  can  be  measured  by  the  weak  genera- 
tive capacity  of  this  theory.  In  fact,  it  may  well  be  true  that  the  correct 
theory  of  generative  grammar  will  permit  generation  of  a  very  wide  class 
of  languages  but  only  a  very  narrow  class  of  systems  of  structural  de- 
scriptions, that  is  to  say,  that  it  will  have  a  broad  weak  generative  capacity 
but  a  narrow  strong  generative  capacity.  Thus  the  hierarchy  of  theories  that 
we  establish  in  this  chapter  (in  terms  of  weak  generative  capacity)  must  not 
be  interpreted  as  providing  any  serious  measure  of  the  richness  and  com- 
plexity of  theories  of  generative  grammar  that  may  be  proposed. 


1.  ABSTRACT   AUTOMATA 

1.1  Representation  of  Linguistic  Competence 

At  the  outset  of  Chapter  1 1  we  raised  the  problem  of  constructing  (a) 
models  to  represent  certain  aspects  of  the  competence  achieved  by  the 
mature  speaker  of  a  language  and  (b)  models  to  represent  certain  aspects 
of  his  behavior  as  he  puts  this  competence  to  use.  The  second  task  is 
concerned  with  the  actual  performance  of  a  speaker  or  hearer  who  has 
mastered  a  language;  the  first  involves  rather  his  knowledge  of  that 
language.  Psychologists  have  long  realized  that  a  description  of  what  an 
organism  does  and  a  description  of  what  it  knows  can  be  very  different 
things  (cf.  Lashley  1929,  p.  553;  Tolman,  1932,  p.  364).  A  generative 
grammar  that  assigns  structural  descriptions  to  an  infinite  class  of  sentences 
can  be  regarded  as  a  partial  theory  of  what  the  mature  speaker  of  the 
language  knows.  It  in  no  sense  purports  to  be  a  description  of  his  actual 
performance,  either  as  a  speaker  or  as  a  listener.  However,  one  can 
scarcely  hope  to  develop  a  sensible  theory  of  the  actual  use  of  language 
except  on  the  basis  of  a  serious  and  far-reaching  account  of  what  a 
language-user  knows. 

The  generative  grammar  represents  the  information  concerning  sentence 


ABSTRACT   AUTOMATA 


327 


structure  that  is  available,  in  principle,  to  one  who  has  acquired  the 
language.  It  indicates  how,  ideally — leaving  out  any  limitations  of  memory, 
distractions,  etc.— he  would  understand  a  sentence  (to  the  extent  that  the 
processes  associated  with  "understanding"  can  be  interpreted  syntactically). 
In  fact,  such  sentences  as  Example  1 1  in  Chapter  1 1  are  quite  incompre- 
hensible on  first  hearing,  but  this  has  no  bearing  on  the  question  whether 
those  sentences  are  generated  by  the  grammar  that  has  been  acquired, 
just  as  the  inability  of  a  person  to  multiply  18,674  times  26,521  in  his  head 
is  no  indication  that  he  has  failed  to  grasp  the  rules  of  multiplication.  In 
either  case  an  artificial  increase  in  memory  aids,  time,  attention,  etc.,  will 
probably  lead  the  subject  to  the  unique  correct  answer.  In  both  cases 
there  are  problems  that  so  exceed  the  user's  memory  and  attention  spans 
that  the  correct  answer  will  never  be  approached,  and  in  both  cases  there 
is  no  reasonable  alternative  to  the  conclusion  that  recursive  rules  specifying 
the  correct  solution  are  represented  somehow  in  the  brain,  despite  the 
fact  that  (for  quite  extraneous  reasons)  this  solution  cannot  be  achieved  in 
actual  performance. 

In  a  work  that  inaugurated  the  modern  era  of  language  study  Ferdinand 
de  Saussure  (1916)  drew  a  fundamental  distinction  between  what  he  called 
langue  and  parole.  The  first  is  the  grammatical  and  semantic  system 
represented  in  the  brain  of  the  speaker;  the  second  is  the  actual  acoustic 
output  from  his  vocal  organs  and  input  to  his  ears.  Saussure  drew  an 
analogy  between  langue  and  a  symphony,  between  parole  and  a  particular 
performance  of  it;  and  he  observed  that  errors  or  idiosyncracies  of  a 
particular  performance  may  indicate  nothing  about  the  underlying  reality. 
Langue^  the  system  represented  in  the  brain,  is  the  basic  object  of  psycho- 
logical and  linguistic  study,  although  we  can  determine  its  nature  and 
properties  only  by  study  of  parole — just  as  a  speaker  can  construct  this 
system  for  himself  only  on  the  basis  of  actual  observation  of  specimens  of 
parole.  It  is  the  child's  innate  faculte  de  langage  that  enables  him  to 
register  and  develop  a  linguistic  system  (langue)  on  the  basis  of  scattered 
observations  of  actual  linguistic  behavior  (parole).  Other  aspects  of  the 
study  of  language  can  be  seriously  undertaken  only  on  the  basis  of  an 
adequate  account  of  the  speaker's  linguistic  intuition,  that  is,  on  the  basis 
of  a  description  of  his  langue. 

This  is  the  general  point  of  view  underlying  the  work  with  which  we  are 
here  concerned.  It  has  sometimes  been  criticized — even  rejected  whole- 
sale— as  "mentalistic".  However,  the  arguments  that  have  been  offered  in 
support  of  this  negative  evaluation  of  the  basic  Saussurian  orientation  do 
not  seem  impressive.  This  is  not  the  place  to  attempt  to  deal  with  them 
specifically,  but  it  appears  that  the  "antimentalistic"  arguments  that  have 
been  characteristically  proposed  would,  were  they  correct,  apply  as  well 


328  FORMAL   PROPERTIES   OF    GRAMMARS 

against  any  attempt  to  construct  explanatory  theories.  They  would,  in 
other  words,  simply  eliminate  science  as  an  intellectually  significant 
enterprise.  Particular  "mentalistic"  theories  may  be  useless  or  uninforma- 
tive  (as  also  "behavioral"  or  "mechanistic*'  theories),  but  this  is  not  because 
they  deal  with  "mentalistic"  concepts  that  are  associated  with  no  necessary 
and  sufficient  operational  or  "behavioral"  criterion.  Observations  of 
behavior  (e.g.,  specimens  of  parole,  particular  arithmetical  computations) 
may  constitute  the  evidence  for  determining  the  correctness  of  a  theory  of 
the  individual's  underlying  intellectual  capacities  (e.g.,  his  langue,  his  innate 
faculte  de  langage,  his  knowledge  of  arithmetic),  just  as  observations  of 
color  changes  in  litmus  paper  may  constitute  the  evidence  that  justifies  an 
assumption  about  chemical  structure  or  as  meter  readings  may  constitute 
the  evidence  that  leads  us  to  accept  or  reject  some  physical  theory.  In 
none  of  these  cases  is  the  subject  matter  of  the  theory  (e.g.,  innate  or 
mature  linguistic  competence,  ability  to  learn  arithmetic  or  knowledge  of 
arithmetic,  the  nature  of  the  physical  world)  to  be  confused  with  the 
evidence  that  is  adduced  for  or  against  it.  As  a  general  designation  for 
psychology,  "behavioral  science"  is  about  as  apt  as  "meter-reading 
science"  would  be  for  physics  (cf.  Kohler,  1938,  pp.  152-169). 

Our  discussion  departs  from  a  strict  Saussurian  conception  in  two  ways. 
First,  we  say  nothing  about  the  semantic  side  of  langue.  The  few  coherent 
remarks  that  might  be  made  concerning  this  subject  lie  outside  the  scope  of 
the  present  survey.  Second,  our  conception  of  langue  differs  from  Saus- 
sure's  in  one  fundamental  respect;  namely,  langue  must  be  represented  as 
a  generative  process  based  on  recursive  rules.  It  seems  that  Saussure 
regarded  langue  as  essentially  a  storehouse  of  signs  (e.g.,  words,  fixed 
phrases)  and  their  grammatical  properties,  including,  perhaps,  certain 
"phrase  types."  Consequently,  he  was  unable  to  deal  with  questions  of 
sentence  structure  in  any  serious  way  and  was  forced  to  the  conclusion  that 
formation  of  sentences  is  basically  a  matter  of  parole  rather  than  langue, 
that  is,  a  matter  of  free  and  voluntary  creation  rather  than  of  systematic 
rule.  This  bizarre  consequence  can  be  avoided  only  through  the  realization 
that  infinite  sets  with  certain  types  of  internal  structure  (such  as,  in  par- 
ticular, sentences  of  a  natural  language  with  their  structural  descriptions) 
can  be  characterized  by  a  finite  recursive  generative  process.  This  insight 
was  not  generally  available  at  the  time  of  Saussure's  lectures.  Once 
we  reformulate  the  notion  of  langue  in  these  terms,  we  can  hope  to  incor- 
porate into  its  description  a  full  account  of  syntactic  structure.  Further- 
more, even  the  essentially  finite  parts  of  linguistic  theory — phonology,  for 
example — must  now  receive  a  rather  different  formulation,  as  we  observed 
briefly  in  Chapter  11,  Sec.  6.  New  and  basic  questions  of  a  semantic 
nature  "can  also  be  raised.  Thus  we  can  ask  how  a  speaker  uses  the 


ABSTRACT   AUTOMATA 


329 


(sentence  s,  structural 
description  of  s) 


Sentence  s 


Structural 
description  of  s 


Linguistic 
data 


•  Grammar 


Fig.  1 .  Three  types  of  psycholinguistic  models  sug- 
gested by  the  Saussurian  conception  of  language. 

recursive  devices  that  specify  sentences  and  their  structural  descriptions,  on 
all  levels,  to  interpret  presented  sentences,  to  produce  intended  ones,  to 
utilize  deviations  from  normal  grammatical  structure  for  expressive  and 
literary  purposes,  etc.  (cf.  Katz  &  Fodor,  1962).  It  is  impossible  to  main- 
tain seriously  the  widespread  view  that  our  knowledge  of  the  language 
involves  familiarity  with  a  fixed  number  of  grammatical  patterns,  each  with 
a  certain  meaning,  and  a  set  of  meaningful  items  that  can  be  inserted  into 
them,  and  that  the  meaning  of  a  new  sentence  is  basically  a  kind  of  com- 
pound of  these  component  elements. 

With  this  modification,  the  Saussurian  conception  suggests  for  investi- 
gation three  kinds  of  models,  which  are  represented  graphically  in  Fig.  1. 

The  device  A  is  a  grammar  that  generates  sentences  with  structural 
descriptions;  that  is  to  say,  A  represents  the  speaker's  linguistic  intuition, 
his  knowledge  of  his  language,  his  langue.  If  we  want  to  think  of  A  as  an 
input-output  device,  the  inputs  can  be  regarded  as  integers,  and  A  can  be 
regarded  as  a  device  that  enumerates  (in  some  order  that  is  of  no  immediate 
interest)  an  infinite  class  of  sentences  with  structural  descriptions.  Alter- 
natively, we  can  think  of  the  device  A  as  being  a  theory  of  the  language. 

The  device  B  in  Fig.  1  represents  the  perceptual  processes  involved  in 
determining  sentence  structure.  Given  a  sensory  input  s,  the  hearer 
represented  by  B  constructs  an  internal  representation — a  percept — which 
we  call  the  structural  description  of  s.  The  device  B,  then,  would  con- 
stitute a  proposed  account  of  the  process  of  coming  to  understand  a 
sentence,  to  the  (by  no  means  trivial)  extent  that  this  is  a  matter  of  deter- 
mining its  grammatical  structure. 

The  device  C  represents  thtfaculte  de  langage,  the  innate  abilities  that 


33°  FORMAL    PROPERTIES    OF    GRAMMARS 

make  it  possible  for  an  organism  to  construct  for  itself  a  device  of  type  A 
on  the  basis  of  experience  with  a  finite  corpus  of  utterances  and,  no  doubt, 
information  of  other  kinds. 

The  converse  of  B  might  be  thought  of  as  a  model  of  the  speaker,  and, 
in  fact,  Saussure  did  propose  a  kind  of  account  of  the  speaker  as  a  device 
with  a  sequence  of  concepts  as  inputs  and  a  physical  event  as  output. 
But  this  doctrine  cannot  survive  critical  analysis.  In  the  present  state  of  our 
understanding,  the  problem  of  constructing  an  input-output  model  for 
the  speaker  cannot  even  be  formulated  coherently. 

Of  the  three  tasks  of  model  construction  just  mentioned,  the  first  is 
logically  prior.  A  device  of  type  A  is  the  output  of  C — it  is,  in  other  words, 
one  major  result  of  the  learning  process.  It  also  seems  that  one  of  the  most 
hopeful  ways  to  approach  the  problem  of  characterizing  C  is  through  an 
investigation  of  linguistic  universals,  .the  structural  features  common  to  all 
generative  grammars.  For  acquisition  of  language  to  be  possible  at  all 
there  must  be  some  sort  of  initial  delimitation  of  the  class  of  possible  sys- 
tems to  which  observed  samples  may  conceivably  pertain;  the  organism 
must,  necessarily,  be  preset  to  search  for  and  identify  certain  kinds  of 
structural  regularities.  Universal  features  of  grammar  offer  some  sugges- 
tions regarding  the  form  this  initial  delimitation  might  take.  Furthermore, 
it  seems  clear  that  any  interesting  realization  of  B  that  is  not  completely 
ad  hoc  will  incorporate  A  as  a  fundamental  component;  that  is  to  say,  an 
account  of  perception -will  naturally  have  to  base  itself  on  the  perceiver's 
knowledge  of  the  structure  of  the  collection  of  items  from  which  the  pre- 
ceived  objects  are  drawn.  These,  then,  are  the  reasons  for  our  primary 
concern  with  the  nature  of  grammars — with  devices  of  the  type  A — in  this 
chapter.  It  should  be  noted  again  that  the  logical  priority  of  langue  (i.e., 
the  device  A)  is  a  basic  Saussurian  point  of  view. 

The  primary  goal  of  theoretical  linguistics  is  to  determine  the  general 
features  of  those  devices  of  types  A  to  Cthat  can  be  justified  as  empirically 
adequate— that  can  qualify  as  explanatory  theories  in  particular  cases. 
B  and  C,  which  represent  actual  performance,  must  necessarily  be  strictly 
finite.  A,  however,  which  is  a  model  of  the  speaker's  knowledge  of  his 
language,  may  generate  a  set  so  complex  that  no  finite  device  could 
identify  or  produce  all  of  its  members.  In  other  words,  we  cannot  conclude, 
on  the  basis  of  the  fact  that  the  rules  of  the  grammar  represented  in  the 
brain  are  finite,  that  the  set  of  grammatical  structures  generated  must  be  of 
the  special  type  that  can  be  handled  by  a  strictly  finite  device.  In  fact,  it  is 
clear  that  when  A  is  the  grammar  of  a  natural  language  L  there  is  no  finite 
device  of  type  B  that  will  always  give  a  correct  structural  description  as 
output  when  and  only  when  a  sentence  of  L  is  given  as  input.  There  is 
nothing  surprising  or  paradoxical  about  this;  it  is  not  a  necessary 


ABSTRACT    AUTOMATA  j^/ 

consequence  of  the  fact  that  L  is  infinite  but  rather  a  consequence  of  cer- 
tain structural  properties  of  the  generating  device  A. 

Viewed  in  this  way,  several  rather  basic  aspects  of  linguistic  theory  can  be 
regarded,  in  principle  at  least,  as  belonging  to  the  general  theory  of 
(abstract)  automata.  This  theory  has  been  studied  fairly  extensively  (for 
a  recent  survey,  see  McNaughton,  1961),  but  it  has  received  little  attention 
in  the  technical  literature  of  psychology  and  is  not  readily  available  to  most 
psychologists.  It  seems  advisable,  therefore,  to  survey  some  well-known 
concepts  and  results  (along  with  some  new  material)  as  background  for  a 
more  specific  investigation  of  sentence-generating  devices  in  Sees.  2  to  5. 


1.2  Strictly  Finite  Automata 

The  simplest  type  of  automaton  is  the  strictly  finite  automaton.  We 
can  describe  it  as  a  device  consisting  of  a  control  unit,  a  reading  head,  and  a 
tape.  The  control  unit  contains  a  finite  number  of  parts  that  can  be 
arranged  in  a  finite  number  of  distinct  ways  Each  of  these  arrangements 
is  called  an  internal  state  of  the  automaton.  The  tape  is  blocked  off  into 
squares ;  it  can  be  regarded  as  extending  infinitely  far  both  to  the  left  and 
to  the  right  (i.e.,  as  doubly  infinite).  The  reading  head  can  scan  a  single 
tape  square  at  a  time  and  can  sense  the  symbols  aQ .  . . ,  aD  of  a  finite  alpha- 
bet A  (where  aQ  functions  as  the  identity  element).  We  assume  that  the 
tape  can  move  in  only  one  direction — say  right  to  left.  We  designate  a 
particular  state  of  the  automaton  as  its  initial  state  and  label  it  50.  The 
states  of  the  automaton  are  designated  SQ9 . . . ,  Sn  (n  >  0). 

We  can  describe  the  operation  of  the  automaton  in  the  following  manner. 
A  sequence  of  symbols  a^9 . . . ,  a^  (0  <  /?£  <  D)  of  the  alphabet  A  is 
written  on  consecutive  squares  of  the  tape,  one  symbol  to  a  square.  We 
assume  that  the  symbol  #,  which  is  not  a  member  of  A,  appears  in  all 
squares  to  the  left  of  a$  and  in  all  squares  to  the  right  of  a^.  The  control 
unit  is  set  to  state  SQ.  The  reading  head  is  set  to  scan  the  square  containing 
the  symbol  a$  .  This  initial  tape-machine  configuration  is  illustrated  in 
Fig.  2. 

The  control  unit  is  constructed  so  that  when  it  is  in  a  certain  state  and 
the  reading  head  scans  a  particular  symbol  it  switches  into  a  new  state 
while  the  tape  advances  one  square  to  the  left.  Thus  in  Fig.  2  the  control 
unit  will  switch  to  a  new  state  S£  while  the  tape  moves,  so  that  the  reading 
head  is  now  scanning  the  symbol  a$ .  This  is  the  second  tape-machine 
configuration.  The  machine  continues  to  compute  in  this  way  until  it 
blocks  (i.e.,  reaches  a  tape-machine  configuration  for  which  it  has  no 
instruction)  or  until  it  makes  a  first  return  to  its  initial  state.  In  the  latter 


332 


FORMAL    PROPERTIES    OF    GRAMMARS 


Fig.  2.  Initial  tape-machine  configuration. 


case,  if  the  reading  head  is  scanning  the  square  to  the  right  of  a^  (in 
which  case,  incidentally,  the  machine  is  blocked,  since  this  square  contains 
#  £  A),  we  say  that  the  automaton  has  accepted  (equivalently,  generated) 
the  string  #  api . . .  a$k#.  The  set  of  strings  accepted  by  the  automaton  is 
the  language  accepted  (generated)  by  the  automaton. 

The  behavior  of  the  automaton  is  thus  described  by  a  finite  set  of  triples 
(i9j,k)9  0  <  /  <  Z),  and  0  </,  fc  <  n,  in  which  the  triple  (/,/,  k)  is 
interpreted  as  a  rule  asserting  that  if  the  control  unit  is  in  the  state  S, 
and  the  reading  head  is  scanning  the  symbol  ai  then  the  control  unit  can 
shift  to  state  Sk.  The  total  behavior  of  the  automaton  can  be  represented 
by  a  state  diagram  consisting  of  nodes  labeled  by  the  names  of  the  states 
and  oriented  paths  (arrows)  connecting  the  nodes,  the  paths  labeled  by 
the  symbols  of  the  vocabulary  A.  In  the  graph  the  node  labeled  S$  is 
connected  by  an  arrow  labeled  ai  to  the  node  labeled  Sk  just  in  case  (/,/,  k) 
is  one  of  the  triples  describing  the  behavior  of  the  automaton.  An  illustra- 
tive graph  is  shown  in  Fig.  3.  (When  we  interpret  these  systems  as  gram- 
mars, the  triples  play  the  role  of  grammatical  rules.)  A  finite  automaton, 
then,  is  represented  by  an  arbitrary  finite  directed  graph  with  lines  labeled 
by  symbols  of  A.  Tracing  through  the  graph  from  SQ  to  a  first  return 
to  SQ  by  one  of  the  permissible  paths,  we  generate  the  sentence  #  x  #,  in 
which  x  is  the  string  consisting  of  the  successive  symbols  labeling  the 
arrows  traversed  in  this  path. 

It  is  immaterial  whether  we  picture  an  automaton  as  a  source  generating 
a  sentence  symbol  by  symbol  as  it  proceeds  from  state  to  state  or  as  a 
reader  switching  from  state  to  state  as  it  receives  each  successive  symbol 
of  the  sentences  it  accepts.  This  is  merely  a  matter  of  how  we  choose  to 
interpret  the  notations.  In  either  case,  for  the  kind  of  system  we  have 
been  considering,  we  have  the  following  definition  of  a  sentence: 
Definition  1.  A  string  x  of  symbols  is  a  sentence  generated  by  the  finite 

automaton  F  if  and  only  if  there  is  a  sequence  of  symbols  (a^  .  . . ,  a^ 

of  the  alphabet  of  F  and  a  sequence  of  states  (Syi, .  . . ,  Sy    )  of  F  such  that 


ABSTRACT  AUTOMATA 


333 


Fig.  3.  Graph  of  finite  automaton  defined  by 
the  triples  (0,  6,  0),  (1,  0,  1),  (2,  1,  6),  (3,  2,  1), 
(4,1,4),  (5,1,4),  (6,4,3),  (7,4,5),  (8,5,4), 
(9,5,5),  (9,5,6).  State  50  is  the  initial  and 
terminal  state  for  all  sentences.  States  Sz  and 
S3  would  normally  be  omitted,  since  they  can 
play  no  role  in  the  generation  of  sentences. 


0;  (ii)y<9*0/orl<i<r  +  1;  (iii)  (ft,  y*  yi+1)  is  a 
rule  of  F  for  each  i  (I  ^i^r);  (iv)  x^tfa^...  a^#. 
Any  set  of  sentences  generated  by  a  finite  automaton  we  call  a  regular 
language.   A  more  familiar  term  in  the  literature  for  such  sets  is  regular 
event  (cf.  Kleene,  1956;    Rabin  &  Scott,  1959—  the  equivalence  of  the 
alternative  formulations  is  worked  out  explicitly  in  Bar-Hillel  &  Shamir, 
1960;  Culfk,  1961). 

Note  that  the  device  M  shifts  left  with  each  interstate  transition,  that 
the  identity  symbol  a0  can  occupy  a  square  of  the  input  tape,  and  that  the 
instruction  (f,y,  k)  applies  to  M  just  in  case  it  is  in  state  Sf  scanning  a  square 
containing  a{.  Equivalently,  we  can  stipulate  that  aQ  cannot  occupy  a 
square  of  the  tape,  that  the  instruction  (i,j9  k)  applies  when  M  is  in  state 
Sf  and  either  i  =  0  or  M  is  scanning  ai9  and  that  the  input  tape  moves  left 
only  if  an  instruction  (z,y,  k)  is  applied  with  i  ^  0.  In  this  case  we  can 
think  of  the  instruction  (O,/,  k)  as  permitting  a  shift  from  Sf  to  Sk  inde- 
pendent of  the  input  symbol  and  with  no  shift  of  the  input  tape.  Notice 
that  with  this  formulation  the  device  M  will  be  blocked  only  if  it  is  in  state 


334  FORMAL    PROPERTIES    OF    GRAMMARS 

S3  scanning  a  symbol  az  for  which  it  has  no  instruction  (/,/,  k)  (as  in  the 
alternative  formulation)  and,  furthermore,  if  it  has  no  instruction 
(O,/,  K). 

We  generally  omit  the  boundary  symbol  #  in  citing  sentences  generated 
by  these  and  other  sorts  of  automata. 

Two  such  automata  are  equivalent  if  they  generate  the  same  language. 
We  say  that  an  automaton  is  deterministic  if  there  are  no  rules  (i,j,  k) 
and  (iJ9  /),  where  k  ^  /,  and  no  rule  (0,y,  k)  fork^Q  (that  is,  identity 
transitions  are  confined  to  return  to  S0).  The  state  of  a  deterministic  device 
is  uniquely  determined  (except  for  possible  return  to  S0)  by  the  input  string 
that  it  has  read  and  the  state  from  which  it  began  computation. 

A  good  deal  is  known  about  such  devices.    We  state  here,  for  later 
reference,  two  theorems  without  proof.2 
Theorem  I .     Given  finite  automata  FIy  F2,  we  can  construct  finite  automata 

Gl9  G2,  Gz  such  that  G^  is  deterministic  and  equivalent  to  Fl9  G2  accepts 

just  those  strings  in  A  that  are  rejected  by  Fl9  and  Gz  accepts  just  those 

strings  accepted  by  JFi  or  by  F2. 

Thus  the  set  of  all  regular  languages  in  the  alphabet  A  is  a  Boolean 
algebra.  We  could,  incidentally,  eliminate  all  identity  inputs  in  a  deter- 
ministic device  if  we  were  to  define  "acceptance  of  a  string  x"  in  terms  of 
entry  into  one  of  a  set  of  designated  final  states  instead  of  in  terms  of 
return  to  the  initial  state.  These  alternatives  are  equivalent  as  long  as  we 
allow  any  number  of  final  states. 

The  most  important  result  concerning  regular  languages  is  the  structural 
characterization  theorem  of  Kleene  (1956).  The  theorem  asserts  that  all 
regular  languages,  and  only  these,  can  be  constructed  from  the  finite 
languages  by  a  few  simple  set-theoretic  operations.  It  thus  leads  to  simple 
and  intuitive  ways  of  representing  any  regular  language  (cf.  Chomsky  & 
Miller,  1958;  McNaughton  &  Yamada,  1960).  We  can  see  this  in  the 
following  way: 

Given  an  alphabet  A,  we  define  a  representing  expression  recursively  as 
follows: 
Definition  2.     (i)  Every  finite  string  in  A  is  a  representing  expression. 

(ii)  If  X^  and  X2  are  representing  expressions,  then  X:X2  is  a  representing 

expression,    (iii)  If  X^  !</<«,  are  representing  expressions,  then 

(Xl9 . . . ,  Xn)*  is  a  representing  expression. 

This  definition  gives  us  (i)  some  representing  expressions  corresponding 
to  strings  in  A  and  permits  us  to  form  more  representing  expressions 
either  (ii)  by  juxtaposing  them  or  (iii)  by  separating  them  by  commas  and 
grouping  them  inside  parentheses  marked  by  a  star. 

2  Chomsky  &  Miller  (1958).  For  these  and  many  other  results  concerning  finite  auto- 
mata and  certain  restricted  infinite  automata,  see  also  Rabin  &  Scott  (1959). 


ABSTRACT   AUTOMATA  5^5 

Representing  State 

Expresssion  Diagram 

,  .  / — \       a\          /""""N 

(a)  ai  (J 2 *{j 

(b)  a  102  r^) l- — W^) >\) 

\ /  V /  \ / 

r\ 

(c)  (ai>  &%)* 


a2\ 
\~s 

s~*>. 

\az 
(d)  ai<a2ra3  (     J ^ ^/     ) 55 > 


O 


as 
Fig.  4.  Illustration  of  the  use  of  the  representing  expressions. 

Next,  we  want  to  say  exactly  what  it  is  that  these  expressions  are  sup- 
posed to  represent: 

Definition  3.     (i)  A  finite  string  in  A  represents  itself , (better:  the  unit  class 

containing  it),   (ii)  If  X±  represents  the  set  of  strings  S2  and  X2  represents 

the  set  of  strings  S2>  then  X^X^  represents  the  set  of  all  strings  VW  such 

that  Ve  ^and  We  22.  (iii)  IfXi9  !</<«,  represents  the  set  of  strings 

Si?  then  (X19 .  . . ,  XJ*  represents  the  set  of  all  strings  V±.  .  .Vm  such 

that  for  1  </  <  m  there  is  a  k  (1  <  &  <  «)  such  that  Vj  e  Sfc. 

Note  that  according  to  condition  (iii)  the  representing  expression 

(Xl9 . .  . ,  Xn)*  does  not  specify  the  order  in  which  the  elements  K,  must 

occur  but  only  the  n  sets  from  which  they  are  chosen.  The  simplest  way 

to  grasp  the  import  of  Def.  3  is  in  terms  of  the  state  diagrams,  or 

parts  of  state  diagrams,  that  generate  the  sets  of  strings  being  represented. 

Figure  46  and  4c  illustrate  the  formation  of  a  set  product  (represented  by 

juxtaposition)  and  the  "star"  operation  [represented  by  (Xly .  .  . ,  Xn)*]. 


FORMAL    PROPERTIES    OF    GRAMMARS 

Figure  4d  illustrates  a  combination  of  these  operations.  Thus  it  would 
generate  a^a^  a^a^a^  a^a^a^a^  .  . .  .  Figure  4e  illustrates  a  still  more 
complex  element,  etc. 

We  are  now  prepared  to  state : 
Theorem  2.    L  is  a  regular  language  if  and  only  if  there  are  representing 

expressions  Xi9  where  1  <  i  <  «,  such  that  L  is  the  sum  of  the  sets  S^ 

represented,  respectively,  by  Xi9 . .  .  9  Xn.   (Chomsky  &  Miller,  1958). 

The  proof  involves  a  demonstration  that  for  any  state  diagram  of  an 
automaton  F  there  is  an  equivalent  state  diagram  composed  of  elements 
such  as  those  shown  in  Fig.  4;  from  this  alternative  diagram  the  repre- 
senting expressions  for  F  can  be  read  off  directly.  The  constructed 
equivalent  automaton  is  generally  nondeterministic,  of  course. 

A  special  class  of  finite  automata  of  some  interest  consists  of  those  with 
the  property  that  the  state  of  the  automaton  is  determined  by  the  last  k 
symbols  of  the  input  sequence  that  it  has  accepted,  for  some  fixed  k.  Such 
an  automaton  is  called  a  k-limited  automaton  and  the  language  it  generates, 
a  k- limited  language. 

Suppose  that  M  is  a  ^-limited  automaton  with  a  vocabulary  V  of  D 
symbols.  We  can  determine  its  behavior  by  a  D*  x  D  matrix  in  which 
each  column  corresponds  to  an  element  WeV  and  each  row  to  a  string 
<f>  of  length  k  of  elements  of  V.  The  corresponding  entry  will  be  zero  or  one 
as  the  automaton  does  or  does  not  accept  W,  having  just  accepted  </>. 
Each  such  <f>  defines  a  state  of  the  automaton. 

This  notion  is  familiar  in  the  study  of  language  under  the  following 
modification.  Suppose  that  each  entry  of  the  defining  matrix  is  a  number 
between  zero  and  one,  representing  the  frequency  with  which  the  word 
corresponding  to  the  given  column  occurs  after  the  string  of  k  words 
defining  the  given  row  in  some  sample  of  a  language.  Interpret  this 
matrix  as  a  description  of  a  probabilistic  ^-limited  automaton  that 
generates  strings  in  accordance  with  this  set  of  transitional  probabilities; 
that  is,  if  the  automaton  is  in  the  state  defined  by  the  fth  row  of  the  matrix 
that  corresponds  to  the  sequence  of  symbols  WJ  .  . .  Wk\  then  the  entry 
('»/)  gives  the  probability  that  the  next  word  it  generates  will  be  Wf.  After 
having  generated  (accepted)  W^  it  switches  to  the  state  defined  by  the  string 
Wj  . . .  WjfWj.  Where  k  >  1,  such  a  device  generates  what  is  called  a 
(k  +  \)-order  approximation  to  the  sample  from  which  the  probabilities 
were  derived  (cf.  Shannon  &  Weaver,  1949;  Miller  &  Self  ridge,  1950). 
We  return  to  this  notion  in  Sec.  1.2  of  Chapter  13. 

Clearly  not  every  finite  automation  is  a  ^-limited  automaton.  For 
example,  the  three-state  automaton  with  the  state  diagram  shown  in  Fig. 
5  is  not  a  ^-limited  automaton  for  any  k.  However,  for  each  regular 
language  L,  there  is  a  1 -limited  language  L*  and  a  homomorphism  /such 


ABSTRACT    AUTOMATA 


337 


Fig.  5.  A  finite  automaton  that  is  not 
^-limited  for  any  k. 

that  L  =/(£*)  (Schiitzenberger,  196 la).  In  fact,  let  L  be  accepted  by  a 
deterministic  automaton  M  with  no  rule  (r,y,  0)  for  i  7*  0  (clearly  there  is 
such  an  M).  Let  M*  have  the  input  alphabet  consisting  of  the  symbols 
(at,  Ss)  and  the  internal  states  [az,  Sf]9  where  at  is  in  the  alphabet  of  M 
and  Ss  is  a  state  of  M,  with  [a0,  50]  as  initial  state.  The  transitions  of 
M*  are  determined  by  those  of  M  by  the  following  principle:  if  (i,j9  k) 
is  a  rule  of  M,  then  Af  *  can  move  from  state  [al9  S,]  (for  any  /)  to  state 
[ai9  Sk]  when  reading  the  input  symbol  (aiy  Sk).  Let  L*  be  the  language 
accepted  by  M*.  Let /be  the  homomorphism  that  maps  (ai9  Sj)  into  at 
for  each  z,y.  Then  L  =/(£*)  and  L*  is  1-limited. 

Suppose  that  we  now  relax  the  requirement  that  the  tape  must  always 
shift  left  with  each  interstate  transition.  Instead,  allow  the  direction  of 
shift  to  be  determined,  as  is  the  next  state,  by  the  present  state  and  the 
symbol  being  read.  The  behavior  of  the  automaton  is  now  determined  by 
a  set  of  quadruples  (/,/,  k,  /),  where  /,/,  k  are,  as  before,  indices  of  a  letter, 
a  state,  and  a  state,  respectively,  and  where  /  is  one  of  (+1,0,  —1). 
Following  Rabin  and  Scott  (1959),  we  interpret  these  quadruples  in  the 
following  way: 
Definition  4.  Let  (i,j\  k,  /)  be  one  of  the  rules  defining  the  automaton  M. 

If  the  control  unit  of  M  is  in  state  S5  and  its  reading  head  is  scanning  a 

square  containing  the  symbol  aiy  then  the  control  unit  may  shift  to  state 

Sk  while  the  tape  shifts  I  squares  to  the  left.   A  device  of  this  sort  we  call 

a  two-way  automaton. 

We  regard  a  shift  of  —  1  square  to  the  left  as  a  shift  of  one  square  to 
the  right. 

We  can  again  say  that  such  a  device  accepts  (generates)  a  string  exactly 
as  a  finite  automaton  does.  That  is  to  say,  it  accepts  the  string  x  only 
under  the  following  condition.  Let  x  be  written  on  consecutive  squares 


Jj?5  FORMAL    PROPERTIES    OF    GRAMMARS 

of  the  tape,  which  is  otherwise  filled  with#'s  .  Let  the  control  unit  be  set 
to  the  initial  state  S0,  scanning  the  leftmost  square  not  containing  #. 
Suppose  that  the  device  now  computes  until  its  first  return  to  S0,  at  which 
point  the  control  unit  is  scanning  a  square  containing  #.  In  this  case  it 
accepts  x. 

It  might  be  expected  that  by  thus  relaxing  the  conditions  that  a  finite 
automaton  must  meet  we  would  increase  the  generative  capacity  of  the 
device.  This  is  not  the  case,  however,  and  we  have  (Rabin  &  Scott,  1959; 
Shepherdson,  1959)  the  following  theorem: 
Theorem  3.     The  sets  that  can  be  generated  by  two-way  automata  are 

again  the  regular  languages. 

The  proof  involves  showing  that  the  device  can  spare  itself  the  necessity 
of  returning  to  look  a  second  time  at  any  given  part  of  the  tape  if,  before  it 
leaves  that  part,  it  thinks  of  all  the  questions  (their  number  will  be  finite) 
it  might  later  come  back  to  ask,  answers  all  of  those  questions  right  then, 
and  carries  the  table  of  question-answer  pairs  forward  along  the  tape  with 
it,  altering  the  answers  when  necessary  as  it  goes  along.  Thus  it  is  possible 
to  construct  an  equivalent  one-way  automaton,  although  the  price  is  to 
increase  the  number  of  internal  states  of  the  control  unit. 


1.3  Linear-Bounded  Automata 

Suppose  that  we  were  to  allow  a  two-way  automaton  to  write  a  symbol 
on  the  tape  as  it  switches  states.  The  symbols  written  on  the  tape  belong 
to  an  output  alphabet  Ao  =  {00, . . . ,  ap9 . . .  ,  aq}  (#  $  Ao),  where  A2  = 
{#0, . .  . ,  ap}  is  the  input  alphabet.  We  now  have  to  specify  the  behavior  of 
the  device  by  a  set  of  quintuples  (/,/,  k,  I,  m\  in  which  the  set  of  quadruples 
(/,/,&,/)  specifies  a  two-way  automaton  and  the  scanned  symbol  af  is 
replaced  by  am  (which  may,  of  course,  be  identical  with  a^  as  the  device 
switches  from  state  S3-  to  Sk.  Following  Myhill  (1960),  in  essentials,  we 
have  Def.  5. 

Definition  5.  Let  (/,/,  k,  1,  m)  be  one  of  the  rules  defining  M.  If  the 
control  unit  of  M  is  in  state  S3-  and  its  reading  head  is  scanning  a  square 
containing  aiy  then  the  control  unit  may  switch  to  Sk  while  the  tape  shifts 
I  squares  left  and  the  scanned  symbol  at  is  replaced  by  am.  We  call  this 
device  a  linear-bounded  automaton. 

Acceptance  of  a  string  is  defined  as  before.  In  such  a  device  the  tape  is 
now  used  for  storage,  not  just  for  input.  Therefore,  when  a  linear-bounded 
automaton  M  is  given  the  input  x9  it  has  available  to  it  an  amount  of 
memory  determined  by  c  %(x)  +  q,  where  q  is  the  fixed  memory  of  the 
control  unit,  c  is  a  constant  (determined  by  the  size  of  the  output  alphabet), 


ABSTRACT    AUTOMATA  J^p 

and  /.(x)  is  the  length  of  x.  It  is  thus  a  simple,  potentially  infinite  auto- 
maton, and,  as  we  shall  see  directly,  it  can  generate  languages  that  are 
not  regular. 

It  is  sometimes  convenient,  in  studying  or  attempting  to  visualize  the 
performance  of  an  automaton,  to  assign  to  it  a  somewhat  more  complex 
structure.  Thus  we  can  regard  the  device  as  having  separate  subparts  for 
carrying  out  various  aspects  of  its  behavior.  In  particular,  we  can  regard 
a  linear-bounded  automaton  as  having  two  separate  infinite  tapes,  one 
solely  for  input  and  the  other  solely  for  computation,  with  the  second  tape 
having  as  many  squares  available  for  computation  as  are  occupied  by 
alphabetic  symbols  (i.e.,  occurring  between  #...#)  on  the  input  tape. 
We  can  also  regard  it  as  having  several  independent  computation  tapes  of 
this  kind.  These  modifications  require  appropriate  changes  in  the  descrip- 
tion of  the  operation  of  the  control  unit,  but  it  is  not  difficult  to  describe 
them  in  such  a  way  as  to  leave  the  generative  capacity  of  the  class  of 
automata  in  question  unmodified. 


1.4  Pushdown  Storage 

One  special  class  of  linear-bounded  automata  of  particular  interest  is 
the  following.  Consider  an  automaton  M  with  two  tapes,  one  an  input 
tape,  the  other  a  storage  tape.  The  control  unit  can  read  from  the  input 
tape,  and  it  can  read  from  or  write  on  the  storage  tape.  The  input  tape 
can  move  in  only  one  direction,  let  us  say,  right  to  left.  The  storage  tape 
can  move  in  either  direction.  Symbols  of  the  input  alphabet  Ax  can 
appear  on  the  input  tape,  and  symbols  of  the  output  alphabet  Ao  can  be 
read  from  or  printed  on  the  storage  tape,  where  AT  and  Ao  are  as  pre- 
viously given.  We  assume  that  Ao  contains  a  designated  symbol  a  £  AI 
that  will  be  used  only  to  initiate  or  terminate  computation  in  a  way  that 
will  be  described  directly.  In  Sees.  L4  to  1.6  we  designate  the  identity 
element  of  Ao  and  AT  as  e  instead  of  a0.  The  other  symbols  of  Ao  we 
continue  to  designate  as  al9 . . . ,  aq.  We  continue  to  regard  Ax  and  Ao  as 
"universal  alph'abets,"  independent  of  the  particular  machine  being 
considered. 

We  define  a  situation  of  the  device  as  a  triple  (a,  S^  b),  in  which  a  is 
the  scanned  symbol  of  the  input  tape,  St  is  the  state  of  the  control  unit,  and 
b  is  the  scanned  symbol  of  the  storage  tape.  Each  step  of  a  computation 
depends,  in  general,  on  the  total  situation  of  the  device. 

In  the  initial  tape-machine  configuration  the  input  tape  contains  the 
symbols  aft9 .  .  . ,  a^  (where  now  &  ^  0)  in  successive  squares,  flanked  on 
both  sides  by#;  and  the  control  unit  is  in  state  SQ  scanning  the  leftmost 


FORMAL   PROPERTIES    OF    GRAMMARS 


symbol  a?i  of  a;  =  ^  .  .  .  a^  (as  in  Fig.  2).  The  scanned  square  of  the 
storage  tape  contains"  cr,  and  every  other  square  contains  #.  Thus  the 
device  is  in  the  situation  (a^9  S0,  cr)  in  its  initial  configuration.  The  device 
computes  in  the  manner  subsequently  described  until  its  first  return  to 
SQ.  The  input  string  x  is  accepted  by  the  device  if,  at  this  point,  #  is  being 
scanned  on  both  the  input  and  the  storage  tapes,  that  is,  if  the  device  is  in 
the  terminal  situation  (#,  S09#). 

The  special  feature  of  these  devices  which  distinguishes  them  from 
general  linear-bounded  automata  is  this.  When  the  storage  tape  moves  one 
square  to  the  right,  its  previously  scanned  symbol  is  "erased."  When  the 
storage  tape  moves  k  squares  to  the  left,  exposing  k  new  squares,  k  succes- 
sive symbols  ofAo  (all  distinct  from  e)  are  printed  in  these  squares.  When 
it  does  not  move,  nothing  is  printed  on  it  or  erased  from  it.  Thus  only  the 
rightmost  symbol  in  storage  is  available  at  each  stage  of  the  computation. 
The  symbol  most  recently  written  in  storage  is  the  earliest  to  be  read  out  of 
storage.  Furthermore,  the  storage  tape  will  necessarily  be  completely  blank 
(i.e.,  it  will  contain  only  #)  when  the  terminal  situation  (#,  50,  #)  is  reached. 

The  device  M  which  behaves  in  the  way  just  described  is  called  a  push- 
down storage  (PDS)  automaton,  following  Newell,  Shaw,  and  Simon  (1959). 
This  organization  of  memory  has  found  wide  application  in  programming, 
and,  in  particular,  its  utility  for  analysis  of  syntactic  structure  by  computers 
has  been  noted  by  many  authors.  The  reasons  for  this,  as  well  as  the 
intrinsic  limitations,  will  become  clearer  when  we  see  that  the  theory  of 
PDS  automata  is,  in  fact,  essentially  another  version  of  the  theory  of 
context-free  grammar  (see  Chapter  1  1  ,  Sec.  4).  Note  that  a  PDS  automaton 
with  a  possibly  nondeterministic  control  unit  is  a  device  that  carries  out 
"predictive  analysis"  in  the  sense  of  Rhodes  (cf.  Oettinger,  1961).  Hence 
this  theory,  too,  is  essentially  a  variant  of  context-free  grammar. 

Let  us  now  turn  to  a  more  explicit  specification  of  PDS  automata.  We 
assume,  selecting  one  of  the  two  equivalent  formulations  mentioned  on 
p.  333  above,  that  e  cannot  occupy  a  square  of  the  input  or  storage  tape. 
Thus  we  extend  the  definition  of  "situation"  to  include  triples  (e,  S{,  b\ 
(a,  Si9  e),  and  (e,  Siy  e);  and  we  assume  that  when  the  device  is  in  the 
situation  (a,  Siy  b)  it  is  also,  automatically,  in  the  situations  (e,  Si9  b\ 
(a,  Si9  e),  and  (e9  S^  e);  that  is,  any  instruction  that  applies  to  the  situa- 
tions (e9  Si9  b\  (a,  Si9  e\  or  (e,  Si9  e)  may  apply  when  the  device  is  in 
state  Si  reading  a  on  the  input  tape  and  b  on  the  storage  tape.  The  input 
tape  will  actually  shift  left  only  when  an  instruction  involving  a  7*  e 
on  the  input  tape  is  applied. 

Let  us  define  a  function  A(x)  (read  "length  of  a?*')  for  certain  strings  x 
as  follows:  l(a)  =  -1;  k(e)  =*  0;  ^(za^  =  A(z)  +  1,  where  zai  is  a 
string  in  Ao  —  {a},  (1  <  i  <  q\ 


ABSTRACT   AUTOMATA  341 

Each  instruction  for  a  PDS  automaton  can  now  be  given  in  the  stand- 
ardized form 

(a,Si9b)^(Sj9x)9  (1) 

where  a  e  Al9  b  e  A0,  x  =  a  or  x  is  a  string  on  A0  —  {a},  and/  =  0  if  and 
only  if  b  =  a  =  x.  The  instruction  (1)  applies  when  the  device  is  in  the 
situation  (a,  Si9  b)  and  has  the  following  effect:  the  control  unit  switches 
to  state  Sji  the  input  tape  is  moved  h(d)  squares  to  the  left;  the  symbols 
of  x  are  printed  successively  on  the  squares  to  the  right  of  the  previously 
scanned  square  of  the  storage  tape — in  particular,  if  x  =  ayi . . .  #7m, 
then  ayk  is  printed  in  the  kth  square  to  the  right  of  the  previously  scanned 
square  of  the  storage  tape,  replacing  the  contents  of  this  square — while 
the  storage  tape  is  moved  A(x)  squares  to  the  left.  Thus,  if  x  =£  a,  the 
device  is  now  scanning  (on  the  storage  tape)  the  rightmost  symbol  of  x; 
if  x  =  e,  it  is  still  scanning  b;  if  x  =  cr,  it  is  scanning  the  symbol  to  the 
left  of  b.  Furthermore,  we  can  think  of  each  square  to  the  right  of  the 
scanned  square  of  the  storage  tape  as  being  automatically  erased  (replaced 
by#).  In  any  event,  we  define  the  contents  of  the  storage  tape  as  the  string 
to  the  left  of  and  including  the  scanned  symbol,  and  we  say  that  the 
storage  tape  contains  this  string.  More  precisely,  if  #,  a^, . .  . ,  a?n  appear 
in  successive  squares  of  the  storage  tape,  where  a^  occupies  the  scanned 
square,  then  the  string  a^  . . .  a$  is  the  contents  of  the  storage  tape.  If 
#  is  the  scanned  symbol  of  the  storage  tape,  we  say  that  this  tape  contains 
the  string  e  (its  contents  is  e)  or  that  the  storage  tape  is  blank. 

Note  that  when  the  automaton  M  applies  Instruction  1  the  input  tape 
will  move  one  square  to  the  left  if  a  ^  e  and  will  not  move  if  a  —  e. 
Furthermore,  if  M  is  scanning  #  on  the  input  tape,  the  Instruction  1  can 
apply  only  if  a  =  e.  The  condition  that  j  =  0  if  and  only  if  b  =  a  =  x 
implies  that  if  M  begins  to  compute  from  its  initial  configuration,  then  on 
its  first  return  to  S0  it  will  necessarily  be  in  a  situation  (a,  S0,  #)}  for  some 
a.  If  a—  #>  then  M  is  in  the  terminal  situation  (#,  SQ,  #),  scanning  # 
on  both  input  and  storage  tapes,  and  it  therefore  accepts  the  string  that 
occupied  the  input  tape  in  the  initial  configuration.  There  may,  in  fact,  be 
further  vacuous  computation  at  this  point  if  there  is  an  instruction  in  the 
form  of  (1)  with  a  =  e  =  b  and  i  =  0,  but  this  does  not  affect  generative 
capacity.  We  can  regard  the  device  as  blocked  when  it  reaches  the 
terminal  situation.  Its  storage  tape  will  be  blank  at  this  point  and  at  no 
other  stage  of  a  computation. 

We  can  give  a  somewhat  simpler  characterization  of  the  family  of 
languages  accepted  by  PDS  automata  without  explicit  reference  to  tape 
manipulations,  etc.  Given  A^  Ao  and  or,  let  us  define  a  PDS  automaton 
M  as  a  finite  set  of  instructions  of  the  form  of  Instruction  1.  For 
each  i  let  a&  =  e— that  is,  a  is  a  general  "right-inverse."  A  configuration 


342  FORMAL   PROPERTIES    OF    GRAMMARS 

of  M  is  a  triple  K  =  (x,  Sf,  z/),  where  S*  is  a  state,  a;  is  a  string  in  Al9  and 
y  is  a  string  in  Ao.  Think  of  x  as  being  the  still  unread  portion  of  the 
input  tape  (i.e.,  the  string  to  the  right  of  and  including  the  scanned  symbol) 
and  y  as  the  contents  of  the  storage  tape,  where  5,  is  the  present  state. 
When  /is  the  Instruction  1,  we  say  that  configuration  K2  follows  from 
configuration  K±  by  /if  ^  =  (ay,  St,  zb)  and  K2  =  (z/,  Si9  zbx).  We  say  that 
M  accepts  w  if  there  is  a  sequence  of  configurations  K^  .  . .  ,  Km  such  that 
KI  =  (H-,  S0,  a),  Km  =  (e,  S0,  <?),  and  for  each  /  <  m  there  is  an  instruction 
I  of  M  such  that  K^  follows  from  AT,  by  /.  M  accepts  (generates)  the 
language  L  just  in  case  L  is  the  set  of  all  strings  accepted  by  M. 

The  memory  of  a  PDS  device  can  be  represented  in  terms  of  the  set  of 
strings  on  an  internal  alphabet,  transition  from  one  internal  configuration 
to  another  corresponding  to  addition  or  deletion  of  letters  at  the  right-hand 
end  of  the  strings  associated  with  internal  configurations.  Thus  from  the 
"state"  represented  by  the  string  <£oc  transition  is  permitted  only  to  states 
represented  <f>  or  <^>a/?.  It  may  be  instructive  to  compare  a  PDS  device, 
which  has,  in  this  interpretation,  an  infinite  set  of  potential  states,  with  a 
^-limited  automaton.  As  it  was  defined  above,  the  memory  of  a  A>limited 
automaton  can  also  be  represented  in  terms  of  the  set  of  strings  on  an 
internal  alphabet  (in  this  case  identical  to  the  input  alphabet).  Transition 
in  a  ^-limited  automaton  corresponds  to  the  addition  of  a  letter  to  the 
right-hand  end  of  the  string  representing  a  state  and  simultaneous  deletion 
of  a  letter  from  the  left-hand  end  of  that  string.  Thus  the  total  set  of 
potential  states  is  finite. 

A  device  with  PDS  is  a  special  type  of  linear-bounded  automaton.  It 
can,  of  course,  easily  perform  many  tasks  that  a  finite  automaton  cannot, 
although  it  makes  only  a  "single  pass"  through  the  input  data  (i.e.,  the 
input  tape  moves  in  only  one  direction).  Consider,  for  example,  the  task  of 
generating  (accepting)  the  language  L2'  consisting  of  all  strings  #xcx*#, 
where  #  is  a  nonnull  string  of  a's  and  Z?'s  and  x*  is  the  mirror-image  of  x9 
that  is,  x  read  from  right  to  left  (cf.  language  L2  in  Chapter  11,  Sec.  3, 
p.  285).  This  task  is  clearly  beyond  the  range  of  a  finite  automaton,  since 
the  number  of  available  states  must  increase  exponentially  as  the  device 
accepts  successive  symbols  of  the  first  half  of  the  input  string.  Consider, 
however,  the  PDS  automaton  M  with  the  input  alphabet  {a,  b,  c},  the  inter- 
nal states  SQ,  S19  and  S&  and  the  following  rules,  where  a  ranges  over  {a,  b}: 

0)  (a,S0>*)-*OSi,a) 

(ii)  (a,  Si,  *)-»($!,  a) 

(Hi)  (c,Si,*)-*to,*)  (2) 

(iv)  (a,S2,a)->(S2,cr) 

(v)  (e,S*a)-+(StoO). 


ABSTRACT    AUTOMATA 


The  control  unit  has  the  state-diagram  shown  in  Fig.  6,  in  which  the  triple 
(r,  s,  0  is  on  the  arrow  leading  from  state  5,  to  state  5,  just  in  case  the 
device  has  the  rule  (r,  Sz,  s)  ->•  (S3,  t).  Clearly,  this  device  will  accept  a 
string  if  and  only  if  it  is  in  £2'.  For  example,  the  successive  steps  in 
accepting  #abcba#  are  given  in  Fig.  7. 

Evidently  pushdown  storage  is  an  appropriate  device  for  accepting 
(generating)  languages  such  as  L2',  which  have,  in  the  obvious  sense, 
nesting  of  units  (phrases)  within  other  units,  that  is,  the  kind  of  recursive 
property  that  in  Chapter  11,  Sec.  3,  we  called  self-embedding.  We  shall  see 
in  Sees.  4.2  and  4.6  that  the  essential  properties  of  context-free  grammars 
(cf.  Chapter  11,  Sec.  4)  distinguishing  them  from  finite  automata  are  that 
they  permit  self-embedding  and  symmetries  in  the  generated  strings. 
Consequently,  we  would  expect  that  pushdown  storage  would  be  useful 
in  dealing  with  languages  with  grammars  of  this  type.  This  class  obviously 
includes  many  familiar  artificial  languages  (e.g.,  sentential  calculus  and 
probably  many  programming  languages  —  cf.  Sec.  4.8).  It  is,  in  fact,  a 
straightforward  matter  to  construct  a  PDS  automaton  that  will  recognize 
or  generate  the  sentences  of  such  systems.  Oettinger  (1961)  has  pointed 
out  that  if  we  equip  a  PDS  device  with  an  output  tape  and  adjust  its 
instructions  to  permit  it  to  map  an  input  string  into  a  corresponding  output 
(using  its  PDS  in  the  computation)  we  can  instruct  it  to  translate  between 
ordinary  and  Polish  notation,  for  example.  To  some  approximation, 
context-free  constituent-structure  grammars  are  partially  adequate  for 
natural  languages;  that  is,  nesting  of  phrases  (self-embedding)  and  sym- 
metry are  basic  properties  of  natural  languages.  Consequently,  such 
devices  as  PDS  will  no  doubt  be  useful  for  actual  handling  of  natural- 
language  texts  by  computers  for  one  or  another  purpose. 

The  device  (2)  is  deterministic.  In  the  case  of  finite  automata  we  have 
observed  (cf.  Theorem  1)  that,  given  any  finite  automaton,  there  is  an 
equivalent  deterministic  one.  This  observation  is  not  true,  however,  for 
the  class  of  PDS  automata.  There  is,  for  example,  no  deterministic  PDS 
automaton  that  will  accept  the  language  L2  =  {xx*  \  x*  the  mirror-image 
of  x}  (thus  L2  consists  of  the  strings  formed  by  deleting  the  midpoint 
element  c  from  the  strings  of  L2'),  since  the  device  will  have  no  way  of 
knowing  when  it  has  reached  the  middle  of  the  input  string;  but  L2  is 
accepted  by  the  nondeterministic  PDS  device  derived  from  the  device  (2) 
for  I*'  by  replacing  Rule  iii  by 


This  amounts  to  dropping  the  arrow  labeled  (c,  e,  e)  in  Fig.  6  and  con- 
necting Sl  to  S2  with  two  arrows,  one  labeled  (a,  e,  a)  and  the  other 
(b,  e,  b).  The  device  uses  Instruction  (3)  when  it  "guesses"  that  it  has 


344 


FORMAL    PROPERTIES    OF    GRAMMARS 


(e,  <r,  a) 


(b,  e,  b) 
Fig.  6.  State  diagram  for  Af. 


(b,  b,  a) 


a  b  c  b 


Initial  position 


Position  1 


Position  3 


Position  4 


b  c  b 


Position  2 


Position  5 


Position  6 
Fig.  7.  Generation  of  #abcba  #  vnfa  pushdown  storage. 


ABSTRACT   AUTOMATA  345 

reached  the  middle  of  the  string.  If  the  guess  is  wrong,  its  computation 
will  not  terminate  with  acceptance  (just  as  when  the  input  is  not  a  string 
of  Ia);  if  the  guess  is  right  and  the  input  is  in  L2,  the  computation  will 
terminate  with  acceptance. 

In  this  discussion  we  have  assumed  that  the  next  act  of  the  device  is 
determined  in  part  by  the  symbol  being  scanned  on  the  storage  tape.  It  is 
interesting  to  inquire  to  what  extent  control  from  the  storage  tape  is 
essential  for  PDS  devices.  Consider  the  two  subclasses  of  PDS  devices 
defined  by  the  following  conditions :  M  is  a  PDS  automaton  without  control 
if  each  rule  is  of  the  form  (a,  St,  e)  ~>  (SJ9  x);  M  is  a  PDS  automaton 
with  restricted  control  if  each  rule  is  of  one  of  the  two  forms  (a,  Si9  e)  ->- 
(S3,  x),  x  ?£  a,  or  (a,  S%9  b)  -+  (SJ9  a).  In  other  words,  in  the  case  of  a  PDS 
device  with  restricted  control  the  symbol  being  scanned  on  the  storage 
tape  plays  a  role  in  determining  only  the  computations  that  "erase"  from 
storage.  Thus  the  device  of  Fig.  6  has  restricted  control.  In  the  case  of  a 
PDS  automaton  without  control  the  storage  tape  is  acting  only  as  a 
counter.  We  can,  without  loss  of  generality,  assume  that  only  one  symbol 
can  be  written  on  it. 

Concerning  these  families  of  automata  we  observe,  in  particular,  the 
following: 

Theorem  4.  (i)  The  family  of  PDS  automata  without  control  is  essentially 
richer  in  generative  capacity  than  the  family  of  finite  automata  but  essen- 
tially poorer  than  the  full  family  of  PDS  devices,  (ii)  For  each  PDS  device 
there  is  an  equivalent  PDS  device  with  restricted  control. 
As  far  as  Part  i  is  concerned,  it  is  obvious  that  a  PDS  automaton  without 
control  can  accept  the  language  Lx  =  {anbn}  (cf.  Chapter  11,  Sec.  3)  but 
not  the  language  L2  or  L2'.  In  fact,  these  languages  are  beyond  the  range 
of  a  device  with  any  finite  number  of  infinite  counters  that  shift  position 
independently  in  a  fixed  manner  with  each  interstate  transition  (e.g.,  a 
counter  could  register  the  number  of  times  the  device  has  passed  through  a 
particular  state  or  produced  a  particular  symbol),  where  the  decision 
whether  to  accept  an  input  string  depends  on  an  elementary  property  of  the 
counters  (i.e.,  are  they  equal;  do  they  read  zero,  as  in  the  case  of  PDS 
automata;  etc.).  Although  a  device  with  q  counters  and  k  states  has  a 
potentially  infinite  memory,  after/?  transitions  at  most  kpq  configurations 
of  a  state  and  a  counter  reading  can  have  been  reached,  and  obviously  2P 
different  configurations  must  be  available  after/?  transitions  for  generating 
sentences  of  L2  of  length  2p  (the  availability  of  identity  transitions  does 
not  affect  this  observation).  See  Schiitzenberger  (1959)  for  a  discussion 
of  counter  systems  in  which  this  description  is  made  precise. 

Part  ii  of  Theorem  4  follows  as  a  corollary  of  several  results  that  we  shall 
establish  (cf.  Theorem  6,  Sec.  1.6). 


1.5  Finite  Transducers 

Suppose  that  we  have  a  PDS  device  M  meeting  the  additional  restriction 
that  the  storage  tape  never  moves  to  the  right,  that  is,  each  rule  of  M  is 
of  the  form  (a,  Si9  b)  ->  (SJ9  x)9  where  x  ^  a.  Beginning  in  its  initial 
configuration  with  the  successive  symbols  #,  a^  . . . ,  a^  #  on  the  input 
tape,  the  device  will  compute  in  accordance  with  its  instructions,  moving 
its  storage  tape  left  whenever  it  prints  a  string  x  on  that  tape.  Suppose 
that  the  device  continues  to  compute  until  it  reaches  the  situation  (#,  Si9  a^9 
for  some  i9j;  that  is,  it  does  not  block  before  reading  in  the  entire  input 
tape.  At  this  point  the  storage  tape  contains  some  string  y  —  wai9  and  we 
say  that  the  device  M  maps  the  string  afti . .  .  a^  into  the  string  y.  We  call 
M  a  transducer,  which  maps  input  strings  into  output  strings  and,  corre- 
spondingly, input  languages  into  output  languages.  We  designate  by 
M(L)  the  set  of  strings  y  such  that,  for  some  x  €  L,  M  maps  x  into  y.  Note 
that  a  transducer  can  never  reach  a  configuration  in  which  it  is  scanning 
#  on  the  storage  tape.  Consequently,  it  can  never  accept  its  input  string 
in  the  sense  of  "acceptance"  as  previously  defined.  In  the  case  of  a 
transducer  we  can  regard  the  storage  tape  as  an  output  tape. 

In  the  case  of  a  transducer  the  restrictions  on  the  form  of  instructions 
for  PDS  automata  that  involve  return  to  SQ  are  clearly  inoperative.  In 
fact,  we  can  allow  an  instruction  7  of  a  transducer  to  be  of  the  form 
(a,  Si9  b)  ->  (SJ9  x)  [as  in  (1)],  where  aeAl9  b  e  A0  and  re  is  a  string  in 
A0  —  {a},  dropping  the  other  restrictions. 

Where  M  is  a  transducer,  it  is  clear  that  the  storage  tape  is  playing  no 
essential  role  in  determining  the  course  of  the  computation;  and,  in  fact, 
we  can  construct  a  device,  T9  that  effects  the  same  mapping  as  M9  while 
meeting  the  additional  restriction  that  the  next  state  is  determined  only 
by  the  input  symbol  and  the  present  state.  We  designate  the  states  of  T 
in  the  form  (Si9  a),  where  St  is  a  state  of  M  and  a  ^  e  is  a  symbol  of  its 
output  alphabet.  The  initial  state  of  T  is  (50,  or).  Where  M  has  the  rule 
(a,  Si9  b)  ->  (SJ9  x\  T  will  have  the  rule 

[a,(Si9b\V\-+[(Si9c\x]9  (4) 

where  either  x  =  yc  or  x  =  e  and  c  =  b.  Clearly  the  behavior  of  T  is  in 
no  way  different  from  that  of  M,  but  in  the  case  of  T  the  next  step  in  the 
computation  depends  only  on  the  input  symbol  and  the  present  state. 
It  is  thus  a  PDS  device  without  control.  Eliminating  the  redundant 
specification  of  the  scanned  symbol  of  the  storage  tape,  we  can  give  all 
the  rules  of  T  in  the  form 

(*,  S<) -*(?„*),  (5) 


ABSTRACT   AUTOMATA  34? 

indicating  that  when  T  is  in  state  Sf  and  is  scanning  a  (if  a  ^  e)  or  is 
scanning  any  symbol  (if  a  =  e)  on  the  input  tape  it  may  switch  to 
state  S,.,  move  the  input  tape  /(#)  squares  left,  and  the  storage  tape  /(#) 
squares  left,  printing  x  on  the  newly  exposed  squares  (if  any)  of  the  storage 
tape.  Each  transducer  can  thus  be  fully  represented  by  a  state-diagram, 
in  which  nodes  represent  states,  and  an  arrow  labeled  (a,  x)  leads  from 
Sf  to  S3-  just  in  case  Rule  5  is  an  instruction  of  the  device. 

Suppose  that  in  the  state-diagram  representing  the  transducer  M  there 
is  no  possibility  of  traversing  a  closed  path,  beginning  and  ending  with  a 
given  node,  following  only  arrows  labeled  (e,  x)9  for  some  x.  More 
formally,  there  is  no  sequence  of  states  (5ai, . . . ,  5ajfc)  of  M  such  that 
aj.  =  a^  and,  for  each  /  <  k,  there  is  an  xi  such  that  (e,  Sa.)  ->  (Sat+i,  x£ 
is  an  instruction  of  M.  If  this  condition  is  met,  the  number  of  outputs 
that  can  be  given  with  a  single  input  is  bounded,  and  we  call  M  a  bounded 
transducer. 

The  mapping  effected  by  a  transducer  we  call  a  (finite)  transduction.  A 
transduction  is  a  mapping  of  strings  into  strings  (hence  languages  into 
languages)  of  a  kind  that  can  be  performed  by  a  strictly  finite  device. 

Given  a  bounded  transducer  T,  we  can  obviously  eliminate  as  many  of 
the  instructions  of  the  form  (ey  Sf)  ->•  (S3-,  x)  as  we  like  without  affecting 
the  transduction  performed  by  simply  allowing  the  device  to  print  out 
longer  strings  on  interstate  transitions.  Alternatively,  by  adding  a  sufficient 
number  of  otherwise  unused  states  and  enough  rules  of  the  form  of  Rule  5 
where  a  =  e  we  can  construct,  corresponding  to  each  transducer  T,  a 
transducer  T'  which  performs  the  same  transduction  as  Tbut  has  rules  only 
of  the  form  (a,  SJ  ->  (Si9  b\  where  b  €  Ao. 

Note  that,  given  such  a  T,  we  can  construct  immediately  the  "inverse" 
transducer  T*  that  maps  the  string  y  into  the  string  x  just  in  case  Tmaps  x 
into  y  by  simply  interchanging  input  and  output  symbols  in  the  instructions 
of  T'\  for  example,  in  Rule  5,  where  x  e  Ao,  interchanging  a  and  x.  This 
amounts  to  replacing  each  label  (a,  b)  on  an  arrow  of  the  state-diagram  by 
the  label  (b,  a).  Hence,  given  any  transducer  T9  we  can  construct  the 
inverse  transducer  T*  that  maps  y  into  x  just  in  case  T  maps  x  into  y 
and  that  maps  the  language  L  onto  the  set  of  all  strings  x  such  that  Tmaps 
x  into  y  €  L.  If  T  is  bounded,  its  inverse  T*  may  still  be  unbounded.  If 
the  inverse  T*  of  T  is  bounded,  then  T  is  called  information  lossless. 
For  general  discussion  of  various  kinds  of  transduction,  see  Schiitzenberger 
(I961a)  and  Schiitzenberger  &  Chomsky  (1962). 

We  shall  study  the  effects  of  transduction  on  context-free  languages  in 
Sec.  4.5.  Note  that  if  T  is  a  transducer  mapping  L  onto  L',  where  L  is  a 
regular  language,  then  L'  is  also  a  regular  language.  Note  also  that  for 
each  regular  language  L  there  is  a  transducer  TL  that  will  map  L  onto 


FORMAL    PROPERTIES    OF    GRAMMARS 


the  particular  language  U  (alternatively  U  onto  L),  where  U  is  the  set  of 
all  strings  in  the  output  alphabet  (alternatively,  the  input  alphabet—  if  the 
input  alphabet  contains  only  e,  the  transducer  will,  by  definition,  be 
unbounded;  otherwise,  it  can  always  be  bounded).  These  and  many 
related  facts  are  fairly  obvious  from  inspection  of  state-diagrams. 


1.6  Transduction  and  Pushdown  Storage 

We  have  described  a  transducer  as  a  PDS  device  which  never  moves  its 
storage  tape  to  the  right — that  is,  it  never  erases— on  any  computation 
step.  It  maps  an  input  string  x  into  an  output  string  y.  A  general  PDS 
device,  on  the  other  hand,  uses  its  storage  tape  to  determine  its  later  steps, 
in  particular,  its  ultimate  acceptance  of  the  input  string  x.  It  terminates 
its  computation  with  acceptance  of  x  only  if,  on  termination,  the  contents 
of  the  storage  tape  is  simply  e,  that  is,  the  storage  tape  is  blank.  We 
could,  therefore,  think  of  a  general  PDS  device  as  defining  a  mapping  of 
the  strings  it  accepts  into  the  empty  string  e,  which  is  the  contents  of  the 
storage  tape  when  the  computation  terminates  with  acceptance  of  the  input. 
(The  device  would  essentially  represent  the  characteristic  function  of  a 
certain  set  of  strings.)  We  now  go  on  to  show  how  we  can  associate  with 
each  PDS  device  M  a  transducer  T  constructed  so  that  when  and  only  when 
M  accepts  x  (i.e.,  maps  it  into  e)  Tmaps  x  into  a  string  y  which,  in  a  sense 
that  we  shall  define,  reduces  to  e? 

Suppose  that  M  is  a  PDS  device  with  input  alphabet  AI  and  output 
alphabet  Ao  =  {e,  alt  . ,  .  ,  aq}.  We  will  construct  a  new  device  M' 
with  the  input  alphabet  Az  and  the  output  alphabet  AQ  with  2q  +  1 
symbols,  where  AQ  =  Ao  u  {^/, .  -  -  aa'}.  We  will  treat  each  element 
a!  as  essentially  the  "right  inverse"  of  at.  More  formally,  let  us  say  that 
the  string  x  reduces  to  y  just  in  case  there  is  a  sequence  of  strings  zl9 .  .  . ,  zm 
(m  >  1)  such  that  z^  =  x,  zm  =  y,  and  for  each  z  <  m  there  are  strings 
wi9  Wf  and  a^  e  Ao  such  that  zi  =  nvfy^'wj  and  zi+1  =  H^HV  In  other 
words,  x  reduces  toyifx  —  yorify  can  be  formed  from  y  by  successive 
deletions  of  substrings  a^a/. 

We  say  that  the  string  x  is  blocked  if  x  =  yazajw,  where  z  reduces  to 
e  and  either  ya  reduces  to  e  or  a  €  Ao  —  {e,  at}.  If  x  is  blocked,  then,  for 
all  v,  xv  is  blocked  and  does  not  reduce  to  e.  We  say  that  the  storage  tape 
is  blocked  if  the  string  that  it  contains  is  blocked. 

The  new  device  M'  will  be  a  PDS  automaton  which  never  moves  its 

3  The  results  in  this  section  and  Sec.  4.2  are  the  product  of  work  done  jointly  with  M.  P. 
Schutzenberger.  For  a  concise  summary,  see  Chomsky  (1962a).  See  Schiitzenberger 
(1962a,b,d)  for  generalizations  and  related  results. 


ABSTRACT    AUTOMATA  34$ 

storage  tape  to  the  right.  It  will  be  constructed  in  such  a  way  that  if  M 
does  not  accept  x  then,  with  x  as  input,  M'  will  terminate  its  computation 
before  reading  through  x  or  with  the  storage  tape  blocked ;  and,  if  M  does 
accept  x,  M'  will  be  able  to  compute  in  such  a  way  that  when  it  has  read 
through  all  of  x  the  storage  tape  will  not  be  blocked — in  fact,  its  contents 
will  reduce  to  e. 

The  states  of  M'  will  be  designated  by  the  same  symbols  as  those  of  M, 
and  SQ  will  again  be  the  initial  state. 

Suppose  that  K  and  Kr  are  tape-machine  configurations  of  M  and  M', 
respectively,  meeting  the  following  conditions.  K  is  attainable  from  the 
initial  configuration  of  M.  The  string  w  contained  on  the  storage  tape  of 
M'  in  K'  reduces  to  the  string  y  contained  on  the  storage  tape  of  M  in  K. 
Furthermore,  if  y  ^  e,  then  w  =  zak  for  some  k  (i.e.,  it  has  an  unprimed 
symbol  to  its  extreme  right).  M  and  M'  are  scanning  the  same  square  of 
identical  input  tapes  and  are  in  the  same  internal  state.  In  this  case  we  say 
that  K  and  K'  match.  Note  that  when  K  and  K'  match  then  either  M  has 
terminated  with  the  storage  tape  blank  (in  which  case  the  contents  of  the 
storage  tape  of  M'  is  zak'  which  reduces  to  e)  or  M  and  M'  are  in  the  same 
situation. 

The  instructions  of  M'  are  determined  by  those  of  M  by  the  following 

rule.   Let  (b,  St,  ak) -»  &,  x)  (6) 

be  an  instruction  of  M.  If  x  ^  <r,  then  M'  will  also  have  Instruction  6. 
Suppose  that  x  =  a.  Then,  if  ak  =  a  (in  which  case/  =  0),  M'  will  have 
Instruction  7,  and,  if  ak  ^  ff9  then  for  each  r  (1  <r  <q)  M'  will  have 
Instruction  8 

0,S»<r)-»(So,<0  (?) 

<b9St,aJ-+(Si9ak'ar'ar).  (8) 

Suppose  now  that  KI  and  Kz  are  configurations  of  M,  K±  is  a  configura- 
tion of  M1  that  matches  Kl9  K^  is  not  terminal,  and  Instruction  6  of  M 
carries  it  from  K±  to  K*  Clearly,  if  x  ^  cr  in  Instruction  6,  then  the  instruc- 
tion of  M'  corresponding  to  6  will  carry  M1  from  Kt'  to  a  configuration 
K2f  that  matches  K2. 

Suppose  then  that  x  =  a  in  Instruction  6.  Since  KI  is  by  assumption  not 
a  terminal  configuration,  M  in  KI  must  contain  on  its  storage  tape  a  string 
yak  for  some  k.  Either  y  =  e  or  y  =  zar  for  some  r. 

Suppose  y  =  e.  It  must  be  that  ak  =  a  and  thaty  =  0  in  Instruction  6. 
Hence  M '  has  the  corresponding  Instruction  7,  which  carries  it  from  KJ 
to  a  configuration  K*.  But  KJ  matches  Ki9  and  thus  the  contents  of  the 
storage  tape  of  M'  in  KJ  must  be  ta9  where  t  reduces  to  e.  Applying 
Instruction  7,  M'  moves  into  K*  where  the  contents  of  the  storage  tape  is 
taa',  which  reduces  to  e.  Thus  K2'  matches  K* 


35°  FORMAL   PROPERTIES    OF    GRAMMARS 

Suppose  that  y  =  zar.  Then  Instruction  6  carries  M  into  K2  in  which  the 
storage  tape  contains  zar.  Since  K±  matches  K19  the  contents  of  the  storage 
tape  of  M'  in  K±  must  be  a  string  taruak,  where  t  reduces  to  z  and  u  to  e. 
By  construction,  M'  has  Instruction  8  (corresponding  to  Instruction  6); 
it  carries  M'  into  Kz',  which  is  identical  to  K2  with  respect  to  input  tape  and 
internal  state  and  in  which  the  storage  tape  contains  taruakakar'ar,  which 
reduces  to  zar  =  y9  and  K2  matches  K2. 

In  each  case  we  see  that  if  Instruction  6  carries  M  from  K±  to  K2  then 
M'  has  an  instruction  to  carry  it  from  K±  (which  matches  KJ  to  K2 
(which  matches  K2). 

Suppose  that  K^  is  again  a  nonterminal  configuration  of  M  and  K^  is  a 
matching  configuration  of  M',  that  an  instruction  /'  of  M'  carries  M'  into 
the  configuration  K2,  and  that  there  is  no  instruction  of  M  to  carry  it  into  a 
configuration  K2  that  matches  K2.  It  is  clear  that  /'  was  not  derived  (by 
the  construction  previously  given)  from  an  instruction  of  M  of  the  form 
of  Instruction  6,  where  x  ^  a.  Thus  we  can  assume  that  /'  is  either 
Instruction  7  or  8  and  that  /'  was  derived  by  the  construction  presented 
from  Instruction  /:  (i,  Si9  ak)  ->  (Si9  <r).  In  any  case,  since  K^  and  K± 
match  and  neither  is  a  terminal  configuration,  the  storage  tape  of  M 
in  KI  must  contain  a  string  yak,  where  y  =  e  or  y  =  zas  for  some  s,  and 
the  storage  tape  of  M'  in  K±  must  contain  a  string  vak,  where  v  reduces 
to  y. 

Suppose  that  /'  is  Instruction  7.  Then  ak  =  a,  and  the  contents  of  the 
storage  tape  of  M'  in  Kz'  is  vaa'.  Mf  terminates  in  the  state  SQ  with  the 
storage  tape  containing  a  string  that  reduces  to  y\  but  since  a  (=  ak) 
cannot  be  printed  on  the  storage  tape  in  any  step  of  M  and  since  the  con- 
tents of  the  storage  tape  of  M  in  K^  is  ya,  it  must  be  that  y  =  e.  Thus  I 
carries  M  from  K^  to  a  configuration  K2  which  matches  K2'9  contrary  to 
assumption. 

Thus  it  must  be  that  /'  is  Instruction  8.  Then  the  contents  of  the  storage 
tape  of  M'  in  K2'  is  vakakar'ar,  which  reduces  to  yakakar'ar,  and,  in  turn, 
to  yar'ar.  If  y  =  e,  then  the  storage  tape  of  M'  is  blocked  in  K2.  Suppose 
that  y  =  zas.  Then  /carries  M  into  a  configuration  K2  in  which  the  storage 
tape  contains  zag.  Suppose  that  s  =  r.  Then  the  contents  of  the  storage 
tape  of  M'  in  K2,  which  reduces  to  yar'ar  =  zasarfar,  reduces  further  to 
zar  =  zas.  But  in  this  case  K2  and  K2  match,  contrary  to  assumption. 
Therefore  r  =^  5.  But  in  this  case,  the  storage  tape  of  M'  in  K2  again  is 
blocked.  In  any  case,  then,  it  is  blocked  if/7  is  Instruction  8. 

As  we  have  observed,  once  the  storage  tape  is  blocked  it  will  remain 
blocked  for  the  rest  of  the  computation.  Hence,  once  /'  is  applied,  the 
device  M'  cannot  reach  a  configuration  in  which  the  storage  tape  reduces 
to  e. 


ABSTRACT    AUTOMATA 


35* 


Briefly,  then,  M'  makes  the  guess  that,  after  "erasing"  ak  ^  a  and 
entering  configuration  K2,  M  will  be  scanning  ar  on  the  storage  tape.  It 
thus  writes  ak'ar'ar  on  its  storage  tape,  so  that  it,  too,  is  now  scanning  ar, 
having  "erased"  both  ak  (by  akr)  and  ar  (by  ar').  If  the  guess  was  right,  the 
new  configuration  of  M'  matches  K2;  if  it  was  wrong,  the  storage  tape  of 
Mf  is  blocked,  and  the  computation  cannot  terminate  with  the  storage 
tape  containing  a  string  that  reduces  to  e. 

But  M  and  M'  have  the  identical  initial  configuration  when  they  have 
identical  input  tapes.  Hence,  if  M  accepts  x9  M'  will  be  able  to  compute 
from  its  initial  configuration  with  input  x  until  it  terminates  in  the  situation 
(#,  S0,  cr')  with  a  string  y  on  the  storage  tape  which  reduces  to  e;  and,  if 
M  does  not  accept  x,  then  no  computation  of  M'  from  its  initial  configura- 
tion with  x  on  the  input  tape  can  terminate  in  the  situation  (#,  SJ9  a), 
for  some  a,  with  a  string  that  reduces  to  e  as  the  contents  of  the  storage 
tape.  (Note  that  the  contents  of  the  storage  tape  of  M'  can  reduce  to  e 
if  and  only  if  M1  has  just  printed  af  and  returned  to  SQ  by  an  instruction 
such  as  7.) 

Note,  also,  that  M'  is  a  PDS  automaton  which  never  moves  the  storage 
tape  to  the  right  (never  "erases").  We  have  already  shown  in  Sec.  1 .5  how, 
corresponding  to  each  device  of  this  sort,  we  can  construct  an  equivalent 
transducer  which  operates  independently  of  the  contents  of  the  storage 
tape.  Let  T  be  the  transducer  that  is  constructed  from  M'  in  the  manner 
described  in  Sec.  1.5.  Then  M  accepts  x  if  and  only  if  T  maps  x  into  a 
string  y  that  reduces  to  e. 

Thus  we  have  the  following  general  result. 
Theorem  5.     Given  a  PDS  automaton  M,  we  can  construct  a  transducer  T 

such  that  M  accepts  x  if  and  only  if  T  maps  x  into  a  string  y  that  reduces 

to  e. 

Suppose,  now,  that  L(M)  is  the  language  accepted  by  the  PDS  auto- 
maton M,  T  is  the  corresponding  transducer  guaranteed  by  Theorem  5, 
K  is  the  set  of  strings  in  the  output  alphabet  of  T  that  reduce  to  e,  UI  is 
the  set  of  all  strings  in  the  input  alphabet  of  T9  and  T  is  the  inverse  trans- 
ducer that  maps  x  onto  y  just  in  case  T  maps  y  onto  x.  Then  L(M )  = 
T'(K  n  T(Uj)).  But  Ujr  is  a  regular  language  and,  as  we  have  noted  in 
Sec.  1 .5,"it  follows  that  T(Uj)  is  a  regular  language.  It  is  also  easy  to  show 
that  K  is  a  context-free  language  (cf.  Sec.  4,  Chapter  1 1). 

We  shall  see  in  Sec.  4.6  that  the  intersection  of  a  context-free  language 
and  a  regular  language  is  a  context-free  language  and  that  a  finite  trans- 
ducer maps  a  context-free  language  into  another  context-free  language. 
It  follows,  then,  that  L(M)  is  a  context-free  language. 

We  shall  also  see  in  Theorem  17,  Sec.  4.2,  that  corresponding  to  each 
context-free  language  there  is  a  PDS  automaton  with  restricted  control 


352  FORMAL   PROPERTIES    OF    GRAMMARS 

(cf.  Sec.  1.4)  that  accepts  it.    From  these  results,  then,  we  can  conclude 
the  following: 

Theorem  6.  The  following  are  equivalent: 
(i)  L  is  accepted  by  a  PDS  automaton; 
(ii)  L  is  accepted  by  a  PDS  automaton  with  restricted  control  [cf. 

Theorem  4(ii)] ; 
(iii)  L  is  a  context-free  language. 


1.7  Other  Kinds  of  Restricted-Infinite  Automata 

The  field  of  restricted-infinite  automata  is  of  great  potential  interest  not 
only  for  the  theory  of  computability  but  also,  one  would  expect,  for 
psychology,  since  psychologically  relevant  'models  representing  the 
knowledge  and  competence  of  an  organism  will  presumably  be  neither 
strictly  finite  nor  unmanageably  infinite  in  the  sense  of  the  devices  to 
which  we  shall  turn  our  attention  in  Sec.  1.8.  However,  the  investigation 
of  this  topic  has  only  recently  been  undertaken  and  only  a  few  initial 
results  are  available.  Ritchie  (1960)  has  investigated  a  hierarchy  of  devices 
with  the  general  property  that  at  each  level  of  the  hierarchy  the  memory 
available  to  a  device  is  a  function  of  the  length  of  the  input,  where  this 
function  can  be  computed  by  a  device  of  the  next  lowest  level,  the  lowest 
level  being  that  of  the  finite  automata.  Yamada  (1960)  has  studied  the 
case  of  "real  time"  automata,  which  are  subject  to  the  condition  that  the 
number  of  computing  steps  allowed  is  determined  by  the  length  of  the 
input  (see  McNaughton,  1961 ,  for  a  survey  of  some  of  his  results).  Schiit- 
zenberger  (1961b,  1962e)  has  developed  the  theory  of  finite  automata 
equipped  with  sets  of  counters  that  can  change  state  in  accordance  with 
simple  arithmetical  conditions  with  each  step  of  computation.  Schiitzen- 
berger  (1961c)  has  also  related  some  of  these  results  to  a  general  theory  of 
context-free  grammars  (cf.  Sec.  4.7).  The  devices  studied  in  Sees.  2  and  3 
are  restricted-infinite  automata,  and  it  seems  reasonable  to  predict  that  it 
is  in  this  setting  that  the  mathematical  study  of  grammar  will  ultimately 
find  its  natural  home. 


1.8  Turing  Machines 

Each  device  M  of  the  kind  we  have  considered  is  characterized  by  the 
property  that  the  amount  of  time  it  will  take  and  the  amount  of  space  it 
will  use  in  solving  a  particular  problem  (i.e.,  in  accepting  or  rejecting  a 
certain  input)  is,  in  some  sense,  predictable  in  advance.  In  fact,  the  deepest 


ABSTRACT    AUTOMATA  JJJ 

and  most  far-reaching  mathematical  investigations  in  the  theory  of  auto- 
mata concern  devices  that  do  not  share  this  property.  Such  devices,  called 
Turing  machines,  we  now  consider  briefly. 

We  obtain  a  Turing  machine  by  taking  a  linear-bounded  automaton, 
as  defined  in  Def.  5,  and  adding  to  its  rules  quintuples  (#,/,  k,  /,  m), 
where/  7^  0,  having  just  the  properties  defined  in  Def.  5.  These  rules  have 
the  effect  of  allowing  the  device  to  use  previously  inaccessible  portions  of 
tape  on  the  left  or  right  in  the  process  of  computation,  since  #'s  can  now 
be  rewritten  as  symbols  of  the  alphabet.  It  is  also  customary  to  require 
that  these  devices  be  deterministic  in  the  sense  that  for  a  given  tape-machine 
configuration  no  more  than  one  move  is  possible;  if  (!,j,k,l,m)  and 
(/,/,  k'9  lf,  m)  are  rules,  then  k  =  fc',  /  =  /',  and  m  =  m.  We  can  say 
that  a  Turing  machine  accepts  (generates)  a  string  under  essentially 
the  conditions  previously  given.  Specifically,  we  write  on  the  tape  the 
symbols  of  the  string  $  in  successive  squares,  the  rest  of  the  infinite  tape 
being  filled  by  #'s.  We  set  the  control  unit  to  SQ  with  the  reading  head 
scanning  the  leftmost  symbol  a  7^  #.  If  the  device  computes  until  a  first 
return  to  S0,  we  say  that  the  machine  has  accepted  (generated)  <f>.  Further- 
more, the  sequence  of  symbols  which  now  appears  on  the  tape  between 
#'s  spells  a  string  ip  that  we  can  call  the  output  of  the  machine.  We  can, 
in  other  words,  regard  it  as  a  partial  function  that  maps  <f>  into  ip  under  the 
conditions  just  stated. 

The  automata  obtained  in  this  way  are  totally  different  in  their  behavior 
from  those  considered  in  Sees.  1.2  to  1.7.  There  is,  for  example,  no  general 
way  to  determine  whether,  for  a  given  input,  the  device  will  run  into  a 
block  or  an  infinite  loop.  There  is,  furthermore,  no  way  to  determine, 
from  systematic  inspection  of  the  instructions  for  the  device,  how  long  it 
will  compute  or  how  much  tape  it  will  use  before  it  gives  an  answer,  if  it 
does  accept  the  string.  There  is  no  uniform  and  systematic  way  to  deter- 
mine from  a  study  of  the  rules  for  a  Turing  machine  whether  it  will  ever 
give  an  output  at  all  or  whether  its  output  or  the  set  it  accepts  will  be  finite 
or  infinite;  nor  is  it  in  general  possible  to  determine  by  some  mechanical 
procedure  whether  two  such  devices  will  ever  give  the  same  output  or  will 
accept  the  same  set.  Nevertheless,  it  is  important  to  observe  that  a  Turing 
machine  is  specified  by  a  finite  number  of  rules  and  at  any  point  in  a 
computation  it  will  be  using  only  a  finite  amount  of  tape  (i.e.,  only  a  finite 
number  of  squares  will  appear  between  the  bounding  strings  of  #). 
Furthermore,  if  it  is  going  to  accept  a  given  input,  this  fact  will  be  known 
after  a  finite  number  of  operations.  However,  if  it  does  not  accept  the 
input,  this  fact  may  never  be  known.  (If,  after  a  certain  number  of  steps,  it 
has  still  not  returned  to  50,  we  do  not  know  whether  this  is  because  it  has 
not  computed  long  enough  or  because  it  never  will  reach  this  terminal  state, 


FORMAL   PROPERTIES    OF    GRAMMARS 


and  we  may  never  know.)  The  study  of  Turing  machines  constitutes  the 
basis  for  a  rapidly  developing  branch  of  mathematics  (recursive  function 
theory).  For  surveys  of  this  field,  see  Davis  (1958)  and  Rogers  (1961). 


1.9  Algorithms  and  Decidability 

It  is  interesting  to  observe  that  there  are  Turing  machines  that  are 
universal  in  the  sense  that  they  can  mimic  the  behavior  of  any  arbitrary 
Turing  machine.  Suppose,  in  fact,  that  among  the  strings  formed  from 
an  alphabet  A,  which  we  can  assume  to  be  the  common  alphabet  of  all 
Turing  machines,  we  select  an  infinite  number  to  represent  the  integers; 
for  example,  let  us  take  al  =  1  and  regard  the  string  1  ...  1  consisting  of 
n  successive  1's  as  representing  the  number  n.  (We  shall,  henceforth  use 
the  notation  xn  for  the  string  consisting  of  n  successive  occurrences 
of  the  string  x.)  Suppose  now  that  we  have  an  enumeration  Ml9  M2, .  . . 
of  all  of  the  infinitely  many  Turing  machines.  This  enumeration  can  be 
given  in  a  perfectly  straightforward  and  definite  way.  Then  there  is  a 
universal  Turing  machine  Mu  with  the  following  property:  Mu  will  accept 
the  input  lnax  and  give,  with  this  input,  the  output  y,  just  in  case  Mn 
accepts  x  and  gives  y  as  the  corresponding  output  (where  a  is  some  other- 
wise unused  symbol).  We  can  think  of  the  input  tape  to  Mu  as  containing 
the  stored  program  ln,  which  instructs  Mu  to  act  in  the  manner  of  the  nth 
Turing  machine  when  any  input  x  is  written  on  its  tape.  Each  Turing 
machine  can  thus  be  regarded  as  one  of  the  programs  for  a  universal 
machine  Mu.  An  ordinary  digital  computer  is,  in  effect,  a  universal  Turing 
machine  such  as  Mu  if  we  make  the  idealized  assumption  that  memory 
(e.g.,  new  tape  units)  can  be  added  whenever  needed,  without  limit,  in  the 
course  of  a  particular  computation.  The  program  stored  in  the  computer 
instructs  it  as  to  which  Turing  machine  it  should  mimic  in  its  computations. 

Given  a  set  £  of  strings,  we  are  often  interested  in  determining  whether 
a  particular  string  x  is  or  is  not  a  member  of  S.  Furthermore,  we  are  often 
interested  in  determining  whether  there  is  a  mechanical  (effective)  pro- 
cedure by  which,  given  an  arbitrary  string  x9  we  can  tell  after  a  finite 
amount  of  time  whether  or  not  #  is  a  member  of  S.  If  such  a  procedure 
exists,  we  say  that  the  set  2  is  decidable  or  computable,  that  there  is  an 
algorithm  for  determining  membership  in  2,  or  that  the  decision  problem 
for  S  is  (recursively)  solvable  (these  all  being  equivalent  locutions).  An 
algorithm  for  determining  whether  an  arbitrary  element  a;  is  a  member  of 
the  set  S  can  be  regarded  as  a  computer  program  with  the  following  prop- 
erty. A  computer  storing  this  program,  given  x  as  input,  is  guaranteed  to 
terminate  its  computation  with  the  answer  yes  (when  x  e  S)  or  no  (when 


ABSTRACT   AUTOMATA  355 

x  $  2).  We  must  assume  here  that  the  computer  memory  is  unbounded ; 
that  is  to  say,  we  are  dealing  with  an  idealized  digital  computer — a 
universal  Turing  machine. 

Suppose  that  we  now  revise  our  characterization  of  Turing  machines 
just  to  the  following  extent:  we  add  to  the  control  unit  a  designated  state 
S*  and  we  say  that,  given  the  input  x,  the  device  accepts  x  if  it  returns  to 
S0,  as  before,  and  that  it  rejects  x  if  it  reaches  S*.  Call  such  a  device  a 
two-output  Turing  machine.  Given  Turing  machines  Mt  and  M2,  which 
accept  the  disjoint  sets  Sj  and  22,  respectively,  it  is  always  possible  to 
construct  a  two-output  machine  M3  that  will  accept  just  Sx  and  reject  just 

s,. 

With  this  revision,  consider  again  the  question  of  decidability.  A  set  2 
is  decidable  if  there  is  a  computer  program  that  is  guaranteed  to  determine, 
of  an  arbitrary  input  x,  whether  x  is  a  member  of  2,  after  a  finite  number 
t(x)  of  steps.  We  can  now  reformulate  the  notion  of  decidability  as  follows : 
a  set  is  decidable  if  there  is  a  two-output  Turing  machine  that  will  accept 
all  its  members  and  reject  all  its  nonmembers. 

A  set  is  called  recursively  enumerable  just  in  case  there  is  a  Turing 
machine  that  accepts  all  strings  of  this  set  and  no  others.  This  machine 
is  then  said  to  recursively  enumerate  (generate)  the  set.  A  set  is  recursive 
if  both  it  and  its  complement  are  recursively  enumerable.  It  is  clear  that  a 
set  is  recursive  just  in  case  it  is  decidable  in  the  sense  just  defined,  for,  if 
2  is  recursive,  then  there  is  a  Turing  machine  M^  that  accepts  2  and  a 
Turing  machine  M%  that  accepts  its  complement.  Consequently,  as 
previously  observed,  we  can  construct  a  two-output  machine  M%  that  will 
accept  the  set  2  enumerated  by  M1  and  reject  the  set  S  (=  complement  of 
2)  enumerated  by  Af2.  To  determine  whether  a  string  x  is  in  2,  we  can 
write  x  on  the  input  tape  and  set  Mz  to  computing.  This  amounts  to 
setting  both  M:  and  M2  to  computing  synchronously  with  input  x.  After 
some  finite  time,  one  or  the  other  machine  will  have  come  to  a  stop  and 
have  accepted  x;  therefore,  after  some  finite  time,  M3  will  either  have 
accepted  or  rejected  x,  and  we  shall  know  whether  x  is  in  2  or  its  comple- 
ment. On  the  other  hand,  if  2  is  decidable,  then  it  is  recursive,  since  the 
two-output  device,  which  accepts  all  its  members  and  rejects  all  non- 
members,  can  easily  be  separated  into  two  Turing  machines,  one  of  which 
accepts  all  members  and  the  other,  all  nonmembers. 

A  classic  result  in  Turing-machine  theory  is  that  there  are  recursively 
enumerable  sets  that  are  not  recursive.  In  fact,  some  rather  familiar  sets 
have  this  property.  Thus  the  set  of  valid  schemata  of  elementary  logic 
(i.e.,  the  theory  of  and,  or,  not,  some,  all— called  first-order  predicate 
calculus  or  quantification  theory)  is  recursive  if  we  restrict  ourselves  to 
one-place  predicates  but  nonrecursive  (though  recursively  enumerable) 


FORMAL   PROPERTIES    OF    GRAMMARS 

if  we  allow  two-place  predicates,  that  is,  relations.  Or  consider  a  for- 
malized version  of  ordinary  elementary  number  theory.  If  it  meets  the 
usual  conditions  of  adequacy  for  axiomatic  systems,  then  it  is  known  that 
although  the  set  of  theorems  is  recursively  enumerable  it  is  not  recursive. 
In  fact,  elementary  number  theory  has  the  further  property  that  there  is 
a  mechanical  procedure/such  that  if  Mi  is  any  two-output  Turing  machine 
that  accepts  all  of  the  theorems  deducible  from  some  consistent  set  of 
axioms  for  this  theory  and  rejects  the  negations  of  all  of  these  theorems, 
then/(/)  is  a  formula  (in  fact,  a  true  formula)  of  elementary  number  theory 
that  is  neither  accepted  nor  rejected  by  Mr  There  does  not  exist  ^two- 
output  machine  which  accepts  a  set  S  and  rejects  its  complement  S  if  2 
contains  all  of  the  theorems  of  this  system  and  none  of  their  negations. 
There  are,  furthermore,  perfectly  reasonable  sets  that  are  not  recursively 
enumerable;  for  example,  the  set  of  true  statements  in  elementary  number 
theory  or  the  set  of  all  satisfiable  schemata  of  quantification  theory. 

The  notions  of  decidability  and  existence  of  algorithms  can  be  imme- 
diately extended  beyond  the  particular  questions  just  discussed.  Let  us 
define  &  problem  as  a  class  of  questions,  each  of  which  receives  a  yes-or-no 
answer.  The  problem  is  called  recursively  solvable  or  decidable  if  there  is  a 
mechanical  procedure,  which,  applied  to  any  question  from  this  class,  will, 
after  a  finite  time,  give  either  yes  or  no  as  its  answer.  Decidability  of 
problems  can,  in  general,  be  formulated  in  the  manner  indicated  pre- 
viously. To  consider  a  case  to  which  we  will  return,  suppose  that  we  are 
given  a  set  of  generative  grammars  Gl9  G2, ...  of  a  certain  sort  (note  that 
to  be  given  a  set  is,  in  this  context,  to  be  given  a  device  that  recursively 
enumerates  it),  and  that  we  are  interested  in  determining  whether  the 
equivalence  problem  is  recursively  solvable.  This  is  the  problem  of  deter- 
mining, of  an  arbitrary  pair  of  grammars  Gi9  Gjy  whether  or  not  they 
generate  the  same  language.  We  can  formulate  the  question  as  follows : 
is  there  a  two-output  Turing  machine  that  accepts  the  string  Val*  if  Gi 
and  Gj  generate  the  same  language  and  rejects  the  string  P'aP"  if  Gi  and  G,- 
generate  different  languages?  The  equivalence  problem  is  recursively 
solvable  for  the  set  G19  (72, .  . .  just  in  case  there  is  such  a  device.  Or  con- 
sider the  problem  of  determining,  of  an  arbitrary  Giy  whether  it  generates 
a  finite  set.  This  problem  is  recursively  solvable  just  in  case  there  is  a 
two-output  Turing  machine  which  accepts  the  string  1*  in  case  <?,-  generates 
a  finite  language  and  rejects  it  in  case  <74  generates  an  infinite  language. 
Similarly,  other  decision  problems  can  be  formulated  in  this  way,  for 
example,  the  problems  of  equivalence  and  finite  generation  that  we  have 
observed  to  be  unsolvable  for  Turing  machines. 

Turing  machines  (and  computers)  are  usually  regarded  as  devices  for 
the  computation  of  functions  rather  than  for  the  generation  of  sets  of 


UNRESTRICTED   REWRITING   SYSTEMS  ^57 

strings,  as  we  have  described  them.  The  two  points  of  view  are  completely 
equivalent,  however.  Thus,  as  previously  indicated,  we  can  regard  a 
Turing  machine  as  representing  the  (partial)  function  that  maps  x  into  y 
whenever  with  input  x  it  computes  until  it  accepts  x  (i.e.,  returns  to  50 
for  the  first  time),  at  which  point  its  tape  contains  exactly  y  (bounded  by 
#'s).  Equivalently,  we  can  add  to  the  alphabet  a  new  symbol  a  and  can 
regard  a  Turing  machine  as  representing  the  relation  that  holds  between 
x  and  y  just  in  case  it  accepts  the  string  xay,  and  as  representing  the 
function  that  maps  x  into  y,  just  in  case  this  relation  is  a  function.  Either 
way,  the  same  functions  are  representable.  A  two-output  Turing  machine, 
for  example,  could  be  regarded  as  a  computable  partial  function  that 
maps  the  set  S  that  it  accepts  into  the  integer  1  and  the  set  S*  cz  £  that  it 
rejects  into  the  integer  2.  The  specific  terms  in  which  we  have  described 
Turing  machines  and  other  automata,  though  equivalent  to  the  usual 
ones,  are  adapted  specifically  to  the  purposes  of  this  chapter. 


2.  UNRESTRICTED    REWRITING   SYSTEMS 

We  are  now  ready  to  investigate  the  interconnections  of  the  theory  of 
grammar  (Chapter  11,  Sees.  3  to  6)  and  the  theory  of  automata  (Sec.  1) 
and  to  study  the  capacity  and  formal  properties  of  various  kinds  of 
grammars.  Unfortunately,  this  survey  must  be  restricted  to  a  discussion  of 
constituent-structure  grammars  rather  than  transformational  grammars 
and  (with  the  exception  of  Sec.  4.6)  to  the  question  of  the  enumeration  of 
sets  of  sentences  rather  than  sets  of  structural  descriptions.  As  noted  previ- 
ously, the  reason  is,  simply,  that  little  of  any  significance  is  known  concern- 
ing those  more  interesting  but  much  more  difficult  questions. 

In  Chapter  11,  Sec.  3,  we  introduced  a  class  of  grammars  based  on 
rewriting  rules  <j>  ->  %  where  <f>  and  y  are  strings  of  symbols  of  a  finite 
vocabulary  V.  In  this  section  we  shall  consider  only  these  systems,  pre- 
supposing the  notions  defined  in  Chapter  11,  Sec.  4,  and  following  the 
notational  conventions  established  there.  In  particular,  we  assume  to  be 
given  a  universal  terminal  vocabulary  VT  and  a  universal  nonterminal 
vocabulary  VN  disjoint  from  VT9  where  V  =  VT  u  VN. 

When  no  further  constraints  are  placed  on  the  kinds  of  rewriting  rules 
that  are  available,  the  grammars  defined  in  Chapter  11,  Sec.  4,  can  be 
called  unrestricted  rewriting  systems.  The  problem  of  relating  these  un- 
restricted systems  to  the  theory  of  automata,  as  outlined  in  Sec.  1,  is 
quite  simple.  In  fact,  any  Turing  machine  can  be  represented  directly  as 
an  unrestricted  rewriting  system  and  conversely.  We  can  state  this  fact  as 
follows: 


358 


FORMAL   PROPERTIES   OF    GRAMMARS 


01  #2 


Fig.  8.  Successive  tape-machine  configurations  of  a  Turing  machine. 

Theorem  7.  L  is  a  terminal  language  generated  by  an  unrestricted 
rewriting  system  if  and  only  if  L  is  a  recursively  enumerable  set  of  strings 
{##!#,  ##2#j  •  •  •  }>  where  ^  contains  no  occurrences  of  #. 
For  a  proof  of  this  result,  due  to  Post,  see  Davis  (1958,  Chapter  6,  Sec.  2). 
The  basic  idea  of  the  proof  is  that  successive  tape-machine  configurations 
of  a  Turing  machine  can  be  determined  by  rewriting  rules  <f>-+ip.  Suppose, 
for  example,  that  a  Turing  machine  M  has  the  rule  (/,/,  Ar,  —  1,  m),  indi- 
cating that  when  it  is  in  state  Sj  scanning  the  symbol  ai  it  replaces  at  by  am 
and  switches  to  state  Sk  as  the  tape  moves  right  one  space.  At  one  moment, 
then,  the  machine  will  be  as  indicated  in  Fig.  8*2  and  at  the  next  moment 
as  indicated  in  Fig.  8&,  where  ax,  a2, .  .  . ,  j8l5  /?2, . . .  are  symbols  on  the 
tape.  Consider  now  a  vocabulary  V  consisting  of  the  alphabet  A  of  the 
Turing  machine  M ,  the  symbol  #,  the  symbol  S,  and  symbols  designating 
the  states  of  M.  In  the  present  example  the  configuration  of  Fig,  8<z  can 
be  represented  in  this  vocabulary  by  the  string  of  symbols  (9)  and  the  con- 
figuration of  Fig.  Sb  by  the  string  (10),  in  which  the  state  symbol  appears 
to  the  left  of  the  symbol  that  is  currently  being  scanned  by  the  machine: 


(10) 

The  transition  of  the  Turing  machine  M  from  String  9  to  String  10  can 
be  accomplished  by  a  rewriting  rule : 

(11) 


UNRESTRICTED    REWRITING   SYSTEMS 


359 


It  is  possible  to  demonstrate  that  with  a  finite  number  of  such  rewriting 
rules  the  entire  behavior  of  M  can  be  unambiguously  represented. 

In  Sec.  1.8  we  described  the  behavior  of  a  Turing  machine  in  the  follow- 
ing way:  on  its  tape  appears  a  finite  string  <j>  of  symbols  of  its  alphabet  A 
(not  containing  #)  with  an  infinite  string  of  #'s  to  the  left  and  to  the  right 
of  </>.  It  begins  operation  in  state  S09  scanning  the  leftmost  symbol  of 
<f>9  and  continues  until  it  blocks  or  returns  to  50.  In  the  latter  case  we  say 
that  it  accepts  (generates)  the  string  #<£#  (it  may,  of  course,  continue  to 
compute  indefinitely).  Suppose  that  we  now  modify  the  description  in  the 
following  way.  When  a  Turing  machine  M  is  about  to  return  to  state  S0, 
it  moves  the  tape  instead  until  the  reading  head  is  scanning  the  rightmost 
square  not  containing  #.  In  this  square  it  prints  #  and  moves  the  tape 
right,  again  printing  #  and  moving  the  tape  right,  etc.,  until  it  reaches  the 
last  square  not  containing  #,  where  it  prints  the  symbol  S,  at  which  point 
it  blocks  in  a  designated  final  state.  This  terminal  routine  is  easily  ar- 
ranged. With  this  modification,  the  set  of  generated  languages  and  the 
character  of  the  rules  is  left  unchanged^  The  machine  M  accepts  (generates) 
#<£#,  just  in  case,  given  the  input  <f>,  it  computes  until  it  blocks  with  the 
tape  containing  all  #5s  except  for  a  single  occurrence  of  S.  Furthermore, 
at  each  step  of  the  computation  there  is  on  the  tape  a  string  <f>  containing 
no  occurrences  of  #,  flanked  on  both  sides  by  infinite  strings  of  #. 

In  the  foregoing  manner  we  can  completely  describe  the  behavior  of  such 
a  Turing  machine  M  by  a  particular  set  S  of  rewriting  rules  that  will  con- 
vert a  string  #S<$#  to  #S#  just  in  case  M  accepts  #j>#.  Because  of  the 
deterministic  character  of  M,  the  set  S  is  monogenic;  that  is  to  say,  given 
a  string  %,  there  is  at  most  one  string  y)  that  can  result  from  application 
of  S  to  x-  Now  consider  the  set  of  rules  S'  containing  y  ->  %  just  in  case 
X  — >•  y  is  in  2,  and  containing  also  a  final  rule  #SQ  ->  #.  This  set  S'  of 
rules  is  an  unrestricted  rewriting  system.  A  #5#-derivation  of  this  system 
will  terminate  in  #$#  just  in  case  M  accepts  #$#. 

From  Theorem  7  we  see  that  unrestricted  rewriting  systems  are  universal. 
If  a  language  can  be  generated  at  all  by  what  in  the  intuitive  sense  is  a 
finitely  statable,  well-defined  procedure,  it  can  be  generated  by  a  grammar 
of  this  type.  However,  such  systems  are  of  little  interest  to  us  in  the  present 
context.  In  particular,  there  is  no  natural  and  uniform  method  to  associate 
with  each  terminated  derivation  a  P-marker  of  the  desired  kind  for  its 
terminal  string.  It  is  in  this  sense  that  an  arbitrary  Turing  machine,  or  an 
unrestricted  rewriting  system,  is  too  unstructured  to  serve  as  a  grammar. 
By  imposing  further  conditions  on  the  grammatical  rules,  we  arrive  at 
systems  that  have  more  linguistic  interest  but  less  generative  power.  As 
remarked  in  Sec.  1.8,  a  particular  Turing  machine  can  be  regarded  as 
nothing  more  or  less  than  a  program  of  a  perfectly  arbitrary  kind  for  a 


360  FORMAL    PROPERTIES    OF    GRAMMARS 

digital  computer  with  potentially  infinite  memory.  Obviously,  a  computer 
program  that  succeeded  in  generating  the  sentences  of  a  language  would  be, 
in  itself,  of  no  scientific  interest  unless  it  also  shed  some  light  on  the  kinds 
of  structural  features  that  distinguish  languages  from  arbitrary,  recursively 
enumerable  sets.  If  all  we  can  say  about  a  grammar  of  a  natural  language 
is  that  it  is  an  unrestricted  rewriting  system,  we  have  said  nothing  of  any 
interest.  (See  Chomsky,  1961,  1962b,  for  further  discussion). 

The  most  restrictive  condition  that  we  shall  state  will  limit  grammars  to 
devices  with  the  generative  capacity  of  strictly  finite  automata.  We  shall 
see  that  these  devices  cannot,  in  principle,  serve  as  grammars  for  natural 
languages.  Consequently  we  are  interested  primarily  in  devices  with  more 
generative  capacity  than  finite  automata  but  that  are  more  structured 
(and,  presumably,  have  less  generative  capacity)  than  arbitrary  Turing 
machines.  In  other  words,  we  shall  be  concerned  with  devices  that  fall 
into  the  general  area  of  restricted-infinite  automata. 


3.  CONTEXT-SENSITIVE   GRAMMARS 

Suppose  we  take  a  system  G  that  meets  all  the  requirements  defining  an 
unrestricted  rewriting  system  and  impose  on  it  the  following  further 
condition : 
Condition  I.     If  <j)-*y  is  a  rule  of  G,  then  there  are  nonnull  symbols 

#!,...,  am,  bl9 .  . . ,  bn,   where  m  <  n,  such   that  (f>  =  al .  .  .  am  and 

V  =  *i  -  -  -  bn. 

In  brief,  Condition  1  requires  that  if  <f>  ->  ip  is  a  rule  of  the  grammar  then 
y  is  not  shorter  than  <f>.  A  grammar  meeting  Condition  1  we  call  a  type  1 
grammar. 

Henceforth,  for  each  Condition  /  that  we  establish  we  shall  call  the 
grammars  meeting  it  type  i  grammars',  a  language  generated  by  a  type  / 
grammar  we  call  a  type  i  language.  An  unrestricted  rewriting  system  we 
call  a  type  0  grammar.  The  conditions  that  we  shall  consider  are  in- 
creasingly strong;  that  is  to  say,  for  each  /  a  type  /  +  1  grammar  will  also 
satisfy  the  defining  condition  for  a  type  /  grammar,  but  some  type  z  gram- 
mars will  not  qualify  as  type  /  +  1  grammars. 

Condition  1  imposes  an  essential  limitation  on  generative  capacity. 
Since,  in  derivations  of  type  1  grammars,  each  line  must  be  at  least  as  long 
as  the  preceding  line,  the  following  theorem  is  obvious: 
Theorem  8.     Every  type  1  languag'e  is  recursive. 

In  fact,  given  a  type  1  grammar  G  and  a  string  x,  only  a  finite  number  of 
derivations  (those  whose  final  lines  are  not  longer  than  x)  need  be  tested 
to  determine  whether  G  generates  x.  Not  all  recursive  languages  are 


CONTEXT-SENSITIVE    GRAMMARS  $6l 

generated  by  type  1  grammars,  however,  as  can  be  shown  by  a  straight- 
forward diagonal  argument. 

Although  type  1  grammars  generate  only  recursive  sets,  there  is  a 
certain  sense  in  which  they  come  close  to  generating  arbitrary  recursively 
enumerable  sets.  In  order  to  simplify  the  discussion  of  this  matter,  we 
restrict  ourselves  to  sets  of  positive  integers  (without  loss  of  generality, 
since  any  set  of  finite  strings  can  be  effectively  coded  into  a  set  of  integers). 

Recall  once  again  the  characterization  of  a  Turing  machine  given  in 
Sec.  1 .8.  Here  we  consider  Turing  machines  with  the  alphabet  {1 ,  e},  where 
e  is  the  identity  element  (i.e.,  a  square  containing  e  is  regarded  as  blank 
and  replacement  of  1  by  e  constitutes  erasure).  We  can  assume  that  each 
particular  machine  M  operates  in  the  following  manner:  given  the  input 
sequence  \i  (i.e.,  /  successive  occurrences  of  1  flanked  by  infinite  strings 
of  #),  where  /  >  1 ,  M  begins  to  compute  in  its  initial  state  S0  while  scanning 
the  leftmost  occurrence  of  1 ,  and  continues  until  it  terminates  in  a  desig- 
nated final  state  SF.  At  this  point  the  tape  will  contain  the  string  <£ 
flanked  by  #*s,  where  $  contains  j  occurrences  of  the  symbol  1  and  k 
occurrences  of  e.  We  can  construct  M  so  that  it  will  not  enter  state  SF 
unless y  >  1 ,  and  we  can  assume  without  loss  of  generality  that,  at  termina- 
tion in  SF,  M  is  scanning  the  leftmost  symbol  distinct  from  #  and  that  all 
occurrences  of  1  precede  all  occurrences  of  e  (that  is  to  say,  we  can  easily 
add  to  each  M  a  component  that  will  automatically  convert  it  to  this  final 
configuration  when  it  terminates).  Thus  the  output  of  M,  if  M  ever  reaches 
state  SF  when  computing  with  the  input  1  \  will  be  the  string  IV  (j  >  1, 
k  >  0).  We  have  already  observed  (in  Sec.  1 .2)  that  M  may  never  terminate 
for  certain  (or  all)  inputs  and  that  there  is  no  algorithm  for  determining 
from  the  rules  ofM  whether  it  will  terminate  with  a  particular  input  or  even 
whether  there  is  some  input  for  which  it  will  terminate.  If  M  does  terminate 
with  the  output  IV,  given  input  1%  we  say  that  M  maps  the  integer  /  into 
the  integer/.  This  description  is  quite  general,  and  any  Turing  machine 
that  represents  a  partial  function  (i.e.,  a  function  that  may  not  be  defined 
for  certain  elements  of  its  domain)  mapping  positive  integers  into  positive 
integers  can  be  described  in  this  way.  It  is  well  known  that  the  range  of  a 
Turing  machine,  so  described,  is  a  recursively  enumerable  set  and  that 
each  recursively  enumerable  set  is  the  range  of  some  such  Turing  machine. 

Observe  that  such  a  Turing  machine  never  writes  the  symbol  #,  although 
it  can  read  and  erase  #  (thus  extending  the  portion  of  the  tape  available 
for  computation).  Consequently,  if  M  terminates  with  the  output  IV 
given  the  input  1*,  then  j  +  k  >  /.  Furthermore,  we  see  immediately 
that  if,  in  the  manner  indicated  in  Sec.  2,  we  construct  a  set  of  rewriting 
rules  that  mirror  the  behavior  of  M  exactly,  then  this  set  of  rules  will 
constitute  a  monogenic  type  1  grammar  G.  Condition  1  is  satisfied  because 


FORMAL    PROPERTIES    OF    GRAMMARS 


the  amount  of  tape  actually  being  used  never  decreases.  If  M  computes 
the  output  ljek  from  the  input  1  \  then  G  will  produce  a  #S0  1  ^-derivation 
terminating  in  the  string  #Spljek#  and  conversely.  Although  M  may 
enumerate  an  arbitrary  recursively  enumerable  set  (as  the  range  of  the 
function  it  represents),  the  set  of  outputs  that  it  generates  is  recursive. 
Suppose  now  that  we  select  a  Turing  machine  Af,  and  associate  with  it 
a  type  1  grammar  G  in  the  manner  just  described.  Suppose  further  that 
we  form  G*,  which  consists  of  the  rules  of  G  along  with  the  following  four 
rules  for  generating  appropriate  "initial  strings"  for  G: 


*    ; 


These  rules  provide  a  terminated  #S#-derivation  for  each  string  #50P'#, 
where  /  >  1  ,  and  only  for  these  strings.  Consequently,  the  complete 
grammar  G*  will  provide  a  terminated  #5#-derivation  for  a  string  #$# 
just  in  case  <f>  =  SF\j&(j^  1),  and,  for  some  /,  M  terminates  with  the 
output  1'e*,  given  the  input  1*'.  Thus,  given  an  arbitrary  Turing  machine 
enumerating  the  recursively  enumerable  set  2  as  its  range,  we  can  construct 
a  type  1  grammar  that  will  generate  all  and  only  the  strings  #SF<j>y#, 
where  <f>  e  S,  and  y>  is  a  string  of  e's  (of  length  computable  from  </»,  in 
fact,  where  <f)  e  2).  The  problem  of  determining  whether  the  range  of  an 
arbitrary  Turing  machine  is  null,  finite,  or  infinite  is  known  to  be  recur- 
sively unsolvable.  Consequently,  the  corresponding  problems  for  type  1 
grammars  are  also  recursively  unsolvable. 
Theorem  9.  There  is  no  algorithm  for  determining  whether  an  arbitrary 

type  1  grammar  G  generates  the  null  set  of  strings,  a  finite  set  of  strings, 

or  an  infinite  set  of  strings.   (Scheinberg,  1960a.) 

For  just  the  same  reason  there  is  no  algorithm  for  determining  whether 
some  particular  string  appears  as  a  proper  subpart  of  a  line  of  an  #S#- 
derivation  of  G.  Furthermore,  since  a  grammar  gives  no  terminated 
#Sr#-derivations  just  in  case  it  is  equivalent  to  the  grammar  Gnull  with 
the  single  rule  S  -+  I  SI,  there  is,  by  Theorem  9,  no  way  to  determine 
whether  G  is  equivalent  to  Gnull,  and,  in  general: 
Theorem  10.  There  is  no  algorithm  for  determining  whether  two  type  1 

grammars  are  equivalent.   (Scheinberg,  1960a.) 

In  fact,  for  quite  a  range  of  problems  unsolvable  for  Turing  machines 
we  can  find  an  analogous  problem  unsolvable  for  type  1  grammars.  There 
is,  in  other  words,  little  that  one  can  tell  about  the  deviations  of  language 
generated  by  a  type  1  grammar  (other  than  that  the  language  is  necessarily 
recursive)  by  systematic  investigation  of  its  rules. 

Condition  1  is  not  sufficiently  strong  to  permit  the  uniform  assignment 


CONTEXT-SENSITIVE    GRAMMARS  363 

of  P-markers  to  sentences  in  the  desired  way.  To  guarantee  this,  as  we 
noted  in  Chapter  11,  Sec.  4,  we  must  require  that  only  one  symbol  be 
rewritten  at  a  time  (as  well  as  certain  other  conditions  which,  as  we  noted, 
do  not  affect  generative  capacity).  This  observation  leads  us  to  consider 
a  more  restrictive  condition : 

Condition  2.  If  <f> -+ y>  is  a  rule,  then  there  are  strings  fa,  %&  A,  w  (where 
A  is  a  single  symbol  and  co  is  not  null}  such  that  <f>  =  %iA%z  and  y>  =  XiMX2- 
In  a  type  2  grammar  each  rule  <f>  ->  y  (that  is,  %iA%*  ->  Xi^Xz)  can  be 
regarded  as  asserting  that  A  can  be  rewritten  co  when  in  the  context 
fa-%2,  where  fa  or  £2  may,  of  course,  be  null.  We  refer  to  grammars 
meeting  this  condition  as  context-sensitive  grammars.  Rules  of  this  kind 
are  quite  common  in  actual  grammatical  descriptions.  They  can  be  used 
to  indicate  selectional  and  contextual  restrictions  on  the  choice  of  cer- 
tain elements  or  categories,  as  observed  in  Chapter  11.  In  a  context- 
sensitive  grammar  we  can  identify  the  class  Vy  of  nonterminal  symbols 
as  the  class  containing  exactly  those  symbols  A  such  that  the  grammar 
contains  a  rule  %iA%2  -+  %ico%2.  We  shall  henceforth  assume  this  conven- 
tion. 

Clearly,  then,  every  type  2  grammar  is  a  type  1  grammar  and  not  con- 
versely.   Nevertheless,  Condition  2  does  not  restrict  generative  capacity. 
Theorem  II.     If  G  is  a  type  1  grammar,  then  there  is  a  type  2 grammar  G' 
that  is  weakly  equivalent  to  G. 

The  proof,  which  is  perfectly  straightforward,  can  be  found  in  Chomsky 
(1959a). 

Since  the  correspondence  given  by  Theorem  1 1  is  effective,  it  follows  at 
once  that  the  undecidable  problems  concerning  type  1  grammars  remain 
undecidable  when  we  restrict  ourselves  to  context-sensitive  grammars. 
Theorem  12.     There  is  no  algorithm  for  determining,  given  two  context- 
sensitive  grammars,  whether  these  grammars  are  equivalent,  whether 
either  generates  a  null,  finite,  or  infinite  set  of  strings,  or  whether  an 
arbitrarily  selected  string  appears  as  part  of  a  line  of  a  #S#-derivation 
of  either  grammar  or  as  part  of  a  sentence  of  the  generated  language. 
Here  again  we  find  that  little  of  a  general  nature  can  be  discovered  by 
systematically  studying  the  rules  of  these  systems. 

Theorem  12  has  an  important  consequence  for  the  theory  of  constituent- 
structure  grammar.  Any  theory  of  grammar  must  provide  a  general  method 
for  assigning  structural  descriptions  to  sentences,  given  a  grammar;  and, 
in  the  case  of  constituent-structure  grammars,  this  can  be  done  in  a  natural 
way  only  if  each  such  grammar  meets  the  Condition  C  that  there  are  no 
symbols  A,  B  and  no  string  co  such  that  <f>ABy,  <f>Aa>By>  are  successive  lines 
of  a  derivation  of  a  terminal  string  (cf.  Footnote  3,  Chapter  11,  Sec.  4). 
Hence  it  is  reasonable  to  require  of  a  well-formed  constituent-structure 


364  FORMAL    PROPERTIES    OF    GRAMMARS 

grammar,  in  addition  to  the  conditions  already  given,  that  it  meet  Con- 
dition C.  It  then  follows  immediately  from  Theorem  12  that  well-formed- 
ness is  an  undecidable  property  of  context-sensitive  grammars.  This  fact  is 
sufficient  to  rule  out  the  theory  of  context-sensitive  grammar,  in  its  present 
form,  as  a  possible  theory  of  grammar.  Clearly,  a  general  theory  of  gram- 
mar must  provide  a  recursive  class  of  well-formed  grammars  as  potential 
candidates  for  specification  of  some  natural  language;  that  is,  there  must 
be  an  algorithm  for  determining  whether  a  particular  set  of  rules  constitutes 
a  well-formed  grammar  in  accordance  with  this  theory.  The  theory  of 
context-sensitive  grammar,  in  its  present  form,  does  not  meet  this  require- 
ment (though  it  can  be  modified  in  such  a  way  as  to  meet  it.  See  Footnote 
3,  Chapter  11,  Sec.  4).  Note  that  the  theory  of  transformational  grammar 
is  not  subject  to  this  difficulty  if,  as  suggested  in  Chapter  11,  Sec.  5,  its 
constituent-structure  component  generates  only  a  finite  set  of  strings 
(similarly,  if  it  generates  an  infinite  set  of  a  sufficiently  restricted  kind,  a 
restriction  that  is  feasible  if  transformational  devices  are  available  to 
extend  generative  capacity). 

In  Chapter  11,  Sec.  3,  we  used  as  illustrations  three  artificial  languages, 
L!,  L2,  and  Z,3,  all  with  the  vocabulary  {a,  £},  and  we  demonstrated  that 
L!  and  L2  can  have  what  we  are  now  calling  context-sensitive  grammars. 
(In  fact,  they  have  grammars  meeting  Condition  2  where  Xi  an<^  £2  are 
null).  For  L3,  however,  we  gave  a  simple  grammar  that  was  not  an 
unrestricted  rewriting  grammar  at  all.  We  know,  of  course,  that  an 
unrestricted  rewriting  grammar  for  L3  must  exist,  since  it  is  obviously  a 
recursively  enumerable — in  fact,  a  recursive  set.  It  is  interesting  to  observe, 
however,  that  L3  can  indeed  be  generated  by  a  context-sensitive  grammar, 
although  a  much  more  complicated  one  is  required  for  L3  than  for  Lx  or 
L2. 

This  follows  from  a  general  property  of  context-sensitive  grammars,  to 
which  we  now  turn.  Consider  the  following  set  of  rules : 

Rl :  CD  A  ->  CEAA ;  CDS  ->  CEBB 

R2 :  CEA-+  ACE;  CEB  ->  BCE 

R3:  £a/3-^/3£a 

R4:  £a#->Z)a#  (13) 

R5:  aD->jDa 

R6:  A-+A;  B-+B. 

In  these  rules,  the  variables  a  and  ft  range  over  {A,  B,  F}.  So,  for  example, 
Rule  R5  is  actually  to  be  regarded  as  the  set  of  three  rules:  AD->  DA; 
BD-+DB;  FD-+DF. 


CONTEXT-SENSITIVE    GRAMMARS 


Given  a  string  CD$F#,  where  <£  is  any  string  of  >Ts  and  J?'s,  the  rules 
in  Example  13  will  apply  in  a  unique  order  (except  for  some  freedom  in  the 
case  of  R6)  to  produce,  finally, 


(14) 

at  which  point  none  of  these  rules  applies.  In  short,  Rules  13  describe 
a  copying  machine.  Given  such  a  copying  machine,  it  is  not  a  difficult 
task  to  use  it  as  the  basis  for  a  grammar  that  will  generate  L3.  Moreover, 
since  all  of  these  rules  meet  Condition  I,  it  will  be  a  type  1  grammar. 
From  Theorem  11,  therefore,  the  following  theorem  is  derived. 
Theorem  13.  There  is  a  context-sensitive  grammar  G  that  generates  L2. 

(Chomsky,  1959a.) 

This  grammar  will  be  considerably  more  complex  than  the  grammar 
(12)  proposed  in  Chapter  11,  Sec.  3,  since,  in  particular,  it  must  include  a 
set  of  rules  which  has  the  effect  of  the  copying  machine  of  Rules  13. 
The  grammar  (12)  in  Chapter  11>  Sec.  3,  can  easily  be  redescribed  as  a 
transformational  grammar.  Here,  then,  is  an  elementary,  artificial 
example  of  the  simplification  that  can  often  be  achieved  by  extending  the 
scope  of  grammatical  theory  to  include  transformational  grammars  of  the 
kind  described  in  Chapter  11,  Sec.  5. 

It  is  important  to  observe  that  the  ability  of  a  context-sensitive  grammar 
to  generate  such  languages  as  L3  represents  a  defect  rather  than  a  strength. 
This  fact  becomes  clear  when  we  observe  how  a  context-sensitive  copying 
device  actually  functions.  The  basic  point  is  that  it  is  possible  to  achieve 
the  effect  of  a  permutation  AB  -+  BA  within  the  limitations  of  a  context- 
sensitive  grammar;  but,  when,  with  a  sequence  of  context-sensitive  rules,. 
we  succeed  in  rewriting  AB  as  BA,  we  find  that  in  the  associated  P-marker 
the  symbol  B  ofBA  is  of  type  A  (i.e.,  is  traceable  back  to  A  in  the  associated 
tree)  and  the  symbol  A  ofBA  is  of  type  B.  For  example,  if  we  were  to  use 
such  rules  to  convert  John  will  arrive  to  mil  John  arrive  9  we  would  be  forced 
to  assign  a  P-marker  to  will  John  arrive  that  would  provide  the  structural 
information  that  will  in  this  sentence  is  a  noun  phrase  (being  traceable  back 
to  the  symbol  NP  that  dominates  John)  and  that  John  is  a  modal  auxiliary, 
contrary  to  our  intention.  Were  we  to  attempt  to  construct  a  context- 
sensitive  grammar  for  English,  there  would  be  no  natural  way  to  avoid  this 
totally  unacceptable  consequence.  (Note  that  if  will  John  arrive  is  derived 
from  John  will  arrive  by  a  grammatical  transformation,  in  the  manner 
described  in  Chapter  11,  Sec.  5,  this  counterintuitive  consequence  does 
not  result). 

This  observation  suggests  that  it  might  be  important  to  devise  a  further 
condition  on  context-sensitive  grammars  that  would  exclude  permutations 
but  would  still  permit  the  use  of  rules  to  limit  the  rewriting  of  certain 


FORMAL    PROPERTIES    OF    GRAMMARS 


symbols  to  a  specific  context.  A  very  natural  restriction  that  would  have 
this  effect  is  the  following,  which  has  been  proposed  by  Parikh  (1961): 
Condition  3.  G  is  a  type  2  grammar  containing  no  rule  yu\Ay^  —  >  Zi(o%2> 

where  o)  is  a  single  nonterminal  symbol  (i.e.,  co  e  KY). 

In  a  type  3  grammar  no  symbol  can  be  rewritten  as  a  single  nonterminal 
symbol  in  any  context.  With  this  restriction,  it  is  impossible  to  construct 
a  sequence  of  rules  with  the  effect  of  a  simple  permutation  AB  ->  BA. 
Consequently,  the  copying  machine  described  by  Rules  13  cannot  be 
constructed,  and  the  unwanted  consequences  do  not  ensue.  Presumably, 
L3  cannot  be  generated  by  a  type-3  grammar.  Certainly  it  cannot  be  by 
the  method  previously  described. 

Condition  3,  as  it  stands,  is  too  strong  to  be  met  by  actual  grammars  of 
natural  languages,  but  it  can  be  revised,  without  affecting  generative 
capacity,  to  be  perhaps  not  unreasonable  for  the  construction  of  grammars 
of  languagelike  systems.  Suppose  that  we  allow  the  grammar  G  to  contain 
a  rule  faAfa  —>  Xi^2  onty  when  a  is  either  terminal  (as  in  Condition  3) 
or  when  a  dominates  only  a  finite  number  of  strings  in  the  full  set  of 
derivations  (and  P-markers)  constructive  from  G.  This  essentially 
amounts  to  the  requirement  that  if  a  category  is  divided  into  subcategories 
these  subcategories  are  not  phrase  types  but  word  or  morpheme  classes. 
To  the  extent  that  systems  of  the  kind  we  are  now  discussing  are  at  all 
useful  for  grammatical  description,  it  seems  likely  that  the  particular 
subclass  meeting  this  condition  will  in  fact  suffice. 

Only  one  nontrivial  property  of  type  3  grammars  is  known,  namely, 
that  stated  in  Theorem  14.  This  class  of  grammars  merits  further  study, 
however.  It  seems  that  Condition  3  provides  a  reasonably  adequate 
formalization  of  the  set  of  linguistic  notions  involved  in  the  richest 
varieties  of  immediate  constituent  analysis. 


4.  CONTEXT-FREE   GRAMMARS 

Consider  next  the  class  of  grammars  meeting  the  following  condition : 
Condition  4.  If  <f>  ->  y  is  a  rule,  then  <f>  is  a  single  (nonterminal)  letter 

and  ip  is  nonnulL 

Thus  each  rule  of  the  grammar  states  that  a  certain  nonterminal  symbol 
can  be  rewritten  as  a  string  of  symbols,  irrespective  of  the  context  in  which 
it  occurs.  A  grammar  meeting  Condition  4  we  call  a  context-free  grammar. 
A  language  generated  by  a  context-free  grammar  is  called  a  context-free 
language.  (Recall  that  although  the  rules  of  a  context-free  grammar  are 
applicable  irrespective  of  context  nevertheless  there  can  be,  and  usually  are, 
strong  contextual  constraints  among  the  elements  of  the  terminal  string.) 


CONTEXT-FREE    GRAMMARS  %F>j 

Concerning  context-free  grammars  quite  a  bit  is  now  known.  We  shall 
sketch  the  major  results  here,  referring  occasionally  to  more  detailed 
presentations  elsewhere  for  proofs  and  further  discussion. 

It  is  immediately  clear  that  if  in  the  statement  of  Condition  4  we  drop 
the  requirement  that  y  be  nonnull  then  the  generative  capacity  of  the  class 
of  context-free  grammars  is  unchanged  (except  that  the  "empty"  language 
{e}  can  be  generated).  See  Bar-Hillel,  Perles,  &  Shamir  (I960,  Sec.  4). 
We  can  also,  without  affecting  generative  capacity,  impose  the  requirement 
that  there  be  no  rule  of  the  form  A  -+  B  in  the  grammar  (cf.  Chomsky, 
1959a),  so  that  context-free  languages  also  meet  Condition  3. 

Although  all  context-free  (type  4)  languages  are  type  3  languages,  the 
converse  is  not  true. 

Theorem  14.     The  language  #acnf*+*n  dnb#  is  a  type  3  language  that  can- 
not be  generated  by  a  context-free  grammar. 

The  proof  is  due  to  Parikh  (1961).  It  is  much  easier  to  find  examples  of 
type  2  languages  (languages  generated  by  the  full  class  of  context-sensitive 
grammars)  that  cannot  be  generated  by  any  context-free  grammar.  In 
particular,  among  the  languages  Lls  L2,  L3  of  the  preceding  section,  al- 
though L!  and  L2  are  context-free  languages  (cf.  Chapter  1 1 ,  Sec.  3),  L3 
is  clearly  not. 

Theorem  15.  The  language  L3  and  the  language  {anbnan}  are  type  2 
languages  for  which  there  exists  no  context-free  grammar.  (Cf.  Chomsky, 
1959a;  Scheinberg,  1960b;  Bar-Hillel,  Perles,  &  Shamir,  1960.) 
We  can  obtain  theP-marker  (cf.  Chapter  11,  Sec.  3)  of  a  string  generated 
by  a  context-free  grammar  G  directly  by  considering  a  new  context-free 
grammar  G'  with  the  vocabulary  of  G  and  the  new  terminals  ]  and  [A,  where 
A  is  any  nonterminal  of  G.  Where  G  has  the  rule  A^&G'  has  the  rule 
A  _>  [A<f)].  Gr  will  generate  a  string  x  containing  brackets  that  indicate  the 
constituent  structure  of  the  corresponding  string  y,  formed  from  x  by 
deleting  brackets,  that  would  be  generated  by  G.  Under  special  circum- 
stances, the  right  bracket  ]  can  be  deleted  without  ambiguity,  giving  a  kind 
of  parenthesis-free  notation.  Clearly  the  bracketing  imposed  by  G'  on  a 
terminal  string  of  G  corresponds  exactly  to  that  given  by  the  tree-diagrams 
used  in  Chapter  11.  Thus  we  can  regard  the  structural  description  of  a 
string  x  as  a  string  <£  in  the  vocabulary  V*  containing  VT  and  the  new 
symbols  ]  and  [A,  for  each  AeVN. 

By  the  same  method  we  can  obtain  the  P-marker  of  a  context-sensitive 
grammar  if  it  meets  Condition  C  of  p.  363.  If  a  context-sensitive  grammar 
meets  Condition  C,  we  will  say  that  it  is  well  formed.  A  context-sensitive 
grammar  G  is  thus  well  formed  if  and  only  if  it  has  the  following  property: 
if  <f>,  y  are  successive  lines  of  a  terminated  #S#-derivation  of  G,  there 
exist  unique  strings  a  e  VN  and  fo  X*  °>  such  that  <£  =  #ia#2  and 


368  FORMAL    PROPERTIES    OF    GRAMMARS 

y  =  jh&xz.  Extending  the  vocabulary  V  to  include  brackets,  as  above, 
let  us  define  d(<[>)  (read  "debracketization  of  <£")>  for  any  strm§  <£  in  this 
extended  vocabulary,  as  the  string  obtained  by  deleting  all  occurrences  of 
brackets  (with  their  subscripts)  from  <f>.  We  can  then  define  a  strong  <f>- 
derivation  of  y  as  a  sequence  fa,  ...,</>„  such  that  ^  =  <f>,  <f>n  =  y>,  and 
for  each  i  >  n  there  are  strings  <ul9 .  .  . ,  o>5  and  a  symbol  a  such  that 
^  -=  a)1a)2aa>3w4,  ^f+1  =  ^>ico2[aw5]w3co4?  d(a>5)  =  co5  and  rf(o>2aco3)  -> 
</(a>2a>5co3)  is  a  rule  of  G.  If  Z)  is  a  strong  #S#-derivation  of  y>  in  G  and 
rffy)  is  a  string  on  Kr,  then  rf(y)  is  a  terminal  string  generated  by  G  and  y 
can  be  taken  as  the  P-marker  assigned  to  it  uniquely  by  the  derivation  £>', 
each  line  of  which  is  the  debracketization  of  the  corresponding  line  of  D. 
Furthermore,  for  each  #S#-derivation  D  of  a  string  x  in  G  there  is  a 
corresponding  strong  #S#-derivation  that  terminates  in  a  string  which 
can  be  taken  as  the  P-marker  uniquely  assigned  by  D  to  x.  Thus  for 
well-formed  context-sensitive  grammars  we  have  a  precise  definition  of 
generation  of  strong  P-markers. 

As  we  have  noted  above,  well-formedness,  in  the  sense  just  defined,  is 
not  a  decidable  property  of  context-sensitive  grammars.  We  might  define 
a  decidable  property  of  well-formedness  for  such  grammars  in  the  following 
way:  G  is  well  formed  if  it  contains  no  rule  of  the  form  </>ABy  ->  <f>AcoBy. 
In  this  case  a  strong  derivation  might  not  be  uniquely  determined  by  a 
weak  derivation  (as  it  is  if  the  former  condition  of  well-formedness  is 
met),  but  it  would  still  be  uniquely  determined  by  a  weak  derivation 
together  with  the  sequence  of  rules  used  to  form  it  (neither  of  these  being 
dispensable,  in  general).  In  fact,  as  we  noted  in  Chapter  11,  Sec.  4,  there 
is  no  difficulty  in  imposing  effective  conditions  that  eliminate  all  such 
indeterminacy,  without  affecting  weak  generative  capacity  (and  affecting 
strong  generation  only  by  the  imposition  of  some  additional  and  otherwise 
irrelevant  categorization).  These  further  conditions  are,  however,  rather 
ad  hoc. 


4.1  Special  Classes  of  Context-Free  Grammars 

In  this  section  we  shall  consider  various  subclasses  of  the  set  of  context- 
free  grammars  that  are  defined  by  additional  restrictions  on  the  set  of 
rules.  Recall  that  each  rule  is  of  the  form  A -+</),  where  A  is  a  single 
nonterminal  letter  and  <f>  is  a  nonnull  string.  We  shall  continue  to  use  the 
notational  convention  of  Chapter  1 1 ,  Sec.  4.  Recall  that  <f>=>y  just  in  case 
there  is  a  ^-derivation  of  ip.  Furthermore,  the  nonterminal  vocabulary 
VN  consists  of  exactly  those  symbols  A  that  appear  on  the  left  of  a  rule 
A  -*  </>  in  the  grammar. 


CONTEXT-FREE     GRAMMARS  gfig 

We  call  a  rule  linear  if  it  is  of  the  form  A  ->•  xBy.  It  is  right-linear  if  it  is 
of  the  form  A-+xB;   left-linear  if  it  is  of  the  form  A-+Bx.    A  rule  is 
terminating  if  it  is  of  the  form  A-*x.  In  terms  of  these  notions,  we  define 
several  kinds  of  grammars. 
Definition  6.     A  grammar  G  is 

(i)  linear  if  each  nonterminating  rule  is  linear;    in  particular?  if  each 
nonterminating  rule  is  either  right-linear  or  left-linear;    (ii)  one-sided 
linear  if  either  each  nonterminating  rule  is  right-linear  or  each  nonter- 
minating rule  is  left-linear;  (Hi)  meta-linear  if  all  nonterminating  rules  are 
linear  or  of  the  form  S-+<f>  and,  furthermore,  there  is  no  rule  A  ->  <f>Sip 
for  any  A,  <f>9  ip;    (iv)  normal  if  all  nonterminating  rules  are  of  the  form 
A  — *  BC  and  all  terminating  rules  are  of  the  form  A  ->  a;  (v)  sequential 
if  its  nonterminal  vocabulary  can  be  ordered  as  Al9 .  . . ,  An  in  such  a 
way  that  for  each  z,y,  if  At  =>  <f>Ajip  then  j  ^  i. 
In  the  case  of  a  linear  grammar,  if  </>  is  a  line  of  an  #S#-derivation,  then 
</>  contains  at  most  one  nonterminal  symbol.   There  is,  in  other  words, 
only  one  point  at  which  a  derivation  can  branch  at  any  step.   When  the 
first  terminating  rule  applies,  the  derivation  terminates  in  a  terminal  string. 
In  a  meta-linear  grammar  there  is  a  bound  n  on  the  number  of  points  at 
which  a  derivation  can  branch.   This  bound  is  given  by  the  longest  rule 
in  which  S  appears— maximal,  that  is  to  say,  in  the  number  of  nonterminal 
symbols  that  appear.   When  n  terminating  rules  have  applied,  the  deriva- 
tion terminates  in  a  terminal  string. 

A  one-sided  linear  grammar  is  nothing  other  than  a  finite  automaton, 
in  the  sense  of  Sec.  1.2  (cf.  Chomsky,  1956).  This  is  clear  in  the  case  of  a 
one-sided  linear  grammar  G  with  only  right-linear  (and  terminating)  rules. 
We  can  assume,  without  loss  of  generality,  that  each  linear  rule  of  G 
is  of  the  form  A  ->  aB,  where  B  is  not  the  initial  symbol  of  G,  and  that  each 
terminating  rule  is  of  the  form  A  ->  a.  Let  A19 . . .  ,  An  be  the  nonterminal 
symbols  of  G9  where  A±  is  the  initial  symbol.  We  can  associate  with  G 
the  finite  automaton  F  with  the  same  nonterminal  vocabulary  as  G  and  with 
states  Al3 .  .  . ,  An,  A^  being  the  initial  state.  We  form  the  rules  of  F  in 
the  following  way.  If  A*  -+  aA$  is  a  rule  of  G,  then  the  triple  (a,  Ai9 1,)  is  a 
rule  of  F  (interpreted  as  the  instruction  to  F  to  switch  from  state  Ai  to 
state  A3-  when  reading  the  input  symbol  a).  If  Ai  -»  a  is  a  rule  of  G,  then 
the  triple  (a,  Ii9  AJ  is  a  rule  of  F.  Thus  F  and  G  terminate  after  having 
generated  the  same  terminal  string.  Similarly,  the  rules  of  a  finite  auto- 
maton immediately  give  a  grammar  with  only  right-linear  or  terminating 
rules. 

Since  if  L  is  a  regular  language,  then  L*  consisting  of  the  mirror-images 
of  the  strings  of  L  (i.e.,  containing  an...al  whenever  al...aneL)is  also 
a  regular  language,  it  is  clear  that  each  one-sided  linear  grammar  represents 


37°  FORMAL    PROPERTIES    OF    GRAMMARS 

a  finite  automaton  and  that  each  finite  automaton  can  be  represented  as 
a  one-sided  linear  grammar. 

A  normal  grammar  is  the  kind  usually  considered  in  discussions  of 
immediate  constituent  analysis  in  linguistics.  The  terminating  rules  A  -+ a 
constitute  the  lexicon  of  the  language,  which  is  sharply  distinguished  from 
the  set  of  grammatical  rules  A  —>  BC,  each  of  which  gives  binary  constituent 
breaks. 

The  notion  of  a  sequential  grammar  is  motivated  by  the  ease  with  which 
the  output  of  such  a  device  can  be  mechanically  computed.  Once  the 
rules  developing  a  certain  nonterminal  symbol  A  have  been  applied,  thus 
eliminating  all  occurrences  of  A  in  the  last  line  of  the  derivation  under 
construction,  we  can  be  sure  that  A  will  not  recur  in  any  later  lines  of  the 
derivation.  A  restriction  of  this  sort  (but  more  general,  involving  also 
transformational  rules)  has  been  suggested  and  studied  in  a  linguistic 
application  by  Matthews  (1962). 
Definition  7.  Let 

X  =  {L  |  L  is  generated  by  a  linear  grammar} 
%!  =  {L  |  L  is  generated  by  a  one-sided  linear  grammar] 
Xm  =  {L  |  L  is  generated  by  a  meta-linear  grammar] 

v  =  {L  |  L  is  generated  by  a  normal  grammar] 

a  =  {L  |  L  is  generated  by  a  sequential  grammar] 

7  =  {L  |  L  is  generated  by  a  context-free  grammar}. 

Thus  Al9  in  particular,  is  the  class  of  regular  languages,  as  we  have  just 
observed.  The  systems  defined  in  Def.  6  are  related  to  one  another  in  the 
following  way,  from  the  point  of  view  of  generative  capacity. 
Theorem  16.     (/)  ^  c  X  c  A-m  c  y  (Chomsky,  1956;    Schutzenberger  & 

Chomsky,    1962);     (ff)    r  =  y    (Chomsky,  1959a);     (Hi)    ^  c  a  <=  y 

(Ginsburg  &  Rice). 

The  languages  L^  and  L2  are  generated  by  linear  grammars,  but,  as  we 
observed  in  Sec.  1.1,  cannot  be  generated  by  finite  automata  (one-sided 
linear  grammars).  The  product  L±  •  L2  =  {y  \  y  =  ocz,  x  e  Zl3  and  z  G  L2} 
of  languages  L^  and  L2  of  1  is  in  Xm  but  not,  in  general,  in  L  The  set  of 
well-formed  formulas  of  sentential  calculus  in  the  so-called  Polish  notation 
is  an  example  of  a  language  that  has  no  meta-linear  grammar  but  can  be 
generated  by  the  context-free  grammar 

S-+CSS,        S-^NS,        5->K,         F->F',        V^p,         (15) 

where  the  sentential  letters  are/?,/,/', . . .  ;  C  is  the  sign  of  the  conditional, 
.JV  of  negation. 


CONTEXT-FREE    GRAMMARS 

The  languages  L:  (=  {a«b*})  and  L2  (=  {xx*  \  x  a  string  on  (a,  b]  and  x* 
the  reflection  of  x})  are  in  a  but  not  ilB  An  example  of  a  language  in  y 
but  not  in  a  is  the  language  with  vocabulary  (a,  b,  c9  d]  and  containing 
the  sentence 


(16) 

(which  is  symmetrical  about  c)  for  each  sequence  (k9  nl9 . . . ,  n2k_^  of 
positive  integers.   This  language  is  generated  by  the  rules 

S-^adAda,        S-+aSa,        S-^aca, 

A-+bAb,  A^bdSdb  (17) 

(which  are,  in  fact,  linear),  but  it  is  not  generated  by  a  sequential  grammar. 
Since  normal  grammars  can  generate  any  context-free  language,  the 
common  restriction  to  binary  constituent  breaks  and  to  a  separate  lexicon 
does  not  limit  generative  capacity  beyond  context-free  grammars  (though, 
of  course,  this  restriction  severely  limits  the  system  of  structural  descrip- 
tions that  can  be  generated,  i.e.,  it  limits  strong  generative  capacity). 

4.2  Context-Free  Grammars  and  Restricted-Infinite  Automata 

We  have  observed  that  context-free  and  context-sensitive  grammars  of 
the  various  kinds  we  have  been  considering  are  richer  in  generative  capacity 
than  finite  automata,  though  weaker  than  unrestricted  rewriting  systems 
(Turing  machines).  In  particular,  we  have  found  languages  that  are  not 
regular  but  that  can  be  generated  by  linear  context-free  grammars  (even 
with  a  single  nonterminal);  we  have,  on  the  other  hand,  noted  that  even 
context-sensitive  grammars  can  generate  only  recursive  sets  and  not  all  of 
these.  Grammars  of  the  kind  we  are  now  considering  belong  to  the 
category  of  restricted-infinite  automata  (cf.  Sees.  1.3  to  1.7).  In  the  case 
of  context-free  languages  we  find  that  each  can  be  accepted  by  a  linear- 
bounded  automaton  of  the  special  kind  that  uses  only  pushdown  storage 
(PDS)  (cf.  Sees.  1.4  to  1.6)  and,  furthermore,  that  only  context-free 
languages  can  be  accepted  by  such  devices. 

To  see  this,  we  restrict  attention  to  normal  grammars,  without  loss  of 
generality  [cf.  Theorem  16(ff)]-  We  can  also  clearly  assume,  with  no  loss 
of  generality,  that  if  A  ->  BC  is  a  rule  of  a  normal  grammar  then  B  ^  C. 
Given  such  a  normal  grammar  G,  we  can  construct  a  PDS  automaton  that 
accepts  the  language  L(G)  generated  by  G  in  the  following  way.  For  each 
nonterminal  A  of  G,  the  control  unit  of  M  will  have  two  states  designated 
A  l  and  AT.  For  each  rule  A->aofG,M  will  have  the  instruction : 

(a,At,e)-+(AT,e).  (IS) 


372 

For  each  rule  A  • 


FORMAL    PROPERTIES    OF    GRAMMARS 


•  BC  of  G,  M  will  have  the  instructions: 

(e,  Alt  e)  -  (B,,  A), 
(e,BT>e)-»(Ct,e), 
(e,  CT,  A)  -*  (Ar,  a), 


(19) 


where  e  is  the  identity  element.  M  is  thus,  in  general,  a  nondeterministic 
PDS  automaton.  Its  initial  state  we  designate  S,  where  S  does  not  appear 
in  G,  We  assume  that  the  device  has  the  instructions  (e,  2,  cr)  ->  (Sl9  e) 
and  (e,  Sr,  a)  ->  (2,  a),  where  S  is  the  initial  symbol  of  G,  allowing  it  to 

go  from  2  to  St  and  from  Sr  to  2,  erasing  a 
and  terminating. 

M  accepts  a  string  x  generated  by  G  by 
simply  tracing  systematically  through  the 
tree  diagram  representing  the  derivation 
of  x  by  G,  from  top  to  bottom  and  from 
left  to  right.  To  illustrate,  consider  the 
grammar  G  with  the  rules 


C 

A 

A        S 


B 


Fig.  9.  A  typical  derivation  of 
a  sentence  in  language  (21). 


CB,     C-+AS,     A-+a,B- 


generating  the  language 


(20) 


(21) 


with  such  typical  derivations  as  that  shown  in  Fig.  9  for  the  string  aacbb. 
The  corresponding  PDS  device  M  will  have  Instructions  22  corresponding 
to  Instruction  18  and  Instructions  23  corresponding  to  Instructions  19: 


(a,  A^e)^*  (Ar,  e), 
(b,  Bl;  e)  -f  (Br,  e), 
(c,Sl,e)^(ST,e). 

(e,  St,  e)  ->  (C,,  S), 
(e,  Cr,  e)  -+  (Blt  e), 
(e,Br,S)-+(Sr,a), 
(e,  C,,  e)  -*  (At,  C), 
(e,  Ar,  e)  -+  (St,  e\ 
(e,5'r,O-*(Cf,(r). 


(22) 


(23) 


In  accepting  the  string  aacbb  with  the  derivation  in  Fig.  9,  M  will  compute 
in  the  following  steps,  in  which  column  one  represents  the  input  tape,  with 


CONTEXT-FREE    GRAMMARS 


the  scanned  symbol  in  bold  face,  column  two  indicates  the  state  of  the 
control  unit,  and  column  three  represents  the  contents  of  the  storage  tape. 

Input  Control  Unit          PDS 


1.  aacbb                  S 

cr 

2.  aacbb                  St 

or 

3.  aacbb                  Cz 

(75 

4.  aacbb                   Al 

aSC 

5.  aacbb                  Ar 

(T5C 

6.  aflc&6                  5Z 

or^C 

7.  #ac&6                   Cl 

CT5CS 

8.  aac&6                  Xz 

cr^CSC 

9.  aacbb                  AT 

aSCSC 

10.  a0cZ?Z?                  Sz 

aSCSC 

11.  aacbb                  Sr 

aSCSC 

12.  aacAi                  Cr 

aSCS 

13.  aacAZ)                  5Z 

aSCS 

14.  aacbb                   Br 

aSCS 

15.  a^c^A                   5r 

aSC 

16.  aacbb                  Cr 

aS 

17.  aacZ?A                  Bl 

aS 

1  Q               A»  A  //                           D 

aS 

1  Q               A  A  //                          O 

or 

20.  aacbb#               if 

e 

(24) 


Clearly  M  accepts  all  and  only  the  strings  generated  by  G,  using  its 
storage  tape  to  store  as  much  of  the  derivation  of  a  generated  string  x 
as  will  be  needed  in  later  steps  of  the  computation,  as  it  processes  x  on  its 
input  tape.  Furthermore,  the  same  construction  can  obviously  be  carried 
out  quite  generally  for  any  normal  grammar.  Observe  also  that  the  PDS 
automaton  given  by  this  construction  is  what  in  Sec.  1.4  we  called  a  PDS 
automaton  with  restricted  control.  Hence  we  conclude  the  following: 
Theorem  17.  Given  a  context-free  language  L,  we  can  construct  a  PDS 

automaton  with  restricted  control  that  accepts  L. 

Matthews  has  shown  (Matthews,  1963a,  b)  that  this  result  can  be 
extended  in  part  to  context-sensitive  grammars.  Given  a  context-sensitive 
grammar  G,  we  define  a  left-to-right  derivation  (a  right-to-left  derivation)  as 
one  meeting  this  condition :  if  (#,  y>)  are  successive  lines,  then  <f>  =  xAa)  and 
yj  =  x%o)  (respectively,  <f>  =  <oAx  and  y  =  cop;).  Thus  only  the  leftmost 
(respectively,  rightmost)  nonterminal  may  be  rewritten.  Matthews  shows 
that  we  can  construct  a  PDS  automaton  M^_R  that  will  accept  a  string  x 
if  and  only  if  there  is  a  left-to-right  derivation  of  x  in  G,  and  there  is  a 


374  FORMAL    PROPERTIES    OF    GRAMMARS 

PDS  automaton  MR_L  that  will  accept  a  string  x  if  and  only  if  there  is  a 
right-to-left  derivation  of  x  in  G.  By  Theorem  6,  Sec.  1.6,  it  follows  that 
the  languages  accepted  by  ML_R  and  MR_L  are  context-free.  Conse- 
quently their  union  is  context-free  (see  Theorem  20,  Sec.  4.3).  Thus,  if  we 
consider  only  the  left-to-right  and  the  right-to-left  derivations  of  a  context- 
sensitive  grammar  G,  this  grammar  will  generate  a  context-free  language. 
Let  us  say  that  a  context-sensitive  grammar  is  strictly  context-sensitive  if 
it  generates  a  noncontext-free  language.  We  see,  then,  that  a  necessary 
condition  for  a  grammar  to  be  strictly  context-sensitive  is  that  some  of  its 
terminal  strings  have  no  derivations  that  are  left-to-right  or  right-to-left. 

It  is  not  difficult  to  show  that  these  observations  continue  to  hold  if  we 
define  a  left-to-right  derivation  (a  right-to-left  derivation)  as  one  in 
which  the  rewritten  symbol  of  each  line  is  no  more  than  a  bounded  distance 
away  from  the  leftmost  (respectively,  rightmost)  nonterminal  of  this  line. 
See  Matthews  (forthcoming).  Thus  a  grammar  will  be  strictly  context- 
sensitive  only  if  some  of  its  terminal  strings  have  only  derivations  which  are 
neither  left-to-right  nor  right-to-left  in  this  extended  sense. 

Suppose  that  we  say  that  D  is  an  n-embedded  derivation  ifn  is  the  largest 
number  such  that  D  contains  a  pair  of  successive  lines  xA^ByCy  and 
xAfayCy,  where  the  shorter  of  <£,  ip  is  of  length  n.  Thus  in  an  ^-embedded 
derivation  there  is  some  line  in  which  the  rewritten  symbol  is  at  a  distance 
of  n  symbols  away  from  either  the  leftmost  or  rightmost  nonterminal  of 
this  line.  One  would  conjecture  that  a  necessary  condition  for  a  grammar 
to  be  strictly  context-sensitive  is  that  for  each  n  it  can  generate  some 
terminal  strings  only  by  derivations  which  are  m-embedded,  where  m  >  n. 
Note,  in  particular,  that  in  the  derivations  produced  by  the  "copying 
device"  of  Example  13  there  are  lines  in  which  the  rewritten  symbol  is 
arbitrarily  far  from  both  the  rightmost  and  leftmost  nonterminal  of  these 
lines.  In  considering  these  questions,  it  is  important  to  bear  in  mind  that 
the  question  whether  a  context-sensitive  grammar  is  strictly  context- 
sensitive  is  undecidable,  as  has  been  observed  by  Shamir  (1963). 

We  have  already  shown  in  Sec.  1.6  that  corresponding  to  each  PDS 
automaton  M  there  is  a  transducer  T  with  the  following  property:  T 
maps  x  into  a  string  y  that  reduces  to  e,  if  and  only  if  M  accepts  x.  Con- 
sequently, we  now  see  that,  given  a  context-free  grammar  G,  there  is  a 
transducer  T  that  maps  x  into  a  string  y  that  reduces  to  e,  just  in  case  G 
generates  x.  However,  we  can  achieve  a  somewhat  stronger  result  by 
carrying  out  the  construction  of  T  from  G  directly. 

Let  us  define  a  modified  normal  grammar  as  a  normal  grammar  that 
contains  no  pair  of  rules  A  -+  BC,  D^CE  for  any  nonterminals  A9  B, 
C,  D,  E;  that  is,  in  a  modified  normal  grammar  we  can  tell  unambiguously, 
for  each  nonterminal,  whether  it  appears  on  a  left  branch  or  a  right  branch 


CONTEXT-FREE    GRAMMARS 


of  a  derivation;  no  nonterminal  can  appear  on  both  a  left  and  a  right 
branch.  Clearly,  there  is  a  modified  normal  grammar  equivalent  to  each 
normal  grammar,  hence  to  each  context-free  grammar. 

Suppose  that  we  now  apply  a  construction  much  like  that  of  Instructions 
18  and  19  to  a  modified  normal  grammar  G,  giving  a  transducer  T. 
Corresponding  to  each  nonterminal  A  of  G,  T  will  have  two  states  Al 
and  Ar.  In  addition,  T  has  the  initial  state  2  and  the  instructions  (e,  2)  ~> 
(Sl9  e)  and  (e,  Sr)  ->  (S,  a'),  where  S  is  the  initial  symbol  of  G.  The  input 
alphabet  of  T  is  the  terminal  vocabulary  VT  of  G.  Its  output  alphabet 
includes,  in  addition,  a  symbol  a'  for  each  a  e  VT,  a  pair  of  symbols 
A9  A'  for  each  nonterminal  A  of  G,  and  a,  a1.  When  A  -+  a  (a  e  KT)  is  a 
rule  of  (/,  T  will  have  the  instruction 


');  (25) 

when  v4  —  *  BC  is  a  rule  of  G,  Twill  have  the  instructions 

(e9Ad-+(Bl9A\ 
(e9BJ-+(Cl9e)9  (26) 

(e,Cr)-+(Ar,Af). 

The  transducer  T  so  constructed  has  the  essential  property  of  the 
transducer  associated  with  a  PDS  device  by  the  construction  presented  in 
Sec.  1.6,  namely,  G  generates  x  if  and  only  if  T  maps  x  into  a  string  y  that 
reduces  to  e  by  successive  deletions  of  pairs  ococ'.  For  example,  when  G  is 
as  in  Example  20  and  T  has  the  input  #aacbb#  (with  the  derivation  of 
Fig.  9),  it  will  compute  in  essentially  the  manner  of  M  in  (24),  terminating 
with  the  string 

aSCAaa'A'SCAaa'A'Scc'S'CBbb'B'S'CBbb'B'S'o',  (21) 

which  reduces  to  e,  on  its  storage  tape. 

Let  us  now  extend  this  notion  of  "reduction"  and  define  K  as  the  class 
of  strings  in  the  output  alphabet  Ao  of  T  that  reduce  to  e  by  successive 
cancellation  of  substrings  oca'  or  a'  a  (a,  a'  e  Ao).  We  are  thus  essentially 
regarding  a,  a'  as  strict  inverses  in  the  free  group  &  with  the  generators 
a  E  A0.  But  since  the  grammar  G  from  which  T  was  constructed  was  a 
modified  normal  grammar,  the  output  of  T  can  never,  in  fact,  contain  a 
substring  a'ora,  where  a;  reduces  to  e.  Hence  this  extension  of  the  notion 
"reduction"  is  harmless,  and  under  it  the  transducer  T  will  still  retain  the 
property  that  G  generates  x  if  and  only  if  T  maps  x  into  a  string  y  that 
reduces  to  e. 

We  have  assumed  throughout  (cf.  Chapter  11,  Sec.  4),  as  is  natural,  that 
the  vocabulary  Kfrom  which  all  context-free  grammars  are  constructed  is  a 
fixed  finite  set  of  symbols,  so  that  K  is  a  particular  fixed  language  in  the 


FORMAL    PROPERTIES    OF    GRAMMARS 

vocabulary  V  containing  V,  a,  a'  and  a  symbol  a'  for  each  a  e  V.  Let  <f>  be 
the  homomorphism  (i.e.,  the  one-state  transduction)  such  that  <£(a)  ->  a  for 
a  e  KT  and  <£(a)  =  e  for  a  £  FT.  Let  U  be  the  set  of  all  strings  on  V.  Ob- 
serve now  that  where  G  and  T  are  as  have  been  described  and  L(G)  is  the 
language  generated  by  G,  we  have,  in  particular,  the  result  that  L(G) 
=  fiK  n  r(C/)]. 

It  is  a  straightforward  matter  to  construct  a  PDS  automaton  that  will 
accept  K;  consequently,  by  Theorem  6,  Sec.  1.6,  K  is  a  context-free 
language.  As  we  have  observed  in  Sec.  1.6,  T(U)  is  a  regular  language.  We 
shall  see  directly  that  the  intersection  of  a  context-free  language  with  a 
regular  language  is  a  context-free  language  and  that  transduction 
carries  context-free  languages  into  context-free  languages  (Sec.  4.6). 
Given  K  and  (f>  as  in  the  preceding  paragraph,  let  us  define  y(L)  for  any 
language  L  as  y(L)  =  fi(K  n  L).  Summarizing  the  facts  just  stated,  we  have 
the  following  general  observation: 
Theorem  1 8.  For  each  regular  language  R,  ip(R)  is  a  context-free  language; 

for  each  context-free  language  L  there  is  a  regular  language  R  such  that 

L  =  yCR). 

Thus  a  context-free  language  is  uniquely  determined  by  the  choice  of  a 
certain  regular  language  (i.e.,  finite  automaton),  and  each  such  choice 
produces  a  context-free  language,  given  K,  <f>.  This  provides  a  simple 
algebraic  characterization  of  context-free  languages. 

Theorem  18  can  be  extended  immediately  to  the  result  that  each  context- 
free  language  L  is  given  as  the  homomorphic  image  of  the  intersection  of 
K  with  some  1-limited  language  D.  (Recall  that,  as  noted  in  Sec.  1.2,  each 
regular  language  is  the  homomorphic  image  of  some  1-limited  language.) 
Furthermore,  the  various  categories  of  context-free  languages  that  we  have 
defined  are  easily  definable  by  imposition  of  simple  conditions  on  D  (cf. 
Schiitzenberger  &  Chomsky,  1962,  for  details).  We  know  from  Sec.  1.2 
that  regular  languages  consist  of  strings  with  a  basically  periodic  structure. 
From  the  role  of  Km  characterizing  context-free  languages,  we  see  that,  in 
a  sense,  symmetry  of  structure  is  the  fundamental  formal  property  of  the 
strings  of  a  context-free  language  (and  the  substrings  of  which  they  are 
constituted).  We  might  say,  rather  loosely,  that  to  the  extent  that  the 
character  of  some  aspect  of  serially  ordered  behavior  is  determined  by 
conditions  on  contiguous  parts  (e.g.,  associative  linkage),  it  is  natural  to 
regard  the  organism  carrying  out  this  behavior  as  essentially  a  limited 
automaton;  to  the  extent  that  this  behavior  is  periodic  and  rhythmic  [e.g., 
in  the  case  of  the  examples  offered  by  Lashley  (1951)  in  his  critique  of  the 
"associative  chain"  theory],  the  organism  is  performing  in  the  manner  of  a 
strictly  finite  automaton;  to  the  extent  that  such  behavior  exhibits  hier- 
archic organization  and  symmetries,  the  organism  is  performing  in  the 


CONTEXT-FREE    GRAMMARS  377 

manner  of  a  device  whose  intrinsic  competence  is  expressed  by  a  context- 
free  grammar.  Naturally  this  brief  (and  loose)  classification  does  not 
exhaust  the  possibilities  for  complex,  serially  ordered,  and  integrated  acts, 
and  it  is  to  be  hoped  that,  as  the  theory  of  richer  generative  systems  (in 
particular,  for  the  case  of  language,  transformational  grammars)  develops, 
deeper  and  more  far-reaching  formal  properties  of  such  behavior  will  be 
revealed  and  explained. 

Note  that  String  27  is  essentially  the  structural  description  of  the 
input  string  #aacbb#  corresponding  to  Fig.  9.  Specifically,  String  27 
becomes  a  structural  description  of  the  form  described  on  p.  367  under 
the  homomorphism/defined  as  follows :  for  a  6  VT9  /(a)  =  a  and/(a')  = 
e;  for  ae  Fv,  /(a)  =  [a  and  /(a')  =  ].  This  amounts  to  replacing  T 
with  Instructions  25  and  26  by  the  transducer  T'  identical  with  T  except 
that  it  never  prints  a'  (for  a  e  VT)  and  that  it  prints  ]  instead  of  a'  for  each 
a  e  Vx.  As  an  immediate  corollary  of  Theorem  17,  then,  we  have  the 
following: 

Theorem  1 9.  Given  a  context-free  language  L,  we  can  construct  a  modified 
normal  grammar  G  generating  L  and  a  transducer  T  with  the  following 
property:  if  G  generates  x  with  the  structural  description  <f>,  then  T  maps 
x  into  $ ;  ifT  maps  x  into  $  and  <f>  reduces  to  e  under  successive  cancella- 
tion of  substrings,  [a  a  ],  where  a  e  VT,  then  G  generates  x  with  the  struc- 
tural description  <f>. 

The  transducer  T  guaranteed  by  Theorem  19  is  thus,  in  a  sense,  a  "recogni- 
tion routine"  (i.e.,  a  perceptual  model)  that  assigns  to  arbitrary  sentences 
their  structural  description  with  respect  to  G.  It  is  not,  however,  a  strictly 
finite  recognition  routine  because  of  the  condition  that  the  output  must 
reduce  to  e.  We  shall  return  (Sec.  4.6)  to  the  problem  of  constructing  an 
optimal,  strictly  finite  recognition  routine  for  context-free  grammars.  We 
know,  as  a  result  of  this  section,  that  there  is  a  mechanical  procedure  for 
constructing  a  recognition  routine  with  PDS  corresponding  to  each  normal 
context-free  grammar,  hence  to  each  context-free  language  in  at  least  one 
of  its  grammatical  representations  (and,  one  would  conjecture,  no  doubt 
in  all). 

To  summarize,  we  have  the  following  results.  There  is  a  fixed  homo- 
morphism/such  that  for  any  regular  language  R  we  can  find  a  1 -limited 
language  L  such  that  R  =/(L).  There  are  fixed  homomorphisms  g1?  g& 
such  that  given  any  context-free  grammar  Gf  generating  L(Gr)  there  is  a 
modified  normal  grammar  G  generating  the  language  L(G)  =  L(G')  and 
generating  the  set  of  structural  descriptions  S(G),  and  there  is  a  1 -limited 
language  L  such  that  L(G)  =  g^K  n  L)  and  S(G)  =  g2(K  r\  L\  where  K 
is  the  fixed  context-free  language  defined  above.  Thus  the  weak  generative 
capacity  of  any  context-free  grammar  and  the  strong  generative  capacity  of 


FORMAL    PROPERTIES    OF    GRAMMARS 


any  modified  normal  grammar  is  specified  in  this  way  by  the  choice  of  a 
particular  1  -limited  language. 

We  have  now  noted  the  following  features  of  the  three  artificial 
languages  introduced  in  Chapter  11,  Sec.  3.  All  three  are  beyond  the 
range  of  finite  automata.  L3  can  be  generated  by  a  context-sensitive 
grammar  but  not  by  a  context-free  grammar.  L2  can  be  generated  by  a 
context-free,  in  fact,  linear  grammar,  but  not  by  a  countersystem  (cf. 
Sec.  1.4).  JLi  can  be  generated  by  a  countersystem.  Furthermore,  a 
language  can  be  generated  by  a  context-free  grammar  just  in  case  it  is 
accepted  by  some  PDS  automaton. 

As  we  observed  in  Chapter  11,  Sec.  3,  the  fundamental  property  of  L2 
(namely,  that  it  contains  nested  dependencies)  is  a  common  feature  of 
natural  languages.  It  should  be  noted  that  dependency  sets  of  the  L3 
type  also  appear  in  natural  languages.  Postal  (1962)  has  found  a  deep- 
seated  system  of  this  sort  in  Mohawk,  where  noun  sequences  of  arbitrary 
length  can  be  incorporated  in  verbs,  with  the  order  of  their  elements 
matched  in  the  incorporated  and  exterior  noun  sequence.  A  language 
containing  such  a  dependency  set  is  beyond  the  range  of  a  context-free 
grammar  or  a  PDS  automaton,  irrespective  of  any  consideration  involving 
structural  descriptions  (cf.  Chapter  11,  Sec.  5.1)  and  strong  generative 
capacity.  Subsystems  of  this  sort  are  also  found  in  English,  though  more 
marginally.  Thus  Solomonoff  (1959,  personal  communication)  and 
Bar-Hillel  and  Shamir  (1960)  note  that  the  word  respectively  gives  depend- 
ency sets  of  the  L$  type  (e.g.,  John  and  Mary  wrote  to  his  and  her  parents, 
respectively).  Similarly,  alongside  the  elliptical  sentence  John  saw  the 
play  and  so  did  Bill.,  we  can  have  John  saw  the  play  and  so  did  Bill  see  the 
play,  but  not  *John  saw  the  play  and  so  did  Bill  read  the  book,  etc. 

In  the  same  connection,  it  should  be  observed  that  a  language  is  also 
beyond  the  weak  generative  capacity  of  context-free  grammars  or  PDS 
automata  if  it  has  the  essential  formal  property  of  the  complement  of  £3, 
that  is,  if  it  contains  an  infinite  set  of  phrases  a?1?  x%,  .  .  .  ,  and  sentences  of 
the  form  ax$xjy  if  and  only  if  i  is  distinct  from/  (whereas  a  language  is  of 
the  £3  type  when  it  contains  such  sentences  if  and  only  if  i  is  identical  with 
j,  as  in  the  Mohawk  example  just  cited).  But  restrictions  of  this  kind  are 
very  common  (cf.,  e.g.,  Harris,  1957,  Sec.  3.1).  Thus  in  the  comparative 
construction  we  can  have  such  sentences  as  That  one  is  wider  than  this  one 
is  DEEP  (with  heavy  stress  on  deep),  but  not  *That  one  is  wider  than  this 
one  is  WIDE—  the  latter  is  replaced  obligatorily  by  That  one  is  wider  than 
this  one  is.  Thus  in  these  constructions,  characteristically,  a  repeated 
element  is  deleted  and  a  nonrepeated  element  receives  heavy  stress.  We 
find  an  unbounded  system  of  this  sort  when  noun  phrases  are  involved,  as 
in  the  case  of  such  comparatives  as  John  is  more  successful  as  a  painter  than 


CONTEXT-FREE    GRAMMARS 


379 


Bill  is  as  a  SCULPTOR,  but  not  *John  is  more  successful  as  a  painter  than 
Bill  is  as  a  PAINTER,  which  is  converted,  by  an  obligatory  deletion  trans- 
formation, to  John  is  more  successful  as  a  painter  than  Bill  is.  As  in  the  case 
of  subsystems  of  the  L3  type,  these  constructions  show  that  natural 
languages  are  beyond  the  range  of  the  theory  of  context-free  grammars  or 
PDS  automata,  irrespective  of  any  consideration  involving  strong  genera- 
tive capacity. 

Considerations  of  this  sort  show  that,  in  the  attempt  to  enrich  linguistic 
theory  to  overcome  the  deficiencies  of  constituent-structure  grammar  (cf. 
Chapter  11,  Sec.  5 — note  that  there  only  deficiencies  in  strong  generative 
capacity  were  considered),  it  is  necessary  to  develop  systems  that  can  deal 
with  infinite  sets  of  strings  that  are  beyond  the  weak  generative  capacity 
of  the  theory  of  context-free  grammar.  In  these  examples,  as  in  the 
examples  discussed  in  Chapter  11,  it  is  easy  to  state  the  required  rules  in 
the  form  of  grammatical  transformations  and  thus  to  handle  linguistic 
phenomena  that  are  beyond  the  scope  of  the  theory  of  constituent-structure 
grammar. 

We  have  now  almost  completed  the  proof  of  Theorem  6  of  Sec.  1.6, 
which  asserts  that,  with  respect  to  weak  generative  capacity,  context-free 
grammars  correspond  exactly  to  nondeterministic  PDS  automata.  In 
Sec.  2  we  observed  that  unrestricted  rewriting  systems  correspond  exactly 
to  Turing  machines,  and,  in  Sec.  4.1,  that  one-sided  linear  grammars  have 
exactly  the  weak  generative  capacity  of  finite  automata.  In  the  case  of 
finite  automata  and  Turing  machines,  nondeterminacy  does  not  extend 
(weak)  generative  capacity.  It  has  recently  been  shown  that  every  language 
accepted  by  a  deterministic  linear-bounded  automaton  is  context-sensitive 
(P.  Landweber,  1963).  S.  -Y.  Kuroda  has  observed  that  this  proof  extends 
to  nondeterministic  linear-bounded  automata,  and  he  has  proved  that, 
furthermore,  every  context-sensitive  language  is  accepted  by  a  nondeter- 
ministic linear-bounded  automaton.  It  has  also  been  pointed  out  by 
Bar-Hillel,  Perles,  and  Shamir  (1961)  that  two-tape  automata,  as  defined 
by  Rabin  and  Scott  (1959),  correspond  to  linear  grammars  in  the  following 
sense.  Suppose  that  y*  is  the  reflection  of  y  and  that  a  is  a  designated 
symbol  of  VT.  Then,  if  T  is  a  two-tape  automaton  accepting  the  set  of 
pairs  {(xi}  yz)}  (where  xi9  y^  are  strings  on  VT  —  {a}),  there  is  a  linear 
grammar  G  generating  the  language  {x^y^*};  and,  if  G  is  a  linear  grammar 
generating  the  language  {x^ax/t}  (xi9  yi  strings  on  VT  —  {a}),  there  is  a  two- 
tape  automaton  that  accepts  exactly  the  set  of  pairs  {(x^  ?/£*)}.  Summariz- 
ing, then,  we  see  that  with  respect  to  weak  generative  capacity  there  is 
a  close  correspondence  between  the  hierarchy  of  constituent-structure 
grammars  and  a  certain  hierarchy  of  automata,  namely  that  unrestricted 
rewriting  systems  correspond  to  Turing  machines,  context-sensitive 


3&O  FORMAL    PROPERTIES    OF    GRAMMARS 

grammars  to  nondeterministic  linear-bounded  automata,  context-free 
grammars  to  nondeterministic  PDS  automata,  linear  grammars  to  two- 
tape  automata,  and  one-sided  linear  grammars  to  finite  automata. 

Kuroda  has  also  shown  that  the  complement  (with  respect  to  a  fixed 
vocabulary)  of  a  language  accepted  by  a  deterministic  linear-bounded 
automaton  is  context-sensitive  and  that  every  context-free  language  is 
accepted  by  a  deterministic  linear-bounded  automaton.  It  follows,  then, 
that  the  complement  of  a  context-free  language  is  context-sensitive.  We 
shall  see  (in  Sec.  4.3)  that  the  complement  of  a  context-free  language  is  not 
necessarily  context-free  and  (in  Sec.  4.4)  that  there  is  no  algorithm  for 
determining  whether  or  not  it  is  context-free. 


4.3  Closure  Properties 

Regular  languages  are  closed  under  Boolean  operations  (i.e.,  formation 
of  union,  intersection,  complement  with  respect  to  a  fixed  alphabet), 
as  well  as  under  reflection  (i.e.,  a  mapping  of  each  string  al .  .  .  an  into 
an  .  . .  <2X),  product  (i.e.,  formation  of  the  language  L±  •  L2  =  {x  |  x  —  yz9 
where  y  e^  and  z  e  L2}),  and  infinite  closure  (i.e.,  formation  of  \JnLn, 
where  Ln  =  L  -  L  • . . .  •  L,  n  times).  (Cf.  Sec.  1.2.)  However,  this  observa- 
tion carries  over  to  the  case  of  context-free  languages  only  in  part. 
Theorem  20.  The  set  of  context-free  languages  is  closed  under  the 

operations  of  reflection,  product,  infinite  closure,  and  set  union.  (Bar-Hillel, 

Perles,  &  Shamir,  1960.) 

However,  the  intersection  of  two  context-free  languages  is  not  necessarily 
a  context-free  language;  consequently,  the  complement  of  a  context-free 
language  with  respect  to  a  fixed  vocabulary  is  not  necessarily  a  context-free 
language. 
Theorem  21.    The  set  of  context-free   languages   is   not   closed  under 

operations  of  set  intersection  or  complement  (with  respect  to  the  fixed 

vocabulary  V).  (Scheinberg,  1960b;  Bar-Hillel,  Perles,  &  Shamir,  1960.) 
_  Scheinberg  gives  as  a  counterexample  the  languages  Zx  =  {anbnam}  and 
i2  =  {ambnan},  each  of  which  is  context-free  but  which  intersect  in  the  set 
of  strings  {anbnan}  which  is  not  context-free  (the  example  in  Bar-Hillel, 
Perles,  &  Shamir,  1960,  is  essentially  the  same).  The  intersection  of  two 
sets  can,  of  course,  be  represented  in  terms  of  complement  and  union. 
Thus  it  follows  that  the  complement  of  a  context-free  grammar  is  not 
necessarily  context-free,  since  the  union  of  context-free  grammars  is 
context-free,  __ 

Observe  that  L±  and  L2  of  the  preceding  paragraph  are  meta-linear, 


CONTEXT-FREE    GRAMMARS 


sequential  languages.  The  union  of  meta-linear  languages  is  meta-linear; 
the  union  of  sequential  languages  is  sequential  Consequently,  this 
example  shows  in  fact  that  the  sets  /TO  and  a  of  Def.  7  are  not  closed  under 
complementation  and  intersection,  just  as  the  full  set  7  is  not  closed  under 
these  operations;  and,  furthermore,  the  intersection  of  two  languages  of 
hm  or  of  two  languages  of  a  need  not  be  in  y. 

Schiitzenberger  has  pointed  out  (personal  communication)  that  the 
result  can  be  strengthened  to  linear  grammars  (grammars  of  the  class  /. 
of  Def.  7).  Consider  the  grammar  Gx  with  the  rules 


S->bSc,        S-+bc,  (28) 

and  the  grammar  G2  with  the  rules 

S-+aScc9        S-*aSb,        S-*ab.  (29) 


The  intersection  of  the  languages  generated  by  G±  and  G2  is  the  set 
of  strings  {a2nbna271},  which  is  not  context-free;  but  G:  and  G2  are 
linear.  They  are,  furthermore,  grammars  of  the  simplest  type  above  the 
level  of  finite  automaton,  that  is,  linear  with  a  single  nonterminal.  We 
see  then  that  even  for  this  simple  case  the  intersection  of  the  generated 
languages  may  not  be  context-free,  and  the  class  1  is  also  not  closed  under 
intersection  or  complementation  (it  is  closed  under  union). 

The  preceding  argument  does  not  extend  directly  to  subcategories  of  the 
set  y  of  context-free  languages;  although  it  does  establish  that  the  com- 
plement of  a  language  in  A  (or  Xm  or  d)  is  not  in  A  (or  in  hm  or  a,  respec- 
tively), it  does  not  establish  that  it  is  not  in  y.  It  is  a  reasonable  conjecture, 
however,  that  the  result  will  extend  to  these  subfamilies.  It  is,  as  we  shall 
see  directly,  an  important  open  question  whether  the  complement  of  a 
linear  language  with  a  single  nonterminal  (and  with  a  single  terminating 
rule  S->c,  where  c  appears  in  no  other  rule)  is  context-free, 

We  know  that  the  class  At  of  regular  languages  is  closed  under  com- 
plementation and  intersection  (cf.  Theorem  1,  Sec.  1.2).  Summing  up, 
then:  all  of  the  categories  of  context-free  languages  defined  in  Def.  7, 
Sec.  4.1,  are  closed  under  the  operation  of  set  union,  but  only  ^  is  closed 
under  intersection  and  complementation.  Furthermore,  the  intersection 
of  languages  of  the  categories  A,  Am,  or  <r  need  not  be  context-free  (i.e., 
in  y)  at  all. 

It  is  interesting  to  observe  that  these  properties  do  not  carry  over  to 
context-sensitive  languages.  In  particular,  the  intersection  of  two  context- 
sensitive  languages  is  context-sensitive  (Landweber,  forthcoming).  The 
status  of  the  complement  of  a  context-sensitive  language  remains  open, 
however. 


FORMAL    PROPERTIES    OF    GRAMMARS 


4.4  Undecidable  Properties  of  Context-Free  Grammars 

We  showed  in  Sec.  3  that  a  great  variety  of  problems  involving  context- 
sensitive  grammars  are  recursively  unsolvable.  Some,  but  not  all,  of  these 
problems  are  also  unsolvable  in  the  case  of  context-free  gammars  —  in 
fact,  even  those  of  the  simplest  kind  beyond  the  level  of  finite  automata. 

It  is,  first  of  all,  immediate  that: 
Theorem  22.     There  is  an  algorithm  for  determining,  given  the  context-free 

grammar  G,  whether  the  language  generated  by  G  is  empty,  finite,  or 

infinite.   (Bar-Hillel,  Perles,  &  Shamir,  1960.) 

Hence  in  this  respect  more  can  be  determined  about  the  properties  of  a 
context-free  grammar  by  systematic  investigation  of  its  rules  than  about  a 
context-sensitive  grammar.  However,  in  a  variety  of  other  respects,  we  see 
that  there  are  striking  limitations  on  what  can  be  determined  in  this  way. 
The  observations  concerning  undecidability  follow  essentially  Bar-Hillel, 
Perles,  and  Shamir  (1960),  with  a  few  modifications. 

Known  undecidable  properties  of  context-free  grammars  follow  by 
reduction  to  a  problem  that  was  proven  to  be  recursively  unsolvable  by 
Post  (1946),  called  the  correspondence  problem.  This  can  be  stated  as 
follows.  Suppose  that  we  are  given  a  set  S  of  n  pairs  of  strings  on  an 
alphabet  of  at  least  two  letters.  Thus  S  =  (fo,  y-^,  .  .  .  ,  (xn,  yj}.  A 
sequence  of  integers  /  =  (zls  .  .  .  ,  im),  1  <  ig  <  n,  we  call  an  index 
sequence  for  S.  We  say  that^the  index  sequence  7  satisfies  S  just  in  case 


The  correspondence  problem  is  the  problem  of  determining,  given  S, 
whether  there  is  an  index  sequence  7  satisfying  S.  Post  showed  that  this 
problem  is  recursively  unsolvable;  that  is,  there  is  no  algorithm  for 
determining,  given  an  arbitrary  sequence  S  of  pairs  of  strings,  whether 
there  is  an  index  sequence  that  satisfies  S.  Note  that  if  7  satisfies  S  so 
does  77  =  (4,  .  .  .  ,  fm,  f1?  .  .  .  ,  zm).  Consequently,  either  there  is  no  index 
sequence  satisfying  £  or  there  are  infinitely  many  index  sequences  satisfy- 
ing 2  ;  and  the  problem  whether  there  are  infinitely  many  satisfying  sequen- 
ces is  also  therefore  recursively  unsolvable.  This  fact  plays  an  important 
role  in  the  subsequent  discussion. 

Let  us  for  the  moment  restrict  ourselves  to  languages  with  the  alphabet 
{a,  6,  c}^  Let  us  designate  by  L(G)  the  language  generated  by  the  grammar 
G;  by  L,  the  complement  of  the  language  L  (with  respect  to  the  alphabet 
{a,  b,  c});  by  L^  r\  L^  the  intersection  of  Za  and  L2;  by  Lx  u  7^,  the 
union  of  7^  and  L2;  and  by  a;*,  the  reflection  of  the  string  x. 


CONTEXT-FREE    GRAMMARS  383 

Suppose  that  we  are  given  an  arbitrary  set  S, 

S  =  {(*i,yO,.-.,  (*„,?«)},  (31) 

where,  for  each  z,  x.  and  yi  are  strings  on  the  alphabet  {a,  b}.  Let  G(2) 
be  the  set  of  rewriting  rules 

S  ->  c,        S  -+  x.Sy,*        (1<  f  <  «)  (32) 

generating  the  language  L[G(2)].  G(S)  is  thus  a  linear  context-free 
grammar  with  a  single  nonterminal.  Consider  now  the  question  whether 
G(S)  generates  a  string  zcz*.  Clearly,  this  will  be  true  just  in  case  there  is 
an  index  sequence  (il9 . . . ,  z"TO)  for  S  such  that 

z  =  *V  •  xin  =  Vii . . .  y.m.  (33) 

Thus  the  problem  of  determining  whether  an  arbitrary  linear  grammar  G 
(with  one  nonterminal  symbol)  generates  a  string  of  the  form  zcz*,  for 
some  z,  is  just  the  Post  correspondence  problem  (cf.  Schiitzenberger, 
1961c).  Consequently,  there  is  no  general  method  for  determining, 
given  a  context-free  grammar  G  (which  may  even  be  linear,  with  one  non- 
terminal symbol),  whether  G  generates  a  string  zcz*,  hence  an  infinite 
number  of  such  strings  (as  previously  noted),  or  whether  it  generates  no 
such  string. 
But  now  let  Gm  be  the  grammar 

S^c,        S-^aSa,        S^bSb  (34) 

generating  the  mirror-image  language  £((7^)  =  {zcz*}.  Given  S  as  in 
Eq.  31,  consider  the  question, 

What  is  the  cardinality  of  L(G  J  n  L[G(S)]  ?  (35) 

But  this  is  just  the  question,  How  many  strings  of  the  form  zcz*  does 
G(S)  generate?  We  know  that  the  answer  is,  Either  none  or  an  infinite 
number,  and  we  have  just  discovered  that  there  is  no  algorithm  for 
determining  which  of  these  is  the  case.  Therefore,  there  is  no  algorithm 
for  determining,  given  arbitrary  S,  whether  £(£,„)  n  L[G(E)]  is  empty  or 
infinite. 

Observe  further  that  no  infinite  subset  of  L(G  J  is  a  regular  language. 
Consequently,  the  intersection  L(G  J  C\  L[G(2)]  is  regular  just  in  case  it  is 
finite,  that  is,  empty.  Since  there  is  no  algorithm  for  determining  this, 
there  is  no  algorithm  for  determining  whether  for  arbitrary  S,  L(Gm)  n 
L[<7(£)]  is  a  regular  language. 

Theorem  23.  There  is  no  algorithm  for  determining,  given  the  context- 
free  grammars  Gl  and  G2,  whether  L(G^  n  L(G^  is  empty,  infinite,  or 
regular.  (Bar-Hillel,  Perles,  &  Shamir,  1960.) 


34  FORMAL    PROPERTIES    OF    GRAMMARS 

In  fact,  this  is  true  even  when  Gl  is  fixed  as  in  Grammar  34  and  G2  is 
linear  with  a  single  nonterminal  symbol  (as  is  Gx  also).  Linear  grammars 
with  one  nonterminal  are  the  simplest  systems  beyond  finite  automata  in 
our  framework;  and  it  is  well  known  that  there  is  an  algorithm  for 
determining  whether  the  intersection  of  two  regular  languages  is  empty  or 
infinite  (it  is  always  regular). 

We  described  Gm  in  Grammar  34  as  the  grammar  generating  the  mirror- 
image  language  with  a  defined  midpoint.  Let  Gm2  be  the  grammar  con- 
sisting of  the  rules  in  Grammar  34  and,  in  addition,  the  rule 

S^ScS.  (36) 

Let  S-i  be  the  initial  symbol  of  Gm2.  Thus  L(Gm2)  consists  of  all  strings  of 
the  form  xcx*cycy*,  where  x  and  y  are  strings  in  the  alphabet  {a,  b}. 
Given  2  again,  as  in  Example  31,  define  (7m(S)  as  the  grammar  con- 
taining Rules  32  of  G(S)  and,  in  addition,  the  rules 

S^aSja,        S^-^bSJ,        ^  ->  cSc,  (37) 

where  Sl  is  the  initial  symbol.  Gm(£)  is  the  grammar  that  embeds  L[G(S)], 
as  defined  by  Rules  32,  into  the  mirror-image  language.  That  is,  L[Gm(S)] 
consists  of  exactly  those  strings  of  the  form  xcyczcx*,  where  ycz  is  generated 
by  (r(S);  Gm(S)  is  basically  an  amalgam  of  Gm  and  (?(£). 

Consider  now  the  intersection  of  the  languages  generated  by  Gm2  and 
GTO(S)  (just  as  before  we  considered  the  intersection  of  L(Gm)  and  £[G(2)]}. 
This  is  the  set  L(Gm2)  C\  L[Gm(S)]  consisting  of  exactly  those  strings  x 
meeting  the  following  conditions  : 

(i)  x  =  x1cx2cx^cx,l  (xi  a  string  on  {a,  b}) 

(ii)  Xl  =  z2*  and  £3  =  z4*  (since  x  e  L(Gm2))  (38) 

(iii)  x±  =  xt*  and  x2cxz  e  L(G(S))    (since  x  e  L[Gm(E)]). 


In  particular,  then,  x^  =  x^  xz  =  x^  and  x2  =  #3*.  Consequently, 
L(Gm2)  n  L[Gm(£)]  will  be  empty  just  in  case  there  is  no  index  sequence 
satisfying  S  and  infinite  otherwise,  exactly  as  before  (since,  as  we  have 
already  observed,  the  question  whether  there  is  a  string  x^cx^  e  L[G(2)], 
where  #2  =  #3*,  is  exactly  the  correspondence  problem  for  S).  Con- 
sequently, there  can  be  no  algorithm  for  determining  whether  L(Gm2)  n 
L[Gm(£)]  is  empty  or  infinite  (these  being  the  only  possibilities). 

Each  string  of  L(G^)  C\  L[Gm(S)]  is  of  the  form  xcx*cxcx*>  and  it  is 
easy  to  show  that  no  infinite  set  of  strings  of  this  form  constitutes  a  context- 
free  language.  Consequently,  L(Gm2)  n  L[(7m(£)]  is  a  context-free  language 
just  in  case  it  is  finite,  that  is,  empty,  and  since  the  question  of  emptiness 
is,  as  just  observed,  undecidable,  the  question  whether  L(Gm2)  n  L[Gm(£)] 
is  a  context-free  language  is  undecidable. 


CONTEXT-FREE    GRAMMARS  jjtfj 

Theorem  24.     There  is  no  algorithm  for  determining,  given  the  context- 

free  grammars  Gl  and  G2,  whether  L(G^  n  L(G2)  is  a  context-free 

language.   (Bar-Hillel,  Perles,  &  Shamir,  1960.) 

In  fact,  this  is  true  even  when  Gx  is  fixed  as  Gm2  and  <72  is  meta-linear 
(note  that  Gm2  is  also  meta-linear).  Note  that  Theorem  23  follows  from 
the  argument  that  proves  Theorem  24. 

We  have  considered  languages  with  the  alphabet  {a,  b,  c},  but  by 
appropriate  coding,  we  can  easily  extend  these  unsolvability  results  to 
languages  with  alphabets  of  two  or  more  symbols  (cf.  Bar-Hillel,  Perles,  & 
Shamir,  1960). 

In  Bar-Hillel,  Perles,  and  Shamir  (1960)  the  proof  of  Theorems  23  and 
24  proceeds  essentially  as  follows.  Consider  S,  as  in  Example  31.  LetL2 
be  the  language  consisting  of  all  strings 

ab'*  .  .  .  db^cx^  .  .  .  x^*  .  .  .  y^cb^a  .  .  .  b**a9  (39) 

where  (/19  .  .  .  ,  ik)  and  (jl9  .  ,  .  ,/z)  are  index  sequences  for  2.    It  is  easy 
to  show  that  L2  is   a   context-free  language  generated  by  a  grammar 
<?r.    G2,  so  defined,  plays  the  role  of  Gm(S)  in  the  proof  previously 
sketched. 
Consider  now  the  language  LM  consisting  of  just  the  strings 


(40) 

where  x±  and  x%  are  strings  on  {a,  b}.   LM  is  also  a  context-free  language, 
generated  by  a  grammar  GM;  this  grammar  plays  the  role  of  Gm2  in  the 
proof  sketched  previously. 
We  observe  now  that  LM  C\  L2  is  the  set  of  all  strings 

.  .  .  bika,  (41) 


where  xi  ...  xijs  =  yt  .  .  .  yik;  that  is,  where  zl9  .  .  .  ,  4  is  an  index  sequence 
that  satisfies  E.  Consequently,  by  the  argument  given  previously,  if  there 
is  an  index  sequence  satisfying  S,  then  LM  n  L£  is  infinite  and  is  not  a 
regular  language  or  a  context-free  language;  if  there  is  no  index  sequence 
satisfying  S,  then  LM  n  Ls  is  empty  (and  therefore,  trivially,  is  a  regular 
language  and  a  context-free  language).  Hence  unsolvability  of  the  Post 
correspondence  problem  implies  unsolvability  of  the  problem  of  deter- 
mining, for  arbitrary  S,  whether  LM  n  L^  is  empty,  finite,  a  regular 
language,  or  a  context-free  language  (Theorems  23  and  24). 

By  a  construction  too  complex  to  reproduce  here,  it  is  shown  in  Bar- 
Hillel,  Perles,  and  Shamir  (1960)  that,  for  each  S,  the  complement  Ls 
of  LE  is  a  context-free  language.  It  is  easy  to  show  that  the  complement 
LM  of  LM  is  a  context-free  language.  The  union  of  two  context-free 
languages  is  context-free.  Therefore,  for  each  S  the  language  LM  U  Ls 


FORMAL    PROPERTIES    OF    GRAMMARS 


=  LM  O  L2  is  a  context-free  language.  The  complement  of  this  language  is 
L3/  n  Lv,  and  we  have  just  observed  that  there  is  no  algorithm  for 
determining  whether  this  is  empty,  finite,  regular,  or  context-free.    Each 
step  is  constructive.    Consequently,  we  have  the  result: 
Theorem  25.     There  is  no  algorithm  for  determining,  given  a  context-free 
grammar  G,  whether  the  language  L(G)  (the  complement  of  the  language 
generated  by  G  with  respect  to  the  alphabet  {a,  b,  c})  is  empty,  finite, 
regular,  or  context-free.   (Bar-Hillel,  Perles,  &  Shamir,  1960.) 
In  particular,  there  is  no  algorithm  for  determining  whether  an  arbitrary 
context-free  grammar  generates  all  strings  in  its  terminal  vocabulary, 
that  is,  whether  it  is  universal;  and  there  is  no  algorithm  for  determining 
whether  an  arbitrary  context-free  grammar  generates  a  regular  language 
(since  this  will  be  true  just  in  case  the  complement  of  the  language  it 
generates  is  a  regular  language). 

At  the  end  of  Sec.  4.3  we  observed  that  it  is  not  known  whether  the 
complement  of  a  language  generated  by  a  linear  grammar  with  one 
nonterminal  (the  simplest  type  of  nonregular  language)  is  a  context-free 
language.  If  the  answer  to  this  question  is  positive,  then  the  complement 
of  L[G(S)]  generated  by  G(S)  of  Example  32  is  context-free.  It  is  easy  to 
show  that  the  complement  of  L(Gm)  generated  by  Gm  of  Example  34  is 
context-free.  Consequently,  by  the  argument  given,  we  can  show  that  there 
is  no  algorithm  for  determining  whether  the  complement  of  a  context- 
free  language  (namely,  the  union  of  the  complements  of  these  linear 
languages)  is  empty,  finite,  or  regular.  Furthermore,  the  complement  of 
L((7m2)  (cf.  Rule  36)  is  context-free,  and  the  complement  of  L[<7m(Z)]  is 
context-free  if  the  complement  of  £[<7(E)]  is  also.  Hence  we  would  be 
able  to  prove  Theorem  25  in  full  without  considering  LE  of  Example  39, 
or  the  construction  that  gives  its  complement,  if  it  is  true  that  the  com- 
plement of  a  language  generated  by  a  linear  grammar  with  one  nonterminal 
is  context-free.  This  is  apparently  not  a  simple  question,  however. 

The  construction  of  the  complement  of  Ls  given  by  Bar-Hillel,  Perles, 
and  Shamir  (1960)  amounts  to  a  proof  that  for  a  certain  type  of  linear 
grammar  with  one  nonterminal,  the  complement  is  linear.  In  Schiitzen- 
berger  and  Chomsky  (1962)  it  is  proved  that  for  a  more  general  class  of 
linear  grammars  with  designated  center  symbols  and  one  nonterminal 
(namely,  those  for  which  the  string  to  the  right  of  the  center  symbol  in  the 
generated  language  is  uniquely  determined  by  the  string  to  its  left)  it  is 
still  true  that  the  complement  is  linear.  But  the  general  question  remains 
open,  even  for  linear  grammars  with  a  single  nonterminal  and  a  designated 
center  symbol. 

From  Theorem  25  it  follows  immediately  that: 
Theorem  26.    There  is  no  algorithm  for  determining^  given  context-free 


CONTEXT-FREE    GRAMMARS 


grammars  G±  and  G2,  whether  L(G^  =  £(G2).    (Bar-Hillel,  Perles,  & 

Shamir,  1960.) 

If  there  were  such  an  algorithm,  then  it  would  be  possible  to  determine 
in  general  whether  L(G^  is  universal,  contrary  to  Theorem  25.  It  also 
follows  immediately  that  the  problem  of  determining  whether  L(G1) 
<=.  L(G2)  is  recursively  unsolvable  {this,  in  fact,  follows  also  from  Theorem 
23,  since  L(Gm)  is  context-free,  and  L[G(2)]  is  included  in  it  just  in  case 
its  intersection  with  L(Gm)  is  null}. 

One  further  immediate  consequence  of  these  undecidability  results 
deserves  mention  here.  We  have  already  observed  (last  paragraph  of 
Sec.  1.5)  that  the  language  L  is  regular  if  and  only  if  there  is  a  transducer 
T  such  that  T(U)  =  L,  where  U  is  the  set  of  all  strings  on  VT,  As  we  have 
seen,  Theorem  25  implies  that  there  is  no  algorithm  for  determining 
whether  an  arbitrary  context-free  language  L  is  a  regular  language;  that 
is,  whether  there  is  a  transducer  T  such  that  T(U)  =  L.  Since  U  is  a 
context-free  (in  fact,  regular)  language,  we  have  the  following  theorem: 
Theorem  27.  There  is  no  algorithm  for  determining,  given  two  context- 

free  languages  L^  and  L2,  whether  there  is  a  finite  transducer  T  such  that 
=  L2.  (S.  Ginsburg,  personal  communication.) 


4.5  Structural  Ambiguity 

We  say  that  a  context-free  grammar  G  is  ambiguous  if  it  generates  a 
string  x  in  two  essentially  different  ways,  that  is,  if  it  assigns  to  x  two  distinct 
structural  descriptions  (cf.  p.  367).  The  topic  of  structural  ambiguity  is 
important  from  many  points  of  view,  and,  though  it  has  been  very  little 
studied  so  far,  there  are  some  suggestive  results. 

Consider  this  question  first.  Is  there  an  algorithm  to  determine  whether 
a  context-free  grammar  is  ambiguous?  A  negative  answer  follows  directly 
from  the  unsolvability  of  the  correspondence  problem.  Suppose  in  fact 
that  S  =  {(xi9  y<)  |  1  <*<»}>  where  xi  and  yi  are  again  strings  on  some 
alphabet,  say,  the  alphabet  (a,  b}.  Select  n  new  symbols  dl9...,dn  and 
construct  the  two  grammars  Gx  and  Gy  as  follows: 


Gx:  Sx  ->  c, 


Clearly,  Gx  and  Gv  are  unambiguous,  but  note  that  there  is  an  index 

sequence  & i  J  such  that  a^  . . .  sia>  =  y^ . . .  yim  if  and  only  if 

G,.  and  Gy  both  generate  the  string  z,  where 

...»*.*.         (43) 


FORMAL    PROPERTIES    OF    GRAMMARS 

Thus  the  correspondence  problem  for  S  has  a  positive  solution  if  and  only 
if  there  is  a  z  generated  by  both  Gx  and  Gy,  that  is,  if  the  grammar  Gxy  is 
ambiguous,  where  Gxy  contains  the  rules  of  Gx,  the  rules  of  Gy,  and,  in 
addition,  the  rules  S  —  >  Sxy  S  -+•  Sy,  S  being  the  initial  symbol  of  Gxy. 
Consequently,  there  is  no  algorithm  for  determining  whether,  for  arbitrary 
S,  the  grammar  Gxy  constructed  in  the  manner  indicated  is  ambiguous. 
Theorem  28.  There  is  no  algorithm  for  determining  whether  a  context-free 
grammar  is  ambiguous.  (Schiitzenberger,  personal  communication.) 
Note  that  the  grammar  Gxy  belongs  to  a  class  of  grammars  G  meeting 
the  following  condition: 

G  is  linear  with  at  least  three  nonterminals 

and  terminating  rules  all  of  the  form  a  -»•  c,  (44) 

where  c  does  not  appear  in  any  nonterminating 

rule  of  G. 

Thus  we  see  that  the  ambiguity  problem  is  unsolvable  for  grammars 

meeting  this  condition. 
It  has  recently  been  shown  that  through  a  generalization  of  the  corre- 

spondence problem,  Condition  44  can  be  weakened  to  the  case  of  one 

nonterminal  without  affecting  the  unsolvability  of  the  ambiguity  problem. 

In  effect,  this  generalization  permits  S,  Sx,  and  Sy  to  be  identified  in  Gxy 

(Greibach,  forthcoming).  Thus  Theorem  28  holds  of  the  class  of  minimal 

linear  grammars  G  meeting  the  condition  : 

G  is  linear  with  a  single  nonterminal  S  and  a 

single  terminating  rule  S-+c9  where  c  does  (45) 

not  appear  in  any  nonterminating  rule  of  G. 

Suppose  that  G  is  a  minimal  linear  grammar  with  the  nonterminating 
rules  S->xtSyj*,  1  <  i  <  n,  associated  with  the  set  of  pairs 

I  =  {(*«,  yj\Ki<  n}. 
G  is  ambiguous  if  and  only  if  there  are  two  index  sequences  7,  7 

7  =  (ilf  .  .  .  ,  g,     j  =  (j\9  .  .  .  ,JQ), 

7^J,         l<fm,yra</z, 

such  that  xti  .  .  .  xifcyt*  .  .  .  yt*  =  x3i  .  .  .  x^cy,*  .  .  .  y,*.  This,  in  turn, 
is  true  if  and  only  if 


(0  #,'       ...    Xf        =    X:       ...    X 

v  J  H          *>        * 


*     , 

V 


But  consider  the  question: 

Given  2,  do  there  exist  index  sequences  7,  / 

as  in  Example  46,  that  satisfy  Condition  47  ?  ^    ^ 


CONTEXT-FREE    GRAMMARS 


From  Theorem  28,  extended  to  minimal  linear  grammars,  it  follows  that 
this  question  concerning  unique  decipherability  is  undecidable. 

We  observed  in  Sec.  1.2  that  corresponding  to  every  finite  automaton 
we  can  construct  an  equivalent  deterministic  finite  automaton  (and,  of 
course,  the  question  whether  two  finite  automata  are  equivalent  is  decid- 
able).  In  other  words,  given  a  one-sided  linear  grammar  (?,  we  can  find 
an  equivalent  unambiguous  one-sided  linear  grammar.  An  obvious 
question  is  whether  this  is  also  true  of  context-free  grammars  in  general. 
It  has  been  proven  by  Parikh  (1961)  that  it  is  not. 
Theorem  29.  There  are  context-free  languages  that  cannot  be  generated 

by  any  unambiguous  context-free  grammar. 

Parikh  proves  that  the  language 

L  =  {x  |  x  =  anbman'bm  or  x  =  anbmanbm';    n,  n',  m,  m  >  1}    (49) 

cannot  be  generated  by  an  unambiguous  context-free  grammar,  although 
it  is  generated  by  the  set  of  rules 


A  ->  aAa,        A  -*•  aBa, 

C-+bCb,         C-+bDb,  (50) 


There  is,  in  fact,  a  linear  grammar  equivalent  to  the  Grammar  50.  The 
degree  of  ambiguity  it  assigns  to  strings  is  2.  Another  example  of  such  a 
language  is  the  set  {anbmap  \  n  =  m  or  n  =  p}.  It  is  an  interesting  open 
problem  to  find  languages  of  higher  (perhaps  unbounded)  degree  of 
inherent  ambiguity.  Many  open  and  important  questions  can  immediately 
be  raised  concerning  the  question  of  inherent  ambiguity,  its  scale,  its 
relation  to  decidability  of  ambiguity,  and  the  level  of  richness  of  grammar 
at  which  it  arises  (e.g.,  are  there  minimal  linear  languages  that  are 
inherently  ambiguous  or  is  there  inherent  ambiguity  at  the  level  of 
context-sensitive  grammars?),  etc. 

Theorem  29  is  a  suggestive  result.  It  is  an  interesting  question  why 
natural  languages  have  as  much  structural  ambiguity  as  they  do.  We 
might  hope  to  obtain  an  answer  to  this  question  that  would  take  the 
folio  wing  form: 

1.  Grammars  of  natural  languages  are  drawn  from  the  class  T  of  gener- 
ative processes. 

2.  The  language  L  is  rich  enough  in  expressive  power  to  contain  the 
set  of  grammatical  devices  A  but  not  A'  (e.g.,  the  class  of  sentences  L  but 
not  S'). 


39°  FORMAL    PROPERTIES    OF    GRAMMARS 

3.  A  grammar  G  e  T  that  expresses  A  but  not  A'  (that  generates  S  but 
not  S')  must  be  ambiguous,  that  is,  the  language  it  generates  is  inherently 
ambiguous  with  respect  to  F. 

If  an  argument  of  this  kind  could  be  provided,  it  would  be  important 
not  only  as  an  explanation  for  the  existence  of  structural  ambiguities  in 
L,  but  it  would  provide  striking  evidence  of  an  indirect  (hence  quite 
interesting)  kind  for  the  validity  of  the  general  linguistic  theory  that  makes 
the  claim  (1),  It  need  hardly  be  emphasized  that  we  are  far  from  being 
able  to  provide  such  an  argument  for  natural  language,  but  Theorem  29 
may  represent  a  first  step  in  this  direction. 

4.6  Context-Free  Grammars  and  Finite  Automata 

We  have  seen  that  a  finite  automaton  can  be  represented  as  a  one-sided 
linear  grammar  and  that  such  a  device  is  much  more  restricted  in  generative 
capacity  than  a  context-free  or  even  a  linear  grammar  may  be.  We  have 
also  observed  that  such  elementary  formal  properties  of  natural  languages 
as  recursive  nesting  of  dependencies  make  it  impossible  for  them  to  be 
generated  by  finite  automata,  although  these  properties  do  not  exclude  them 
from  the  class  of  context-free  (even  linear)  languages.  From  these  observa- 
tions we  must  conclude  that  the  competence  of  the  native  speaker  cannot 
be  characterized  by  a  finite  automaton.  The  grammar  stored  in  his  brain 
cannot  be  a  one-sided  linear  grammar,  a  fact  that  is  not  in  the  least  sur- 
prising. Nevertheless,  the  performance  of  the  speaker  or  hearer  must  be 
representable  by  a  finite  automaton  of  some  sort  The  speaker-hearer  has 
only  a  finite  memory,  a  part  of  which  he  uses  to  store  the  rules  of  his 
grammar  (a  set  of  rules  for  a  device  with  unbounded  memory),  and  a  part 
of  which  he  uses  for  computation  in  actually  producing  a  sentence  or 
"perceiving"  its  structure  and  understanding  it. 

These  considerations  are  sufficient  to  show  the  importance  of  gaining 
a  better  understanding  of  the  source  and  extent  of  the  excess  generative 
power  of  context-free  grammars  over  finite  automata  (even  though  context- 
free  grammars  are  demonstrably  not  fully  adequate  for  the  grammatical 
description  of  natural  languages).  We  turn  now  to  an  investigation  of  this 
problem. 

Let  us  first  review  what  we  have  so  far  found  concerning  the  relation 
of  context-free  grammars  to  restricted-infinite  and  finite  automata.  In 
Sees.  1 .6  and  4.2  we  have  stated  several  results  (which  are  contingent,  in 
part,  on  results  yet  to  be  proved  in  this  section)  having  to  do  with  this 
question.  In  particular,  we  observed  that  context-free  languages  are  the 
sets  that  are  accepted  by  a  class  of  restricted-infinite  automata  that  we 


CONTEXT-FREE    GRAMMARS 


called  pushdown  storage  (PDS)  automata.  We  showed  that  from  this 
fact  it  follows  that  there  is  an  extremely  close  relation  between  regular 
(one-sided  linear)  languages  and  context-free  languages.  Namely,  let  us 
extend  the  vocabulary  V  from  which  all  context-free  languages  are  con- 
structed to  a  vocabulary  V  containing  V  and  a'  for  each  a  e  K  Let  us 
define  K  as  the  set  of  strings  on  V  that  reduce  to  e  by  successive  cancella- 
tion of  oca'  and  oc'cc  (i.e.,  by  treating  a  and  a'  as  inverses).  Let  us  define 
<£(a)  =  a  for  a  e  VT9  (£(a)  =  e  for  a  ^  VT.  For  any  language  JL,  let  us 
define  ip(L)  =  ^(K  C\  L).  Then,  for  each  regular  language  L,  y(L)  is  a 
context-free  language,  and  each  context-free  language  is  determined  as 
y)(L)  for  some  choice  of  a  regular  language  L.  Hence  the  family  /^  of 
regular  languages  is  mapped  onto  the  family  y  of  context-free  languages 
by  the  mapping  yi. 

Continuing  with  the  investigation  of  the  relation  between  context-free 
and  regular  languages,  we  note  first  that  there  are  context-free  languages 
that  are,  in  a  sense,  much  'bigger'  than  any  regular  language. 
Theorem  30.     There  is  a  context-free  grammar  G  generating  L(G)  with  the 
following  property:  given  a  finite  automaton  Al  generating  L(A^  <=•  L(G), 
we  can  construct  a  finite  automaton  A2  generating  L(A^)  such  that  (i) 
L(-4i)  ci  L(A£  c  L(G)  and  (ii)  L(A^  contains  infinitely  many  strings  not 
inL(AJ.  (Parikh,  1961.) 

Parikh  shows  that  this  result  holds  for  a  context-free  grammar  that 
provides  the  relations 

S  =>  An£mAn9  A  =>  ce*c9  B  =>  dfkd  (m,  n,k^  1).  (51) 
[In  fact,  it  is  also  true  of  the  simpler  grammar  G  generating  just  L(G)  = 
{anbman  |  m,  n  ^  1}  (S.  Ginsburg,  personal  communication).]  From 
Theorem  30  we  see  that  we  cannot,  in  general,  approach  a  context-free 
language  L  as  the  limit  of  an  increasing  sequence  of  regular  languages,  each 
containing  an  infinite  number  of  sentences  not  in  the  preceding  one  and  all 
contained  in  L. 

Suppose  now  that  we  have  a  context-free  grammar  G  generating  L(G) 
and  a  finite  transducer  T  with  initial  state  SQ.  We  shall  construct  a  new 
context-free  grammar  G'  which,  in  fact,  generates  T[L(G)].  Suppose,  first, 
that  Tis  bounded.  We  can,  then,  assume  that  it  contains  no  instructions 
of  the  form  (e,  S€)  -*  (Sf9  x).  Let  us  construct  the  new  context-free 
grammar  G'  with  the  output  vocabulary  of  T  as  its  terminal  vocabulary 
and  with  nonterminal  symbols  represented  as  triples  (Siy  a,  S,),  where 
Si9  5y  are  states  of  rand  a  e  K.4 

4  The  construction  that  follows  is  due  to  Bar-Hilld,  Perles,  and  Shamir  (1960)  who  use  it 
only  to  prove  what  we  give  here  as  Theorem  32.  Our  Theorem  31  is  proved  in  essentially 
this  way  in  Ginsburg  and  Rose  (1961).  This  construction  is  closely  related  to  the 
representation  of  transducers  by  matrices  in  Schutzenberger  (l%la). 


3$2  FORMAL    PROPERTIES    OF    GRAMMARS 

The  initial  symbol  of  Gf  is  S'.   The  rules  of  G  are  determined  by  the 
following  principle: 

(i)  S'  -»  (So,  S,  St)  is  a  rule  of  G'9  for  each  /. 

(ii)  If  A  ->  ocj  .  .  .  xk  is  a  rule  of  G,  then  for 

each  i,j,  @19  .  .  .  ,  /3fc_1?  G'  contains  the  rule 
(St,  A,  5,)  ->  (S<,  a1?  S^)^,  a,,  S^)  .  .  .  (52) 

(  Vi>  a*>  ^>- 

(iii)  If  (a,  St)  ->  (S,.,  x)  is  a  rule  of  T,  then  G' 

contains  the  rule 


To  preserve  the  condition  that  a  nonterminal  symbol  is  one  that  appears 
to  the  left  of  a  rule  oc  ->  0,  we  can  require  also  that  G'  contain  the  rules 

to,fl,S,)-*to,a,S,)a,  (53) 

for  each  a,  i,j  not  involved  in  step  iii. 

The  terminating  rules  of  G'  are  those  given  by  step  (iii)  of  Construction 
52.  Carrying  a  derivation  of  G'  as  far  as  we  can  without  applying  any  ter- 
minating rules,  we  have,  as  final  line,  a  string 

CSV  a1?  S^S^  03,  S^  .  .  .  (Sw  afc,  S<t)*,  (54) 

where  f0  =  0,  a,  6  Fr  for  eachy,  and  G  generates  ax  .  .  .  afc.  Furthermore, 
if  G  generates  ax  .  .  .  aft  (a5-  e  Kr),  then  for  each  zl9  .  .  .  ,  4,  the  String  54 
is  a  line  of  a  derivation  of  Gx.  But  a  derivation  with  the  String  54  as  final 
line  will  terminate  with  the  terminal  string  z  if  and  only  if  there  are  strings 
a?!,  .  .  .  ,  xk  such  that  x1  .  .  .  xk  =  z  and  for  each  j9  (a,,-,  Sif^  —  ^  (^,  %)  is 
an  instruction  of  T;  that  is,  if  and  only  if  T  maps  the  input  string  ax  .  .  .  afc 
into  s,  passing  through  states  S^  5^,  .  .  .  ,  5i;fc  in  the  process.  Thus  G 
generates  the  language  T[L(G)].  Note  that  G'  may  have  rules  of  the  form 
a  ->  £  (where  a:  =  ^  in  step  iii  of  Construction  52).  However,  as  we 
observed  at  the  outset  of  Sec.  4,  rules  of  this  kind  do  not  permit  the  genera- 
tion of  noncontext-free  languages  (aside  from  the  language  {e}).  Con- 
sequently, we  have  the  following  result,  where  Tis  a  bounded  transducer. 
Theorem  3  1.  If  L  is  a  context-free  grammar  and  T  a  transducer,  then 

T(L)  is  a  context-free  language  (or  T(L)  =  {e}). 

Suppose  that  R  is  a  regular  language  accepted  by  the  automaton  F 
with  initial  state  50.  Construct  the  transducer  T  with  the  instruction 
(ai9  Sj)  -*  (Sk,  af)  whenever  F  has  the  instruction  (i,j,  k),  that  is,  whenever 
F  goes  from  state  Sf  to  state  Sk  on  reading  the  input  at.  We  can  assume, 
with  no  loss  of  generality,  that  F  is  deterministic  (cf.  Theorem  1)  and  that 
T  is  bounded.  Let  G  be  a  context-free  grammar  generating  L(G).  Con- 
struct G'  by  the  construction  previously  given,  but  with  the  revision  that  in 


CONTEXT-FREE    GRAMMARS  ^03 

place  of  step  i  of  Construction  52  we  have  the  single  rule  Sf  ->  (50,  5,  50). 
Now  it  is  easy  to  show  that  Gf  generates  the  intersection  of  R  with  L(G) 
by  the  argument  that  leads  to  Theorem  31. 
Theorem  32.     If  R  is  a  regular  language  and  L  a  context-free  language, 

then  the  intersection  L  n  R  is  a  context-free  language. 

To  drop  the  requirement  of  boundedness  of  the  transducer  Tin  Theorem 
3 1 ,  we  amend  the  construction  as  follows.  Given  the  context-free  grammar 
G  generating  L(G),  first  replace  each  rule  A  ->  a2 . . .  xk  by  the  rule 
A  -*•  qxiqatf .  . .  qxkq,  where  q  is  some  new  symbol.  Then  apply  the 
construction  in  (52)  and  (53),  as  before,  to  give  G'.  Now  define  Qtj  as 

Qu  =  (z  |  for  some  f0, . . . ,  iw  x^  . . . ,  xm9  z  =  ^  . . .  xm 

and  for  each  k,   1  <  k  <  m,   T  has  the  rule        (55) 
(e>  sikj  -*  (sik*  xk\  where  z0  =  i  and  im  =  /}. 

Clearly  gw  is  regular.  Therefore,  we  can  add  to  G'  rules  that  provide  for 
the  relations 

(Si9q,St)-+e       and        (Si9  q,  S,)  =>  x,  (56) 

for  each  /,/  and  each  xeQl}.  G'9  so  extended,  generates  the  language 
T[L(G)].  Hence  Theorem  31  holds  without  restriction  on  T. 

For  additional  results  concerning  the  effect  of  various  operations  on 
context-free  languages  and  related  systems,  see  Ginsburg  and  Rose 
(1961)  and  Schlitzenberger  and  Chomsky  (1962). 

Let  us  now  restrict  our  attention  to  rules  of  the  forms 

(i)  A-+BC, 

(ii)  A  ->  aB  (right-linear), 

(iii)  A  ->  Ba  (left-linear), 

(iv)  A  — >  a. 

Recall  that  a  normal  grammar  contains  only  rules  of  the  types  (i)  and  (iv) ; 
a  linear  grammar  can  be  described,  without  loss  of  generality,  as  one  that 
contains  only  rules  of  the  forms  (ii),  (iii),  and  (iv);  a  finite  automaton 
contains  only  rules  of  the  forms  (ii)  and  (iv),  or  only  rules  of  the  form  (iii) 
and  (iv).  Recall  also  that  a  normal  grammar  can  generate  any  context-free 
language  and  that  a  linear  grammar,  although  more  limited  in  generative 
capacity,  can  generate  languages  beyond  the  capacity  of  finite  automata 
(Theorem  16).  Thus  we  gain  generative  capacity  over  finite  automata  by 
permitting  both  right-  and  left-linear  rules  and,  still  beyond  this,  by 
allowing  nonlinear  rules  to  appear  in  the  grammar.  Of  course,  some  nor- 
mal grammars  and  some  linear  grammars  may  generate  only  regular 
languages,  although  there  is  no  algorithm  to  determine  when  this  is  true 


394  FORMAL    PROPERTIES    OF    GRAMMARS 

in  either  case  (Theorem  25).  The  question  remains,  then,  under  what  con- 
ditions is  a  normal  or  linear  grammar  richer  than  any  finite  automaton  in 
generative  capacity? 

In  order  to  study  this  question,  it  is  useful  to  consider  again  the  classifica- 
tion of  recursive  elements  given  in  Chapter  11,  Sec.  3.  A  finite  automaton 
contains  either  all  right-recursive  or  all  left-recursive  elements.  A  linear 
grammar  may  contain  both  right-recursive  and  left-recursive  elements, 
as  may  a  normal  grammar.  Furthermore,  both  linear  and  normal  gram- 
mars may  contain  self-embeddmg  elements.  It  turns  out  to  be  the  latter 
property  that  accounts  for  the  excess  generative  capacity  over  finite 
automata. 
Definition  8.  A  grammar  is  self-embedding  if  it  contains  a  nonterminal 

symbol  A  such  that  for  some  nonnull  strings  <f>,  y>9  A  =>  <f>Ay. 

A  self-embedding  grammar  is,  in  other  words,  one  that  contains  self- 
embedding  elements.  It  can  now  be  shown : 

If  G  is  a  nonself-embedding,  context-free  grammar 
generating  L(G),  then  there  is  a  finite  automaton  (58) 

that  generates  L(G). 

From  this  we  derive  the  following  result  immediately: 
Theorem  33.    The  language  L  is  not  regular  if  and  only  if  all  of  its  context- 
free  grammars  are  self-embedding.  (Chomsky,  1959a,  1959b;  Bar-Hillel, 
Perles,  &  Shamir,  1960.) 

Thus  L  is  a  regular  language  just  in  case  it  has  a  nonself-embedding 
grammar. 

Clearly,  there  is  an  algorithm  to  determine  whether  a  context-free 
grammar  contains  self-embedding  elements  (similarly,  whether  it  contains 
right-recursive  and  left-recursive  elements).  If  we  apply  this  test  to  a 
grammar  G  and  discover  that  it  has  no  self-embedding  symbols,  then  we 
can  conclude  that  G  has  the  capacity  of  a  finite  automaton  (although  it 
may  have  both  right-recursive  and  left-recursive  symbols).  If,  on  the 
other  hand,  we  find  that  G  has  self-embedding  symbols,  we  do  not  know 
whether  G  has  the  capacity  of  a  finite  automaton.  This  depends  on  the 
answer  to  the  question  whether  there  is  a  grammar  G'  that  generates  the 
same  language  as  G  with  no  self-embedding  symbols.  As  we  have  seen, 
there  is  no  mechanical  procedure  by  which  this  can  be  determined  for 
arbitrary  G.  Thus,  although  Theorem  33  provides  an  effective  sufficient 
condition  for  a  context-free  grammar  to  be  equivalent  to  a  finite  auto- 
maton in  generative  capacity  (and,  furthermore,  a  condition  met  by 
some  grammar  of  each  regular  language),  we  know  that  it  cannot  be 
strengthened  to  an  effective  criterion,  that  is,  an  effective  necessary  and 
sufficient  condition. 


CONTEXT-FREE    GRAMMARS  „,- 

Proposition  58  follows  directly  from  elementary  properties  of  finite 
automata.  Suppose  that  G  is  a  nonself-embedding  grammar  generating 
L(G),  where  the  nonterminals  of  G  are  Al9...,An.  Call  G  connected  if 
for  each  f,  /  there  are  strings  <£,  y  such  that  Ai  =>  <£/4/y>.  Suppose  that  G 
is  connected.  If  there  are  i,j,  k,  I  such  that  Ai  =>  faA^  and  y4fc  =>  ^Aw* 
where  ^^  e  ^  y2,  then  it  is  immediate  that  G  is  self-embedding,  con- 
trary to  assumption.  Therefore  there  can  be  no  such  f,/,  k,  /,  and  thus  each 
nonterminating  rule  of  G  is  right-linear  or  each  rule  is  left-linear.  In 
either  case,  G  generates  a  regular  language. 

Suppose  now  that  n  =  1.  Then  either  L(G)  is  finite  or  G  is  connected 
and  generates  a  regular  language,  as  just  noted. 

Suppose  that  Proposition  58  is  true  for  all  grammars  containing  less 
than  n  nonterminals.  Suppose  that  A^  is  the  initial  symbol  of  G,  which  has 
nonterminals  A19 . . . ,  An.  We  may  assume  that  G  contains  no  redundant 
symbols  and  that  it  is  not  connected.  Thus  for  some  particular/  there  is  no 
</»,  ip  such  that  Af  =>  <f>A^.  Suppose/  ^  1.  Form  G'  by  deleting  from  G 
each  rule  As  -*•  <f>  and  replacing  A5  elsewhere  in  the  rules  by  a  new  symbol 
a.  By  the  inductive  hypothesis  the  language  L'  generated  by  G'  is  regular, 
as  is  the  set  K  =  {x  \  Aj  ^>  x}.  It  is  obvious  that  if  L±  and  L2  are  regular 
and  £3  consists  of  all  those  strings  formed  from  x  e  L^  by  replacing  each 
a  (if  any)  in  x  by  some  string  of  L2,  then  £3  is  regular.  L(G)  is  formed  in 
this  way  from  L'  and  K  and  is  therefore  regular. 

Suppose/  =1.  Let  <£15 . . . ,  <f>r  be  the  strings  such  that  A^  ->  fa.  For 
each  z"  let  Kt  =  {x  |  </>t  =>  x}.  Suppose  that  </>f  =  ax . . .  am.  By  the  inductive 
hypothesis  the  set  L^  =  {x  \  a,  =>  x}  is  regular.  By  Theorem  2  it  follows 
that  Kfr  hence  L,  is  also  regular.  This  establishes  Proposition  58.  This 
observation  also  follows  immediately  from  the  much  stronger  result  to 
which  we  turn  next. 

We  have  considered  so  far  only  the  question  of  weak  equivalence  among 
grammars,  that  is,  the  question  whether  they  generate  the  same  language. 
We  have  also  defined  a  relation  of  strong  equivalence  holding  between  two 
grammars  that  not  only  generate  the  same  language,  but  also  the  same  set 
of  structural  descriptions  (see  Chapter  1 1 ,  Sec.  5.1).  Little  is  known  about 
strong  equivalence.  However,  it  is  important  to  observe  that  we  can 
extend  Proposition  58  to  the  effect  that,  given  a  nonself-embedding 
grammar  G,  we  can  construct  a  finite  transducer  T  that,  in  a  sense  that 
we  shall  make  precise,  is  strongly  equivalent  to  G.  We  can  also  strengthen 
this  result  to  any  finite  degree  of  self-embedding.  It  has  certain  con- 
sequences to  which  we  return  in  Chapter  13.  See  also  Chomsky  (1961) 
for  further  discussion. 

To  make  this  result  precise,  let  us  consider  more  closely  the  class  of  finite 
transducers.  We  can  assume  that  the  input  alphabet  of  each  transducer 


FORMAL    PROPERTIES    OF    GRAMMARS 


T  is  a  subset  of  VT  (the  fixed  and  universal  set  of  terminal  symbols 
of  all  context-free  grammars).  We  assume  that  the  output  alphabet  of 
each  transducer  is  a  subset  of  some  fixed  set  A0.  Given  T,  we  say  that  T 
accepts  x  and  assigns  to  it  the  structural  description  y  [briefly,  T  generates 
(X  y)}  just  in  case  T  begins  its  computation  in  state  SQ  with  a  blank  storage 
tape  scanning  the  leftmost  symbol  of  the  string  x#  on  the  input  tape,  and 
terminates  its  computation  on  its  first  return  to  50  scanning  #  on  the  input 
tape  and  with  y  as  the  contents  of  the  storage  tape.5  Thus,  if  T  generates 
(#9  y\  then  ^accepts  x  in  the  manner  of  a  finite  automaton  with  no  output 
(cf.  Sec.  1.2)  and  maps  it  into  y  in  the  manner  of  a  transducer. 

In  order  to  compare  the  generative  capacity  of  transducers,  so  regarded, 
with  that  of  context-free  grammars,  let  us  assume  that  we  are  given  an 
effective  one-one  mapping,  O,  which  maps  the  set  of  structural  descriptions 
given  by  context-free  grammars  (defined,  let  us  say,  as  strings  with  labeled 
bracketing — cf.  p.  367)  into  the  set  of  strings  in  Ao.  We  can  now  define 
strong  equivalence  as  a  relation  between  context-free  grammars  and  finite 
transducers: 
Definition  9.  Given  the  context-free  grammar  G  and  the  finite  transducer 

T9  then  G  and  T  are  strongly  equivalent  if  and  only  if  the  following 

condition  is  met :    T  generates  (x,  <t(y))  just  in  case  G  generates  x  -with 

the  structural  description  y. 

Thus,  if  T  is  strongly  equivalent  to  G,  T  accepts  just  those  strings 
generated  by  G  and  maps  each  such  string  into  each  of  the  structural 
descriptions  assigned  to  it  by  G  and  nothing  else. 

We  can  now  state  a  precise  form  of  the  generalization  of  Proposition  58 : 
Theorem  34.  There  is  an  effective  procedure  *¥  such  that,  given  a  normal 

nonself-embedding  context-free  grammar  G,  *¥(G)  is  a  finite  transducer 

that  is  strongly  equivalent  to  G.  (Chomsky,  1959a). 

As  previously  observed,  this  procedure  can  easily  be  generalized  to  any 
finite  degree  of  self-embedding  in  a  manner  that  we  will  describe  more 
carefully.  Clearly,  Proposition  58  and  Theorem  33  follow  from  Theorem  34. 

Theorem  34  has  actually  been  proved  only  for  normal  grammars  meeting 
the  additional  condition  that  if  A  ->  BC  is  a  rule  then  B  ^  C,  and  if 
A  -*  cf>By  and  A  ->  %Bu>  are  rules  then  <f>  =  %  and  y  =  o>.  It  is  not  difficult 
to  show  that  these  additional  conditions  have  no  effect  on  generative 
capacity.  Furthermore,  it  is  merely  a  matter  of  added  detail  to  drop  these 
restrictions  (and,  in  fact,  many,  if  not  all  of  the  restrictions  that  give 
normality). 

The  proof  of  Theorem  34  is  too  complex  to  be  given  here,  but  we  present 
the  procedure  T  and  illustrate  it  by  an  example.  Beforehand,  however, 

More  precisely,  in  view  of  the  account  of  transducers  given  in  Sec.  1.5,  we  should  say 
that  T  begins  its  computation  scanning  a  on  an  otherwise  blank  storage  tape  and 
terminates  its  computation  with  ay  as  the  contents  of  the  storage  tape. 


CONTEXT-FREE    GRAMMARS 


note  how  Theorem  34  differs  from  Theorem  19  (Sec.  4.2),  which  states 
that,  given  any  modified  normal  grammar  G,  there  is  a  finite  transducer 
T  with  the  following  property:  T  maps  x  into  a  string  z  that  reduces  to  e 
if  and  only  if  z  is  a  structural  description  assigned  to  x  by  G.  In  this  case  O 
is  the  identity  mapping;  that  is,  the'  output  of  T  is  in  the  exact  form  of  a 
structural  description  assigned  by  G.  Furthermore,  there  is  no  limitation 
here  to  nonself-embedding  grammars.  However,  the  transducer  T  guaran- 
teed by  Theorem  19  is  not  strongly  equivalent  to  G  in  the  s«.nse  of  Def.  9. 
Thus  T  can  return  to  SQ  and  terminate,  with  input  x,  with  a  string  y  on  the 
storage  tape  that  does  not  reduce  to  e  (and  is  not  a  structural  description 
assigned  to  x  by  G).  In  fact,  the  reason  why  the  transducer  T  associated 
with  a  context-free  grammar  in  Theorem  19  appears,  superficially,  more 
powerful  than  the  transducer  T  associated  with  a  context-free  grammar  by 
Theorem  33  is  that  T,  but  not  T'9  is,  in  effect,  using  a  potentially  infinite 
memory  in  deciding  when  to  accept  a  string  with  a  given  structural 
description,  since  this  decision  requires  an  analysis  of  the  (unbounded) 
string  on  the  storage  tape  to  determine  whether  it  reduces  to  e.  We  know 
from  Theorem  33  that  Theorem  34  is  the  strongest  possible  result  concern- 
ing strong  equivalence  of  context-free  grammars  and  devices  with  strictly 
bounded  memory. 

To  illustrate  Theorem  34,  suppose  that  G  is  a  normal,  nonself-embedding 
grammar  meeting  the  additional  condition  stated  directly  after  Theorem 
34.  Let  K  be  the  set  of  sequences  {(Al9  .  .  .  ,  Am)}  meeting  the  following 
condition:  for  each  z,/  such  that  !</</<ra,  Ai~  >  <j>AMy  and 
Ai  j£  A3:  We  can  now  construct  the  grammar  G*  with  nonterminal 
symbols  represented  in  the  form  [B19  .  ,  .  ,  B^  (i  =  1,  2),  where  the  B/s 
are  nonterminal  symbols  of  G:6 

Suppose  that  (Bl9  .  .  .  ,  Bn)  e  K. 


(i)           If  Bn  ->  a,  then  [B^  .  .  .  Bn\  ->  a[B^  ...  B 

»]* 

(ii)           If  Bn  ->  CD,  where  C^B^D  (i  <  «), 

then 

(d)  [Bi...  Bn\  ->[#!...  £nC]!, 

(6)  [5-L  .  .  .  BnC]%  -^[B!  .  .  .  BnD}± 

(c)   [B!  .  .  .  J6n/>]2  —  ^  [A  •  -  -  Bnlz- 

(iii)           If  Bn  -+  CD,  where  Bi  =  D  for  some  i  < 

:  77,  then               (59) 

(6)  s^'Scr^xv.^ 

(iv)           If  Bn  -+  CD,  where  Bi  =  C  for  some  i  < 

77,  then 

(a)  [^  .  .  .  JBJ2  ->  [^  .  .  .  5nJD]x, 

(Z>)  [B±  .  .  .  BnD]z  -+  [B!  .  .  .  Bn]2. 

6  We  can  use  the  symbol  -»•  unambiguously  for  both  (7  and  G*, 
nonterminal  symbols  differ. 

since  the  forms  of  their 

39$  FORMAL    PROPERTIES    OF    GRAMMARS 

We  can  now  prove  that  there  is  an  S-derivation  of  z  in  G  if  and  only  if 
there  is  a  [S]r  derivation  of  z[S]2  in  G*  (Theorem  10  in  Chomsky,  1959a), 
where  5  is  the  initial  symbol  of  G. 

The  rules  of  G*  are  all  of  the  form  A  -*  aB,  where  a  =  e  unless  the  rule 
in  question  was  formed  by  step  (i)  of  the  Construction  59.  We  can, 
therefore,  regard  G*  as  a  finite  automaton.  Suppose  that  we  now  supply 
the  automaton  with  a  new  state  SQ  and  the  additional  rules 

S»-+[S]19        [S]t-+SQ.  (60) 

Taking  S0  as  its  initial  state,  the  device  is  now  weakly  equivalent  to  G. 
To  convert  this  device  to  a  transducer,  we  must  supply  it  with  rules 
stating  the  output  symbol  it  produces  as  it  moves  from  state  Ql  to  state 
Qj  reading  symbol  ak.  We  take  the  output  alphabet  to  be  the  set  of  non- 
terminal symbols  of  G*  (that  is,  the  set  of  symbols  of  the  form  [Bl.  . .  -BJt- 
which  now  designate  states  of  the  automaton).  We  shall  say  that  when 
the  device  switches  into  the  state  Q  it  prints  the  output  symbol  Q.  This 
completes  the  construction  T  required  in  Theorem  34,  which  associates 
a  finite  transducer  T  with  a  nonself-embedding  grammar  G.  If  G  generates 
x,  T  maps  the  input  string  x  into  the  output  string  <r,  where  a  is  a  record 
of  the  successive  states  that  T  has  traversed  in  accepting  (generating)  x. 
This  sequence  a  actually  contains  a  complete  account  of  a  structural 
description  assigned  to  x  by  G,  and  for  each  such  structural  description 
assigned  to  x  by  G  there  is  a  sequence  of  states  a,  into  which  x  is  mapped 
by  r,  that  preserves  the  structure  of  this  structural  description  exactly. 
The  point  is  that  the  names  of  the  states  of  T  actually  contain  information 
about  certain  subtrees  of  the  labeled  tree  associated  with  a;  by  G;  from  the 
sequence  of  these  states  this  labeled  tree  can  be  completely  reconstructed. 
It  is  possible  to  construct  a  procedure  0  of  the  type  specified  in  Def.  9 
that  will  convert  an  output  of  T  into  a  structural  description  (e.g.,  a 
labeled  bracketing)  of  its  input  and  conversely.  Such  a  construction  is 
carried  out  in  detail  by  Langendoen  (1961).7  Consequently,  T  as 
constructed  by  the  procedure  T  of  Construction  59  and  60  is  strongly 
equivalent  to  G. 

The  properties  of  the  construction  T  can  be  clarified  by  an  example. 
Consider  the  grammar  in  (61),  which  meets  the  conditions  assumed  for 
the  construction  T  and  Theorem  34: 

S^AB,        A-+SC,        B-+DB, 
A-+a,        B-+b,        C-^c,        D->d.  (    ' 

7  Langendoen,  in  fact,  constructs  a  procedure  $  that  operates  in  real  time,  as  it  were; 
that  is  to  say,  the  labeled  bracketing  of  a  string  x  generated  by  G  is  produced  by  $  from 
the  output  of  the  transducer  T  associated  with  G  in  the  course  of  the  computation  of  T 
on  x. 


CONTEXT-FREE    GRAMMARS 


399 


B 


B 


D 


B 


Fig.  10.  Structural  description  for  a  sentence 
generated  by  the  grammar   (61). 

This  grammar  generates  the  language  consisting  of  all  strings  a^ 
and  assigns  to  them  such  structural  descriptions  as  the  one  shown  in  Fig. 
10  (for  the  case  /  =  j  =  1 ,  k  =  2).  Note  that  G  contains  both  left-recursive 
and  right-recursive  elements,  although  it  contains  no  self-embedding  ele- 
ments, and  that,  in  this  case,  the  left-recursive  element  A  dominates  the 
right-recursive  element  B.  This  example  serves  to  illustrate  the  point, 
sometimes  overlooked  or  misunderstood,  that  although  the  finite  trans- 
ducer T9  which  interprets  sentences  in  the  manner  of  G,  of  course  reads 
these  sentences  from  left-to-right  in  one  pass,  it  does  not  follow  that  there 
must  be  any  left-right  asymmetry  in  the  structural  descriptions  of  the  sen- 
tences that  T  accepts  and  interprets  in  the  manner  of  G. 

The  class  K  of  sequences  constructed  from  the  grammar  G1  of  (61)  con- 
sists of  the  sequences  (S),  (S,  A\  (S,  B),  (S,  A,  C\  and  (5,  B,  D).  The 
construction  T  now  provides  these  rules: 


(by  step  ii  of  Construction  59) 

(by  step  i  of  Construction  59) 

(by  iv  of  Construction  59)  x^ 

(by  i  of  Construction  59) 

(by  iii  of  Construction  59) 

(by  i  of  Construction  59). 


\[SA],    -^[SB], 
([SB]2    ->  [S]2  , 

r c<  A~\  ^.ro  A~\ 

\bA\i      — >  Ct\&A\z 

{(S]z       -  [SAC],\ 
\[SAC]Z-*[SA]Z   I 

r  c  DI          ^  Arc  zxi 

[OZJJ!       — >  U 10-OJ2 

([SB],  ->[SBKU 
\[SBD]t-+[SB],  I 
l(SAC],-+c[SAC]z\ 
\[SBD],- 


400  FORMAL    PROPERTIES    OF    GRAMMARS 

These  constitute  the  grammar  (?*  provided  by  T.   Corresponding  to  Fig. 
10,  we  have  the  derivation 


a[SA]2 

adbcd[SBD]2 

adbcd\SE\  (63) 

ad[SBD]2  adbcd[SBD]i 

adlSB^  adbcdd[SBD]2 

adb[SB]2  adbcdd[SB\i 

adb[S]2  adbcddb[SB]2 

adb[SAC\i  adbcddb[S]2 

adbc[SAC]2 

It  is  clear  that  the  sequence  of  nonterminal  symbols  produced  in  this 
derivation  enables  us  to  reconstruct  uniquely  the  structural  description 
in  Fig.  10.  In  fact,  the  automaton  with  the  rules  of  Example  62  generates 
the  sentence  adbcddb,  essentially,  by  tracing  systematically  through  the 
labeled  tree  of  Fig.  10.  This  is  a  representative  example;  it  illustrates  how 
a  device  with  finite  memory  can  associate  with  each  string  x  generated 
by  a  nonself-embedding  normal  grammar  G  the  structural  description 
assigned  to  x  by  (7,  where  this  structural  description  may  be  of  arbitrary 
complexity. 

Suppose  now  that  we  were  to  apply  this  construction  to  a  self-embedding 
normal  grammar  G.  Let  us  say  that  a  transducer  T  generates  (x,  y)  in  the 
manner  ofG  if  T  generates  (z,  y)  in  the  sense  previously  defined  (i.e.,  maps 
x  into  y  while  accepting  x  in  the  manner  of  a  finite  automaton),  where 
is  a  structural  description  assigned  to  x  by  G.  Then  the  transducer 
constructed  by  Constructions  59  and  60  will,  in  fact,  generate  (x,  y) 
in  the  manner  of  G  for  each  pair  (x,  y)  such  that  y  is  a  structural  description 
assigned  to  x  by  Or,  where  y  involves  no  self-embedding.  Furthermore,  by 
increasing  the  memory  of  T  (conceptually,  the  easiest  way  to  do  this  is  to 
provide  it  with  a  bounded  pushdown  storage),  we  can  allow  it  to  relabel 
self-embedded  symbols  up  to  any  bounded  degree  of  self-embedding  and 
then  operate  as  before.  In  this  case  it  can  generate  (x,  y)  in  the  manner 
of  G  for  any  pair  (x,  y)  such  that  y  is  a  structural  description  assigned  to 
xbyG  that  involves  no  more  than  some  bounded  degree  of  self-embedding. 
Beyond  this  we  cannot  go  with  a  finite  device,  as  we  know  from  Theorem 
33.  It  is  clear  from  the  results  of  Sec.  1.6  and  4.2  that  if  we  allow  the  trans- 
ducer an  unbounded  pushdown  storage  memory  then  it  can  be  made 


CONTEXT-FREE    GRAMMARS  40! 

strongly  equivalent  to  any  given  normal  grammar  G — observe,  in  particular 
that  Constructions  18  and  19  in  Sec.  4.2  are,  in  effect,  the  trivial  special 
case  of  Construction  59  involving  only  steps  i  and  ii. 

These  observations  give  us  a  precise  indication  of  the  extent  to  which 
sentences  generated  by  a  context-free  grammar  can  be  handled  (i.e., 
accepted  and  interpreted)  by  a  device  with  finite  memory  or  a  person  with 
no  (or  fixed)  supplementary  aids  to  computation.  We  return  to  this 
question  again  in  Chapter  13. 


4.7  Definability  of  Languages  by  Systems  of  Equations 

Suppose  that  G  is  a  context-free  grammar  with  nonterminals  ordered  as 
A19 .  .  .  ,  An9  where  A17  is  the  designated  initial  symbol.  With  each  Ai 
associate  the  set  St.  of  terminal  strings  dominated  by  Ai9  that  is, 
24.  =  {x  |  Ai  =>  x}9  using  the  notations  to  which  we  have  adhered 
throughout.  We  thus  associate  with  the  grammar  G  the  sequence  of  sets 
(Sl5 . . . ,  Sn),  each  a  set  of  terminal  strings,  where  Sx  is  the  terminal  language 
generated  by  G.  We  say  that  this  sequence  of  sets  satisfies  the  grammar  G, 
with  the  given  ordering  of  nonterminals.  Clearly,  each  term  of  the  satisfying 
sequence  is  the  terminal  language  generated  by  some  context-free  grammar, 
in  fact,  a  grammar  differing  from  G  only  in  choice  of  initial  symbol. 

Suppose  that  we  now  regard  the  nonterminal  symbols  of  G  as  variables 
ranging  over  sets  of  strings  in  the  terminal  vocabulary.  We  define  a 
polynomial  expression  in  the  variables  Ai9  . . . ,  An  as  an  expression  of  the 
form 

&  +  ...  +  &,  (64) 

where  each  fa  is  a  string  in  Fand  the  only  nonterminal  symbols  appearing 
in  Expression  64  are  Al9...9An.  A  polynomial  expression  such  as 
Expression  64  can  be  regarded  as  defining  a  function  /  which  maps  a 
sequence  of  sets  of  strings  in  VT  onto  a  set  of  strings  in  VT  in  the  following 
manner.  Given  the  sequence  (S15 . . . ,  Sn),  where  St  is  a  set  of  strings  in 
VT9  let  /(S15 .  . . ,  S  J  be  the  set  of  strings  formed  by  replacing  each 
occurrence  ofAt  in  Expression  64  by  the  symbol  £,-  designating  the  set  Sz> 
then  interpreting  +  as  set  union  and  concatenation  as  set  (Cartesian) 
product — that  is  to  say,  where  A  and  B  are  sets,  AB  =  {yz  \  y  e  A  and 
2  e  B};  xA  =  {xy\ye  A};  Ax  =  {yx  \  y  e  A}.  For  example,  the  function 
f(A,  B)  defined  by  the  polynomial  expression 

a  +  Aa  +  BaA  (65) 

maps  the  pair  of  sets  {x,  y}9  {z9  w}  onto  the  set  {a,  xa,  ya,  zax,  zay,  wax, 
way}. 


FORMAL    PROPERTIES    OF    GRAMMARS 


Given  the  context-free  grammar  G  with  nonterminals  Aly  .  .  .  ,  An,  we 
associate  with  each  AI  the  polynomial  expression  fa  +  .  .  .  +  <f>k9  where 
AI  -*  <£i,  .  .  .  ,  Ai  ->  (j>k  are  all  of  the  rules  of  G  with  Ai  as  the  left-hand 
member.  Consider  now  the  system  of  equations 


=/lWl,  •  •  •  ,A  n) 

(66) 


where/  is  the  function  defined  by  the  polynomial  expression  associated 
with  At.  It  is  well  known  that  such  a  system  of  equations  has  a  unique 
minimal  solution;  that  is,  there  is  a  unique  sequence  of  sets,  S1?  .  .  .  ,  En, 
which  satisfies  this  system  of  equations  (as  values  for  Al9  .  .  .  ,  An9  respec- 
tively), such  that  if  Si',  .  .  .  ,  Sn'  is  another  solution  then  Sz-  c  S/  for  each 
i.  Furthermore,  it  is  clear  that  the  sequence  of  sets  that  constitutes  the 
minimal  solution  for  Eqs.  66  (what  we  shall  henceforth  call  the  solution) 
is  the  sequence  of  sets  that  satisfies  G  in  the  sense  of  the  first  paragraph 
of  this  section  and  St  is  the  terminal  language  generated  by  G. 

Putting  the  same  remark  in  different  language,  we  can  regard  Eqs.  66 
as  defining  a  function/  such  that 

=  WA19  .  .  .  ,  An\  .  .  .  9fn(Ai,  .  .  .  ,  An)].         (67) 


A  fixed  point  of  the  function  /  is  a  sequence  (21?  .  .  .  ,  Sn)  such  that 
/(E1?  .  .  .  ,  Sn)  =  (S1?  .  .  .  ,  SJ.  Then  there  is  a  unique  minimal  fixed 
point  of  the  function/,  which  is  identical  with  the  solution  to  Eqs.  66; 
that  is,  it  is  the  sequence  of  sets  that  satisfies  G. 

The  solution  to  Eqs.  66  can  be  determined  by  the  following  recursive 
procedure.  We  construct  a  sequence  <TO,  al9  .  .  .  ,  in  which  each  term  0^- 
is  an  7z-tuple  of  sets,  in  the  following  way:  cr0  is  the  w-tuple  (0,  .  .  .  ,  0), 
where  0  is  the  null  set.  For  each  z  ^  0,  let  crH1  =/(crf).  Where  ai  = 
(<V,  .  .  .  ,  anl\  define  a  =  lim  <rt  =  (o^,  .  .  .  ,  O,  where  af  =  2  <?/• 

i-*oo  & 

Then  a  is  the  solution  to  Eqs.  66.  It  is  the  minimal  fixed  point  of  /of  Eq. 
67,  the  sequence  of  sets  satisfying  G.  And  a^  is  the  terminal  language 
generated  by  G.  We  say  that  each  af  is  definable  from  Eqs.  66  ;  a  definable 
language  is  a  set  that  is  definable  from  some  such  system  of  equations. 
Clearly,  the  definable  languages  are  exactly  the  context-free  languages. 
The  point  of  view  that  we  have  just  sketched  is  developed  in  Ginsburg  & 
Rice  (1962)  and  is  the  basis  for  the  investigations  carried  out  there  and 
continued  in  Ginsburg  &  Rose  (1963a,  b).  This  work  was  motivated 
originally  by  an  investigation  of  problem-oriented  computer  languages, 


CONTEXT-FREE    GRAMMARS  403 

in  particular,  ALGOL,  and  has  led  to  several  interesting  observations 
concerning  these  systems,  to  which  we  return  in  Sec.  4.8. 

This  approach  to  the  study  of  context-free  languages  has  been  placed 
in  a  more  general  setting  by  Schiitzenberger  (see  in  particular,  Schiitzen- 
berger,  1961c,  1962b,  and  Schiitzenberger  &  Chomsky,  1962).  Suppose 
that  we  have  a  mapping/  which  assigns  to  each  string  a:  in  VT  a  non- 
negative  integer /(x).  We  can  represent /as  &  formal  power  series  r: 

r  =  I(r,  x)x  (68) 

X 

in  the  elements  at  £  VT,  where  the  integral  coefficient  (r,  x)  =  /(z).  We 
say  that  the  formal  power  series  r  is  characteristic  if,  for  each  x,  (r,  x)  is 
either  0  or  1,  that  is,  if  it  represents  a  characteristic  function.  We  define 
the  support  of  the  formal  power  series  r  [=  sup  (r)]  as  the  set  of  strings  x 
such  that  (r,  x}  ^  0.  Thus  we  obtain  the  support  of  r  by  regarding  +  as 
ordinary  set  union  and  nx,  where  n  is  the  coefficient  of  x9  as  x  4-  . . .  +  x, 
n  times  (which  amounts  to  identifying  nx  with  x  for  n  ^  0). 

Note  that  a  formal  power  series  becomes  an  ordinary  power  series  in 
the  variables  at  e  VT  if  we  regard  them  as  commutative,  that  is,  if  we 
identify  any  two  strings  that  can  be  obtained  from  one  another  by  per- 
mutation. 

The  set  of  formal  power  series  is  closed  under  the  following  operations 
(among  others) : 

(i)  Multiplication  by  an  integer:  the  coefficient  (nr,  x)  of  x  in  nr  is 
n(r,  x). 

(ii)  Addition:  the  coefficient  (r  +  r',  x)  of  x  in  r  +  r  is  (r,  x)  +  {/•',  x). 

(iii)  Multiplication:  the  coefficient  (rr',x)  of  x  in  rrf  is  obtained  by 
factoring  x  into  yz  =  x  in  all  possible  ways  and  taking  the  sum  of 
all  the  integers  (r,  y)(r'y  z);  that  is,  <rr',  x)  =  E  <r,  y)<r',  z>,  for  all 
t/,  z  such  that  yz  —  x.  (69) 

Note  that  addition  is  analogous  to  set  union  and  multiplication,  to 
formation  of  the  set  product.  Thus  we  have 

sup  (r  +  r')  =  SUp  (r)  u  sup  (r'),        sup  (rr')  =  sup  (r)  -  sup  (r').    (70) 

We  can  also  define  an  operation  analogous  to  set  intersection,  namely, 
the  operation  ®  such  that  r  ®  r'  is  the  power  series  in  which  the  coefficient 
r  ®  r'  of  x  is  <r,  x)(r'9  x).  There  are  also  easily  defined  operations  corre- 
sponding to  universal  closure  and,  in  certain  cases,  to  complement. 

Given  the  grammar  G  with  nonterminals  AI9...9 An,  let fjx)  be  the 
number  of  different  structural  descriptions  assigned  to  x  by  the  grammar 
Gi  formed  by  taking  At  as  the  initial  symbol  of  G  [we  can,  to  make  this 


404  FORMAL    PROPERTIES    OF    GRAMMARS 

precise,  take  structural  descriptions  to  be  labeled  bracketings  generated 
by  G  in  the  manner  described  on  p.  367,  in  which  case/^a?)  is  the  number  of 
strings  generated  by  Gt  from  which  x  can  be  formed  by  dropping  brackets]. 
Let  ri  be  the  formal  power  series  that  assigns  to  a  string  x  the  coefficient 
(r.,  x)  =  fi(x).  Thus  [sup  (rj,  .  .  .  ,  sup  (rn)]  is  the  sequence  of  sets  that 
satisfies  G  in  the  sense  previously  defined  ;  supfo)  is  the  terminal  language 
L(G)  generated  by  G;  (rl9  x}  =  0  just  in  case  x  £  L(G);  (rl9  x)  =  n  just 
in  case  G  provides  n  nonequivalent  derivations,  that  is,  n  distinct  structural 
descriptions,  for  x.  We  say  that  the  sequence  (rl9  .  .  .  ,  rj  satisfies  G.  The 
definition  of  satisfaction  given  previously  is  the  special  case  in  which  we 
consider  only  those  formal  power  series  formed  from  ordinary  formal 
power  series  by  identifying  all  coefficients  greater  than  zero. 

We  can,  in  fact,  obtain  the  sequence  (rl9  .  .  .  ,  rn)  which  satisfies  G  by  an 
iterative  procedure  exactly  as  before.  Regarding  G  again  as  the  sequence 
of  Eqs.  66,  we  construct  an  infinite  sequence  a0,  al9  .  .  .  ,  in  which  <r,- 
=  0V,  .  .  .  ,  rwz)  and  r}  is  a  formal  power  series  (with,  in  fact,  only  a  finite 
number  of  terms;  that  is,  it  is  a  polynomial  in  the  terminal  symbols  of  G). 
We  again  regard  Eqs.  66  as  defining  a  function  /which,  in  this  case,  maps 
a  sequence  (/*!,  .  .  .  ,  rj  into  the  sequence  [/i(rl5  .  .  .  ,  rn),  .  .  .  ,fn(rl9  .  .  .  ,  rj], 
The  function  ft  is  defined,  as  before,  by  the  polynomial  expression  ^  + 
.  .  .  +  c/>k,  where  A  -*•  <^-(l  <  i  <  k)  are  all  the  rules  in  G  that  contain  A 
on  the  left.  We  now  interpret  +  and  concatenation  not  as  set  union  and 
complex  product  but  as  the  corresponding  operations  on  power  series, 
as  in  Definition  69.  Take  cr0  as  the  sequence  (r^,  .  .  .  ,  rn°),  where  each  r* 
is  the  null  power  series  in  which  every  coefficient  is  zero.  Take  <ra-+1  =  /(o"f). 
As  before,  take  a  =  Km  at  =  (r^,  .  .  .  ,  r/),  where  /•/  is  defined  as 

i~*oo 

follows.  Suppose  that  #  is  a  string  of  length  k  and  r/  is  the  yth  term 
of  c^,  as  above.  Then  the  coefficient  of  x  in  the  power  series  r™  is 
determined  by  the  following  condition  : 


In  fact,  it  is  not  difficult  to  show  that  in  the  sequence  <TO,  a^  .  .  .  the 
coefficients  assigned  to  words  of  length  k  do  not  change  past  ak.  Con- 
sequently, a  is  well  defined  as  the  limit  of  this  sequence;  r-^  is  the  formal 
power  series  generated  by  G,  where  At  is  the  designated  initial  symbol  of 
G  and  its  support,  sup  (r^,  is  the  language  generated  by  G,  in  the  former 
sense;  r^0  also  assigns  to  each  string  x  belonging  to  sup  (r-^)  its  coefficient 
(ri°>  x}*  which  is  a  measure  of  the  degree  of  ambiguity  assigned  to  x  by 
G.  We  speak  of  r^  as  the  power  series  that  satisfies  G. 
Suppose,  for  example,  that  we  have  the  grammar  G  with  the  rules 

A-+AA,        A-+a,        A-*b.  (72) 


CONTEXT-FREE    GRAMMARS 


A 

A 

A                          A 

A         /\ 

A            A 

A          A              A             A 

1       1 

A               A 

a             b 

A       A     a              a         A       A 

a         b                                 b        a 

A 

A                           A 

A                              A 

A 

A        /\ 

A           A 

A            A 

AA               AAAA                 A             A 

A    A 

A          A 

A 

1     A    A     A 

A        A     6         A       A        6 

a     A        A             a        A        ^ 

A             A 

1  A        A 

i      6      a     6 

A     A     a                 a     A    A 

6     A     A               A     A      { 

a      b  bo,  a      b  b      a 

Fig.  11.  P-markers  illustrating  the  ambiguity  of  the  grammar  of  (72). 

This  corresponds  to  the  system  of  equations  consisting  of  just 

A  =  a  +  b  +  A2.  (73) 

In  this  extremely  simple  case  we  have  at  =  (r^),  where 
rxi  =  a  +  b+  02  =  a  +  b, 
r^=,a  +  b  +  (r^  =  a  +  b  +  (a  +  b)* 
=  a  +  b  +  a*  +  ab  +  ba  +  b*, 

rf  =  a  +  b  +  (r*f  =  £  (^  *X  where  (74) 

(/-/,  x)  ==  1  for  each  string  x  of  length  1,  2  or  4, 
(rj3,  x}  =  2  for  each  string  x  of  length  3, 

rf  =  a  +  b  +  (rff,  etc. 

The  coefficients  of  each  string  of  length  3  will  continue  to  be  2  in  each 
?!  0"  >  ^)'  anc*  the  coefficients  will  increase  with  the  length  of  the  string. 
Thus  in  this  grammar  there  is  exactly  one  way  to  generate  each  string  of 
length  2,  exactly  two  ways  to  generate  each  string  of  length  3,  exactly  5 
ways  to  generate  each  string  of  length  4,  etc.,  as  can  be  seen  from  the 
examples  in  Fig.  1 1 .  The  power  series  rl5  which  is  the  limit  of  the  r^'s,  so 
defined,  is  the  solution  of  Eq.  73.  Its  support  is  the  terminal  language 
L(G)  generated  by  G. 


£06  FORMAL    PROPERTIES    OF    GRAMMARS 

In  this  case  L(G)  is  the  set  of  all  strings  in  the  alphabet  {a,  b}9  and  we 
know  that  there  is  an  equivalent  grammar  G*  which  will  have  as  its  solution 
a  characteristic  power  series  (i.e.,  with  all  coefficients  =  0  or  1  —  in  this 
case  all  =  1  except  for  the  coefficient  of  e)  with  L(G)  as  its  support.  An 
example  is  the  grammar  G*  represented  as  the  equation 

A=a  +  b  +  aA  +  bA.  (75) 

In  fact,  we  have  observed  more  generally  (Sec,  1.2,  Theorem  1)  that 
corresponding  to  every  finite  automaton  there  is  a  deterministic  finite 
automaton.  Rephrased  in  our  present  terminology,  every  regular  language 
is  generated  by  a  grammar  G  that  is  satisfied  by  a  characteristic  formal 
power  series.  Clearly,  the  language  generated  by  the  grammar  of  Example 
72  is  a  regular  language,  as  we  can  see  from  the  fact  that  it  is  generated 
by  the  one-sided  linear  grammar  of  Eq.  75.  However,  Theorem  29  of 
Sec.  4.5  asserts  that  there  are  context-free  languages  that  cannot  be 
represented  in  this  way  by  a  characteristic  power  series;  that  is,  Theorem 
29  asserts  the  existence  of  a  language  L  which  is  the  support  of  a  power 
series  r  that  satisfies  a  context-free  grammar  but  which  is  not  the  support 
of  any  characteristic  power  series  satisfying  a  context-free  grammar. 
As  a  second  example  to  illustrate  these  notions,  consider  the  following 
grammars: 

(76) 

(77) 


Interpreting  b  as  the  sign  for  conditional  and  a  as  a  variable,  we  see  that 
Grammar  76  generates  the  set  of  well-formed  formulas  of  the  implicational 
calculus  with  one  free  variable  in  Polish  parenthesis-free  notation  and 
correspondingly  has  a  solution  that  is  characteristic.  Grammar  77 
generates  the  set  of  strings  of  this  calculus  in  ordinary  notation,  without 
parentheses,  and  its  solution  is  the  power  series  in  which  the  coefficient  of 
a  string  is  the  number  of  distinct  ways  in  which  it  can  be  parenthesized 
to  yield  a  well-formed  formula  of  this  system. 

Schiitzenberger's  notion  of  representing  sets  enumerated  by  a  generative 
process  in  terms  of  formal  power  series  is  well  motivated  for  the  study  of 
language.  As  has  been  mentioned  several  times,  we  are  ultimately  inter- 
ested in  studying  processes  that  generate  systems  of  structural  descriptions 
rather  than  sets  of  strings;  that  is,  we  are  ultimately  interested  in  strong 
rather  than  weak  generative  capacity.  The  framework  just  sketched 
provides  a  first  step  toward  this  goal,  since  it  takes  account  of  the  number 
of  structural  descriptions  assigned  to  a  string  (though  not  the  structural 
descriptions  themselves).  It  also  provides  a  particularly  natural  way  of 
approaching  the  study  of  nondeterministic  transduction.  Recall  that  a 


CONTEXT-FREE    GRAMMARS  407 

transducer  can,  in  general,  have  two  kinds  of  indeterminacy.  When  in 
state  St  reading  the  symbol  a,  it  can  have  the  option  of  switching  to  one  of 
several  states.  If  it  switches  to  state  Sjy  it  can  have  the  further  option  of 
printing  one  of  several  strings  on  its  output  tape.  Let  us  say  that  the  string 
x  =  bI .  .  .  bm  (where  b{  is  a  symbol  of  the  input  alphabet)  carries  the 
transducer  T  from  state  St  to  S;  with  output  x  =  x1 . .  .  xm  if  T  has  the 
rules  (bfr  S^  ->  (S^+i,  xk)  for  some  zl9 . . . ,  im^  fa  =  f;  fw+1  =  /)  and 
for  each  k  <  m.  Then  a  string  x  may  carry  T  from  St  to  S;  with  many 
different  outputs,  and  it  may  carry  T  from  Si  to  S,  with  the  output  x  in 
many  different  ways  (i.e.,  with  different  factorizations  of  a;).  The  natural 
way  to  represent  the  effect  of  the  input  string  x  in  carrying  T  from  5t- 
to  Sj  is  therefore  by  a  polynomial  TT(X,  i,j)  =  S  (TT(X,  f,y),  z)z,  where 
(TT(X,  i,j),  z)  is  the  number  of  different  ways  in  which  z  can  be  given  as 
output  as  x  carries  T  from  Si  to  Sj.  We  can  then  represent  an  w-state 
transducer  T  by  a  homomorphism  /z  mapping  KT  m^o  the  TinS  of  w  x  /z 
matrices  with  polynomials  in  the  output  alphabet  of  T  as  entries.  Then 
JLLX  will  be  the  matrix  with  entries  (jjtx)ti  =  IT(X,  iyj\  which  represent  the 
behavior  of  T  as  x  carries  it  from  St  to  S^  Many  problems  involving 
transduction  thus  become  problems  in  manipulation  of  matrices  that  can 
be  handled  by  familar  techniques  (cf.  Schiitzenberger  1961a,  1962c). 
Moreover,  several  new  questions  suggest  themselves  in  this  more  general 
framework.  Thus  we  have  restricted  ourselves  in  this  discussion  to  the 
positive  power  series  that  has  only  nonnegative  coefficients.  More 
generally,  we  can  consider  the  algebraic  elements  (of  the  ring  of  power 
series),  which  have  positive  or  negative  coefficients  and  which  satisfy 
systems  of  equations  that  may  have  negative  coefficients  in  the  polynomial 
expressions.  We  can  think  of  a  power  series  r  with  positive  or  negative 
integral  coefficients  as  being  the  difference  of  two  positive  power  series  r' 
and  r".  The  coefficient  of  the  string  x  in  r  is  the  difference  between  the 
number  of  times  that  x  is  generated  by  the  grammar  corresponding  to  r' 
and  the  grammar  corresponding  to  r".  The  support  of  r  is  the  class  of 
strings  that  is  not  generated  the  same  number  of  times  by  these  two  gram- 
mars. Schiitzenberger  has  studied  the  family  of  formal  power  series 
r  =  r'  —  r",  where  r'  and  r"  satisfy  one-sided  linear  grammars  and  thus 
have  regular  languages  as  their  supports  (these  are  the  formal  power  series 
that  correspond  to  rational  functions  when  we  identify  strings  that  differ 
only  by  permutations)  and  has  characterized  the  supports  of  such  formal 
power  series  in  terms  of  acceptability  by  a  certain  class  of  restricted-infinite 
automata  (cf.  Schiitzenberger,  1961a).  He  has  also  shown  that  such  an  r 
may  have  as  support  a  noncontext-free  language  and  that  there  are  some 
context-free  languages  that  do  not  constitute  the  support  of  any  such 
power  series  (Schiitzenberger,  l%lc).  For  further  discussion  of  these  and 


408  FORMAL    PROPERTIES    OF    GRAMMARS 

related  questions,  see  these  papers  and  Schiitzenberger    &   Chomsky 
(1962). 

Relating  to  the  questions  of  ambiguity  for  context-free  languages  raised 
in  Sec.  4.5,  we  have  the  following  general  result  concerning  regular 
languages,  which  makes  use  of  some  of  these  notions. 
Theorem  35.     Let  G  be  a  one-sided  linear  grammar  satisfied  by  the  formal 

power  series  r  and  generating  the  language  L  =  sup  (r).    Let  L%  = 

{x  \  (r,  x)  <  k}.     Then  Lk  is  a  regular  language  for   each  k.     (Schiit- 

zenberger,  personal  communication.) 

Let  N  be  the  set  of  nonnegative  integers,  k  a  fixed  integer,  and  NM  the 
semiring  of  the  M  x  M  matrices  with  entries  in  N.  It  is  proved  in  Schiit- 
zenberger  (1962c)  that  where  G  and  r  are  as  in  the  statement  of  Theorem 
35  and  £7  is  the  set  of  all  strings  on  VT  (the  free  semigroup  with  generators 
a  E  VT)  then 

there  is  an  M  <  oo  and  a  homomorphism  p  _-, 

of  U  into  NM  such  that  (r,  x}  = 


Let  K  =  (i  |  0  <  /  <  k}  and  0:  N-+  K  be  defined  by 

j8(/i)  =  «,    for    n<k;        0(/i)  =  k,    for    n  >  k.  (79) 

Define  an  addition  and  a  multiplication  for  K  by  setting 

(80) 


K  is  a  semiring,  and  it  is  easily  shown  that  /?  is  a  homomorphism  mapping 
TV  onto  K.  Let  KM  be  the  set  (in  fact,  semiring)  of  the  M  x  M  matrices 
with  entries  in  K.  Then  j3  extends  in  a  natural  fashion  to  a  homomorphism 
0:  NM->KM. 

Define  <£(z)  =  /?[//(#)],  where  p  is  as  in  Proposition  78.  Thus  <£(#)  is  an 
element  of  KM  for  x  e  £7.  Then,  clearly, 

=  <r>  *>>    if    <r>  *>  <  ^' 
=  *>    if    <r'  a;>  >  fc- 


But  ^M  is  a  multiplicative  semigroup  of  finite  cardinality,  and,  for 
Jfc'<Jfc,  I*  =  {*  |  <r,  *><  #}  (by  definition)  =  {a:  |  (px\M  <,k'}  (by 
Proposition  78)  =  {x  \  $x)  e  g},  where  2  is  the  subset  of  <£(£/)  contain- 
ing (f>(x)  just  in  case  (<j>x)l>M  <  fc'.  It  is  well  known  that,  where  \p  is  a 
homomorphism  mapping  a  subset  of  U  into  a  finite  semigroup  H9  then 
^-i(#)  =  {a;  |  y(ar)  6  H}  is  a  regular  language.  Consequently,  Lk,  is  a 
regular  language. 

From  the  fact  that  Lk  is  regular  for  each  k,  it  follows  by  elementary 
properties  of  regular  sets  (cf.  Theorem  1)  that  for  each  k  the  set  of  strings 
x  such  that  (r,  x)  =  k  and  the  set  of  strings  x  such  that  (r,  x)  ^  k  are  each 
regular.  In  particular,  the  set  of  strings  x  such  that  {r,  x)  >  2  is  a  regular 


CONTEXT-FREE    GRAMMARS 


language.  This  is  the  set  of  strings  that  is  ambiguous  with  respect  to  G 
in  the  sense  of  Sec.  4.5. 

Suppose,  in  particular,  that  X  =  {zl5  .  .  .  ,  xn}  and  7  =  [yl9  .  .  .  ,  y  n] 
are  two  sets  of  strings.  Define  Lx  as  the  set  of  all  strings  z  =  xki  .  .  .  xkf 
(ks  <  n),  that  is,  as  the  set  of  all  strings  factorizable  into  strings  of  X—m 
the  notation  of  Sec.  1.2,  Lx  would  be  represented  (xl9  .  .  .  ,  «„)*.  Define 
LY  similarly.  Let  us  call  X  a  code  (cf.  Chapter  11,  Sec.  2)  if  each  string  of 
Lx  is  uniquely  factorizable  (decipherable)  into  members  of  X  (similarly, 
7).  Now  consider  the  grammar  Gx  with  the  rules  S  ->  xtS  and  GY  with 
the  rules  S  -*  ytS  (and  the  rule  S  ->  e).  By  Theorem  35  the  set  of  strings 
which  are  ambiguous  with  respect  to  Gx  (similarly,  with  respect  to  GT)  is 
regular,  and  consequently  there  is  a  procedure  for  determining  whether  it 
is  empty.  Hence  there  is  an  algorithm  for  determining  whether  a  set  of 
strings  constitutes  a  code  (cf.  Sardinas  &  Patterson,  1953).  However,  we 
cannot  decide  whether  X  and  7  fail  to  be  codes  in  the  same  way  (cf. 
Sec.  4.5). 

Suppose  now  that  G  and  r  are  as  in  Theorem  35  and  that  G'  is  a  second 
one-sided  linear  grammar  satisfied  by  r'.  It  follows  from  a  theorem  of 
Markov  and  from  Proposition  78  that  there  is  no  algorithm  for  deter- 
mining in  an  arbitrary  case  of  this  sort  whether  there  is  an  x  such  that 
(r,  x)  =  (/•',  x}.  Theorem  35  implies  that,  given  k,  there  is  an  algorithm 
for  determining  whether  Lk  n  Lk  is  nonempty,  where  Lk  =  {x  \  {r,  x)  <  k} 
and  Lkf  =  {x  \  (/•',  x)  <  k}.  We  see,  then,  that  there  is  no  algorithm  for 
determining  whether  there  is  a  k  such  that  Lk  n  Lk  is  nonempty  (Schiitzen- 
berger,  personal  communication). 


4.8  Programming  Languages 

A  program  for  a  digital  computer  can  be  regarded  as  a  string  of  symbols 
in  some  fixed  alphabet.  A  programming  language  can  be  regarded  as  an 
infinite  set  of  strings,  each  of  which  is  a  program.  A  programming  language 
has  a  grammar  that  specifies  precisely  the  alphabet  and  the  set  of  techniques 
for  constructing  programs.  Ginsburg  and  Rice  have  pointed  out  that 
the  language  ALGOL  has  a  context-free  grammar,  though  not  a  one-sided 
linear  or  even  a  sequential  grammar.  This  observation  suggests  that  it 
might  be  of  some  interest  to  interpret  the  results  obtained  in  the  general 
study  of  context-free  languages,  taking  them  as  constituting  a  class  of 
potential  "problem  oriented"  programming  languages. 

Note,  in  particular,  that  a  programming  language  must- have  an  unam- 
biguous grammar  in  the  sense  defined  in  Sec.  4.5,  above.  If  the  set  of 
techniques  that  is  available  for  the  construction  of  programs  constitutes 


£10  FORMAL    PROPERTIES    OF    GRAMMARS 

an  ambiguous  grammar,  then  the  programmer  may  construct  a  program 
that  he  intends  the  machine  to  interpret  in  a  certain  way,  but  the 
machine  may  interpret  it  in  quite  a  different  way.  We  have  seen,  however, 
that  there  are  certain  context-free  languages  that  are  inherently  ambiguous 
with  respect  to  the  class  of  context-free  grammars  (Theorem  29,  Sec.  4.5). 
Hence  there  is  at  least  an  abstract  possibility  that  a  certain  infinite  class  of 
"programs"  may  not  be  characterizable  by  an  unambiguous  grammar 
when  the  techniques  for  constructing  programs  are  limited  to  those 
expressible  within  the  framework  of  context-free  grammar.  Furthermore, 
we  have  observed  that  there  is  no  algorithm  for  determining  whether  a 
context-free  grammar  is  ambiguous  (Theorem  28,  Sec.  4.5).  Thus  in 
particular  cases  the  problem  of  determining  whether  a  given  grammar  is 
ambiguous  (whether  a  proposed  programming  language  is  minimally 
adequate)  may  be  quite  a  difficult  one. 

Consider  now  the  problem  of  translating  from  a  programming  language 
L!  into  another  language  L2  (e.g.,  machine  code  or  another  higher  order 
programming  language).  We  can  regard  this  as  the  problem  of  con- 
structing a  finite  transducer  (a  "compiler")  Tsuch  that  T(L^  =  L2.  There 
is  no  reason  to  assume,  in  general,  that  such  a  transducer  exists.  Further- 
more, we  have  seen  that  the  general  problem  of  determining  for  given 
context-free  languages  L±  and  L2  whether  there  exists  a  transducer  T 
such  that  T(L^)  =  L2  is  recursively  unsolvable  (Theorem  27,  Sec.  4.4). 
Hence  the  problem  of  translating  between  arbitrary  systems  of  this  sort 
seems  to  raise  potentially  quite  difficult  questions  (Ginsburg,  personal 
communication). 


5.  CATEGORIAL   GRAMMARS 

Traditional  grammatical  analysis  is  concerned  with  the  division  of 
sentences  into  phrases  and  subphrases,  down  to  word  categories,  where 
these  phrases  belong  to  a  finite  number  of  types  (noun  phrases,  predicates, 
nouns,  etc.).  In  the  last  twenty  years  there  have  been  various  attempts  in 
descriptive  linguistics  to  codify  and  clarify  the  traditional  approach.  One 
might  mention  here  in  particular  Harris  (1946),  elaborated  further  in 
Harris  (1951,  Chapter  16),  Wells  (1947),  and  the  recent  work  of  Pike  and 
his  colleagues  in  Tagmemics  (cf.,  e.g.,  Elson  &  Pickett,  1960).  The 
generative  systems  studied  in  Sees.  3  and  4  represent  one  attempt  to  give 
precise  expression  to  some  of  these  ideas.  There  have  been  several  other, 
more  or  less  related,  approaches  which  we  mention  here  only  briefly. 

Several  attempts  have  been  made  to  develop  systematic  procedures  that 
might  lead  from  a  set  of  sentences  to  a  categorization  of  substrings  into 


CATEGORIAL    GRAMMARS 


411 


phrase  types  (e.g.,  Harris,  1946;  Harris,  1951;  Chomsky,  1953;  Kulagina, 
1958).  These  approaches  are  conceptually  related  to  one  another  and  to 
the  systems  we  have  discussed,  but  the  exact  nature  of  this  relation  has  not 
been  explored.  For  a  somewhat  different  approach  to  systematic  cate- 
gorization of  phrase  types  see  Hiz  (1961). 

A  second  approach  arose  from  the  theory  of  semantical  categories  of 
Lesniewski,  which  was  developed  for  the  study  of  formalized  languages. 
A  modification,  based  on  the  formulation  in  Ajdukiewicz  (1935),  was 
suggested  by  Bar-Hillel  (1953)  as  a  precise  explication  of  the  immediate 
constituent  analysis  of  recent  linguistics.  Lambek  (1958,  1959,  1961)  has 
also  developed  several  systems  of  this  general  type,  with  certain  additional 
modifications,  and  he  has  examined  their  applicability  to  linguistic  material. 
Similar  approaches  are  also  discussed  in  Wundheiler  and  Wundheiler 
(1955),  Suszko  (1958),  Curry  and  Feys  (1958),  and  Curry  (1961).  We  follow 
here  the  exposition  in  Bar-Hillel,  Gaifman,  and  Shamir  (1960). 

We  can  establish  a  system  of  categories  in  the  following  way.  Select  a 
finite  number  of  primitive  categories  (e.g.,  the  category  s  of  sentences  and 
the  category  n  of  nouns  and  noun  phrases,  which  were  the  only  primitive 
categories  envisaged  in  the  system  of  Ajdukiewicz).  All  primitive  categories 
are  categories.  When  oc  and  /?  are  categories,  then  [oc//3]  and  [oc\/J]  are  also 
categories — call  them  derived  categories.  Thus  we  can  have  such  categories 
as  [n/s],  [s/[n/s]],  and  [[n/n]\[s\n]].  These  are  the  only  categories.  Each 
member  of  VT  is  assigned  to  one  or  more  categories.  The  set  of  categories 
to  which  elements  of  VT  are  assigned,  with  a  list  of  their  members,  con- 
stitutes the  grammar  G.  We  have  the  following  two  rules  of  resolution: 

(i)  Resolve  a  sequence  of  two  category  symbols  of  the  form 

[a//3],  0  to  a. 

(82) 
(ii)  Resolve  a  sequence  of  two  category  symbols  of  the  form 

a,  [a\fl  to  p. 

These  rules  suggest  cancellation  in  arithmetic,  which  was,  in  fact,  the 
motivation  for  the  notation. 

Given  a  string  x  of  elements  of  VT,  replace  each  symbol  of  VT  in  x  by  its 
category  symbol,  thus  giving  a  sequence  of  category  symbols.  There  may 
be  several  such  associated  sequences  of  category  symbols,  since  a  member 
of  VT  may  belong  to  several  categories.  Denote  these  sequences  as  C^a), 
. . . ,  Cn(x).  By  successive  application  of  the  rules  of  resolution  to  Cf(#),  we 
may  find  either  that  C^x)  resolves  ultimately  to  s  or  that  it  resolves  ulti- 
mately to  some  sequence  of  (one  or  more)  category  symbols  distinct  from 
s.  If,  for  some  i,  C^x)  resolves  to  s,  we  say  that  the  grammar  G  generates  x; 
if  there  is  no  such  i9  G  does  not  generate  x.  The  set  of  strings  generated  by 
G  is  the  language  generated  by  G.  As  in  the  case  of  the  other  generative 


412  FORMAL    PROPERTIES    OF    GRAMMARS 

grammars  we  have  discussed,  it  is  merely  a  notational  question  whether 
we  think  of  G  as  a  grammar  that  generates  sentences  or  that  accepts  strings 
as  inputs  and  determines  whether  they  are  sentences  (i.e.,  a  recognition 
device).  Generally,  the  latter  phraseology  has  been  used  for  categorial 
grammars  of  the  type  just  described. 

The  functioning  of  such  a  grammar  can  be  clarified  by  an  example. 
Suppose  that  our  grammar  contains  the  primitive  categories  n  and  s,  the 
words  John,  Mary,  loves,  died,  is,  old,  very,  and  the  following  category 
assignment:  John,  Mary  to  n;  diedto[n\s];  loves  to  [n\s]/n;  old  to  [n/n]; 
very  to  [n/n]/[n/n] ;  is  to  [n\s]/[n/n].  Thus  intransitive  verbs  (such  as  died) 
are  regarded  as  "operators"  that  "convert"  nouns  appearing  to  their  left 
to  sentences;  transitive  verbs  (loves)  are  regarded  as  operators  that  convert 
nouns  appearing  to  their  right  to  intransitive  verbs ;  adjectives  are  regarded 
as  operators  that  convert  nouns  appearing  to  their  right  to  nouns;  very 
is  regarded  as  an  operator  that  converts  an  adjective  appearing  to  its  right 
to  an  adjective;  is  is  regarded  as  an  operator  that  converts  an  adjective 
appearing  to  its  right  to  an  intransitive  verb.  Such  strings  as  the  following 
resolve  to  s  as  indicated : 

(i)  John  died 
n,  [n\s] 


(ii)  John  loves  Mary 
n,  [n\s]/n,  n 


(83) 


(iii)  John  is  very  old 

n,      [n\s]/[n/n],        [n/n]/[n/n],        [n/n] 

[n/n] 


[n\s] 


A  grammar  of  the  type  just  described  we  call  a  bidirectional  categorial 
grammar.  If  all  of  the  derived  categories  of  the  grammar  are  of  the  type 
[oc\/3]  or  if  all  are  of  the  type  [a//?],  we  call  the  system  a  unidirectional 
categorial  grammar.  Ajdukiewicz  considered  only  the  second  form,  since 
he  was  primarily  concerned  with  systems  using  Polish  parenthesis-free 
notation,  in  which  functors  precede  arguments. 

It  is  possible,  of  course,  to  regard  both  unidirectional  and  bidirectional 


CATEGORIAL    GRAMMARS  4/3 

categorial  systems  as  generative  grammars,  and  we  can  ask  how  they  are 
related  to  one  another  and  to  the  systems  we  have  discussed.   Bar-Hillel, 
Gaifman,  and  Shamir  (1960)  have  shown  the  following: 
Theorem  36.     The  families  of  unidirectional  categorial  grammars,   bi- 
directional categorial  grammars,  and  context-free  grammars  are  weakly 

equivalent. 

If  G  is  a  bidirectional  categorial  grammar,  there  is  a  context-free  gram- 
mar that  generates  the  language  generated  by  G;  and  if  G  is  a  context-free 
grammar  there  is  a  unidirectional  categorial  grammar  that  generates  the 
language  generated  by  G.  From  this  follows  the  somewhat  surprising 
corollary  that  the  class  of  unidirectional  categorial  grammars  is  equal  in 
generative  capacity  to  the  full  class  of  bidirectional  categorial  grammars. 

Shamir  has  recently  observed  (personal  communication)  that  Theorem 
36  can  be  established  by  a  proof  very  much  like  that  of  the  proof  of  equiv- 
alence of  context-free  grammars  and  PDS  automata. 

It  should  be  emphasized  that  the  relation  studied  in  Theorem  36  is  weak 
equivalence.  It  does  not  follow  that,  given  a  grammar  of  one  of  these 
kinds,  a  grammar  of  one  of  the  other  kinds  can  be  found  that  will  involve 
a  category  assignment  of  comparable  complexity  or  naturalness  or  that 
will  assign  the  same  bracketing  (constituent  structure)  to  substrings.  It 
seems,  in  fact,  that  for  those  subparts  of  actual  languages  that  can  be 
described  in  a  fairly  natural  way  by  context-free  grammars,  a  corresponding 
description  in  terms  of  bidirectional  categorial  systems  becomes  complex 
fairly  rapidly  (and,  of  course,  a  natural  description  with  a  unidirectional 
categorial  grammar  is  generally  quite  out  of  the  question). 

The  systems  that  Lambek  has  developed  differ  in  several  respects  from 
the  one  just  described — in  particular,  they  allow  a  greater  degree  of 
flexibility  in  category  assignment.  Thus  his  rules  of  resolution  assert  that  a 
category  a  is  also  at  the  same  time  a  category  of  the  form  /?/[a\/3],  so  that 
in  this  and  other  ways  it  is  possible  to  increase  the  complexity  and  length 
of  the  sequence  of  category  symbols  associated  with  a  string  by  application 
of  rules  of  resolution.  Consequently,  it  is  not  immediately  obvious,  as  it  is 
in  the  case  of  the  system  just  sketched,  that  the  language  generated  is 
recursive.  Lambek  has  shown,  however,  that  the  systems  he  has  studied 
are,  in  fact,  decidable.  It  is  not  known  how  Lambek's  system  is  related  to 
bidirectional  categorial  systems  or  context-free  grammars,  although  one 
would  expect  to  find  that  the  relation  is  quite  close,  perhaps  as  close  as 
weak  equivalence. 

The  interest  of  the  various  kinds  of  categorial  grammars  is  that  they 
contain  no  grammatical  rules  beyond  the  lexicon;  that  is  to  say,  where  G 
is  an  assignment  of  the  words  of  a  finite  vocabulary  VT  to  a  finite  number 
of  categories,  primitive  and  derived,  it  is  possible  to  determine  for  each 


4^4  FORMAL    PROPERTIES    OF    GRAMMARS 

string  x  on  the  vocabulary  VT  whether  G  generates  xby  a  computational 
procedure  that  uses  the  rules  of  resolution,  which  are  uniform  for  all 
grammars  of  the  given  type,  hence  need  not  be  stated  as  part  of  the  gram- 
mar G.  There  is,  in  fact,  a  traditional  view  that  identifies  grammar  with 
the  set  of  grammatical  properties  of  words  or  morphemes  (cf.  de  Saussure, 
1916,  p.  149),  and  it  might  reasonably  be  maintained  that  the  approach 
just  outlined  gives  one  precise  expression  to  this  notion. 

Matthews  has  recently  investigated  a  generalization  of  the  theory  of 
constituent-structure  grammar  in  which  certain  types  of  discontinuity  are 
permitted  (Matthews,  1963b).  Continuing  to  follow  the  notational  con- 
vention of  Chapter  11,  Sec.  4,  let  us  consider  rules  of  the  form  A  — * 
<pi[n]q>to  where  n  >  0.  We  interpret  such  a  rule  as  applying  to  a  string 
ipAv.i . . .  <xn%  to  form  yy^  . . .  ^n^%  (where  a,-  ^  e)  and  as  applying  to  a 
string  ipA^ .  .  .  am  (m  <  n)  to  form  y^i0^  •  •  •  aw9V  Under  this  conven- 
tion, we  can  regard  a  context-free  grammar  as  one  containing  only  rules  of 
the  form  A  ->  ^[O]^.  He  has  also  generalized  this  in  a  natural  way  to 
the  case  of  discontinuous  context-sensitive  rules.  For  any  grammar,  we 
now  define  a  left-to-right  derivation  in  the  manner  presented  explicitly  in 
Sec.  4.2,  p.  373,  and  a  left-to-right  discontinuous  grammar  as  a  grammar 
with  rules  of  the  form  just  given-(or  of  the  more  general  context-sensitive 
discontinuous  type)  and  with  rule  applications  so  restricted  as  to  allow 
only  left-to-right  derivations,  and  all  of  these  (  cf.  Sec.  4.2).  Matthews 
has  shown  that  a  left-to-right  discontinuous  grammar  can  generate  only 
context-free  languages,  so  that  these  generalizations  do  not  increase 
generative  capacity.  Obviously,  the  same  is  therefore  true  of  right-to-left 
discontinuous  grammars  which  provide  derivations  in  the  manner  also 
described  on  p.  373  (i.e.,  only  the  right-most  nonterminal  symbol  is 
rewritten  at  each  stage)  and  in  which  a  rule  of  the  form  A  ->  y^n]^  is 
interpreted  as  placing  ^  n  symbols  to  the  left  (or  to  the  extreme  left)  as  A 
is  rewritten,  instead  of  n  symbols  to  the  right  (or  to  the  extreme  right)  as 
in  the  case  of  a  left-to-right  discontinuous  grammar  (similarly  for  context- 
sensitive  discontinuous  rules).  Matthews  has  also  extended  this  result  to 
rules  which  permit  multiple  discontinuities  and  has  observed  that  allowing 
even  two-way  discontinuous  rules  does  not  extend  the  capacity  of  context- 
sensitive  grammars. 

Various  other  models  of  linguistic  structure  have  been  proposed,  but 
insofar  as  they  can  be  interpreted  as  specifying  a  form  of  generative 
grammar  (i.e.,  insofar  as  they  specify  grammars  that  provide  information 
about  sentence  structure  in  an  explicit  manner)  they  seem  to  fall  largely 
within  the  scope  of  the  theory  of  constituent-structure  grammar,  or  even, 
quite  generally,  the  theory  of  context-free  grammar.  For  discussion,  see 
Gross  (1962)  and  Postal  (forthcoming). 


CATEGORIAL    GRAMMARS  4/5 

This  concludes  our  survey  of  formal  properties  of  grammars.  It  is 
hardly  necessary  to  stress  the  preliminary  character  of  most  of  these 
investigations.  As  is  apparent  from  the  appended  bibliography,  the  whole 
subject  is,  properly  speaking,  only  five  or  six  years  old,  and  much  of  this 
survey  has  in  fact  dealt  with  work  in  progress.  It  is  important  to  reiterate 
that  the  systems  that  have  so  far  proved  amenable  to  serious  abstract  study 
are  undoubtedly  inadequate  to  represent  the  full  complexity  and  richness 
of  the  syntactic  devices  available  in  natural  language,  in  particular,  because 
of  the  restriction  to  rewriting  systems  that  do  not  incorporate  grammatical 
transformations  of  the  kind  discussed  in  Chapter  11,  Sec.  5.  Nevertheless, 
they  do  appear  to  have  the  scope  of  the  theories  of  grammatical  structure 
that  have  been  proposed  in  traditional  and  modern  linguistics  or  in  recent 
work  on  computable  sentence  analysis  or  that  are  implicit  in  traditional 
and  modern  descriptive  studies,  with  the  exception  of  the  theoretical  and 
descriptive  studies  involving  transformations.  Certain  basic  properties 
of  natural  languages  (e.g.,  bracketing  into  continuous  phrases,  categoriza- 
tion into  lexical  and  phrase  types,  nesting  of  dependencies)  appear  in 
systems  of  the  kind  that  we  have  surveyed.  Hence  the  study  of  these 
systems  has  some  direct  bearing  on  the  character  of  natural  language. 
Furthermore,  it  is  clear  that  profitable  abstract  study  of  systems  as  rich 
and  intricate  as  natural  language,  or  of  organisms  sufficiently  complex  to 
master  and  use  such  systems,  will  require  sharper  tools  and  deeper  insights 
into  formal  systems  than  we  now  possess,  and  these  can  be  acquired  only 
through  study  of  language-like  systems  that  are  simpler  than  the  given 
natural  languages.  Whether  these  richer  systems  will  yield  to  serious 
abstract  study  is,  of  course,  a  question  about  which  we  can  at  present  only 
speculate. 

References 

Ajdukiewicz,  K.  Die  syntaktische  Konnexitat.  Studia  Philosophica,  1935,  1,  1-27. 
Bar-Hillel,  Y.  A  quasi-arithmetical  notation  for  syntactic  description.  Language,  1953, 

29,  47-58. 
Bar-Hillel,  Y.,  Gaifman,  C,  &  Shamir,  E.    On  categorical  and  phrase  structure 

grammars.  Bull  Res.  Council  of  Israel,  9F,  1960,  1-16. 
Bar-Hillel,  Y.,  Perles,  M,  &  Shamir,  E.    On  formal  properties  of  simple  phrase 

structure  grammars.    Tech.  Rept.  No.  4,  Office  of  Naval  Research,  Information 

Systems  Branch,  1960.    (Also  published  in  Zeitschrift  fur  Phonetik,  Sprachwissen- 

schaft  und  Kommunikationsforschung,  1961,  14, 143-172.) 
Bar-Hillel,  Y.,  &  Shamir,  E.    Finite  state  languages:    formal  representation  and 

adequacy  problems.  Bull.  Res.  Council  of  Israel,  1960,  8F,  155-166. 
Chomsky,  N.  Systems  of  syntactic  analysis.  /.  Symbolic  Logic,  1953,  18,  242-256. 
Chomsky,  N.   Three  models  for  the  description  of  language.   IRE  Trans,  on  Inform. 

Theory,  1956,  IT-2,  113-124. 


416  FORMAL    PROPERTIES    OF    GRAMMARS 

Chomsky,  N.    On  certain  formal  properties  of  grammars.    Information  and  Control, 

1959,2,  137-167.  (a) 
Chomsky,  N.   A  note  on  phrase  structure  grammars.   Information  and  Control,  1959, 

2,  393-395.  (b) 
Chomsky,  N.   On  the  notion  "Rule  of  grammar."   In  R.  Jakobson  (Ed.),  Structure  of 

language  and  its  mathematical  aspects,  Proc.  12th  Sympos.  in  Appl.  Math.  Providence, 

R.I.:  American  Mathematical  Society,  1961.    Pp.  6-24.    Reprinted  in  J.  Katz  & 

J.  Fodor  (Eds.),  Readings  in  the  Philosophy  of  Language.   New  York:  Prentice-Hall, 

1963. 
Chomsky,  N.  Context-free  grammars  and  pushdown  storage.  RLE  Quart.  Prog.  Rept. 

No.  65.  Cambridge,  Mass.:  M.I.T.  March  1962.  (a) 
Chomsky,  N.  The  logical  basis  for  linguistic  theory.  Proc.  IXth  Int.  Cong,  of  Linguists, 

1962.  (b).  Reprinted  in  J.  Katz  &  J.  Fodor  (Eds.),  Readings  in  the  Philosophy  of 

Language.  New  York:  Prentice-Hall,  1963. 
Chomsky,  N.,  &  Miller,  G.  A.  Finite  state  languages.   Information  and  Control,  1958, 

1,91-112. 

Culik,  K.   Some  notes  on  finite  state  languages  and  events  represented  by  finite  auto- 
mata using  labelled  graphs.    Casopis  pro  pestovdni  matematiky,   1961,  86,  43-55 

(Prague). 
Curry,  H.    Some  logical  aspects  of  grammatical  structure.    In  R.  Jakobson  (Ed.), 

Structure  of  language  and  its-mathematical  aspects,  Proc.  12th  Sympos.  in  Appl.  Math. 

Providence,  R.I.:  American  Mathematical  Society,  1961.  Pp.  56-68. 
Curry,  H.,  &  Feys,  R.    Combinatory  logic.   Amsterdam:   North-Holland,  1958. 
Davis,  M.   Computability  and  unsolvability.  New  York:   McGraw-Hill,  1958. 
Elson,  B.,   &  Pickett,  V.   B.    Beginning  morphology-syntax.    Summer  Institute  of 

Linguistics,  Santa  Ana,  Calif.,  1960. 
Floyd,  R.  W.    Mathematical  induction  on  phrase  structure  grammars.    Information 

and  Control,  1961,  4,  353-358. 
Ginsburg,  S.,  &  Rice,  H.  G.  Two  families  of  languages  related  to  ALGOL.  /.  Assoc. 

Computing  Machinery,  1962,  10,  350-371. 
Ginsburg,  S.,  &  Rose,  G.  F.    Some  recursively  unsolvable  problems  in  ALGOL-like 

languages.  /.  Assoc.  Computing  Mach.,  1963,  10,  29-47.  (a) 
Ginsburg,  S.,  &  Rose,  G.  F.    Operations  which  preserve  definability  in  languages. 

/.  Assoc.  Computing  Mach.,  1963,  10,  175-195.   (b) 
Greibach,  S.  Undecidability  of  the  ambiguity  problem  for  minimal  linear  grammars. 

Information  and  Control  (in  press). 
Gross,  M.    On  the  equivalence  of  models  of  languages  used  in  the  fields  of  mechanical 

translation  and  information  retrieval.    Mimeographed.    Cambridge:    Mass.  Inst. 

Tech.,  1962. 

Harris,  Z.  S.  From  morpheme  to  utterance.  Language,  1946,  22,  161-183. 
Harris,  Z.  S.   Methods  in  structural  linguistics.    Chicago:   Univer.  of  Chicago  Press, 

1951. 
Harris,  Z.  S.    Co-occurrence  and  transformation  in  linguistic  structure.    Language, 

1957,  33,  283-340. 
Hiz,  H.    Congrammaticality.    In  R.  Jakobson  (Ed.),  Structure  of  language  and  its 

mathematical  aspects,  Proc.   12th   Sympos.    in    Appl.    Math.     Providence,    R.I.: 

American  Mathematical  Society,  1961.  Pp.  43-50. 
Katz,  J.,  &  Fodor,  J.    The  structure  of  a  semantic  theory.  To  appear  in  Language. 

Reprinted  in  J.  Katz  &  J.  Fodor  (Eds.),  Readings  in  the  Philosophy  of  Language. 
New  York:  Prentice-Hall,  1963. 
Kleene,  S.  C.  Representation  of  events  in  nerve  nets  and  finite  automata.   In  C.  E. 


REFERENCES  41? 

Shannon  &  J.  McCarthy  (Eds.),  Automata  Studies.  Princeton:  Princeton  Univer. 

Press,  1956.   Pp.  3-41, 

Kohler,  W.   The  place  of  value  in  a  world  of  fact.  New  York:  Liveright,  1938. 
Kulagina,  O.  S.    Ob  odnom  sposobe  opredelenija  grammaticeskix  panjatij  na  baze 

teoril  mnozestv.   (On  one  method  of  defining  grammatical  categories  on  the  basis  of 

set  theory.)  Problemy  kibernetiki,  1,  Moscow,  1958. 
Lambek,  J.   The  mathematics  of  sentence  structure.   Amer.  Math.  Monthly,  1958,  65, 

154-170. 
Lambek,  J.    Contributions  to  a  mathematical  analysis  of  the  English  verb  phrase. 

/.  Canadian  Linguistic  Assoc.,  1959,  5,  83-89. 
Lambek,  J.    On  the  calculus  of  syntactic  types.    In  R.  Jakobson  (Ed.),  Structure  of 

language  and  its  mathematical  aspects,  Proc.  \2th  Symp.  in  Appl.  Math.   Providence, 

R.I.:  American  Mathematical  Society,  1961.   Pp.  166-178. 
Landweber,  P.  S.  Three  theorems  on  phrase  structure  grammars  of  type  1 .  Information 

and  Control  (in  press). 
Langendoen,  T.    Structural  descriptions  for  sentences  generated  by  non-self-embedding 

constituent  grammars.   Undergraduate  Honors  Thesis,  Mass.  Inst.  of  Tech.,  1961. 
Lashley,  K.  S.   Learning:  I.  Nervous  mechanisms  of  learning.   In  C.  Murchison  (Ed.), 

The  foundations  of  experimental  psychology.  Worcester,  Mass.:  Clark  Univer.  Press, 

1929.   Pp.  524-563. 
Lashley,  K.  S.  The  problem  of  serial  order  in  behavior.  In  L.  A.  Jeffress  (Ed.)9  Cerebral 

mechanisms  in  behavior.   New  York:   Wiley,  1951.   Pp.  112-136. 
Matthews,  G.  H.    Hidatsa  syntax.    Mimeographed.    Cambridge;    Mass.  Inst.  Tech., 

1962a. 

Matthews,  G.  H.  Discontinuity  and  asymmetry  in  phrase  structure  grammars.  Informa- 
tion and  Control  (in  press). 
Matthews,  G.  H.  A  note  on  asymmetry  of  phrase  structure  grammars.  Information  and 

Control  (in  press). 
Miller,  G.  A.,  &  Selfridge,  J.  A.  Verbal  context  and  the  recall  of  meaningful  material. 

Amer.  J.  PsychoL,  1950,  63,  176-185. 
Myhill,  J.    Linear  bounded  automata.    WADD  Technical  note  60-165.    Wright  Air 

Development  Division,  Wright-Patterson  Air  Force  Base,  Ohio,  1960. 
McNaughton,  R.  The  theory  of  automata:  a  survey.  In  F.  L.  Alt  (Ed.),  Advances  in 

computers,  Vol.  2.  New  York:  Academic  Press,  1961. 
McNaughton,  R.,  &  Yamada,  H.  Regular  expressions  and  state  graphs  for  automata. 

IRE  Trans,  on  Electronic  Computers,  1960,  EC-9,  39-47. 
Newell,  A.,  Shaw,  J.  C.,  &  Simon,  H.  A.    Report  on  a  general  problem-solving 

program.   In  Information  Processing.  Proc.  International  Conference  on  Information 

Processing,  UNESCO,  Paris,  June  1959.    Pp.  256-264. 
Oettinger,  A.   Automatic  syntactic  analysis  and  the  pushdown  store.   In  R.  Jakobson 

(Ed.),  Structure  of  language  and  its  mathematical  aspects,  Proc.  12th  Sympos.  in 

Appl  Math.  Providence,  R.I.:  American  Mathematical  Society,  1961.  Pp.  104-129. 
Parikh,  R.  Language-generating  devices.  RLE  Quart.  Prog.  Kept.,  No.  60,  Cambridge, 

Mass.:  M.I.T.  January  1961,  199-212. 
Post,  E.  A  variant  of  a  recursively  unsolvable  problem.  Bull.  Amer.  Math.  Soc.,  1946, 

52,  264-268. 
Postal,  P.   On  the  limitations  of  context-free  phrase  structure  description.  RLE  Quart. 

Prog.  Rept.,  No.  64,  Cambridge,  Mass.:  M.I.T.  January  1962,  231-238. 
Postal,  P.  Constituent  analysis.  Int.  J.  Amer.  Linguistics,  Supplement  (to  appear). 
Rabin,  M.,  &  Scott,  D.   Finite  automata  and  their  decision  problems.   IBM  J.  Res. 

Develop.,  1959,3,114-125. 


418  FORMAL    PROPERTIES    OF    GRAMMARS 

Ritchie,  R.  W.    Classes  of  recursive  functions  of  predictable  complexity.    Doctoral 

dissertation,  Dept.  Math.,  Princeton  Univer.,  1960. 
Rogers,  H.  Recursive  functions  and  effective  computability .    Mimeographed,  Dept. 

Math.,  Mass.  Inst.  Tech.,  1961. 
Sardinas,  A.  A.,  &  Patterson,  G.  W.   A  necessary  and  sufficient  condition  for  unique 

decipherability  of  coded  messages.  IRE  Convention  Record,  1953,  8,  104-108. 
Saussure,  F.  de.    Cours  de  linguistique  generate,  Paris:    1916.    (Translation  by  W. 

Baskin,  Course  in  general  linguistics,  New  York:  Philosophical  Library,  1959). 
Scheinberg,  S.  Some  properties  of  constituent  structure  grammars.  Unpublished  paper, 

1960.  (a) 
Scheinberg,  S.   Note  on  the  Boolean  properties  of  context-free  languages.   Information 

and  Control,  1960,  3,  372-375.  (b) 
Schiitzenberger,  M.  P.    Un  probleme  de  la  theorie  des  automates.    Seminaire  Dubreil- 

Pisot,  Paris,  December  1959. 
Schiitzenberger,  M.  P.   A  remark  on  finite  transducers.  Information  and  Control,  1961, 

4, 185-196.  (a) 
Schiitzenberger,  M.  P.    On  the  definition  of  a  family  of  automata.   Information  and 

Control,  1961,  4,  245-270.  (b) 
Schiitzenberger,  M.  P.    Some  remarks  on  Chomsky's  context-free  languages.    RLE 

Quart.  Prog.  Kept.  No.  63,  Cambridge,  Mass. :  M.I.T.  October  1961,  155-170.  (c) 
Schiitzenberger,  M.  P.  On  a  family  of  formal  power  series.  Mimeographed,  1962.  (a) 
Schiitzenberger,  M.  P.  Certain  families  of  elementary  automata  and  their  decision 

problems.  To  appear  in  Proc.  Sympos.  on  Math.  Theory  Automata,  Vol.  XII,  MRI 

Symposia  Series,  1962.  (b) 
Schiitzenberger,  M.  P.   On  a  theorem  of  Jungen.  Proc.  Amer.  Math.  Soc.,  1962,  13, 

885-890.   (c) 
Schiitzenberger,  M.  P.    On  context-free  languages  and  push-down  automata.   Research 

paper  RC-793  of  IBM  Res.  Lab.,  Yorktown  Heights,  New  York,  1962.  (d) 
Schiitzenberger,  M.  P.   Finite  counting  automata.   Information  and  Control,  1962,  5, 

91-107.  (e) 
Schiitzenberger,  M.  P.,  &  Chomsky,  N.  The  algebraic  theory  of  context-free  languages. 

Computer  programming  and  formal  systems.    Amsterdam:    North-Holland,  1963. 

Pp. 118-161. 
Shamir,  E.    On  sequential  languages.   Tech.  Rept.  No.  7,  Office  of  Naval  Research, 

Information  Systems  Branch,  1961. 
Shamir,  E.  A  remark  on  discovery  algorithms  for  grammars.  Information  and  Control 

1962,5,246-251. 
Shannon,  C.  E.,  &  Weaver,  W.   The  mathematical  theory  of  communication.  Urbana: 

University  of  Illinois  Press,  1949. 
Shepherdson,  J.  C.  The  reduction  of  two-way  automata  to  one-way  automata.  IBM 

J.  Res.  Develop.,  1959,  3,  198-200. 
Suszko,  R.   Syntactic  structure  and  semantical  reference  I.    Studia  logica    1958    8 

213-244. 
Tolman,  E.  C.  Purposive  behavior  in  animals  and  men.  New  York:  Appleton-Century- 

Crofts,  1932. 

Wells,  R.  Immediate  constituents.  Language,  1947,  23,  81-117. 
Wundheiler,  L.,  &  Wundheiler,  A.  Some  logical  concepts  for  syntax.  In  W.  N.  Locke 

&  A.  D.  Booth  (Eds.),  Machine  translation  of  languages.    Cambridge:   Technology 

Press  and  Wiley,  1955.  Pp.  194-207. 
Yamada,  H.   Counting  by  a  class  of  growing  automata.  Doctoral  dissertation,  Univer. 

of  Pennsylvania,  Philadelphia,  1960. 


Finitary  Models  of 
Language  Users1 

George  A.  Miller 
Harvard  University 

Noam  Chomsky 

Massachusetts  Institute  of  Technology 


1.  The  preparation  of  this  Chapter  was  supported  in  part  by  the  U.S.  Army, 
the  Air  Force  Office  of  Scientific  Research,  and  the  Office  of  Naval  Research ;  and  in 
part  by  the  National  Science  Foundation  (Grants  No.  NSF  G-16486  and  No.  NSF 
G- 13903}. 


Contents 


1.  Stochastic  Models  421 

1.1.  Markov  sources,     422 

1.2.  A>Limited  stochastic  sources,     427 

1.3.  A  measure  of  selective  information,     431 

1 .4.  Redundancy,     439 

1.5.  Some  connections  with  grammaticalness,     443 

1 .6.  Minimum-redundancy  codes,     450 

1.7.  Word  frequencies,     456 


2.  Algebraic  Models  464 

2.1.  Models  incorporating  rewriting  systems,     468 

2.2.  Models  incorporating  transformational  grammars,     476 


3.  Toward  a  Theory  of  Complicated  Behavior  483 

References  488 


420 


Finitary  Models  of  Language  Users 


In  this  chapter  we  consider  some  of  the  models  and  measures  that  have 
been  proposed  to  describe  talkers  and  listeners — to  describe  the  users  of 
language  rather  than  the  language  itself.  As  m  was  pointed  out  at  the 
beginning  of  Chapter  12,  our  language  is  not  merely  the  collection  of  our 
linguistic  responses,  habits,  or  dispositions,  just  as  our  knowledge  of 
arithmetic  is  not  merely  the  collection  of  our  arithmetic  responses,  habits, 
or  dispositions.  We  must  respect  this  distinction  between  the  person's 
knowledge  and  his  actual  or  even  potential  behavior;  a  formal  characteri- 
zation of  some  language  is  not  simultaneously  a  model  of  the  users  of  that 
language. 

When  we  turn  to  the  description  of  a  user,  a  severe  constraint  is  placed 
on  our  formulations.  We  have  seen  that  natural  languages  are  not  ade- 
quately characterized  by  one-sided  linear  grammars  (finite  automata),  yet 
we  know  that  they  must  be  spoken  and  heard  by  devices  with  bounded 
memory.  How  might  this  be  accomplished  ?  No  automaton  with  bounded 
memory  can  produce  all  and  only  the  grammatical  sentences  of  a  natural 
language;  every  such  device,  man  presumably  included,  will  exhibit 
certain  limitations. 

In  considering  models  for  the  actual  performance  of  human  talkers 
and  listeners  an  important  criterion  of  adequacy  and  validity  must  be  the 
extent  to  which  the  model's  limitations  correspond  to  our  human  limita- 
tions. We  shall  consider  various  finite  systems — both  stochastic  and 
algebraic — with  the  idea  of  comparing  their  shortcomings  with  those  of 
human  talkers  and  listeners.  For  example',  the  fact  that  people  are  able  to 
produce  and  comprehend  an  unlimited  variety  of  novel  sentences  indicates 
immediately  that  their  capacities  are  quite  different  from  those  of  an 
automaton  that  compiles  a  simple  list  of  all  the  grammatical  sentences  it 
hears.  This  example  is  trivial,  yet  it  illustrates  the  kind  of  argument  we 
must  be  prepared  to  make. 


1.  STOCHASTIC    MODELS 

It  is  often  assumed,  usually  by  workers  interested  in  only  one  aspect 
of  communication,  that  our  perceptual  models  for  a  listener  will  be 
rather  different  from  any  behavioral  models  we  might  need  for  a  speaker. 

421 


422  FINITARY    MODELS    OF    LANGUAGE    USERS 

That  assumption  was  not  adopted  in  our  discussion  of  formal  aspects 
of  linguistic  competence,  and  it  will  not  be  adopted  here  in  discussing 
empirical  aspects  of  linguistic  performance.  In  proposing  models  for  a 
user  of  language — a  user  who  is  simultaneously  talker  and  listener — 
we  have  assumed  instead  that  the  theoretically  significant  aspects  of  verbal 
behavior  must  be  common  to  both  the  productive  and  receptive  functions. 

Once  a  formal  theory  of  communication  or  language  has  been  con- 
structed, it  generally  turns  out  to  be  equally  useful  for  describing  both 
sources  and  receivers;  in  order  to  describe  one  or  the  other  we  simply 
rename  various  components  of  the  formal  theory  in  an  appropriate 
fashion.  This  is  illustrated  by  the  stochastic  theories  considered  in  this 
section. 

Stochastic  theories  of  communication  generally  assume  that  the  array 
of  message  elements  can  be  represented  by  a  probability  distribution  and 
that  various  communication  processes  (coding,  transmitting,  and  receiving) 
have  the  effect  of  operating  on  that  a  priori  distribution  to  transform  it 
according  to  known  transitional  probabilities  into  an  a  posteriori  distribu- 
tion. The  basic  mathematical  idea,  therefore,  is  simply  the  multiplication  of 
a  vector  by  a  matrix.  But  the  interpretation  we  give  to  this  underlying 
mathematical  structure  differs,  depending  on  whether  we  interpret  it  as  a 
model  of  a  source,  a  channel,  or  a  receiver.  Thus  the  distinction  between 
talkers  and  listeners  is  in  no  way  critical  for  the  development  of  the 
basic  stochastic  theory  of  communication.  The  same  neutrality  also 
characterizes  the  algebraic  models  of  the  user  that  are  discussed  in  Sec.  2 
of  this  chapter. 

Purely  for  expository  purposes,  however,  it  is  often  convenient  to 
present  the  mathematical  argument  in  a  definite  context.  For  that  reason 
we  have  arbitrarily  chosen  here  to  interpret  the  mathematics  as  a  model  of 
the  source.  This  choice  should  not  be  taken  to  mean  that  a  stochastic 
theory  of  communication  must  be  concerned  solely,  or  even  principally, 
with  speakers  rather  than  with  transmitters  or  hearers.  The  parallel 
development  of  these  models  for  a  receiver  would  be  simply  redundant, 
since  little  more  than  a  substitution  of  terms  would  be  involved. 


1.1  Markov  Sources 

An  important  function  of  much  communication  is  to  reduce  the  un- 
certainty of  a  receiver  about  the  state  of  affairs  existing  at  the  source.  In 
such  task-oriented  communications,  if  there  were  no  uncertainty  about 
what  a  talker  would  say,  there  would  be  no  need  for  him  to  speak.  From 
a  receiver's  point  of  view  the  source  is  unpredictable;  it  would  seem  to 


STOCHASTIC    MODELS  42$ 

be  a  natural  strategy,  therefore,  to  describe  the  source  in  terms  of  prob- 
abilities. Moreover,  the  process  of  transmission  is  often  exposed  to 
random  and  unpredictable  perturbations  that  can  best  be  described 
probabilistically.  The  receiver  himself  is  not  above  making  errors;  his 
mistakes  can  be  a  further  source  of  randomness.  Thus  there  are  several 
valid  motives  for  the  development  of  stochastic  theories  of  communication. 

A  stochastic  theory  of  communication  readily  accommodates  an 
infinitude  of  alternative  sentences.  Indeed,  there  would  seem  to  be  far 
more  stochastic  sequences  than  we  actually  need.  Since  no  grammatical 
sentence  is  infinitely  long,  there  can  be  at  most  only  a  countable  infinitude 
of  them.  In  probability  theory  we  deal  with  a  random  sequence  that 
extends  infinitely  in  both  directions,  past  and  future,  and  we  consider  the 
uncountable  infinitude  of  all  such  sequences  that  might  occur.2  The  events 
with  which  probability  theory  deals  are  subsets  of  this  set  of  all  sequences. 
A  finite  stochastic  sentence,  therefore,  must  correspond  to  a  finite  segment 
of  the  infinite  random  sequence.  A  probability  measure  is  assigned  to  the 
space  of  all  possible  sequences  in  such  a  way  that  (in  theory,  at  least) 
the  probability  of  any  finite  segment  can  be  computed. 

If  the  process  of  manufacturing  messages  were  completely  random,  the 
product  would  bear  little  resemblance  to  actual  utterances  in  a  natural 
language.  An  important  feature  of  a  stochastic  model  for  verbal  behavior 
is  that  successive  symbols  can  be  correlated — that  the  history  of  the 
message  will  support  some  prediction  about  its  future.  In  1948  Shannon 
revived  and  elaborated  an  early  suggestion  by  Markov  that  the  source  of 
messages  in  a  discrete  communication  system  could  be  represented  by  a 
stationary  stochastic  process  that  selected  successive  elements  of  the 
message  from  a  finite  vocabulary  according  to  fixed  probabilities.  For 
example,  Markov  (1913)  classified  20,000  successive  letters  in  Puskhin's 
Eugene  Onegin  as  vowels  v  or  consonants  c,  then  tabulated  the  frequency 
N  of  occurrences  of  overlapping  sequences  of  length  three.  His  results  are 
summarized  in  Table  1  in  the  form  of  a  tree. 

There  are  several  constraints  on  the  frequencies  that  can  appear  in  such 
a  tabulation  of  binary  sequences.  For  example,  N(vc)  =  N(cv)  ±  I, 
since  the  sequence  cannot  shift  from  vowels  to  consonants  more  often, 
±  1,  than  it  returns  from  consonants  to  vowels.  In  this  particular  example 
the  number  of  degrees  of  freedom  is  2n~a,  where  n  is  the  length  of  the  string 
that  is  analyzed  and  2  is  the  size  of  the  alphabet. 

The  tabulated  frequencies  enable  us  to  estimate  probabilities.  For 
instance,  the  estimated  probability  of  a  vowel  is  p(v)  =  N(v)/N  ==  0.432. 
If  successive  letters  were  independent,  we  would  expect  the  probability  of  a 
vowel  following  a  consonant  p(v  \  c)  to  be  the  same  as  the  probability  of  a 
2  We  assume  that  the  stochastic  processes  we  are  studying  are  stationary. 


424  FINITARY    MODELS    OF    LANGUAGE    USERS 

vowel  following  another  vowel  p(v  \  v),  and  both  would  equal  p(v).  The 
tabulation,  however,  yields  p(v  \  c)  =  0.663,  which  is  much  larger  than 
p(u),  and  p(v  \  v)  =  0.128,  which  is  much  smaller.  Clearly,  Russian 
vowels  are  more  likely  to  occur  after  consonants  than  after  vowels. 
Newman  (1951)  has  reported  further  data  on  the  written  form  of  several 
languages  and  has  confirmed  this  general  tendency  for  vowels  and  con- 
sonants to  alternate.  (It  is  unlikely  that  this  result  would  be  seriously 
affected  if  the  analyses  had  been  made  with  phonemes  rather  than  with 
written  characters.) 

Table  1  Markov's  Data  on  Consonant- Vowel  Sequences  in  Pushkin's 

Eugene  Onegin 

N(vvv)  =    115)  = 

N(vvc)  =    989]  . 

;  \. N(v)  =    8,638 

N(vcv)  =  42121 

*r/     x       o^l N(vc)=7534 

N(vcc)  =  3322! 


-N  =  20,000 
N(cvv)  =    9891  - 

xr/       x         £***} N(CV)=1534 

N(cvc)  =  65451 

[ #(c)  =  1 1,362 

N(ccv)  =  33221 

*<«*)-    505J— "«*>-3827j 

Inspection  of  the  message  statistics  in  Table  1  reveals  that  the  probability 
of  a  vowel  depends  on  more  than  the  one  immediately  preceding  letter. 
Strictly  speaking,  therefore,  the  chain  is  not  Markovian,  since  a 
Markov  process  has  been  defined  in  such  a  way  (cf.  Feller,  1957)  that  all 
of  the  relevant  information  about  the  history  of  the  sequence  is  given  when 
the  single,  immediately  preceding  outcome  is  known.  However,  the 
Markovian  representation  is  readily  projected  to  handle  more  complicated 
cases.  We  shall  consider  how  this  can  be  done. 

But  first  we  must  clarify  what  is  meant  by  a  Markov  source.  Given  a 
discrete  Markov  process  with  a  finite  number  of  states  vQ9  .  .  .  ,  VD  and  a 
probability  measure  /LL,  a  Markov  source  is  constructed  by  defining 
y  —  {v&  •  •  •  >  VD}  to  be  the  vocabulary;  messages  are  formed  by  con- 
catenating the  names  of  the  successive  states  through  which  the  system 
passes.  In  the  terms  used  in  Sec.  1.2  of  Chapter  12  a  Markov  source  is  a 
special  type  of  finite  state  automaton  in  which  the  triples  that  define  it  are 
all  of  the  form  (z,y,  0  and  in  which  the  control  unit  has  access  to  the  con- 
ditional probabilities  of  all  state  transitions. 

In  Sec.  2  of  Chapter  1 1 ,  a  state  was  defined  as  the  set  of  all  initial  strings 
that  were  equivalent  on  the  right.  This  definition  must  be  extended  for 
stochastic  systems,  however.  We  say  that  all  the  strings  that  allow  the  same 


STOCHASTIC    MODELS 


continuations  with  the  same  probabilities  are  stochastically  equivalent  on 
the  right;  then  a  state  of  a  stochastic  source  is  the  set  of  all  strings  that  are 
stochastically  equivalent  on  the  right. 

If  we  are  given  a  long  but  arbitrary  sequence  of  symbols  and  wish  to  test 

whether  it  comprises  a  Markov  chain,  we  must  proceed  to  tabulate  the 

frequencies  of  the  possible  pairs,  triplets,  etc.    Our  initial  (Markovian) 

hypothesis  in  this  analysis  is  that  the  symbol  occurring  at  any  given  time 

can  be  regarded  as  the  name  of  the  state  that  the  source  is  in  at  that  time. 

Inspection  of  the  actual  sequence,  however,  may  reveal  that  some  of  the 

hypothesized  states  are  stochastically  equivalent  on  the  right  (all  possible 

continuations  are  assigned  the  same  probabilities  in  both  cases)  and  so 

can  be  parsimoniously  combined  into  a  single  state.  This  reduction  in  the 

number  of  states  implies  that  the  state  names  must  be  distinguished  from 

the  terminal  vocabulary.  We  can  easily  broaden  our  definition  of  a  Markov 

source  to  include  these  simplified  versions  by  distinguishing  the  set  of 

possible  states  {SQ,  Sl9  .  .  .  ,  Sm}  from  the  vocabulary  (u0,  vl9  .  .  .  ,  VD}. 

Since  human  messages  have  dependencies  extending  over  long  strings 

of  symbols,  we  know  that  any  pure  Markov  source  must  be  too  simple  for 

our  purposes.    In  order  to  generalize  the  Markov  concept  still  further, 

therefore,  we  can  introduce  the  following  construction  (McMillan,  1953): 

given  a  Markov  source  with  a  vocabulary  F,  select  some  different  vocab- 

ulary W  and  define  a  homomorphic  mapping  of  V  into  W.  This  mapping 

will  define  a  new  probability  measure.  The  new  system  is  a  projection  of  a 

Markov  source,  but  it  may  not  itself  be  Markovian  in  the  strict  sense. 

Definition  I  .     Given  a  Markov  source  with  vocabulary  V  =  {v0,  .  .  .  ,  i;^}, 

with  internal  states  SQ,  .  .  .  ,  Sm,  and  with  probability  measure  p,  a  new 

source  can  be  constructed  with  the  same  states  but  with  vocabulary  W 

and  derived  probability  measure  //,  where  Wj  e  W  if  and  only  if  there  is  a 

vi  G  V,  and  a  mapping  6  such  that  0(uJ  =  wr   Any  source  formed  from 

a  Markov  source  by  this  construction  is  a  projected  Markov  source. 

The  effect  of  this  construction  is  best  displayed  by  an  example.  Consider 

the  Markov  source  whose  graph  is  shown  in  Fig.  1  and  assume  that 

appropriate  probabilities  are  assigned  to  the  indicated  transitions,  all 

other  conceivable  transitions  having  probability  zero.    The  vocabulary 

is  V  =  {1,2,3,4},  and  each  symbol  names  the  state  that  the  system 

is  in  after  that  symbol  occurs.   We  shall  consider  three  different  ways 

to  map  V  into  an  alternative  vocabulary  according  to  the  construction  in 

Definition  1  : 

1.  Let  0(1)  =  0(4)  =  v  and  0(2)  =  0(3)  =  c.  Then  the  projected  system 
is  a  higher  order  Markov  source  of  the  type  required  to  represent  the 
probabilities  of  consonant-vowel  triplets  in  Table  1.  Under  this  con- 
struction we  would  probably  identify  state  1  as  [vv],  state  2  as  [vc],  state  3 


426 


FINITARY    MODELS    OF    LANGUAGE    USERS 

/- — >x 

4 
I/ 1      \    X  4 


Fig.  1 .  Graph  of  a  Markov  source. 


as  [cc],  and  state  4  as  [cv],  thus  retaining  the  convention  of  naming  states 
after  the  sequences  that  lead  into  them,  but  now  with  the  non-Markovian 
stipulation  that  more  than  one  preceding  symbol  is  implicated.  In  the 
terminology  of  Chapter  12,  we  are  dealing  here  with  a  fc-limited  automaton, 
where  k  =  2. 

2.  Let  (9(1)  =  (9(2)  =  a  and  6(3)  =  6(4)  =  b.  Then  the  projected  system 
is  ambiguous :  an  occurrence  of  a  may  leave  the  system  in  either  state  1 
or  state  2;  an  occurrence  of  b  may  leave  it  in  either  state  3  or  state  4. 
The  states  cannot  be  distinctively  named  after  the  sequences  that  lead  into 
them. 

3.  Let   6(1)  =  -1,   6(2)  =  6(4)  =  0,   and   (9(3)  =  +1.     With   this 
projection  we  have  a  non-Markovian  example  mentioned  by  Feller 
(1957,  p.  379).  If  we  are  given  a  sequence  of  independent  random  variables 
that  can  assume  the  values  ±  1  with  probability  £,  we  can  define  the  moving 
average  of  successive  pairs,  Xn  =  (Yn  +  Yn+I)/2.  The  sequence  of  values 
of  Xn  is  non-Markovian  for  an  instructive  reason ;  given  a  consecutive  run 
of  Xn  =  0,  how  it  will  end  depends  on  whether  it  contains  an  odd  or  an 
even  number  of  O's.   After  a  run  of  an  even  number  of  occurrences  of 
Xn  =  0  the  run  must  terminate  as  it  began;  after  an  odd  number  the  run 
must  terminate  with  the  opposite  symbol  from  the  one  with  which  it  started. 
Thus  it  is  necessary  to  remember  how  the  system  got  into  each  run  of  O's 
and  how  long  the  run  has  been  going  on.   But,  since  there  is  no  limit  to 
how  long  a  run  of  O's  may  be,  this  system  is  not  Jk-limited  for  any  k.  Thus 
it  is  impossible  to  produce  the  moving  average  by  a  simple  Markov  source 
or  even  by  a  higher  order  (^-limited)  Markov  process  (which  still  must  have 
finite  memory),  but  it  is  quite  simple  to  produce  it  with  the  projected 
Markov  source  constructed  here. 

By  this  construction,  therefore,  we  can  generalize  the  notion  of  a  Markov 


STOCHASTIC    MODELS 


source  to  cover  any  kind  of  finite  state  system  (regular  event)  for  which  a 

suitable  probability  measure  has  been  defined. 

Theorem  I.    Any  finite  state  automaton  over  which  an  appropriate  prob- 

ability measure  is  defined  can  serve  as  a  projected  Markov  source. 

Given  any  finite  state  automaton  with  an  associated  probability  measure, 
assign  a  separate  integer  to  each  transition.  The  set  of  integers  so  assigned 
must  form  the  vocabulary  of  a  Markov  source,  and  the  rule  of  assignment 
defines  a  homomorphic  mapping  into  a  projected  Markov  source.  This 
formulation  makes  precise  the  sense  in  which  regular  languages  can  be  said 
to  have  Markovian  properties. 

All  of  our  projected  Markov  sources  will  be  assumed  to  operate  in  real 
time,  from  past  to  future,  which  we  conventionally  denote  as  left  to  right. 
Considered  as  rewriting  systems,  therefore,  they  contain  only  right- 
branching  rules  of  the  general  form  A  —>•  a£,  where  A  and  B  correspond  to 
states  of  the  stochastic  system.  The  variety  of  projected  Markov  sources 
is,  of  course,  extremely  large,  and  only  a  few  of  the  many  possible  types 
have  been  studied  in  any  detail.  We  shall  sample  some  of  them  in  the 
following  sections. 

These  same  ideas  could  have  been  developed  equally  well  to  describe  a 
receiver  rather  than  a  source.  A  projected  Markov  receiver  is  one  that  will 
accept  as  input  only  those  strings  of  symbols  that  correspond  to  possible 
sequences  of  state  transitions  and  that,  through  explicit  agreement  with 
the  source  or  through  long  experience,  has  built  up  for  each  state  an 
estimate  of  the  probabilities  of  all  possible  continuations.  As  we  have 
already  noted,  once  the  mathematical  theory  is  fixed  its  specialization  as  a 
model  for  either  the  speaker  or  the  hearer  is  quite  simple.  We  are  really 
concerned  with  ways  to  characterize  the  user  of  natural  languages;  the 
fact  that  we  have  here  pictured  him  as  a  source  is  quite  arbitrary. 


1.2  A:-Limited  Stochastic  Sources 

One  well-studied  type  of  projected  Markov  source  is  known  generally  as 
a  higher  order,  or  ^-limited,  Markov  source,  which  generates  a  (k  +  1)- 
order  approximation  to  the  sample  of  text  from  which  it  is  derived.  The 
states  of  the  fc-limited  automaton  are  identified  with  the  sequences  of  k 
successive  symbols  leading  into  them,  and  associated  with  each  state  is  a 
probability  distribution  defined  over  the  D  different  symbols  of  the 
alphabet.  If  there  are  D  symbols  in  the  alphabet,  then  a  fc-limited  stochas- 
tic source  will  have  (potentially)  Dk  different  states.  A  O-limited  stochastic 
source  has  but  one  state  and  generates  the  symbols  independently. 

If  k  is  small  and  if  we  consider  an  alphabet  of  only  27  characters  (26 


4^8  FINITARY    MODELS    OF    LANGUAGE    USERS 

letters  and  a  space),  It  is  possible  to  estimate  the  transitional  probabilities 
for  a  ^-limited  stochastic  source  by  actually  counting  the  number  of 
(k  +  l)-tuplets  of  each  type  in  a  long  sample  of  text.  If  we  use  these 
tabulations,  it  is  then  possible  to  produce  (k  +  l)-order  approximations 
to  the  original  text  by  drawing  successive  characters  according  to  the 
probability  distribution  associated  with  the  state  determined  by  the  string 
of  k  preceding  characters.  It  is  convenient  to  define  a  zero-order  approxi- 
mation as  one  that  uses  the  characters  independently  and  equiprobably; 
a  first-order  approximation  uses  the  characters  independently;  a  second- 
order  approximation  uses  the  characters  with  the  probabilities  appropriate 
in  the  context  of  the  immediately  preceding  letter;  etc. 

An  impression  of  the  kind  of  approximations  to  English  that  these 
sources  produce  can  be  obtained  from  the  following  examples,  taken 
from  Shannon  (1948).  In  each  case  the  (k  +  l)th  symbol  was  selected 
with  probability  appropriate  to  the  context  provided  by  the  preceding  k 
symbols. 

1.  Zero-order  letter  approximation  (26  letters  and  a  space,  independent 
and     equiprobable):     XFOML     RXKHRJFFJUJ     ZLPWCFWKCYJ 
FFJEYVKCQSGHYDQPAAMKBZAACIBZLHJQD. 

2.  First-order  letter  approximation  (characters  independent  but  with 
frequencies  representative  of  English):  OCRO  HLI  RGWR  NMIELWIS 
EU  LL  NBNESEBYA  TH  EEI ALHENHTTPA  OOBTTVA  NAH  BRL. 

3.  Second-order  letter  approximation  (successive  pairs  of  characters  have 
frequencies  representative  of  English  text) :  ON  IE  ANTSOUTINYS  ARE 
T INCTORE  ST  BE  S  DEAMY  ACHIN  D ILONASIVE  TUCOOWE  AT 
TEASONARE  FUSO  TIZIN  ANDY-  TOBE  SEACE  CTISBE. 

4.  Third-order  letter  approximation  (triplets  have  frequencies  repre- 
sentative of  English  text):  IN  NO  1ST  LAT  WHEY  CRATICT  FROURE 
BIRS    GROCID    PONDENOME    OF    DEMONSTURES    OF    THE 
REPTAGIN  IS  REGOACTIONA  OF  CRE. 

A  ^-limited  stochastic  source  can  also  be  defined  for  the  words  in  the 
vocabulary  V  in  a  manner  completely  analogous  to  that  for  letters  of  the 
alphabet  A.  When  states  are  defined  in  terms  of  the  k  preceding  words, 
the  following  kinds  of  approximations  are  obtained : 

5.  First-order  word   approximation   (words    independent,    but   with 
frequencies    representative     of    English):      REPRESENTING     AND 
SPEEDILY  IS  AN  GOOD  APT  OR  CAME  CAN  DIFFERENT  NAT- 
URAL HERE  HE  THE  A  IN  CAME  THE  TO  OF  TO  EXPERT  GRAY 
COME  TO  FURNISHES  THE  LINE  MESSAGE  HAD  BE  THESE. 

6.  Second-order  word  approximation  (word-pairs  with  frequencies 


STOCHASTIC    MODELS  429 

representative  of  English):  THE  HEAD  AND  IN  FRONTAL  ATTACK 
ON  AN  ENGLISH  WRITER  THAT  THE  CHARACTER  OF  THIS 
POINT  IS  THEREFORE  ANOTHER  METHOD  FOR  THE  LETTERS 
THAT  THE  TIME  OF  WHO  EVER  TOLD  THE  PROBLEM  FOR  AN 
UNEXPECTED. 

The  following  two  illustrations  are  taken  from  Miller  &  Selfridge  (1950). 

7.  Third-order  word  approximation  (word-triplets  with  frequencies 
representative  of  English):   FAMILY  WAS  LARGE  DARK  ANIMAL 
CAME  ROARING  DOWN  THE  MIDDLE  OF  MY  FRIENDS  LOVE 
BOOKS  PASSIONATELY  EVERY  KISS  IS  FINE. 

8.  Fifth-order  word  approximation  (word  quintuplets  with  frequencies 
representative  of  English):  ROAD  IN  THE  COUNTRY  WAS  INSANE 
ESPECIALLY  IN  DREARY  ROOMS  WHERE  THEY  HAVE  SOME 
BOOKS  TO  BUY  FOR  STUDYING  GREEK. 

Higher-order  approximations  to  the  statistical  structure  of  English  have 
been  used  to  manipulate  the  apparent  meaningfulness  of  letter  and  word 
sequences  as  a  variable  in  psychological  experiments.  As  k  increases,  the 
sequences  of  symbols  take  on  a  more  familiar  look  and — although  they 
remain  nonsensical — the  fact  seems  to  be  empirically  established  that  they 
become  easier  to  perceive  and  to  remember  correctly. 

We  know  that  the  sequences  produced  by  /^-limited  Markov  sources 
cannot  converge  on  the  set  of  grammatical  utterances  as  k  increases 
because  there  are  many  grammatical  sentences  that  are  never  uttered  and 
so  could  not  be  represented  in  any  estimation  of  transitional  probabilities. 
A  ^-limited  Markov  source  cannot  serve  as  a  natural  grammar  of  English 
no  matter  how  large  k  may  be.  Increasing  k  does  not  isolate  the  set  of 
grammatical  sentences,  for,  even  though  the  number  of  high-probability 
grammatical  sequences  included  is  thereby  increased,  the  number  of  low- 
probability  grammatical  sequences  excluded  is  also  increased  correspond- 
ingly. Moreover,  for  any  finite  k  there  would  be  ungrammatical  sequences 
longer  than  k  symbols  that  a  stochastic  user  could  not  reject. 

Even  though  a  A>limited  source  is  not  a  grammar,  it  might  still  be 
proposed  as  a  model  of  the  user.  Granted  that  the  model  cannot  isolate 
the  set  of  all  grammatical  sentences,  neither  can  we;  inasmuch  as  our 
human  limitations  often  lead  us  into  ungrammatical  paths,  the  real  test 
of  this  model  of  the  user  is  whether  it  exhibits  the  same  limitations  that 
we  do. 

However,  when  we  examine  this  model,  not  as  a  convenient  way  to  sum- 
marize certain  statistical  parameters  of  message  ensembles,  but  as  a  seri- 
ous proposal  for  the  way  people  create  and  interpret  their  communicative 


43°  FINITARY    MODELS    OF    LANGUAGE    USERS 

utterances,  it  is  all  too  easy  to  find  objections.  We  shall  mention  only 
one,  but  one  that  seems  particularly  serious:  the  ^-limited  Markov 
source  has  far  too  many  parameters  (cf.  Miller,  Galanter,  &  Pribram, 
1960,  pp.  145-148).  As  we  have  noted,  there  can  be  as  many  as  Dk 
probabilities  to  be  estimated.  By  the  time  k  grows  large  enough  to  give 
a  reasonable  fit  to  ordinary  usage  the  number  of  parameters  that  must  be 
estimated  will  have  exploded;  a  staggering  amount  of  text  would  have  to 
be  scanned  and  tabulated  in  order  to  make  reliable  estimates. 

Just  how  large  must  k  and  D  be  in  order  to  give  a  satisfactory  model? 
Consider  a  perfectly  ordinary  sentence:  The  people  who  called  and  wanted 
to  rent  your  house  -when  you  go  away  next  year  are  from  California.  In  this 
sentence  there  is  a  grammatical  dependency  extending  from  the  second 
word  (the  plural  subject  people)  to  the  seventeenth  word  (the  plural  verb 
are).  In  order  to  reflect  this  particular  dependency,  therefore,  k  must  be  at 
least  15  words.  We  have  not  attempted  to  explore  how  far  k  can  be 
pushed  and  still  appear  to  stay  within  the  bounds  of  common  usage,  but 
the  limit  is  surely  greater  than  15  words ;  and  the  vocabulary  must  have  at 
least  1000  words.  Taking  these  conservative  values  of  A:  and  D,  therefore, 
we  have  Dk  =  1045  parameters  to  cope  with,  far  more  than  we  could  esti- 
mate even  with  the  fastest  digital  computers. 

Of  course,  we  can  argue  that  many  of  these  1045  strings  of  15  words 
whose  probabilities  must  be  estimated  are  redundant  or  that  most  of  them 
have  zero  probability.  A  more  realistic  estimate,  therefore,  might  assume 
that  what  we  learn  are  not  the  admissible  strings  of  words  but  rather  the 
"sentence  frames" — the  admissible  strings  of  syntactic  categories. 
Moreover,  we  might  recognize  that  not  all  sequences  of  categories  are 
equally  likely  to  occur;  as  a  conservative  estimate  (cf.  Somers,  1961), 
we  might  assume  that  on  the  average  there  would  be  about  four  alternative 
categories  that  might  follow  in  any  given  context.  By  such  arguments, 
therefore,  we  can  reduce  D  to  as  little  as  four,  so  that  Dk  becomes  415  =  109. 
That  value  is,  of  course,  a  notable  improvement  over  1045  parameters,  yet, 
when  we  recall  that  several  occurrences  of  each  string  are  required  before 
we  can  obtain  reliable  estimates  of  the  probabilities  involved,  it  becomes 
apparent  that  we  still  have  not  avoided  the  central  difficulty — an  enormous 
amount  of  text  would  have  to  be  scanned  and  tabulated  in  order  to  provide 
a  satisfactory  empirical  basis  for  a  model  of  this  type. 

The  trouble  is  not  merely  that  the  statistician  is  inconvenienced  by  an 
estimation  problem.  A  learner  would  face  an  equally  difficult  task.  If 
we  assume  that  a  ^-limited  automaton  must  somehow  arise  during  child- 
hood, the  amount  of  raw  induction  that  would  be  required  is  almost 
inconceivable.  We  cannot  seriously  propose  that  a  child  learns  the  values 
of  109  parameters  in  a  childhood  lasting  only  108  seconds. 


STOCHASTIC    MODELS 


1.3  A  Measure  of  Selective  Information 

Although  the  direct  estimation  of  all  the  probabilities  involved  in  a 
A>limited  Markov  model  of  the  language  user  is  impractical,  other  statistics 
of  a  more  general  and  summary  nature  are  available  to  represent  certain 
average  characteristics  of  such  a  source.  Two  of  these  with  particular 
interest  for  communication  research  are  amount  of  information  and 
redundancy.  We  introduce  them  briefly  and  heuristically  at  this  point. 

The  problem  of  measuring  amounts  of  information  in  a  communication 
situation  seems  to  have  been  posed  first  by  Hartley  (1928).  If  some 
particular  piece  of  equipment  —  a  switch,  say,  or  a  relay  —  has  D  possible 
positions,  or  physical  states,  then  two  of  the  devices  working  together 
can  have  D2  states,  three  can  have  .D3  states  altogether,  etc.  The  number  of 
possible  states  of  the  total  system  increases  exponentially  as  the  number 
of  devices  increases  linearly.  In  order  to  have  a  measure  of  information 
that  will  make  the  capacity  of  2n  devices  just  double  the  capacity  of  n 
of  them,  Hartley  defined  what  we  now  call  the  information  capacity  of  a 
device  as  log  D,  where  D  is  the  number  of  different  states  the  total  system 
can  get  into.  Hartley's  proposal  was  later  generalized  and  considerably 
extended  by  Shannon  (1948)  and  Wiener  (1948). 

When  applied  to  a  communication  channel,  Hartley's  notion  of  capacity 
refers  to  the  number  of  different  signals  that  might  be  transmitted  in  a  unit 
interval  of  time.  For  example,  let  N(T)  denote  the  total  number  of  different 
strings  exactly  T  symbols  long  that  the  channel  can  transmit.  Let  D  be  the 
number  of  different  states  the  channel  has  available  and  assume  that  there 
are  no  constraints  on  the  possible  transitions  from  one  state  to  another. 
Then  N(T)  =  DT,  or 

_        ^ 


which  is  Hartley's  measure  of  capacity.  In  case  there  are  some  con- 
straints on  the  possible  transitions,  N(T)  will  still  (in  the  limit)  increase 
exponentially  but  less  rapidly.  In  the  general  case,  therefore,  we  are  led 
to  define  channel  capacity  in  terms  of  the  limit: 


r  m 

channel  capacity  =  hm  .  (1) 

T-"oo          T 

This  is  the  best  the  channel  can  do.  If  a  source  produces  more  information 
per  symbol  on  the  average,  the  channel  will  not  be  able  to  transmit  it  all  — 
not,  at  least,  in  the  same  number  of  symbols.  The  practical  problem, 


432 


FINITARY    MODELS    OF    LANGUAGE    USERS 


therefore,  is  to  estimate  N(T)  from  what  we  know  about  the  properties  of 
the  channel. 

Our  present  goal,  however,  is  to  see  how  Hartley's  original  insight  has 
been  extended  to  provide  a  measure  of  the  amount  of  information  per 
symbol  contained  in  messages  generated  by  stochastic  devices  of  the  sort 
described  in  the  preceding  sections  of  this  chapter.  We  shall  confine  our 
discussion  here  to  those  situations  in  which  the  purpose  of  communication 
is  to  reduce  a  receiver's  uncertainty.  The  amount  of  information  he 
receives,  therefore,  must  be  some  function  of  what  he  learns  about  the 
state  of  the  source.  And  what  he  learns  will  depend  on  how  ignorant  he 
was  to  begin  with.  Let  us  assume  that  the  source  selects  its  message  by 
any  procedure,  random  or  deterministic,  but  that  all  the  receiver  knows  in 
advance  is  that  the  source  will  choose  among  a  finite  set  of  mutually  exclu- 
sive messages  Ml9  M2, . . .  9MDwih  probabilities p(MJ,p(M^)9 . . .  ,p(MD\ 
where  these  probabilities  sum  to  unity.  What  Shannon  and  Wiener  did 
was  to  develop  a  measure  H(M}  of  the  receiver's  uncertainty,  where  the 
argument  M  designates  the  choice  situation: 

M2,       .  .  .  ,       MD\ 
p(Ma),     •  •  -  ,     p(MD)J . 

When  the  particular  message  is  correctly  received,  a  listener's  uncertainty 
about  it  will  be  reduced  from  H(M)  to  zero;  therefore,  the  message 
conveyed  H(M)  units  of  information.  Thus  H(M)  is  a  measure  of  the 
amount  of  information  required  (on  the  average)  to  select  Aff  when  faced 
with  the  choice  situation  M. 

We  list  as  assumptions  a  number  of  properties  that  intuition  says  a 
reasonable  measure  of  uncertainty  ought  to  have  for  discrete  devices. 
Then,  following  a  heuristic  presentation  by  Khmchin  (1957),  we  shall 
use  those  assumptions  to  develop  the  particular  H  of  Shannon  and 
Wiener. 

Our  first  intuitive  proposition  is  that  uncertainty  depends  only  on  what 
might  happen.  Impossible  events  will  not  affect  our  uncertainty.  If  a 
particular  message  Mt  is  known  in  advance  to  have  p(M^  =  0,  it  should 
not  affect  the  measure  H  in  any  way  if  Mi  is  omitted  from  consideration. 
Assumption  !.  Adding  any  number  of  impossible  messages  to  M  does  not 

change  H(M): 

H(M»  •••'  MP!  MH=H(MI MI 

o  /       \P(MI),   . . . ,   P(MD), 


Our  second  intuition  is  that  people  are  most  uncertain  when  the  alter- 
native messages  are  all  equally  probable.  Any  bias  that  makes  one  message 


STOCHASTIC    MODELS  433 

more  probable  than  another  conveys  information  in  the  sense  that  it 
reduces  the  receiver's  total  amount  of  uncertainty.  With  only  two  alter- 
native messages,  for  example,  a  50  :  50  split  presents  the  least  predictable 
situation  imaginable.  Since  there  are  D  different  messages  in  M,  when  they 
are  equiprobable p(Mz)  =  l/D  for  all  L 

Assumption  2.     H(M)  is  a  maximum  when  all  the  messages  in  M  are 
equiprobable: 

J  M15       ...,       MD  \<H(Mi>     '••>     MA 
\p(MJ,     .  .  .  ,    p(Mj  ^      \1/D,     ,  .  .  ,     l/D/. 

Now  let  L(D)  represent  the  amount  of  uncertainty  involved  when  all 
the  messages  are  equiprobable.  Then  we  have,  by  virtue  of  our  two 
assumptions, 


l/D,     ...,     l/D,        0 
Mls         ...,       Mw 

U/D  +  1,     ...,     1/D  +  l 

Therefore,  we  have  established  the  following  lemma: 
Lemma  I.     L(D)  is  a  monotonic  increasing  function  of  D. 

That  is  to  say,  when  all  D  of  the  alternative  messages  in  M  are  equiprob- 
able H(M)i$  a  nondecreasing  function  of  D.  Intuitively,  the  more  different 
things  that  can  happen,  the  more  uncertain  we  are. 

It  is  also  reasonable  to  insist  that  the  uncertainty  associated  with  a 
choice  should  not  be  affected  by  making  the  choice  in  two  or  more  steps, 
but  should  be  the  weighted  sum  of  the  uncertainties  involved  in  each  step. 
This  critically  important  assumption  can  be  stated: 
Ass  u  m  pt  i  o  n  3 .     H(M)  is  additive. 

Let  any  two  events  of  M  be  combined  to  form  a  single,  compound  event, 
which  we  designate  as  Mx  U  M2  and  which  has  probability p(M1  u  M2)  = 
p(M1)  +  p(M2).  Thus  we  can  decompose  M  into  two  parts : 

/i  U  M2, 

M'  =  ' 


and 

/  Mx, 

M"  = 


A  choice  from  M  is  equivalent  to  a  choice  from  M1  followed  (if  M1  U  M2 
is  chosen)  by  a  choice  from  M".  Assumption  3  means  that  H(M)  depends 


434  FINITARY    MODELS    OF    LANGUAGE    USERS 

on  the  sum  of  H(M')  and  H(M").  In  calculating  H(M)9  however,  H(M") 
should  be  weighted  by  p(M^)  +  p(M^  because  that  represents  the  prob- 
ability that  a  second  choice  will  be  required.  Assumption  3  implies  that 

H(M)  =  H(M'} 


If  this  equation  holds  whenever  two  messages  of  M  are  lumped  together, 
then  it  can  easily  be  generalized  to  any  subset  whatsoever,  and  it  can  be 
extended  to  more  than  one  subset  of  messages  in  M. 

In  order  to  discuss  this  more  general  situation,  we  represent  the  messages 
in  M  by  M&  where  i  is  the  first  selection  and  j  is  the  second.  The  first 
selection  is  made  from  A  : 


1  AK       .  .  .  ,       Ar 
A  = 


where 


and  the  second  choice  depends  (as  before)  on  the  outcome  of  the  first; 
that  is  to  say,  the  second  choice  is  made  from 

I       ,  ^  BS 

B   A,-  = 


The  Bj  have  probabilities  p(By-  \  A^  that  depend  on  Ai9  the  preceding 
choice  from  A.  The  two  choices  together  are  equivalent  to — are  a  decom- 
position of— a  single  choice  from  M,  where 


AiBj  = 


and 


Now,  by  Assumption  3,  H(M)  should  be  the  sum  of  the  two  components. 
But  that  is  a  bit  complicated,  since  H(B  \  Az)  is  a  random  variable  depend- 
ing on  f.  On  the  average,  however,  it  will  be 


E{H(B  |  4)}  =  |  p(4)  H(B  |  ^)  =  H(B  \  A).  (2) 

In  this  situation,  therefore,  the  assumption  of  additivity  means  that 

H(M)  =  H(AE)  =  H(A)  +  H(B  \  A).  (3) 

Of  course,  if  A  and  B  are  independent,  Eq.  3  becomes 

H(AB)  =  H(A)  +  H(B\  (4) 


STOCHASTIC    MODELS 


and,  if  the  messages  are  independent  and  equally  probable,  a  sequence  of  s 
successive  choices  among  D  alternatives  will  give 


(5) 

We  shall  now  establish  the  following  lemma: 
Lemma  2.    L(D)  =  k  log  D,  where  k  >  0. 

Consider  repeated  independent  choices  from  the  same  number  D  of 
equiprobable  messages.  Select  m  such  that  for  any  positive  integers  Z),  s,  t 

Dm  <  5*  <  Dm+I  (6) 

m  log  D  <  f  log  s  <  (m  +  1)  log  D 


t       logD  t  ^  } 

From  Eq.  6,  and  the  fact  that  L(D)  is  monotonic  increasing,  it  follows  that 

L(Dm)  <  LOO  < 
and  from  Eq.  5  we  know  that 

m  L(D)  <  *  L(5)  <  (m 


Combining  Eqs.  7  and  8,  therefore, 


log 

Since  m  is  not  involved,  /  can  be  chosen  arbitrarily  large,  and 

=  L(D) 


log  s      log  D 

Moreover,  since  D  and  ^  are  quite  arbitrary,  these  ratios  must  be  constant 
independent  of  D;  that  is  to  say, 

^^  =  fc,    so    L(D)  =  fc  log  Z). 
log  D 

Of  course,  log  D  is  nonnegative  and  therefore  [since  L(D)  is  monotonic 
increasing]*  A:  >  0.  This  completes  the  proof  of  Lemma  2. 

Ordinarily  k  is  chosen  to  be  unity  when  logarithms  are  taken  to  the 
base  2, 

L(D)  =  Iog2  D,  (9) 

that  is  to  say,  the  unit  of  measurement  is  taken  to  be  the  amount  of 
uncertainty  involved  in  a  choice  between  two  equally  possible  alternatives. 
This  unit  is  called  a  bit. 


FINITARY    MODELS    OF    LANGUAGE    USERS 


Next  consider  the  general  case  with  unequal,  but  rational,  probabilities. 
Let 


g 

where  the  gi  are  all  positive  integers  and 

2  Si  =  g- 

i 

The  problem  is  to  determine  H(A).  In  order  to  do  this,  we  shall  construct 

a  second  choice  situation  (B  \  At)  in  a  special  way  so  that  the  Cartesian 

product  M  =  A  x  B  will  consist  entirely  of  equiprobable  alternatives. 

Let  (B  \  A^  consist  of  gi  messages  each  with  probability  l/git  Therefore, 


H(B  |  Ad  =  H  '=  L(ft)  =  c  log  ft.  (10) 

U/ft,     .  .  .  ,     I/ft/ 

From  Eqs.  2  and  10  it  follows  that 

log 


=  c  log  g  +  c  2  X4)  log  p(At)  .  (11) 

i 

Consider  next  the  compound  choice  M  =  A  x  B.   Since 
p(AiBj)  =  p(^*)  X^  I  4)  =  -  '  -  =  ~  , 


it  must  follow  that  for  this  specially  contrived  situation  there  are  g  equally 
probable  events  and 

H(A  *B)  =  H(AB)  =  L(g)  =  c  log  g,  (12) 

When  we  substitute  Eqs.  11  and  12  into  Eq.  3  we  obtain 

c  log  g  =  H(A)  +  c  log  g  +  c  2  p(Ai)  log  X4-)- 

We  have  now  established  the  theorem: 
Theorem  2.  For  rational  probabilities, 

(13) 


Since  Eq.  13  can  be  interpreted  as  the  mean  value  E{—logp(AJ}>  the 
measure  of  uncertainty  thus  turns  out  to  be  the  mean  logarithmic  prob- 
ability—a quantity  familiar  to  physicists  under  the  name  entropy.  It  is  as 


STOCHASTIC    MODELS 


437 


though  we  had  defined  the  amount  of  information  in  message  Ai  to  be 
-logX^)>  regardless  of  what  the  probability  distribution  might'be  for 
the  other  messages.  The  assumption  that  the  amount  of  information 
conveyed  by  one  particular  message  is  independent  of  all  the  other  possible 
messages  is  what  Luce  (1960)  has  called  the  assumption  of  independence 
from  irrelevant  alternatives;  he  remarks  (Luce,  1959)  that  it  is  characteris- 
tic—either explicitly  or  implicitly—  of  most  theories  of  choice  behavior. 

Finally,  in  order  to  make  H(B)  a  continuous  function  of  the  probabilities, 
we  need  a  fourth  assumption  of  continuity.  Since  it  is  felt  intuitively  that 
a  small  change  in  probabilities  should  result  in  a  small  change  in  H(M}> 
this  final  assumption  needs  little  comment  here.  It  will  not  play  a  critical 
role  in  the  discussion  that  follows. 

Next,  we  want  to  use  H  to  measure  the  uncertainty  associated  with  the 
projected  Markov  sources.  Suppose  we  have  a  stationary  source  with  a 
finite  number  of  states  Al9  .  .  .  ,  An,  with  an  alphabet  Bl9  .  .  .  ,  BD,  and 
with  the  matrix  of  transitional  probabilities  p(Bt  \  At).  When  the  system 
is  in  state  Ai9  the  choice  situation  is 


. 
BAi= 

P(BD 

By  Theorem  2  the  amount  of  information  involved  in  this  choice  must  be 
H(B  |  AJ  =  -c  I  p(B,  |  4)  log  p(B,  |  A}>. 

3 

This  quantity  is  defined  for  each  state.  In  order  to  obtain  an  average  value 
to  represent  the  amount  of  information  that  we  can  expect  for  the  source, 
regardless  of  the  state  it  is  in,  we  must  average  over  /: 


=  -'22  X^.  By)  log  P&i  |  4)  =  H(B  \  A).     (14) 

i     3 

Now  we  can  regard  H(B  \  A)  as  a  measure  of  the  average  amount  of  infor- 
mation obtained  when  the  source  moves  one  step  ahead  by  choosing 
a  letter  from  the  set  {Bt}.  [In  the  special  case  in  which  successive  events  in 
the  chain  are  independent,  of  course,  H(B  \  A)  reduces  to  #(#).]  A  string 
of  N  successive  choices,  therefore,  will  yield  NH(B  \  A)  units  of  information 
on  the  average. 

In  general,  H(AB)  <  H(A)  +  H(B)\  equality  obtains  only  when  A 
and  B  are  independent.  This  fact  can  be  demonstrated  as  follows  :  the 
familiar  expansion 

f  =  l  +  x  +  £  +  £+...9       (x  >  -1), 


43$  FINITARY    MODELS    OF    LANGUAGE    USERS 

can  be  used  to  establish  that  e*  >  1  +  x.    If  we  set  t  =  1  +  x9  this 
inequality  can  be  written  as 

f  -  1  >  loge  t,        (t  >  0). 

Now  put  /  =  p(A:)p(Bj)/p(AiB^: 

P(A?AP(»?  -  l  >  lo%*  P^  +  ^ge  p(B,)  -  log,  p(AiBi\ 
P(Ai&j) 

and  take  expected  values  over  the  distribution  p(AiB3) : 

^)  log. 


ii  i   i 

+ 

i     i 

-IIp(AiBi)logeP(AiBi), 

3     i 
SO 

1  -  1  >  -H(A)  -  H(B)  +  H(AB), 
which  is  the  result  we  wished  to  establish: 

H(A)  +  H(B)^H(AB).  (15) 

If  we  compare  Eq.  15  with  the  assumption  of  additivity  expressed  in 
Eq.  3,  we  see  that  we  have  also  established  the  following  theorem: 

Theorem  3.  H(S)  >  H(B  \  A).  (16) 

This  important  inequality  can  be  interpreted  to  mean  that  knowledge 
of  the  choice  from  A  cannot  increase  our  average  uncertainty  about  the 
choice  from  B.  In  particular,  if  A  represents  the  past  history  of  some 
message  and  B  represents  the  choice  of  the  next  message  unit,  then  the 
average  amount  of  information  conveyed  by  B  can  never  increase  when 
we  know  the  context  in  which  it  is  selected. 

It  is  important  to  remember  that  H  is  an  average  measure  of  selective 
information,  based  on  the  assumption  that  the  improbable  event  is  always 
the  most  informative,  and  is  not  a  simple  measure  of  semantic  information 
(of.  Carnap  &  Bar-Hillel,  1952).  An  illustration  may  suggest  the  kind  of 
problems  that  can  arise:  in  ordinary  usage  It  is  a  man  will  generally  be 
judged  to  convey  more  information  than  It  is  a  vertebrate,  because  the 
fact  that  something  is  a  man  implies  that  it  is  a  vertebrate,  but  not  vice 
versa.  In  the  framework  of  selective  information  theory,  however,  the 
situation  is  reversed.  According  to  the  tabulations  of  the  frequencies  of 
English  words,  vertebrate  is  a  less  probable  word  than  man,  and  its  selection 
in  English  discourse  must  therefore  be  considered  to  convey  more  informa- 
tion. 


STOCHASTIC    MODELS  43$ 

Because  many  psychological  processes  involve  selective  processes  of  one 
kind  or  another,  a  measure  of  selective  information  has  proved  to  be  of 
some  value  as  a  way  to  characterize  this  aspect  of  behavior.  Surveys  of 
various  applications  of  information  measures  to  psychology  have  been 
prepared  by  Attneave  (1959),  Cherry  (1957),  Garner  (1962),  Luce  (1960), 
Miller  (1953),  Quastler  (1955),  and  others.  Not  all  applications  of  the 
mean  logarithmic  probability  have  been  carefully  considered  and  well 
motivated,  however.  As  Cronbach  (1955)  has  emphasized,  in  many 
situations  it  may  be  advisable  to  develop  alternative  measures  of  informa- 
tion based  on  intuitive  postulates  that  are  more  closely  related  to  the 
particular  applications  we  intend  to  make. 

1.4  Redundancy 

Since  H(B)  >  H(B  \  A),  where  equality  holds  only  for  sequentially 
independent  messages,  any  sequential  dependencies  that  the  source 
introduces  will  act  to  reduce  the  amount  of  selective  information  the 
message  contains.  The  extent  to  which  the  information  is  reduced  is  a 
general  and  interesting  property  of  the  source.  Shannon  has  termed  it  the 
redundancy  and  has  defined  it  in  the  following  way. 

First,  consider  the  amount  of  information  that  could  be  encoded  in  the 
given  alphabet  (or  vocabulary)  if  every  atomic  symbol  were  used  inde- 
pendently and  equiprobably.  If  there  are  D  atomic  symbols,  then  the 
informational  capacity  of  the  alphabet  will  be  L(D)  =  Iog2  D  bits  per 
symbol.  Moreover,  this  value  will  be  the  maximum  possible  with  that 
alphabet.  Now,  if  we  determine  that  the  source  is  producing  an  amount 
H(M)  that  is  actually  less  than  its  theoretical  maximum  per  symbol, 
H(M)/L(D)  will  be  some  fraction  less  than  unity  that  will  represent  the 
relative  amount  of  information  from  the  source.  One  minus  the  relative 
information  is  the  redundancy: 

(H(M)\ 
1 — - 1 .  (17) 

logD/ 

The  relative  amount  of  information  per  symbol  is  a  measure  of  how 
efficiently  the  coding  alphabet  is  being  used.  For  example,  if  the  relative 
information  per  symbol  is  only  half  what  it  might  be,  then  on  the  average 
the  messages  are  twice  as  long  as  necessary.  Shannon  (1948),  on  the  basis 
of  his  observation  that  a  highly  skilled  subject  could  reconstruct  passages 
from  which  50  %  of  the  letters  had  been  removed,  estimated  the  efficiency 
of  normal  English  prose  as  something  less  than  50  %.  But,  when  Chapanis 
(1954)  tried  to  repeat  this  observation  with  other  subjects  and  other  pas- 
sages, he  found  that  if  letters  are  randomly  deleted  and  the  text  is  shortened 


440  FINITARY    MODELS    OF    LANGUAGE    USERS 

so  that  no  indication  is  given  of  the  location  of  the  deletion  few  people 
are  able  to  restore  more  than  25  %  of  the  missing  letters  in  a  short  period 
of  time.  However,  these  are  difficult  conditions  to  impose  on  subjects.  In 
order  to  estimate  the  coding  efficiency  of  English  writing,  we  should  first 
make  every  effort  to  optimize  the  conditions  for  the  person  who  is  trying  to 
reconstruct  the  text.  For  example,  we  might  tell  him  in  advance  that  all 
spaces  between  words  and  all  vowels  have  been  deleted.  This  form  of 
abbreviation  shortens  the  text  by  almost  50  %,  yet  Miller  and  Friedman 
(1957)  found  that  the  most  highly  skilled  subjects  were  able  to  restore 
the  missing  characters  if  they  were  given  sufficient  time  and  incentive  to 
work  at  the  task.  We  can  conclude,  therefore,  that  English  is  at  least 
50%  redundant  and  perhaps  more. 

Why  do  we  bother  with  such  crude  bounds?  Why  not  compute  re- 
dundancy directly  from  the  message  statistics  for  printed  English?  As 
we  noted  at  the  end  of  Sec.  1.2,  the  direct  approach  is  quite  impractical, 
for  there  are  too  many  parameters  to  be  estimated.  However,  we  can  put 
certain  rough  bounds  on  the  value  of  H  by  limiting  operations  that  use 
the  available  message  statistics  directly  for  short  sequences  of  letters  in 
English  (Shannon,  1948).  Let/?(^.)  denote  the  probability  of  a  string  xi 
of  k  symbols  from  the  source  and  define 

G*=--r 

k 

where  the  sum  is  taken  over  all  strings  xi  containing  exactly  k  symbols. 
Then  G>  will  be  a  monotonic  decreasing  function  of  k  and  will  approach 
H  in  the  limit. 

An  even  better  estimate  can  be  obtained  with  conditional  probabilities. 
Consider  a  matrix  P  whose  rows  represent  the  Dk  possible  strings  xi  ofk 
symbols  and  whose  columns  represent  the  D  different  symbols  a^  The 
elements  of  the  matrix  are  p(all  \  x^,  the  conditional  probabilities  that  as 
will  occur  as  the  (k  +  l)st  symbol  given  that  the  string  xi  of  k  symbols 
just  preceded  it.  For  each  row  of  this  matrix 


measures  our  uncertainty  regarding  what  will  follow  the  particular  string 
xt.  The  expected  value  of  this  uncertainty  defines  a  new  function, 

f  w-i  =  -22  P(*i)  P(<*s  I  *<)  loS2  P(<*s  |  Xi),  (19) 

i      3 

where  p(x^  is  the  probability  of  string  x€.   Since  p(x^p(as  \  xi)  =  XaW)» 
we  can  show  that 

FM  =  (k+  l)Gk+l  -  kGk 

1  -  Gk)  +  Gk. 


STOCHASTIC    MODELS  44! 

Therefore,  as  Gk  approaches  H9  Fk  must  also  approach  H.    Moreover, 


IL    ,    *  ' 

so  we  know  that 

Ok  >  Fk. 

Thus  Fk  converges  on  H  more  rapidly  than  Gk  as  £  increases. 

Even  F  (and  similar  functions  using  the  message  statistics)  converges 
quite  slowly  for  natural  languages,  so  Shannon  (1951)  proposed  an 
estimation  procedure  using  data  obtained  with  a  guessing  procedure.  We 
consider  here  only  his  procedure  for  determining  an  upper  bound  for  H 
(and  thus  a  lower  bound  for  the  redundancy). 

Imagine  that  we  have  two  identical  ^-limited  automata  that  incorporate 
the  true  probabilities  of  English  strings.  Given  a  finite  string  of  k  symbols, 
these  devices  assign  the  correct  probabilities  for  the  (k  +  l)st  symbol. 
The  first  device  is  located  at  the  source.  As  each  symbol  of  the  message 
is  produced,  the  device  guesses  what  the  next  symbol  will  be.  It  guesses 
first  the  most  probable  symbol,  second  the  next  most  probable,  and  so  on, 
continuing  in  this  way  until  it  guesses  correctly.  Instead  of  transmitting 
the  symbol  produced  by  the  source,  we  transmit  the  number  of  guesses 
that  the  device  required. 

The  second  device  is  located  at  the  receiver.  When  the  number  j  is 
received,  this  second  device  interprets  it  to  mean  that  theyth  guess  (given 
the  preceding  context)  is  correct.  The  two  devices  are  identical  and  the 
order  of  their  guesses  in  any  context  will  be  identical;  the  second  machine 
decodes  the  received  signal  and  recovers  the  original  message.  In  that  way 
the  original  message  can  be  perfectly  recovered,  so  the  sequence  of  numbers 
must  contain  the  same  information  —  therefore  no  less  an  amount 
of  information  —  as  the  original  text.  If  we  can  determine  the  amount  of 
information  per  symbol  for  the  reduced  text,  we  shall  also  have  determined 
an  upper  bound  for  the  original  text. 

What  will  the  reduced  text  look  like?  We  do  not  possess  two  such 
fc-limited  automata,  but  we  can  try  to  use  native  speakers  of  the  language 
as  a  substitute.  Native  speakers  do  not  know  all  the  probabilities  we  need, 
but  they  do  know  the  syntactic  and  semantic  rules  which  lead  to  those 
probabilities.  We  can  let  a  person  know  all  of  the  text  up  to  a  given  point, 
then  on  the  basis  of  that  and  his  knowledge  of  the  language  ask  him  to 
guess  the  next  letter.  Shannon  (1951)  gives  the  following  as  typical  of 
the  results  obtained: 

THERE#IS#NO#REVERSE#ON#A#... 

11151   121121    1151  17  1112132122... 


442  FINITARY    MODELS    OF    LANGUAGE    USERS 

The  top  line  is  the  original  message;  below  it  is  the  number  of  guesses 
required  for  each  successive  letter. 

Note  that  most  letters  are  guessed  correctly  on  the  first  trial — approxi- 
mately 80  %  when  a  large  amount  of  antecedent  context  is  provided.  Note 
also  that  in  the  reduced  text  the  sequential  constraints  are  far  less  impor- 
tant; how  many  guesses  the  nth  letter  took  tells  little  about  how  many  will 
be  needed  for  the  (n  +  l)st.  It  is  as  if  the  sequential  redundancy  of  the 
original  text  were  transformed  into  a  nonsequential  favoritism  for  small 
numbers  in  the  reduced  text.  Thus  we  are  led  to  consider  the  quantity 

£*+i  =  -  2  fcG')  Iog2  &0)>  (20) 

i=l 

where  qk(j)  is  the  probability  of  guessing  the  (k  +  l)st  letter  of  a  string 
correctly  on  exactly  theyth  guess.  Ifk  is  large,  and  if  our  human  subject  is  a 
satisfactory  substitute  for  the  ^-limited  automaton  we  lack,  then  Ek 
should  be  fairly  close  to  H. 

Can  we  make  this  idea  more  precise?  Suppose  we  reconsider  the 
Dk  x  D  matrix  P  whose  elements  p(aj  \  x^  are  the  conditional  probabili- 
ties of  sybmol  a^  given  the  string  xt.  What  the  ^-limited  automaton  will 
do  when  it  guesses  is  to  map  as  into  the  digit  6(a^  for  each  row,  where  the 
character  with  the  largest  probability  in  the  row  would  be  coded  as  1, 
the  next  largest  as  2,  and  so  on.  Consider,  therefore,  a  new  Dk  x  D 
matrix  Q  whose  rows  are  the  same  but  whose  columns  represent  the  first 
D  digits  in  order.  Then  in  every  row  of  this  new  matrix  the  conditional 
probabilities  q[0(a^)  \  xt]  would  be  arranged  in  a  monotonically  decreasing 
order  of  magnitude  from  left  to  right.  Note  that  we  have  lost  nothing  in 
shifting  from  P  to  Q ;  6  has  an  inverse,  so  Fk  can  be  computed  from  Q  just 
as  well  as  from  P. 

Now  suppose  we  ignore  the  context  a^;  that  is  to  say,  suppose  we 
simply  average  all  the  rows  of  Q  together,  weighting  them  according  to 
their  probability  of  occurrence.  This  procedure  will  yield  qk(j\  the  average 
probability  of  being  correct  on  they'th  guess.  From  Theorem  3  we  know 
that  Fk  <  Ek.  Therefore,  Ek  must  also  be  an  upper  bound  on  the  amount 
of  information  per  symbol. 

Moreover,  this  bound  holds  even  when  we  use  a  human  substitute  for 
out  hypothetical  automaton,  since  people  can  err  only  in  the  direction  of 
greater  uncertainty  (greater  Ek)  than  would  an  ideal  device.  We  can 
formulate  this  fact  rigorously:  suppose  the  true  probabilities  of  the 
predicted  symbols  are  pi  but  that  our  subject  is  guessing  on  the  basis  of 
some  (not  necessarily  accurate;  cf.  Toda,  1956)  estimates  piy  derived 
somehow  from  his  knowledge  of  the  language  and  his  previous  experience 
with  the  source.  Let  Sp^  =  S^i  =  1,  and  consider  the  mean  value  of  the 


STOCHASTIC    MODELS 


quantity  at  =  A//V   From  the  well-known  theorem  of  the  arithmetic  and 
geometric  means  (see,  e.g.,  Hardy,  Littlewood,  &  Polya,  1952,  Chapter  2), 

we  know  that       .    ,„         ,     .  „     ^ 

(flO'i  .  .  .  (o^*  <  ^  +  .  .  .  +  JP^, 

from  which  we  obtain,  directly 


with  equality  only  if  pi  =  /^.  for  all  f.  Taking  logarithms, 


which  gives  the  desired  inequality 

-2  Pi  loS  A  >  -2  ft  log  ft.  (21) 

Any  inaccuracy  in  the  subject's  estimated  probabilities  can  serve  only  to 
increase  the  estimate  of  the  amount  of  information.  The  more  ignorant 
he  is,  the  more  often  the  source  will  surprise  him. 

The  guessing  technique  for  estimating  bounds  on  the  amount  of  selective 
information  contained  in  redundant  strings  of  symbols  can  be  performed 
rapidly,  and  the  bounds  are  often  surprisingly  low.  The  technique  has 
been  useful  even  in  nonlinguistic  situations. 

Shannon's  (1951)  data  for  a  single,  highly  skilled  subject  gave  £100  =  1.3 
bits  per  character.  For  a  27-character  alphabet  the  maximum  possible 
would  be  Iog2  27  =  4.73  bits  per  character.  The  lower  bound  for  the 
redundancy,  therefore,  is  1  —  (1,3/4.73)  =  0.73.  This  can  be  interpreted 
to  mean  that,  for  the  type  of  prose  passages  Shannon  used,  at  least  73 
of  every  100  characters  on  the  page  could  have  been  deleted  if  the  same 
alphabet  had  been  used  most  efficiently,  that  is,  if  all  the  characters  were 
used  independently  and  equiprobably.  Burton  and  Licklider  (1955) 
confirmed  this  result  and  added  that  Ek  has  effectively  reached  its  asymp- 
tote by  k  =  32;  that  is  to  say,  measurable  effects  of  context  on  a  person's 
guesses  do  not  seem  to  extend  more  than  32  characters  (about  six  words) 
back  into  the  history  of  the  message. 

The  lower  bound  on  redundancy  depends  on  the  particular  passage 
used.  In  some  situations — air-traffic-control  messages  to  a  pilot  landing  at 
a  familiar  airport — redundancy  may  rise  as  high  as  96  %  (Frick  &  Sumby, 
1952;  Fritz  &Grier,  1955). 

1.5  Some  Connections  with  Grammaticalness 

In  Sec.  3  of  Chapter  11  we  mentioned  the  difficult  problem  of  assigning 
degrees  of  grammaticalness  to  strings  in  a  way  that  would  reflect  the 


444  FINITARY    MODELS    OF    LANGUAGE    USERS 

manner  and  extent  of  their  deviation  from  well-formedness  in  a  given 
language.  Some  of  the  concepts  introduced  in  the  present  chapter  suggest 
a  possible  approach  to  this  problem.3 

Suppose  we  have  a  grammar  G  that  generates  a  fairly  narrow  (though, 
of  course,  infinite)  set  L(G)  of  well-formed  sentences.  How  could  we 
assign  to  each  string  not  generated  by  the  grammar  a  measure  of  its 
deviation  in  at  least  one  of  the  many  dimensions  in  which  deviation  can 
occur?  We  might  proceed  in  the  following  way:  select  some  unit — for 
concreteness,  let  us  choose  word  units  and,  for  convenience,  let  us  not 
bother  to  distinguish  in  general  between  different  inflectional  forms  (e.g., 
between  find,  found,  finds).  Next,  set  up  a  hierarchy  #  of  classes  of  these 
units,  where  ^  =  ^15 .  . .  ,  ^v,  and  for  each  i  <  N 

%i  =  (CV,  .  . .  ,  C^\  where:  a±  >  a2  >  .  .  .  >  ax  =  1, 

Cf  is  nonnull, 
for  each  word  w,  there  is  a  j  such  that 

w  E  C/, 
and  C/  s  Cy  if  and  only  if  j  =  k.    (22) 

#!  is  the  most  highly  differentiated  class  of  categories;  %N  contains  but  a 
single  category.  Other  conditions  might  be  imposed  (e.g.,  that  c^i  be  a 
refinement  of  ^i+j),  but  Condition  22  suffices  for  the  present  discussion. 

^  is  called  the  categorization  of  order  i;  its  members  are  called  cate- 
gories of  order  L  A  sequence  Cbi\  .  . . ,  Cbj  of  categories  of  order  i  is 
called  a  sentence-form  of  order  /;  it  is  said  to  generate  the  string  of  words 
Wj. .  . .  WQ  if,  for  each  j  <  q,  \^j  e  Cb  \  Thus  the  set  of  all  word  strings 
generated  by  a  sentence-form  is  the  complex  (set)  product  of  the  sequence 
of  categories. 

We  have  described  ^  and  G  independently;  let  us  now  relate  them. 
We  say  that  a  set  S  of  sentence-forms  of  order  i  covers  G  if  each  string  of 
L(G)  is  generated  by  some  member  of  S.  We  say  that  a  sentence-form  is 
grammatical  with  respect  to  G  if  one  of  the  strings  that  the  sentence-form 
generates  is  in  L(G)— fully  grammatical,  with  respect  to  (7,  if  each  of  the 
strings  that  it  generates  is  in  L(G).  We  say  that  #  is  compatible  with  G 
if  for  each  sentence  H>  of  L(G)  there  is  a  sentence-form  of  order  one  that 
generates  w  and  that  is  fully  grammatical  with  respect  to  G.  Thus,  if  ^  is 
compatible  with  G,  there  is  a  set  of  fully  grammatical  sentence-forms  of 
order  one  that  covers  G.  We  might  also  require,  for  compatibility,  that 
#!  be  the  smallest  set  of  word  classes  to  meet  this  condition.  Note  in  this 
case  that  the  categories  of  ^  need  not  be  pairwise  disjoint.  For  example, 

3  The  idea  of  using  information  measures  to  determine  an  optimal  set  of  syntactic 
categories,  as  outlined  here,  was  suggested  by  Peter  Elias.  This  approach  is  developed 
in  more  detail,  with  some  supporting  empirical  evidence,  in  Chomsky  (1955,  Chapter  4). 


STOCHASTIC    MODELS  445 

know  will  be  in  C^  and  no  in  C/,  where  z  7^  /,  although,  they  are  phoneti- 
cally the  same.  If  two  words  are  mutually  substitutable  throughout  L(G), 
they  will  be  in  the  same  category  C,1,  if  it  is  compatible  with  G,  but  the 
converse  is  not  necessarily  true. 

We  say  that  a  string  w  is  i-grammatical  (has  degree  of  grammaticalness 
i)  with  respect  to  G,  *£  if  i  is  the  least  number  such  that  w  is  generated  by  a 
grammatical  sentence-form  of  order  z".  Thus  the  strings  of  the  highest 
degree  of  grammaticalness  are  those  of  order  1,  the  order  with  the  largest 
number  of  categories.  All  strings  are  grammatical  of  order  TV  or  less,  since 
<%N  contains  only  one  category. 

These  ideas  can  be  clarified  by  an  example.  Suppose  that  G  is  a  gram- 
mar of  English  and  that  ^  is  a  system  of  categories  compatible  with  it 
and  having  a  structure  something  like  this  : 

^1  =  ^hum  =  {boy,  man,  .  .  .} 

7V"ab  =  {virtue,  sincerity,  .  .  .} 

^cornp  =  (idea>  belief,  -  •  •} 

=  (bread,  beef,  .  .  .} 

=  (book,  chair,  .  .  .} 
Kx  =  {admire,  dislike,  .  .  .} 
F2  =  {annoy,  frighten,  .  .  .}  (23) 

Vz  =  {hit,  find,  .  .  .} 
j/4  =  {sleep,  reminisce,  .  .  .} 
etc. 


#hum  uATab  u... 

Verb  =  Fi  U  F2  U  .  .  . 
etc. 

#3:  Word. 

This  extremely  primitive  hierarchy  ^  of  categories  would  enable  us  to 
express  some  of  the  grammatical  diversity  of  possible  strings  of  words. 
Let  us  assume  that  G  would  generate  the  boy  cut  the  beef,  the  boy  reminisced, 
sincerity  frightens  me,  the  boy  admires  sincerity,  the  idea  that  sincerity 
might  frighten  you  astonishes  me,  the  boy  found  a  piece  of  bread,  the  boy 
found  the  chair,  the  boy  who  annoyed  me  slept  here,  etc.  It  would  not, 
however,  generate  such  strings  as  the  beef  cut  sincerity,  sincerity  reminisced, 
the  boy  frightens  sincerity,  sincerity  admires  the  boy,  the  sincerity  that  the 
idea  might  frighten  you  astonishes  me,  the  boy  found  a  piece  of  book,  the  boy 
annoyed  the  chair,  the  chair  who  annoyed  me  found  here,  etc.  Strings  of 


FINITARY    MODELS    OF    LANGUAGE    USERS 


the  first  type  would  be  one-grammatical  (as  are  all  strings  generated  by  G)  ; 
strings  of  the  second  type  would  be  two-grammatical;  all  strings  would  be 
three-grammatical,  with  respect  to  this  primitive  categorization. 

Many  of  the  two-grammatical  strings  might  find  a  natural  use  in  actual 
communication,  of  course.  Some  of  them,  in  fact,  (e.g.,  misery  loves 
company,  etc.)  might  be  more  common  than  many  one-grammatical 
strings  (an  infinite  number  of  which  have  zero  probability  and  consist  of 
parts  which  have  zero  probability,  effectively). 

A  speaker  of  English  can  impose  an  interpretation  on  many  of  these 
strings  by  considering  their  analogies  and  resemblances  to  those  generated 
by  the  grammar  he  has  mastered,  much  as  he  can  impose  an  interpretation 
on  an  abstract  drawing.  One-grammatical  strings,  in  general,  like  rep- 
resentational drawings,  need  have  no  interpretation  imposed  on  them  to 
be  understood.  With  a  hierarchy  such  as  ^  we  could  account  for  the  fact 
that  speakers  of  English  know  for  example,  that  colorless  green  ideas 
sleep  furiously  is  surely  to  be  distinguished,  with  respect  to  well-formedness, 
from  revolutionary  new  ideas  appear  infrequently  on  the  one  hand  and  from 
furiously  sleep  ideas  green  colorless  or  harmless  seem  dogs  young  friendly 
(which  has  the  same  pattern  of  grammatical  affixes)  on  the  other;  and  so 
on,  in  an  indefinite  number  of  similar  cases. 

Such  considerations  show  that  a  generative  grammar  could  more  com- 
pletely fulfil  its  function  as  an  explanatory  theory  if  we  had  some  way  to 
project,  from  the  grammar,  a  certain  compatible  hierarchy  ^  in  terms  of 
which  degree  of  grammaticalness  could  be  defined.  Let  us  consider  now 
how  this  might  be  done. 

In  order  to  simplify  exposition,  we  first  restrict  the  problem  in  two  ways. 
We  shall  consider  only  sentences  of  some  fixed  length,  say  length  L 
Second,  let  us  consider  the  problem  of  determining  the  system  of  categories 
#z-  =  {Ci\  .  .  .  Caf}9  where  ^  is  fixed.  The  best  choice  of  at  categories  is 
the  one  that  in  the  appropriate  sense  maximizes  substitutability  relations 
among  the  categorized  elements.  The  question,  then,  is  how  we  can  select 
the  fixed  number  of  categories  which  best  mirror  substitutability  relations. 
Note  that  we  are  interested  in  substitutability  not  with  respect  to  L(G) 
but  to  contexts  stated  in  terms  of  the  categories  of  ^  itself.  To  take  an 
example,  boy  and  sincerity  are  much  more  freely  substitutable  in  contexts 
defined  by  the  categories  of  ^2  of  (23)  than  in  actual  contexts  of  L(G); 
thus  we  may  find  both  words  in  the  context  Noun  Verb  Determiner  -  , 
but  not  in  the  context  you  frightened  the  -  .  Some  words  may  not  be 
substitutable  at  all  in  L(G),  although  they  are  mutually  substitutable  in 
terms  of  higher  order  categories.  This  fact  suggests  that  systematic 
procedures  of  substitution  applied  to  successive  words  in  some  sequence 
of  grammatical  sentences  will  probably  always  fail  —  as,  indeed,  they  always 


STOCHASTIC    MODELS  44? 

have  so  far — since  the  maximization  of  substitutability,  in  the  sense  we 
intend  here,  is  a  property  of  the  whole  system  of  categories. 

A  better  way  to  approach  the  problem  is  this:  suppose  that  Oj  is  a 
sequence  sl9  . .  . ,  sm  of  all  sentences  of  length  I  in  L(G)  and  that  <fi  is  a 
proposed  set  of  ai  categories.  Let  cr2  be  a  sequence  of  sentence-forms 
Sl5 . . . ,  2W?  where,  for  each  j  <  772,  2^  generates  ^  and  S^  consists  of 
categories  of  ^.  There  will,  of  course,  be  many  repetitions  in  a2,  in 
general.  Let  cr3  be  the  sequence  tl9 .  . . ,  tn  of  all  strings  generated  by  the 
S/s  in  0*2,  where  cr3  contains  no  repetitions.  For  example,. if  a:  contains 
the  boy  slept  and  the  period  elapsed,  but  not  the  period  slept  or  the  boy 
elapsed,  and  if  <72  is  based  on  a  categorization  into  nouns  and  verbs  [i.e., 
or2  contains  (Determiner,  Noun,  Verb)],  then  cr3  would  contain  all  four  of 
those  sentences. 

It  seems  reasonable  to  measure  the  adequacy  of  a  system  of  categories 
by  some  function  of  the  length  of  the  sequence  a3.  The  number  of  generated 
sentences  in  o3  indicates  the  extent  to  which  the  categorization  reflects 
substitutability  relations  not  only  with  respect  to  the  given  set  of  sentences 
but  also  with  respect  to  contexts  defined  in  terms  of  the  categories  them- 
selves. Thus  particular  nouns  may  not  be  substitutable  with  respect  to 
the  same  verbs,  but  they  do  each  occur  in  a  given  position  relative  to 
some  verb  so  that  they  are  substitutable  with  respect  to  the  category 
Verb.  The  same  is  true  of  particular  verbs,  adjectives,  etc.  This  approach 
permits  us  to  set  up  all  the  categories  simultaneously. 

To  evaluate  a  system  <Si  of  ai  categories,  given  a  sequence  a±  of  actual 
sentences  of  length  A,  we  shall  try  to  discover  a  sequence  <r2  that  covers 
cr1?  in  the  sense  we  have  defined  (more  precisely,  whose  terms  constitute  a 
set  that  covers  o^),  and  that  is  minimal  in  the  sense  that  it  generates  the 
shortest  sequence  cr3.  In  case  the  categories  of  <Si  are  pairwise  disjoint, 
this  procedure  is  perfectly  simple;  we  merely  replace  each  word  in  the 
strings  of  a±  by  the  category  of  <ffi  to  which  it  belongs,  thus  forming  <r2. 
But,  if  the  categories  of  <6i  overlap,  there  may  be  many  covering  sequences 
<r2;  we  must  find  the  minimal  one  in  order  to  evaluate  #V 

Categories  overlap  in  the  case  of  grammatical  homonyms,  as  we  have 
observed.  Note  that  if  a  word  is  put  into  more  than  one  category  when  we 
form  ^i  the  value  of  this  categorization  will  always  suffer  a  loss  in  one 
respect.  Each  time  a  category  appears  in  a  sentence-form  of  o2  a  set  of 
sentences  of  cr3  is  generated  for  each  word  in  that  category.  Hence  the 
more  words  in  a  category,  the  more  sentences  generated  and  the  less 
satisfactory  the  categorization.  However,  if  the  word  assigned  to  two 
categories  is  a  bona  fide  homonym,  there  may  be  a  compensating  saving. 
Suppose,  for  example,  that  the  phoneme  sequence  /no/  (know,  no)  is  put 
only  into  the  category  of  verbs.  Then  all  verbs  will  be  generated  in  a3  in 


44  FINITARY    MODELS    OF    LANGUAGE    USERS 

the  position  there  are  -  books  on  the  table.  Similarly,  if  it  is  put  only 
into  the  category  of  determiners,  all  will  be  generated  in  such  contexts  as 
/  -  that  he  has  been  here.  If  /no/  is  assigned  to  both  categories,  a  given 
occurrence  of  /no/  in  a±  can  be  assigned  to  either  verb  or  determiner.  Since 
verbs  will  appear  anyway  in  the  context  7  -  that  he  has  been  here  and 
determiners  in  there  are  -  books  on  the  table,  no  new  sentence  forms  are 
produced  by  assignment  of  /no/  to  verb  in  the  first  case  and  to  determiner 
in  the  second.  There  is  thus  a  considerable  saving  in  the  sequence  <r3  of 
generated  strings. 

These  observations  suggest  a  way  to  decide  when  an  element  should  in 
fact  be  considered  a  set  of  grammatical  homonyms.  We  make  this  decision 
when  the  loss  incurred  automatically  in  assigning  it  to  several  categories 
is  more  than  compensated  for  by  the  gain  that  can  be  achieved  through  the 
extra  freedom  in  choosing  the  complete  covering  sequence  cr2;  there  is 
always  a  numerical  answer  to  this  question.  It  must  be  shown,  of  course, 
that  in  terms  of  presystematic  criteria,  the  solution  of  the  homonym  prob- 
lem given  by  this  approach  is  the  correct  one.  Certain  preliminary 
investigations  of  this  have  been  hopeful  (cf.  Chomsky,  1955),  but  the  task 
of  evaluating  and  improving  this  or  any  other  conception  of  syntactic 
category  is  an  immense  one.  Furthermore,  several  important  distinctions 
have  been  blurred  in  this  brief  discussion. 

Let  us  now  return  to  our  two  assumptions:  (a)  that  the  length  2,  of 
sentences  is  fixed  and  (b)  that  the  number  a^  of  categories  is  fixed.  The 
first  is  easily  dispensable.  Given  G,  we  can  evaluate  a  set  <S%  —  {C^, 
.  .  .  ,  Cfl.*},  where  at  is  a  fixed  integer,  in  the  following  way.  Select  a  new 
"word"  #  to  indicate  sentence  boundary,  #  $  Cf  for  any/  Define  a 
discourse  as  a  sequence  of  words  #,  vt^1,  .  .  .  ,  w^1,  #,  wx2,  .  .  .  9  w^, 
#,...,#,  w-f,  .  .  .  ,  H>afrfc,  where  for  each/  w^  .  .  .  w^j  is  a  sentence  of 
the  language  generated  by  G.  This  is  a  discourse  of  length  ax  +  .  .  .  + 
aj.  +  k.  An  initial  discourse  is  an  initial  subsequence  of  a  discourse.  A 
discourse  form  is  a  sequence  of  categories  Cp*,  .  .  .  ,  C$  i  of  categories  of  ^f 
or  {#}  such  that  there  is  a  discourse  wl9  .  .  .  ,  wq,  where,  for  each  /  w3-  e  Cp*9 
and  an  initial  discourse  form  is  an  initial  subsequence  of  a  discourse  form. 
Let  S^  be  a  set  of  initial  discourse  forms,  each  of  length  A,  which  covers 
the  set  of  initial  discourses  of  length  A  and  is  minimal  from  the  point  of 
view  of  generation,  and  let  N(X)  be  the  number  of  distinct  strings  generated 
by  the  members  of  SA.  Then  the  natural  way  to  define  the  value  of  the 
categorization  <^i  is,  by  analogy  with  the  definition  of  channel  capacity  in 
Eq.  1,  p.  431,  as 


(24) 


STOCHASTIC    MODELS  449 

We  choose  as  the  best  categorization  into  af  categories  that  analysis  (Si 
for  which  Val  (^)  is  minimal.  In  other  words,  we  select  the  categorization 
into  at  categories  that  minimizes  the  information  per  word,  that  is, 
maximizes  the  redundancy,  .in  the  generated  "language"  of  grammatical 
discourses  (assuming  independence  of  successive  sentences).  Thus  we  shall 
try  to  select  a  categorization  that  maximizes  the  contribution  of  the 
category  analysis  to  the  total  set  of  constraints  under  which  the  source 
operates  in  producing  discourses.  In  practice,  this  computation  can  be 
much  simplified  by  assuming  that  successive  choices  of  SA,  for  increasing 
A,  are  not  independent. 

We  have  now  proposed  a  definition  of  optimal  categorization  into  n 
categories,  for  each  n,  which  is  independent  of  arbitrary  decisions  about 
sentence  length.  We  must  finally  consider  the  assumption  that  we  are 
given  the  integers  al9 .  .  . ,  aN  which  determine  the  number  of  categories  in 
Condition  22.  Suppose,  in  fact,  that  we  determine  for  each  n  the  optimal 
categorization  Kn  into  n  categories,  in  the  way  previously  sketched.  To 
select  from  the  set  {Kn}  the  hierarchy  &9  we  must  determine  for  which 
integers  ai  we  will  actually  set  up  the  optimal  categorization  Ka.  as  an 
order  ^  of  V.  We  would  like  to  select  at  in  such  a  way  that  Ka.  will 
be  clearly  preferred  to  Ka_^  but  will  not  be  much  worse  than  Ka+l; 
that  is  to  say,  we  would  like  to  select  Ka.  in  such  a  way  that  there  will  be  a 
considerable  loss  in  forming  a  system  of  categories  with  fewer  than  ai 
categories  but  not  much  of  a  gain  in  adding  a  further  category. 

We  might,  for  example,  take  ^  =  K^  as  an  order  of  #  just  in  case  the 
function  f(ri)  =  wVal  (Kn)  has  a  relative  minimum  at  n  =  at.  (We  might 
also  be  interested  in  the  absolute  minimum  of/,  defined  in  this  or  some 
more  appropriate  way — we  might  take  this  as  defining  an  absolute  order  of 
grammaticalness  and  an  overriding  bifurcation  of  strings  into  grammatical 
and  ungrammatical,  with  the  grammatical  including  as  a  proper  subclass 
those  generated  by  the  grammar.) 

In  the  way  just  sketched  we  might  prescribe  a  general  procedure  T  such 
that,  given  a  grammar  G,  Y(G)  is  a  hierarchy  #  of  categories  compatible 
with  (?,  by  which  degree  of  grammaticalness  is  defined  for  each  string  in 
the  terminal  vocabulary  of  G.  It  would  then  be  correct  to  say  that  a 
grammar  not  only  generates  sentences  with  structural  descriptions  but  also 
assigns  to  each  string,  whether  generated  or  not,  a  degree  of  grammatical- 
ness  that  measures  its  deviation  from  the  set  of  perfectly  well-formed 
sentences  as  well  as  a  partial  structural  description  that  indicates  how  this 
string  deviates  from  well-formedness. 

It  is  hardly  necessary  to  emphasize  that  this  proposal  is,  in  its  details, 
highly  tentative.  Undoubtedly  there  are  many  other  ways  to  approach 
this  complex  question. 


450  FINITARY    MODELS    OF    LANGUAGE    USERS 


1.6  Minimum-Redundancy  Codes 

Before  a  message  can  be  transmitted,  it  must  be  coded  in  a  form  appro- 
priate to  the  medium  through  which  it  will  pass.  This  coding  can  be 
accomplished  in  many  ways;  the  procedure  becomes  of  some  theoretical 
interest,  however,  when  we  ask  about  its  efficiency.  For  a  given  alphabet, 
what  codes  will,  on  the  average,  give  the  shortest  encoded  messages? 
Such  codes  are  called  minimum-redundancy  codes.  Natural  languages  are 
generally  quite  redundant;  how  to  encode  them  to  eliminate  that  re- 
dundancy poses  a  challenging  problem. 

The  question  of  coding  efficiency  becomes  especially  interesting  when 
we  recognize  that  every  channel  is  noisy,  so  that  an  efficient  code  must  not 
only  be  short  but  at  the  same  time  must  enable  us  to  keep  erroneous 
transmissions  below  some  specified  probability.  The  solutions  that  have 
been  found  for  this  problem  constitute  the  real  core  of  information  theory 
as  it  is  applied  to  many  practical  problems  in  communication  engineering. 
Inasmuch  as  psychologists  and  linguists  have  not  yet  exploited  these 
fundamental  results  for  noisy  channels,  we  shall  limit  our  attention  here  to 
the  simpler  problem  of  finding  minimum-redundancy  codes  for  noiseless 
channels. 

The  problem  of  optimal  coding  can  be  posed  as  follows :  we  know  from 
Sec.  1.3  that  an  alphabet  is  used  most  efficiently  when  each  character 
occurs  independently  and  equiprobably,  that  is,  when  all  strings  of  equal 
length  are  equiprobable.  So  we  must  find  a  function  6  that  maps  our 
natural  messages  into  coded  forms  in  which  all  sequences  of  the  same 
length  are  equiprobable.  For  the  sake  of  simplicity,  let  us  assume  that 
the  messages  can  be  divided  into  independent  units  that  can  be  separately 
encoded.  In  order  to  be  definite,  let  us  imagine  that  we  are  dealing  with 
printed  English  and  that  we  are  willing  to  assume  that  successive  words  are 
independent.  Each  time  a  space  occurs  in  the  text  the  text  accumulated 
since  the  preceding  space  is  encoded  as  a  unit.  For  each  word,  therefore, 
we  shall  want  to  assign  a  sequence  of  code  symbols  in  such  a  way  that,  on 
the  average,  all  the  code  symbols  will  be  used  independently  and  equally 
often  and  in  such  a  way  that  we  shall  be  able  to  segment  the  coded  messages 
to  recover  the  original  word  units  when  the  time  comes  to  decode  it. 

First,  we  observe  that  in  any  minimum-redundancy  code  the  length  of  a 
given  coded  word  can  never  be  less  than  the  length  of  a  more  probable 
coded  word.  If  the  more  probable  word  were  longer,  a  saving  in  the 
average  length  could  be  achieved  by  simply  reversing  the  codes  assigned 
to  the  two  words.  We  begin,  therefore,  by  ranking  the  words  in  order  of 
decreasing  probability  of  occurrence.  Let  pr  represent  the  probability  of 


STOCHASTIC    MODELS  451 

the  word  ranked  r,  and  let  cr  represent  the  length  of  its  encoded  representa- 
tion; that  is  to  say,  we  rank  the  words 

Pi  >  Pz  >  •  -  •  >  PN-I  >  PN* 

where  N  is  the  number  of  different  words  in  the  vocabulary.  For  a  mini- 
mum redundancy  code  we  must  then  have 

cl  ^  C2  ^  •  •  •   ^  cAr~l  ^  CN* 

Note,  moreover,  that  the  mean  length  C  of  an  encoded  word  will  be 

C-ijyv.  (25) 

r=l 

Obviously,  the  mean  length  would  be  a  minimum  if  we  could  use  only 
one-letter  words,  but  this  would  entail  too  large  a  number  D  of  different 
code  characters.  Ordinarily,  our  choice  of  D  is  limited  by  the  nature  of 
the  channel.  Of  course,  it  is  not  length  per  se  that  we  want  to  minimize 
but  length  per  unit  of  information  transmitted.  The  problemjs  to  minimize 
CfH9  the  length  per  bit  (or  to  maximize  H/C,  the  amount  of  information 
per  unit  length),  subject  to  the  subsidiary  conditions  that  Spr  =  1  and 
that  the  coded  message  be  uniquely  decodable. 

By  virtue  of  Assumption  2  in  Sec.  1.3  it  would  seem  that  H/C,  the 
information  per  letter  in  the  encoded  words,  cannot  be  greater  than  log  D, 
the  capacity  of  the  coding  alphabet.  From  that  fact  we  might  try  to  move 
directly  to  a  lower  bound, 

C>-^-.  (26) 

log  D 

Although  this  inequality  is  correct,  it  cannot  be  derived  as  a  simple 
consequence  of  Assumption  2.  Consider  the  following  counter-example 
(Feinstein,  1958):  we  have  a  vocabulary  of  three  words  with  probabilities 
Pi  =p2  =  2p3  =  0.4  and  we  code  them  into  the  binary  alphabet  (0,  1} 
so  that  6(1)  =  0,  6(2)  =  1,  and  6(3)  =  01.  Now  we  can  easily  compute 
that  C  =  1.2,  H  =  1.52,  and  Iog2  D  =  1,  so  that  the  average  length  is  less 
than  the  bound  stated  in  Eq.  26.  The  trouble,  of  course,  is  that  6  does 
not  yield  a  true  code,  in  the  sense  defined  in  Chapter  11;  the  coded 
messages  are  not  uniquely  decodable.  If,  however,  we  add  to  Assumption 
2  the  further  condition  of  unique  decodability,  the  lower  bound  stated  in 
Eq.  26  can  be  established.  The  further  condition  is  most  easily  phrased 
in  terms  of  a  left  tree  code,  in  which  no  coded  word  is  an  initial  segment 
of  any  other  coded  word.  By  using  Eq.  21  we  can  write 

N  N  n-c»  N  N 

H  =  -        <logp,  <  -  I  ftlog- =  log!  £~"  +  2PA  log  D' 


452  FINITARY    MODELS    OF    LANGUAGE    USERS 

For  left  tree  codes  we  know,  from  Eq.  4,  in  Chapter  1 1,  that  SZ>-C*  <  1 ; 
therefore,  log  £/>-<•<  log  1  =  0, 

so  we  can  write 

H  <  f  pfr  log  D, 

2=1 

from  which  the  desired  inequality  of  Eq.  26  follows  by  rearranging  terms. 
If  Eq.  26  sets  a  lower  bound  on  the  mean  length  C,  how  closely  can 

we  approach  it?  The  following  theorem,  due  to  Shannon  (1948),  provides 

the  answer: 

Theorem  4.  Given  a  vocabulary  V  of  N  words  with  information  H  and  a 
coding  alphabet  A  of  D  code  symbols,  it  is  possible  to  code  the  words  by 
finite  strings  of  code  symbols  from  A  in  such  a  way  that  C,  the  average 
number  of  code  symbols  per  word,  satisfies  the  inequality 

-^—  <C<  -^—  +  1.  (27) 

log  D  log  D 

The  proof  has  been  published  in  numerous  places;  see,  for  example, 
Feinstein  (1958,  Chapter  2)  or  Fano  (1961,  Chapter  3). 

Instead  of  proving  here  that  such  minimum-redundancy  codes  exist, 
we  shall  consider  ways  of  constructing  them.  Both  Shannon  (1948) 
and  Fano  (1949)  proposed  methods  of  constructing  codes  that  approach 
minimum  redundancy  asymptotically  as  the  length  of  the  coding  unit 
is  enlarged  to  include  progressively  longer  sequences  of  words.  In  1952, 
however,  Huffman  discovered  a  method  of  systematically  constructing 
minimum-redundancy  codes  for  finite  vocabularies  without  resorting  to 
any  limiting  operations. 

Huffman  assumes  that  the  vocabulary  to  be  encoded  is  finite,  that  the 
probability  of  each  word  is  known  in  advance,  that  a  left  tree  code  can  be 
used,  and  that  all  code  symbols  will  be  of  unit  length.  Within  these  limits, 
let  us  now  consider  the  special  conditions  that  a  minimuin-redundancy 
code  must  satisfy: 

1 .  No  two  words  can  be  represented  by  identical  strings  of  code  symbols. 

2.  It  must  be  possible  for  a  receiver  to  segment  coded  messages  into 
the  coded  words  that  comprise  them.  (This  restriction  is  discussed  in  Chap- 
ter 1 1,  Sec.  2.)  The  printer's  use  of  a  special  symbol  (space)  to  mark  word 
boundaries  in  a  natural  code  is  in  general  too  inefficient  for  minimum 
redundancy  codes.  Proper  segmentation  in  the  sense  of  boundary  markers 
is  ensured,  however,  by  the  assumption  that  it  must  be  a  left  tree  code. 

3.  If  the  words  are  ranked  in  order  of  decreasing  probability  pr,  then 
the  length  of  the  rth  word,  cr,  must  satisfy  the  inequalities 

^i  ^  ^2  ^  •  •  •  ^  CN—I  ==  CN* 


STOCHASTIC    MODELS 


Because  all  the  code  symbols  are  equally  long,  cr  can  be  interpreted  simply 
as  the  number  of  symbols  used  to  code  the  rth  word.  In  a  minimum 
redundancy  tree  code  CN_-L  =  CN  because  the  first  CN^  symbols  used  to 
code  the  Nth  word  cannot  be  the  coded  form  of  any  other  word  ;  that  is 
to  say,  the  coded  forms  of  words  TV"  —  1  and  N  must  differ  in  their  first 
CN-I  symbols,  and,  if  they  do,  no  additional  symbols  are  needed  to  encode 
word  N. 

4.  At  least  two  (and  not  more  than  Z))  words  of  code  length  CN  have 
codes  that  are  identical  except  for  their  final  digits.   Imagine  a  minimum 
redundancy  tree  code  in  which  this  was  not  true;    then  the  final  code 
symbols  could  be  deleted,  thus  shortening  the  average  length  of  a  coded 
word  and  so  leading  to  a  contradiction. 

5.  Each  possible  string  of  CN  —  1  code  symbols  must  be  used  either  to 
represent  a  word  or  some  initial  segment  of  the  sequence  must  represent 
a  word.  If  such  a  string  of  symbols  existed  and  was  not  used,  the  average 
length  of  the  coded  words  could  be  reduced  by  using  it  in  place  of  some 
longer  string. 

These  restrictions  are  sufficient  to  determine  the  following  procedure, 
which  we  shall  outline  for  a  binary  coding  alphabet,  D  =  2.  List  the  words 
from  most  probable  to  least  probable.  By  (3),  CN_I  =  CN,  and,  by  (4), 
there  are  exactly  two  words  of  code  length  CN  that  must  be  identical  except 
for  their  final  symbols.  So  we  can  assign  0  as  the  final  digit  of  the  (N  —  l)th 
word  and  1  as  the  final  digit  of  the  Nth  word.  Once  this  has  been  done,  the 
(TV  —  l)th  and  Mh  words  taken  together  are  equivalent  to  a  single  com- 
posite message;  its  code  will  be  the  common  (but  still  unknown)  initial 
segment  of  length  CN  —  I  and  its  probability  will  be  the  sum  of  the 
probabilities  of  the  two  words  comprising  it.  By  combining  these  two 
words,  we  create  a  new  vocabulary  with  only  N  —  I  words  in  it.  Suppose 
we  now  reorder  the  words  as  before  and  repeat  the  whole  procedure.  We 
can  continue  to  do  so  until  the  reduced  vocabulary  contains  only  two 
words,  at  which  point  we  assign  0  to  one  and  1  to  the  other  and  the  code 
is  completed. 

An  illustration  of  this  procedure,  using  a  binary  code,  is  shown  in 
Table  2.  A  vocabulary  of  nine  words  is  given  in  order  of  decreasing 
probability.  The  first  step  is  to  assign  0  to  word  /z  and  1  to  word  i  (or 
conversely)  as  their  final  code  symbols,  then  to  combine  /z  and  z  into  a  single 
item  in  a  new  derived  distribution.  The  procedure  is  then  repeated  for 
the  two  least  probable  items  in  this  new  distribution,  etc.,  until  all  the  code 
symbols  have  been  assigned.  The  result  is  to  produce  a  coding  tree;  it 
can  be  seen  with  difficulty  in  Table  2,  in  which  its  trunk  is  on  the  right 
and  its  branches  extend  to  the  left,  or  more  easily  in  Fig.  2,  in  which  it  has 
been  redrawn  in  the  standard  way. 


454 


FINITARY    MODELS    OF    LANGUAGE    USERS 


13 

8 

fc 

\ 

o 

«o  r}- 

> 

0   0 

X 

CO  ^^5    t^~* 

^-  en  CN 

CO 

0    0   0 

.g 

"*"       •    -\ 

o 

E 

o  o  o  o 

o 

£ 

1 

£j 

P 

r<i  CN  CN  T^  ^ 

J3 

f? 

O    O    O    O    O 

fl 

^"  \ 

"§ 

T3 

^Jl 

<U 
rv* 

.fcj 

CN       CN       T-l       T-l       T-H       T-1 

PH 
i 

H 

o  o  o  o  o  o 

a 

^^  ^^ 

1 

a 

^S-22l^ 

§ 

CO 

o  o  o"  o  o"  o"  o 

ctf 

1§ 

^ 

^g 

II 

r*-  co  >/*i  o  oo  t~^  r*f}  *o^ 

<N   fS    <—  i   ^—  i    o   O    O   O 

g 

CO       CO 

oooooooo 

CO 

E  ^ 

\ 

0 

0 

a 

F-H         O 

O 

^JL^ 

o 

-(—  > 

"gjQ  JD 

SI 

o  o  o  o  o  o"  o  o  o 

1 

p 

1=} 

s 

i_i 

cS 

ta 

s 

^  ^  °  -  o  2  5 

OOr-H-r-  <<r-l,—  «,—  ( 

3 

o 

ffi 

p 

O<pHO^OO^HT-^^H 

U 

0 

"d 

1 

1 

-s 

STOCHASTIC    MODELS  455 


k 

Fig.  2.  The  coding  tree  developed  for  the  minimum-redun- 
dancy code  of  Table  2. 

In  order  to  evaluate  the  coding  efficiency,  we  need  to  know  log  D,  C, 
and  H.  The  coding  alphabet  is  binary,  D  =  2,  and  its  information  capacity 
is  Iog2  D  =  1  bit  per  symbol.  The  average  length  of  an  encoded  word  can 
be  easily  computed  from  Table  2  by  Eq.  25;  the  result  is  2.80  binary  code 
symbols  per  word.  The  amount  of  information  in  the  original  distribution 
or  word  probabilities  can  be  computed  by  Eq.  13;  the  result  is  2.781  bits 
per  word.  In  terms  of  Theorem  4,  therefore,  we  have 


which  indicates  that  for  this  example  the  average  length  is  already  quite 
close  to  its  lower  bound.  The  redundancy  of  the  coded  signal  —  as  defined 
by  Eq.  17—  is  less  than  1  %. 

It  should  be  obvious  that  errors  in  the  transmission  or  reception  of 
minimum-redundancy  codes  are  difficult  to  detect.  Every  branch  of  the 
coding  tree  is  utilized  and  errors  convert  any  intended  message  into  a 
perfectly  plausible  alternative  message.  Considerable  study  has  been 
devoted  to  the  most  efficient  ways  to  introduce  redundancy  into  the  code 
deliberately  in  order  to  make  errors  easier  to  detect  and  to  correct.  But 
these  artificially  redundant  codes  are  not  surveyed  here.  The  poiftt  should 
be  noted,  however,  that  the  redundancy  of  natural  codes  may  not  be  so 
inefficient  as  it  seems,  for  it  can  help  us  to  communicate  under  less  than 
optimal  conditions. 


FINITARY    MODELS    OF    LANGUAGE    USERS 


The  reason  that  minimum-redundancy  codes  are  important  is  an 
economic  one.  There  is  a  cost  to  communication  and  someone  must 
pay  for  it.  It  is  often  appropriate  to  use  C,  the  average  length  of  the  mes- 
sage, as  a  measure  of  the  cost,  since  it  takes  either  more  time  or  more 
equipment  to  transmit  more  symbols.  It  should  be  recognized,  however, 
that  the  economy  achieved  by  minimizing  C\H  affects  the  supply  price, 
not  the  demand  price  of  this  commodity  (Marschak,  1960).  The  supply 
price  is  the  lowest  price  the  supplier  is  willing  to  charge;  the  demand  price 
is  the  highest  price  the  buyer  is  willing  to  pay.  The  demand  price 
depends  on  the  payoff  that  the  customer  expects  to  obtain  by  using  the 
information;  since  that  use  will  ordinarily  involve  the  meaning  of  the 
message  in  an  essential  way,  it  takes  us  beyond  the  limits  we  have  arbit- 
rarily imposed  on  this  chapter. 


1.7  Word  Frequencies 

It  is  scarcely  surprising  to  find  that  the  various  words  of  a  natural 
language  do  not  occur  equally  often  in  any  reasonable  sample  of  ordinary 
discourse.  Some  words  are  far  more  common  than  others.  Psychologists 
have  recognized  the  importance  of  these  unequal  probabilities  for  any  kind 
of  experimentation  that  uses  words  as  stimuli.  It  is  standard  procedure 
for  psychologists  to  try  to  control  for  the  effects  of  unequal  familiarity  by 
selecting  the  words  from  some  tabulation  of  relative  frequencies  of 
occurrence.  For  English  the  Thorndike-Lorge  (1944)  counts  are  probably 
the  best  known  and  most  widely  used.  An  extensive  technical  literature 
deals  with  the  various  statistics  that  have  been  compiled  for  the  (usually 
written)  languages  of  the  world;  we  shall  make  no  attempt  to  review  or 
evaluate  it  in  these  pages.  Instead,  we  shall  concentrate  our  attention  on 
certain  statistical  aspects  of  the  vocabulary  that  seem  theoretically  most 
significant. 

There  is  one  particularly  striking  regularity  that  has  been  found  in  these 
various  statistical  explorations.  The  following  is  perhaps  the  simplest  way 
to  summarize  it  (Mandelbrot,  1959):  consider  a  (finite  or  infinite)  popula- 
tion of  discrete  items,  each  of  which  carries  a  label  chosen  from  a  discrete 
set.  Let  n(f,  s)  be  the  number  of  different  labels  that  occur  exactly/times 
in  a  sample  of  s  items.  Then  one  finds  that,  for  large  s, 

n(f,  s)  =  G(s)f-(l+f>\  (28) 

where  p  >  0  and  G(s)  is  a  constant  depending  on  the  size  of  the  sample. 

If  Eq.  28  is  expressed  as  a  probability  density,  then  it  is  readily  seen  that 

the  variance  of/is  finite  if  and  only  if  p  >  2  and  that  the  mean  of/is  finite 

if  and  only  if  p  >  1.  In  the  cases  of  interest  in  this  section  it  is  often 


STOCHASTIC    MODELS  457 

true  that  p  <  1,  so  we  are  faced  with  a  law  that  is  often  judged  anomalous 
(or  even  pathological)  to  those  prejudiced  in  favor  of  the  finite  means  and 
variances  of  normal  distributions.  In  the  derivation  of  the  normal  distri- 
bution function,  however,  it  is  necessary  to  assume  that  we  are  dealing 
with  the  sum  of  a  large  number  of  variables,  each  of  which  makes  a  small 
contribution  relative  to  the  total.  When  equal  contributions  are  not  as- 
sumed, however,  it  is  still  possible  to  have  stable  limit  distributions,  but 
either  the  second  moment  (and  all  higher  moments)  will  be  infinite,  or 
all  moments  will  be  infinite  (cf.  Gnedenko  &  Kolmogorov,  1954,  Chapter 
7).  Such  is  the  distribution  underlying  Eq.  28. 

Nonnormal  limit  distributions  might  be  dismissed  as  mathematical 
curiosities  of  little  relevance  were  it  not  for  the  fact  that  they  have  been 
observed  in  a  wide  variety  of  situations.  As  Mandelbrot  (1958)  has  pointed 
out,  these  situations  seem  especially  common  in  the  social  sciences.  For 
example,  if  the  items  are  quantities  of  money  and  the  labels  are  the  names 
of  people  who  earn  each  item,  then  n(fy  s)  will  be  the  number  of  people 
earning  exactly  /units  of  money  out  of  a  total  income  equal  to  s.  In  this 
form  the  law  was  first  stated  (with  p  >  1)  by  Pareto  (1897).  Alternatively, 
if  the  items  are  taxonomic  species  and  the  labels  are  the  names  of  genera 
to  which  they  belong,  then  n(f,  s)  will  be  the  number  of  genera  each  with 
exactly  /  species.  In  this  form  the  law  was  first  stated  by  Willis  (1922), 
then  rationalized  by  Yule  (1924),  with  p  <  1  (and  usually  close  to  0.5). 

In  the  present  instance,  if  the  items  are  the  consecutive  words  in  a  con- 
tinuous discourse  by  a  single  author  and  the  labels  are  sequences  of  letters 
used  to  encode  words,  then  n(f,  s)  will  be  the  number  of  letter  sequences 
(word  types)  that  occur  exactly  f  times  in  a  text  of  s  consecutive  words 
(word  tokens).  In  this  form  the  law  was  first  stated  by  Estoup  (1916), 
rediscovered  by  Condon  (1928),  and  intensively  studied  by  Zipf  (1935). 
Zipf  believed  that  p  =  1,  but  further  analysis  has  indicated  that  usually 
p  <  1  .  Considerable  data  indicating  the  ubiquity  of  Eq.  28  were  provided 
by  Zipf  (1949),  and  empirical  distributions  of  this  general  type  have  come 
to  be  widely  associated  with  his  name. 

When  working  with  word  frequencies,  it  is  common  practice  to  rank 
them  in  order  (as  we  did  for  the  coding  problem  in  the  preceding  section) 
from  the  most  frequent  to  the  least  frequent.  The  rank  r  is  then  defined 
as  the  number  of  items  that  occur  /times  or  more: 


If  we  combine  this  definition  with  Eq.  28  and  approximate  the  sum  by  an 
integral,  then,  for  large/, 


Pf 


458 


FINITARY    MODELS    OF    LANGUAGE    USERS 


0.1] 


0.01 


0,001 


"2 
I 


0.0001 


0.00001 


1 


10 


100 
Word  order 


1000 


10,000 


Fig.  3.  The  rank-frequency  relation  plotted  on  log-log  coordinates. 


which  states  a  reciprocal  relation  between  the  ranks  r  and  the  frequencies 
/.  We  can  rewrite  this  relation  as 


where  B  =  !//>.  Therefore 


pr 


.=  K.  r 


(29) 


K  —  Blogr, 

which  means  that  on  log-log  coordinates  the  rank-frequency  relation 
should  give  a  straight  line  with  a  slope  of  —  B.  It  was  with  such  a  graph 
that  the  law  was  discovered,  and  it  is  still  a  popular  way  to  display  the 
results  of  a  word  count.  An  illustration  is  given  in  Fig.  3. 

The  persistent  recurrence  of  stable  laws  of  this  nonnormal  type  has 
stimulated  several  attempts  at  explanation,  and  there  has  been  considerable 
discussion  of  their  relative  merits.  We  shall  not  review  that  discussion  here  ; 
the  present  treatment  follows  Mandelbrot  (1953,  1957)  but  does  little  more 
than  introduce  the  topic  in  terms  of  its  simplest  cases. 

Imagine  that  the  coded  message  is,  in  fact,  a  table  of  random  (decimal) 


STOCHASTIC    MODELS  459 

digits.  Let  the  digits  0  and  1  play  the  role  of  word-boundary  markers; 
each  time  0  or  1  occurs  it  marks  the  beginning  of  a  new  word.  (In  this  code 
there  are  words  of  zero  length ;  a  minor  modification  can  eliminate  them  if 
they  are  considered  anomalous.)  The  probability  of  getting  a  particular 
word  of  exactly  length  i  is  (probability  of  symbol)*(probability  of  bound- 
ary marker)  =  (0.8)*'(0.2),  and  the  number  of  different  words  of  length 
i  is  8*. 

The  critical  point  to  note  in  this  example  is  that  when  we  order  these 
coded  words  with  respect  to  increasing  length  we  have  simultaneously 
ordered  them  with  respect  to  decreasing  probability.  Thus  it  is  possible 
to  construct  Table  3.  The  one  word  of  zero  length  has  a  probability  of  0.2 

Table  3     The  Rank-Frequency  Relation  for  a  Random  Code 


Length 
z 

Probability 

Number 
Di 

Ranks 

Average  Rank 

0 
1 

2 
3 

0.2 
0.02 
0.002 
0.0002 

1 
8 
64 
512 

1 
2-9 
10-73 
74-585 

1 
5.5 
41.5 
329.5 

and,  since  it  is  the  most  probable  word,  it  receives  rank  1.  The  eight 
words  one  digit  long  all  have  a  probability  of  0.02  and  share  ranks  2 
through  9;  we  assign  them  all  the  average  rank  5.5;  and  so  the  table 
continues.  When  we  plot  these  values  on  log-log  coordinates,  we  obtain 
the  function  shown  in  Fig.  4.  Visual  inspection  indicates  that  the  slope  is 
slightly  steeper  than  —1,  which  is  also  characteristic  of  many  natural- 
language  texts. 

It  is  not  difficult  to  obtain  the  general  equation  relating  probability  to 
average  rank  for  this  simple  random  case  (Miller,  1957).  Let  p(#]  be 
the  probability  of  a  word-boundary  marker,  and  let  1  —  X#)  ^  f(L) 
be  the  probability  of  a  letter.  If  the  alphabet  (excluding  #)  has  D  letters, 
then  p(L)/D  is  the  probability  of  any  particular  letter,  and  p(wt)  = 
p(^f)p(L)iD~i  is  the  probability  of  any  particular  word  of  length  i  (=  0, 
1,  .  .  .).  This  quantity  will  prove  to  be  more  useful  when  written 


p(#K*ilosDrs-  (30) 


460 


o.i  V 


0.01 


S      0.001 

I 


0.0001 


0.00001 


"TTT 


FINITARY    MODELS    OF    LANGUAGE    USERS 

— I    1  1  llllll r    i  1 1 1 


\ 


\ 


\ 


\ 


i  i  MM 


1.0 


10 


100 
Rank,  r 


1000  10,000 


Since 

P*=o 
order 


Fig.  4.  The  rank-frequency  relation  for  strings  of  random 
digits  occurring  between  successive  occurrences  of  0  or  1. 
The  solid  line  represents  the  expected  function  and  the 
dashed  line  representsthe  average  ranks. 

there  are  Dj  different  words  of  exactly  length  /,  there  must  be 

D*  of  them  eclual  to  or  shorter  than  z?  so  that  when  we  rank  them  in 
of  increasing  length  the  Dj  words  of  length  ;  will  receive  ranks 

lo"1  D*  to  2|*o  DJ-  The  averaSe  rank 


D-l 


D  -  1 


«    D+l     .      *>~3 


2(D  -  1)      2(D  -  1) 
which  will  prove  more  useful  if  we  write 


D  + 


(31) 


STOCHASTIC    MODELS  461 

for  now  Eqs.  30  and  31  combine  to  give 

=  K'(r(Wi)  -  c]-B,      (32) 


which  can  be  recognized  as  a  variant  form  of  Eq.  29,  where 


and 


logD  2(D-  1)  D  +  l 

Thus  a  table  of  random  numbers  can  be  seen  to  follow  the  general  type 
of  law  that  has  been  found  for  word  frequencies.  If  we  take  D  =  26  and 
=  0.18  to  represent  written  English,  then 

= 


log  26 


and 

A-1.06 


/5QY-1.06 

'  =  0.181  —          =0.09, 
\277 


so  we  have 

p(w.)  =  0.09  [r(w,)  -  0.46]-1'06. 

Since  c  =  0.46  will  quickly  become  negligible  as  r(wt.)  increases,  we  can 
write 


which  is,  in  fact,  close  to  the  function  that  has  been  observed  to  hold  for 
many  normal  English  texts  (Zipf,  for  example,  liked  to  put  K'  =  0.1 
and  B  =  1). 

The  hypothesis  that  word  boundaries  occur  more  or  less  at  random  in 
English  text,  therefore,  has  some  reasonable  consequences.  It  helps  us  to 
understand  why  the  probability  of  a  word  decreases  so  rapidly  as  a  function 
of  its  length — which  is  certainly  true,  on  the  average,  for  English.  The 
critical  step  in  the  derivation  of  Eq.  32,  however,  occurs  when  we  note  that 
for  the  random  message  the  rank  with  respect  to  increasing  length  and 
the  rank  with  respect  to  decreasing  probability  are  the  same.  In  English, 
of  course,  this  precise  equivalence  of  rankings  does  not  hold — otherwise 
we  would  never  let  our  most  frequent  word  the  require  three  letters — but 
it  holds  approximately.  Miller  and  Newman  (1958)  have  verified  the 
prediction  that  the  average  frequency  of  words  of  length  i  is  a  reciprocal 
function  of  their  average  rank  with  respect  to  increasing  length,  where  the 
slope  constant  for  the  length-frequency  relation  on  log-log  coordinates  is 
close  to  but  perhaps  somewhat  smaller  than  B. 


^6>  FINITARY    MODELS    OF    LANGUAGE    USERS 

In  Sec.  1.6  we  noted  that  for  a  minimum-redundancy  code  the  length  of 
any  given  word  can  never  be  less  than  the  length  of  a  more  probable  word. 
Suppose,  therefore,  that  we  consider  the  rank-frequency  relation  for 
optimal  codes,  that  is,  for  codes  in  which  the  lower  bound  on  the  average 
length  C  is  actually  realized,  so  that  C  =  ff/log  D.  This  optimal  con- 
dition will  hold  when  the  length  i  of  any  given  word  is  directly  proportional 
to  the  amount  of  information  associated  with  it: 

-log  p(wt) 

~ 


where  p  depends  on  the  choice  of  scale  units.   This  equation  can  be  re- 
written as 


which  is  Eq.  30  again,  with  B  =  I/  p.  From  here  on  the  argument  can 
proceed  exactly  as  before.  We  see,  therefore,  that  the  rank-frequency 
relation  holds  quite  generally  for  minimum-redundancy  codes  because 
such  codes  (like  tables  of  random  numbers)  use  all  equally  long  sequences 
of  symbols  equally  probably.  The  fact  that  both  minimum-redundancy 
codes  and  natural  languages  (which  are  certainly  far  from  minimum- 
redundancy)  share  the  rank-frequency  relation  in  Eq.  29  is  interesting, 
of  course,  but  it  provides  no  basis  whatsoever  for  any  speculation  that 
there  is  something  optimal  about  the  coding  used  in  natural  languages. 

The  choice  of  the  digits  0  and  1  as  boundary  markers  to  form  words  in  a 
table  of  random  numbers  was  completely  arbitrary;  any  other  digits 
would  have  served  equally  well.  If  we  generalize  this  observation  to 
English  texts,  it  implies  that  we  might  choose  some  character  other  than  the 
space  as  a  boundary  marker.  Miller  and  Newman  (1958)  have  studied 
the  rank-frequency  relation  for  a  (relatively  small)  sample  of  pseudo-words 
formed  by  using  the  letter  E  as  the  word  boundary  (and  treating  the  space 
as  just  another  letter).  The  null  word  EE  was  most  frequent,  followed 
closely  by  ERE,  E#E,  and  so  on.  As  predicted,  a  function  of  the  general 
type  of  Eq.  29  was  also  obtained  for  these  pseudo-words  (but  with  a  slope 
constant  B  slightly  less  than  unity,  perhaps  attributable  to  inadequate 
sampling). 

There  is  an  enormous  psychological  difference  between  the  familiar 
words  formed  by  segmenting  on  spaces  and  the  apparently  haphazard 
strings  that  result  when  we  segment  on  E.  Segmenting  on  spaces  respects 
the  highly  overlearned  strings  —  Miller  (1956)  has  referred  to  them  as 
chunks  of  information  in  order  to  distinguish  sharply  from  the  bits  of 
information  defined  in  Sec.  1.3  —  that  normally  function  as  unitary, 
psychological  elements  of  language.  It  seems  almost  certain,  therefore, 


STOCHASTIC    MODELS  463 

that  an  evolutionary  process  of  selection  must  have  been  working  in  favor 
of  short  words — some  psychological  process  that  would  not  operate  on 
the  strings  of  characters  between  successive  Es.  Thus  we  find  many  more 
very  long,  very  improbable  pseudo-words. 

In  one  form  or  another  the  hypothesis  that  we  favor  short  words  has 
been  advanced  by  several  students  of  language  statistics.  Zipf  (1935)  has 
referred  to  it  as  the  law  of  abbreviation:  whenever  a  long  word  or  phrase 
suddenly  becomes  common,  we  tend  to  shorten  it.  Mandelbrot  (1961) 
has  proposed  that  historical  changes  in  word  lengths  might  be  described 
as  a  kind  of  random  walk.  He  reasons  that  the  probability  of  lengthening 
a  word  and  the  probability  of  shortening  it  should  be  in  equilibrium,  so  that 
a  steady  state  distribution  of  word  lengths  could  be  maintained.  If  the 
probability  of  abbreviation  were  much  greater  than  the  probability  of 
expansion,  the  vocabulary  would  eventually  collapse  into  a  single  word  of 
minimum  length.  If  expansion  were  more  likely  than  abbreviation,  on 
the  other  hand,  the  language  would  evolve  toward  a  distribution  with 
B  <  1,  and,  presumably,  some  upper  bound  would  have  to  be  imposed 
on  word  lengths  in  order  for  the  series/?(>vt.)  to  converge,  so  that  %p(w?)  =  1 . 
It  should  be  noted,  however,  that  the  existence  of  a  relation  in  the  form  of 
Eq.  29  does  not  depend  in  any  essential  way  on  some  prior  psychological 
law  of  abbreviation.  The  central  import  of  Mandelbrot's  earlier  argument 
is  that  Eq.  29  can  result  from  purely  random  proccesses.  Indeed,  if  there  is 
some  law  of  abbreviation  at  work,  it  should  manifest  itself  as  a  deviation 
from  Eq.  29 — presumably  in  a  shortage  of  very  long,  very  improbable 
words,  a  shortage  that  would  not  become  apparent  until  extremely  large 
samples  of  text  had  been  tabulated. 

The  occurrence  of  the  rank-frequency  relation  of  Eq.  29  does  not  con- 
stitute evidence  of  some  powerful  and  universal  psychological  force  that 
shapes  all  human  communication  in  a  single  mold.  In  particular,  its 
occurrence  does  not  constitute  evidence  that  the  signal  analyzed  must  have 
come  from  some  intelligent  or  purposeful  source.  The  rank-frequency 
relation,  Eq.  29  has  something  of  the  status  of  a  null  hypothesis,  and,  like 
many  null  hypotheses,  it  is  often  more  interesting  to  reject  than  to  accept. 

These  brief  paragraphs  should  serve  to  introduce  some  of  the  theoretical 
problems  in  the  statistical  analysis  of  language.  There  is  much  more  that 
might  be  said  about  the  analysis  of  style,  cryptography,  estimations  of 
vocabulary  size,  spelling  systems,  content  analysis,  etc.,  but  to  survey  all 
that  would  lead  us  even  further  away  from  matters  of  central  concern  in 
Chapters  11,  12,  and  13. 

If  one  were  to  hazard  a  general  criticism  of  the  models  that  have  been 
constructed  to  account  for  word  frequencies,  it  would  be  that  they  are  still 
far  too  simple.  Unlike  the  Markovian  models  that  envision  Dk  parameters, 


464  FINITARY    MODELS    OF    LANGUAGE    USERS 

explanations  for  the  rank  frequency  relation  use  only  two  or  three  param- 
eters. The  most  they  can  hope  to  accomplish,  therefore,  is  to  provide  a 
null  hypothesis  and  to  indicate  in  a  qualitative  way  (perhaps)  the  kind  of 
systems  we  are  dealing  with.  They  can  tell  us,  for  example,  that  any 
grammatical  rule  regulating  word  lengths  must  be  regarded  with  con- 
siderable suspicion — in  an  English  grammar,  at  least. 

The  complexity  of  the  underlying  linguistic  process  cannot  be  suppressed 
very  far,  however,  and  examples  of  nonrandom  aspects  are  in  good  supply. 
For  example,  if  we  partition  a  random  population  on  the  basis  of  some 
independent  criterion,  the  same  probability  distribution  should  apply  to 
the  partitions  as  to  the  parent  population.  If,  for  example,  we  partitioned 
according  to  whether  the  words  were  an  odd  or  an  even  number  of  running 
words  away  from  the  beginning  of  the  text  or  according  to  whether  their 
initial  letters  were  in  the  first  or  the  last  half  of  the  alphabet,  etc.,  we  would 
expect  the  same  rank-frequency  relation  to  apply  to  the  partitions  as  to 
the  original  population.  There  are,  however,  several  ways  to  partition 
the  parent  population  that  look  as  though  they  ought  to  be  independent 
but  turn  out  in  fact  not  to  be.  Thus,  for  example,  Yule  (1944)  established 
that  the  same  distribution  does  not  apply  when  different  categories  (nouns, 
verbs,  and  adjectives)  are  taken  separately;  Miller,  Newman,  and  Fried- 
man (1958)  showed  a  drastic  difference  between  the  distributions  of  content 
words  (nouns,  verbs,  adjectives,  adverbs)  and  of  function  words  (every- 
thing else),  and  Miller  (1951,  p.  93)  demonstrated  that  the  distribution 
can  be  quite  different  if  we  consider  only  the  words  that  occur  immediately 
following  a  given  word,  such  as  the  or  of.  There  is  nothing  in  our  present 
parsimonious  theories  of  the  rank-frequency  relation  that  could  help  us 
to  explain  these  apparent  deviations  from  randomness. 

In  an  effort  to  achieve  a  more  appropriate  level  of  complexity  in  our 
descriptions  of  the  user,  therefore,  we  turn  next  to  models  that  take  account 
of  the  underlying  structure  of  natural  languages — models  that,  for  lack  of 
a  better  name,  we  shall  refer  to  here  as  algebraic. 


2.  ALGEBRAIC   MODELS 

If  the  study  of  actual  linguistic  behavior  is  to  proceed  very  far,  it  must 
clearly  pay  more  than  passing  notice  to  the  competence  and  knowledge  of 
the  performing  organism.  We  have  suggested  that  a  generative  grammar 
can  give  a  useful  and  informative  characterization  of  the  competence  of 
the  speaker-hearer,  one  that  captures  many  significant  and  deep-seated 
aspects  of  his  knowledge  of  his  own  language.  The  question  is,  therefore, 
how  does  he  put  his  knowledge  to  use  in  producing  a  desired  sentence  or 


ALGEBRAIC    MODELS  45 

in  perceiving  and  interpreting  the  structure  of  presented  utterances  ?  How 
can  we  construct  a  model  for  the  language  user  that  incorporates  a 
generative  grammar  as  a  fundamental  component  ?  This  topic  has  received 
almost  no  study,  so  we  can  do  little  more  than  introduce  a  few  speculations. 

As  we  observed  in  the  introduction  to  this  chapter,  models  of  linguistic 
performance  can  generally  be  interpreted  interchangeably  as  depicting 
the  behavior  of  either  a  speaker  or  a  hearer.  For  concreteness,  in  the 
present  sections  we  shall  concentrate  on  the  listener's  task  and  frame  our 
discussion  largely  in  perceptual  terms.  This  decision  is,  however,  a 
matter  of  convenience,  not  of  principle. 

Unfortunately,  the  bulk  of  the  experimental  research  on  speech  percep- 
tion has  involved  the  recognition  of  individual  words  spoken  in  isolation 
as  part  of  a  list  (cf.  Fletcher,  1953)  and  so  is  of  little  value  to  us  in  under- 
standing the  effects  of  grammatical  structure  on  speech  perception.  That 
such  effects  exist  is  clear  from  the  fact  that  the  same  words  are  easier  to 
hear  in  sentences  than  in  isolation  (Miller,  Heise,  &  Lichten,  1951; 
Miller,  1962a).  How  these  effects  are  caused,  however,  is  not  at  all  clear. 

Let  us  take  as  our  starting  point  the  sentence-recognizing  device 
introduced  briefly  in  Chapter  11,  Sec.  6.4.  Instead  of  a  relatively  passive 
process  of  acoustic  analysis  followed  by  identification  and  symbolic 
representation,  we  imagined  (following  Halle  &  Stevens,  1959,  1962)  an 
active  device  that  recognizes  its  input  by  discovering  what  must  be  done 
in  order  to  generate  a  signal  (in  some  possibly  derived  form)  to  match  it. 
At  the  heart  of  this  active  device,  of  course,  is  a  component  M  that  contains 
rules  for  generating  a  matching  signal.  Associated  with  M  would  be 
components  to  analyze  and  (temporarily)  to  store  the  input,  components 
that  reflect  various  semantic  and  situational  constraints  suggested  by  the 
context  of  the  sentence,  a  heuristic  component  that  could  make  a  good 
first  guess,  a  component  to  make  the  comparison  of  the  input  and  the 
internally  generated  signals,  and  perhaps  others.  On  the  basis  of  an  initial 
guess,  the  device  generates  an  internal  signal  according  to  the  rules  stored 
in  M  and  tests  its  guess  against  the  input  signal.  If  the  match  is  un- 
satisfactory, the  discrepancy  is  used  to  make  a  better  guess.  In  this 
manner  the  device  proceeds  to  modify  its  own  internal  signal  until  the 
match  is  judged  satisfactory  or  the  input  is  dismissed  as  unintelligible. 
The  program  for  generating  the  matching  signal  can  be  taken  as  the 
symbolic  representation  of  the  input. 

If  it  is  granted  that  such  a  sentence-recognizer  can  provide  a  plausible 
model  for  human  speech  perception,  we  can  take  it  as  our  starting  point 
and  can  proceed  to  try  to  specify  it  more  precisely.  In  particular,  the 
two  parts  of  it  that  seem  to  perform  the  most  important  functions  are 
the  contextual  component,  which  helps  to  generate  a  first  guess,  and  the 


4  FINITARY    MODELS    OF    LANGUAGE    USERS 

grammatical  component  M,  which  imposes  the  rules  for  generating  the 
internal  signal.  We  should  begin  by  studying  those  two  components.  Even 
if  it  were  feasible,  a  study  of  the  ways  contextual  information  can  be 
stored  and  brought  to  bear  would  lead  us  far  beyond  the  limits  we  have 
placed  on  this  discussion.  With  respect  to  M,  however,  the  task  seems 
easier.  The  way  the  rules  for  synthesizing  sentences  might  operate  is,  of 
course,  very  much  in  our  present  line  of  sight. 

We  are  concerned  with  a  finite  device  M  in  which  are  stored  the  rules 
of  a  generative  grammar  G.  This  device  takes  as  its  input  a  string  x  of 
symbols  and  attempts  to  understand  it;  that  is  to  say,  M  tries  to  assign 
to  #  a  certain  structural  description  F(x) — or  a  set  (F^x), .  .  . ,  Fm(x)}  of 
syntactic  descriptions  in  the  case  of  a  sentence  x  that  is  structurally 
ambiguous  in  m  different  ways.  We  shall  not  try  to  consider  all  of  those 
real  but  obscure  aspects  of  understanding  that  go  beyond  the  assignment 
of  syntactic  structural  descriptions  to  sentences,  nor  shall  we  consider  the 
situational  or  contextual  features  that  may  determine  which  of  a  set  of 
alternative  structural  descriptions  is  actually  selected  in  a  particular  case. 
There  is  no  point  of  principle  underlying  this  limitation  to  syntax  rather 
than  to  semantics  and  to  single  sentences  rather  than  their  linguistic  and 
extra-linguistic  contexts — it  is  simply  an  unfortunate  consequence  of 
limitations  in  our  current  knowledge  and  understanding.  At  present  there 
is  little  that  can  be  said,  with  much  precision,  about  those  further  questions. 
[See  Ziff  (1960)  and  Katz  &  Fodor  (1962)  for  discussion  of  the  problems 
involved  in  the  development  of  an  adequate  semantic  theory  and  some  of 
the  ways  in  which  they  can  be  investigated]. 

The  device  M  must  contain,  in  addition  to  the  rules  of  G9  a  certain 
amount  of  computing  space,  which  may  be  utilized  in  various  different 
ways,  and  it  must  be  equipped  to  perform  logical  operations  of  various 
sorts.  We  require,  in  particular,  that  M  assign  a  structural  description 
FJx)  to  x  only  if  the  generative  grammar  G  stored  in  the  memory  of  M 
assigns  Ft(x)  to  #  as  a  possible  structural  description.  We  say  that  the 
device  M  (partially)  understands  the  sentence  x  in  the  manner  ofG  if  the  set 
(jFifc), . .  . ,  Fm(x)}  of  structural  descriptions  provided  by  M  with  input  x 
is  (included  in)  the  set  assigned  to  x  by  the  generative  grammar  G.  In 
particular,  M  does  not  accept  as  a  sentence  any  string  that  is  not  generated 
by  G.  (This  restriction  can,  of  course,  be  softened  by  introducing  degrees 
of  grammaticalness,  after  the  manner  of  Sec.  1.5,  but  we  shall  not  burden 
the  present  discussion  with  that  additional  complication.)  M  is  thus  a 
finite  transducer  in  the  sense  of  Chapter  12,  Sec.  1.5.  It  uses  its  information 
concerning  the  set  of  all  strings  in  order  to  determine  which  of  them  are 
sentences  of  the  language  it  understands  and  to  understand  sentences 
belonging  to  this  language.  This  information,  we  assume,  is  represented  in 


ALGEBRAIC    MODELS 


the  form  of  rules  of  the  generative  grammar  G  stored  in  the  memory  of  M. 

Before  continuing,  we  should  like  to  say  once  more  that  it  is  perfectly 
possible  that  M  will  not  contain  enough  computing  space  to  allow  it  to 
understand  all  sentences  in.the  manner  of  the  device  G  whose  instructions 
it  stores.  This  is  no  more  surprising  than  the  fact  that  a  person  who  knows 
the  rules  of  arithmetic  perfectly  may  not  be  able  to  perform  many  computa- 
tions correctly  in  his  head.  One  must  be  careful  not  to  obscure  the 
fundamental  difference  between,  on  the  one  hand,  a  device  M  storing  the 
rules  G  but  having  enough  computing  space  to  understand  in  the  manner  of 
G  only  a  certain  proper  subset  Lf  of  the  set  L  of  sentences  generated  by 
G  and,  on  the  other  hand,  a  device  M*  designed  specifically  to  understand 
only  the  sentences  of  Z/  in  the  manner  of  G.  The  distinction  is  perfectly 
analogous  to  the  distinction  between  a  device  F  that  contains  the  rules  of 
arithmetic  but  has  enough  computing  space  to  handle  only  a  proper  subset 
£'  of  the  set  S  of  arithmetical  computations  and  a  device  F*  that  is 
designed  to  compute  only  S'.  Thus,  although  identical  in  their  behavior 
to  F*  and  M  *,  F  and  M  can  improve  their  behavior  without  additional 
instruction  if  given  additional  memory  aids,  but  F*  and  M*  must  be 
redesigned  to  extend  the  class  of  cases  that  they  can  handle.  It  is  clear  that 
F  and  M,  the  devices  that  incorporate  competence  whether  or  not  it  is 
realized  in  performance,  provide  the  only  models  of  any  psychological 
relevance,  since  only  they  can  explain  the  transfer  of  learning  that  we  know 
occurs  when  memory  aids  are  in  fact  made  available. 

In  particular,  if  the  grammar  G  incorporated  in  M  exceeds  any  finite 
automaton  in  generative  capacity,  then  we  know  that  M  will  not  be  able 
to  understand  all  sentences  in  the  manner  of  G.  There  would  be  little 
reason  to  expect,  a  priori,  that  the  natural  languages  learned  by  humans 
should  belong  to  the  special  family  of  sets  that  can  be  generated  by  one- 
sided linear  grammars  (cf.  Defs.  6  and  7,  Chapter  12,  Sec.  4.1)  or  by 
nonself-embedding  context-free  grammars  (cf.  Proposition  58  and  Theorem 
33,  Chapter  12,  Sec.  4.6).  In  fact,  they  do  not,  as  we  have  observed  several 
times.  Consequently,  we  know  that  a  realistic  model  M  for  the  perceiver 
will  incorporate  a  grammar  G  that  generates  sentences  that  M  cannot 
understand  in  the  manner  of  G  (without  additional  aids).  This  conclusion 
should  occasion  no  surprise;  it  leads  to  none  of  the  paradoxical  conse- 
quences that  have  occasionally  been  suggested.  There  has  been  much 
confusion  about  this  matter  and  we  should  like  to  reemphasize  the  fact 
that  the  conclusion  we  have  reached  is  just  what  should  have  been  expected. 

We  can  construct  a  model  for  the  listener  who  understands  a  presented 
sentence  by  specifying  the  stored  grammar  G,  the  organization  of  memory, 
and  the  operations  performable  by  M.  We  determine  a  class  of  perceptual 
models  by  stating  conditions  that  these  specifications  must  meet.  In 


FINITARY    MODELS    OF    LANGUAGE    USERS 


Sec.  2.1  we  consider  perceptual  models  that  store  rewriting  systems.  Then 
in  Sec.  2.2  we  discuss  possible  features  of  perceptual  models  that  incorpo- 
rate transformational  grammars. 


2. 1  Models  Incorporating  Rewriting  Systems 

Let  us  suppose  that  we  have  a  language  L  generated  by  a  context- 
sensitive  grammar  G  that  assigns  to  each  sentence  of  L  a  P-marker — 
a  labeled  tree  or  labeled  bracketing— in  the  manner  we  have  already 
considered.  What  can  we  say  about  the  understanding  of  sentences  by  the 
speaker  of  L?  For  example,  what  can  we  say  about  the  class  of  sentences 
of  his  language  that  this  speaker  will  be  able  to  understand  at  all?  If  we 
construct  a  finite  perceptual  device  M  that  incorporates  the  rules  of  G 
in  its  memory,  to  what  extent  will  M  be  able  to  understand  sentences  in 
the  manner  of  G? 

In  part,  we  answered  this  question  in  Sec.  4.6  of  Chapter  12.  Roughly, 
the  answer  was  the  following.  Suppose  that  we  say  that  the  degree  of 
self-embedding  of  the  P-marker  Q  is  m  if  m  is  the  largest  integer  meeting 
the  following  condition:  there  is,  in  the  labeled  tree  that  represents  Q, 
a  continuous  path  passing  through  m  +  1  nodes  JV0, .  .  . ,  Nm,  each  with 
the  same  label,  where  each  Nt  (i  >  1)  is  fully  self-embedded  (with  something 
to  the  left  and  something  to  the  right)  in  the  subtree  dominated  by  N^; 
that  is  to  say,  the  terminal  string  of  Q  can  be  written  in  the  form 

*2/o2/l  -  -  -  2/m-l^m-l  -  •  -  l>lW  (33) 

where  Nm  dominates  z,  and  for  each  /  <  m,  Ni  dominates 

2/i  -  •  •  2/m-l^m-l  •  •  -  Vt,  (34) 

and  none  of  the  strings  y0,  .  .  . ,  ym_l5  y0,  .  .  .  ,  i;m_x  is  null.    Thus,  for 
example,  in  Fig.  5  the  degree  of  self-embedding  is  two. 

In  Sec.  4.6  of  Chapter  12  we  presented  a  mechanical  procedure  T  that 
can  be  regarded  as  having  the  following  effect:  given  a  grammar  G  and 
an  integer  m,  T(G,  m)  is  a  finite  transducer  M  that  takes  a  sentence  x 
as  input  and  gives  as  output  a  structural  description  F(x)  (which  is,  further- 
more, a  structural  description  assigned  to  x  by  G)  wherever  F(x)  has  a 
degree  of  self-embedding  of  no  more  than  m;  that  is  to  say,  where  m  is  a 
measure  of  the  computing  space  available  to  a  perceptual  model  M,  which 
incorporates  the  grammar  G,  M  will  partially  understand  sentences  in  the 
manner  of  G  just  to  the  extent  that  the  degree  of  self-embedding  of  their 
structural  descriptions  is  not  too  great.  As  the  amount  of  computing 


ALGEBRAIC    MODELS 


space  available  to  the  device  M  increases,  M  will  understand  more  deeply 
embedded  structures  in  the  manner  of  G.  For  any  given  sentence  x  there 
is  an  m  sufficiently  large  so  that  the  device  M  with  computing  space 
determined  by  m  [i.e.,  the  device  Y(<7,  m)]  will  be  capable  of  understanding 
x  in  the  manner  of  G;  M  does  not  have  to  be  redesigned  to  extend  its 
capacities  in  this  way.  Furthermore,  this  is  the  best  result  that  can  be 
achieved,  since  self-embedding  is,  as  was  proved  in  Chapter  12,  precisely 
the  property  that  distinguishes  context-free 
languages  from  the  regular  languages  that 
can  be  generated  (accepted)  by  finite  auto- 
mata. 

In  Chapter  12  this  result  was  stated  only 
for  a  certain  class  K  of  context-free  gram- 
mars. We  pointed  out  that  the  class  K 
contains  a  grammar  for  every  context-free 
language  and  that  it  is  a  straightforward 
matter  to  drop  many,  if  not  all,  of  the 
restrictions  that  define  K.  Extension  to 
context-sensitive  grammars  is  another 
matter,  however,  and  the  problem  of  find- 
ing an  optimal  finite  transducer  that  under- 
stands the  sentences  of  G  as  well  as  possible, 
for  any  context-sensitive  G,  has  not  been 
investigated  at  all.  Certain  approaches  to 
this  question  are  suggested  by  the  results 
of  Matthews',  discussed  in  Chapter  12,  Sec. 
4.2,  on  asymmetrical  context-sensitive  gram- 
mars and  PDS  automata,  but  these  have  not  yet  been  pursued. 

These  restrictions  aside,  the  procedure  T  of  Sec.  4.6,  Chapter  12, 
provides  an  optimal  perceptual  model  (i.e.,  an  optimal  finite  recognition 
routine)  that  incorporates  a  context-free  grammar  G.  Given  G,  we  can 
immediately  construct  such  a  device  in  a  mechanical  way,  and  we  know 
that  it  will  do  as  well  as  can  be  done  by  any  device  with  bounded  memory 
in  understanding  sentences  in  the  manner  of  G.  As  the  amount  of  memory 
increases,  its  capacity  to  understand  sentences  of  G  increases  without 
limit.  Only  self-embedding  beyond  a  certain  degree  causes  it  to  fail  when 
memory  is  fixed.  We  can,  in  fact,  rephrase  the  construction  so  that  the 
procedure  T  determines  a  transducer Y(G)  which  understands  all  sentences 
in  the  manner  of  G,  where  Y(G)  is  a  "single-pass"  device  with  only  push- 
down storage,  as  shown  in  Sec.  4.2,  Chapter  12. 

Observe  that  the  optimal  perceptual  model  M  =  T(G,  m\  where  m 
is  fixed,  may  fail  to  understand  sentences  in  the  manner  of  G  even  when 


Fig.  5.  Phrase  marker  with  a 
degree  of  self-embedding  equal 
to  two. 


47°  FINITARY    MODELS    OF    LANGUAGE    USERS 

the  language  L  generated  by  G  might  have  been  generated  by  a  one-sided 
linear  grammar  (finite  automaton).  For  example,  the  context-free  gram- 
mar G  that  gives  the  structural  description  in  Fig.  5  might  be  the  following: 

S-+aS,        S->SZ>,        S-+c.  (35) 

(It  is  a  straightforward  matter  to  extend  T  to  deal  with  rules  of  the  kind 
in  Example  35.)  The  generated  language  is  the  set  of  all  strings  a*cb*  and 
is  clearly  a  regular  language.  Nevertheless,  with  m  =  1,  Y(G,  m)  will 
not  be  capable  of  understanding  the  sentence  aacbb  generated  in  Fig.  5 
in  the  manner  of  G,  since  this  derivation  has  a  degree  of  self-embedding 
equal  to  two.  The  point  is  that  although  a  finite  automaton  can  be  found 
to  accept  the  sentences  of  this  language  it  is  not  possible  to  find  a  finite 
device  that  understands  all  of  its  sentences  in  the  manner  of  the  particular 
generative  process  G  represented  in  Example  35. 

Observe  also  that  the  perceptual  device  ^F(G,  m)  is  nondeterministic. 
As  a  perceptual  model  it  has  the  following  defect.  Suppose  that  G  assigns 
to  x  a  structural  description  D  with  degree  of  self-embedding  not  exceeding 
m.  Then,  as  we  have  indicated,  the  device  Y(G,  m)  will  be  capable  of 
computing  in  such  a  way  that  it  will  map  x  into  D,  thus  interpreting  x  in 
the  manner  of  G.  Being  nondeterministic,  however,  it  may  also,  given  x, 
compute  in  such  a  way  that  it  will  fail  to  map  x  into  a  structural  description 
at  all.  If  Y(G,  m)  fails  to  interpret  x  in  the  manner  of  G  on  a  particular 
computation,  we  can  conclude  nothing  about  the  status  of  x  with  respect 
to  the  grammar  G,  although  if  T(G,  m)  does  map  x  into  a  structural 
description  D  we  can  conclude  that  G  assigns  D  to  x.  We  might  investigate 
the  problem  of  constructing  a  deterministic  perceptual  model  that  parti- 
ally understands  the  output  of  a  context-free  grammar,  or  a  model  with 
nondeterminacy  matching  the  ambiguity  of  the  underlying  grammar — 
that  is,  a  model  that  may  block  on  a  computation  with  a  particular  string 
only  if  this  string  is  either  not  generated  by  the  grammar  from  which  the 
model  is  constructed  or  is  generated  only  by  a  derivation  that  is  too  deeply 
self-embedded  for  the  device  in  question — but  this  matter  has  not  yet  been 
carefully  investigated.  It  is  clear,  however,  that  such  devices  unlike  T(G,  m), 
would  involve  a  restriction  on  the  right-recursive  elements  in  the  structural 
descriptions  (i.e.,  on  right  branchings).  See,  in  this  connection,  the 
example  on  p.  473. 

Self-embedding  is  the  fundamental  property  that  takes  a  system  outside 
of  the  generative  capacity  of  a  finite  device,  and  self-embedding  will 
ultimately  result  from  nesting  of  dependencies,  since  the  nonterminal 
vocabulary  is  finite.  However,  the  nesting  of  dependencies,  even  short  of 
self-embedding,  causes  the  number  of  states  needed  in  the  device  Y(G,  m)  to 


ALGEBRAIC    MODELS  4J I 

increase  quite  rapidly  with  the  length  of  the  input  string  that  it  is  to  under- 
stand. Consequently,  we  would  expect  that  nested  constructions  should 
become  difficult  to  understand  even  when  they  are,  in  principle,  within  the 
capacity  of  a  finite  device,  since  available  memory  (i.e.,  number  of  states) 
is  clearly  quite  limited  for  real-time  analytic  operations,  a  fact  to  which  we 
return  in  Sec.  2.2.  Indeed,  as  we  observed  in  Chapter  11  (cf.  Example  11 
in  Sec.  3),  nested  structures  even  without  self-embedding  quickly  become 
difficult  or  impossible  to  understand. 

From  these  observations  we  are  led  to  conclude  that  sentences  of  natural 
languages  containing  nested  dependencies  or  self-embedding  beyond  a 
certain  point  should  be  impossible  for  (unaided)  native  speakers  to  under- 
stand. This  is  indeed  the  case,  as  we  have  already  pointed  out.  There  are 
many  syntactic  devices  available  in  English — and  in  every  other  language 
that  has  been  studied  from  this  point  of  view — for  the  construction  of 
sentences  with  nested  dependencies.  These  devices,  if  permitted  to 
operate  freely,  will  quickly  generate  sentences  that  exceed  the  perceptual 
capacities  (i.e.,  in  this  case,  the  short-term  memory)  of  the  native  speakers 
of  the  language.  This  possibility  causes  no  difficulties  for  communication, 
however.  These  sentences,  being  equally  difficult  for  speaker  and  hearer, 
simply  are  not  used,  just  as  many  other  proliferations  of  syntactic  devices 
that  produce  well-formed  sentences  will  never  actually  be  found. 

There  would  be  no  reason  to  expect  that  these  devices  (which  are,  of 
course,  continually  used  when  nesting  is  kept  within  the  bounds  of  memory 
restriction)  should  disappear  as  the  language  evolves;  and,  in  fact,  they 
do  not  disappear,  as  we  have  observed.  It  would  be  reasonable  to  expect, 
however,  that  a  natural  language  might  develop  techniques  to  paraphrase 
complex  nested  sentences  as  sentences  with  either  left-recursive  or  right- 
recursive  elements,  so  that  sentences  of  the  same  content  could  be  produced 
with  less  strain  on  memory.  That  expectation,  formulated  by  Yngve 
(1960, 1961)  in  a  rather  different  way,  to  which  we  return,  is  well  confirmed. 
Alongside  such  self-embedding  English  sentences  as  if,  whenever  X  then  Y, 
then  Z,  we  can  have  the  basically  right-branching  structure  Z  if  whenever 
X,  then  Y,  and  so  on  in  many  other  cases.  In  particular,  many  singulary 
grammatical  transformations  in  English  seem  to  be  primarily  stylistic; 
they  convert  one  sentence  into  another  with  much  the  same  content  but 
with  less  self-embedding.  Alongside  the  sentence  that  the  fact  that  he 
left  was  unfortunate  is  obvious,  which  doubly  embeds  S,  we  have  the  more 
intelligible  and  primarily  right-recursive  structure  it  is  obvious  that  it  was 
unfortunate  that  he  left.  Similarly,  we  have  a  transformation  that  converts 
the  cover  that  the  book  that  John  has  has  to  John's  book's  cover,  which  is 
left-branching  rather  than  self-embedding.  (It  should  also  be  noted, 
however,  that  some  of  these  so-called  stylistic  transformations  can  increase 


472  FINITARY    MODELS    OF    LANGUAGE    USERS 

structural  complexity,  e.g.,  those  that  give  "cleft-sentences" — from  7 
read  the  book  that  you  told  me  about  we  can  form  //  was  the  book  that  you 
told  me  about  that  I  read,  etc.) 

Now  to  recapitulate :  from  the  fact  that  human  memory  is  finite  we  can 
conclude  only  that  some  self-embedded  structures  should  not  be  under- 
standable; from  the  further  assumption  that  memory  is  small,  we  can 
predict  difficulties  even  with  nested  constructions.  Although  sentences  are 
accepted  (heard  and  spoken)  in  a  single  pass  from  left  to  right,  we  cannot 
conclude  that  there  should  be  any  left-right  asymmetry  in  the  under- 
standable structures.  Nor  is  there  any  evidence  presently  available  for  such 
asymmetry.  We  have  little  difficulty  in  understanding  such  right-branching 
constructions  as  he  watched  the  boy  catch  the  ball  that  dropped  from  the 
tower  near  the  lake  or  such  left-branching  constructions  as  all  of  the  men 
whom  I  told  you  about  who  were  exposed  to  radiation  who  worked  half-time, 
are  still  healthy,  but  the  ones  who  worked  full  time  are  not  or  many  more  than 
half  of  the  rather  obviously  much  too  easily  solved  problems  were  dropped 
last  year.  Similarly,  no  conclusion  can  be  drawn  from  our  present  knowl- 
edge of  the  distribution  of  left-recursive  and  right-recursive  elements  in 
language.  Thus,  in  English,  right-branching  constructions  predominate; 
in  other  languages — Japanese,  Turkish — the  opposite  is  the  case.  In  fact, 
in  every  known  language  we  find  right-recursive,  left-recursive,  and  self- 
embedding  elements  (and,  furthermore,  we  find  coordinate  constructions 
that  exceed  the  capacity  of  rewriting  systems  entirely,  a  fact  to  which  we 
return  directly). 

We  have  so  far  made  only  the  following  assumptions  about  the  model  M 
for  the  user: 

1.  M  is  finite; 

2.  M  accepts  (or  produces)  sentences  from  left-to-right  in  a  single  pass ; 

3.  M  incorporates  a  context-free  grammar  as  a  representation  of  its 
competence  in  and  knowledge  of  the  language. 

Of  these,  (3)  is  surely  false,  but  the  conclusions  concerning  recursive  ele- 
ments that  we  have  drawn  from  it  would  undoubtedly  remain  true  under 
a  wide  class  of  more  general  assumptions.  Obviously,  (1)  is  beyond  ques- 
tion; (2)  is  an  extremely  weak  assumption  that  also  cannot  be  questioned, 
either  for  the  speaker  or  hearer — note  that  many  different  kinds  of  internal 
organization  of  M  are  compatible  with  (2),  for  example,  the  assumption 
that  M  stores  a  finite  string  before  deciding  on  the  analysis  of  its  first 
element  or  that  M  stores  a  finite  number  of  alternative  assumptions 
about  the  first  element  which  are  resolved  only  at  an  indefinitely  later 
time. 


ALGEBRAIC    MODELS  4J$ 

If  we  add  further  assumptions  beyond  these  three,  we  can  derive  addi- 
tional conclusions  about  the  ability  of  the  device  to  produce  or  understand 
sentences  in  the  manner  of  the  incorporated  grammar.  Consider  the  two 
extreme  assumptions : 

4.  M  produces  P-markers  strictly  "from  the  top  down,"  or  from  trunk 
to  branch,  in  the  tree  graph  of  the  P-marker. 

5.  M  produces  P-markers  strictly  "from  the  bottom  up,"  or  from 
branch  to  trunk,  in  the  tree  graph  of  the  P-marker. 

In  accordance  with  (4),  the  device  M  will  interpret  a  rule  A  ->  <f>  of 
the  incorporated  grammar  as  the  instruction  "rewrite  A  as  </>" — that  is 
to  say,  as  the  instruction  that,  in  constructing  a  derivation,  a  line  of  the 
form  yiAyz  can  be  followed  by  the  line  y-^y^  Assumption  5  requires  the 
device  M  to  interpret  each  rule  A  — >  (/>  of  the  grammar  as  the  instruction 
"replace  <f>  by  A" — that  is  to  say,  in  constructing  an  inverted  derivation 
with  S  as  its  last  line  and  a  terminal  string  as  its  first  line,  a  line  of  the  form 
Y>i</>y>2  can  be  followed  by  the  line  ^v4yv 

From  Assumption  4  we  can  conclude  that  only  a  bounded  number  of 
successive  left-branchings  can,  in  general,  be  tolerated  by  M.  Thus 
suppose  that  M  is  based  on  a  grammar  containing  the  rule  S  ->  SA. 
After  n  applications  of  this  left-branching  rule  the  memory  of  a  device 
meeting  Assumptions  2  and  4  (under  the  natural  interpretation)  would 
have  to  store  n  occurrences  of  A  for  later  rewriting  and  would  thus 
eventually  have  to  violate  Assumption  1 .  On  the  other  hand,  from  Assump- 
tion 5  we  can  conclude  that  only  a  bounded  number  of  successive  right- 
branchings  can  in  general  be  tolerated.  For  example,  suppose  the  under- 
lying grammar  contains  right-branching  rules:  A  ->  cA,  B-+cB,  A->a, 
and  B-+b.  In  this  case  the  device  will  be  presented  with  strings  cna  or 
cnb.  Now,  although  Assumption  2  still  calls  for  resolution  from  left  to 
right,  Assumption  5  implies  that  no  node  in  the  P-marker  can  be  replaced 
until  all  that  it  dominates  is  known,  so  that  resolution  must  be  postponed 
until  the  final  symbol  in  the  string  is  received.  Thus  the  device  would  have 
to  store  n  occurrences  of  c  for  later  rewriting  and,  again,  Assumption  1 
must  eventually  be  violated.  Left-branching  causes  no  difficulty  under 
Assumption  5,  of  course,  just  as  right-branching  causes  no  difficulty  in  the 
case  of  Assumption  4.  Thus  Assumptions  4  and  5  impose  left-right  asym- 
metries (in  opposite  ways)  on  the  set  of  structures  that  can  be  accepted  or 
produced  by  M.  Observe  that  the  devices  T(G,  m),  given  by  the  proced- 
ure T  of  Chapter  12,  Sec.  4.6,  need  not  meet  either  of  the  restrictions 
in  Assumption  4  or  5;  in  constructing  a  particular  P-marker,  they  may 
move  up  or  down  or  both  ways  indefinitely  often,  just  as  long  as  self- 
embedding  is  restricted. 


FINITARY    MODELS    OF    LANGUAGE    USERS 


Assumption  4  might  be  interpreted  as  a  condition  to  be  met  by  the 
speaker;  Assumption  5,  as  a  condition  to  be  met  by  the  hearer.  (Of  course, 
if  we  design  a  model  of  the  speaker  to  meet  Assumption  4  and  a  model  of 
the  hearer  to  meet  Assumption  5  simultaneously,  we  will  severely  restrict 
the  possibility  of  communication  between  them.)  If  Assumption  4 
described  the  speaker,  we  would  expect  him  to  have  difficulty  with  left- 
branching  constructions  ;  if  Assumption  5  described  the  listener,  we  would 
expect  him  to  have  difficulty  with  right-branching  constructions.  Neither 
assumption  seems  particularly  plausible.  There  is  no  reason  to  think  that 
a  speaker  must  always  select  his  major  phrase  types  before  the  minor 
subphrases  or  his  word  categories  before  his  words  (Assumption  4). 
Similarly,  although  a  listener  obviously  receives  terminal  symbols  and 
constructs  phrase  types,  there  is  no  reason  to  assume  that  decisions  con- 
cerning minor  phrase  types  must  uniformly  precede  those  concerning  major 
structural  features  of  the  sentence.  Assumptions  4  and  5  are  but  two  of  a 
large  set  of  possible  assumptions  that  might  be  considered  in  specifying 
models  of  the  user  more  fully.  Thus  we  might  introduce  an  assumption 
that  there  is  a  bound  on  the  length  of  the  string  that  must  be  received  before 
a  construction  can  be  uniquely  identified  by  a  left-to-right  perceptual 
model—  and  so  on,  in  many  other  ways. 

There  has  been  some  discussion  of  hypotheses  such  as  Assumptions  4 
and  5.  For  example,  Skinner's  (1957)  proposal  that  "verbal  operant 
responses"  to  situations  (e.g.,  the  primary  nouns,  verbs,  adjectives)  form 
the  raw  materials  of  which  sentences  are  constructed  by  higher  level 
"autoclitic"  responses  (grammatical  devices,  ordering,  selecting,  etc.) 
might  be  loosely  interpreted  as  a  variant  of  Assumption  5,  regarded  as 
an  assumption  about  the  speaker.  Yngve  (1960,  1961)  has  proposed  a 
variant  of  (4)  as  an  assumption  about  the  speaker;  his  proposal  is  explicitly 
directed  toward  our  present  topic  and  so  demands  a  somewhat  fuller 
discussion. 

Yngve  describes  a  process  by  which  a  device  that  contains  a  grammar 
rather  similar  to  a  context-free  grammar  produces  derivations  of  utter- 
ances, always  rewriting  the  leftmost  nonterminal  symbol  in  the  last  line  of 
the  already  constructed  derivation  and  postponing  any  nonterminal 
symbols  to  the  right  of  it.  Each  postponed  symbol,  therefore,  is  a  promise 
that  must  be  remembered  until  the  time  comes  to  develop  it  ;  as  the  number 
of  these  promises  grows,  the  load  on  memory  also  grows.  Thus  Yngve 
defines  a  measure  of  depth  in  terms  of  the  number  of  postponed  symbols, 
so  that  left-branching,  self-embedding,  and  multiple-branching  all  con- 
tribute to  depth,  whereas  right-branching  does  not.  (Note  that  the  depth  of 
postponed  symbols  and  the  degree  of  embedding  are  quite  distinct 
measures.)  Yngve  observes  that  a  model  so  constructed  for  the  speaker 


ALGEBRAIC    MODELS  475 

will  be  able  with  a  limited  memory  to  produce  structures  that  do  not  exceed 
a  certain  depth.  He  offers  the  hypothesis  that  Assumption  4,  so  interpreted, 
is  a  correct  characterization  of  the  speaker  and  that  natural  languages  have 
developed  in  such  a  way  as  to  ease  the  speaker's  task  by  limiting  the 
necessity  for  left-branching. 

The  arguments  in  support  of  this  hypothesis,  however,  seem  incon- 
clusive. It  is  difficult  to  see  why  any  language  should  be  designed  for  the 
ease  of  the  speaker  rather  than  the  hearer,  and  Assumption  4  in  any  form 
seems  totally  unmotivated  as  a  requirement  for  the  hearer;  on  the 
contrary,  the  opposite  assumption,  as  we  have  noted,  seems  the  better 
motivated  of  the  two.  Nor  does  (4)  seem  to  be  a  particularly  plausible 
assumption  concerning  the  speaker,  for  reasons  we  have  already  stated.  It 
is  possible,  of  course,  to  construct  sentences  that  have  a  great  depth  and  that 
are  quite  unintelligible,  but  they  characteristically  involve  nesting  or 
self-embedding  and  thus  serve  merely  to  show  that  the  speaker  and  hearer 
have  finite  memories — that  is  to  say,  they  support  only  the  obvious  and 
unquestionable  Assumptions  1  and  2,  not  the  additional  Assumption  4. 
In  order  to  support  Yngve's  hypothesis,  we  would  have  to  find  unintelligible 
sentences  whose  difficulty  was  attributable  entirely  to  left-branching  and 
multiple-branching.  Such  examples  are  not  readily  produced.  In  order  to 
explain  why  multiple-branching,  which  contributes  to  the  measure  of 
depth,  does  not  cause  more  difficulty,  Yngve  treats  coordinate  construc- 
tions (e.g.,  conjunctions)  as  right-branching,  which  does  not  contribute 
to  the  number  of  postponed  symbols.  But  this  is  perfectly  arbitrary; 
they  could  just  as  well  be  treated  as  left-branching.  The  only  correct 
interpretation  for  such  constructions  is  in  terms  of  multiple-branching  from 
a  single  node — this  is  exactly  the  formal  feature  that  distinguishes  true 
coordinate  constructions,  with  no  internal  structure,  from  others.  As  we 
have  observed  in  Chapter  11,  Sec.  5,  such  constructions  are  beyond  the 
limits  of  systems  of  rewriting  rules  altogether.  Hence  the  relative  ease 
with  which  such  sentences  as  Examples  18  and  20  of  Chapter  11  can  be 
understood  contradicts  not  only  Assumption  4  but  even  the  underlying 
Assumption  3,  of  which  4  is  an  elaboration. 

In  short,  there  seems  to  be  little  that  we  can  say  about  the  speaker  and 
the  hearer  beyond  the  obvious  fact  that  they  are  limited  finite  devices  that 
relate  sentences  and  structural  descriptions  and  that  they  are  subject  to  the 
constraint  that  time  is  linear.  From  this,  all  that  we  can  conclude  is  that 
self-embedding  (and,  more  generally,  nesting  of  dependencies)  should  cause 
difficulty,  as  indeed  it  does.  It  is  also  not  without  interest  that  self-embed- 
ding seems  to  impose  a  greater  burden  than  an  equivalent  amount  of 
nesting  without  self-embedding.  Further  speculations  are,  at  the  present 
time,  quite  unsupported. 


FINITARY    MODELS    OF    LANGUAGE    USERS 


2.2  Models  Incorporating  Transformational  Grammars 

There  are  surprising  limitations  on  the  amount  of  short-term  memory 
available  for  human  data  processing,  although  the  amount  of  long-term 
memory  is  clearly  great  (cf.  Miller,  1956).  This  fact  suggests  that  it  might 
be  useful  to  look  into  the  properties  of  a  perceptual  model  M  with  two 
basic  components,  M^  and  M2,  operating  as  follows :  M±  contains  a  small, 
short-term  memory.  It  performs  computations  on  an  input  string  x 
as  it  is  received  symbol  by  symbol  and  transmits  the  result  of  these  com- 
putations to  M2.  M2  contains  a  large  long-term  memory  in  which  is  stored 
a  generative  grammar  G;  the  task  of  M2  is  to  determine  the  deeper 
structure  of  the  input  string  x,  using  as  its  information  the  output  trans- 
mitted to  it  by  Mj.  (Sentence-analyzing  procedures  of  this  sort  have  been 
investigated  by  Matthews,  1961.) 

The  details  of  the  operation  of  M2  would  be  complicated,  of  course; 
probably  the  best  way  to  get  an  appreciation  of  the  functions  it  would  have 
to  perform  is  to  consider  an  example  in  some  detail.  Suppose,  therefore, 
that  a  device  M,  so  constructed,  attempts  to  analyze  such  sentences  as 

John  is  easy  to  please.  (36) 

John  is  eager  to  please.  (37) 

To  these,  MI  might  assign  preliminary  analyses,  as  in  Fig.  6,  in  which 
inessentials  are  omitted.  Clearly,  however,  this  is  not  the  whole  story. 
In  order  to  account  for  the  way  in  which  we  understand  these  sentences, 
it  is  necessary  for  the  component  M2,  accenting  the  analysis  shown  in  Fig. 
6  as  input,  to  give  as  output  structural  descriptions  that  indicate  that  in 


John  is         [elfger]         to          please 

Fig.  6.  Preliminary  analysis   of  Sentences   36 
and  37. 


ALGEBRAIC    MODELS 


NP  /VPv  NP  VP 

John  's  Adj.  John  V  ATP 

eaSer  complement  pleases  someone 

(*)  (b) 

S  S 

NP  VP  NP  Vp 

it  is         Adj.    complement  someone       V  NP 

I  I  I 

easy  pleases  John 

(c)  (d) 

Fig.  7.  Some  P-markers  that  would  be  generated  by  the  rewriting  rules  of  the 
grammar  and  to  which  the  transformation  rules  would  apply. 

Example  36  John  is  the  direct  object  of  please,  whereas  in  Example  37  it  is 
the  logical  subject  of  please. 

Before  we  can  attempt  to  provide  a  description  of  the  device  M2  we 
must  ask  how  structural  information  of  this  deeper  kind  can  be  represented. 
Clearly,  it  cannot  be  conveyed  in  the  labeled  tree  (P-marker)  associated 
with  the  sentence  as  it  stands.  No  elaboration  of  the  analysis  shown  in 
Fig.  6,  with  more  elaborate  subcategorization,  etc.,  will  remedy  the 
fundamental  inability  of  this  form  of  representation  to  mirror  grammatical 
relations  properly.  We  are,  of  course,  facing  now  precisely  the  kind  of 
difficulty  that  was  discussed  in  Chapter  11,  Sec.  5,  and  that  led  to  the 
development  of  a  theory  of  transformational  generative  grammar.  In  a 
transformational  grammar  for  English  the  rewriting  rules  would  not  be 
required  to  provide  Examples  36  and  37  directly;  the  rewriting  rules 
would  be  limited  to  the  generation  of  such  P-markers  as  those  shown  in 
Fig.  7  (where  inessentials  are  again  omitted).  In  addition,  the  grammar 
will  contain  such  transformations  as 

TV  replaces  complement  by  'Tor  x  to  y"  where  x  is  an  NP  and  y  is  a  VP 

in  the  already  generated  sentence  xy; 
T2:  deletes  the  second  occurrence  of  two  identical  TVP's  (with  whatever 

is  affixed  to  them); 
TV  deletes  direct  objects  of  certain  verbs; 


47  FINITARY    MODELS    OF    LANGUAGE    USERS 

T4:  deletes  "for  someone"  in  certain  contexts; 
T5 :  converts  a  string  analyzable  as 

#P  _  is  _  Adj  -  (for  -  NP^  -  to  -  V  -  NP2 
to  the  corresponding  string  of  the  form 

NP2  -is  -  Adj  -  (for  -  NPJ  -  to  -  V. 

Each  of  these  can  be  generalized  and  put  in  the  form  specified  in  Chapter 
11.  When  appropriately  generalized,  they  are  each  independently  moti- 
vated by  examples  of  many  other  kinds.  Note,  for  example,  the  range  of 
sentences  that  are  similar  in  their  underlying  structural  features  to  Examples 
36  and  37;  we  have  such  sentences  as  John  is  an  easy  person  to  please, 
John  is  a  person  who  (ft)  is  easy  to  please,  this  room  is  not  easy  to  work  in 
(to  do  decent  work  in),  he  is  easy  to  do  business  with,  he  is  not  easy  to  get 
information  from,  such  claims  are  very  easy  to  be  fooled  by,  and  many  others 
all  of  which  are  generated  in  essentially  the  same  way. 

Applying  7\  to  the  pair  of  structures  in  Figs.  Ic  and  Id,  we  derive  the 
sentence  It  is  easy  for  someone  to  please  John,  with  its  derived  P-marker. 
Applying  7*4  to  this,  we  derive  It  is  easy  to  please  John,  which  is  converted 
to  Example  36  by  T5.  Had  we  applied  T5  without  r4,  we  could  have 
derived,  for  example,  John  is  easy  for  us  to  please  (with  we  chosen  in  place 
of  someone  in  Fig.  Id — we  leave  unstated  obvious  obligatory  rules). 
Applying  7\  to  the  pair  of  structures  in  Figs,  la  and  Ib,  we  derive  John  is 
eager  for  John  to  please  someone,  which  is  converted  by  T2  to  John  is 
eager  to  please  someone.  Had  we  applied  Tz  to  Fig.  Ib  before  applying 
Tl9  we  would,  in  the  same  way,  have  derived  Example  37. 

At  this  point  we  should  comment  briefly  on  several  features  of  such  an 
analysis.  Notice  that  lam  eager  for  you  to  please,  you  are  eager  for  me  to 
please,  etc.,  are  all  well-formed  sentences;  but  /  am  eager  for  me  to  please, 
you  are  eager  for  you  to  please,  etc.,  are  impossible  and  are  reduced  to 
I  am  eager  to  please,  you  are  eager  to  please  obligatorily  by  T2.  This  same 
transformation  gives  /  expected  to  come,  you  expected  to  come,  etc.,  from 
7  expected  me  to  come,  you  expected  you  to  come,  which  are  formed  in  the 
same  way  as  you  expected  me  to  come,  I  expected  you  to  come.  Thus  this 
grammar  does  actually  regard  John  in  Example  37  as  identical  with  the 
deleted  subject  of  please.  Note,  in  fact,  that  in  the  sentence  John  expected 
John  to  please,  in  which  T2  has  not  applied,  the  two  occurrences  of  John 
must  have  different  reference.  In  Example  36,  on  the  other  hand,  John 
is  actually  the  direct  object  of  please,  assuming  grammatical  relations  to  be 
preserved  under  transformation  (assuming,  in  other  words,  that  the  P- 
marker  represented  in  Fig.  Id  is  part  of  the  structural  description  of 


ALGEBRAIC    MODELS 


479 


Example  36).  Note,  incidentally,  that  T5  does  not  produce  such  non- 
sentences  as  John  is  easy  to  come,  since  there  is  no  NP  comes  John,  though 
we  have  John  is  eager  to  come  by  Tl9  T2.  T5  would  not  apply  to  any 
sentence  of  the  form 

NP  -  is  -  eager  -  (for  -  NPJ  -  to  -  V  -  NP2 
to  give 

NP2  -  is  -  eager  -  (for  -  NPJ  -  to  -  V 

(for  example,  Bill  is  eager  for  us  to  meet  from  John  is  eager  for  us  to  meet 
Bill;  these  crooks  are  eager  for  us  to  vote  out  from  John  is  eager  for  us  to 
vote  out  these  crooks),  since  eager  complement,  but  not  eager,  is  an  Adj 
(whereas,  easy,  but  not  easy  complement,  is  an  Adj).  Supporting  this 
analysis  is  the  fact  that  the  general  rule  that  nominalizes  sentences  of  the 
form  NP  —  is  —  Adj  (giving,  for  example,  John's  cleverness  from  John  is 
clever),  converts  John  is  eager  (for  us)  to  come  (which  comes  from  Fig, 
la  and  we  come  by  7\)  to  John's  eagerness  for  us  to  come;  but  it  does  not 
convert  Example  36  to  John's  easiness  to  please.  Furthermore,  the  general 
transformational  process  that  converts  phrases  of  the  form 

the  —  Noun  —  who  (which)  —  is  —  Adj 
to 

the  —  Adj  —  Noun 

(for  example,  the  man  who  is  old  to  the  old  man)  does  convert  a  fellow  who 
is  easy  to  please  to  an  easy  fellow  to  please  (since  easy  is  an  Adj)  but  does 
not  convert  a  fellow  who  is  eager  to  please  to  an  eager  fellow  to  please 
(since  eager  is  not,  in  this  case,  an  Adj).  In  brief,  when  these  rules  are 
stated  carefully,  we  find  that  a  large  variety  of  structures  is  generated  by 
quite  general,  simple,  and  independently  motivated  rules,  whereas  other 
superficially  similar  structures  are  correctly  excluded.  It  would  not  be 
possible  to  achieve  the  same  degree  of  generalization  and  descriptive 
adequacy  with  a  grammar  that  operates  in  the  manner  of  a  rewriting 
system,  assigning  just  a  single  P-marker  to  a  sentence  as  its  structural 
description. 

Returning  now  to  our  main  theme,  we  see  that  the  grammatical  relations 
of  John  to  please  in  Examples  36  and  37  are  represented  in  the  intuitively 
correct  way  in  the  structural  descriptions  provided  by  a  transformational 
grammar.  The  structural  description  of  Example  36  consists  of  the  two 
underlying  P-markers  in  Figs.  Ic  and  Id  and  the  derived  P-marker  in 
Fig.  6  (as  well  as  a  record  of  the  transformational  history,  i.e.,  T1?  T^  T5). 
The  structural  description  of  Example  36  consists  of  the  underlying  P- 
markers  in  Figs,  la  and  Ib  and  the  derived  P-marker  in  Fig.  6  (along  with 
the  transformational  history  Tl9  T2,  T3).  Thus  the  structural  description 


4<$0  FINITARY    MODELS    OF    LANGUAGE    USERS 

of  Example  36  contains  the  information  that  John  in  Example  36  is  the 
object  of  please  in  the  underlying  P-marker  of  Fig.  Id;  and  the  structural 
description  of  Example  37  contains  the  information  that  John  in  Example 
37  is  the  subject  of  please  in  the  underlying  P-marker  in  Fig.  Ib.  Note 
that,  when  the  appropriately  generalized  form  of  T5  applies  to  it  is  easy 
to  do  business  with  John  to  yield  John  is  easy  to  do  business  with,  we  again 
have  in  the  underlying  P-markers  a  correct  account  of  the  grammatical 
relations  in  the  transform,  although  in  this  case  the  grammatical  subject 
John  is  no  longer  the  object  of  the  verb  of  the  complement,  as  it  is  in 
Example  3b.  Notice  also  that  it  is  the  underlying  P-markers,  rather  than 
the  derived  P-marker,  that  represent  the  semantically  relevant  information 
in  this  case.  In  this  respect,  these  examples  are  quite  typical  of  what  is 
found  in  more  extensive  grammars. 

These  observations  suggest  that  the  transformational  grammar  be 
stored  and  utilized  only  by  the  component  M2  of  the  perceptual  model. 
M!  will  take  a  sentence  as  input  and  give  us  as  output  a  relatively  superficial 
analysis  of  it  (perhaps  a  derived  P-marker  such  as  that  in  Fig.  6).  M2  will 
utilize  the  full  resources  of  the  transformational  grammar  to  provide  a 
structural  description,  consisting  of  a  set  of  P-markers  and  a  transforma- 
tional history,  in  which  deeper  grammatical  relations  and  other  structural 
information  are  represented.  The  output  of  M  =  (Ml9  M2)  will  be  the 
complete  structural  description  assigned  to  the  input  sentence  by  the 
grammar  that  it  stores ;  but  the  analysis  that  is  provided  by  the  initial, 
short-term  memory  component  Mx  may  be  extremely  limited. 

If  the  memory  limitations  on  M1  are  severe,  we  can  expect  to  find  that 
structurally  complex  sentences  are  beyond  its  analytic  power  even  when 
they  lack  the  property  (i.e.,  repeated  self-embedding)  that  takes  them 
completely  beyond  the  range  of  any  finite  device.  It  might  be  useful, 
therefore,  to  develop  measures  of  various  sorts  to  be  correlated  with 
understandability.  One  rough  measure  of  structural  complexity  that  we 
might  use,  along  with  degree  of  nesting  and  self-embedding,  is  the  node-to- 
terminal-node  ratio  N(Q)  in  the  P-marker  Q  of  the  terminal  string  t(Q). 
This  number  measures  roughly  the  amount  of  computation  per  input 
symbol  that  must  be  performed  by  the  listener.  Hence  an  increase  in 
N(Q)  should  cause  a  correlated  difficulty  in  interpreting  t(Q)  for  a  real- 
time device  with  a  small  memory.  Clearly  N(Q)  grows  as  the  amount  of 
branching  per  node  decreases.  Thus  N(Q)  is  higher  for  a  binary  P-marker 
such  as  that  shown  in  Fig.  %a  than  for  the  P-marker  in  Fig.  86  that  repre- 
sents a  coordinate  construction  with  the  same  number  of  terminals. 
Combined  with  our  earlier  speculations  concerning  the  perceptual  model 
My  this  observation  would  lead  us  to  suspect  that  N(Q)  should  in  general 
be  higher  for  the  derived  P-marker  that  must  be  provided  by  the  limited 


ALGEBRAIC    MODELS 


481 


a  b  c 

Fig.  8.  Illustrating  a  measure  of  structural  complexity.    N(Q) 
for  the  P-marker  (a)  is  7/4;   for  (b),  N(Q)  =  5/4. 

component  Afx  than  it  would  be  for  underlying  P-markers.  In  other 
words,  the  general  effect  of  transformations  should  be  to  decrease  the 
total  amount  of  structure  in  the  associated  P-marker.  This  expectation  is 
fully  borne  out.  The  underlying  P-markers  have  limited,  generally  binary 
branching.  But,  as  we  have  already  observed  in  Chapter  1 1  (particularly 
p.  305),  binary  branching  is  not  a  general  characteristic  of  the  derived 
P-markers  associated  with  actual  sentences;  in  fact,  the  actual  set  of 
derived  P-markers  is  beyond  the  generative  capacity  of  rewriting  systems 
altogether,  since  there  is  no  bound  on  the  amount  of  branching  from  a 
single  node  (that  is  to  say,  on  the  length  of  a  coordinate  construction). 

The  psychological  plausibility  of  a  transformational  model  of  the 
language  user  would  be  strengthened,  of  course,  if  it  could  be  shown  that 
our  performance  on  tasks  requiring  an  appreciation  of  the  structure  of 
transformed  sentences  is  some  function  of  the  nature,  number,  and  com- 
plexity of  the  grammatical  transformations  involved. 

One  source  of  psychological  evidence  concerns  the  grammatical  trans- 
formation that  negates  an  affirmative  sentence.  It  is  a  well-established 
fact  that  people  in  concept-attainment  experiments  find  it  difficult  to  use 
negative  instances  (Smoke,  1933).  Hovland  and  Weiss  (1953)  established 
that  this  difficulty  persists  even  when  the  amount  of  information  conveyed 
by  the  negative  instances  is  carefully  equated  to  the  amount  conveyed  by 
positive  instances.  Moreover,  Wason  (1959,  1961)  has  shown  that  the 
grammatical  difference  between  affirmative  and  negative  English  sentences 
causes  more  difficulty  for  subjects  than  the  logical  difference  between 
true  and  false;  that  is  to  say,  if  people  are  asked  to  verify  or  to  construct 
simple  sentences  (about  whether  digits  in  the  range  2  to  9  are  even  or  odd), 
they  will  take  longer  and  make  more  errors  on  the  true  negative  and  false 
negative  sentences  than  on  the  true  affirmative  and  false  affirmative  sen- 
tences. Thus  there  is  some  reason  to  think  that  there  may  be  a  grammatical 
explanation  for  some  of  the  difficulty  we  have  in  using  negative  infor- 
mation; moreover,  this  speculation  has  received  some  support  from 


482  FINITARY    MODELS    OF    LANGUAGE    USERS 

Eifermann  (1961),  who  found  that  negation  in  Hebrew  has  a  somewhat 
different  effect  on  thinking  than  it  has  in  English. 

A  different  approach  can  be  illustrated  by  sentence-matching  tests 
(Miller,  19626).  One  study  used  a  set  of  18  elementary  strings  (for 
example,  those  formed  by  taking  Jane,  Joe,  or  John  as  the  first  constituent, 
liked  or  warned  as  the  second,  and  the  old  woman,  the  small  boy,  or  the 
young  man  as  the  last),  along  with  the  corresponding  sets  of  sentences 
that  could  be  formed  from  those  by  passive,  negative,  or  passive-and- 
negative  transformations.  These  sets  were  taken  two  at  a  time,  and  sub- 
jects were  asked  to  match  the  sentences  in  one  set  with  the  corresponding 
sentences  in  the  other.  The  rate  at  which  they  worked  was  recorded  and 
from  that  it  was  possible  to  obtain  an  estimate  of  the  time  required  to 
perform  the  necessary  transformations.  If  we  assume  that  these  four 
types  of  sentence  are  coordinate  and  independently  learned,  then  there  is 
little  reason  to  believe  that  finding  correspondences  between  any  two  of 
them  will  necessarily  be  more  difficult  than  between  any  other  two.  On 
the  other  hand,  if  we  assume  that  the  four  types  of  sentence  are  related  to 
one  another  by  two  grammatical  transformations  (and  their  inverses), 
then  we  would  expect  some  of  the  tests  to  be  much  easier  than  others. 
The  data  supported  a  transformational  position :  the  negative  transforma- 
tion was  performed  most  rapidly,  the  more  complicated  passive  transfor- 
mation took  slightly  longer,  and  tests  requiring  both  transformations 
(kernel  to  passive-negative  or  negative  to  passive)  took  as  much  time  as 
the  two  single  transformations  did  added  together.  For  example,  in  order 
to  perform  the  transformations  necessary  to  match  such  pairs  as  Jane 
didn't  warn  the  small  boy  and  The  small  boy  was  warned  by  Jane,  subjects 
required  on  the  average  more  than  three  seconds,  under  the  conditions  of 
the  test. 

Still  another  way  to  explore  these  matters  is  to  require  subjects  to 
memorize  a  set  of  sentences  having  various  syntactic  structures  (J.  Mehler, 
personal  communication).  Suppose,  for  example,  that  a  person  reads  at  a 
rapid  but  steady  rate  the  following  string  of  eight  sentences  formed  by 
applying  passive,  negative,  and  interrogative  transformations:  Has  the 
train  hit  the  car?  The  passenger  hasn't  been  carried  by  the  airplane.  The 
photograph  has  been  made  by  the  boy.  Hasn't  the  girl  worn  the  jewel? 
The  student  hasn't  written  the  essay.  The  typist  has  copied  the  paper. 
Hasn't  the  house  been  bought  by  the  man  ?  Has  the  discovery  been  made  by 
the  biologist  ?  When  he  finishes,  he  attempts  to  write  down  as  many  as  he 
can  recall  Then  the  list  (in  scrambled  order)  is  read  again,  and  again  he 
tries  to  recall,  and  so  on  through  a  series  of  trials.  Under  those  conditions 
many  syntactic  confusions  occur,  but  most  of  them  involve  only  a  single 
transformational  step.  It  is  as  if  the  person  receded  the  original  sentences 


TOWARD    A    THEORY    OF    COMPLICATED    BEHAVIOR  43 

into  something  resembling  a  kernel  string  plus  some  correction  terms  for 
the  transformations  that  indicate  how  to  reconstruct  the  correct  sentence 
when  he  is  called  on  to  recite.  During  recall  he  may  remember  the  kernel, 
but  become  confused  about  which  transformations  to  apply. 

Preliminary  evidence  from  these  and  similar  studies  seems  to  support 
the  notion  that  kernel  sentences  play  a  central  role,  not  only  linguistically, 
but  psychologically  as  well.  It  also  seems  likely  that  evidence  bearing  on 
the  psychological  reality  of  transformational  grammar  will  come  from 
careful  studies  of  the  genesis  of  language  in  infants,  but  we  shall  not  attempt 
to  survey  that  possibility  here. 

It  should  be  obvious  that  the  topics  considered  in  this  section  have 
barely  been  opened  for  discussion.  The  problem  can  clearly  profit  from 
abstract  study  of  various  kinds  of  perceptual  models  that  incorporate 
generative  processes  as  a  fundamental  component.  It  would  be  instructive 
to  study  more  carefully  the  kinds  of  structures  that  are  actually  found 
in  natural  languages  and  the  formal  features  of  those  structures  that  make 
understanding  and  production  of  speech  difficult.  In  this  area  the  empirical 
study  of  language  and  the  formal  study  of  mathematical  models  may  bear 
directly  on  questions  of  immediate  psychological  interest  in  what  could 
turn  out  to  be  a  highly  fruitful  and  stimulating  way. 


3.  TOWARD   A  THEORY   OF    COMPLICATED 
BEHAVIOR 

It  should  by  now  be  apparent  that  only  a  complicated  organism  can 
exploit  the  advantages  of  symbolic  organization.  Subjectively,  we  seem 
to  grasp  meanings  as  integrated  wholes,  yet  it  is  not  often  that  we  can 
express  a  whole  thought  by  a  single  sound  or  a  single  word.  Before  they 
can  be  communicated,  ideas  must  be  analyzed  and  represented  by  se- 
quences of  symbols.  To  map  the  simultaneous  complexities  of  thought 
into  a  sequential  flow  of  language  requires  an  organism  with  considerable 
power  and  subtlety  to  symbolize  and  process  information.  These  com- 
plexities make  linguistic  theory  a  difficult  subject.  But  there  is  an  extra 
reward  to  be  gained  from  working  it  through.  If  we  are  able  to  understand 
something  about  the  nature  of  human  language,  the  same  concepts  and 
methods  should  help  us  to  understand  other  kinds  of  complicated  behavior 
as  well. 

Let  us  accept  as  an  instance  of  complicated  behavior  any  performance 
in  which  the  behavioral  sequence  must  be  internally  organized  and  guided 
by  some  hierarchical  structure  that  plays  the  same  role,  more  or  less,  as  a 
P-marker  plays  in  the  organization  of  a  grammatical  sentence.  It  is  not 


484  FINITARY    MODELS    OF    LANGUAGE    USERS 

immediately  obvious,  of  course,  how  we  are  to  decide  whether  some 
particular  nonlinguistic  performance  is  complicated  or  simple ;  one  natural 
criterion  might  be  the  ability  to  interrupt  one  part  of  the  performance  until 
some  other  part  had  been  completed. 

The  necessity  for  analyzing  a  complex  idea  into  its  component  parts 
has  long  been  obvious.  Less  obvious,  however,  is  the  implication  that  any 
complicated  activity  obliges  us  to  analyze  and  to  postpone  some  parts 
while  others  are  being  performed.  A  task,  X,  say,  is  analyzed  into  the 
parts  Yi,  72>  73,  which  should,  let  us  assume,  be  performed  in  that  order. 
So  y,_  is  singled  out  for  attention  while  72  and  Y3  are  postponed.  In  order 
to  accomplish  715  however,  we  find  that  we  must  analyze  it  into  Z±  and 
Z2,  and  those  in  turn  must  be  analyzed  into  still  more  detailed  parts. 
This  general  situation  can  be  expressed  in  various  ways— by  an  outline 
or  by  a  list  structure  (Newell,  Shaw,  &  Simon,  1959)  or  by  a  tree  graph 
similar  to  those  used  to  summarize  the  structural  description  of  individual 
sentences.  While  one  part  of  a  total  enterprise  is  being  accomplished, 
other  parts  may  remain  implicit  and  still  largely  unformulated.  The 
ability  to  remember  the  postponed  parts  and  to  return  to  them  in  an 
appropriate  order  is  necessarily  reserved  for  organisms  capable  of  com- 
plicated information  processing.  Thus  the  kind  of  theorizing  we  have 
been  doing  for  sentences  can  easily  be  generalized  to  even  larger  units  of 
behavior.  Restricted-infinite  automata  in  general,  and  PDS  systems  in 
particular,  seem  especially  appropriate  for  the  characterization  of  many 
different  forms  of  complicated  behavior. 

The  spectrum  of  complicated  behavior  extends  from  the  simplest 
responses  at  one  extreme  to  our  most  intricate  symbolic  processes  at  the 
other.  In  gross  terms  it  is  apparent  that  there  is  some  scale  of  possibilities 
between  these  extremes,  but  exactly  how  we  should  measure  it  is  a  difficult 
problem.  If  we  are  willing  to  borrow  from  our  linguistic  analysis,  there 
are  several  measures  already  available.  We  can  list  them  briefly: 
INFORMATION  AND  REDUNDANCY.  The  variety  and  stereotypy  of 
the  behavior  sequences  available  to  an  organism  are  an  obvious  parameter 
to  estimate  in  considering  the  complexity  of  its  behavior  (cf.  Miller  & 
Frick,  1949;  Frick  &  Miller,  1951). 

DEGREE  OF  SELF-EMBEDDING.  This  measure  assumes  a  degree  of 
complication  that  may  seldom  occur  outside  the  realm  of  language  and 
language-mediated  behaviors.  Self-embedding  is  of  such  great  theoretical 
significance,  however,  that  we  should  certainly  look  for  occurrences  of  it 
in  nonlinguistic  contexts. 

DEPTH  OF  POSTPONEMENT.  This  measure  of  memory  load,  proposed 
by  Yngve,  may  be  of  particular  importance  in  estimating  a  person's 


TOWARD    A    THEORY    OF    COMPLICATED    BEHAVIOR  485 

capacity  to  carry  out  complicated  instructions  or  consciously  to  devise 

complicated  plans  for  himself. 

STRUCTURAL  COMPLEXITY.    The  ratio  of  the  total  number  of  nodes 

in  the  hierarchy  to  the  number  of  terminal  nodes  provides  an  estimate  of 

complexity  that,  unlike  the  depth  measure,  is  not  asymmetrical  toward  the 

future. 

TRANSFORMATIONAL  COMPLEXITY.    A  hierarchical  organization  of 

behavior  to  meet  some  new  situation  may  be  constructed  by  transforming 

an  organization  previously  developed  in  some  more  familiar  situation. 

The  number  of  transformations  involved  would  provide  an  obvious 

measure  of  the  complexity  of  the  transfer  from  the  old  to  the  new  situation. 

These  are  some  of  the  measures  that  we  can  adapt  in  analogy  to  the 
linguistic  studies;  no  doubt  many  others  of  a  similar  nature  could  be 
developed. 

Clearly,  no  one  can  look  at  a  single  instance  of  some  performance 
and  immediately  assign  values  to  it  for  any  of  those  measures.  As  in  the 
case  of  probability  measures,  repeated  observations  under  many  different 
conditions  are  required  before  a  meaningful  estimate  is  available. 

Many  psychologists,  of  course,  prefer  to  avoid  complicated  behavior 
in  their  experimental  studies;  as  long  as  there  was  no  adequate  way  to 
cope  with  it,  the  experimentalist  had  little  other  alternative.  Since  about 
1945,  however,  this  situation  has  been  changing  rapidly.  From  mathe- 
matics and  logic  have  come  theoretical  studies  that  are  increasingly 
suggestive,  and  the  development  of  high-speed  digital  computers  has 
supplied  a  tool  for  exploring  hypotheses  that  would  have  seemed  fantastic 
only  a  generation  ago.  Today,  for  example,  it  is  becoming  increasingly 
common  for  experimental  psychologists  to  phrase  their  theories  in  terms 
of  a  computer  program  for  simulating  behavior  (cf.  Chapter  7).  Once  a 
theory  is  expressed  in  that  form,  of  course,  it  is  perfectly  reasonable  to  try 
to  apply  to  it  some  of  the  indices  of  complexity. 

Miller,  Galanter,  and  Pribram  (1960)  have  discussed  the  organization  of 
complicated  behavior  in  terms  of  a  hierarchy  of  tote  units.  A  tote  unit 
consists  of  two  parts:  a  test  to  see  if  some  situation  matches  an  internally 
generated  criterion  and  an  operation  that  is  intended  to  reduce  any  dif- 
ferences between  the  external  situation  and  some  internal  criterion.  The 
criterion  may  derive  from  a  model  or  hypothesis  about  what  will  be  per- 
ceived or  what  would  constitute  a  satisfactory  state  of  affairs.  The  opera- 
tions can  either  revise  the  criterion  in  the  light  of  new  evidence  received 
or  they  can  lead  to  actions  that  change  the  organism's  internal  and/or 
external  environment.  The  test  and  its  associated  operations  are  actively 
linked  in  a  feedback  loop  to  permit  iterated  adjustments  until  the  criterion 


486  FINITARY    MODELS    OF    LANGUAGE    USERS 

is  reached.  A  tote  (test-operate-test-exit)  unit  is  shown  in  the  form  of  a 
flow-chart  in  Fig.  9.  A  hierarchy  of  tote  units  can  be  created  by  analyzing 
the  operational  phase  into  a  sequence  of  tote  units ;  then  the  operational 
phase  of  each  is  analyzed  in  turn.  There  should  be  no  implication,  how- 
ever, that  the  hierarchy  must  be  constructed  exclusively  from  strategy  to 
tactics  or  exclusively  from  tactics  to  strategy — both  undoubtedly  occur. 
An  example  of  the  kind  of  structures  produced  in  this  way  is  shown  in  the 
flowchart  in  Fig.  10. 

These  serial  flowcharts  are  simply  the  finite  automata  we  considered  in 
Chapter  12,  and  it  is  convenient  to  replace  them  by  oriented  graphs  (cf. 

Karp,  1960).  Wherever  an  initial  or 
terminal  element  or  operation  occurs  in 
the  flowchart,  replace  it  by  a  node  with 
one  labeled  arrow  exiting  from  the 
node;  wherever  a  test  occurs,  replace  it 
by  a  node  with  two  labeled  exits.  Next, 
replace  every  nonbranching  sequence 
of  arrows  by  a  single  arrow  bearing  a 
Fig.  9.  A  simple  tote  unit.  compound  label.  The  graph  corre- 

sponding to  the  flow-chart  of  Fig.  10  is 

shown  in  Fig.  11.  From  such  oriented  graphs  as  these  it  is  a  simple 
matter  to  read  off  the  set  of  triples  that  define  a  finite  automaton. 
A  tote  hierarchy  is  just  a  general  form  of  finite  automaton  in  the  sense 
of  Chapter  12.  We  know  from  Theorem  2  of  Chapter  12  that  for  any  finite 
automaton  there  is  an  equivalent  automaton  that  can  be  represented  by 
a  finite  number  of  finite  notations  of  the  form  A±(A^  . . .  ,  Am)*Am+:L9 
where  the  elements  A^  .  . . ,  Am  can  themselves  be  notations  of  the  same 
form,  and  so  on,  until  the  full  hierarchy  is  represented.  For  any  finite 
state  model  that  may  be  proposed,  therefore,  there  is  an  equivalent  model 
in  terms  of  a  (generalized)  tote  hierarchy. 

Since  a  tote  hierarchy  is  analogous  to  a  program  of  instructions  for  a 
serial  computer,  it  has  been  referred  to  as  a  plan  that  the  system  is  trying 
to  execute.  Any  postponed  parts  of  the  plan  constitute  the  system's 
intentions  at  any  given  moment.  Viewed  in  this  way,  therefore,  the  finite 
devices  discussed  in  these  chapters  are  clearly  applicable  to  an  even  broader 
range  of  behavioral  processes  than  language  and  communication.  Some 
implications  of  this  line  of  argument  for  nonlinguistic  phenomena  have 
been  discussed  informally  by  Miller,  Galanter,  and  Pribram. 

A  central  concern  for  this  type  of  theory  is  to  understand  where  new 
plans  come  from.  Presumably,  our  richest  source  of  new  plans  is  our  old 
plans,  transformed  to  meet  new  situations.  Although  we  know  little  about 
it,  we  must  have  ways  to  treat  plans  as  objects  that  can  be  formed  and 
transformed  according  to  definite  rules.  The  consideration  of  transfor- 
mational grammars  gives  some  indication  of  how  we  might  combine 


TOWARD    A    THEORY    OF    COMPLICATED    BEHAVIOR 


Fig.  10.  A  hierarchical  system  of  tote  units 
E 


T'   Oz 
z,    z. 


Fig.  11.  Graph  of  flowchart  in  Fig.  10. 


FINITARY    MODELS    OF    LANGUAGE    USERS 


and  rearrange  plans,  which  are,  of  course,  so  closely  analogous  to 
P-markers.  As  in  the  case  of  grammatical  transformations,  the  truly 
productive  behavioral  transformations  are  undoubtedly  those  that  com- 
bine two  or  more  simpler  plans  into  one.  These  three  chapters  make  it 
perfectly  plain,  however,  how  difficult  it  is  to  formulate  a  transformational 
system  to  achieve  the  twin  goals  of  empirical  adequacy  and  feasibility  of 
abstract  study. 

When  we  ask  about  the  source  of  our  plans,  however,  we  also  raise  the 
closely  related  question  of  what  it  might  be  that  stands  in  the  same  relation 
to  a  plan  as  a  grammar  stands  to  a  P-marker  or  as  a  programming  language 
stands  to  a  particular  program.  In  what  form  are  the  rules  stored  whereby 
we  construct,  evaluate,  and  transform  new  plans?  Probably  there  are 
many  diverse  sets  of  rules  that  govern  our  planning  in  different  enterprises, 
and  only  patient  observation  and  analysis  of  each  behavioral  system  will 
enable  us  to  describe  the  rules  that  govern  them. 

It  is  probably  no  accident  that  a  theory  of  grammatical  structure  can  be 
so  readily  and  naturally  generalized  as  a  schema  for  theories  of  other  kinds 
of  complicated  human  behavior.  An  organism  that  is  intricate  and  highly 
structured  enough  to  perform  the  operations  that  we  have  seen  to  be 
involved  in  linguistic  communication  does  not  suddenly  lose  its  intricacy 
and  structure  when  it  turns  to  nonlinguistic  activities.  In  particular,  such 
an  organism  can  form  verbal  plans  to  guide  many  of  its  nonverbal  acts. 
The  verbal  machinery  turns  out  sentences—  and,  for  civilized  men,  sen- 
tences have  a  compelling  power  to  control  both  thought  and  action.  Thus 
the  present  chapters,  even  though  they  have  gone  well  beyond  the  usual 
bounds  of  psychology,  raise  issues  that  must  be  resolved  eventually  by  any 
satisfactory  psychological  theory  of  complicated  human  behavior. 


References 

Attneave,  F.    Applications  of  information  theory  to  psychology.    New  York:    Holt- 

Dryden,  1959. 
Burton,  N.  G.,  &  Licklider,  J.  C.  R.   Long-range  constraints  in  the  statistical  structure 

of  printed  English.  Amer.  J.  PsychoL,  1955,  68,  650-653. 
Carnap,  R.,  &Bar-Hillel,  Y.  An  outline  of  a  theory  of  semantic  information.  Res.  Lab. 

Electronics,  Cambridge:  Mass.  Inst.  Tech.  Tech.  Rept.  247,  1952. 
Chapanis,  A.   The  reconstruction  of  abbreviated  printed  messages.   /.  exp.  PsychoL, 

1954,  48,  496-510. 

Cherry,  C.  On  human  communication.  New  York:  Technology  Press  and  Wiley,  1957. 
Chomsky,  N.    Logical  structure  of  linguistic  theory.    Microfilm.    Mass.  Inst.  Tech. 

Libraries,  1955. 
Condon,  E.  V.  Statistics  of  vocabulary.  Science,  1928,  67,  300. 


REFERENCES  48$ 

Cronbach,  L.  J.  On  the  non-rational  application  of  information  measures  in  psychology. 

In  H.  Quastler  (Ed.),  Information  theory  in  psychology.    Glencoe,  111. :    Free  Press, 

1955.  Pp.  14-26. 

Eifermann,  R.  R.   Negation:  a  linguistic  variable.    Ada  PsychoL,  1961,  18,  258-273. 
Estoup,  J.  B.   Gamines  stenographique.   (4th  ed.)  Paris :   1916. 
Fano,  R.  M.    The  transmission  of  information.    Res.  Lab.  Electronics,  Cambridge: 

Mass.  Inst.  Tech.  Tech.  Rept.  65,  1949. 

Fano,  R.  M.   The  transmission  of  information.  New  York:  Wiley,  1961. 
Feinstein,  A.  Foundations  of  information  theory .  New  York:  McGraw-Hill,  1958. 
Feller,  W.    An  introduction  to  probability  theory  and  its  applications.    (2nd  ed.)  New 

York:  Wiley,  1957. 
Fletcher,  H.    Speech  and  hearing  in  communication.    (2nd  ed.).    New  York:    Van 

Nostrand,  1953. 
Frick,  F.  C,  &  Miller,  G.  A.  A  statistical  description  of  operant  conditioning.  Amer. 

J.  PsychoL,  1951,  64,  20-36. 
Frick,  F.  C.,  &  Sumby,  W,  H.   Control  tower  language.  J.  acoust.  Soc.  Amer.,  1952, 

24,  595-597. 
Fritz,  E.  L.,  &  Grier,  G.  W.,  Jr.  Pragmatic  communications:  A  study  of  information 

flow  in  air  traffic  control.    In  H.  Quastler  (Ed.),  Information  theory  in  psychology. 

Glencoe,  111.:  Free  Press,  1955.  Pp.  232-243. 
Garner,  W.  R.     Uncertainty  and  structure  as  psychological  concepts.    New  York: 

Wiley,  1962. 
Gnedenko,  B.  V.,  &  Kolmogorov,  A.  N.   Limit  distributions  for  sums  of  independent 

random  variables.    Translated  by  K.  L.  Chung.    Cambridge,   Mass.:    Addison- 

Wesley,  1954. 
Halle,  M.,  &  Stevens,  K.  N.    Analysis  by  synthesis.    In  Proc.  Seminar  on  Speech 

Compression  and  Production,  AFCRC-TR-59-198,  1959. 
Halle,  M.,  &  Stevens,  K.  N.  Speech  recognition:  A  model  and  a  program  for  research. 

IRE  Trans,  on  Inform.  Theory,  1962,  IT-8,  155-159. 
Hardy,  G.  H.,  Littlewood,  J.  E.,  &  Polya,  G.   Inequalities.    (2nd  ed.).    Cambridge: 

Cambridge  Univer.  Press,  1952. 
Hartley,  R.  V.    The  transmission  of  information.    Bell  System  Tech.  /.,  1928,  17, 

535-550. 
Hovland,  C.  L,  &  Weiss,  W.  Transmission  of  information  concerning  concepts  through 

positive  and  negative  instances.  J.  exp.  Psycho!.,  1953,  45,  175-182. 
Huffman,  D.  A.    A  method  for  the  construction  of  minimum-redundancy  codes. 

Proc.  IRE,  1952,  40,  1098-1101. 

Karp,  R.  M.    A  note  on  the  application  of  graph  theory  to  digital  computer  pro- 
gramming.  Information  and  Control,  1960,  3,  179-190. 
Katz,  J.,  &  Fodor,  J.    The  structure  of  a  semantic  theory.   To  appear  in  Language. 

Reprinted  in  J.  Katz    &  J.  Fodor.    Readings  in  the  philosophy  of  language.    New 

York:  Prentice-Hall,  1963. 
Khinchin,  A.  I.    Mathematical  foundations  of  information  theory.    Translated  by 

R.  A.  Silverman  and  M.  D.  Friedman.  New  York:  Dover,  1957. 
Luce,  R.  D.  Individual  choice  behavior.  New  York:  Wiley,  1959. 
Luce,  R.  D.  (Ed,)    Developments  in  mathematical  psychology.    Glencoe,  III:    Free 

Press,  1960. 
Mandelbrot,  B.  An  informational  theory  of  the  structure  of  language  based  upon  the 

theory  of  the  statistical  matching  of  messages  and  coding.   In  W.  Jackson,  (Ed.), 

Proc.  symp.  on  applications  of  communication  theory.  London:  Butterworth,  1953. 


400  FINITARY    MODELS    OF    LANGUAGE    USERS 

Mandelbrot,  B.  Linguistique  statistique  macroscopique.  In  L.  Apostel,  B.  Mandelbrot, 

&  A,  Morf.    Logique,  langage  and  theorie  de  r information.    Paris:    Universitaires 

de  France,  1957.  Pp.  1-78. 
Mandelbrot,   B.     Les   lois   statistique  macroscopiques   du   comportment.     Psychol. 

Francaise,  1958,  3,  237-249. 
Mandelbrot,  B.    A  note  on  a  class  of  skew  distribution  functions:    Analysis  and 

critique  of  a  paper  by  H.  A.  Simon.  Information  and  Control,  1959,  2,  90-99. 
Mandelbrot,  B.   On  the  theory  of  word  frequencies  and  on  related  Markovian  models 

of  discourse.  In  R.  Jakobson  (Ed.),  Structure  of  language  in  its  mathematical  aspect. 

Proc.  12th  Symp.  in  App.  Math.  Providence,  R.  I. :  American  Mathematical  Society, 

1961.  Pp. 190-219. 
Markov,  A.  A.    Essai  d'une  recherche  statistique  sur  le  texte  du  roman  "Eugene 

Onegin,"  Bull  acad.  imper.  Set.,  St.  Petersburg,  1913,  7. 
Marschak,  J.  Remarks  on  the  economics  of  information.  In  Contributions  to  Scientific 

Research  in  Management.   Berkeley,  Calif.:   Univer.  of  California  Press,  1960.   Pp. 

79-98. 
Matthews,  G.  H.  Analysis  by  synthesis  of  sentences  of  natural  languages.  In  Proc.  1st 

Int.  Cong,  on  Machine  Translation  of  Languages  and  Applied  Language  Analysis, 

1961.  Teddington,  England:  National  Physical  Laboratory,  (in  press). 
McMillan,  B.  The  basic  theorems  of  information  theory.  Ann.  math.  Stat.,  1953,  24, 

196-219. 

Miller,  G.  A.  Language  and  communication.  New  York:  McGraw-Hill,  1951. 
Miller,  G.  A.  What  is  information  measurement?  Amer.  Psychologist,  1953,  8,  3-11. 
Miller,  G.  A.   The  magical  number  seven,  plus  or  minus  two:   Some  limits  on  our 

capacity  for  processing  information.  Psychol.  Rev.,  1956,  63,  81-97. 
Miller,  G.  A.  Some  effects  of  intermittent  silence.  Amer.  J.  Psychol. ,  1957,70,311-313. 
Miller,  G.  A.  Decision  units  in  the  perception  of  speech.  IRE  Trans.  Inform.  Theory, 

1962,  IT-8,  No.  2,  81-83.  (a) 

Miller,  G.  A.    Some  psychological  studies  of  grammar.    Amer.  Psychologist,  1962 

17,  748-762.  (b) 
Miller,  G.  A.,  &  Frick,  F.  C.    Statistical  behavioristics  and  sequences  of  responses. 

Psychol.  Rev.,  1949,  56,  311-324. 
Miller,  G.  A.,  &  Friedman,  E.  A.    The  reconstruction  of  mutilated  English  texts. 

Information  and  Control,  1957,  1,  38-55. 
Miller,  G.  A.,  Galanter,  E.,  &  Pribram,  K.  Plans  and  the  structure  of  behavior.  New 

York:  Holt,  1960. 
Miller,  G.  A.,  Heise,  G.  A.,  &  Lichten,  W.  The  intelligibility  of  speech  as  a  function 

of  the  context  of  the  test  materials.  /.  exp.  Psychol.,  1951,  41,  329-335. 
Miller,  G.  A.,  &  Newman,  E.  B.  Tests  of  a  statistical  explanation  of  the  rank-frequency 

relation  for  words  in  written  English.  Amer.  J.  Psychol.,  1958,  71,  209-258. 
Miller,  G.  A.,  Newman,  E.  B.,  &  Friedman,  E.  A.    Length-frequency  statistics  for 

written  English.  Information  and  Control,  1958, 1,  370-398. 
Miller,  G.  A.,  &  Selfridge,  J.  A.  Verbal  context  and  the  recall  of  meaningful  material. 

Amer.  J.  Psychol.,  1950,  63,  176-185. 
Newell,  A.,  Shaw,  J.  C.,  &  Simon,  H.  A.  Report  on  a  general  problem-solving  program. 

In  Information  Processing.  Proc.  International  Conference  on  Information  Processing, 

UNESCO,  Paris,  June  1959.  Pp.  256-264. 
Newman,  E.  B.  The  pattern  of  vowels  and  consonants  in  various  languages.  Amer.  /. 

Psychol.,  1951,  64,  369-379. 
Pareto,  V.   Cours  d'economie politique.  Paris:  1897. 


REFERENCES  491 

Quastler,  H.  (Ed.).  Information  theory  in  psychology.  Glencoe,  111.:  Free  Press,  1955. 
Shannon,  C.  E.  A  mathematical  theory  of  communication.  Bell  System  Tech.  /.,  1948, 

27,  379-423. 
Shannon,  C.  E.  Prediction  and  entropy  of  printed  English.  Bell  Syst.  tech.  /.,  1951,  30, 

50-64. 

Skinner,  B.  F.    Verbal  behavior.  New  York:  Appleton-Century-Crofts,  1957. 
Smoke,  K.  L.    Negative  instances  in  concept  learning.   /.  exp.  PsychoL,  1933,  16, 

583-588. 
Somers,  H.  H.   The  measurement  of  grammatical  constraints.  Language  and  Speech, 

1961,  4,  150-156. 
Thorndike,  E.  L.,  &  Lorge,  I.  The  teachers  word  book  of  30,000  words.  New  York: 

Bureau  of  Publications,  Teachers  College,  Columbia  University,  1944. 
Toda,  M.   Information-receiving  behavior  in  man.  Psychol.  Rev.,  1956,  63,  204-212. 
Wason,  P.  C.    The  processing  of  positive  and  negative  information.    Quart.  J.  exp. 

Psychol.,  1959,  11,  92-107. 
Wason,  P.  C.  Response  to  affirmative  and  negative  binary  statements.  Brit.  J.  Psychol, 

1961,  52,  133-142. 

Wiener,  N.   Cybernetics.  New  York:  Wiley,  1948. 
Willis,  J.  C.  Age  and  area.  Cambridge:  Cambridge  Univer.  Press,  1922. 
Yngve,  V.  H.  A  model  and  an  hypothesis  for  language  structure.  Proc.  Am.  Phil.  Soc., 

1960, 104,  444-466. 
Yngve,  V.  H.  The  depth  hypothesis.  In  R.  Jakobson  (Ed.),  Structure  of  language  and 

its  mathematical  aspect.  Proc.  12th  Symp.  in  App.  Math.  Providence,  R.  I. :  American 

Mathematical  Society,  1961.  Pp.  130-138. 
Yule,  G.  U.   A  mathematical  theory  of  evolution,  based  on  the  conclusions  of  Dr. 

J.  C.  Willis,  FRS.  Phil.  Trans.  Roy.  Soc.  (London),  1924,  B  213,  21-87. 
Yule,  G.  U.   The  statistical  study  of  literary  vocabulary.  London:  Cambridge  Univer. 

Press,  1944. 

Ziff,  P.  Semantic  analysis.  Ithaca:  Cornell  Univ.  Press,   1960. 
Zipf,  G.  K.   The  psychobiology  of  language.  Boston:  Houghton-Mifflin,  1935. 
Zipf,  G.  K,    Human  behavior  and  the  principle  of  least  effort.    Cambridge,  Mass.: 

Addison-Wesley,  1949. 


14 

Mathematical  Models  of 
Social  Interaction 


Anatol  Rapoport 

University  of  A£ichigan 


493 


Contents 


1.  Interaction  in  Large  Well-Mixed  Populations  497 

LI.  The  general  equation,     498 

1.2.  The  linear  model,     499 

1.3.  The  logistic  model,     504 

1.4.  Time-dependent  contagion,     505 

1.5.  Contagion  with  diffusion,     508 

1.6.  Nonconservative,  nonlinear  interaction  models,     509 

2.  Statistical  Aspects  of  Net  Structures  512 

2. 1 .  The  random  net,     5 1 3 

2.2.  Biased  nets,     515 

2.3.  Application  of  the  overlapping  clique  model  to  an  informa- 

tion-spread process,     519 

2.4.  Application  of  the  biased  net  model  to  a  large  sociogram,     522 

3.  Structure  of  Small  Groups  529 

3.L     Descriptive  theory  of  small  group  structures,     531 

3.2.  The  detection  of  cliques  and  other  structural  characteristics 

of  small  groups,     536 

3.3.  The  theory  of  structural  balance,     539 

3.4.  Dominance  structures,     541 

4.  Psychoeconomics  546 

4.1.  A  mathematical  model  of  parasitism  and  symbiosis,     546 

4.2.  Bargaining,     549 

4.3.  Bilateral  monopoly,     551 

4.4.  Formal  experimental  games,     556 

5.  Group  Dynamics  562 

5.1.  A  "classical"  model  of  group  dynamics,     563  • 

5.2.  A  semiquantitative  model,     565 

5.3.  Markov  chain  models,     567 

References  576 


494 


Mathematical  Models  of  Social  Interaction 


In  order  to  apply  mathematical  methods  to  the  study  of  social  interac- 
tions, it  is  obviously  necessary  to  single  out  entities  among  which  specifiable 
functional  relations  are  assumed  to  exist.  Attempts  to  do  so  have  pro- 
ceeded along  two  distinct  paths. 

One  path  parallels  the  methodological  road  of  mathematical  physics, 
where  attention  is  focused  on  numerically  measurable  quantities  and  their 
rates  of  change.  The  mathematical  tools  appropriate  to  this  method  are 
those  of  classical  analysis.  The  area  to  which  such  methods  seem  pertinent 
is  that  of  large-scale  social  phenomena.  When  dealing  with  large  masses  of 
entities,  it  is  natural  to  disregard  the  determinants  governing  the  behavior 
of  the  individual  elements  in  favor  of  gross  statistically  determined  effects. 
Such  gross  effects  are  represented  by  functional  relations  among  a  few 
essentially  continuous  variables  and  usually  their  time  derivatives,  typically 
as  differential  equations. 

The  other  path,  departing  from  the  methods  of  classical  analysis,  is 
directed  toward  the  consideration  of  sets  of  discrete  entities  and  structural 
relations  among  them.  This  is  the  path  followed  by  those  interested  in 
interactions  among  a  small  number  of  individuals,  especially  those  based 
on  relations  in  which  the  individuals  stand  to  each  other,  or  in  interactions 
based  on  decisions  governed  by  complex  logical  considerations.  In  such 
situations  rather  detailed  structural  descriptions  are  required.  Here  some 
of  the  "modern"  developments  of  mathematics  play  an  important  part. 
Among  examples  of  the  tools  used  in  this  approach  are  the  theory  of  linear 
graphs,  in  which  binary  relations  instead  of  numerical  variables  are  funda- 
mental; the  associated  matrix  representations  of  the  relational  structures ; 
set  theory,  in  which  the  fundamental  entities  are  subsets  of  a  given  set, 
instead  of  the  individual  elements  composing  it;  stochastic  processes,  in 
which  probability  distributions  instead  of  just  probabilistically  deter- 
mined expected  values  are  the  objects  of  attention;  and  the  mathematical 
theory  of  games,  which  studies  in  complete  detail  the  logical  structure  of 
conflict  situations. 

In  this  chapter  I  have  selected  a  number  of  developments  in  mathe- 
matical theories  of  social  interaction  of  both  types,  which  seem  to  me 
either  (1)  typical,  (2)  interesting,  or  (3)  representative  of  an  attempt  to  link 
a  mathematical  theory  to  observation  or  experiment. 

495 


49$  MATHEMATICAL    MODELS    OF    SOCIAL    INTERACTION 

Typical  methods  are  worth  studying  because  they  indicate  some  unifying 
theoretical  principle.  Therefore,  regardless  of  whether  the  corresponding 
mathematical  models  have  been  found  to  be  applicable  in  some  specific 
instance,  there  is  reason  to  suppose  that  similar  models  will  be  applicable 
in  some  situations  sooner  or  later.  In  these  models  a  great  many  idealized 
situations  appear  logically  isomorphic.  "Interesting"  models  have  been 
included  for  their  inspirational  value.  The  inclusion  of  models  in  which 
some  link  has  already  been  made  between  theory  and  observation  corrob- 
orates, at  least  partly,  the  feasibility  of  using  mathematical  methods  in 
constructing  a  science  or  social  interaction. 

Our  models  can  also  be  classified  in  a  triple  dichotomy,  namely  (1) 
static  versus  dynamic,  (2)  pertaining  to  large-scale  versus  small-scale 
social  events,  and  (3)  deterministic  versus  stochastic.  We  have  confined 
ourselves  to  those  theoretical  developments  that  can  be  treated  in  the  lan- 
guage of  undergraduate  mathematics,  not  going  beyond  the  elements  of 
probability  theory  and  elementary  differential  equations.  The  mathe- 
matics of  game  theory,  being  a  vast  field  of  research  in  its  own  right,  has 
not  been  included,  although  a  mathematical  treatment  of  some  instructive 
gamelike  situations  has  been  presented  where  the  underlying  principles 
seemed  central  to  our  topic. 

Finally,  attempts  have  been  made,  wherever  possible,  to  point  out  the 
links  that  exist  between  the  various  approaches.  For  example,  the  classical 
mathematical  approach  to  social  interaction  via  systems  of  differential 
equations  leads  to  the  consideration  of  the  phase  space  (cf.  Sec.  1.6) 
which,  in  turn,  leads  to  questions  of  stability  of  certain  steady  states.  These 
questions  are  seen  to  have  a  relation  to  certain  game-theoretical  questions. 
Thus  a  transition  between  the  distinctly  "classical"  and  the  characteristi- 
cally "modern"  approaches  can  be  discerned.  Nor  are  the  distinctions 
between  the  "dichotomies"  as  sharp  as  they  appear  on  being  named.  The 
abstract  mathematical  model  does  not  distinguish  between  the  replication 
of  an  event  in  "space"  or  in  "time,"  and  so  the  same  framework  may  at 
times  fit  a  large  population  of  persons  or  a  large  population  of  responses  in 
a  small  group  (cf.  Sees.  1.2  and  5.1).  The  distinction  between  the  large- 
scale  static  sociometric  model  and  the  large-scale  dynamic  contagion 
process  (cf.  Sees.  2.3  and  2.4)  is  likewise  obscured  by  the  similarity  of  the 
recursive  formulas  used  in  their  respective  treatments.  These  fusions  are 
not  surprising  in  view  of  the  ubiquitous  presence  of  logical  isomorphisms 
among  the  conceptual  models  used  in  our  day.  Whether  this  frequent 
occurrence  of  similar  formulations  bespeaks  an  un  derlying  logical  similarity 
of  events  or  a  comparative  paucity  of  ideas  remains  for  future  generations 
to  decide. 


INTERACTION    IN    LARGE    WELL-MIXED    POPULATIONS  497 


1.  INTERACTION  IN   LARGE   WELL-MIXED 
POPULATIONS 

Intuitive  observations  and  more  systematic  but  hitherto,  for  the  most 
part,  purely  empirical  studies  of  mass  behavior  indicate  that  information 
and  behavioral  patterns  often  spread  through  populations  by  a  contagion- 
like  process.  The  occasional  explosive  spreads  of  rumors,  fads,  and  panics 
attest  to  the  underlying  similarity  between  social  diffusion  and  other 
diffusion  and  chain-reaction  processes,  such  as  epidemics,  the  spread  of 
solvents  through  solutes,  crystallization,  dissemination  of  genes  through 
an  interbreeding  population,  etc.  Accordingly,  the  mathematical  behav- 
ioral scientist  is  motivated  to  seek  mathematical  models  that  can  be 
supposed  to  underlie  whole  classes  of  processes  of  diverse  content  but 
of  similar  mathematical  type. 

In  all  cases  the  model  must  postulate  a  population  and  a  set  of  states 
in  which  each  member  of  the  population  may  find  himself.  If  the  number 
of  individuals  is  large,  the  fraction  or  density  of  individuals  in  a  given 
state  can  be  taken  as  a  continuous  variable.  In  some  situations  passing 
from  one  state  to  another  means  suffering  an  increment  or  a  decrement  of 
some  quantity;  for  example,  biomass  or  displacement  in  space  or  assets. 
If  this  quantity  is  also  continuous,  the  states  themselves  form  a  continuum. 
Otherwise  there  is  only  a  finite  (perhaps  denumerably  infinite)  number  of 
states — in  the  simplest  case,  just  two. 

The  number  of  individuals  in  the  population  may  be  constant  or  not. 
If  it  is  not,  we  account  for  "sources"  and  "sinks"  in  the  population,  which 
allow  increments  (or  decrements)  of  individuals  in  a  certain  state  without 
compensating  decrements  (or  increments)  of  individuals  in  other  states. 

The  states  may  be  reversible  or  irreversible.  For  example,  if  the  inter- 
action is  contagion,  in  which  a  disease  (or  a  piece  of  information)  passes 
from  one  individual  to  another,  the  passage  of  individuals  from  the  non- 
infected  (or  not-knowing)  to  the  infected  (or  knowing)  state  may  be  as- 
sumed irreversible  if  the  individuals  are  not  expected  to  recover  (or  forget) 
during  the  time  under  consideration.  If  recovery  without  immunity  does 
occur  or  if  the  contagion  results  in  the  spread  of  an  attitude  or  a  form  of 
behavior  which  can  also  be  abandoned,  we  are  dealing  with  a  reversible 
process.  The  analogy  with  chemical  reactions  is  obvious. 

If  there  are  more  than  two  states,  the  concept  of  reversibility  has  a  more 
general  analogue  in  the  concept  of  two-way  connectedness  among  sets  of 
states ;  that  is,  the  counterpart  of  reversibility  in  a  multistate  situation  is  the 
possibility  of  passing  from  any  state  to  any  other  state  (in  general  via  other 


4$8  MATHEMATICAL    MODELS    OF    SOCIAL    INTERACTION 

states).  The  analogue  of  irreversibility  is  the  existence  of  a  group  of  states 
into  which  it  is  possible  to  pass  but  from  which  it  is  impossible  to  escape, 
although  it  may  be  possible  to  pass  from  state  to  state  among  them.  If 
there  are  subsets  of  states  that  are  not  connected  to  each  other  in  either 
direction,  we  have  essentially  several  independent  processes,  which  can  be 
treated  separately.  In  this  case  there  is  no  need  to  consider  the  entire 
process  within  the  framework  of  a  single  system. 

As  an  example  of  a  multivariate  contagion  system,  consider  a  disease 
with  time-limited  infectiousness,  in  which  the  following  states  are  dis- 
tinguished; uninfected,  infected  and  contagious,  infected  noncontagious, 
recovered  without  immunity,  recovered  with  immunity,  dead.  Some  of 
these  may  be  "absorbing  states,"  in  which  the  individuals  once  having 
entered  will  persist  for  the  duration  of  the  process,  for  example,  "dead" 
or  possibly  "recovered  with  immunity."  Therefore  this  process  contains 
some  irreversible  "reactions." 

To  construct  a  general  model  of  a  contagion  process,  it  is  necessary  to 
list  all  the  relevant  states  in  which  the  members  of  the  population  may  be 
and  also  to  indicate  the  transition  probabilities  from  one  state  to  another. 
The  event  contributing  to  the  probability  of  such  a  transition,  typical  for 
a  contagion  process,  is  contact  between  two  individuals  as  a  result  of 
which  one  or  both  individuals  pass  into  another  state.  However,  it  is 
possible  to  imagine  also  "spontaneous"  changes  of  state,  for  example, 
from  one  stage  of  a  disease  to  the  next.  Also  when  two  individuals  come 
into  contact  this  may  contribute  to  an  increment  of  a  state  to  which  neither 
of  the  individuals  belongs. 

Thus  Richardson  (1948),  in  his  discussion  of  war  moods,  differentiates 
among  several  "psychological  states"  associated  with  people  in  peace  and 
war  times,  such  as  "friendly,"  "hostile,"  "war-weary,"  "dead,"  and 
combinations  of  some  of  these.  On  the  battlefield,  contacts  between 
two  "hostile"  individuals  contribute  to  an  increment  of  dead  individuals, 
an  example  of  contact  between  individuals  in  one  state  contributing  to  an 
increment  of  individuals  in  another.  Likewise,  we  can  imagine  various 
degrees  of  two  conflicting  political  opinions  as  the  states.  It  is  conceivable 
that  contacts  between  two  individuals  of  mild  but  opposite  opinions 
contribute  to  increments  of  individuals  with  stronger  opposite  opinions 
because  of  mutual  irritation  or,  on  the  contrary,  to  increments  of 
individuals  with  intermediate  opinions  because  of  mutual  influence. 


1.1  The  General  Equation 

To  account  for  a  social  interaction  process  of  n  states,  in  which  time 
rates  of  increments  to  each  state  depend  on  (1)  independent  sources  or 


INTERACTION    IN    LARGE    WELL-MIXED    POPULATIONS  49$ 

sinks,  (2)  spontaneous  changes  from  state  to  state,  and  (3)  changes  of  state 
occasioned  by  contacts,  our  model  would  have  to  be  a  system  of  nonhomo- 
geneous,  first-order,  second-degree  differential  equations  of  the  following 

type:      Ax       n 

^7  =  Zot&  +21  fc(&A  +  c,,        (i  =  1,  2,  .  .  .  «).          (1) 
at      5=1  fr=i  3=1 

Here  xi  represents  the  number  (or  fraction  or  density)  of  individuals 
in  the  zth  state.  The  a^  (i  ^  y)  represent  the  rates  of  net  absolute  flow  into 
or  from  the  zth  state  from  or  to  other  states  (due  to  concentrations  of  other 
states);  <%  represents  the  net  reproduction  (or  dissipation)  rate  of  the 
individuals  in  the  zth  state;  b$  represents  the  rates  of  conversion  to  or 
from  the  zth  state  due  to  contacts  between  pairs  of  individuals;  and  ci 
represents  the  sources  or  sinks. 

Note  that  the  increments  due  to  contacts,  as  given  by  Eq.  1,  depend  only 
on  the  total  numbers  (or  concentrations)  of  individuals  in  the  different 
states;  that  is  to  say,  the  probability  of  contact  between  any  two  individ- 
uals from  a  pair  of  specified  states  is  assumed  to  be  the  same  for  any  pair 
of  individuals  in  those  states.  This  is  the  assumption  of  well-mixedness. 
Obviously  this  assumption  cannot  be  made  if  the  mobility  of  the  popula- 
tion is  restricted.  In  a  real  contagion,  for  example,  the  focus  of  contagion 
is  at  least  temporarily  geographically  circumscribed  so  that  only  those 
uninfected  who  are  near  the  focus  can  be  expected  to  become  infected  at 
that  time.  Hence  the  probability  of  new  infections  near  the  focus  will 
depend  on  the  concentration  of  individuals  near  the  focus  of  infection  and 
not  on  the  over-all  concentration.  These  complications  are  deliberately 
bypassed  when  the  assumption  of  well-mixedness  is  made.  In  our  discus- 
sion of  contagion  models  we  shall  for  the  most  part  assume  well-mixedness. 
Later  we  shall  drop  this  assumption,  and  this  will  carry  us  to  the  con- 
sideration of  some  structural  properties  of  social  space  (cf.  Sec.  2.2). 
For  the  present  we  shall  examine  some  important  special  cases  of  Eq.  I, 
which  underlie  various  proposed  models  of  social  interaction. 


1.2   The  Linear  Model 

If  all  the  b$  in  the  system  described  by  Eq.  1  are  zero,  the  system  reduces 
to  a  linear  one: 

dx       n 

-1=2  %^  +  <*        (1  =  1,2,...*)-  (2) 

at      j=i 

The  general  solution  of  such  systems  is  known.  The  special  cases,  which 
result  when  certain  restraints  are  imposed  on  the  coefficients,  can  be 
described  in  qualitative  terms.  For  example,  under  certain  conditions, 


500  MATHEMATICAL    MODELS    OF    SOCIAL    INTERACTION 

some  of  the  xi  will  be  periodic  (oscillatory)  functions  of  time.  These 
conditions  are  usually  too  special  to  be  of  interest  in  sociological  applica- 
tions. If  the  system  is  nonhomogeneous  and  nonsingular  (i.e.,  if  not  all 
the  cl  are  zero  and  the  determinant  of  the  coefficients  ais  does  not  vanish), 
setting  dxjdt  equal  to  zero  and  solving  the  resulting  system  of  linear 
algebraic  equations  yields  an  equilibrium  point  in  the  /7-dimensional 
space,  xi  =  xf  (i  =  1,  2,  ...  ri).  An  important  question  then  arises 
concerning  the  stability  of  the  equilibrium.  If  it  is  stable,  then  an  accidental 
fluctuation  from  it  in  the  values  of  the  xi  tends  to  be  "corrected,"  that  is,  the 
system  returns  to  the  equilibrium  state.  If  the  equilibrium  is  unstable, 
an  accidental  increment  in  some  of  the  xi  will  tend  to  be  magnified,  carrying 
some  of  the  variables  to  infinity  of  either  sign  (or  to  zero  if  only  positive 
values  are  meaningful). 
If  there  are  only  two  variables,  Eq.  2  reduces  to 

dx 

—  =  a^x 
dt 

-¥-  =  a2lx  +  a^y  +  c2. 
dt 

This  model  has  been  used  by  Richardson  (1939)  to  represent  an  idealized 
arms  race  between  two  states  or  alliances  and  by  Rashevsky  (1939)  to 
represent  the  simplest  case  of  mass  behavior  based  on  mutual  imitation. 

In  Richardson's  model  x  and  y  represent,  respectively,  armament 
expenditures  of  two  rival  states.  He  assumes  that  increases  in  the  arma- 
ment expenditures  of  each  state  are  stimulated  by  the  armament  expendi- 
tures of  the  rival.  Hence  #12  >  0,  <22i  >  0.  He  further  assumes  that  the 
rate  of  increase  of  armament  expenditures  is  inhibited  by  the  level  of  one's 
own  expenditures  (as  the  burden  increases).  Hence  an  <  0,  #22  <  0. 
Finally,  the  constant  terms  represent  positive  or  negative  stimulation  to 
armament  expenditures  independent  of  the  expenditure  levels.  These  are 
the  "grievances,"  if  positive,  or  reservoirs  of  "good  will"  if  negative.  We 

can  therefore  write 

dx 

—  =  —  ax  +  my  +  g, 
dt 

dy  =  nx-by+h. 
dt 

Here  a,  b,  m,  and  n  are  positive.  The  equilibrium  point  (assuming 
ab  —  mn  7*  0)  is  obtained  by  setting  dx/dt  =  dy/dt  =  0  and  solving  the 
resulting  algebraic  equations.  The  position  of  the  equilibrium  therefore 
depends  on  all  the  coefficients.  The  stability  of  the  equilibrium,  on  the 
other  hand,  depends  only  on  the  sign  of  ab  —  mn.  It  is  easy  to  show  that 


INTERACTION    IN    LARGE    WELL-MIXED    POPULATIONS  JOJ 

the  equilibrium  is  stable  if  and  only  if  ab  —  mn  >  0,  that  is,  if  the  product 
of  the  self-restraint  coefficients  is  greater  than  that  of  the  mutual-stimula- 
tion coefficients.  If  the  equilibrium  is  unstable,  the  armament  expenditures 
will  either  increase  without  bound  (a  runaway  arms  race  or  a  war  in 
Richardson's  interpretation)  or,  if  the  expenditures  are  sufficiently  low 
initially,  tend  to  be  reduced  still  further  until  complete  disarmament  is 
achieved. 

Obviously  the  model  is  much  too  simple-minded  to  be  of  use  in  the 
analysis  of  actual  international  behavior.  Still  Richardson  ventured  to 
apply  it  to  the  description  of  some  arms  races,  notably  to  that  preceding 
World  War  I  (1909-1914).  Giving  the  coefficients  certain  values  and  solv- 
ing the  system  of  differential  equations,  he  obtained  the  time  course  of  the 
combined  armament  expenditures  of  the  two  rival  camps,  which  fitted  the 
data  very  well.  The  coefficients  were  such  that  the  system  was  inherently 
unstable.  Hence  its  fate  was  determined  by  the  initial  conditions.  In 
Richardson's  interpretation  these  initial  conditions  had  to  do  with  the 
difference  between  armament  expenditures  and  the  volume  of  trade 
between  the  two  blocks  of  states.  It  appears  that,  had  the  volume  of  trade 
been  greater  by  £5  million  or  the  level  of  armament  expenditures  cor- 
respondingly lower,  the  "reaction"  would  have  gone  in  the  opposite 
direction,  that  is,  toward  disarmament  and  increasing  cooperation  (trade). 

These  conclusions  are  not  easy  to  take  seriously.  Nor  is  the  agreement 
between  theoretically  derived  and  observed  armament  expenditures 
impressive,  in  view  of  the  small  number  of  points  fitted  by  the  equations. 
Still  the  approach  is  noteworthy  as  an  early  example  of  a  method  for 
dealing  with  large-scale  social  interactions,  which  may,  under  certain 
conditions,  find  application.  Moreover,  the  method  has  unquestionable 
heuristic  value  in  that  it  serves  as  a  framework  in  which  more  sophisticated 
approaches  can  be  developed. 

Rashevsky's  model  of  mass  behavior  (1939)  is  based  on  influences 
mutually  exerted  by  two  classes  of  individuals,  X  and  7,  representing, 
respectively,  two  different  patterns  of  behavior  or  attitudes  R:  and  R& 
for  example,  allegiance  to  two  different  political  parties.  In  each  class 
there  is  a  fixed  number  of  "actives"  (rr0,  y0)  who  are  immune  to  change. 
The  remaining  individuals  (x,  y)  are  "passives,"  subject  to  influence  by 
both  the  actives  and  the  passives.  Accordingly,  if  TV  is  the  total  population, 
x  +  y  =  0,  a  constant,  and  z0  +  2/o  =  N  —  0,  Rashevsky's  linear  model 

becomes 

ax 

—  =  ax  —  my  +  g, 

*a  =  _wa.  +  by  +  h, 
dt 


502  MATHEMATICAL    MODELS    OF    SOCIAL    INTERACTION 

analogous  to  Richardson's  except  that  the  signs  of  a,  b,  m,  and  n  are 
reversed.  The  constants  g  and  /z,  as  in  Richardson's  model,  can  have 
either  sign.  They  represent  the  net  (constant)  influence  of  the  "actives." 
Substituting  6  —  x  for  y,  we  get 

dx 

—  =  (a  +  m)x-mO  +  g,  (6) 

at 

whose  solution  in  terms  of  the  initial  condition  x(G)  is 


a  +  m 
where  r  =  m6  —  g. 

The  fate  of  #,  therefore,  depends  crucially  on  x(0).  If  x(Q)  >  r\(a  +  m), 
x(t)  will  increase  until  the  whole  population  will  turn  to  R^  If  the  in- 
equality is  reversed,  the  opposite  will  happen.  The  equilibrium  at  x  = 
r/(a  +  rn)\  y  =  6  —  x  =  [(a  +  m)6  —  r]j(a  +  m)  is  unstable. 

In  a  somewhat  more  involved  model  Rashevsky  (1951)  assumed  that 
each  individual  possesses  some  inherent  tendency  to  behave  one  way  or  the 
other.  The  magnitude  of  this  tendency  is  denoted  by  a  quantity  </>,  which 
is  positive  if  the  individual  prefers  RI  and  negative,  otherwise.  We  have, 
then  a  distribution  of  </>,  N(<f>)  in  the  population,  so  that  N((f))  d<j)  denotes 
the  number  of  individuals  characterized  by  the  value  of  </>  between  <f>  and 
<f>  +  d<f>.  Rashevsky  assumed  that  N(<f>)  is  Laplacian,  that  is, 

N(fi  -  itfoCHfl*'.  (8) 

He  assumed  also  that  <£  fluctuates  randomly  in  an  individual  and  that  the 
time  distribution  of  its  value  is  also  Laplacian  but  with  a  different  disper- 
sion constant  k  instead  of  cr. 

The  probability  that  an  individual  will  perform  R±  at  a  given  time  de- 
pends on  the  magnitude  of  <f>  at  that  time  and  also  on  a  magnitude  of 
another  "propensity,"  ^,  contributed  by  the  tendency  to  imitate  others. 
Specifically,  assuming  y  >  0  (i.e.,  the  net  imitation  influence  is  toward 
Rj)  the  probabilty  that  R^  will  be  performed  is  given  by 

if  <£  >  -y, 


The  expressions  for  ^  <  0  are  analogous.  In  the  remaining  discussion  y> 
is  assumed  to  be  positive. 

The  total  numbers  of  individuals  X  and   Y  performing  R±  and  R2, 
respectively,  at  a  given  moment  are 


r  (10) 

"  [i- 

J-oo 


INTERACTION    IN    LARGE    WELL-MIXED    POPULATIONS  JOj? 

It  remains  to  postulate  the  equation  that  determines  y  as  a  function  of 
X  and  Y.   Rashevsky  takes  this  to  be 

^  =  A(X  -  7)  -  a%  (11) 

at 

where  A  and  a  are  constants.  In  other  words,  the  increase  of  y  is  enhanced 
by  the  excess  of  individuals  performing  RI9  and  yj  also  "decays"  pro- 
portionately to  its  own  magnitude.1 
Combining  Eqs.  10  and  11,  we  obtain 


Although  the  equation  is  nonlinear,  its  variables  can  be  separated  and 
so  an  explicit  solution  can  be  obtained  for  t  as  a  quadrature  in  terms  of  a 
function  of  yj.  From  this  solution  X  and  Y  can  be  obtained  as  functions 
of  time.2 

In  view  of  the  practical  impossibility  of  making  the  sort  of  observations 
required  to  check  this  theory,  the  explicit  dynamic  solution  is  of  little 
interest.  However,  certain  general  qualitative  conclusions  are  suggestive. 
For  example,  Eq.  12  implies  an  equilibrium  at  y  =  0.  This  equilibrium 
is  stable  if  and  only  if 


. 

cr  +  k 

Now  the  expression  ka](a  +  k)  is  the  reciprocal  of  I/or  +  I/A:.  Hence 
kal(k  +  a)  is  greater,  the  greater  the  sum  of  the  reciprocals  of  the  dis- 
persions, which  refer,  respectively,  to  the  nonhomogeneity  of  the  popula- 
tion with  respect  to  the  inherent  preference  and  to  the  nonstereotypy  of 
individuals  characterized  by  a  certain  preference  intensity  in  performing 
the  preferred  act.  Combining  these  interpretations,  we  have  the  following 
qualitative  result.  The  more  homogeneous  the  population  and  the  more 
stereotyped  the  behavior  of  its  members,  the  more  likely  the  instability  at 

1  This  form  of  equation  has  been  used  extensively  by  Rashevsky  and  his  co-workers 
to  describe  the  rate  of  increase  of  excitation  produced  by  an  external  stimulus.   A 
positive  contribution  is  assumed  to  be  proportional  to  the  magnitude  of  the  stimulus 
(in  this  case  the  size  of  the  majority  performing  J^O,  whereas  a  negative  contribution 
results  in  the  dissipation  of  the  excitation  at  a  rate  proportional  to  its  own  magnitude. 

2  Although  Eq,  12,  being  nonlinear,  formally  excludes  the  model  just  described  from  the 
class  of  linear  models,  we  have  included  it  as  a  variant  in  view  of  the  linear  dependence  of 
dy\dt  on  X  and  Y  (cf.  Eq.  11).   Under  nonlinear  models  we  have  understood  those  in 
which  increments  to  subpopulations  in  the  various  states  depend  on  products  of  the 
numbers  of  individuals  in  pairs  of  states,  that  is,  presumably  on  the  frequency  of 
contacts.  These  models  are  treated  in  Sees.  1.3  to  L6. 


504  MATHEMATICAL    MODELS    OF    SOCIAL    INTERACTION 

ip  =  0  (equal  frequency  of  acts  of  both  kinds),  hence  the  easier  it  is  to 
swing  the  population  to  the  predominant  performance  of  one  or  the  other 
act. 

Interpreting  the  quantity  ajANQ,  we  find  that  it  is  directly  proportional 
to  the  decay  constant  a  and  inversely  to  the  imitation  propensity  A  and  to 
the  absolute  size  of  the  population  NQ.  The  corresponding  result  with 
reference  to  a  and  A  is  obvious.  The  interesting  result  is  with  reference  to 
JV0,  namely,  the  larger  the  population,  the  more  likely  it  is  to  be  swung 
to  the  predominance  of  acts  of  the  one  or  the  other  kind.  This  is  the  mob 
effect. 

For  further  elaborations  and  generalizations  of  the  model,  in  particular 
involving  asymmetrical  distributions  of  preferences,  the  interested  reader 
is  referred  to  Rashevsky  (1951,  1957). 


1.3  The  Logistic  Model 

The  previous  model  departs  from  linearity  but  not  in  the  fundamental 
sense  of  the  general  contagion  equation  (1).  The  essential  feature  of  this 
equation  is  that  the  variables  representing  fractions  of  the  population  in 
the  several  states  appear  in  the  second  degree.  This  means  that  frequencies 
of  contacts  among  the  members  of  the  subpopulations  contribute  to  their 
rates  of  change.  To  use  a  chemical  analogy,  the  foregoing  models  are 
tantamount  to  assumptions  that  the  various  "substances"  (subpopulations) 
are  produced  from  all-pervading  substrates.  The  contagion  assumptions, 
on  the  other  hand,  imply  that  substances  are  produced  or  destroyed  only 
in  interactions  with  each  other.  The  general  Eq.  1,  of  course,  combines 
both  assumptions. 

The  simplest  special  case  of  Eq.  1  involving  second-degree  terms  is 
the  equation  of  simple  contagion,  the  so-called  logistic  equation: 

^  =  B*(y_0');        x  +  y=l.  (14) 

dt 

Here  x  is  the  fraction  of  the  infected,  y  the  fraction  of  the  uninfected,  and 
6'  the  fraction  of  the  permanently  immune.  The  rate  of  increment  is 
proportional  to  the  frequency  of  contacts  between  the  infected  and  the 
susceptible  (nonimmune)  uninfected.  The  solution  of  Eq.  14  is  given  by 


where  0  =  1  -6'. 


INTERACTION    IN    LARGE    WELL-MIXED    POPULATIONS  JOJ 

If  XQ,  the  initial  fraction  of  infected,  is  small  but  finite,  the  time  course  is 
a  sigmoid  curve  tending  to  6 ;  that  is,  eventually  all  except  the  immune 
will  be  infected. 


1.4  Time-Dependent  Contagion 

The  assumptions  underlying  the  logistic  model  are  quite  strong.  The 
population  is  assumed  to  be  well  mixed,  and  all  the  infected  are  assumed 
to  remain  infected  and  infectious.  For  the  time  being  we  shall  keep  the 
assumption  of  well-mixedness  but  relax  the  other.  The  infectiousness  of 
the  infected  will  now  be  assumed  to  be  a  function  of  time,  namely,  both  of 
the  over-all  duration  of  the  process  and  of  the  time  elapsed  since  the  partic- 
ular individual  became  infected.  The  first  dependency  reflects  the  chang- 
ing "potency"  of  the  process  in  time  (e.g.,  the  virility  of  the  infecting 
organism);  the  second  dependency  reflects  the  well-known  variation  of 
infectiousness  during  the  course  of  a  disease  and  possibly  the  removal  of 
infectious  individuals  by  recovery,  isolation,  or  death.  The  same  considera- 
tions apply  to  the  spread  of  information.  For  example,  the  news  worthiness 
of  an  item  of  information  may  be  a  function  of  its  age,  and  the  tendency  to 
transmit  it  may  be  a  function  of  how  long  ago  it  was  received. 

Let  t  represent  time  measured  from  the  start  of  the  process  and  r  the 
time  measured  from  the  moment  of  infection  of  each  individual.  We  shall 
call  p(t,  r)  the  probability  that  on  contact  at  time  t  an  individual  infected 
at  time  t  —  r  will  transmit  the  infection  to  an  uninfected  individual.  If 
no  one  is  immune,  the  contagion  equation  now  assumes  the  following  more 
general  form: 

^  =  A(l-  x)  \x&(t9  0  +  f  *  ^  p(t,  t  -  A)  d)\  ,  (16) 

at  L  Jo  dA  J 

where  x  is  the  fraction  of  the  infected.  If  p(t,  T)  is  a  constant,  Eq.  16 
reduces  to  a  simple  logistic.  Two  other  special  cases  are  of  interest, 
namely,  (1)  when/?  is  a  function  of  t  alone  and  (2)  when/?  is  a  function  of  r 
alone.  The  first  case  can  be  solved  in  general.  The  solution  is  given  by 

Cexpf^f 
-         -  -  =  (17) 


Cexp 


Ld 


where  XQ  =  C/(l  +  C).    Equation  17  has  the  same  form  as  Eq.   15, 
except  that  the  exponent  B6t  of  Eq.  15  now  appears  as  an  integral.   If 


506  MATHEMATICAL   MODELS    OF   SOCIAL    INTERACTION 


r 

Jo 


/?(£)  d£  =  6,  that  is,  finite,  the  ultimate  fraction  of  the  infected  will  be 

x    x  CeAb 


If,  on  the  other  hand,  the  integral  diverges,  the  ultimate  fraction  will  be 
unity,  that  is,  everyone  will  succumb. 

If  /»(/,  r)  depends  on  r  alone,  the  general  solution  is  obtainable  in  closed 
form  in  some  special  cases.  The  case  in  which  p(r)  =  e"^  is  of  interest 
because  it  is  formally  identical  to  the  case  in  which  the  infected  individuals 
are  removed  from  the  population  at  random  at  a  constant  rate  per  infected 
individual.  In  that  case  Eq.  16  reduces  to 


^  =  ^(1  -  *X~L  +  f  %  «"  di\  . 
dt  L        Jo  dA          J 


This  leads,  after  appropriate  manipulations,  to 

dx 


(1  -  x){k  log  [(1  -  *)/(!  -  %>]  -  Ax} 
The  solution  gives  r  as  a  quadrature  in  x: 


(19) 


=  dt.  (20) 


*  __.  I     *22 C21") 

I  l/1^  t*\      f   1          1  f/1  tm^.      I   S  *  *•       \    1  1  J      t."*         *  ^-  ' 


Again  we  are  interested  in  the  value  of  #(oo)  as  a  function  of  the  param- 
eters #0,  A,  and  fc.  Clearly,  #(oo)  is  the  smaller  root  of  the  denominator 
of  the  integrand  in  Eq.  21.  By  the  nature  of  the  problem  this  is  not  greater 
than  unity.  It  is  therefore  the  root  of  the  transcendental  equation 

k  log  ^-^  +  Ax  =  0.  (22) 

Taking  exponentials  of  both  sides  of  Eq.  22,  we  find  that  the  asymptotic 
value  #*  =  x(co)  must  satisfy  the  equation 

a;*  =  1  -  (1  -  x0)e-Ax*/k.  (23) 

If  the  initial  number  of  the  infected  forms  an  infinitesimal  fraction 
of  the  population,  we  may  set  x0  =  0  and  obtain  the  equation  derived  by 
Solomonoff  and  Rapoport  (1951)  and  independently  by  Landau  (1952) 
for  the  "connectivity"  of  a  random  net  with  axon  density  a.  The  meanings 
of  this  parameter,  of  the  random  net  model,  and  of  its  generalizations  are 
discussed  in  Sec.  2.1. 

The  importance  of  Eq.  23  is  that  it  holds  no  matter  what  the  form  of 
p(r)  is,  as  long  as  the  function  governing  the  probability  of  transmission 


INTERACTION    IN    LARGE    WELL-MIXED    POPULATIONS 


Fig.  1 .  Ultimate  fraction  of  infected  individuals  (#*)  as  a  function  of  the  aver- 
age number  of  individuals  ever  infected  by  each  individual  (a)  for  different 
values  of  the  original  fraction  of  the  infected  (o50) . 

depends  on  r  alone  and  not  on  t,  that  is,  it  depends  only  on  how  long  an 
infected  individual  has  been  infected  and  not  on  how  long  the  epidemic  has 
been  going  on.  In  that  case  the  total  fraction  of  the  population  that  will 
have  succumbed  depends  only  on  the  average  number  of  individuals  in- 
fected by  each  infected  individual  ever  and  not  on  the  times  when  he  has 
transmitted  his  infection.  The  dependence  of  x*  on  a  =  A\k  and  on  x0  is 
shown  in  Fig.  1.  The  curve  corresponding  to  XQ  =  0  must,  of  course,  be 
interpreted  as  the  limiting  curve  of  a  sequence  in  which  x0  tends  to  zero. 
(Obviously,  there  will  be  no  epidemic  if  no  one  has  ever  been  infected.)  It 
is  interesting  to  observe  that  if  a  =  2,  that  is,  if  on  the  average  each  infected 
individual  can  infect  two  others,  then  for  an  arbitrarily  small  #0  eventually 
about  80  %  of  the  population  will  succumb. 

Equation  23  is  also  closely  related  to  the  result  obtained  by  D.  G. 
Kendall  for  an  epidemic  spreading  over  a  geographical  area  with  constant 
relative  rate  of  removal  of  the  infected  (Bailey,  1957).  The  ultimate  frac- 
tion removed,  y,  satisfies  the  equation 

y  =  l~  e-(ff/p)7.  (24) 

Here  a  is  an  infection  rate  constant  dependent  on  the  population  density 
and  p  is  the  ratio  of  the  removal  rate  (per  infected  individual)  to  the  in- 
fection rate  (per  contact  per  individual).  Equation  24  is  formally  identical 
to  Eq.  23  if  XQ  in  (23)  is  neglected.  The  equation  has  a  root  at  zero  and  one 
other  positive  root.  It  implies  the  following  conclusions  pertinent  to 
Kendall's  theory  of  pandemics : 


$0  MATHEMATICAL   MODELS    OF   SOCIAL    INTERACTION 

1.  A  pandemic  will  occur  if  and  only  if  p  <  a. 

2.  If  a  pandemic  does  occur,  a  fraction  at  least  as  great  as  y  (which 
depends  on  a/p)  will  be  ultimately  infected  arbitrarily  far  from  the  focal 
point  of  the  epidemic. 


1.5  Contagion  with  Diffusion 

In  all  the  interaction  models  so  far  (except  Kendall's  pandemic,  whose 
derivation  we  have  not  discussed)  a  "well-mixed"  population  was  always 
assumed.  If  this  well-mixedness  assumption  is  dropped,  the  interaction 
problem  becomes  much  more  difficult.  For  example,  in  contagion  models, 
we  must  take  into  consideration,  in  addition  to  the  spread  of  state  due  to 
contacts  between  individuals,  the  diffusion  of  the  infected  individuals 
through  the  population.  The  assumption  of  well-mixedness  means  that 
there  is  no  restraint  on  mobility,  hence  that  the  diffusion  is  instantaneous. 
It  is  as  if  the  infected  individuals  became  so  rapidly  mixed  throughout  the 
population  that  their  density  was  always  constant  everywhere.  This  is  the 
justification  for  assuming  the  transmission  to  be  equally  probable  between 
any  pair  of  individuals.  In  the  foregoing  generalization  (cf.  Sec.  1.4)  we 
introduced  an  additional  probability,  namely,  that  the  state  will  be 
transmitted  if  Contact  does  occur,  this  probability  being  dependent  on 
time  but  not  on  space  variables.  Abandoning  well-mixedness,  we  introduce 
a  dependence  on  space  variables.  We  can  write  in  general,  assuming  a 
three-dimensional  diffusion  space, 

Q(c,w).  (25) 

Here  c  is  the  concentration  of  infected  individuals,  the  first  term  on  the  right 
governs  the  diffusion  of  these  individuals  throughout  space,  and  the  second 
term  governs  the  contagion  process.  In  particular,  if  the  contagion  proba- 
bility depends  only  on  the  concentration  of  the  infected  explicitly  and 
in  the  elementary  way  of  the  logistic  process,  Eq.  25  becomes 

-* 

where  c  is,  of  course,  a  function  of  all  four  independent  variables,  and  a 
and  /?  are  constants. 

Landahl  (1957)  attacked  this  equation  by  approximation  methods  devel- 
oped previously  in  problems  involving  nonconservative  diffusion.  The 
interested  reader  is  referred  to  his  paper  for  further  development  of  this 
topic. 


INTERACTION   IN   LARGE   WELL-MIXED    POPULATIONS  $O() 


1.6  Nonconservative,  Nonlinear  Interaction  Models 

All  of  the  interaction  models  so  far  considered  except  Richardson's  were 
conservative  in  the  sense  that  either  the  total  population  was  constant  or, 
in  the  case  of  an  infinite  population  spead  over  an  infinite  area,  the  density 
of  the  population  was  constant  in  time.  These  assumptions  imply  that 
increases  in  numbers  or  in  densities  of  individuals  in  some  states  are  always 
compensated  for  by  corresponding  decreases  in  other  states.  These 
assumptions  cannot  be  made,  therefore,  where  sources  or  sinks,  for 
example,  birth  and  death  rates,  are  an  integral  part  of  the  interaction 
process. 

As  the  simplest  example  of  a  nonlinear,  nonconservative  interaction 
process  between  two  subpopulations  X  and  7,  we  shall  consider  the 
following  pair  of  equations : 


dt 

(27) 


dt 

The  coefficients  are  to  be  interpreted  as  follows.  The  A's  are  net  "birth" 
or  "death"  rates;  the  J5's  are  (positive  or  negative)  contributions  due  to 
contacts  between  individuals  in  the  same  state;  the  C's  are  contributions 
(positive  or  negative)  due  to  contacts  between  individuals  in  opposite 
states. 

By  giving  different  signs  to  these  coefficients,  we  can  describe  various 
models  qualitatively  different  from  one  another.  For  example,  if  A:  <  0, 
B±  >  0,  C2  >  0,  Cx  =  ~C2,  we  see  that  members  of  the  Z-population 
tend  to  disappear  if  left  to  themselves,  to  generate  more  of  their  own 
kind  after  contacts  with  their  own  kind,  and  to  be  "converted"  to  the 
y-population  by  contact  with  it. 

Any  combination  of  the  six  coefficients  can  be  interpreted  in  a  similar 
manner.  Of  particular  interest  are  certain  special  cases  which  have  been 
given  a  biological  interpretation.  For  example,  let  the  Jf-population  consist 
of  predators  that  feed  on  the  y-population,  which,  in  turn,  feeds  on  an 
unlimited  food  supply.  In  that  case  we  must  have  A±  <  0  (the  predators 
cannot  survive  in  the  absence  of  their  prey);  Cx  >  0  (the  biomass  of  the 
predators  increases  with  the  number  of  contacts  with  the  prey);  A2  >  0 
(the  prey  multiplies  in  the  absence  of  the  predators);  C2  <  0  (the  biomass 
of  the  prey  decreases  on  contact  with  predators,  as  the  prey  is  eaten  by 
them).  The  signs  of  Bl  and  B2  depend  on  whether  contacts  between  mem- 
bers of  the  same  species  enhance  or  inhibit  the  growth  of  the  respective 


5/0  MATHEMATICAL    MODELS    OF    SOCIAL    INTERACTION 

biomasses  or  populations.  Such  equations  have  been  treated  by  Volterra 
(1931),  Kostitzin  (1937),  and  other  biomathematicians. 

Another  interesting  case  represents  a  competition  between  two  species, 
each  of  which  could  exist  in  a  given  environment  without  the  other.  Here 
the  A9s  are  positive,  and  all  the  other  coefficients  are  negative.  The  .ZTs 
represent  inhibition  to  population  growth  due  to  intraspecies  crowding, 
whereas  the  C's  represent  inhibition  due  to  interspecies  crowding.3 

Treatments  of  such  nonlinear  systems  are  often  confined  to  the  examina- 
tion of  their  statics,  that  is,  the  equilibrium  conditions.  Both  derivatives 
of  Eq.  27  are  set  equal  to  zero,  and  the  expressions  on  the  right  are  factored. 
Assigning  signs  to  the  coefficients  appropriate  to  the  competition  model, 
we  now  have 


B&  -  Co>)  =  0, 
y(A2  -  B&  -  C2z)  =  0. 


We  see  that  x  =  y  =  0  is  a  trivial  equilibrium  point.  A  nontrivial 
equilibrium  is  found  at  the  intersection  of  the  two  straight  lines,  whose 
equations  are  the  expressions  in  parentheses  of  Eq.  28  set  equal  to  zero. 
The  equilibrium  is  biologically  meaningful  if  the  intersection  is  in  the  first 
quadrant.  It  is  stable  if  certain  additional  conditions  are  satisfied  by  the 
coefficients. 

Carrying  the  analysis  through,  we  have  the  following  results  on  the 
statics  of  the  system  representing  competition  between  two  species. 

1.  The  equilibrium  point  will  be  in  the  first  quadrant  (i.e.,  biologically 
meaningful  if  and  only  if  either  BJC^  >  A%\A±  >  C2/B1  or  C^B^  > 
A2/Al  >  ByJC^. 

2.  If  a  biologically  meaningful  equilibrium  exists,  it  will  be  stable  if 
and  only  if  B±B2  >  C^C^  that  is,  if  the  product  of  self-restraint  coefficients 
is  greater  than  the  product  of  the  other's  restraint  coefficients.4 

If  the  two  straight  lines  determined  by  Eq.  28  do  not  intersect  in  the 
first  quadrant,  there  will  be  no  meaningful  equilibrium.  One  of  the  lines 
will  lie  entirely  above  the  other  in  the  first  quadrant,  and  the  species  whose 
line  is  farthest  from  the  origin  will  be  the  sole  survivor.  The  conditions  for 
the  survival  of  3f  are  AJC2  >  A2/B2  and  A^B^  >  A2/C2,  and,  of  course,  the 
conditions  for  the  survival  of  Y  are  just  the  reverse.  Figure  2  represents 
the  entire  situation  graphically. 

Whatever  biological  or  social  interpretation  is  made  of  similar  models, 

8  "Crowding"  is  measured  by  frequency  of  contacts,  which  is  proportional  to  the 

products  (or  squares)  of  the  densities. 

4  Note  the  similarity  of  this  condition  to  that  in  Richardson's  arms-race  model  (Sec.  1.2). 


INTERACTION    IN    LARGE    WELL-MIXED    POPULATIONS 

N2 


-  C2*  =  0 


-  BIX  -  C\y  =  0 


-  B2y  -  C2*  =  0 


-  C2*  =  0 


-  Ciy  =  0 


(c)  (d) 

Fig.  2.  Cases  of  interspecies  competition  between  populations  JVt  and  JV2  (after 
Cause):  (a)  stable  equilibrium — co-existence;  (6)  unstable  equilibrium — only  N^ 
or  NZ  will  survive,  depending  on  initial  conditions;  (c)  only  JV^  will  survive;  (d) 
only  N2  will  survive. 

the  investigation  of  consequences  always  follows  the  same  method.5 
In  the  absence  of  explicit  dynamic  solutions  (time  courses  of  the  dependent 
variables),  which  are  difficult  or  impossible  to  obtain  in  the  general  (non- 
numerical)  case,  the  investigation  is  confined  largely  to  the  examination 
of  the  "phase  space,"  that  is,  the  space  determined  by  the  dependent 
variables.  At  each  point  of  this  phase  space  we  can  calculate  dxjdt  and 
dyldt.  These  derivatives  determine  the  direction  of  motion  of  the  point 
(x,  y)  in  the  phase  space.  Moreover,  the  quantity  [(dx/dt)2  +  (dy\dif\A 

5  For  further  discussion  of  the  statics  of  such  systems  see  Gause  (1934,  1935)  and 
Slobodkin  (1958).  For  extensions  to  stochastic  models  see  Neyman,  Park,  &  Scott 
(1955). 


5/0  MATHEMATICAL    MODELS    OF    SOCIAL    INTERACTION 

gives  the  "speed  of  motion"  of  this  point.  We  can  therefore  draw  a  vector 
at  each  point  in  the  phase  space.  Important  properties  of  the  system  can 
be  determined  by  investigating  the  nature  of  the  resulting  vector  field. 
In  Sec.  4.1  we  shall  examine  another  such  system  in  a  different  context, 
which  will  serve  as  a  conceptual  link  between  the  mathematical  models 
related  to  population  dynamics  and  those  related  to  mathematical 
economics  and  decision  theory. 


2.  STATISTICAL   ASPECTS    OF   NET 
STRUCTURES 

In  our  treatment  of  mathematical  theories  of  contagion,  which  can  also 
serve  as  mathematical  models  of  many  other  kinds  of  social  interactions, 
we  have  kept  the  assumptions  of  well-mixedness  throughout,  with  only  a 
single  reference  to  work  in  which  a  diffusion  term  has  been  included 
(Landahl,  1957).  We  may  now  ask  what  happens  if  there  is  no  diffusion 
at  all,  that  is,  when  everyone  stays  put  and  interacts  only  with  his  "neigh- 
bors." Evidently,  in  this  case,  the  social  interaction  process  will  depend  on 
how  many  neighbors  each  individual  has.  We  have  seen  an  analogue 
to  the  limited  number  of  neighbors  in  the  constant  A/k  of  our  contagion 
models  described  on  p.  507,  this  constant  being  the  number  of  individuals 
ever  infected  by  an  infected  individual.  However,  aside  from  the  finite 
size  of  this  number,  there  was  no  further  limitation.  It  was  given  how  many 
"neighbors"  (contacts)  each  individual  could  have  but  not  who  they  were. 
The  question  "Who  are  the  neighbors?"  does  not  refer  here  to  their  intrin- 
sic properties  but  to  their  relations  to  their  neighbors.  In  a  well-mixed 
population  I  have  always  assumed  that  the  individuals  who  were  my 
contacts*  contacts  would  with  equal  probability  be  any  individuals  in  the 
population.  With  the  introduction  of  the  "neighbor"  concept,  this  assump- 
tion is  no  longer  tenable.  For  one  thing,  it  is  natural  to  endow  the  relation 
"neighbor"  with  a  symmetrical  property:  I  am  one  of  my  neighbors' 
neighbors.  But,  if  so,  then  I  am  certain  to  be  found  among  the  set  of  indi- 
viduals who  are  my  neighbors'  neighbors,  in  contrast  to  the  equiprobability 
assumed  for  my  contacts'  contacts  in  previous  models. 

These  considerations  lead  us  to  the  study  of  net  structure.  The  branch 
of  mathematics  which  deals  with  such  questions  rigorously  is  the  theory  of 
linear  graphs.  We  shall  examine  some  graph-theoretical  treatments  of 
social  structure  in  Sec.  3.1.  Since  we  are  for  the  moment  dealing  with 
large  populations,  for  which  it  is  out  of  the  question  to  list  all  the  relations 
among  the  entities,  graph  theory  will  not  be  of  much  help.  We  shall 
resort  instead  to  examining  some  gross  statistical  properties  of  nets. 


STATISTICAL    ASPECTS    OF    NET    STRUCTURES  513 


2.1  The  Random  Net 

Our  point  of  departure  will  be  to  determine  certain  statistical  properties 
of  a  so-called  random  net,  defined  as  follows.  Let  a  certain  fixed  number 
a  of  directed  line  segments  issue  from  each  node  of  the  net.  Let  each  of 
these  line  segments,  which  we  shall  call  "axones"  to  bring  to  mind  the 
analogy  with  neural  nets,  connect  at  random  to  any  of  the  other  nodes.  To 
define  a  "connection  at  random,"  we  imagine  a  chance  device  with  N 
equiprobable  states,  where  TV  is  the  number  of  nodes  in  the  net.  We  take 
each  of  the  aN  axones  in  turn  and  determine  its  "target,"  that  is,  the  node 
on  which  it  terminates  by  the  chance  device.  The  resulting  net  we  call 
a  random  net.6 

We  now  seek  a  mathematical  description  of  some  of  the  properties  of 
such  a  net.  Our  first  concern  is  with  the  results  of  a  certain  "tracing 
procedure."  Start  with  an  arbitrary  number  of  randomly  selected  nodes, 
this  number  being  small  compared  with  N.  Call  the  set  of  nodes,  which 
are  targets  of  all  of  the  axones  of  this  initial  set,  excluding  the  initial  set 
itself,  targets  of  the  first  remove.  Call  all  the  nodes  that  are  targets  of  all 
the  axones  from  the  targets  of  the  first  remove,  excluding  the  targets  of 
the  first  remove  and  the  initial  set,  targets  of  the  second  remove,  etc.  Let 
/>0,  />!,  pz,  etc.,  be  the  corresponding  population  fractions.  We  seek  a 
recursion  formula  for  the  successive  pt  (t  =  1,2,...). 

Select  an  arbitrary  node  and  consider  the  probability  that  it  belongs  to 
the  targets  of  the  (t  +  l)th  remove.  This  is  a  product  of  two  probabilities, 
namely,  (1)  that  the  node  in  question  does  not  belong  to  any  of  the  targets 
of  previous  removes  and  (2)  that  one  of  the  axones  from  the  targets  of 
the  rth  remove  does  connect  with  the  node  in  question.  The  product  of 
the  probabilities  is  justified  by  the  fact  that  the  two  probabilities  are 
independent,  since  all  of  the  connections  are  determined  by  the  chance 
device  without  any  reference  to  the  state  of  the  tracing  procedure.  More- 
over, the  probability  that  the  node  in  question  does  not  belong  to  the  target 
of  all  the  removes  before  the  (t  +  l)th  is  the  sum  of  the  component 
probabilities,  since  the  sets  defining  the  targets  of  the  successive  removes 
are  mutually  exclusive. 

We  can  therefore  write 

(*     \  r       /       1  \aNvt~\ 
1  -I/-)  ['- 11 4)  ]• 

6  In  a  more  general  model  the  number  of  axones  per  node,  a,  can  itself  be  a  random 
variable.  This  generalization  does  not  modify  the  principal  results  of  our  model,  and 
we  shall  not  make  it. 


5*4  MATHEMATICAL    MODELS    OF    SOCIAL    INTERACTION 

The  factor  in  the  brackets  is  well  approximated,  for  large  TV,  by  1  -  e~ap^ 
and  Eq.  29  can  be  simplified  to 

IV*  =  U  -  2  P,)(l  -*-**'),  (30) 

3 

which  is  the  recursion  formula  required. 
If  we  denote  by  xt  the  fraction  of  nodes  contacted  by  the  rth  remove, 

that  is,  xf  =  J  p^  we  can  write  Eq.  30  after  appropriate  rearrangements, 
as  (1  ~  *t+i)eaxt  =  (1  -  ^y*'-i.  (31) 

Since  the  equality  holds  for  all  values  of  t,  the  equated  expressions  must 
be  independent  of  t,  that  is,  equal  to  a  constant.  The  constant  can  be 
evaluated  by  setting  XQ  =  pQ9  and  we  obtain  the  equation,  which  determines 
7  =  3(00)  as  a  function  of  p0  and  a,  namely, 

y  =  1  -  (1  -  po)*-",  (32) 

which  is  formally  identical  with  Eq.  23. 

Let  us  solve  for  a  in  terms  of  the  ^  in  Eq.  30.  Since  the/?,  (the  expected 
values  of  the  fractions  representing  the  magnitudes  of  the  sets  of  targets 
of  the  successive  removes)  are  completely  determined  by  the  "average" 
tracing  procedure,  it  follows  that  the  expression  representing  a  will  be  a 
function  of  t  alone.  Formally,  a  is  a  function  of  t,  although  it  is  a  constant 
by  definition,  and  so  must  be  independent  of  t.  To  indicate  this  (fictional) 
dependence  on  r,  let  us  designate  a  by  <x(f)  and  call  it  the  "apparent" 
axone  density  for  reasons  that  will  presently  become  clear.  We  have 
accordingly 


Pt      1  -  ^ 


(33) 


•t+1 


If,  in  any  observed  tracing,  the  function  of  t,  represented  on  the  right 
side  of  Eq.  33,  is  indeed  independent  of/,  we  have  experimental  corrobora- 
tion  of  the  hypothesis  that  the  net  in  question  is  a  random  net.  It  may  very 
well  be  that  a(f)  will  turn  out  to  be  a  constant  in  some  nonrandom  nets. 
However,  if  oc(f),  as  determined  from  empirical  determinations  of  the 
successive  pj  in  the  tracing  of  an  actual  net,  is  not  constant,  we  have  a 
refutation  of  the  hypothesis  that  the  net  is  random.  Moreover,  the 
behavior  of  a(r)  may  give  us  some  indication  of  the  nature  of  the  non- 
randomness  of  the  net  we  are  studying. 

Now  it  is  clear  why  we  have  called  <x(r)  the  "apparent  axone  density." 
If  the  value  of  oc(f)  is  adjusted  so  that  every  successive  p99  as  calculated 
from  Eq.  30,  with  <x(r)  replacing  the  exponent  a,  is  equal  to  the  observed 


STATISTICAL   ASPECTS   OF    NET    STRUCTURES  5/5 

value  of  pt,  we  can  interpret  a(/)  as  the  axone  density  related  to  the  tth 
remove,  which  in  a  random  net  would  determine  the  observed  value  of 

/>*+!• 

2.2  Biased  Nets 

Let  us  see  how  we  would  expect  a(r)  to  behave  if  biases  of  a  certain  kind 
were  operating  in  the  structure  of  a  net.  In  a  real  net,  we  may  expect  some 
sort  of  a  distance  bias  to  determine  the  actual  connections.  If  the  net  is  a 
net  of  social  relations,  say  the  acquaintance  relation,  we  can  imagine  that 
the  nodes  are  immersed  in  some  sort  of  "social  space."  The  topology  of 
this  space  is  by  no  means  easy  to  discern,  but  we  feel  intuitively  that 
neighborhoods  can  be  defined  in  it.  In  particular,  suppose  the  social  space 
resembles  geographical  space,  since  it  is  certainly  partially  determined  by  it. 
We  can  then  expect  our  neighborhoods  to  resemble  geographical  neighbor- 
hoods to  some  extent.  Apart  from  geography,  we  expect  certain  symmetry 
biases  and  certain  transitivity  biases  to  operate.  A  symmetry  bias  makes 
itself  felt  in  the  fact  that  if  an  axone  from  node  A  terminates  on  node  B 
(say  A  knows  B)  the  probability  that  an  axone  from  B  will  terminate  on 
A  (B  knows  A)  is  greater  than  the  a  priori  probability.  A  transitivity  bias 
operates  if  it  is  true  that  whenever  an  axone  from  A  terminates  on  B  and 
an  axone  from  B  terminates  on  C,the  probability  that  another  axone  from 
A  will  terminate  on  C  is  greater  than  the  a  priori  probability.  (If  A  knows 
B  and  B  knows  C,  it  is  likely  that  A  knows  C.)  Combining  these  two 
biases,  we  have  the  circularity  bias:  whenever  an  axone  terminates  on 
B  and  an  axone  from  B,  on  C,  it  is  likely  that  an  axone  from  C  will 
terminate  on  A. 

All  these  biases  ensure  that  the  number  of  targets  in  the  second  remove 
will  be  smaller  than  expected  in  a  random  net.  The  number  of  targets  in 
the  first  remove  is  not  affected  by  the  biases  because  the  initial  nodes  have 
presumably  been  chosen  randomly;  hence  the  targets  will  be  determined 
randomly  regardless  of  what  biases  operate  in  the  population.  The 
targets  of  the  second  remove  are  the  "friends  of  friends"  (assuming 
"friendship"  as  the  net  relationship)  of  the  initial  nodes.  Thus  axones 
from  the  first  remove  are  more  likely  to  "converge"  on  certain  targets 
(common  friends)  in  the  second  remove.  There  will  be  more  "hits"  per 
target  hit,  and  consequently  fewer  targets  will  be  hit  than  expected  on  a 
random  basis. 

It  follows  that  if  a  axones  are  traced  at  each  remove  we  shall  have 
a(0)  =  a  (since  the  biases  do  not  disturb  the  random  selection  of  the 
targets  of  the  first  remove),  but  a(l)  will  be  smaller  than  a  by  our  previous 
considerations.  The  drop  in  the  value  of  a  from  a(0)  to  a(l)  provides  us 


5/6"  MATHEMATICAL    MODELS    OF    SOCIAL   INTERACTION 

with  a  rough  index  of  "cliquishness"  or  "tightness"  or  "compactness" 
of  the  social  space  in  which  the  net  under  consideration  is  submerged. 
Note  that  the  vaguely  descriptive  terms  "cliquishness,"  etc.,  have  not  been 
exactly  defined  here  except  as  the  corresponding  property  manifests  itself 
in  the  reduction  of  a.  This  is  in  accord  with  our  investigation  of  only  the 
gross  properties  of  biased  nets. 

In  particular,  we  may  investigate  the  expected  behavior  of  a(0  for  some 
special  kinds  of  social  spaces. 

Let  us  define  an  individual's  acquaintance  circle  as  a  set  of  individuals 
from  whom  his  sociometric  choices  are  to  be  made  or  who  are  likely  to 
choose  him.  We  shall  take  the  number  in  this  set  to  be  constant  for  all 
individuals  and  shall  denote  it  by  q  (q  >  a).  The  question  how  these  q 
acquaintances  were  chosen  from  the  population  now  arises.  If  they 
were  chosen  randomly  from  the  population,  no  essential  modification 
would  be  introduced,  even  if  q  <  N,  so  long  as  q  >  1 ,  as  we  shall  now  show. 

We  have,  in  fact,  instead  of  Eq.  29, 

a*< 

(34) 

If  q  >  1,  the  last  parenthesis  can  still  be  approximated  by  e~ay>ty  and  so 
Eq.  34  reduces  to  Eq.  30. 

If  q,  although  large  compared  to  a,  is  small  compared  to  N9  we  can 
reason  along  a  different  line,  which  will  bring  us  to  essentially  the  same 
result  but  will  also  provide  a  point  of  departure  for  introducing  a  bias. 

Let  us  fix  our  attention  on  an  arbitrary  individual  X  at  the  tracing  of 
the  (t  +  l)st  remove.  We  seek  the  probability  that  on  that  remove  X  was 
not  chosen  by  an  arbitrary  but  definite  individual  A  from  among  his  q 
acquaintances.  This  can  happen  in  either  of  two  mutually  exclusive  ways : 
either  A  himself  was  not  chosen  on  the  rth  remove  or  he  was  chosen  on  the 
tth  remove,  but  his  own  choices  did  not  include  X.  The  probability  we 
seek  is  the  sum  of  the  probabilities  of  these  two  events,  that  is,  I  —  pt  + 

/>*(!  ~  !/?)«- 

Now  if  all  the  states  of  the  acquaintances  of  X  are  independent  of  one 
another  and  if  q  is  small  compared  to  the  total  population,  so  that  sampling 
with  replacement  can  be  assumed  for  any  sample  of  individuals  not  greater 
than  q,  then  the  probability  that  X  was  not  chosen  on  the  (t  +  l)st  remove, 
that 'is,  the  probability  that  he  was  not  chosen  by  any  of  his  q  acquaintances 
(the  only  individuals  who  could  choose  him)  will  be  given  by 

(35) 

Therefore,  the  probability  that  X  was  chosen  on  the  (t  +  l)st  remove  is 
1  -  (1  -pfny,  (36) 


STATISTICAL    ASPECTS    OF   NET   STRUCTURES  5/7 

where  we  have  written  m  for  1  —  (1  —  l/q}a.  Expression  36  corresponds 
to  1  —  e~aj)t  in  Eq.  30.  We  therefore  write  for  our  modified  expression, 
representing  the  recursion  formula  of  the  tracing, 

pM  =  (l-*ttl-(l-pjn)*\.  (37) 

Solving  for  a(r),  as  defined  by  Eq.  33,  we  have 

a(0  =  ^log(l-Am).  (38) 

Pt 

If  q  is  large  and  a  is  small,  m  <  1  and  a  forteriori  ptm  <  1,  so  that 
we  can  approximate  the  logarithm  by  —  ptm  and  a(f)  by  qm.  For  large 
q  this  is  approximately  a.  Thus  the  introduction  of  a  finite  but  sufficiently 
large  acquaintance  circle,  which  limits  the  set  of  individuals  who  can  choose 
a  given  individual  or  be  chosen  by  him,  does  not  lead  to  an  appreciable 
modification  of  the  result.  However,  this  approach  lends  itself  to  the 
imposition  of  a  sociostructural  bias,  which  we  now  discuss. 

Until  now  the  assumption  underlying  the  whole  argument  has  been 
that  the  probabilities  1  —  ptm,  namely,  the  probabilities  that  each  of  the 
q  individuals  in  X's  acquaintance  circle  did  not  choose  Xon  the  (t  +  l)st 
remove,  were  all  equal  and  independent.  Another  way  of  saying  this  is 
that  our  knowledge  that  the  first  acquaintance  did  not  choose  X  did  not 
affect  our  assumption  regarding  the  state  of  the  second  acquaintance,  etc. 
If  we  drop  this  assumption  of  independence,  the  compound  probability 
that  none  of  JTs  q  acquaintances  chose  X  can  no  longer  be  represented  by 
Expression  36.  Instead  of  the  qth  power,  we  must  write  a  q-fold  product 

IT  [1  -  pk(f)m],  (39) 

fc=0 

where  the  pk(t)  are  conditional  probabilities  to  be  determined.  Using 
Expression  39  instead  of  Expression  35  in  Eq.  37,  and  solving  for  oc(f), 
defined  by  Eq.  33,  we  now  have 

o(0  =  —  §  log  [1  -  P*(f)m].  (40) 

Pt  fc=o 

As  before,  assuming  large  q  and  small  a,  the  logarithms  in  Eq.  40  can  be 
well  approximated  by  —p^m,  and  we  have  the  simplified  form  of  a(Y), 
namely  g_1 

(41) 


In  the  special  case  of  the  completely  mixed  population,  all  the  pk(t)  are 
equal  for  a  given  t,  and  a(£)  =  qm  c±  a,  as,  of  course,  should  be  the  case. 


5/5  MATHEMATICAL    MODELS    OF    SOCIAL    INTERACTION 

If  a  bias  operates,  the/?fc(f)  cannot  be  assumed  to  be  equal.  How  they 
will  vary  with  k  depends  on  the  nature  of  the  bias.  We  assume  in  what 
follows  that  the  acquaintance  circles  are  "strongly  overlapping."  The 
exact  meaning  of  this  assumption  will,  I  hope,  become  clear  in  the 
discussion. 

Consider  the  state  of  affairs  we  have  described  in  which,  we  recall, 
choices  are  made  among  fairly  large  but  finite  acquaintance  circles,  the 
latter  having  been  "randomly  recruited."  Because  of  random  recruitment, 
the  acquaintance  circles  of  two  individuals,  A  and  B,  who  are  themselves 
acquainted,  have  the  same  expected  intersection  as  the  acquaintance 
circles  of  any  two  arbitrarily  selected  individuals.  Suppose  now  we  are 
given  the  "density"  of  individuals  newly  chosen  on  the  tth  remove  in  A's 
acquaintance  circle  (i.e.,  the  probability  that  an  arbitrary  individual  in 
that  set  is  among  thept  individuals  of  the  rth  remove).  Call  this  density 
pt(A).  According  to  our  assumptions  of  arbitrarily  recruited  acquaintance 
circles,  this  knowledge  in  no  way  modifies  our  knowledge  of  the  density  of 
the  individuals  in  £'s  acquaintance  circle,  pt(B)9  because  the  two  acquain- 
tance circles  are  in  no  way  related. 

Suppose  now  that  both  A  and  B  are  in  the  acquaintance  circle  of  X. 
The  knowledge  that  A  did  not  happen  to  choose  X  on  the  (t  +  l)st 
remove  somewhat  modifies  our  estimate  of  ihept(A)  because  of  the  way 
this  probability  can  be  calculated  as  a  conditional  probability,  given  that 
A  did  not  choose  X.  It  does  not  modify  our  estimate  ofpt(B)  in  the  random 
case.  Therefore,  the  probability  that  B  did  not  happen  to  choose  X,  given 
that  A  did  not,  remains  the  same  as  the  a  priori  probability  given  by 
1  —  ptm.  This  is  the  justification  for  Expression  35  as  the  probability  that 
none  of  X9  s  acquaintances  chose  X  on  the  (t  +  l)st  remove. 

The  bias  of  strongly  overlapping  acquaintance  circles  is  reflected  in  the 
assumption  that  knowledge  of  the  pt(A)  does  modify  our  estimate  of 
pt(B)  because  both  A  and  B  are  in  X's  acquaintance  circle  and  so  probably 
in  each  other's.7  Hence  knowing  that  A  did  not  choose  ^"introduces  a  new 
contingent  probability  that  B  is  among  thept  individuals  of  the  tth  remove. 
Calculating  this  probability  by  Bayes'  rule,  we  obtain 

(42) 


1  -  ptm 

7  Note  that  we  cannot  make  the  assumption  of  complete  transitivity  of  the  acquaintance 
relation,  namely,  that  two  individuals  who  are  in  the  acquaintance  circle  of  a  third  are 
also  in  each  other's.  This  would  imply  that  the  entire  population  was  partitioned  into 
mutually  exclusive  cliques  of  q  members,  each  with  random  choices  within  the  cliques. 
We  would  then  have  several  random  nets  instead  of  a  single  biased  one.  The  word 
"probably"  in  the  sentence  to  which  this  footnote  refers  points  up  the  approximate  and 
nonprecise  assumption  of  the  strongly  overlapping  acquaintance  circles.  Roughly 
speaking,  there  are  leaks  in  the  cliques. 


STATISTICAL   ASPECTS    OF    NET   STRUCTURES  5J9 

which  reduces  to  pt  for  m  =  0,  that  is,  when  the  acquaintance  circles  are 
infinitely  large.  If  this  is  not  the  case,  we  must  takej^f)  as  the  density  of 
tth  remove  individuals  in  B's  acquaintance  circle.  If  B  also  did  not  happen 
to  choose  X,  we  get  a  further  modification  of  our  estimate  of  the  density 
of  tth  remove  individuals  in  the  vicinity  of  A.  By  iteration,  we  get 


,  (43) 

-  jpfc~i0™ 

and  by  induction 

W(0  =  -  -  —  -  k  ,        (k  =  1,  2,  .  .  .  «).  (44) 

1  ~  Pt  +  Pt* 

where  s  =  1  —  m. 

All  the  pk(t)  being  determined,  we  calculate  a(?)  by  Eq.  41,  which  for 
large  q  is  approximated  by 

a(0  =  —  log  [1-^(1-0].  (45) 

Pt 

For  small  pt,  the  right  side  of  Eq.  45  is  well  approximated  by  1  —  e~a.8 

We  now  relax  the  assumption  of  "strongly  overlapping  acquaintance 

circles"  by  introducing  an  additional  parameter  0,  which  represents  the 

average  fraction  of  individuals  in  the  acquaintance  circles  who  are  in- 

cluded in  the  overlap  bias.    Alternatively,  we  can  say  that  in  selecting 

acquaintances,  each  individual  selects  a  fraction  6  in  accordance  with 

the  overlap  bias  and  a  fraction  1  —  6  randomly.  This  modification  leads 

to  the  following  approximate  expression  for  a(?)  (Rapoport,  1953): 

o(0)  =  a, 

a(*)  =  1  -  e~*  +  (1  -  fl)a,        (t  >  1).  (46) 

For  0  =  0,  a(r)  reduces  to  a,  and  for  d  =  1  to  1  —  <ra.  The  two  special 
cases  thus  coincide  with  the  random  net  and  with  the  strongly  overlapping 
acquaintance  circle  net,  respectively.  Thus  6  will  appear  as  a  measure  of 
the  "tightness"  of  the  net. 


2.3  Application  of  the  Overlapping  Clique  Model 
to  an  Information-Spread  Process 

The  model  described  in  the  preceding  section  was  put  to  a  test  in  experi- 
ments on  information  spread  conducted  by  the  Washington  Public  Opinion 

8  Details  of  the  calculation  are  given  by  Rapoport  (1953).  In  the  empirical  tests  of  the 
theory  to  be  described  subsequently,  pt  seldom  exceeds  0.1  and  the  approximations 
introduced  throughout  are  well  justified. 


520  MATHEMATICAL    MODELS    OF    SOCIAL   INTERACTION 

Laboratory  (Dodd,  Rainboth,  &  Nehnevajsa,  1952)  at  the  University 
of  Washington.  Subjects  were  seventh  grade  children  and  college  students. 
The  experiments  were  designed  so  that  the  values  of  pt  were  available. 
Thus  oc(Y)  could  be  computed  and  tested  for  constancy.  The  observed 
values  of  oc(/)  of  the  seventh  graders  are  shown  in  Table  1.  The  results 
obtained  from  college  students  were  similar. 

Table  1  Values  of  a(J),  the  Apparent  Axone  Density,  in  an 
Information- Spreading  Net  of  School  Children  for  Successive 
Values  of  t 


t 

0 

1 

2 

3 

4 

5        6 

7 

8 

9 

10 

11 

a« 

7.23 

1.47 

1.63 

1.80 

1.98 

2.32   2.23 

2.90 

1.76 

3.01 

1.54 

1.21 

We  see  from  Table  1  that  cc(r)  exhibits  a  conspicuous  drop  on  the  first 
remove  and  thereafter  rises  steadily  for  eight  removes.  The  decline  on  the 
last  two  removes  is  probably  not  significant,  for  toward  the  end  of  the 
process  oc(z)  becomes  exceedingly  sensitive  topt9  so  that  small  fluctuation 
errors  in  the  data  produce  very  large  fluctuation  errors  in  a(f). 

Equation  46  is  able  to  account  for  the  initial  drop  of  oc(r)  from  an 
arbitrary  initial  value  to  a  lower  value.  However,  Eq.  46  also  implies  that 
subsequently  oc(0  remains  approximately  constant,  in  contradistinction  to 
the  data  shown  in  Table  1. 

It  was  necessary,  therefore,  to  make  an  additional  hypothesis,  namely, 
that  there  was  in  the  course  of  the  spread  a  progressively  increasing  ran- 
domization of  contacts.  This  hypothesis  is  psychologically  plausible.  The 
experiment  was  conducted  under  contest  conditions,  the  subjects  being 
motivated  to  get  and  to  pass  on  as  much  information  as  possible.  It  is 
reasonable  to  suppose  that  as  a  subject's  own  acquaintance  circle  became 
saturated  with  knowers  he  was  more  and  more  likely  to  seek  random 
contacts.  This  process  is  tantamount  to  a  steady  decline  of  0  as  a  function 
of  x  (the  cumulated  fraction  of  "knowers").  In  order  to  minimize  the 
number  of  parameters,  it  was  assumed  that  0(0)  =  1,  that  is,  were  it  not 
for  the  increasing  randomization  of  contacts  motivated  by  the  contest 
conditions,  the  spread  of  information  would  be  described  by  a  net  with 
tightly  overlapping  acquaintance  circles.  Taking  0  as  a  decaying  exponen- 
tial function  of  x,  namely,  0  =  e~ftx9  where  fl  is  a  constant,  we  obtain, 
instead  of  Eq.  46, 

o(0  =  [1  -  exp  (—ae~fxty\  +  (1  -  e~fxt)a.  (47) 

Thus  the  free  parameter  0  has  been  replaced  by  the  free  parameter  /?, 
which  measures  the  propensity  of  a  carrier  to  seek  random  contacts,  as  his 


STATISTICAL   ASPECTS    OF    NET   STRUCTURES  521 

acquaintance  circle  becomes  saturated  with  knowers.  We  note  that  setting 
ft  =  0  is  equivalent  to  setting  0=1,  whereas  setting  ft  =  co  reduces 
a(r)  to  a,  that  is,  it  reduces  this  bias  to  zero.  When  Eq.  47  is  substituted 
for  a  in  the  fundamental  recursion  formula  30,  the  modified  recursion 
formula  in  terms  of  xt  &ndpt  is  obtained: 

XM  =  1  -  (1  -  xt){l  -  pt[l  -  exp  (-de-?**)]}  exP  [-*(!  -  *~ftx*)]Pf 

(48) 

Equation  48  contains  two  free  parameters,  a  and  ft.  It  was  this  equation 
that  was  applied  to  the  data  obtained  from  the  seventh  graders.  Compari- 
son of  predicted  and  observed  values  is  shown  in  Fig.  3. 

The  first  point  is  given  by  the  actual  value  of  pQ;  the  second  point  is 
used  for  computing  a,  and  the  third  for  computing  ft.  The  remaining 
points  are  predicted  by  the  resulting  equation  with  numerical  values  of  the 
two  parameters  substituted,  namely,  a  =  7.2,  ft  =  0.22.  Similar  results 
were  obtained  from  the  data  on  college  students  with  a  =  4.4  and  ft  =  0.3. 

If  we  ventured  to  take  the  fits  seriously,  we  could,  in  view  of  the  psycho- 
sociological  interpretation  of  a  and  ft,  conclude  that  the  children  made  on 
the  average  more  contacts  than  the  college  students  (as  reflected  in  the 
larger  value  of  a)  but  were  less  inclined  to  randomize  their  contacts  as  the 
contest  proceeded  (as  reflected  in  the  smaller  value  of  ft).  It  is  a  moot 
question,  of  course,  whether  the  fit  can  be  taken  as  a  strong  indication  of 
the  validity  of  our  model  in  its  present  form.  For  one  thing,  the  derivation 
of  Eq.  48  is  not  rigorous.  The  so-called  overlap  of  acquaintance  circles 
cannot  really  be  expressed  by  a  single  parameter  because  of  the  tremendous 
complexity  of  the  actual  acquaintance  structure.  Actually  the  bias 
resulting  from  "first-order  transitivity"  (if  A  knows  B  and  B  knows  C, 
then  A  is  also  likely  to  know  C)  implies  certain  biases  of  higher  transitivities 
involving  longer  chains.  The  exact  computation  of  parameters  representing 
sociostructural  bias  is  an  extremely  difficult  matter,  even  in  the  most 
drastically  idealized  cases.  Second,  the  time  factor  has  been  completely 
omitted  from  the  theory,  whereas  time,  as  an  explicit  variable,  certainly  had 
an  effect  on  the  process  (interaction  frequencies  varied  radically  during  the 
course  of  the  day)  in  terms  of  clock  time  and  thus  indirectly  in  terms  of  the 
removes.  The  striking  agreement  between  theory  and  data  must  be 
attributed,  as  in  many  such  cases,  to  the  over-all  smoothness  of  the  curves, 
which  allow  a  good  fit  with  two  free  parameters. 

To  be  sure,  the  theory  outlined  here  gives  not  only  a  fit  but  also  a 
rationale  for  some  aspects  of  social  diffusion.  It  is  therefore  advisable  to 
design  further  experiments  to  test  the  particular  rationale.  Some  of  these 
experiments  are  discussed  in  the  next  section. 


522 


0.20 


0.15- 


0.10  - 


0.05 


0.2  j- 
0.1 


MATHEMATICAL   MODELS    OF    SOCIAL  INTERACTION 

1 1      T 


Fig.  3.  Comparison  of  theoretical  curves  (Eq.  48) 
with  data  obtained  from  experimentally  induced 
spread  of  information  among  school  children 
(circles) :  (a)  increments  in  the  fraction  of  knowers 
against  removes;  (b)  cumulative  fractions  of 
knowers  against  removes. 


2.4  Application  of  the  Biased  Net  Model 
to  a  Large  Sociogram 

We  now  abandon  the  study  of  the  diffusion  and  contagion  processes  and 
concentrate  on  a  statistical  study  of  the  population  structure  as  such, 
which  is,  after  all,  the  real  subject  matter  underlying  the  recursion  Formula 


STATISTICAL   ASPECTS    OF    NET   STRUCTURES  $2$ 

30.  This  formula  traces  not  so  much  occurring  contacts  as  existing  chan- 
nels for  contacts.  For  example,  if  we  were  to  follow  the  assumption  of 
"guilt  by  association"  to  its  logical  conclusion,  the  Formula  30  would 
indicate  the  expected  number  of  individuals  who  would  be  implicated 
in  each  tracing  of  acquaintanceship.  The  crucial  variables  involve  the 
average  number  of  acquaintance  bonds  and  the  nature  of  the  sociometric 
bias. 

To  trace  such  a  net  in  an  actual  population,  we  need  only  to  ask  each 
individual  to  list  his  potential  contacts.  The  "tracing"  would  be  done 
from  these  lists.  Suppose  we  ask  that  the  potential  contacts  of  each 
individual  consist  of  n  individuals  ordered  according  to  the  relative  fre- 
quency of  contact  or  according  to  contact  intimacy.  Now  if  we  select 
any  number  from  1  to  n  for  our  a,  we  can,  beginning  with  an  arbitrary  set 
of  starters,  trace  our  net.  The  "axone  density"  a  is  now  no  longer  a  free 
parameter  but  a  parameter  controlled  by  the  experimenter.  Since  a 
represents  only  the  average  number  of  contacts,  the  value  of  a  need  not 
even  be  an  integer.  We  can,  for  example,  make  a  =  3/2  by  tracing  two 
contacts  for  one  half  of  each  set  of  individuals  belonging  to  each  remove 
and  one  contact  for  the  remaining  half. 

By  choosing  the  contacts  to  be  traced  high  or  low  on  the  list,  we  would 
presumably  be  generating  curves  for  high  or  for  low  values  of  6,  since  it  is 
likely  that  the  closer  contacts  are  more  strongly  "inbred."  Finally,  the 
parameter  /?,  which  appeared  in  our  information  spread  model,  should 
not  enter  at  all  in  the  sociogram  tracing  because  the  motivating  factor  of 
which  ft  was  a  reflection  is  absent:  our  tracing  is  determined  only  by  the 
structure  of  the  acquaintance  relationship  and  not  by  any  acts  of  the  indi- 
viduals. 

We  could  trace  any  number  of  curves  in  which  the  parameters  p0  and  a 
would  be  chosen  at  will  and  only  the  single  free  parameter  d  would  be 
inferred.  If  all  of  these  curves  gave  good  agreements  between  predicted  and 
observed  values  of  pt  or  xt  and  if  6  were  found  to  be  monotone  decreasing 
as  one  chose  the  names  to  be  traced  farther  down  on  the  list,  a  rather  strong 
confirmation  of  our  theory  would  be  obtained ;  or  possibly  indications 
would  appear  for  promising  modifications  of  the  theory.  The  number  of 
empirical  investigations  that  could  be  conducted  on  the  statistical 
properties  of  a  large  sociogram  is  large.  Aside  from  the  opportunities  they 
offer  for  constructing  a  theory  of  sociometric  space,  they  can  serve  as 
foundations  of  a  "natural  history"  of  sociometric  nets. 

Data  were  obtained  from  two  junior  high  schools  (population  800  to 
1000)  in  Ann  Arbor,  Michigan,  about  two  months  after  the  beginning  of 
the  school  year.  Each  pupil  was  asked  to  fill  in  the  blanks  in  the  statements, 
"My  best  friend  in  this  junior  high  school  is " ;  "My  second  best 


524 


MATHEMATICAL    MODELS    OF    SOCIAL    INTERACTION 


-",  etc.,  through  "eighth  best 


friend  in  this  junior  high  school  is 

friend,"  Data  on  one  of  the  schools  are  presented  here. 

The  assumptions  require  that  all  choices  be  within  the  population. 
Therefore  choices  that  could  not  be  identified  as  pupils  in  the  school  (made 
through  accidental  or  deliberate  violations  of  the  instructions)  had  to  be 
disregarded.  Thus  some  blanks  appeared  in  the  data  cards.  Absentees 
appeared  as  cards  with  all  blanks.  The  same  number  of  choices  by  everyone 
is  not  strictly  required  by  the  model,  since  the  parameter  a  represents  only 
an  average  axone  density  or  a  tracing.  Therefore,  if  a  tracing  was  made 
with  an  intended  a  =  2  (e.g.,  first  and  second  friends  only),  the  actual 
average  a  was  reduced,  in  our  case  to  about  1.78,  and  this  is  the  value  that 
was  used  in  calculating  the  predicted  curve. 

The  null  hypothesis,  namely,  that  the  net  is  random,  can  be  definitely 
rejected,  as  shown  in  Fig.  4.  Next,  we  assume  an  overlapping  clique  bias 
with  a  free  parameter  6  whose  value  fixes  oc(f)  for  t  >  1.  The  value  of 
a  =  1.1  or  6  =  0.8  gives  a  reasonably  good  fit,  as  shown  in  Fig.  5.  Ob- 
viously, the  fit  could  be  still  better  if  we  had  another  free  parameter  at  our 
disposal.  However,  if  the  theory  is  to  remain  a  rational  one,  such  a 
parameter  cannot  be  chosen  ad  hoc.  It  must  be  "rationalized."  Rationali- 
zations of  additional  parameters  suggest  themselves  in  biases  that  may  be 
operating  in  the  sociogram  in  addition  to  the  bias  of  overlapping  acquain- 
tance circles,  already  taken  into  account. 

Note  that  the  overlapping  acquaintance  bias  implicitly  imposes  a  "metric" 


8         10        12 
Removes 


Fig.  4.  Comparison  of  tracing  through  first  and  second  friends 
(points)  with  the  curve  predicted  by  a  tracing  through  a  random 
net  with  the  same  actual  axone  density  (1 .78) . 


STATISTICAL   ASPECTS    OF    NET    STRUCTURES 
240 

210  - 


525 


8 


14        16 


18 


10        12 
Removes 

Fig.  5.  Comparison  of  tracing  through  first  and  second  friends 
(points)  with  the  curve  predicted  by  a  tracing  through  a  net 
with  overlap  bias  6  =  0.8. 

on  the  social  space,  although  the  precise  nature  of  this  metric  or 
of  the  underlying  topology  is  not  specified.  Thus  any  other  distance  bias 
we  would  impose,  for  example,  a  "reciprocity  bias"  in  which  choices 
tend  to  be  reciprocated,  or  "transitivity  biases"  of  higher  order,  would 
either  already  have  been  implied  by  the  overlap  bias  or  might  be  implicitly 
inconsistent  with  it.  If  we  wish  to  add  an  independent  free  parameter, 
we  should  seek  a  bias  that  is  at  least  not  obviously  related  to  the  overlap 
bias.  Such  may  be  the  "popularity  bias,"  which  results  if  some  individuals 
"attract"  axones  and  others  "repel"  them.  This  distinction  defines  an 
inherent  property  of  the  individuals  in  the  population,  not  a  relation  among 
pairs  of  individuals,  and  therefore  can  be  supposed  to  be  unrelated  to  the 
overlap  distance  bias. 

To  introduce  a  popularity  bias,  we  could  assume  or  determine  empirically 
a  distribution  of  attractiveness  in  the  population  and  modify  the  net 
model  accordingly.  A  much  simpler  way  is  to  take  the  grossest  feature  of 
this  bias  as  a  single  free  parameter.  Note  that  a  popularity  bias  tends  to 
reduce  the  "effective"  population  through  which  the  tracing  is  made. 
This  can  be  seen  in  the  extreme  case  in  which  only  a  fraction  of  the  popula- 
tion can  attract  the  axones  at  all,  since  this  simply  leaves  the  others  out  as 
members  of  the  population.9  In  any  case  the  predominance  of  choices 
9  Note  the  analogy  with  the  notion  of  the  susceptible  population  in  the  theory  of 
contagion. 


526 


MATHEMATICAL   MODELS   OF   SOCIAL  INTERACTION 


210 


0246         8         10       12 
Removes 

Fig.  6.  Comparison  of  tracing  through  first  and  second  friends 
(points)  with  the  curve  predicted  through  a  net  with  N  =  60 1 
and  overlap  bias  6  =  0.77. 


420 

360 

c 

<D 

I  300 


240 


180 


120 


60 


I 


I 


I 


I 


I 


0         24         6         8        10        12        14       16        18       20 

Removes 

Fig.  7.  Comparison  of  tracing  through  fourth  and  fifth  friends 
(points)  with  the  curve  predicted  by  tracing  through  a  net  with 
N  =  631  and  overlap  bias  6  =  0.4. 


STATISTICAL   ASPECTS    OF    NET   STRUCTURES 


527 


directed  at  some  individuals  at  the  expense  of  others  reduces  the  possible 
number  of  targets  of  the  axones  and  so  reduces  the  size  of  the  "effective" 
population  in  a  tracing. 

If  we  are  free  now  to  choose  the  size  of  this  effective  or  apparent  popula- 
tion as  a  free  parameter,  we  obtain  a  fit  shown  in  Fig.  6. 

Figure  7  shows  a  further  test  of  theory.  Here  the  result  of  the  average 
of  30  tracings  is  shown  with  intended  a  =  2  (actual  a  =  1.73),  where 
the  tracings  were  made  through  the  fourth  and  fifth  friends  on  each  list. 
As  can  be  seen  from  the  values  of  the  free  parameters  that  give  the  best 
fits,  the  effective  population  is  about  the  same  size  as  the  tracing  obtained 
from  first  and  second  friends.  The  value  of  oc,  however,  is  1 .53.  Calculating 


900 

800 

700 

600 

g)  500 

|  400 

LU 

300 
200 
100 


1-2         2-3        3-4        4-5         5-6        6-7        7-8 
Rank  order 


2.00 


1.00 


I 


I 


1-2       2-3 


5-6       6-7        7-8 


3-4        4-5 
Rank  order 

Fig.  8.  The  various  parameters  in  tracings 
through  pairs  of  friends  of  different  consecutive 
ranks. 


528  MATHEMATICAL   MODELS    OF    SOCIAL   INTERACTION 

6  from  Eq.  46,  we  have  the  value  0.4  for  that  parameter.  We  conclude 
that  by  the  time  the  first  three  friends  have  been  named  the  remaining  ones 
will,  be  named  considerably  more  at  random;  that  is,  the  resulting  net 
approaches  a  random  net. 

Next,  we  test  the  hypothesis  that  6  is  monotone  decreasing  with  the 
numerical  rank  order  of  the  friends  through  which  the  tracing  is  made. 
We  make  tracings  through  second  and  third  friends,  through  third  and 
fourth,  and  through  seventh  and  eighth.  Figure  8  summarizes  the  results. 

We  see  from  Fig.  8  that  the  actual  a  remains  approximately  constant, 
1 1  to  14  %  below  the  intended  value  (a  =  2),  except  for  the  tracing  through 
seventh  and  eighth  friends,  where  it  drops  more.  The  deficiency  in  the 
actual  a  is  accounted  for  by  the  absences  and  by  some  failures  to  name  a 
friend  or  to  name  a  pupil  in  the  school,  the  last  two  failures  becoming 
prominent  in  the  tracings  through  seventh  and  eighth  friends.  The  gradual 
rise  in  a  reflects  the  decreasing  tightness  of  the  overlap  bias  with  increasing 
numerical  rank  order  of  the  friends.  This  effect  is  masked  in  the  last  value 
because  of  the  drop  in  a,  of  which  a  is  a  monotone-increasing  function. 
Looking  at  the  behavior  of  6,  which  expresses  the  actual  tightness  of  the 
overlap  bias,  we  see  that  it  is  indeed  monotone  decreasing. 

If  it  were  not  for  the  abnormally  high  value  of  TV*,  the  "effective" 
population,  associated  with  the  tracing  through  second  and  third  friends, 
this  parameter  would  be  monotone  increasing  with  the  numerical  rank 
order.  This  would  indicate  that  the  popularity  bias  is  decreasing  with  the 
rank  order,  a  plausible  result.  The  fact  that  the  best  fit  for  the  tracing 
through  second  and  third  friends  is  obtained  without  popularity  bias 
(i.e.,  with  the  actual  N  =  861)  remains  unexplained. 

The  existence  of  the  popularity  bias  can  be  observed  directly  in  the 
distribution  of  the  numbers  of  choices  received  in  the  population.  If 
every  one  had  an  equal  chance  to  receive  a  choice,  this  distribution  would 
be  of  the  Poisson  type.  The  observed  distribution  departs  radically  from 
the  Poisson,  in  fact  follows  closely  the  so-called  Greenwood- Yule  distri- 
bution, as  shown  in  Fig.  9. 

Our  model  suggests  a  number  of  theoretical  problems.  One  is  to  elimi- 
nate the  ad  hoc  character  of  the  "effective  population,"  assumed  to  be  a 
consequence  of  the  popularity  bias,  that  is,'  to  deduce  the  value  of  this 
parameter  theoretically.  Another  problem  is  to  relate  the  overlap  bias 
to  some  directly  observed  biases,  such  as  the  reciprocity  and  transitivity 
biases  of  several  orders.  Still  another  rather  intriguing  problem  is  to  infer 
topological  and  metrical  properties  of  social  space,  for  example,  to  deter- 
mine whether  the  population  can  be  "immersed"  into  an  Euclidean  space 
of  a  few  dimensions  so  that  the  distance  between  any  two  members  would 
appear  as  the  number  of  links  in  the  path  leading  from  one  to  the  other. 


STRUCTURE    OF   SMALL    GROUPS 


5*9 


110 
100 
90 
80 
70 
60 
50 
40 
30 
20 
10 
0 


I        I 


I        i 


I        I 


Greenwood-Yule 
a  =  3.08 
7  =  0.45 


0  1 


21 


25    27     29 


5      7       9      11     13     15     17     19 

Number  of  choices 

Fig.  9.  Comparison  of  the  distribution  of  the  number  of  choices  received 
with  the  Poisson  distribution  that  would  obtain  if  the  choices  were  random. 
The  actual  distribution  is  fitted  well  by  the  Greenwood- Yule  distribution. 
The  parameter  y  (which  vanishes,  as  the  Greenwood- Yule  distribution 
degenerates  into  a  Poisson  distribution)  can  be  taken  as  an  index  of  the 
popularity  bias. 

Since  distance  must  be  a  symmetric  relation  in  any  metric  space,  obviously 
an  average  of  the  two  not  necessarily  equal  distances  separating  a  pair  of 
members  must  be  taken  to  define  this  variable. 

We  leave  the  large-scale  models  at  this  point  and  turn  to  theories  of 
structure  and  interaction  in  small  groups. 


3.  STRUCTURE  OF  SMALL  GROUPS 

In  physics  it  is  commonplace  to  describe  the  state  of  a  system  by  a 
closed  formula,  from  which  we  can  "read  off"  the  conditions  in  every  one 
of  the  parts  of  the  system.  For  example,  if  a  formula  gives  the  gravita- 
tional potential  as  a  function  of  space  coordinates,  we  know  what  the 
potential  is  at  every  point  in  the  region  considered.  There  is  thus  no 
limitation  on  the  size  of  the  region  or  on  "how  many  points"  are  con- 
tained in  it. 

*  The  necessity  of  dealing  with  explicitly  enumerated  sets  of  relations 
(instead  of  with  global  formulas  from  which  an  infinite  number  of  relations 
can  be  read  off)  naturally  places  a  severe  restriction  on  the  size  of  the 


53°  MATHEMATICAL   MODELS    OF   SOCIAL   INTERACTION 

universe  with  which  a  social  scientist  can  deal  in  these  terms.  At  times 
this  restriction  can  be  circumvented  if  the  relations  are  specified  only 
statistically,  as  was  done  in  Sec.  2.  It  is  clear,  however,  that  exact  specifica- 
tion without  the  use  of  encompassing  formulas  is  possible  only  if  the 
number  of  entities  and  relations  is  not  too  large. 

Fortunately,  there  are  situations  in  which  only  a  few  individuals  are 
involved,  but  in  which,  nevertheless,  certain  regularities  can  be  observed, 
and  so  a  portion  of  social  science  can  be  constructed  to  deal  with  the 
observed  events.  These  are  situations  involving  small  groups. 

Definitions  of  small  groups  based  on  content-oriented,  sociopsycho- 
logical  concepts  (motivation  for  existence,  degree  of  involvement,  etc.) 
are  not  our  concern  here.  Only  formal  properties  will  be  examined. 
Suffice  it  to  say,  then,  that  a  small  group  is  a  collection  of  individuals,  to 
each  subset  of  which  a  definite  value  of  each  of  a  finite  set  of  relations  can 
be  assigned.  If  all  the  subsets  are  pairs,  the  relations  are  binary.  All 
subsequent  discussions  are  confined  to  binary  relations. 

In  the  light  of  this  definition,  natural  classification  schemes  for  small 
groups  follow  at  once.  The  simplest  are  those  in  which  only  one  relation 
exists  between  each  pair,  and  this  relation  has  one  of  two  values.  (N.B. 
The  presence  or  absence  of  a  single-valued  relation  is  equivalent  to  a  two- 
valued  relation  existing  between  the  members  of  every  pair  of  the  set  of 
entities  that  constitutes  the  group.) 

A  finite  set  of  points,  among  some  of  whose  pairs  a  single  symmetric 
relation,  having  a  single  value,  exists  is  called  a  linear  graph.  Obviously, 
a  linear  graph  can  also  be  viewed  as  a  set  of  points  with  a  two-valued 
symmetric  relation  defined  for  all  its  pairs.  The  equivalence  of  the  two 
definitions  can  be  seen  at  once  if  the  two  values  of  the  relation  are  "pre- 
sent" and  "absent." 

Related  concepts  are  established  by  specifying  relations  between  mem- 
bers of  ordered  pairs  and/or  by  allowing  more  than  two  values  for  each 
relation.  For  example,  a  directed  graph  consists  of  a  finite  set  of  points, 
among  some  of  whose  ordered  pairs  a  single  relation,  having  a  single 
value,  exists.  A  signed  graph  allows  the  relation  to  have  two  values  if 
present  (plus  or  minus).  Similarly,  we  define  signed  directed  graphs  and 
also  graphs  with  more  than  one  relation,  which  are,  as  we  have  seen, 
equivalent  to  graphs  with  a  single  relation  of  many  values.  When  there 
is  no  danger  of  confusion,  we  shall  refer  to  all  species  of  graphs  simply  as 
linear  graphs. 

The  mathematical  theory  of  graphs  is  a  branch  of  pure  mathematics. 
In  recent  years  some  mathematicians  have  been  developing  the  theory  of 
graphs  in  the  context  of  formalized  social  relations  in  small  groups  (e.g., 
Harary  &  Norman,  1953). 


STRUCTURE    OF    SMALL   GROUPS 


It  is  important  to  note  that  a  system  composed  of  a  set  of  entities  and 
a  relation  defined  for  each  ordered  pair  can  also  be  represented  by  a 
matrix  (%),  in  which  the  entry  ai}  in  the  zth  row  andyth  column  is  the  value 
of  the  relation  associated  with  the  ordered  pair  of  individuals  (/,/)•  If 
the  relation  is  symmetric  (antisymmetric),  the  associated  matrix  is  sym- 
metric (antisymmetric.) 

Besides  the  formalism  of  linear  graph  and  matrix  representations,  cer- 
tain other  methods  of  describing  small  group  structures  are  suggested 
by  the  context  of  the  problems  considered.  Some  of  them  are  discussed 
in  Sec.  3.4. 


3.1  Descriptive  Theory  of  Small  Group  Structures 

The  term  "mathematical  model"  is  sometimes  used  in  two  different 
senses.  The  usual  meaning  refers  to  a  set  of  precisely  stated  postulates 
concerning  some  aspects  of  a  system;  for  example,  the  interdependence 
of  the  variables  which  describe  its  state  or  the  laws  governing  the  time 
course  of  some  process.  Such  postulates  are  assertions,  and  they  lead  via 
a  deductive  chain  of  reasoning  to  conclusions  (theorems)  that  are  also 
assertions.  In  another  sense,  a  "mathematical  model"  is  a  collection  of 
definitions.  Definitions  do  not  assert  anything.  They  only  indicate  how 
words  are  going  to  be  used. 

When  Harary  and  Norman  (1953)  speak  of  the  theory  of  graphs  as  a 
"mathematical  model  in  social  science,"  they  have  in  mind,  at  least  in  the 
beginning,  this  second  kind  of  model,  which  we  shall  call  descriptive,  in 
distinction  from  the  first  kind,  which  we  shall  call  predictive.  In  a  descrip- 
tive model  of  a  social  group  based  on  the  theory  of  graphs  the  social 
group  and  a  set  of  relations  among  its  members  is  conceptualized  as  a 
linear  graph  (a  directed  graph,  a  signed  graph,  etc.).  When  this  is  done, 
theorems  about  the  linear  graph,  which  is  assumed  to  be  isomorphic  to 
the  social  group,  can  be  translated  into  corresponding  statements  about  the 
social  group.  In  this  context  the  validation  of  such  statements  is  a  purely 
logical  validation,  a  consequence  of  the  assumed  isomorphy  between  the 
graph  and  the  social  group.  Only  when  additional  statements  are  included 
as  postulates  about  social  groups  apart  from  formal  properties  of  linear 
graphs  does  the  model  become  predictive  (i.e.,  empirically  falsifiable), 
and  its  adequacy  becomes  a  matter  of  empirical  validation. 

In  some  instances  graph  theory  has  been  used  as  a  framework  of  pre- 
dictive models  in  the  theory  of  small-group  structure  and  behavior.  I 
shall  first  present  the  predominantly  descriptive  approach. 

Table  2  contains  in  the  left-hand  column  some  definitions  underlying 


532 


MATHEMATICAL   MODELS   OF   SOCIAL   INTERACTION 


the  theory  of  graphs.  The  right-hand  column  contains  possible  translations 
of  these  definitions  into  terms  applicable  to  small  social  groups. 

Like  any  branch  of  mathematics,  graph  theory  is  a  collection  of  theorems 
deduced  from  the  definitions  of  the  sort  listed.   Like  the  definitions,  the 

Table  2     Correspondence  Between  Terms  in  Graph  Theory  and 
Terms  Related  to  Description  of  Social  Groups 


Connected  graph:   a  linear  graph  in 
which  a  path  (along  the  lines  of  the 
graph)  exists  between  any  two 
points. 

Completely  connected  graph 

Subgraph:  a  subset  of  points,  and 
lines  associated  with  them,  of  the 
set  that  constitutes  a  graph. 

Component:  maximal  completely 
connected  subgraph. 

Articulation  point:  if  it  is  possible 
to  divide  a  graph  G  into  two  sets  U 
and  V  having  only  point  P  in 
common,  such  that  every  path  from 
a  point  of  U  to  a  point  of  Kin- 
eludes  P,  then  P  is  an  articulation 
point  of  G. 

Bridge:  a  line  of  a  connected 
graph  whose  removal  separates  the 
graph  into  two  components,  each  of 
which  has  more  than  one  point. 


Connected  social  group :  a  set  of 
individuals  among  whom  communi- 
cation between  any  two  (possibly 
via  intermediaries)  is  always  possible 

Single  clique  group 

Subgroup :   a  subset  of  individuals 
and  their  associated  communica- 
tion channels. 

Clique:  maximal  completely 
connected  subgroup. 

Liaison  person:  one  whose  re- 
moval would  turn  a  single  connected 
group  into  one  not  connected. 


Critical  communication  channel: 
one  which,  if  cut,  results  in  the 
severing  of  communication 
between  two  subgroups. 


theorems  too  can  be  translated  into  the  language  of  social  relations. 
However,  the  significance  of  the  theorems  is  not  always  easy  to  evaluate. 
As  an  example,  consider  a  theorem  of  Konig  (1936)  on  a  certain  species 
of  linear  graph  called  "trees."  A  tree  is  a  connected  graph  without  cycles. 
Each  point  of  a  tree  (or  of  any  connected  graph)  has  an  "associated 
number,"  defined  as  the  maximum  of  its  "distances"  to  other  points. 
(The  distance  between  two  points  is  the  smallest  number  of  lines  necessary 
to  traverse  to  get  from  one  to  the  other.)  Konig' s  theorem  states  that  every 
tree  has  either  one  or  two  points  whose  associated  number  is  minimal. 


STRUCTURE  OF  SMALL  GROUPS 


It  is  felt  that  the  theorem  may  have  relevance  for  the  theory  of  small 
groups  because  of  the  following  considerations.  The  analogue  of  "asso- 
ciated number"  in  social  groups  is  the  notion  of  "centrality."  Whenever 
communication  must  go  "through  channels,"  the  effectiveness  and,  per- 
haps, the  morale  of  individual  group  members  can  be  reasonably  supposed 
to  depend  on-  their  "distance"  (in  terms  of  the  number  of  "relay  stations") 
from  the  other  members.  One  could  even  suppose  that  the  individual 
whose  associated  numbers  are  smallest  (those  with  largest  "centrality" 
index)  would  be  the  more  likely  to  assume  positions  of  leadership  in  a 
group  (cf.  Bavelas,  1948).  Konig's  theorem  can  therefore  be  interpreted 
to  mean  that  if  the  communication  structure  of  a  group  is  a  tree  (i.e.,  has 
no  cycles)  then  there  are  at  most  two  individuals  in  it  with  the  largest 
possible  centrality  index. 

It  would  be  far-fetched  to  pursue  the  interpretation  beyond  this  formal 
implication;  for  example,  to  conclude  that  a  power  struggle  for  leadership 
in  a  "treelike"  group  will  either  fail  to  materialize  (if  only  one  "central 
individual"  exists  in  it)  or  will  take  place  between  no  more  than  two 
candidates,  etc.  Such  conjectures  are  not  warranted  in  view  of  the  immense 
complexity  of  real  social  situations  compared  with  the  schematized 
mathematical  models.  Nevertheless,  it  must  be  admitted  that  formal 
mathematical  conclusions  derived  from  structural  characteristics  of  graphs 
have  a  certain  suggestive  potential. 

On  the  other  hand,  certain  types  of  social  relations  are  sufficiently  well 
defined  to  allow  a  translation  of  theorems  strictly  derived  from  structural 
postulates  into  assertions  about  such  relations.  Kinship  relations  and  rules 
governing  marriage  are  of  this  sort.  The  rules  specify  who  may  marry 
whom.  In  so-called  '"primitive"  societies  (i.e.,  among  people  who  live 
in  remote  places),  these  rules  typically  involve  the  kinship  relations  (e.g., 
prohibitions  of  incest)  and,  in  addition,  specify  certain  "marriage  types," 
so  that  marriage  is  permitted  only  among  individuals  of  the  same  type. 

If  marriage  is  always  to  be  permitted  between  any  two  individuals  of 
the  same  type  and  if  brother-sister  and  parent-children  marriages  are 
always  to  be  prohibited  (as  they  are  in  nearly  all  societies),  then  obviously 
sons  and  daughters  must  be  assigned  marriage  types  different  from  that  of 
their  parents  (who  are,  we  recall,  of  the  same  type)  and  different  from  each 
other.  If,  furthermore,  the  marriage  type  of  a  person's  parents  is  to  be 
uniquely  inferred  from  the  person's  marriage  type,  there  is  a  one-to-one 
correspondence  between  the  marriage  types  of  parents  and  that  of  sons 
on  the  one  hand  and  that  of  daughters  on  the  other;  that  is,  two  permu- 
tations operate  in  the  assignment  of  marriage  types,  one  for  sons  and  one 
for  daughters. 

If  there  are  n  marriage  types,  they  can  be  represented  as  a  vector 


534  MATHEMATICAL   MODELS    OF   SOCIAL   INTERACTION 

*  =  (?i>  ^2>  •  •  •  *«)•  A  permutation  transforming  this  vector  into  a  vector 
of  marriage  types  assigned  to  the  sons  or  to  the  daughters  can  be  repre- 
sented by  an  n  x  n  permutation  matrix.  There  are  thus  two  permutation 
matrices,  S  for  the  sons,  and  D  for  the  daughters. 

Now  a  family  tree  can  be  represented  as  a  directed  linear  graph,  in  which 
there  are  just  two  types  of  bonds,  namely,  a  descendent  bond  (from  parent 
to  child)  and  a  marriage  bond.  Moreover,  the  nodes  of  the  graph  will  be 
of  two  kinds,  male  and  female.  Marriage  rules  imply  that  in  a  family  tree 
marriage  bonds  are  found  only  between  individuals  of  opposite  sex  and  of 
the  same  type. 

The  permutation  matrices,  S  and  D,  now  enable  us  to  label  each  indi- 
vidual by  marriage  type,  given  the  type  of  an  ancestor.  Thus  the  sons  of  a 
man  of  type  ti  will  have  the  type  designated  by  the  zth  component  of  the 
once  permuted  vector  St;  the  sons  of  his  son  will  have  the  corresponding 
type  in  the  twice-permuted  vector  52t;  the  sons  of  his  daughters  will  be 
read  off  from  SDt;  the  daughters  of  his  sons  from  DSt,  etc.  We  can  infer 
the  marriage  type  of  an  ancestor  by  applying  the  inverses  of  the  permu- 
tation matrices :  5"1  for  a  man's  parent,  D"1  for  a  woman's  parent,  D^S'1 
for  a  man's  maternal  grandparent,  etc.  Products  of  permutations  matrices 
are  also  permutation  matrices.  The  matrix  that  determines  a  permutation 
of  marriage  types  associated  with  a  given  relationship  is  called  the  relation 
matrix  M.  Thus  M  =  DSD~lS~l  is  the  relation  matrix  for  a  man's  "cross 
cousin"  (mother's  brother's  daughter.) 

The  mathematical  method  just  described  allows  us  to  do  more  than 
determine  whether  marriage  is  permitted  in  any  specific  case.  It  allows  us 
also  to  decide  whether  a  given  rule  for  the  assignment  of  types  is  com- 
patible with  a  given  set  of  rules  governing  marriage  and  also  to  choose 
methods  of  type  assignment  to  generate  given  marriage  rules. 

Examples  of  such  general  findings  have  been  stated  in  the  form  of  theo- 
rems, such  as  the  following.10 

A  man  is  allowed  to  marry  a  female  relative  of  a  certain  kind  if  and  only 
if  his  marriage  type  does  not  belong  to  the  effective  set  o/M[thatis  to  say, 
the  component  of  the  vector  (tl9 t29 . . .  tn)  corresponding  to  his  marriage 
type  remains  invariant  after  transformation  by  M  associated  with  the 
relationship]. 

If  it  is  to  be  permissible  for  some  of  the  descendants  of  any  two  individuals 
to  marry  (i.e.,  if  marriage  prohibition  does  not  extend  to  all  blood  rela- 
tions), then  for  every  i  and  j  there  should  be  a  permutation  (in  the  group 
generated  by  S  and  D)  which  carries  i  into  j. 

10  For  definitions  of  terms  used  in  matrix  theory  and  group  theory,  e.g.,  "effective  set" 
and  "generated  group,"  see  Kemeny,  Snell  &  Thompson  (1957). 


STRUCTURE    OF    SMALL    GROUPS  5^5 

If  whether  a  man  is  allowed  to  marry  a  female  relative  of  a  given  kind 
depends  only  on  the  kind  of  relationship  (i.e.,  if  the  same  rule  is  to  apply  to 
individuals  of  all  types),  then  in  the  group  generated  by  S  and  D  every 
element  except  I  (the  identity  permutation)  is  a  complete  permutation. 

It  is  in  a  way  remarkable  that  the  intricate  and  explicit  marriage  rules 
in  societies  which  have  them  are,  as  far  as  is  known,  consistent.  One 
suspects  of  course,  that  the  experience  of  many  generations  would  have 
weeded  out  inconsistencies  and  that  the  rules  have  become  second  nature 
to  the  practitioners,  who  need  no  formal  deductive  system  to  arrive  at 
conclusions.  To  the  outsider,  however,  a  formalized  logical  system  is  very 
useful  not  only  in  that  it  provides  algorithms  of  reasoning  but  also  because 
it  suggests  generalizations  and  applications  in  a  variety  of  contexts.11 

Katz  (1953)  has  used  the  method  of  matrix  algebra  to  redefine  the 
concept  of  "status"  based  on  sociometric  choice  in  a  group.  The  con- 
ventional "popularity"  definition  of  status  is  related  to  the  number  of 
times  a  group  member  is  chosen  by  other  members  in  situations  in  which 
sociometric  choices  are  recorded.  Katz  redefines  status  by  having  it 
depend  not  only  on  how  many  times  one  has  been  chosen  but  also  by 
whom.  The  idea  is  to  make  the  choices  by  high  status  members  count 
more. 

If  C  represents  the  matrix  of  sociometric  choices  with  c^  =1,  if  i 
chooses  j  and  cw  =  0  otherwise,  let 


T  =  aC  +  a*C*  +  ...  a*Ck  +...  =  (/-  aC)-*  -  7,          (49) 

where  /is  the  identity  matrix  and  a  (0  <  a  <  1)  is  an  "attenuation  factor" 
which  weights  indirect  choices  of  various  order  of  removes  (choices  of 
choices,  etc.)  by  its  corresponding  successive  powers.  The  columns  of 
T  contain  components  that  represent  the  choices  accorded  to  the  corre- 
sponding member  weighted  by  the  statuses  of  the  choosers  and  also  in- 
direct choices  (of  higher  removes)  attenuated  by  the  powers  of  a.  The 
column  sums  of  T9  then,  divided  by  an  appropriate  integer  (analogous  to 
the  total  possible  number  of  choices  in  the  "popularity"  definition  of 
status)  give  the  modified  status  index,  in  which  the  status  of  the  choosers 
plays  a  part  in  determining  the  status  of  the  chosen. 

Katz  shows  how  in  an  artificially  constructed  group  the  proposed  status 
index  reflects  these  properties  (cf.  also  Forsyth  &  Katz,  1946). 

11  Hoffman  (1959)  has  applied  symbolic  logic  methods  to  similar  problems  and  has 
derived  a  marriage  rule  in  Pawnee  society,  which,  he  observes,  is  not  found  in  the 
ethnographic  record  :  A  man's  marriage  partner  must  be  the  granddaughter  of  the  sister 
of  the  man's  father's  father.  (N.B.  In  Pawnee  society  cohabitation  and  marriage  are 
not  synonymous.) 


536 


MATHEMATICAL  MODELS  OF  SOCIAL  INTERACTION 


3.2  The  Detection  of  Cliques  and  Other  Structural 
Characteristics  of  Small  Groups 

Festinger  (1949),  Luce  and  Perry  (1949),  Harary  (1959),  and  others 
have  used  matrix  algebra  as  a  tool  for  detecting  certain  structural  features 
of  small  groups. 

To  fix  ideas,  consider  the  matrix 


0101 
1010 
1001 
0100 


Evidently,  the  corresponding  directed  graph  is  the  one  shown  in  Fig.  10. 
Let  us  now  examine  A* : 


"o  i  o  i" 

"o  i  o  i 

1  1  1  0" 

1010 

1010 

1102 

1001 

1001 

0201 

0100 

0  1  0  0_ 

1010 

The  entries  of  A*,  d®  are  2  <*$&&>  where  the  aii9  the  entries  of  A9  are 

either  0  or  1 .  The  nonzero  entries  of  A2,  therefore,  come  from  all  the  paths 
of  length  2  from  z  to  /  (the  2-chains  in  Luce  and  Perry's  terminology) 

^     in  the  structure  represented  by  A.  There- 


fore the  entry  a$  in  A2  represents  the 
number  of  distinct  2-chains  from  z  to 

This  result  is  immediately  generaliz- 
able:  the  entry  d®  represents  the  num- 
ber of  distinct  fc-chains  from  z  to  j.  We 
see  that  the  structure  of  this  group  is 
represented  in  the  entries  of  the  successive 
powers  of  the  associated  matrix. 

Examining  now  the  diagonal  elements 
of  A?,  we  see  that  the  same  result  implies 
that  the  z'th  diagonal  entry  represents  the 


Fig.    10.    Directed    graph    corre- 
sponding to  matrix  A. 


STRUCTURE    OF   SMALL   GROUPS  557 

numbers  of  elements  in  the  group  with  which  the  zth  member  has  sym- 
metric relations  (two-way  bonds). 

Pursuing  the  interpretation  of  structures  in  a  social-psychological  con- 
text, Luce  and  Perry  (1949)  define  a  clique  as  a  maximal  completely 
connected  subset  of  the  original  structure  containing  at  least  three  persons, 
that  is,  a  completely  connected  subset  which  is  not  properly  contained  in  a 
larger  completely  connected  subset.  In  other  words,  all  members  of  a 
clique  have  symmetric  connections  with  one  another,  but  no  other  group 
member  stands  in  a  symmetric  relation  to  all  the  clique  members  (or  he  too 
would  have  to  be  counted  as  a  clique  member). 

This  definition  seems  natural  enough,  but  on  second  thought  it  may  seem 
somewhat  restrictive,  as  Luce  and  Perry  themselves  point  out.  For 
example,  let  a  subgroup  of  a  given  social  group  be  "very  tightly  knit" 
according  to  any  intuitive  standard  of  judgment,  except  for  a  few  bonds 
missing.  This  situation  violates  the  clique  definition.  Therefore,  the 
definition  fails  to  distinguish  cliques  in  a  sense  useful  to  the  social  psychol- 
ogist or  the  sociologist.  This  objection  is  not  a  formal  one,  of  course.  It 
merely  points  out  that  many  subgroups  which  the  sociologist  may  consider 
well  qualified  to  be  called  cliques  may  not  satisfy  the  mathematical 
qualification.  In  a  later  paper  Luce  (1950)  relaxes  his  definition  of  clique 
to  include  more  general  subgroups. 

To  a  certain  degree  the  clique  structure  of  a  group  can  be  determined 
by  examining  the  cube  of  a  matrix  S  derived  from  A  by  eliminating  all 
unreciprocated  connections;  that  is,  S  is  the  element-wise  product, 
5  =  A  ®  A',  where  A'  is  the  transpose  of  A.lz  The  zth  entry  of  the  main 
diagonal  of  S3  gives  us  information  about  whether  i  is  a  member  of  a 
clique:  he  is  if  the  entry  is  different  from  zero. 

Should  there  be  only  one  clique  of  t  members  in  the  group  (which  does 
not  mean,  of  course,  that  everyone  is  in  it),  the  diagonal  elements  of  S3 
will  show  it:  the  corresponding  /  elements  will  have  entries  (t  —  l)(t  —  2), 
and  the  remaining  entries  will  be  zero,  a  result  obtained  earlier  by  Festinger 
(1949). 

Harary  (1959),  working  primarily  with  symmetrical  structures,  describes 
a  procedure  for  detecting  all  the  cliques  by  way  of  determining  those  with 
"unicliqual"  members.  A  unicliqual  member  is  a  group  member  who 
belongs  to  only  one  clique.  A  noncliqual  member  is  one  who  belongs  to  no 
clique.  Harary's  clique-detecting  procedure  begins  with  the  deletion  of  all 
the  noncliqual  members.  This  is  easily  done  by  examining  the  element-wise 
product  S2  ®  S,  where  S  is  the  symmetric  part  of  the  original  matrix.  A 

12  In  the  element-wise  product  each  entry  of  the  product  matrix  is  simply  the  product 
of  the  corresponding  entries  of  the  factor  matrices. 


MATHEMATICAL   MODELS    OF    SOCIAL   INTERACTION 


(d) 


Fig.  11.  Groups  with  four  cliques.   The  black  circles  are 
unicliqual  members. 

member  is  noncliqual  (as  we  can  readily  verify)  if  and  only  if  the  corre- 
sponding row  of  S2  (x)  S  consists  entirely  of  zeros. 

Now,  let  G  be  the  group  from  which  all  the  noncliqual  members  have 
been  removed.  Then,  obviously,  if  G  has  only  one  clique,  all  its  members 
are  unicliqual.  It  is  also  obvious  that  if  G  has  exactly  two  cliques  then 
both  cliques  must  have  unicliqual  members  (otherwise  every  member  would 
belong  to  both  cliques  and  the  two  cliques  would  be  identical).  Somewhat 
less  obvious  are  the  corresponding  statements  for  the  cases  in  which  G 
has  three  cliques  or  more.  In  the  first  case  at  least  two  of  the  three  cliques 
have  unicliqual  members;  in  the  second  case  there  is  no  restriction  on  the 
number  of  unicliqual  members  there  may  be  in  the  group.  All  the  possi- 
bilities for  the  four  clique  cases  are  illustrated  in  Fig.  11. 


STRUCTURE    OF    SMALL    GROUPS  539 

These  results  are  useful  in  the  method  for  detecting  cliques  proposed  by 
Harary.  After  all  the  noncliqual  members  have  been  removed,  we  deter- 
mine all  the  cliques  having  unicliqual  members,  delete  these  members, 
and  iterate  the  process.  If  the  resulting  group  has  no  unicliqual  members,  it 
is  split  into  two  subgroups  in  such  a  way  that  each  has  fewer  cliques  and 
each  clique  lies  entirely  within  each  subgroup.  When  the  number  of 
cliques  in  a  subgroup  becomes  sufficiently  small  «3),  unicliqual  members 
of  that  subgroup  are  sure  to  appear  by  the  results  cited.  We  can  thus 
proceed  to  eliminate  unicliqual  members  and  be  sure  that  we  have  counted 
all  the  cliques  in  the  process. 

The  foregoing  are  samples  of  investigations  of  abstract  structures.  The 
investigations  have  been  carried  out  in  the  spirit  of  pure  mathematics; 
that  is,  the  results  obtained  were  sought  because  of  the  logical  interconnec- 
tions among  the  questions  asked  about  the  configurations  studied  and  not 
necessarily  because  of  direct  "usefulness"  of  these  results  for  understanding 
aspects  of  analogous  structures  in  the  real  world,  such  as  small  groups  of 
particular  interest  to  the  psychologist  or  the  sociologist.  Moreover,  the 
investigations  were  descriptive  in  the  sense  that  the  results  were  a  display  of 
structural  features  (to  be  sure,  mathematically  deduced)  and  not  behavioral 
predictions.  To  pass  to  the  behavioral  predictions,  hypotheses  are  required 
that  would  relate  structures  to  actual  events  or  to  some  underlying 
tendencies  for  events  to  occur.  In  the  next  sections  we  shall  be  concerned 
with  investigations  of  this  sort. 


3.3  The  Theory  of  Structural  Balance 

Consider  a  sociogram  determined  by  a  set  of  individuals  and  a  two- 
valued  symmetric  relation.  Between  the  members  of  each  pair  there  is 
either  a  "positive"  or  a  "negative"  bond  or  no  bond.  The  relation  can  be 
psychologically  interpreted  as  "liking,"  "disliking,"  or  "indifference." 

A  hypothesis  has  been  advanced  in  social  psychology  (Heider,  1946; 
Newcomb,  1953,  1956)  to  the  effect  that  two  persons'  attitudes  toward  each 
other  are  influenced  by  their  attitudes  toward  some  third  object.  For 
example,  two  persons  who  both  like  the  same  things  tend  to  like  each 
other;  two  persons  who  dislike  the  same  things  also  tend  to  like  each 
other.  But  if  two  persons  have  opposite  attitudes  toward  the  same  thing, 
they  tend  to  dislike  each  other.13  The  three  situations  are  pictured  in  Fig. 


13  Dislike  generated  by  conflict  over  coveted  objects  provides  an  obvious  important 
exception. 


540 


MATHEMATICAL   MODELS    OF    SOCIAL   INTERACTION 


(C) 


Situations  (a),  (6),  and  (c)  in  Fig.  12  are  "balanced"  in  the  sense  that  the 
hypothesis  previously  stated  is  satisfied.  In  (a),  A  and  B  both  like  X  and 
each  other;  in  (£),  A  and  B  have  opposite  attitudes  toward  JTand  dislike 
each  other;  in  (c)  they  both  dislike  X and  like  each  other.  In  (d),  however, 
the  hypothesis  is  violated :  A  and  B  like  each  other,  in  spite  of  having 
opposite  attitudes  toward  X.  In  (e)  the  hypothesis  is  also  violated  because 
A  and  B  dislike  each  other  in  spite  of  having  the  same  (negative)  attitude 
toward  X. 

Suppose,  now,  we  assign  a  positive  sign  to  solid  lines  and  a  negative 
sign  to  dotted  lines.  We  define  the  "sign"  of  the  cycle  A  — *  B  ->  X-+  A 
as  the  product  of  the  three  signs  of  its  three  lines,  following  the  algebraic 
convention  that  the  product  of  like  signs  is  positive  and  of  unlike  signs, 
negative.  We  see,  then,  that  if  and  only  if  the  hypothesis  of  balance  is 
satisfied  the  sign  of  the  associated  cycle  is  positive. 

The  object  X  can,  of  course,  be  a  person  as  well  as  a  thing  (or  an  insti- 
tution or  an  idea).  We  may  then  examine  the  3-cycles  of  any  social  group 
viewed  as  a  signed  symmetric  graph  to  determine  whether  they  satisfy 
the  balance  hypothesis.  Moreover,  we  can  generalize  this  procedure  by 
examining  the  signs  of  cycles  larger  than  3-cycles,  using  the  same  rule  of 
multiplication  of  signs. 

A  signed  graph  is  called  balanced  if  and  only  if  all  of  its  cycles  are 
positive.  It  now  becomes  of  interest  to  examine  the  evidence  for  the 
following  hypothesis,  which  is  an  extension  of  the  preceding  one:  a 
signed  graph  representing  a  sociogram  of  a  social  group  tends  to  become 
balanced. 


STRUCTURE    OF    SMALL   GROUPS  5^/ 

The  hypothesis  implies  roughly  that  attitudes  of  the  group  members  will 
tend  to  change  in  such  a  way  that  one's  friends'  friends  will  tend  to  become 
one's  friends  and  one's  enemies'  enemies  also  one's  friends,  one's  friends' 
enemies  and  one's  enemies'  friends  will  tend  to  become  one's  enemies, 
and  moreover,  that  these  changes  tend  to  operate  even  across  several 
removes  (one's  friends'  friends'  enemies'  enemies  tend  to  become  one's 
friends  by  an  iterative  process).  Another  way  of  saying  the  same  thing  is 
that  a  social  group  tends  to  split  into  two  subgroups  (one  of  which  may  be 
empty)  such  that  members  within  each  subgroup  like  each  other,  whereas 
members  from  the  two  different  subgroups  (if  there  are  two)  dislike  each 
other.  The  formal  equivalence  of  the  two  statements  has  been  proved  by 
Harary  (1954). 

If  a  sociogram  of  a  group  is  given,  we  can  determine  whether  the 
associated  graph  is  balanced  by  examining  the  sign  of  each  cycle.  Since,  in 
general,  there  are  many  cycles  in  a  moderately  large  and  densely  connected 
group  (say  a  fraternity  house),  it  is  too  much  to  expect  the  hypothesis  of 
balance  to  be  satisfied  completely.  However,  the  trend  toward  balance 
may  still  be  verifiable,  provided  that  we  accept  a  quantitative  instead  of  an 
all-or-none  definition  of  balance,  that  is,  a  definition  of  the  degree  of 
balance  of  a  graph  and  of  its  associated  social  structure.  Such  quantitative 
definitions  were  offered  by  Cartwright  and  Harary  (1956)  and  by  Harary 
(1959). 

Evidence  for  the  existence  of  secular  trends  toward  structural  balance, 
as  defined  by  mathematicians,  is  meager.  One  longitudinal  study  comes 
close  to  establishing  a  result  that  may  be  related  to  this  hypothesis.  New- 
comb  (1956)  conducted  a  set  of  consecutive  observations  on  the  17  resi- 
dents of  a  student  house  at  the  University  of  Michigan.  In  the  course  of 
time  (the  study  lasted  one  semester  and  was  replicated),  the  attitudes  of 
those  who  were  attracted  to  each  other  tended  to  come  closer  together, 
including  views  the  subjects  held  of  their  own  selves  and  their  "ideal"  selves. 

Many  social-psychological  studies  deal  with  related  hypotheses  in  the 
realm  of  attitudes,  interactions  of  attitudes,  and  resulting  tensions  or 
tension  resolutions,  but  a  rigorous  testing  of  the  mathematical  theories  of 
structural  balance  is  still  lacking.  A  review  of  the  literature  on  this  topic 
has  been  given  by  Zajonc  (1960). 


3.4  Dominance  Structures 

Let  the  relation  of  interest  between  any  two  individuals  in  a  small  group 
be  one  of  dominance.  For  example,  A  >  B  may  mean  that  A  tends  to 
influence  B's  decisions  or  that  A  tends  to  win  in  chess  from  B  or  that  A 


MATHEMATICAL   MODELS    OF    SOCIAL    INTERACTION 


tends  to  be  preferred  to  B  in  sociometric  choices  by  others  of  the  group 
when  the  choice  involves  A  and  B  alone  (a  paired  comparison).  Such  a 
relation  is  by  its  very  nature  antisymmetric,  that  is,  if  A  >  B  is  true,  then 
B  >  A  is  not  true.  Further,  we  would  ordinarily  expect  such  a  relation  to 
be  transitive,  that  is,  A  >  B  and  B  >  C  might  be  expected  to  imply  A  >  C. 
If  this  is  the  case,  a  well-ordering  of  the  members  of  the  group  is  deter- 
mined by  all  the  relations  between  pairs.  Interesting  questions  arise  if  the 
relation  is  not  transitive,  that  is,  when  we  may  have  A>B,B>C,  and 
C>A. 

Such  cycles  in  dominance  relations  are  actually  observed,  for  example, 
in  the  behavior  of  hens  which  manifests  the  so-called  "peck  right." 
Although  a  complete  hierarchy  (well-ordering)  according  to  peck  right  is 
often  established  in  a  flock  of  barnyard  hens,  cycles  are  also  commonly 
observed.  The  violations  of  transitivity  observed  in  social-dominance 
relations  have  given  rise  to  various  theoretical  developments  of  social 
interaction,  some  of  which  we  shall  now  consider. 

The  usual  interpretation  of  the  nontransitivity  of  the  dominance  relation 
rests  on  the  assumption  that  this  relation  is  established  to  a  certain  extent 
by  chance  events.  The  resulting  models  are  analogous  to  certain  stochastic 
models  designed  to  explain  intransitivities  of  preferences  resulting  from 
sequential  paired  comparisons.  However,  in  the  context  of  a  stochastic 
theory  of  social  structure,  certain  aspects  of  these  models  receive  emphasis 
that  they  do  not  receive  in  the  context  of  individual  preference  theories. 
We  shall  pursue  the  developments  of  some  such  models  accordingly  in  the 
present  context. 

We  take  for  our  chance  event  the  result  of  combat  encounter  between 
two  individuals.  We  shall  assume  that  the  result  is  "victory"  for  one  of 
them  and  that  peck  right  is  accorded  to  the  victor.  We  assume  that  en- 
counters have  occurred  among  all  pairs.  Thus  a  peck-right  structure 
representable  by  a  directed  completely  connected  graph  is  established. 
Next,  we  seek  to  classify  the  structures.  A  natural  classification  would 
identify  a  structure  with  a  class  of  linear  graphs  isomorphic  to  it.  Naturally 
a  renaming  of  'individuals  should  not  affect  the  type  of  structure.  As  the 
number  of  individuals  increases,  the  number  of  possible  nonisomorphic 
linear  graphs  becomes  rapidly  very  large.14  It  follows  that  the  classification 

14  The  number  of  distinct  antisymmetric  matrices  of  order  N  is  2N(N~1}/*.  Some  of  these 
represent  isomorphic  graphs  obtained  by  renaming  the  individuals  (interchanging  some 
rows  and  corresponding  columns).  Each  structure  is  therefore  represented  by  at  most 
Nl  matrices,  and  so  2N(N~1}lZ(N\)~l  is  a  lower  bound  on  the  number  of  nonisomorphic 
dominance  structures  of  TV-person  groups.  For  TV  =  8  this  lower  bound  is  already 
more  than  6000;  for  N  =  12  it  exceeds  1011.  Studies  on  the  number  of  graphs  of 
various  types  appear  in  mathematical  literature  (e.g.,  Katz  &  Powell,  1954;  Davis, 
1953,  1954). 


STRUCTURE    OF    SMALL    GROUPS 


proposed  is  too  fine  to  be  practical,  since  the  number  of  distinct  structures 
becomes  so  large  even  for  moderate  N  that  it  is  hopeless  to  observe  the 
"frequency  of  occurrence"  of  each  structure,  which  is  the  usual  test  of  a 
postulated  stochastic  process  supposed  to  underlie  the  establishment  of  the 
structure.  We  shall  therefore  introduce  a  rougher  measure,  namely,  a 
"score  structure,"  defined  as  follows. 

In  every  group  of  N  individuals,  each  will  have  N  —  1  relations  with  the 
others,  of  which  d  will  be  dominant  and  TV  —  1  —  d  will  be  submissive 
The  group  can  then  be  described  by  a  set  of  N  numbers  (rls  r2, .  .  .  ,  rN) 
such  that  S  ri  =  \N(N  —  1).  This  set  of  numbers  arranged  conventionally 
so  that  rx  >  r2  >,...,>  rN  will  be  called  the  score  structure  of  the 
group.  For  N  >  4  there  are  fewer  nonequivalent  score  structures  than 
social  structures,  and  the  difference  increases  rapidly  as  N  increases. 
Therefore  the  score  structure  gives  us  a  rougher  classification.  A  still 
rougher  index  was  introduced  by  Landau  (1951).  Note  that  the  score 
structure  (7V  —  1,  N  —  2, . . . ,  1,  0)  corresponds  to  the  completely  hier- 
archical structure  of  a  group,  in  which  the  individual  with  the  highest  score 
dominates  all  the  others;  the  one  with  the  next  highest  score  dominates 
all  but  one,  etc.  Landau's  hierarchy  index  measures  the  departure  of  a 
given  score  structure  from  that  of  the  complete  hierarchy.  He  defines 

,             12       ~(         ]V-1\2  _ 

h  —  _ >     r .  (50) 

JV3  -  N  7  I  27 

This  definition  ensures  that  in  an  "egalitarian"  group,  in  which  the  score 
structure  has  all  equal  components  and  therefore  rj  =  (N—  l)/2,  (j  =  1, 
2,  . . .,  N),  h  =  0;  also  h  is  maximal  in  a  hierarchy.  The  factor  outside 
the  summation  in  Eq.  50  is  a  normalization  factor,  which  makes  h  =  1  for 
a  hierarchy. 

Following  Landau  (1951),  suppose  each  individual  j  is  characterized  by 
an  "ability  vector"  xf  =  (%,  x&9 .  . . ,  a?Jtn).  The  components  of  the  vector 
are  the  various  factors  that  presumably  have  a  bearing  on  the  probability 
that,  in  an  encounter  between  two  individuals,  one  will  emerge  the  victor. 
Among  these  factors  in  hens  are  size,  concentration  of  male  hormones 
(making  for  the  appearance  of  secondary  male  sex  characteristics),  etc. 

We  may  assume  that  these  characteristics  are  distributed  according  to 
a  multivariate  distribution  in  the  population  from  which  the  individuals 
have  been  selected.  By  our  definition  of  the  factors  xi9  it  appears  that  the 
probability  that  individual  j  will  dominate  k  is  a  function  of  the  corre- 
sponding vectors : 

Pr  (j  >  k)  =  p(xi9  xk)  =  pjk,       (j,  k  = 


544  MATHEMATICAL   MODELS    OF   SOCIAL  INTERACTION 

The  problem  of  determining  the  actual  multivariate  distribution  of  the 
components  and  the  probabilities/*^  is  not  of  the  essence  in  the  theoretical 
investigation  to  follow. 

We  seek  instead  some  estimate  of  the  expected  value  of  the  score  struc- 
ture h: 

-r^  (52) 


where  r  =  (N  -  l)/2. 

Assuming  the  same  multivariate  distribution  F(x)  for  all  of  the  individ- 
uals in  the  group,  Landau  proves  the  following  general  result: 


(53) 


where  A  =  $g(x,  y)  dF(y)9  and  g(x,,  x^  =  pjjc  -  pkj. 

Introducing  various  special  assumptions  concerning  F(x)  and  pjk, 
Landau  then  obtains  corresponding  values  of  E(h).  In  particular,  in  the 
completely  unbiased  case,  in  which  the  two  possible  outcomes  of  each 
encounter  are  equiprobable  (independent  of  the  ability  vectors),  he  gets 

^•TfTi-  (54) 

a  result  obtained  previously  by  Rapoport  (1949)  directly  in  the  special 
cases  for  N  =  3,  4,  and  5. 

To  calculate  E(K)  specifically  in  a  biased  case,  assumptions  must  be 
made  about  the  multivariate  distribution  F(x)  and  about  the  way  the 
probabilities  pjk  depend  on  the  ability  vectors.  Assuming  the  ability 
components  to  be  normally  distributed  with  variances  aa  (a  =  1,  2,  .  .  .  , 
ni)  and  the  probability  of  dominance  to  be  a  weighted  sum  of  normal 
probability  integrals  with  variances  s^  (a  =  1,2,...,  m\  Landau  obtains 
the  following  expression  for  E(h)  : 


N  +  1 


(55) 


where  the  wa  are  the  relative  "weights"  of  the  ability  factors. 

This  reduces  to  the  unbiased  case  if  ja  becomes  infinitely  large  (implying 
pik  =  \  for  ally,  k}.  If  ,ya  =  0,  dominance  is  determined  for  every  pair  by 
the  ability  vectors  alone,  and  E(K)  reduces  to  unity,  as  it  should. 

The  main  point  of  these  calculations  is  to  show  that  for  any  moderately 
large  N  the  expected  value  of  the  hierarchy  index  h  should  be  small,  that  is, 
considerably  less  than  unity.  The  greater  the  number  of  (uncorrelated) 
ability  components,  the  smaller  this  expected  value.  But  even  for  one 


STRUCTURE    OF   SMALL    GROUPS  545 

component,  Landau  shows  that  E(K)  can  be  expected  to  be  close  to  unity 
only  if  very  small  differences  in  ability  are  decisive  in  determining  the 
direction  of  dominance  established  in  an  encounter. 

Experimental  evidence  indicates  that  the  dominance-determining  power 
of  known  factors  is  actually  quite  small.  Collias  (1943)  staged  200  combats 
between  hens,  taking  each  hen  of  a  pair  from  a  different  flock,  and  cor- 
related the  outcomes  with  measured  degrees  of  moult,  comb  size,  weight, 
and  rank  in  its  own  flock.  The  correlation  coefficients  were  respectively 
.580,  .593,  .474,  and  .262.  These  correlations  lead  to  0.34  as  the  value  of 
E(h)  for  large  N.  On  the  other  hand,  the  observed  values  of  h  in  flocks 
of  10  hens  (10  is  a  large  number  in  this  context)  are  in  the  nineties 
(Schjelderup-Ebbe,  1922). 

Landau's  conclusion  is  that  the  observed  near-hierarchy  established  by 
"almost  transitive"  peck  right  cannot  be  accounted  for  by  "inherent 
abilities"  of  the  hens  alone.  It  is  natural  to  look  for  "social  factors," 
that  is,  the  role  of  experience  within  the  flock,  imitation,  learning,  etc., 
for  reasonable  explanations  of  the  high  values  of  the  hierarchy  index 
observed  in  nature.  The  conclusion  is  not  surprising  in  view  of  the  fact 
that  workers  in  animal  sociology  have  long  suspected  the  operation  of 
these  factors.  However,  deriving  this  conclusion  from  mathematical  and 
statistical  considerations  has  for  the  mathematical  sociologist  a  methodo- 
logical interest  because  it  puts  the  problem  into  theoretical  perspective. 

Leeman  (1952)  used  a  similar  approach  to  construct  a  mathematical 
model  of  sociometric  choice  patterns  in  a  small  group.  His  model  differs 
from  that  studied  by  Rapoport  and  Landau  in  that  only  one  directed 
line  issues  from  each  group  member;  hence  only  one  sociometric  choice  is 
made  by  each  member  at  each  specified  time.  Moreover,  in  the  socio- 
metric choice  model  the  binary  relation  is  not  necessarily  antisymmetric, 
as  it  is  in  the  dominance  structure  model. 

Specifically,  Leeman  assumes  that  sociometric  choices  are  established 
by  encounters  between  pairs.  In  each  encounter  either  the  person  en- 
countered is  chosen  or  any  of  the  other  members  of  the  group  is  chosen  with 
equal  probability.  Thus  at  each  moment  of  time  a  pattern  of  choices  is 
established.  The  stochastic  process  leads  to  the  probability  distribution  of 
all  possible  choice  patterns. 

Leeman  computes  these  distributions  for  a  three-person  group  (in  which 
there  are  two  possible  nonisomorphic  choice  patterns)  and  for  a  four- 
person  group,  in  which  six  nonisomorphic  patterns  are  possible.  The 
theory  is  extended  to  a  biased  probability  case,  and  some  experiments  are 
cited,  the  results  of  which  lead  essentially  to  the  rejection  of  the  model 
based  on  equiprobable  outcomes  of  encounter. 

Luce,  Macy,  and  Tagiuri  (1955)  treat  a  similar  problem  in  which, 


MATHEMATICAL   MODELS    OF   SOCIAL   INTERACTION 

however,  the  relation  between  any  pair  of  individuals  can  have  45  different 
values.  This  relation,  called  a  "diad,"  is  defined  as  follows.  Assuming  that 
an  individual  in  a  group  can  choose,  reject,  or  ignore  another  individual 
and,  in  turn,  can  feel  himself  chosen,  rejected,  or  ignored,  it  follows  that 
individual  A  can  relate  himself  to  another  individual  B  in  the  nine  different 
ways  in  which  his  own  attitude  and  his  perception  of  the  other's  attitude 
can  be  combined,  A's  nine  relations  can  be  combined  with  £'s  nine  in 
45  different  ways  (disregarding  order).  Hence  a  diad  can  have  45  different 
values. 

In  a  random  model  the  choices  and  guesses  of  each  individual  are 
assumed  to  be  governed  by  independent  chance  events.  Thus,  no  psycho- 
logical factors  are  supposed  to  operate  except  those  governing  the  relative 
frequencies  of  choices  and  perceptions.  In  a  biased  model  a  dependency 
is  introduced  between  an  individual's  attitude  toward  another  individual 
and  his  guess  about  the  other's  attitude  to  bias  the  events  toward  greater 
congruence  of  attitude  and  perception. 

Comparison  with  data  obtained  from  a  group-therapy  session  involving 
a  10-person  group  shows  that  the  biased  model  accounts  for  a  large  part 
of  the  observed  variation  in  diad  frequency. 


4.  PSYCHOECONOMICS 

Theories  of  population  dynamics  and  those  of  interaction  in  small 
groups  are  linked  by  a  common  mathematical  apparatus,  first  utilized 
extensively  by  Cournot  (1927  translation  of  1838  volume),  an  early  mathe- 
matical economist.  Recently  his  and  related  ideas  have  been  cast  into  the 
framework  of  social-psychological  experiments  and  have  borne  results  of 
considerable  interest  and,  perhaps,  of  sufficient  importance  to  be  con- 
sidered as  foundations  of  experimental  psychoeconomics. 


4.1  A  Mathematical  Model  of  Parasitism  and  Symbiosis 

Before  we  examine  these  experiments  let  us  first  discuss  a  fictitious  psycho- 
economic  model,  closely  related  to  the  population  dynamics  models 
discussed  in  Sec.  1.6.  In  this  way  we  shall  introduce  a  conceptual  link 
between  population  dynamics  and  psychoeconomics. 

Consider  two  individuals,  X  and  7,  each  of  whom  produces  a  different 
commodity  in  the  respective  amounts  x  and  y.  Each,  being  in  need  of 
both  commodities,  agrees  to  exchange  a  fraction  of  his  own  output  for 


PSYCHOECONOMICS 


a  fraction  q  of  the  other's  output.  Thus  each  keep  the  fraction;?  =  1  —  q 
of  his  own  output. 

Assume  that  there  is  a  positive,  logarithmic  contribution  to  each 
individual's  utility  from  what  he  receives  in  commodities  and  a  negative 
contribution,  proportional  to  the  output  (presumable  because  of  the  labor 
involved).  Specifically,  designating  the  utilities  by  S^  and  Sy,  we  have 

fa, 
fy. 

The  one's  in  the  arguments  of  the  logarithms  were  introduced  to  make 
the  positive  part  of  the  utility  vanish  when  x  =  y  =  0. 

The  situation  can  be  considered  as  a  two-person,  nonzero-sum  game,  in 
which  each  player  has  a  continuum  of  strategies  in  the  x-  and  2/-space, 
respectively,  so  that  the  strategy  space  is  the  product  space  (x,  y). 

In  choosing  his  strategy  each  player  naturally  wishes  to  maximize  his 
utility.  But  since  he  controls  only  one  of  the  variables,  all  he  can  do  is 
make  the  "best  response"  to  each  strategy  chosen  by  his  opponent,  that  is, 
choose  that  value  of  his  variable  which  maximizes  his  own  utility,  given 
the  choice  of  output  by  the  other. 

Setting  dSJdx  and  dSjdy  equal  to  zero,  we  obtain  two  "optimal  lines" 
(so-called  Cournot  lines)  in  (x,  y)  space.  Each  individual  will  regulate  his 
output  to  try  to  bring  the  points  (re,  y)  on  his  own  optimal  line.  The 
equations  of  these  optimal  lines  are 


(57) 

Ly:  qx  +  py  =  £-l. 

P 

The  intersection  will  be  in  the  first  quadrant  if/?  >  /?,  and  the  equilibrium 
will  be  stable  if/?  >  q.  If  the  equilibrium  is  not  stable,  either  X  or  Y  will 
stop  producing  altogether  and  so  become  "parasitic"  on  the  other,  who 
must  keep  on  producing  to  maximize  his  own  utility  in  the  absence  of 
output  by  his  partner.  Which  individual  will  become  the  parasite  in  this 
case  depends  on  the  initial  conditions. 

Thus  the  situation  bears  a  formal  resemblance  to  the  two-population 
competition  discussed  in  Sec.  1.6.  The  unstable  case  in  the  present  model, 
which  leads  to  parasitism,  is  analogous  to  the  unstable  case  in  the  com- 
petition of  populations,  which  leads  to  the  extinction  of  one  population. 

However,  the  present  example  has  another  feature,  which  is  not  present 
in  the  population  dynamics  example,  namely,  the  associated  utilities.  The 
specification  of  utilities  enables  us  to  determine  how  well  each  individual 


54$  MATHEMATICAL    MODELS    OF    SOCIAL    INTERACTION 

does  at  the  various  possible  outcomes :  for  example,  at  the  stable  equilib- 
rium, if  it  exists,  or  as  a  parasite  or  as  a  "host,"  if  he  becomes  one  or  the 
other.  It  is  interesting  to  note  that  at  the  stable  equilibrium  neither  of 
the  two  individuals  does  as  well  as  he  could  at  the  "Pareto  point,"  at  which 
the  joint  payoff  is  maximized.  But  the  Pareto  point  cannot  be  reached 
if  each  individual  "tries"  to  maximize  his  own  utility.  It  can  be  reached 
only  if  the  two  coordinate  their  outputs  (for  example,  by  a  contract,  in 
which  each  obligates  himself  to  produce  as  much  as  the  other)  or  if  each 
attempts  to  maximize  the  joint  payoff  instead  of  his  own.  Even  the 
parasite,  who  emerges  in  the  unstable  case  and  gets  a  considerably  higher 
payoff  than  his  host,  can  sometimes  do  better  by  not  becoming  a  parasite 
but  instead  by  maximizing  the  joint  payoff  (provided  the  other  does  the 
same).  Whether  this  is  so  depends  on  the  parameters/?  and  j8. 

It  turns  out  that  X  will  be  better  off  as  a  parasite  than  at  the  Pareto 
point  if  log  (pp  -\-qp-  q$)  —  log/?  +  1  —  |8  >  0.  Taking  into  account 
that  in  the  unstable  case  q  >  p,  the  inequality  will  hold  if  /3  is  sufficiently 
small,  the  critical  value  depending  on/?. 

Note  that  /5  measures  the  "reluctance  to  work."  The  qualitative  con- 
clusion, then,  is  the  following:  "It  pays  to  be  a  parasite  if  the  host  is  not 
too  lazy."  This  sounds  like  a  common-sense  conclusion.  But  so  does  the 
statement,  "It  pays  to  be  a  parasite  if  you  are  sufficiently  lazy."  Since  |8 
was  assumed  equal  for  both  individuals,  the  two  common-sense  conclusions 
are  incompatible.  The  mathematical  model  decides  between  them. 

Analogous  situations  appear  in  finite  nonzero-sum  games :  for  example, 
in  the  class  of  games  called  the  Prisoner's  Dilemma.  Choices  of  strategy 
based  on  calculation  of  their  own  advantages  leads  the  players  to  an  out- 
come disadvantageous  for  both;  choices  of  strategy  based  on  calculation 
of  joint  interest  lead  to  results  advantageous  to  both.15 

Experiments  with  two-person  games  of  this  sort  indicate  that  if  com- 
munication between  players  is  not  allowed  the  Nash  point  (analogous  to 
a  stable  equilibrium  in  the  continuous  game  described  above)  rather  than 
the  Pareto  point  is  predominantly  chosen  (e.g.,  Scodel,  Minas,  Ratoosh,  & 
Lipetz,  1959).  Experiments  simulating  competing  firms  have  yielded 
essentially  similar  results.16  A  conclusion  seems  warranted  that,  at  least 

15  Homo  economicus  is  assumed  always  to  tend  to  maximize  his  own  utility.   However, 
utility  is  usually  defined  tautologically  as  the  quantity  that  each  individual  attempts  to 
maximize.  In  situations  involving  material  payoffs  it  may  be  useful,  in  the  context  of  a 
psychological  theory,  to  separate  the  utility  accruing  from  one's  own  payoffs  and  the 
vicarious  utilities  accruing  from  payoffs  to  others.  Various  weightings  of  these  utilities, 
supposed  to  be  summed,  determine  the  "altruism  vector"  of  an  individual,  and  the  set 
of  such  vectors  determines  the  "altruism  matrix"  of  a  group.    Use  of  this  matrix  in 
theoretical  psychoeconomics  was  made  by  Rashevsky  (1951)  and  by  Rapoport  (1956). 

16  In  many  formalized  situations  of  economic  competition  the  intersection  of  Cournot 
lines  (cf.  p.  555)  is  analogous  to  the  Nash  point  of  a  noncooperative  nonzero-sum  game. 


PSYCHOECONOMICS  549 

in  the  cultures  of  the  subjects  in  these  experiments,  choices  guided  by  tacit 
mutual  trust  (which  leads  to  the  choice  of  the  Pareto  point)  are,  as  a  rule, 
not  made.  On  the  other  hand,  if  communication  and  collusion  are 
allowed,  Pareto  points  are  chosen  with  considerably  greater  frequency 
(Deutsch,  1958). 

Now  if  there  is  a  unique  Pareto-optimal  set  of  strategies  and  if  the 
parties  are  inclined  to  effect  an  agreement  by  negotiation,  we  must  expect 
the  Pareto-optimal  solution.  (If  they  can  agree  at  all,  they  can  be  expected 
to  agree  to  do  the  best  they  can.)  However,  an  interesting  situation  results 
if  there  are  several  Pareto-optimal  solutions  (or  a  continuum  of  such  solu- 
tions) and,  moreover,  the  interests  of  the  players  are  directly  opposed 
along  this  set:  what  one  player  wins,  the  other  loses.  Games  of  this  sort 
lead  to  bargaining  situations. 


4.2  Bargaining 

A  typical  bargaining  situation  is  one  in  which  two  or  more  participants 
can  each  gain  from  an  agreement  entered  into  but  in  which  there  is  a 
conflict  of  interest  regarding  the  terms  of  agreement.  Some  would  have  it 
that  bargaining  is  the  prototype  of  all  social  interactions.  The  title  of 
J.  J.  Rousseau's  major  work  Le  Contract  Social  attests  to  the  influence  on 
social  philosophy  of  ideas  stemming  from  economics  (later  explicitly 
formulated  by  Adam  Smith  and  Ricardo). 

Formal  theories  of  bargaining  are  a  modern  development,  an  outgrowth 
of  the  theory  of  the  so-called  "cooperative"  nonzero-sum  game.  It  seems 
to  me  that  the  term  "cooperative"  is  a  misnomer  in  this  context  because 
it  suggests  that  the  interests  of  the  players  coincide.  They  do,  as  a  matter 
of  fact,  partially  coincide  by  virtue  of  the  game  being  nonzero-sum,  since 
this  implies  that  in  the  set  of  outcomes  there  is  a  subset  associated  with  a 
maximum  joint  payoff.  If  payoffs  can  be  added  and  transferred  con- 
servatively from  player  to  player  (e.g.,  like  money),  it  follows  that  it  is  in 
the  players'  joint  interest  to  have  an  outcome  with  the  maximum  joint 
payoff.  The  term  "cooperative,"  as  it  is  used  to  designate  such  games, 
applies  to  the  rules  of  the  game  which  allow  the  players  to  communicate, 
that  is,  they  presumably  give  them  the  opportunity  to  agree  on  such  an 
optimum  outcome.  That  this  opportunity  is  not  always  utilized  is  well 
known,  even  if  the  outcome  with  the  maximum  joint  payoff  is  unique. 
If  there  are  several  such  outcomes,  the  difficulty  of  coming  to  an  agreement 
is  even  more  serious  because  in  the  choice  of  one  of  these  outcomes  the 
interests  of  the  players  often  conflict.  Indeed,  this  choice  has  the  features 
of  a  constant-sum  game,  in  which  one  player's  gain  is  necessarily  the 


55°  MATHEMATICAL    MODELS    OF    SOCIAL   INTERACTION 

other's  loss.  I  therefore  prefer  to  call  nonzero-sum  games,  in  which  com- 
munication (i.e.,  bargaining)  is  permitted,  negotiable  games. 

A  theory  of  such  games  constitutes  a  formal  theory  of  bargaining.  Like 
the  theory  of  games,  of  which  it  is  an  extension,  formal  theories  of  bargain- 
ing are  normative  rather  than  descriptive.  Typically,  they  are  deduced 
from  sets  of  axioms  which  reflect  the  features  of  "bargaining  power" 
(e.g.,  the  ability  to  make  enforceable  threats  and  promises)  as  well  as 
certain  equity  considerations  (usually  conditions  of  symmetry  or  invariance 
of  the  outcomes  with  respect  to  the  renaming  of  the  players).  The  interested 
reader  is  referred  to  the  researches  of  Nash  (1950),  Raiffa  (1951),  and 
Braithwaite  (1955). 

Experimental  studies  purporting  to  deal  with  bargaining  situations  are 
rapidly  accumulating.  However,  many  of  them,  although  interesting  in 
their  own  right,  are  not  directly  relevant  to  the  formal  theories  of  bargain- 
ing for  two  reasons.  First,  although  the  situations  studied  are  clearly  of 
the  "mixed-motive"  type,  that  is,  involving  partly  coincident  and  partly 
conflicting  interests  of  the  participants,  they  do  not  aim  at  tests  of  explicit 
mathematical  models  of  bargaining,  such  as  are  contained  in  the  theoretical 
investigations  we  have  discussed.  They  aim,  rather,  at  tests  of  qualitatively 
stated  hypotheses  of  interest  to  psychologists:  for  example,  at  comparison 
of  outcomes  under  different  imposed  conditions.  Second,  many  of 
these  studies  exclude  direct  communication  between  the  participants  and 
thus  lack  the  principal  feature  of  the  bargaining  situation.  I  suspect  that 
this  is  done  in  the  interest  of  simplicity  of  analysis :  it  is  easier  to  record  and 
to  analyze  formalized  acts  of  the  participants  than  a  protocol  of  offers, 
counteroffers,  threats,  and  promises. 

To  be  sure  it  can  be  argued  that  implicit  bargaining  does  occur  if  the 
same  situation  is  repeated  many  times  in  succession  because  each  partici- 
pant can  indicate  to  the  other  by  his  acts  what  he  can  be  expected  to  do 
in  response  to  the  other's  acts.  Thus  implicit  offers,  threats,  and  promises 
can  be  made  and  carried  out. 

An  example  of  an  implicit  bargaining  situation  involving  the  distribution 
of  priorities  in  the  use  of  a  one-way  road  by  two  "trucking  companies" 
is  given  in  Deutsch  &  Krauss  (1960).  The  aim  of  this  study  was  to  test 
hypotheses  concerning  the  effects  of  unilateral  and  bilateral  "threats" 
on  the  outcomes,  measured  by  the  profits  shown  by  the  two  firms  who 
contend  for  the  use  of  the  one-way  road,  which  is  shorter  than  an  alternate 
unimpeded  road  open  to  each  company.  Simultaneous  attempts  to  use  the 
short  road  compels  one  or  the  other  to  back  up,  thus  losing  time  and 
profits.  In  some  of  the  experimental  conditions  one  or  both  of  the  firms 
can  punish  the  other  by  blocking  this  -road  at  one  end,  and  the  ability  to 
do  so  constitutes  the  "threat." 


PSYCHOECONOMICS  JJ7 

The  results  show  that  when  neither  firm  has  the  threat  potential  or 
when  only  one  firm  has  it,  thus  being  in  control  of  the  situation,  both 
firms  do  better  than  if  both  can  avail  themselves  of  the  threat.  (The  firms 
do  not  compete;  each  is  instructed  to  maximize  its  own  profits  without 
regard  for  the  profits  of  the  other.) 

In  view  of  the  existence  of  mathematical  theories  of  economic  behavior, 
simulations  of  classical  economic  situations  (oligopolies,  duopolies,  etc.) 
offer  greater  opportunities  for  designing  bargaining  experiments  along 
lines  suggested  by  mathematical  models.  Experiments  with  oligopoly 
have  been  reported,  among  others,  by  Sauermann  and  Selten  (1959). 
We  shall  examine  in  some  detail  the  mathematical  theory  of  the  bilat- 
eral monopoly  and  a  corresponding  experiment  reported  by  Siegel  and 
Fouraker  (1960). 


4.3  Bilateral  Monopoly 

Consider  a  market  in  which  there  is  only  one  seller  and  only  one  buyer. 
Assume  that  the  buyer  buys  some  manufactured  product  wholesale  from 
the  (only)  seller  and  that  he  can  sell  this  product  in  the  retail  market  at  a 
price  that  is  determined  by  the  demand  for  the  product.  The  buyer's  total 
profit,  therefore,  will  be  the  difference  between  the  market  (demand)  price 
and  the  price  he  pays  to  the  seller,  multiplied  by  the  quantity  of  the  product 
that  the  seller  will  sell  him.  The  seller's  profit,  of  course,  will  be  the  dif- 
ference between  the  price  he  will  receive  from  the  buyer  and  the  production 
costs,  multiplied  by  this  quantity. 

The  question  before  us  is  whether  under  these  conditions,  given  the 
demand  and  the  production  schedules,  the  quantity  sold  by  the  seller  to 
the  buyer  and  the  price  paid  by  the  buyer  are  determined  by  the  economic 
situation  alone,  if  it  is  assumed  that  each  acts  to  maximize  his  total  profit. 

To  fix  ideas,  assume  that  the  retail  demand  price  r  decreases  linearly 
with  the  quantity  Q  offered : 

r  =  A  -  BQ,  (58) 

whereas  the  production  cost  (per  unit)  c  increases  linearly  with  quantity 
produced,  assumed  the  same  as  the  quantity  offered : 

c  =  A'  +  B'Q.  (59) 

The  straight  lines  represented  by  Eqs.  58  and  59  converge  as  Q  increases. 
Therefore  the  margin  between  production  cost  and  demand  price  becomes 
smaller  with  larger  Q.  However,  the  total  profit  to  seller  and  buyer  com- 
bined is  the  distance  between  these  two  lines  multiplied  by  the  quantity 


555  MATHEMATICAL   MODELS    OF    SOCIAL   INTERACTION 

produced,  sold,  and  resold.  At  some  value  of  Q  the  combined  profit 
accruing  to  both  seller  and  buyer  will  be  maximal.  Let  us  compute  this 
optimal  quantity  from  the  combined  standpoint  of  the  two  individuals  of 
this  bilateral  monopoly. 

This  total  profit  is  obtained  by  maximizing  the  difference  between  total 
production  cost  and  total  retail  sales,  namely,  letting  R  =  rQ,  C  =  cQ, 


BQ*-A'Q-  B'Q*.  (60) 

Setting  the  derivative  of  Eq.  60  equal  to  zero,  we  get 

A  -  2BQ  -  A'  -  2B'Q  =  0.  (61) 

Solving  for  Q,  we  get 

Q  =   A~  A>   .  (62) 

*      2B  +  2B' 

It  appears,  therefore,  that  if  the  buyer  and  seller  are  to  maximize  their 
joint  profit  they  should  agree  that  the  quantity  to  be  produced  and  put  on 
the  market  should  be  given  by  the  value  of  Q  in  Eq.  62. 

However,  having  agreed  on  the  quantity,  how  are  they  to  determine  the 
price  that  the  buyer  will  pay  the  seller?  Obviously,  the  higher  the  price, 
the  greater  the  share  of  the  joint  profit  that  will  accrue  to  the  seller  but 
also  the  smaller  the  share  of  the  buyer.  There  are  no  constraints  on  the 
system  to  fix  the  price  at  some  point  between  the  production  cost  and  the 
demand  price.  Therefore  the  model  does  not  lead  to  a  determinate  solu- 
tion with  regard  to  the  price  to  be  paid  to  the  seller,  although  it  does  lead 
to  a  determinate  solution  with  regard  to  the  quantity  to  be  sold. 

Suppose,  now,  the  buyer  announces  that  he  will  pay  the  pricey  per  unit. 
The  seller,  being  also  the  producer,  can  then  decide  what  quantity  he  will  be 
willing  to  produce  and  sell  at  that  price.  The  seller  will  wish  to  maximize 
P  —  C,  where  P  =  pQ.  Since  he  controls  g,  he  will  do  so  by  differentiat- 
ing (P  —  C)  with  respect  to  Q  and  setting  the  derivative  equal  to  zero. 
But  from  Eq.  59  we  get 

C  =  A'Q  +  BfQ\  (63) 


Hence 

dQ 


—  =  A1  +  2V  Q. 
dQ 


Q  =  £— -  .  (64) 

*         2B1 


PSYGHOEGONOMICS  555 

Now  the  profit  accruing  to  the  buyer  is 


and  the  buyer  will  wish  to  maximize  this  quantity  with  respect  to  p,  the 
quantity  he  presumably  controls.  Setting  the  derivative  of  Eq.  65  equal 
to  zero,  we  obtain,  after  simplifying 

=  AB'  +  AB  +  A>B> 
y  B  +  2B' 

where  p*  is  apparently  the  price  that  should  be  quoted  by  the  buyer  to 
ensure  the  maximum  profit  for  himself  under  the  constraint  of  the  seller's 
control  of  the  quantity  to  be  produced. 

Substituting  Eq.  66  into  Eq.  64,  we  obtain  the  value  of  Q*,  the  quantity 
determined  by/?*,  namely, 

A    _     At 

Q*  =  _^  -  £L_  .  (67) 

2(B  +  2Bf) 

We  see  that  Q*  does  not  correspond  to  the  Q  found  by  maximizing  the 
joint  profit. 

Let  us  now  see  what  profits  accrue  to  the  seller  and  to  the  buyer, 
respectively,  ifp  =  p*  and  Q  =  Q*. 

We  have  for  the  seller's  profit 


_  VAB'  +  A'B  +  A'B1  _A,_  B'(A  -  A')~]     A  -  Af 
~  L 


I" 
L2( 


__ 
B  +  2Bf  2(B  +  2J50J  2(B 

A  -A1 


(68) 


(B  +  2B') 
Similarly,  we  have  for  the  buyer's  profit 

,,  =  (r-^  =  (B  +  2B')[^^]25  (69) 

and  accordingly,  for  the  joint  profit 

(70) 


Under  these  circumstances,  we  see  that  the  buyer's  share  of  the  total 
profit  (the  buyer  being  the  price  leader)  will  amount  to 

'""b       =  s  +  2B'  (71) 

77     +  7Tb  B  +   3B'  ' 


554  MATHEMATICAL    MODELS    OF   SOCIAL   INTERACTION 

His  share,  therefore,  will  be  at  least  f  (if  B'  is  very  large  compared  to  B) 
and  will  approach  unity  if  B  is  very  large  compared  to  B'. 

The  advantage  is  obviously  with  the  price  leader.  We  should  not 
conclude,  however,  that  the  price  leader  will  do  best  for  himself  under  all 
circumstances  by  acting  as  price  leader.  Under  certain  circumstances, 
he  would  do  better  by  negotiating  a  settlement  with  the  seller,  even  to  the 
extent  of  offering  him  the  greater  share  of  the  joint  profit.  This  is  because 
negotiation  makes  possible  the  optimization  of  Q.  Assume,  in  fact,  that 
Q  has  been  optimized  to  give  the  greatest  joint  profit.  Then  Q  is  given 
by  Eq.  62  and  the  joint  profit  by 

(r  -  c)Q  =  ^  ~  A'^  .  (72) 

V          ^  '2 


As  price  leader,  the  buyer  can  command  the  quantity  given  by  Eq.  69 
as  his  share  of  the  total  profit.  Therefore  he  can  afford  to  negotiate  if 
he  can  get  a  fraction  of  the  joint  (maximized)  profit,  which  amounts  to  at 
least 


(73) 

This  fraction  is  less  than  J  if  B/B'  <  \/2. 

In  other  words,  the  buyer  as  price  leader  is  in  a  position  to  offer  the 
seller  a  greater  share  in  the  (maximized)  joint  profit  if  the  slope  of  the 
demand  curve  is  sufficiently  smaller  (by  a  factor  of  \/2)  than  that  of  the 
supply  curve.  Recall  the  analogous  situation  with  the  two  producers 
who  share  their  product  (Sec.  4.1).  As  in  the  present  case,  each  can 
command  a  certain  return  at  the  intersection  of  the  optimal  lines.  Therefore 
each  is  in  a  position  to  offer  the  other  a  somewhat  greater  share  of  the 
total  utility  (assuming  the  utility  is  transferable)  in  negotiating  an  agree- 
ment to  maximize  joint  utility.  If  the  equilibrium  is  unstable,  the  parasite 
can  actually  be  better  off  (provided  the  parameters  are  within  a  certain 
range)  receiving  one  half  of  the  joint  (maximum)  payoff  than  getting  his 
payoff  as  parasite.  These  findings  concern  an  aspect  of  "bargaining  power" 
not  often  emphasized.  The  usual  emphasis  is  on  bargaining  power  derived 
from  being  in  a  position  to  threaten  the  opponent  with  dire  consequences 
if  the  terms  are  not  accepted.  But  there  is  also  bargaining  power  derived 
from  being  in  a  position  to  promise  the  opponent  certain  advantages  if 
he  goes  along  with  a  proposal. 

Let  us  turn  to  some  experimental  data  that  have  a  bearing  on  the 
foregoing  theoretical  discussion  of  bilateral  monopoly. 

In  the  experiments  described  by  Siegel  and  Fouraker  (1960),  pairs  of 
subjects  took  the  respective  parts  of  a  buyer  and  a  seller  in  a  simulated 


PSYCHOECONOMICS 

bilateral  monopoly  situation.  The  object  of  the  experiment  was  to  deter- 
mine which,  if  any,  of  the  theoretical  positions  with  regard  to  the  outcomes 
of  a  bilateral  monopoly  would  be  corroborated  in  the  simulated  situation. 
The  question  is  an  interesting  one,  inasmuch  as  the  theoretical  positions 
of  various  economists  have  been  quite  distinct.  The  "solution"  offered  by 
Cournot  on  the  basis  of  assuming  a  "price  leader"  was  derived  on  p.  553 
(Eqs.  66  and  67).  In  symmetric  bargaining  neither  has  the  privilege  of 
being  a  price  leader.  Thus  the  Cournot  solution,  which  determines  both 
price  and  quantity,  does  not  apply.  The  opinions  of  economists  regarding 
the  expected  outcome  have  been  divided.  They  fall  into  three  categories: 

1 .  Neither  the  quantity  nor  the  price  (to  the  buyer)  are  determined  by 
the  economic  factors.   Other  determinants  must  be  known  to  predict  the 
outcome  of  a  specific  situation  (Bowley,  1928). 

2.  The  quantity  is  determined  a  la  Pareto,  that  is,  by  the  maximization 
of  joint  profit.  But  the  price  is  not  determinate  and  will  depend  on  factors 
not  included  in  the  model  (e.g.,  psychological  factors,  reflecting  the 
bargaining  abilities  of  the  participants)  (Stigler,  1952). 

3.  Both  quantity  and  price  are  determinate,  quantity  by  maximization 
of  joint  profit  and  price  by  some  bargaining  principle,  such  as  perceived 
symmetry  of  the  situation  (Schelling,  1960)  and  the  intersection  of  the 
marginal  revenue  and  marginal  cost  lines  (Fouraker,  1957),  or  by  some 
other  bargaining  principle  derived  from  some  plausible  set  of  axioms 
(Nash,  1950;  Raiffa,  1951). 

The  experiments  were  conducted  under  conditions  of  symmetric  bar- 
gaining. The  first  bidder  was  chosen  by  lot  from  each  pair,  and  the  bidding 
was  in  terms  of  offers  and  counteroffers  of  prices  paired  with  quantities. 
In  different  experiments  certain  supplementary  conditions  were  varied  in 
order  to  note  their  effects  on  outcomes.  For  example,  the  members  of 
bargaining  pairs  could  be  informed  either  only  of  their  own  payoff  sched- 
ules, that  is,  the  supply-cost  (or  demand-price)  curves,  or  of  both.  The 
maxima  of  joint  profit  could  be  relatively  sharp  or  relatively  flat.  Finally, 
"levels  of  aspiration"  could  be  induced  in  the  bargainers  by  offers  of 
incentives  if  they  won  for  themselves  an  indicated  minimum  profit. 

The  results  definitely  corroborate  the  hypothesis  that  in  symmetric 
bargaining  the  quantity  agreed  on  is  determinate  and  is  chosen  to  maximize 
joint  profit  but  that  the  division  ratio  of  the  profit  depends  on  factors 
outside  the  economic  model:  for  example,  on  the  information  possessed 
by  the  bargainers  and  on  the  induced  levels  of  aspiration. 

In  every  experiment,  in  which  several  pairs  of  subjects  were  involved,  the 
quantities  agreed  on  clustered  closely  around  the  profit-maximizing 


MATHEMATICAL   MODELS    OF   SOCIAL   INTERACTION 

quantity,  whereas  the  prices  agreed  on  were  spread  out  along  the  "negotia- 
tion set."  Increasing  the  information  available  to  the  bargainers  and 
increasing  the  rate  of  decline  of  joint  profit  as  one  moves  away  from  the 
maximum  (i.e.,  "sharpening"  the  maximum)  had  the  effect  of  decreasing 
(sometimes  to  zero)  the  variance  of  the  quantity  agreed  on.  Inducing 
different  levels  of  aspiration  in  the  bargainers  (offering  additional  incentives 
if  certain  minimum  profits  were  secured)  produced  unmistakable  biases 
in  the  price  agreed  on — the  bargainer  with  a  higher  aspiration  came  off 
with  the  greater  share  of  the  profit. 

In  this  way  the  roles  of  both  economic  and  psychological  factors  were 
separately  and  quantitatively  demonstrated  in  a  controlled  bargaining 
experiment,  in  which  bargaining  was  restricted  to  strictly  formalized 
successive  offers. 


4.4  Formal  Experimental  Games 

The  relevance  of  psychological  factors  in  bargaining  situations  suggests 
that  psychoeconomics  may  well  become  another  of  the  "border  regions" 
(like  social  psychology,  psychophysics,  psycholinguistics)  of  behavioral 
science  in  which  methods  of  more  than  one  discipline  are  fused  in  forging 
the  exploratory  tools.  Aside  from  the  psychology  of  bargaining  (of 
obvious  importance  in  the  study  of  social  interactions),  psychological 
considerations  can  be  expected  to  be  relevant  wherever  individuals  are 
put  into  situations  in  which  they  perceive  their  interests  to  be  divergent. 
Bargaining  situations  are  special  instances  in  which  negotiations  can  take 
place.  Of  equal  interest  from  the  psychological  point  of  view  are  situations 
in  which  explicit  negotiations  are  impossible.  In  the  literature  of  game 
theory  such  situations  are  called  noncooperative  games.  Here  they  are 
called  nonnegotiable  games. 

Although  some  psychological  factors  relevant  to  bargaining  are  ob- 
viously irrelevant  in  nonnegotiable  games,  other  factors  play  a  role  perhaps 
equally  essential.  It  is  important  to  keep  in  mind  that  game  theory,  at 
least  in  its  original  formulation,  completely  bypassed  psychological  factors. 
All  information  available  to  the  players  by  the  rules  of  the  game  was 
assumed  to  be  utilized;  utility  values  of  the  outcomes  were  assumed  to  be 
given;  the  best  available  strategies  were  assumed  always  to  be  chosen,  etc. 
It  goes  without  saying  that  the  application  of  game  theory  to  behavioral 
science  requires  the  introduction  of  psychological  parameters,  since 
human  memory  is  not  perfect,  human  decisions  are  not  always  "rational," 
etc.  In  principle,  such  parameters  could  be  introduced  to  extend  and  to 
generalize  game  theory,  and  some  work  along  these  lines  has  been  done. 


PSYCHO ECO NO  MICS 


557 

Another,  more  purely  empirical  approach  is  taken  by  some  experi- 
menters who  set  up  situations  suggested  by  game  theory  with  a  view  of 
recording  any  regularities  that  may  be  found  relating  the  observed  behavior 
to  normatively  prescribed  "solutions"  of  game  theory:  for  example,  in 
experiments  with  zero-sum  games  or  with  rc-person  games  in  characteristic 
function  form.  Sometimes  this  cannot  be  done  simply  because  game  theory 
fails  to  prescribe  even  a  normative  solution,  or  a  class  of  solutions,  most 
notably  in  nonzero-sum,  nonnegotiable  games.  Nevertheless,  the  behavior 
of  people  in  situations  isomorphic  to  such  games  is  of  great  interest  for 
what  it  may  reveal  about  the  underlying  psychology. 

The  program  of  the  empirical  approach  is  an  old-fashioned  one.  A 
good  model  of  "rational  behavior"  for  nonzero-sum,  nonnegotiable 
games  does  not  exist,  so  it  is  proposed  simply  to  gather  great  quantities  of 
data  on  actual  behavior  in  such  situations  in  the  hope  that  regularities 
discovered  in  the  data  can  suggest  models  to  be  tested  by  further  experi- 
ments, for  example,  by  varying  experimentally  controlled  parameters. 

The  Prisoner's  Dilemma  is  especially  intriguing  in  investigations  of  this 
kind  because  it  puts  the  players  into  a  situation  in  which  "collective  ration- 
ality" (and  its  underlying  assumption  that  the  partner  is  also  motivated 
by  it)  comes  in  conflict  with  "self-interest  rationality"  (and  its  underlying 
assumption  that  the  partner  is  also  motivated  by  it). 

The  essential  feature  of  the  Prisoner's  Dilemma  game  is  the  choice  open 
to  each  player,  namely,  to  conform,  that  is,  to  play  the  "cooperative 
strategy"  which,  if  played  by  both,  rewards  both;  or  to  defect,  that  is, 
to  play  the  "noncooperative  strategy"  which,  if  chosen  by  both,  punishes 
both.  If  the  two  players  choose  different  strategies,  the  conformist  is 
punished  more  severely  and  the  defector  is  rewarded  more  generously 
than  if  either  had  chosen  the  same  strategy  as  his  partner. 

In  all  experiments  with  nonnegotiable  Prisoner's  Dilemma-type  games, 
both  the  cooperative  and  the  noncooperative  strategies  are  chosen,  to  be 
sure  with  different  frequencies  in  different  games  and  under  different 
conditions.  The  question  naturally  arises  regarding  the  nature  of  the  con- 
ditions that  influence  the  propensity  to  conform  or  to  defect. 

Some  recent  studies  indicate  several  such  dependencies.  Deutsch 
(1958)  has  investigated  the  propensity  to  make  the  cooperative  ("trusting") 
choice  that  leads  to  the  Pareto-optimal  outcome  in  a  Prisoner's  Dilemma 
game,  as  it  is  influenced  by  the  instructions  given  to  the  players,  namely,  a 
"cooperative  orientation"  (having  joint  payoffs  in  mind),  "individualistic 
orientation"  (having  only  one's  own  payoff  in  mind),  and  a  "competitive 
orientation"  (having  the  difference  of  the  payoffs  in  mind).  The  propensity 
to  choose  cooperatively  has  been  found  to  change  in  the  expected  direction 
with  the  instructions.  The  same  author  also  investigated  the  effect  of 


55$  MATHEMATICAL   MODELS    OF   SOCIAL   INTERACTION 

the  opportunity  to  negotiate  and  found  that  this  also  significantly  increases 
cooperation. 

Scodel,  Minas,  Ratoosh,  and  Lipetz  (1959)  have  found  that  cooperation 
tends  to  decrease  (or  competition  to  increase)  in  the  course  of  a  run  of  50 
plays  of  the  same  nonzero-sum  game.  They  also  found  evidence  that  the 
competitive  motive  plays  an  important  part  in  the  subjects'  choices  (even 
with  neutral  instructions),  since  even  in  the  games  in  which  no  advantage 
accrued  to  the  single  defector,  as  many  as  50  per  cent  noncooperative 
choices  were  made. 

Lutzker  (1960)  found  a  higher  propensity  to  cooperate  among  subjects 
rated  high  on  an  "internationalist"  attitude  scale,  compared  with  subjects 
high  on  an  "isolationist"  attitude  scale.  A  control  group  of  unselected 
subjects  was  not  significantly  different  from  the  "internationalists,"  but 
their  cooperative  choices  tended  to  decrease  in  the  course  of  the  run,  as 
did  those  of  the  "isolationists,"  whereas  the  frequency  of  cooperative 
choices  of  the  "internationalists"  showed  no  such  trend.  Deutsch  (1960) 
found  similar  differences  between  subjects  with  extreme  opposite  ratings 
on  the  F  ("authoritarian")  scale. 

Experiments  with  three-person,  nonzero-sum,  nonnegotiable  games  were 
undertaken  at  the  University  of  Michigan.  The  main  purpose  was  to 
observe  the  dependence  of  the  over-all  frequency  of  cooperative  choices,/, 
on  the  payoff  matrices.  Accordingly,  all  games  were  played  under  pre- 
sumably the  same  conditions.  The  instructions  were  approximately  the 
same  as  the  individualistic  instructions  in  Deutsch's  experiments.  Com- 
munication was  not  allowed.  Subjects  played  for  one  mill  per  point,  the 
winnings  or  losses  being  added  or  subtracted  from  their  subjects'  fees. 

One  further  feature  was  added  to  equalize  the  conditions  in  which  each 
game  was  played,  in  particular,  to  avoid  progressive  learning  from  one 
game  to  another.  A  trio  of  players  played  eight  games  "at  once"  in  each 
session;  that  is,  the  games  were  presented  in  randomized  order.  In  this 
way  no  game  was  in  any  special  position  in  the  order  of  presentation,  even 
though  the  same  trio  of  subjects  played  eight  different  games. 

There  were  two  such  sets  of  experimental  runs.  The  first  involved  16 
three-person  groups  and  the  seven  games  shown  in  Table  3,  the  total 
number  of  plays  ranging  from  300  to  500.  The  second  involved  12  three- 
person  groups  and  the  eight  games  shown  in  Table  4,  with  800  plays  in 
each  session. 

The  frequency  of  cooperative  choice  /  (averaged  over  all  individuals 
in  all  plays  of  each  game)  was  the  dependent  variable.  It  was  recorded  for 
each  game  so  that  the  games  could  be  arranged  in  a  sequence  in  which/ 
decreased  monotonically. 

The  problem  was  to  choose  an  independent  variable,  that  is,  an  index, 


PSYCHOECONOMICS 

13      43      >^ 

?i    *£      P 


559 


rt 


^ct^ 

S  ° 


o  §3  ±3 
§•83 


fa     O     rt 
$     29  P-i 


O,  - 


.S  .S 

SH    P^H      ^ 


I    I 


II 


<N     i— I 
I  I 


1          I 

vo  -^ 


1  I 


X 


X 


X 


X 


0 

OH 


f i  (         \j 

o     ^ 

4-» 

CO 

o 

o      ^ 

co     X 
,-g 
d 


43 

cd 

O 

-I  a 

H 


III 


Tt     Tf     Tt     ^     Tt     ^     Tj- 
I  I  I 


"St     Tt     Tt     T- 1     Tj*     ^t     -^t 
1  I  I 


II 


I  I 


III 


1  I  I 


1  I  I 


5°  MATHEMATICAL   MODELS   OF    SOCIAL   INTERACTION 

derived  from  the  game  matrix,  against  which/could  be  plotted,  preferably 
an  index  of  which  /  would  be  a  linear  function.  Several  such  indices 
suggested  themselves : 

1.  COMPARISON  OF  EXPECTED  GAIN.    Assume    that    each    player 
views  the  four  possible  combinations  of  strategy  choices  by  the  other  two 
players  as  four  equiprobable  "states  of  nature."  Then,  he  compares  his 
expected  gains  from  his  own  cooperative  and  noncooperative  choices. 
The  algebraic  difference  constitutes  the  "cooperative  index"  of  the  game. 
(In  all  games  the  cooperative  index  was  negative.) 

2.  COMPARISON  OF  PAYOFFS  TO   SELF   AND   OTHER.      Except  where 

the  choice  of  strategy  is  unanimous,  there  are  two  payoffs  associated  with 
each  outcome,  the  payoff  to  the  defector  and  the  payoff  to  the  conformist. 
We  assume  that  the  player  compares  his  payoff  with  that  of  the  conformist, 
if  he  himself  is  a  defector,  and  vice  versa.  The  first  of  these  differences  is 
called  the  relative  advantage  of  being  a  defector;  the  second  the  relative 
advantage  of  being  a  conformist.  The  second  minus  the  first  is  the 
cooperative  index  of  the  game.  Like  the  previous  index,  it  is  always 
negative. 

3.  COMPETITIVE    ADVANTAGE    OVER    THE    OTHER    TWO    PLAYERS. 

This  criterion  is  like  criterion  2,  except  that  the  comparison  is  made  not 
between  the  "roles"  but  between  payoff  to  self  and  average  payoff  to  the 
other  two. 

4.  MINIMAX.     Each  player  assumes  that  he  is  playing  a  2  x  4  game 
against  a  single  opponent  and  chooses  the  strategy  in  which  his  possible 
loss  is  minimized.   This  calls  for  mixed  strategies  in  Games  I,  V,  and  VII 
and  in  all  the  games  of  the  second  series.    The  minimax  strategy  so 
conceived  should  not  be  confused  with  the  Nash  equilibrium  point,  which 
is  SL  pure  strategy  in  Games  I  and  V  and  all  the  games  of  the  second  series. 

Of  these  criteria,  criterion  3  yielded  an  index  most  closely  correlated 
with  the  value  off.  The  regression  line  of  the  plot  of /against  z,  the  com- 
petitive advantage  index,  computed  in  arbitrary  units  is  shown  in  Fig.  13. 

We  note  that  the  regression  line  has  approximately  the  same  slope  for 
10  of  the  15  games.  Beyond  that  range,  however,  the  regression  line 
breaks.  The  cooperation  frequency  /  still  diminishes  with  the  absolute 
value  of  the  (negative)  cooperative  index  but  at  a  much  slower  rate.  If 
the  decrease  had  continued  at  the  same  rate,  we  would  have  expected  all 
instances  of  cooperative  choice  to  disappear  at  about  i  =  —30.  However, 
as  much  as  8  %  cooperative  choices  remain  at  i  =  —48. 

The  source  of  these  cooperative  choices  can  easily  be  seen  if  the  records 
are  examined.  In  every  experimental  run  there  is  at  least  one  individual 
who,  apparently  discouraged  by  continued  losses  resulting  from  unanimous 


PSYCHOECONOMICS 


-5     -10  -15  -20  -25    -30  -35   -40   -45  -50 
i 

Fig.  13.  Plot  of  /  (over-all  frequency  of  cooperative  choice) 
against  z  (index  for  the  15  games  listed  in  Tables  3  and  4, 
computed  by  Criterion  3,  p.  560).  Roman  numerals  indi- 
cate the  positions  of  the  games. 

defection,  now  and  then  plays  the  cooperative  choice  many  times  in 
succession  (sometimes  20  to  30  times),  probably  in  an  attempt  to  arouse  in 
the  others  a  sense  of  social  responsibility  to  join  in  a  collusion  with  him 
"against  the  house."  Almost  always  he  failed.  Since  these  long  stretches 
of  cooperative  choices  are  made  indiscriminately,  that  is,  with  no  regard 
for  which  game  is  being  played,  we  have  the  "irreducible"  residue  of 
cooperative  choices  even  in  the  games  with  high  negative  indices.  Hope 
springs  eternal  in  the  human  breast! 

The  static  theory  just  described  is,  of  course,  a  rather  shallow  one. 
Besides  the  ad  hoc  explanations  of  the  way /varies  from  game  to  game,  it 
has  no  additional  predictive  value.  For  example,  nothing  can  be  con- 
cluded from  the  question  whether  the  /  of  a  game  depends  only  on  the 
game  itself  or  also  on  what  other  games  are  "mixed"  with  it  in  the  same 
experimental  run.  No  conclusion  can  be  drawn  concerning  whether  the 
cooperative  choices  of  the  three  players  in  a  given  run  are  statistically 
independent,  etc. 

In  order  to  form  a  theory  with  somewhat  more  depth,  which  would 
predict  the  present  results  and  others  too,  we  can  seek  some  underlying 
process  of  which  the  observed  data  would  be  consequences.  In  other 
words,  we  seek  a  dynamic  theory  governing  the  time  course  of  a  process 
consisting  of  a  sequence  of  states  in  which  the  subjects  find  themselves. 
A  theory  of  this  kind  is  examined  in  Sec.  5. 


562  MATHEMATICAL   MODELS   OF   SOCIAL   INTERACTION 


5.  GROUP    DYNAMICS 

Group  dynamics  is  a  branch  of  social  psychology  in  which  the  small 
(face-to-face)  human  group  (e.g.,  a  work  group,  a  friendship  circle,  a 
family)  is  the  object  of  study.  Typically,  workers  in  group  dynamics  view 
the  group  as  a  "system"  or  an  "organism"  (an  expected  development  in  a 
society  in  which  a  large  portion  of  decisions  is  made  by  committees). 
Research  is  directed  toward  the  discovery  of  regularities  in  the  behavior 
of  such  systems.  As  in  any  theoretical  approach,  certain  events  and 
processes  are  singled  out  for  study,  and  certain  hypothetical  constructs 
are  offered  with  a  view  of  providing  economical  descriptions  of  the  events 
observed.  If  regularities  are  noted  in  the  descriptions,  hypotheses  are 
suggested  in  which  the  theoretical  constructs  serve  as  variables.  The 
hypotheses  then  become  assertions  about  relations  among  these  variables. 
In  particular,  if  the  assertions  are  related  to  the  time  courses  of  the 
variables,  the  resulting  theory  becomes  "dynamic"  in  the  strict  sense. 

At  a  certain  stage  of  theory  construction,  in  which  quantitative  observa- 
tions are  made  but  are  not  yet  mathematicized,  psychologists  frequently 
state  hypotheses  in  the  form  "the  more  .  .  .  the  more  . .  ."or  "the  more  .  .  . 
the  less  . . ." ;  that  is,  they  make  assertions  about  direct  or  inverse  relations 
among  variables  without  specifying  more  definitely  the  nature  of  the 
functional  relations.  If  the  exact  nature  of  the  implied  quantitative  rela- 
tions were  specified,  say  by  a  mathematical  function,  we  would  have  a 
mathematical  model.  A  crucial  difficulty  in  constructing  a  verifiable  model 
of  this  sort  relates  to  the  lack  of  naturally  suggested  scales  for  the  variables 
involved.  Workers  in  group  dynamics  speak  of  "intensity  of  interaction" 
among  the  members  of  a  group,  of  the  "level  of  friendliness,"  of  the 
"amount  of  activity,"  of  "pressures"  exerted  on  members  from  co-members 
and  from  outside,  etc.  Obviously,  even  hypotheses  in  the  form  "the 
more  . . .  the  more"  can  be  tested  only  when  operational  definitions  of  these 
"variables"  are  given.  Mathematically  stated  hypotheses  require  such 
definitions  to  be  appropriate  to  the  type  of  model.  For  example,  if  explicit 
mathematical  functions  enter  the  equations,  the  variables  must  be  express- 
ible in  a  ratio  scale  (with  zero  point  and  unit  precisely  defined.) 

It  is  not  at  all  clear  a  priori  how  "the  level  of  friendliness"  in  a  group  is 
to  be  measured.  However,  such  a  measure  is  not  unthinkable.  It  is,  in 
fact,  quite  easy  to  suggest  measures.  We  could,  for  example,  take  for  the 
"level  of  friendliness"  the  relative  frequency  of  utterances  of  a  certain 
type  observed  in  the  group:  for  example,  those  labelled  "supportive" 
or  "tension  reducing"  in  the  system  worked  out  by  Bales  (1950).  In  a  task 
group  involved  in  a  situation  in  which  choices  of  cooperative  or 
nonco operative  acts  must  be  made  (cf.  Sec.  4.4),  the  "level  of  friendli- 
ness" might  be  defined  by  the  relative  frequency  of  cooperative  choices. 


GROUP    DYNAMICS 


Thus  it  is  quite  possible  to  render  operational  the  variables  proposed  by 
the  group  dynamicists.  Of  course,  each  of  the  variables  can  be  defined  in 
several  different  ways,  none  of  them  a  priori  more  justifiable  than  others. 
We  can  only  hope  that  justification  of  particular  definitions  can  be  made 
a  posteriori,  that  is,  in  view  of  the  fruitfulness  of  the  models  constructed 
with  them. 

It  is  also  clear  that  the  mathematical  model  builder  need  not  concern 
himself  with  the  justification  of  the  definitions  nor  even  with  the  definitions 
themselves  to  the  extent  that  his  job  is  to  derive  the  consequences  of  the 
model  he  has  constructed.  Nevertheless,  the  model  builder  may  be  guided 
in  the  construction  of  the  model  by  the  content  of  the  social  psychologist's 
hypotheses.  Presumably,  Richardson  was  so  guided  (cf.  Sec.  1.2)  when  he 
translated  various  verbally  stated  hypotheses  on  the  causes  of  arms  races 
into  mathematical  assertions. 

A  similar  "translation"  was  undertaken  by  Simon  (1957)  when  he 
formulated  a  number  of  mathematical  models  on  the  basis  of  hypotheses 
stated  verbally  in  the  literature  of  social  psychology. 


5.1  A  "Classical"  Model  of  Group  Dynamics 

One  set  of  postulates,  which  Simon  translated  into  differential  equations, 
that  derives  from  the  work  of  Romans  (1950)  is  as  follows: 

1  .  The  intensity  of  interaction  in  a  group  increases  with  the  degree  of 
friendliness  and  with  the  degree  of  activity. 

2.  There  is  a  level  of  friendliness  "appropriate"  to  a  corresponding 
activity  level.    The  actual  level  of  friendliness  tends  to  this  appropriate 
level  at  a  rate  proportional  to  the  amount  of  departure  from  it. 

3.  Postulate  2  relates  also  to  the  rate  at  which  "activity"  tends  to  an 
appropriate  level.  This  level  depends  on  the  level  of  friendliness  and  on  the 
amount  of  activity  imposed  on  the  group  externally  (say,  by  a  task). 

The  simplest  "translation"  of  these  postulates  into  mathematics  is  in 
terms  of  a  system  of  linear  algebraic  and  differential  equations.  If  /, 
F,  A,  and  E  represent,  respectively,  intensity  of  interaction,  friendliness, 
level  of  activity,  and  externally  imposed  activity  (all  functions  of  time), 
we  have 


/  =  aj?  +  a*A,  (74) 

(75) 


=  <a(F  -  yA)  +  c2(E  -  A),  (76) 

at 


where  all  the  coefficients  are  positive  constants. 


5$4  MATHEMATICAL   MODELS    OF    SOCIAL   INTERACTION 

The  system  being  linear,  a  general  solution  can  be  obtained,  giving  the 
variables  as  functions  of  time,  of  the  parameters  (the  coefficients),  and  of 
the  initial  conditions.  Clearly,  the  solution  can  be  tested  only  if  the  vari- 
ables can  be  appropriately  measured.  Even  so,  the  large  number  of  free 
parameters  almost  destroys  the  usefulness  of  an  explicit  solution.  We 
therefore  seek  information  of  more  general  nature,  for  example,  the  static 
(equilibrium)  properties  of  the  system.  Particularly,  the  conditions  of 
stability,  as  we  have  seen  (cf.  Sees.  1.6  and  4.1),  play  an  important  part  in 
all  theories  of  this  sort. 

Setting  the  derivatives  in  Eqs.  75  and  76  equal  to  zero,  we  obtain  expres- 
sions for  each  of  the  group  variables  /,  A,  and  Fin  terms  of  the  externally 
imposed  activity.  We  also  obtain  in  the  usual  way  the  conditions  of 
stability  for  the  system.  These  turn  out  to  be 

ctf  +  c2  +  btf  -  flj)  >  0,  (77) 

0?  -  ad(c&  +  ca)  -  flaCi  >  0.  (78) 

We  see  that  /?  >  a-^  is  necessary  to  satisfy  Eq.  78  and  sufficient  to  satisfy 
Eq.  77.  Translated  into  words,  if  the  system  is  to  be  stabilized,  the  co- 
efficient /?,  which  regulates  the  rate  of  change  of  F,  as  F  departs  from  the 
level  "appropriate"  to  a  given  intensity  of  interaction,  should  be  greater 
than  the  coefficient  al9  the  proportionality  factor  that  relates  friendliness 
to  intensity  (in  the  absence  of  activity). 

The  mathematical  model  thus  provides  some  leverage  for  a  theory  of 
interactions  in  a  group.  Questions  of  interpretation  inevitably  arise.  If 
the  system  is  unstable  and  F  becomes  negatively  infinite,  a  reasonable 
interpretation  is  a  dissolution  of  the  group  (or  a  brawl  ?).  It  is  admittedly 
difficult  to  interpret  an  unlimited  growth  of  F.  However,  we  can  always 
limit  the  meaningfulness  of  a  model  to  a  certain  limited  range  of  values 
of  its  variables. 

Denote  by  /*,  F*,  A*,  and  E*  the  values  of  the  corresponding  variables 
at  equilibrium.  If  equilibrium  does  obtain,  the  condition  dF*/dE*  >  0 
can  be  deduced  from  the  model.  Furthermore,  /*,  F*,  and  A*  vanish 
with  E*.  This  is  in  accord  with  Romans'  explanation  of  social  disintegra- 
tion on  community  and  family  levels  resulting  from  the  disappearance  of 
externally  imposed  activities  (e.g.,  with  unemployment  and  atrophy  of 
the  economic  functions  of  the  family). 

The  relations  A*  >  E*  and  A*  <  E*  can  be  interpreted  as  positive  or 
negative  "morale."  The  conditions  for  positive  or  negative  morale  (while 
equilibrium  is  preserved)  can  also  be  obtained  in  terms  of  the  relations 
between  coefficients  and  appropriately  interpreted  in  social-psychological 
terms. 

If  the  restriction  of  linearity  is  dropped,  general  solutions  of  the  dif- 
ferential equations  are  no  longer  available.  In  this  case  we  proceed  with 


GROUP    DYNAMICS  565 

the  investigation  of  the  phase  space,  exactly  as  was  done  with  population 
dynamics  (cf.  Sec.  1.6)  and  analogous  problems. 


5.2  A  Semiquantitative  Model 

The  variables  examined  in  Sec.  5.1  related  to  the  activity  and  the  socio- 
emotional  atmosphere  in  the  group  as  a  whole.  If  we  inquire  on  what  these 
variables,  in  turn,  depend,  we  come  upon  concepts  relating  to  the  inter- 
actions among  the  group  members.  Examples  of  such  concepts  are  the 
degree  of  unanimity  or  discord,  receptiveness  of  members  to  each  other's 
communications,  interpersonal  attractiveness,  etc.  These  terms  appear  in 
group  dynamics  theory  in  contexts  of  more  or  less  rigorous  discussion. 
In  particular,  Festinger  (1950)  defines  the  following: 

D :  The  perceived  discrepancy  of  opinion  among  the  members  on  an 
issue. 

P:    Pressure  to  communicate  with  each  other. 

C:  Cohesiveness,  that  is,  average  (or  total)  strength  of  attractiveness 
among  the  members. 

U:  Pressure  to  achieve  uniformity  of  opinion. 

R :   Relevance  of  the  issue  to  the  group. 

We  note  that  these  terms,  like  those  considered  in  Sec.  5.1,  are 
aggregative;  that  is,  they  pertain  to  the  whole  group  rather  than  to  the 
individual  members.  However,  they  seem  to  be  derived  from  a  more 
detailed  analysis.  For  example,  we  may  well  consider  all  of  them  as 
determinants  of  F,  the  over-all  "friendliness,"  or  of  7,  the  intensity  of 
interaction,  discussed  hi  Sec.  5.1. 

Festinger's  hypotheses  are  statements  about  the  interdependence  of  the 
variables  denoted  by  the  terms.  Simon  (1957)  translates  these  hypotheses 
in  the  same  way  as  those  of  Homans  (cf.  Sec.  5.1)  into  mathematical 
statements,  namely, 

^=/(P,L9JD),  (79) 

at 

P(0  =  P(D,  U),  (80) 

L(f)  =  L(U),  (81) 

^  =  g(D,  U,  C),  (82) 

at 

U(t)  =  U(C,  K),  '  (83) 

^  =  0.  (84) 

at 


MATHEMATICAL    MODELS    OF    SOCIAL    INTERACTION 

Equations  79  to  84  are  weaker  than  Eqs.  74  to  76.  The  specific  form  of 
the  functions  on  the  right  is  not  given.  Naturally,  the  consequences  to  be 
derived  will  also  be  much  weaker. 

Next,  we  note  that  some  of  the  equations  involve  derivatives  and  others 
do  not.  The  latter  imply  an  "instantaneous"  adjustment  of  the  dependent 
variable  values  to  those  of  the  independent  variables,  whereas  the  former 
imply  rates  of  adjustment.  A  direct  dependence  of  the  variable  on  the  left 
on  those  on  the  right  obtains  in  the  latter  case  only  at  equilibrium  (if  it  is 
ever  attained).  Equation  84  says  simply  that  R  is  a  constant  in  a  given 
situation,  determined,  say,  by  the  topic  under  discussion  by  a  group. 

In  Festinger's  treatment  the  specific  dependencies  indicated  in  Eqs. 
79  to  84  are  stated  in  typical  semiquantitative  ("directional")  form  prev- 
alent in  investigations  in  which  attempts  at  quantification  but  no  attempts 
at  mathematization  have  been  begun. 

1.  The  pressure  on  group  members  to  communicate  increases  with 
increasing  perceived  discrepancy  of  opinion,  with  the  degree  of  relevance 
of  the  issue  in  question,  and  with  the  pressure  toward  uniformity. 

2.  The  amount  of  change  in  opinion  resulting  from  a  received  com- 
munication increases  with  the  pressure  toward  uniformity  and  with  the 
cohesiveness  related  to  the  recipient. 

3.  The  rate  of  change  of  cohesiveness  suffers  decrements  (i.e.,  becomes 
smaller  if  positive  and  larger  in  absolute  value  if  negative)  as  either  per- 
ceived discrepancy  or  pressure  toward  uniformity  increases.    This  rate 
depends  also  on  the  level  of  cohesiveness. 

The  last  hypothesis  relates  to  the  changes  in  the  mutual  attractiveness 
among  the  members  as  they  depend  on  discrepancies  of  opinion  and  on  the 
importance  to  the  group  to  preserve  uniformity.  Simon  (1957)  introduces 
this  hypothesis  in  addition  to  those  formally  stated  by  Festinger  in  order 
"to  make  the  dynamic  system  complete."  He  argues  that  in  the  inter- 
pretation of  some  empirical  studies  this  hypothesis  is  actually  implicit. 

These  "directional"  hypotheses  can  be  formalized  by  statements  about 
the  signs  of  partial  derivatives  of  the  functions  on  the  right  side  of  Eqs. 
79  to  84.  Thus  PD,  LUt  Uc,  PU9  and  UR  are  implied  to  be  positive  by 
the  hypotheses,  whereas  /P,  gD,  g^/and/^  are  implied  to  be  negative. 
(The  subscripts  indicate  variables  with  respect  to  which  the  partial  deriv- 
atives are  to  be  taken.)  Thus  the  signs  of  9  of  the  1 1  partial  derivatives 
are  implied  by  the  verbally  stated  hypotheses.  Two  others  remain, 
namely,/^  and  gc.  To  determine  their  signs,  Simon  examined  the  situation 
in  equilibrium  for  a  given  value  of  R,  that  is,  when  dDjdt  and  dC/dt  are 
equal  to  zero.  Using  the  chain  rule  for  differentiation,  we  obtain 

fp6P+fLdL+fjt>dD  =  0,  (85) 

gc  dC  =  0.  (86) 


GROUP    DYNAMICS  567 

If  the  system  moves  from  one  equilibrium  to  another  (say,  as  the  inde- 
pendent experimentally  imposed  parameter  R  changes  in  value)  Eqs.  85 
and  86  must  hold  throughout,  quite  analogously  to  the  situation  in  thermo- 
dynamics in  which  a  system  goes  through  a  sequence  of  reversible  states. 
Also,  if  P  and  L  are  large,  Eq.  79  implies  that  D  will  be  pushed  to  a 
lower  equilibrium  level.  Hence  6D/6P  =  -(/P//D)  <  0  and  dD/dL  = 
~(/i//o)  <  0»  from  which  we  deduce /p  <  0.  By  a  similar  argument, 
Simon  showed  gc  >  0. 

This  completes  the  mathematical  analysis  of  Festinger's  model.  Experi- 
mental studies  can  now  be  examined  with  a  view  to  decide  whether  the 
results  are  relevant  to  the  deduced  predictions,  and,  if  so,  whether  the 
predictions  are  corroborated  or  refuted.  It  goes  without  saying  that 
experimental  data  are  relevant  to  the  model  only  to  the  extent  that  a 
method  of  measuring  quantities  involved  is  indicated.  The  "weakness" 
of  the  present  mathematical  model,  however,  necessitates  only  weak 
measurement  scales.  In  fact,  since  only  relative  magnitudes  are  compared, 
only  an  ordinal  scale  of  measurement  is  required. 

Indices  for  such  scales  have  been  offered  by  many  researchers.  Simon 
analyzed  the  data  obtained  in  an  experiment  by  Back  (1951)  and  in  the 
field  by  Festinger,  Schachter,  and  Back  (1950)  in  the  light  of  his  mathe- 
matical interpretation  of  the  proposed  theory. 


5.3  Markov  Chain  Models 

The  usefulness  of  the  "classical"  models,  discussed  in  Sec.  5.1  and  5.2, 
is  severely  limited  by  difficulties  of  measuring  the  quantities  represented 
in  them.  In  recent  years  mathematical  theories  of  group  dynamics  have 
been  developing  along  quite  different  lines.  Since  most  of  these  develop- 
ments are  being  pursued  by  workers  whose  principal  orientation  is  mathe- 
matical, the  central  variables  of  the  models  are  characteristically  not 
indices  of  "psychological  states"  of  special  interest  to  group  dynamicists, 
for  example,  friendliness,  cohesion,  and  rejection,  but  rather  whatever 
happens  to  be  easily  and  obviously  quantifiable,  such  as  easily  identifiable 
acts,  which  can  be  quantified  in  terms  of  temporal  or  relative  frequency. 

Already  in  learning  theory,  this  focusing  of  interest  on  countable  acts 
has  resulted  in  a  rapid  and  fruitful  development  of  mathematical  models  of 
the  learning  process,  which,  in  some  instances,  have  received  strong  cor- 
roboration.  Mathematical  group  dynamics  is  basically  an  extension  of 
similar  methods  to  situations  in  which  interactions  among  individuals  are 
the  central  acts,  and  this  provides  justification  for  calling  this  area  of 
research  "group  dynamics."  It  is  dynamics  because  it  involves  the  study 
of  the  time  courses  of  processes;  it  is  group  dynamics  because  the  events 


$68  MATHEMATICAL   MODELS    OF   SOCIAL    INTERACTION 

are  defined  in  terms  of  interactions  among  the  several  components  of  a 
"many-headed  subject."  It  is  hoped  that  the  concepts  of  central  interest 
to  the  "psychologically  oriented"  group  dynamicists  will  not  remain 
without  attention.  Indeed,  it  may  turn  out  that  the  holistic  concepts  will 
"emerge"  from  the  atomistic  studies. 

A  mathematical  tool  which  is  receiving  increasing  attention  in  the 
mathematical  approach  to  group  dynamics  is  the  so-called  Markov  chain. 

Suppose  that  a  system  can  be  in  any  one  of  n  states  and  that  the  prob- 
ability of  being  in  state  /  is/v  If  these  probabilities  change  with  time,  the 
pi  can  be  considered  as  functions  of  t,  that  is,/>z-(r).  In  what  follows  t  will 
represent  a  discrete  rather  than  a  continuous  variable.  In  successive 
moments,  *  =  0,  1,  2,  ...  m,  the  system  will  pass  through  a  succession  of 
states,  being,  however,  always  in  one  of  the  n  possible  states.  If  the  system 
finds  itself  at  time  t  in  state  z,  we  can  conceive  of  a  probability  that  at 
time  t  +  I  the  system  will  be  in  state  j.  In  general,  this  probability  can  be 
expected  to  depend  on  z,y,  t,  possibly  on/>^,  and  even  on  the  entire  history 
of  the  process.  If  it  depends  only  on  i  and  y,  we  have  the  condition 
defining  a  Markov  chain.  In  other  words,  a  Markov  chain  describes  a  pro- 
cess in  which,  for  each  pair  of  probabilities  (z,  j  ),  there  is  a  constant  transition 
probability,  <x,ij9  that  being  in  the  state  i,  the  system  will  pass  to  state  j. 

A  Markov  chain  displays  a  system  of  linear  homogeneous  equations  in 
which  each  of  the  state  probabilities  at  time  t  is  given  as  a  linear  combina- 
tion of  the  state  probabilities  at  time  t  —  1,  with  the  VLH  as  coefficients.17 
Thus 

Pl(t  +  1)  =  alljPl(0  +  a21/?2(0  +  .  .  .  +  anl/?n(f); 


(87) 

+  .  .  .  +  ann-pn(f). 
In  vector-matrix  notation,  we  can  write 


(88) 
It  follows  immediately  that 

P(0  =  («rt)'p(0).  (89) 

If  the  matrix  (a^)  satisfies  certain  rather  mild  conditions,  the  distribution 
vector  p(zO  tends  to  a  limiting  distribution  p(oo),  which  can  be  obtained  by 
setting  p^t  +  1)  =pi(t)  =pi  and  solving  the  resulting  system  of  homo- 
geneous linear  equations  (with  the  normalizing  restraint  £/?z.  =  1)  for  p^. 
The  resulting  probability  distribution  is  called  the  equilibrium  distribution. 

17  The  subscripts  are  interchanged  because  the  matrix  of  the  coefficients  is  the  transpose 
of  the  matrix  (ocfj):  in  each  row  the  second  index  remains  fixed  while  the  first  varies. 


GROUP    DYNAMICS  5$ 

The  equilibrium  is  a  dynamic  one :  the  system  still  passes  from  one  state  to 
another,  but  in  any  long  period  of  time  the  fraction  of  time  that  the  system 
spends  in  any  one  state  remains  constant. 

Elsewhere  in  this  volume  (cf.  Chapter  10)  it  is  shown  how  the  Markov 
chain  is  applied  to  a  stochastic  theory  of  learning. 

In  the  simplest  stochastic  learning  model,  the  following  assumptions 
are  made: 

1.  The  subject  has  a  choice  of  two  responses,  A±  and  A2,  to  a  single 
stimulus. 

2.  Regardless  of  the  response  is  given,  A±  is  reinforced  (declared  to  be 
correct)  with  probability  TT  and  A%,  with  probability  1  —  rr  (noncontingent 
reinforcement). 

3.  If  A!  is  reinforced,  the  stimulus  will  become  conditioned  to  AI  with 
probability  6  and  will  retain  whatever  conditioning  it  has  had  with  prob- 
ability 1  —  6.   The  situation  is  similar  with  respect  to  A%. 

The  "state"  of  the  subject  at  time  t  is  defined  by  the  response  to  which 
the  stimulus  is  conditioned  at  time  t.  That  response  is  given  on  the  next 
stimulus  presented.  The  probabilities  IT  and  6  determine  the  matrix  of 
transition  probabilities  among  all  the  pairs  of  the  two  states,  1  and  2. 
Thus: 

1  2 


+  (1  -  6)     6(1  - 
fa  1  —  0 


This  matrix  defines  the  Markov  chain,  which  determines  the  entire  time 
course  of  the  process. 

More  involved  models  result  if  several  stimulus  elements  are  associated 
with  each  stimulus  and  a  "sampled"  element  of  this  set  is  conditioned  to  a 
response  at  each  presentation.  Further  generalizations  involve  larger 
number  of  stimuli,  contingent  reinforcement  schedules,  etc. 
THE  "TWO-HEADED  SUBJECT."  Burke  (1959)  extended  the  stochastic 
learning  model  to  the  case  of  what  amounts  to  a  "two-headed  subject"; 
that  is  to  say,  the  "subject"  is  a  pair  of  persons  in  a  stochastic  reinforcement 
learning  situation  in  which  the  reinforcements  are  in  general  contingent  on 
what  both  subjects  do.  The  learning  processes  of  the  two  are  thus  inter- 
linked, and  a  kind  of  social  interaction  has  been  introduced  into  the 
learning  situation. 

A  similar  study  is  reported  by  Hays  and  Bush  (1954). 
DOMINANCE  STRUCTURES.     In  our  treatment  of  dominance  structures 
(cf.  Sec.  3.4),  the  states  of  a  social  group  could  be  taken  as  particular 


57°  MATHEMATICAL    MODELS    OF    SOCIAL    INTERACTION 

dominance  configurations.  If  the  dominance  structure  changes  in  time, 
we  have  a  dynamic  process.  The  corresponding  Markov  chain  would 
involve  the  probabilities  of  the  various  dominance  configurations  and  the 
transition  probabilities  from  one  configuration  to  another.  The  transition 
probabilities  could.be  compounded  of  the  probabilities  of  contact  between 
pairs  of  individuals  in  the  group  and  the  probabilities  of  dominance 
reversals  between  them,  if  contact  occurs,  thus  leading  from  one  dominance 
structure  to  another.  The  equilibrium  distribution  would  then  represent 
the  relative  frequencies  with  which  the  different  dominance  structures 
would  be  expected  to  be  observed.  In  Rapoport's  treatment  (1949)  these 
transition  probabilities  are  taken  constant,  and  so  the  process  becomes  a 
Markov  chain  (see  also  Landau,  1953). 

If  the  transition  probabilities  are  such  that  reversals  of  dominance 
between  two  individuals  with  sufficiently  disparate  score  structures  become 
very  rare,  it  is  shown  that  the  equilibrium  dominance  structure  ap- 
proaches a  hierarchy.  This  is  the  mathematical  statement  of  Landau's  con- 
clusion (Landau,  1951)  that  social  factors  (as  distinguished  from  inherent 
biological  ones)  must  be  assumed  to  account  for  the  near-hierarchies 
observed  in  moderately  large  flocks  of  hens.  This  sociological  assumption 
(making  transition  probabilities  depend  on  disparities  of  social  rank)  still 
allows  the  dynamics  of  dominance  structure  to  be  treated  as  a  Markov 
chain.  If  the  assumption  were  replaced  by  &  psychological  one  (e.g.,  making 
the  transition  probabilities  depend  on  the  past  histories  of  the  particular 
individuals  involved  in  contacts),  the  model  would  cease  to  be  a  Markov 
chain.  We  have,  thus,  an  example  of  how  the  gross  sociological  assump- 
tions lead  to  simpler  mathematical  models  than  do  the  finer  psychological 
ones. 

CONFORMITY  PRESSURE.  In  another  treatment  (Cohen,  1958)  a 
Markov  process  is  used  to  describe  the  succession  of  probability  distribu- 
tions of  states  in  which  an  individual  is  assumed  to  be  when  his  own 
judgment  conflicts  with  the  judgments  of  others  in  his  "reference  group." 
The  experiment  is  analogous  to  those  initiated  by  Asch  (1956).  Except 
for  one  subject,  the  group  consists  of  the  experimenter's  accomplices,  who 
make  judgments  about  relative  lengths  of  lines  contrary  to  the  obviously 
perceived  differences.  The  data  consist  of  the  subject's  judgments  made 
after  the  accomplices  have  expressed  theirs. 

Cohen  constructs  a  Markov  chain  model  in  which  the  states  are  assumed 
to  be  the  following: 

State  1 .  If  the  subject  is  in  this  state  on  trial «,  he  responds  correctly  on 
that  and  every  subsequent  trial  (i.e.,  he  has  decided  to  pay  no  attention  to 
the  judgments  of  the  others). 


GROUP    DYNAMICS  577 

State  2.  If  the  subject  is  in  this  state,  he  responds  correctly  on  that  trial 
but  may  give  wrong  (conforming)  responses  on  subsequent  trials. 

State  3.  The  subject  conforms  on  that  trial  but  may  respond  correctly 
on  subsequent  trials. 

State  4.  The  subject  then  and  thereafter  conforms  to  the  group's 
responses. 

We  see  that  State  1  and  State  4  are  "absorbing  states,"  that  is,  once 
entered,  they  cannot  be  left.  This  implies  that  eventually,  after  a  suffi- 
ciently large  number  of  trials,  a  subject  will  either  reject  the  group's 
judgment  or  reject  his  own  judgment. 

Observation  of  such  end-states  has  previously  led  to  hypotheses  con- 
cerning the  role  of  personality  differences  in  determining  the  outcome. 
Although  personality  differences  may  indeed  be  decisive,  it  is  important  to 
note  that  the  Markov  chain  model  predicts  the  eventual  separation  of  the 
subjects  into  conformists  and  nonconformists,  even  if  all  the  subjects  have 
the  same  "personality."  According  to  this  model,  it  is  a  matter  of  chance 
into  which  absorbing  state  each  subject  eventually  will  pass.  More  evidence 
is  required  for  ascribing  the  final  behavior  of  subjects  to  personality  dif- 
ferences. 

The  model  does,  however,  allow  the  estimation  of  the  transition  prob- 
abilities for  groups  of  subjects,  and  these  parameters  can  be  taken  as 
reflecting  the  prevalent  or  average  personality  characteristics  of  the 
population.  Moreover,  the  model  will  give  a  good  fit  to  the  time  course  of 
the  process  only  if  the  variance  of  these  parameters  is  small.  The  good  fits 
actually  obtained  therefore  speak  against  large  variations  in  the  parameters. 
Whether  this  small  variance  reflects  a  homogeneity  of  the  subject  popula- 
tion studied  or  the  small  importance  of  personality  characteristics  in 
determining  the  transition  probabilities,  and  so  the  propensity  to  conform, 
remains  to  be  established  in  further  studies. 

GAME-LEARNING  THEORY.  Markov  chains  were  used  by  Flood 
(1954a,  1954b),  by  Suppes  and  Atkinson  (1960),  and  by  others  as  models  of 
social  interaction  in  which  aspects  of  stochastic  learning  theory  and  those 
of  game  theory  were  combined. 

Game  theory,  as  is  now  well  known,  is  a  static,  not  a  dynamic,  theory. 
Typically,  the  game  matrix,  whose  entries  are  payoffs  associated  with  each 
set  of  strategy  choices  by  the  players,  is  assumed  known  to  all  the  players. 
Moreover,  the  players  are  assumed  completely  "rational"  in  the  sense  that 
each  foresees  all  possible  consequences  of  every  choice  open  to  him  and  is 
also  aware  that  all  the  other  players  possess  the  same  knowledge.  Conse- 
quently, game  theory  examines  the  logical  structure  of  social  situations 
characterized  by  disparate  interests  rather  than  by  a  possible  course  of 


57*  MATHEMATICAL   MODELS   OF   SOCIAL   INTERACTION 

behavior  of  the  participants  in  such  situations,  as  determined  by  psycho- 
logical parameters.  Indeed,  there  are  no  behavioral  parameters  in 
game-theoretical  models,  as  originally  formulated  by  von  Neumann  and 
Morgenstern  (1944). 

Theories  of  learning,  on  the  other  hand,  do  contain  parameters.  For 
example,  in  the  Markov  chain  model  of  stochastic  learning  theory  the 
transition  probabilities  are  essentially  learning  parameters.  They  deter- 
mine the  magnitudes  of  changes  in  the  probabilities  of  response  as  a  result 
of  conditioning  operating  in  the  learning  process.  Essentially,  then, 
stochastic  learning  theory  is  a  "mechanical"  theory.  Concepts  such  as 
"insight,"  "the  logic  of  the  situation,"  and  "strategy,"  have  no  place  in 
the  currently  proposed  stochastic  models  of  learning. 

The  assumptions  of  learning  theory  generally  lead  to  predictions  of 
behavior  different  from  those  based  on  game-theoretical  considerations. 
This  can  be  seen  even  in  a  one-person  game  (game  against  nature). 
Suppose  a  subject  is  presented  with  a  single  stimulus  and  has  a  choice  of 
two  responses,  A±  and  A^  Suppose  that  response  A±  is  reinforced  (say, 
declared  to  be  correct)  with  probability  TT,  and  A^  with  probability 
1  -  TT,  independently  of  the  subject's  responses.  Suppose  TT  >  1  —  77. 
If  this  situation  is  viewed  as  a  trivial  game  (against  nature),  game  theory 
prescribes  on  the  basis  of  maximizing  expected  gain  (assuming  only 
correct  guesses  rewarded)  the  choice  of  A±  100%  of  the  time.  A  stochastic 
learning  model,  based  on  noncontingent  reinforcements,  on  the  other  hand, 
predicts  an  asymptotic  frequency  TT  for  A^ 

Some  experiments,  particularly  with  nonhuman  subjects  but  some  with 
human  subjects,  corroborate  the  result  predicted  by  stochastic  learning 
theory.  Similar  departures  from  game-theoretic  results  can  be  deduced 
for  stochastic  learning  models  applied  to  game  situations. 

For  example,  in  the  experiments  of  Suppes  and  Atkinson  (1960)  subjects 
play  a  2  x  2  game  in  which  the  payoffs  are  reinforcement  probabilities. 
Thus,  if  Ai  and  Bi  (i  =  1,  2)  represent  the  responses  of  the  two  subjects, 
respectively,  we  have  a  game  defined  by  the  following  matrix : 


Here  the  <z's  are  the  reinforcement  probabilities  of  A's  responses  contingent 
on  the  joint  responses  of  A  and  B.  If  £'s  reinforcement  probabilities  are 
complementary  to  A's,  the  situation  is  logically  isomorphic  to  a  two-person, 
constant-sum  game.  On  the  other  hand,  viewed  as  a  learning  situation  in 
which  the  conditioning  probabilities  0A  and  6B  characterize  the  two 


GROUP    DYNAMICS 

subjects  (cf.  p.  569),  we  have  the  following  Markov  matrix: 

(1,1)  d,2)  (2,1)  (2,2) 


(1,1) 


(1,  2) 


(2,1) 


(2,2) 


0 


+  (i  - 

0 


0 


+  (1  ~  0 


where  the  state  (/',/)  denotes  the  joint  response  of  A  and  B. 

The  asymptotic  probabilities  of  the  states  turn  out  to  be  functions  of 
the  tf's  and  of  the  ratio  6^/6^.  Since  the  latter  is  a  parameter  charac- 
terizing the  pair  of  subjects,  it  is  clear  that  the  frequencies  of  the  responses 
at  equilibrium  depend  on  the  subjects,  contrary  to  the  game-theoretical 
conclusion  which  prescribes  a  solution  to  the  two-person,  zero-sum  game 
as  a  function  of  the  payoffs  only. 

Suppes  and  Atkinson's  original  aim  was  to  design  a  set  of  experiments  in 
which  normative  prescriptions  of  game  theory  could  be  compared  with  the 
predictions  of  stochastic  learning  theory  in  game  situations  where  the 
subjects  have  the  opportunity  to  modify  their  choices  in  successive  plays 
on  the  basis  of  previous  experience.  As  the  authors  themselves  state, 
however,  the  emphasis  soon  shifted  to  testing  stochastic  learning  models 
per  se.  Accordingly,  the  situations  were  not  designed  as  they  are  assumed 
in  game  theory.  Typically,  the  subjects  did  not  have  all  the  information 
that  the  players  of  a  game  are  assumed  to  have.  However,  the  experi- 
ments were  increasingly  "gamelike"  in  the  sense  that,  in  successive  experi- 
ments, progressively  more  information  was  given  to  the  subjects  until, 
in  the  last  experiments  reported  in  the  study,  some  groups  of  subjects 
had  complete  knowledge  of  the  game  matrix  and  were  playing  for  money. 
Thus,  for  the  most  part,  the  corroboration  of  results  predicted  by  learning 
theory  (as  against  the  strategies  prescribed  by  game  theory)  does  not  in 
itself  bespeak  the  greater  accuracy  of  the  learning  theory  except  in  those 
cases  in  which  the  requirements  of  the  game  situation  were  completely  met. 
However,  not  all  games  were  tested  under  all  conditions.  With  increasing 
approximation  to  a  real  game  situation,  the  games  became  also  more 
complex,  as  will  subsequently  become  apparent. 

From  the  point  of  view  of  gaining  "insight"  into  the  strategic  logic  of 
a  game,  the  simplest  two-person,  zero-sum  game  is  one  in  which  each 


574  MATHEMATICAL   MODELS    OF   SOCIAL    INTERACTION 

opponent  has  two  strategies,  of  which  one  dominates  the  other.  The 
choice  of  strategy  is  then  determined  by  the  "sure-thing  principle"  for 
both  players.  It  is  difficult  to  conceive  that  two  adult  players,  knowing 
the  game  matrix,  will  do  anything  but  choose  the  two  dominating  strategies. 

Not  quite  so  obvious,  but  nearly  so,  is  the  choice  in  a  2  X  2  game  in 
which,  although  there  is  no  dominating  strategy  for  one  of  the  players, 
there  is  nevertheless  a  saddle  point,  that  is,  a  pure  minimax  strategy  for 
both  players.  Note,  in  a  2  x  2  game  the  existence  of  a  saddle  point 
ensures  a  dominating  strategy  for  at  least  one  of  the  players. 

Next  in  complexity,  we  might  take  the  m  x  n  games  (m,  n  >  2)  with 
saddle  points.  Finally,  the  most  general  two-person,  zero-sum  games  are 
those  without  saddle  points,  which  require  mixed  strategies  for  minimax 
solutions. 

In  the  experiments  reported  by  Suppes  and  Atkinson  the  entire  range  of 
complexity  (except  m,  n  >  2)  was  used.  However,  aside  from  the  fact  that 
in  the  simplest  games  the  game  matrix  was  not  known  to  the  players,  the 
payoffs  were  probabilistic  reinforcement  schedules,  as  in  the  one-person 
game  example. 

The  reason  for  introducing  probabilistic  payoffs  in  experiments  with 
human  subjects  designed  to  test  a  learning  theory  is  obvious.  If  responses 
were  reinforced  with  certainty  in  2  x  2  games  with  saddle  points  (especially 
where  both  players  had  sure-thing  strategies),  "insights"  would  have 
occurred  very  quickly,  and,  as  a  consequence,  the  situations  would  not  lend 
themselves  to  treatment  by  stochastic  models.  We  have  already  seen  that 
even  in  games  against  nature  with  probabilistic  reinforcement  schedules 
human  subjects  sometimes  fail  to  maximize  expected  gain:  they  do  not 
choose  the  more  frequently  reinforced  response  exclusively.  In  two-person 
games  this  is  even  more  likely  to  be  true,  as  it  is,  in  fact,  in  the  experiments 
reported  by  Suppes  and  Atkinson. 

It  would  seem  that  the  introduction  of  determinate  numerical  payoffs 
in  games  other  than  2x2  zero-sum  games  with  saddle  points  would  not 
impair  the  usefulness  of  the  Markov  model.  This  is  particularly  true  in 
the  Prisoner's  Dilemma-type  games,  in  which,  in  the  absence  of  communi- 
cation, a  solution  is  not  unequivocally  prescribed.  We  have  already  seen 
(cf.  Sec.  4.4)  that  in  such  games  subjects  typically  oscillate  between  the 
cooperative  and  the  noncooperative  choice,  even  if  the  payoffs  are  deter- 
minate and  known.  One  might  therefore  postulate  the  following  four 
states,  in  which  each  subject  playing  such  a  game  might  find  himself  : 

1.  He  has  played  cooperatively  and  was  rewarded  (i.e.,  the  other  has 
played  cooperatively  also). 

2.  He  has  played  cooperatively  and  was  punished. 


GROUP    DYNAMICS 


575 


3.  He  has  played  noncooperatively  and  was  rewarded. 

4.  He  has  played  noncooperatively  and  was  punished. 

The  four  states  of  the  individual  subject  correspond  in  this  case  also 
to  the  four  possible  states  of  the  system,  namely,  (CC),  (CD),  (DC),  and 
(DD),  where  C  stands  for  cooperation  and  D  for  defection  of  each  player. 

We  see  now  that  if  we  assign  probability  1  to  the  event  that  a  subject 
will  stay  with  any  rewarded  state  and  probability  0  to  the  event  that  he 
will  stay  with  a  punished  state,  then  (CC)  will  be  an  absorbing  state  into 
which  the  system  will  pass  after  at  most  two  plays. 

Since  the  data  show  nothing  of  the  kind,  we  might  try  the  next  simplest 
model,  namely,  assign  probability  1  to  the  event  that  a  subject  will  stay 
in  a  rewarded  noncooperative  state  (where  the  payoff  is  largest)  and  the 
probability  6  (0  <  6  <  1)  to  the  event  that  he  will  stay  in  the  rewarded 
cooperative  state,  the  probabilities  of  staying  in  either  of  the  punished 
states  being  zero. 

These  transition  probabilities  of  individual  states  then  induce  the  follow- 
ing matrix  of  transition  probabilities  among  the  system  states : 


(CC) 
(CD) 
(DC) 
(DD) 


(CC)       (CD)          (DC)        (DD) 
02       0(1  -  0)    (1  -  6)6    (1  -  6f 
0001 
0001 
1000 


The  asymptotic  probabilities  of  the  states 

1 


p(CC)  = 


p(CD)  =  p(DQ  = 


p(DD)  = 


26  -3d2' 

6(1  -  6) 


2  +  26  -  302 


2  +  26  -  3<92 


(90) 
(91) 
(92) 


From  these  equations  it  follows  that  the  over-all  frequency  of  cooperative 
choice  as  determined  by  the  parameter  6  is 


If  the  choices  were  independent,  we  would  have  p(CC)  =/2.  The 
Markov  model,  however,  predicts  p(CC)  >/2,  which  can  be  directly 
verified. 


MATHEMATICAL    MODELS    OF   SOCIAL   INTERACTION 

Thus,  even  if  we  confine  ourselves  to  examining  the  static  (equilibrium) 
aspects  of  the  process,  the  Markov  model  suggests  relations  not  implied 
in  the  purely  static  descriptive  theory  outlined  in  Sec.  4.4. 

Elaboration  of  this  approach  through  the  assignment  of  finite  "next 
choice"  probabilities  to  the  individual  states  and  deducing  the  correspond- 
ing transition  probabilities  of  the  system  states  are  straightforward. 

Details  of  this  method  and  of  its  experimental  applications  to  three- 
person  games  are  given  in  Rapoport  et  al.  (1962). 


References 

*Allanson,  J.  T.    Some  properties  of  a  randomly  connected  neural  network.    In  C. 

Cherry  (Ed.),  Information  Theory.   New  York:  Academic  Press ;   London:  Butter- 
worth  Scientific  Publications,  1956.  Pp.  303-313. 
Asch,  S.  E.    Studies  of  independence  and  conformity:  I.  A  minority  of  one  against  a 

unanimous  majority.  Psychol  Monogr.,  1956,  9. 
Back,  K.  W.   Influence  through  social  communication.  J.  abnorm.  sec.  Psychol.,  1951, 

46,  9-23. 

Bailey,  N.  T.  J.   The  mathematical  theory  of  epidemics .  New  York:  Hafner,  1957. 
Bales,  R.  F.    Interaction  process  analysis;  a  method  for  study  of  small  groups.    Cam- 
bridge, Mass.:  Addison- Wesley  Press,  1950. 
Bavelas,  A.   A  mathematical  model  for  group  structure.   Applied  Anthropology,  1948, 

7,  16-30. 

Bowley,  A.  C.   On  bilateral  monopoly.   Econ.  J.,  1928,  38,  651-659. 
Braithwaite,  R.  B.    Theory  of  games  as  a  tool  for  the  moral  philosopher.    Cambridge, 

England:  Cambridge  Univer.  Press,  1955. 
Burke,  C.  J.  Applications  of  a  linear  model  to  two-person  interactions.  In  R.  R.  Bush 

&  W.  K.  Estes  (Eds.),  Studies  in  mathematical  learning  theory.   Stanford :  Stanford 

Univer.  Press,  1959.  Pp.  180-203. 
Cartwright,  D.,  &  Harary,  F.  Structural  balance:  a  generalization  of  Heider's  theory. 

Psychol.  Rev.,  1956,  63,  277-293. 

Cohen,  B.  P.  A  probability  model  for  conformity.   Sociometry,  1958,  21,  69-81. 
Coleman,  J.  S.    The  mathematical  study  of  small  groups.    In  H.  Solomon  (Ed.), 

Mathematical  thinking  in  the  measurement  of  behavior.    Glencoe,  111.:    The  Free 

Press,  1960.  Pp.  1-149. 
Collias,  N.  E.   Statistical  factors  which  make  for  success  in  initial  encounters  between 

hens.   Amer.  Naturalist,  1943,  77,  519-538. 
Cournot,  A.  A.    Recherches  sur  les  principes  maihematiques  de  la  theorie  des  richesse. 

English  translation :    Researches  into  the  mathematical  principles  of  the  theory  of 

wealth.  New  York:  Macmillan,  1927. 
Davis,  R.  L.   The  numbers  of  structures  of  finite  relations.   Proc.  Amer.  Math.  Soc., 

1953,  4,  486-495. 
Davis,  R.  L.    Structure  of  dominance  relations.    Bull.  Math.  Biophysics,  1954,  16, 

131-140. 
Deutsch,  M.   Trust  and  suspicion.  J.  Conflict  Resolution,  1958,  2,  267-279. 

*  The  starred  items  are  relevant  to,  although  not  specifically  mentioned  in,  the  text. 


REFERENCES  J77 

Deutsch,  M.  Trust,  trustworthiness,  and  the  F  scale.  /.  abnorm.  sec.  Psychol. 9  1960,  61, 

138-140. 
Deutsch,  M.,  &  Krauss,  R.  M.    The  effect  of  threat  upon  interpersonal  bargaining. 

/.  abnorm.  soc.  Psychol.,  1960,  61,  181-189. 
Dodd,  S.  C.,  Rainboth,  E.  D.,  &  Nehnevajsa,  J.,  Revere  Studies  on  Interaction.    U.S. 

Air  Force  report,  unpublished.   Washington  Public  Opinion  Lab.,  1952. 
Festinger,  L.   The  analysis  of  sociograms  using  matrix  algebra.   Human  ReL,  1949,  2, 

153-158. 

Festinger,  L.   Informal  social  communication.  Psychol.  Rev.,  1950,  57,  271-282. 
Festinger,  L.,  Schachter,  S.,  &  Back,  K.  W.   Social  pressures  in  informal  groups.   New 

York:  Harper,  1950. 
Flood,  M.  M.  A  stochastic  model  for  social  interaction.   Trans.  N.  Y.  Acad.  Sci.,  1954,- 

16,  202-205.  (a) 
Flood,  M.  M.    Game  learning  theory  and  some  decision-making  experiments.    In 

R.  M.  Thrall,  C.  H.  Coombs,  &  R.  L.  Davis  (Eds.),  Decision  Processes.   New  York: 

Wiley,  1954.   Pp.  139-158.  (b) 
Forsyth,  Elaine,  &  Katz,  L.   A  matrix  approach  to  the  analysis  of  sociometric  data. 

Sociometry,  1946,  9,  340-347. 
*Foster,  C.  C.,  &  Rapoport,  A.    The  case  of  the  forgetful  burglar.    Math.  Monthly, 

1958,  65,  71-76. 

Fouraker,   L.   E.    Professor  Fellner's  bilateral  monopoly  theory.     Southern  Econ. 

Journal,  1957,  24,  182-189. 

Gause,  G.  F.    The  struggle  for  existence,  Baltimore:  William  &  Wilkins,  1934. 
Gause,  G.  F.   Verifications  experimentales  de  la  th6orie  mathematique  de  la  lutte  pour 

la  vie.   Actualites  scientifiques  et  industrielles .  Paris:  Hermann  et  Cie.,  1935. 
Harary,  F.   On  the  notion  of  balance  of  a  signed  graph.   Mich.  Math.  Journ.,  1954,  2, 

143-146. 
Harary,  F.   Graph  theoretic  methods  in  the  management  sciences.   Management  Sci., 

1959,  5,  387-403. 

Harary,  F.,  &  Norman,  R.  Z.    Graph  theory  as  a  mathematical  model  in  social  science. 

Ann  Arbor:  Institute  for  Social  Research,  1953. 
Hays,  D.  G.,  &  Bush,  R.  R.   A  study  of  group  action.   Amer.  Sociol.  Rev.,  1954,  19, 

694-701. 

Heider,  F.   Attitudes  and  cognitive  organization.  /.  Psychol. ,  1946,  21,  107-112. 
Hoffman,  H.  Symbolic  logic  and  the  analysis  of  social  organization.  Behav.  Sci.,  1959, 

4,  288-298. 

Homans,  G.  C.   The  human  group.   New  York:  Harper,  1950. 
Katz,  L.  A  new  status  index  derived  from  sociometric  analysis.  Psychometrika,  1953, 

18,  39-43. 
Katz,  L.,  &  Powell,  J.  H.  The  number  of  locally  restricted  directed  graphs.  Proc.  Amer. 

Math.  Soc.,  1954,  5,  621-626. 
Kemeny,  J.  G.,  Snell,  J.  L.,  &  Thompson,  G.  L.    Introduction  to  finite  mathematics. 

Englewood  Cliffs,  N.J.:  Prentice  Hall,  1957. 
Konig,  D.    Theorie  der  endlichen  und  unendlichen  Graphen.    Leipzig:    Akademische 

Verlagsgesellschaft,  1936. 

Kostitzin,  V.  A.  Biologic  mathematique.   Paris:  Librarie  Armand  Colin,  1937. 
"Landahl,  H.  D.   Outline  of  a  matrix  calculus  for  neural  nets.   Bull.  Math.  Biophysics, 

1947,  9,  99-108. 
Landahl,  H.  p.    Population  growth  under  the  influence  of  random  dispersal.    Bull. 

Math.  Biophysics,  1957, 19,  171-186. 


MATHEMATICAL   MODELS    OF    SOCIAL    INTERACTION 

Landahl,  H.  D.,  &  Runge,  R.  Outline  of  a  matrix  algebra  for  neutral  nets.  Bull.  Math. 

Biophysics,  1946,  8,  75-81. 
Landau,  H.  G.    On  dominance  relations  and  the  structure  of  animal  societies:    I. 

Effect  of  inherent  characteristics.   Bull.  Math.  Biophysics,  1951,  13,  1-19. 
Landau,  H.  G.  On  some  problems  of  random  nets.   Bull.  Math.  Biophysics,  1952,  14, 

203-212. 
Landau,  H.  G.    On  dominance  relations  and  the  structure  of  animal  societies:   III. 

The  condition  for  a  score  structure.  Bull.  Math.  Biophysics,  1953, 15,  143-148. 
Landau,  H.  G.,  &  Rapoport,  A.  Contributions  to  the  mathematical  theory  of  contagion 

and  spread  of  information.  Bull.  Math.  Biophysics,  1953,  15,  173-183. 
Leeman,  C.  P.  Patterns  of  sociometric  choice  in  small  groups :  a  mathematical  model 

and  related  experimentation.  Sociometry,  1952,  15,  220-243. 
Luce,  R.  D.    Connectivity  and  generalized  cliques  in  sociometric  group  structures. 

Psychometrika,  1950, 15, 169-190. 

Luce,  R.  D.,  &  Perry,  A.  D.  A  method  of  matrix  analysis  of  group  structure.  Psycho- 
metrika, 1949, 14,95-116. 

Luce,  R.  D.,  Macy,  J.,  Jr.,  &  Tagiuri,  R.   A  statistical  model  for  relational  analysis. 
•  Psychometrika,  1955,  20,  319-327. 
Lutzker,  D.  R.   Internationalism  as  a  predictor  of  cooperative  behavior.   J.  Conflict 

Resolution,  1960,  4,  426-430. 
Nash,  J.  F.    Equilibrium  points  in  w-person  games.   Proc.  Nat.  Acad.  Sci.,  U.S.A., 

1950,  36,  48-49. 
Newcomb,  T.  M.   An  approach  to  the  study  of  communicative  acts.  Psychol.  Rev., 

1953,  60,  393^04. 
Newcomb,  T.  M.  The  prediction  of  interpersonal  attraction.  Amer.  Psychologist,  1956, 

11,  575-586. 
Neyman,  J.,  Park,  T.,  &  Scott,  Elizabeth  L.    Struggle  for  existence.   The  Tribolium 

model.  Biological  and  statistical  aspects.  In  J.  Neyman  (Ed.),  Proc.  Third  Berkeley 

Symposium  on  Math.  Stat.  and  Probability.  Berkeley:    Univer.  of  California  Press, 

1955.  Pp.  41-79. 
*Polya,  G.    Sur  la  nombre  des  isomere  de  certains  composes  chimiques.    Comptes 

Rendus  Acad.  Sci.,  Paris,  1936,  202, 1554-1556. 
Raiffa,  H.  Arbitration  schemes  for  generalized  two-person  games.  Rept.  M  720-1,  R  30, 

Engrg.  Res.  Inst.,  University  of  Michigan,  Ann  Arbor,  1951. 
*Rapoport,  A.  Cycle  distribution  in  random  nets.   Bull.  Math.  Biophysics,  1948,  10, 

145-157. 
Rapoport,  A.    A  probabilistic  approach  to  animal  sociology:    I,  II.    Bull.  Math. 

Biophysics,  1949, 11,  183-196;  273-282. 

*Rapoport,  A.  Nets  with  distance  bias.  Bull.  Math.  Biophysics,  1951, 13,  85-91. 
*Rapoport,  A.  The  probability  distribution  of  distinct  hits  on  closely  packed  targets. 

Bull.  Math.  Biophysics,  1951, 13,  133-137. 
Rapoport,  A.   Spread  of  information  through  a  population  with  sociostructural  bias : 

I.  Assumption  of  transitivity.   II.  Various  models  with  partial  transitivity.   Bull. 

Math.  Biophysics,  1953,  15,  523-533,  535-543. 
Rapoport,  A.  Some  game-theoretical  aspects  of  parasitism  and  symbiosis.  Bull.  Math. 

Biophysics,  1956, 18,  15-30. 

Rapoport,  A.,  Gyr,  J.,  Chammah,  A.,  &  Dwyer,  J.   Studies  of  three-person  non-zero- 
sum,  non-negotiable  games.  Behav.  Sci.,  1962,  7,  38-58. 
Rashevsky,  N.   Studies  in  mathematical  theory  of  human  relations.   Psychometrika, 

1939,  4,  221-239. 


REFERENCES 


579 


Rashevsky,  N.  Mathematical  biology  of  social  relations.  Chicago:  Univer.  of  Chicago 

Press,  1951. 
*Rashevsky,  N.   Topology  and  life:   In  search  of  several  mathematical  principles  in 

biology  and  sociology.  Bull.  Math.  Biophysics,  1954, 16,  317-348. 
*Rashevsky,  N.    Some  theorems  in  topology  and  a  possible  biological  application. 

Bull.  Math.  Biophysics,  1955,  17,  111-129. 
*Rashevsky,  N.   Some  remarks  on  topological  biology,  Bull.  Math.  Biophysics,  1955, 

17,  207-218. 
*Rashevsky,  -N.    The  geometrization  of  biology.    Bull.  Math.  Biophysics,  1956,  18, 

31-56. 

Rashevsky,  N.   Contributions  to  the  theory  of  imitative  behavior.   Bull.  Math.  Bio- 
physics, 1957,19,91-119. 
Richardson,  L.  F.  Generalized  foreign  policy.  Brit.  J.  Psychol.  Monograph  Supplements, 

No.  23,  1939. 

Richardson,  L.  F.  War  moods:  I,  II.  Psychometrika,  1948,  13, 147-174;  197-232. 
Sauermann,  H.,  &  Selten,  R.   Ein  Oligolpolexperiment.   Z.  Ges.  Staatswissenschaft, 

1959,115,427-471. 
Schelling,  T.  C.   The  strategy  of  conflict.  Cambridge,  Mass. :  Harvard  Univer.  Press, 

1960. 
Schjelderup-Ebbe,  T.  Beitrage  zur  Sozialpsychologie  des  Haushuhns.  Z.  Psychologie, 

1922,  88,  225-252. 

Scodel,  A.,  Minas,  J.  S.,  Ratoosh,  P.,  &  Lipetz,  M.  Some  descriptive  aspects  of  two- 
person  non-zero-sum  games.  /.  Conflict  Resolution,  1959,  3,  114-119. 
*Shimbel,  A.    An  analysis  of  theoretical  systems  of  differentiating  nervous  tissue. 

Bull.  Math.  Biophysics,  1948,  10,  131-143. 
*Shimbel,  A.   Applications  of  matrix  algebra  to  communication  nets.    Bull.  Math. 

Biophysics,  1951, 13, 165-178. 
Siegel,  S.,  &  Fouraker,  L.  E.    Bargaining  and  group  decision  making.    New  York: 

McGraw-Hill,  1960. 

Simon,  H.  A.  Models  of  man.  New  York:  Wiley,  1957. 
Slobodkin,  L.  B.  Formal  properties  of  animal  communities.  General  Systems,  1958,  3, 

93-100. 

Solomonoff,  R.,  &  Rapoport,  A.    Connectivity  of  random  nets.    Bull.  Math.  Bio- 
physics, 1951,13,107-117. 

Stigler,  G.  J.  The  theory  of  price.  New  York:  Macmillan,  1952. 
Suppes,  P.,  &  Atkinson,  R.  C.   Markov  learning  models  for  multiperson  interactions. 

Stanford:  Stanford  Univer.  Press,  1960. 
Trucco,  E.   On  the  information  content  of  graphs:   Compound  symbols;   different 

states  for  each  point.  Bull  Math.  Biophysics,  1956,  18,  237-253. 
Volterra,  V.  Lecons  sur  la  iheorie  mathematique  de  la  luttepour  la  vie.  Paris :  Gauthier- 

Villars,  1931. 
Von  Neumann,  J.,  &  Morgenstern,  O.  Theory  of  games  and  economic  behavior  (1st  ed,). 

Princeton:  Princeton  Univer.  Press,  1944. 
*  Wright,  S.  The  roles  of  mutation,  inbreeding,  crossbreeding,  and  selection  in  evolution. 

Proc.  Sixth  Int.  Congress  on  Genetics,  1932,  1,  356-366. 
Zajonc,  R.  B.   The  concepts  of  balance,  congruity,  and  dissonance.  Public  Opinion 

Quart.,  1960,  24,  280-296. 


Author  Index 


Page  numbers  in  boldface  indicate  bibliography  references. 


Ajdukiewicz,  K.,  411,  412,  415 

Allanson,  J.  T.,  576 

Alt,  F.  L.,  417 

Anderson,  N.  H.,  80,  99,  100,  108,  117 

Anscombe,  F.  J.,  96,  117 

Apostel,  L.,  490 

Arrow,  K.  J.,  118,  265,  266,  267 

Asch,  S.  E.,  570,  576 

Atkinson,  R.  C.,  119,  125,  133,  134,  140, 
141,  154,  163,  164,  170,  173,  179, 
181,  183,  187,  194,  195,  233,  234, 
238,  250,  252,  256,  257,  258,  259, 
262,  264,  265,  267,  268,  571,  572, 
573,  574,  579 

Attneave,  F.,  439,  488 

Audley,  R.  J.,  17,  31,  67,  100,  117 

Back,  K.  W.,  567,  576,  577 

Bailey,  N.  T.  J.,  39,  117,  507,  576 

Bales,  R.  F.,  562,  576 

Bar-Hillel,  Y.,  333,  367,  378,  379,  380, 

382,  383,  385,  386,  387,  391,  394, 

411,413,415,438,488 
Barucha-Reid,  A.  T.,  76,  117 
Baskin,  W.,  321,  418 
Bavelas,  A.,  533,  576 
Behrend,  E.  R.,  62,  117 
Bellman,  R.,  83 
Berge,  C.,  278,  319 
Berkson,J.,  96, 117 
Billingsley,  P.,  266 
Birdsall,  T.  G.,  250,  256,  268 
Bitterman,  M.  E.,  62,  117 
Bloch,  B.,  311,  313,  319 
Bloomfield,  L.,  309 
Booth,  A.  D.,  418 
Bower,  G'.  H.,  130,  131,  133,  134,  136, 

137,   139,   140,   164,  209,  243,  257, 

266 
Bowley,  A.  C.,  555,  576 


Braithwaite,  R.  B.,  550,  576 

Bruner,  J.S.,  319,319 

Burke,  C.  J.,  163,  181,  194,  195,  198, 
234,  238,  250,  266,  267,  569,  576 

Burton,  N.  G.,  443,  488 

Bush,  R.  R.,  4,  6,  9,  10,  11,  13,  14,  19, 
20,  22,  31,  32,  34,  37,  42,  45,  48, 
50,  53,  61,  62,  66,  74,  75,  76,  78, 
79,  82,  83,  85,  86,  87,  89,  90,  92, 
94,  95,  96,  97,  98,  99,  100,  101, 
102,  103,  104,  107,  110,  112,  113, 
117,  118,  119,  120,  125,  140,  159, 
200,  213,  226,  234,  238,  250,  256, 
257,  266,  267,  268,  569,  576,  577 

Cane,  V.  R.,  26,  116,  118 

Carnap,  R.,  438,  488 

Carterette,  T.  S.,  201,  206,  266 

Cartwright,  D.,  541,  576 

Chammah,  A.,  576,  578 

Chapanis,  A.,  439,  489 

Cherry,  C.,  439,  489,  576 

Chomsky,  N.,  276,  285,  292,  293,  295, 
297,  299,  302,  303,  304,  308,  309, 
315,  319,  320,  325,  334,  336,  347, 
348,  360,  363,  365,  367,  369,  370, 
376,  386,  393,  394,  395,  396,  398, 
403,  408,  411,  415,  416,  418,  444, 
448,  489 

Chung,  K.  L.,  489 

Cobb,  S.,  319 

Cohen,  B.  P.,  570,  576 

Cole,  M.,  163,  181,267 

Coleman,  J.  S.,  576 

Collias,  N.  E.,  545,  576 

Condon,  E.  V.,  457,  489 

Coombs,  C.  H.,  118,  577 

Cooper,  F.  S.,  313,320 

Cournot,  A.  A.,  546,  547,  555,  576 


AUTHOR  INDEX 


Cox,  D.  R.,  35,  96,  116,118 
Criswell,  J.,  265 
Cronbach,L.J.,439,  489 
Crothers,  E.  J.,  128,  207,  211,  222,  223, 

266, 267 

Culik,  K.,  293,  320,333,  416 
Curry,  H.,  4 11,  416 

Davis,  M.,  291,  320,  354,  358,  416 

Davis,  R.  L.,  118,  542,  576,  577 

Delattre,P.,  313,  320 

Detambel,  M.  H.,  188,  266 

Deutsch,  M.,  549,  550,  557,  558,  576, 

577 

Dodd,  S.  C.,  520,  577 
Dwyer,J.,  576,578 

Edwards,  W.,  62,  115, 118 

Eifermann,  R.  R.,  482,  489 

Elias,  P.,  444 

Elson,  B.,  410,  416 

Estes,  W.  K.,  4,  6,  7,  13,  32,  48,  61,  62, 
75,  85,  89,  98,  117,  118,  119,  120, 
124,  125,  126,  128,  141,  153,  159, 
163,  164,  169,  170,  173,  179,  181, 
183,  193,  194,  195,  197,  198,  200, 
202,  207,  209,  211,  213,  216,  219, 
221,  222,  223,  226,  228,  229,  231, 
233,  234,  244,  250,  266,  267,  268, 
576 

Estoup,J.B.,  457,  489 

Fano,  R.  M.,  452,  489 

Fant,  C.  G.M.,  310,320 

Feinstein,  A.,  451,  452,  489 

Feldman,  J.,  62, 119 

Feller,  W.,  31,  63,  68,  119,  123,  131, 

146,   176,   199,  267,  424,  426,  489 
Festinger,  L.,  536,  537,  565,  566,  567, 

577 

Feys,R.,411,416 
Fletcher,  H.,  465,  489 
Flood,  M.M.,  571,  577 
Floyd,  R.  W.,  416 
Fodor,  J.,  319,  320,  321,  329,  416,  466, 

489 

Forsyth,  E,  535,  577 
Foster,  C.  C.,  577 
Fouraker,  L.  E.,  551,  554,  555,  577, 

579 


Frankmann,  J.  P.,  194,  195,  250,  267 
Frick,  F.  C.,  443,  484,  489,  490 
Friedman,  E.  A.,  440,  464,  490 
Friedman,  M.  D.,  489 
Friedman,  M.  P.,  163,  181,  267 
Fritz,  E.  L.,  443,  489 

Gardner,  R.  A.,  188,267 

Gaifman,  C.,  411,  413,  415 

Galanter,  E.,  9,  14,  50,  53,  74,  75,  90, 

92,  94,  95,  96,  97,   100,   102,   103, 

112,   113,  118,  119,  238,  430,  485, 

486,  490 

Garner,  W.  R.,  439,  489 
Gause,G.F.,  511,577 
Ginsberg,  R.,  164,  215,  248,  268 
Ginsburg,  S.,  370,  391,  393,  402,  409, 

410,  416 

Gnedenko,  B.  V.,  457,  489 
Goldberg,  S.,  84, 119, 186,  267 
Goodnow,  J.  J.,  50,  70,  71,  72,  73,  74, 

75,  97,  98,  101,  108 
Grant,  D.  A.,  108,  117 
Greibach,  S.,  388,416 
Grier,  G.  W.,  Jr.,  443,  489 
Gross,  M.,  414,  416 
Gulliksen,  H.,  31,  37, 119 
Guttman,  N.,  201,  267 
Gyr,  J.,  576,  578 

Halle,  M.,  308,  309,  310,  313,  315,  319, 

320,  465,  489 

Hanania,  M.  I.,  13,  17,  106,  110,  119 
Hannan,  E.J.,  116, 119 
Harary,  F.,  530,  531,  536,  537,  539,  541, 

576,  577 

Hardy,  G.  H.,  443,  489 
Harris,  Z.  S.,  299,  320,  378,  410,  411, 

416 

Hartley,  R.  V.,  43 1,  432,  489 
Hayes,  K.  J.,  109, 119 
Hays,  D.  G.,  569,  577 
Heider,  F.,  539,  577 
Heise,  G.  A.,  465,  490 
Hill,  A.  A.,  319 
Hillner,  K.,  223,  268 
Hiz,H.,  411,  416 
Hodges,  J.  L.,  Jr.,  97, 119 
Hoffman,  H.,  535,  577 
Homans,  G.  C.,  563,  564,  565,  577 


AUTHOR  INDEX 


5*3 


Hopkins,  B.  L.,  128,  207,  211,  222,  223, 

244,  267 

Hovland,  C.  I.,  48 1,489 
Huffman,  D.  A.,  452,  454,  489 
Hull,  C.  L.,  11,  25,  30,  35,  39,  54,  55, 

119,206,213,267 
Humboldt,  W.  von,  3 19,  320 

Irwin,  F.  W.,  5, 119 

Jackson,  W.,  490 

Jakobson,  R.,  308,  310,  311,  319,  320, 

416,  417,  490,  491 
Jarvik,  M.  E.,  115,  119,  179,  267 
Jeffress,  L.  A.s  417 
Jonckheere,  A.  R.,  17,  31,  67,  95,  100, 

117 

Jones,  M.  R.,  266 
Joos,  M.,  319 
Jordan,  C.,  148,  186,  267 

Kalish,  H.I.,201,267 

Kanal,  L.,  75,  83,  85,  86,  89, 119 

Karlin,  S.,  75,  118,  119,  265,  266,  267 

Karp,  R.  M.,  486,  489 

Katz,  J.,  292,  319,  320,  321,  329,  416, 

466,  489 

Katz,  L.,  535,  542,  577 
Kemeny,  J.  G.}  123,  132,  146,  228,  267, 

534,  577 

Kendall,  D.  G.,  39,  119,  507,  508 
Khinchin,  A.  L,  432,  489 
Khristian,  J.,  321 
Kinchla,  R.  A.,  251,  253,  255,  267 
Kleene,  S.  C.,  333,  334,  336,  417 
Koch,  S.,  118,  266 
Kohler,  W.,  328,  417 
Kolmogorov,  A.  N.,  457,  489 
Konig,  D.,  532,  533,577 
Kostitzin,V.  A.,  510,577 
Kraft,  L.  G.,  282,  320 
Krauss,  R.  M.,  550,  577 
Kulagina,  O.  S.,  411,417 
Kuroda,  S.-Y.,  379,  380 

Laberge,  D.  L,,  202 
Lambek,  J,,411,  413,417 
Lamperti,  J.,  62,  64,  75,  89,  119,  236, 
268 


Land,  V.,  223,  268 

Landabl,  H.  D.,  508,  512,577 

Landau,  H.  G.,  506,  543,  544,  545,  570, 
578 

Landweber,  P.  S.,  379,  381,  417 

Langendoen,  T.,  398,  417 

Lashley,  K.  S.,  240,  326,  376,  417 

Leeman,  C.  P.,  545,  578 

Lees,R.B.,  297,304,  320 

Lesniewski,  S.,  411 

Liberman,  A.  M.,  313,  320 

Lichten,  W.,  465,  490 

Licklider,J.CR.,  443,488 

Lipetz,  M.,  548,  558,  579 

Littlewood,  J.  E.,  443,  489 

Locke,  W.  N.,  418 

Logan,  F.  A.,  9,  20,  119 

Lorge,  I,  456,  491 

Luce,  R.  D.,  10,  18,  19,  25,  26,  27,  36, 
37,  50,  53,  62,  63,  64,  74,  89,  94, 
95,  96,  100,  101,  102,  113,  118,  119, 
234,  238,  250,  256,  266,  268,  437, 
439,  490,  536,  537,  545,  546,  578 

Lukoff,  F.,  315,320 

Lutzker,  D.  R.,  558,  578 

McCawley,  J.  D.,  321 

McGill,  W.  J.,  107, 119 

McMillan,  B.,  283,  321,  425,  490 

McNaughton,  R.,  331,  334,  352,  417 

MacKay,  D.  M.,  319,  320 

Macy,  J.,  Jr.,  545,  578 

Mandelbaum,  D.  G.,  321 

Mandelbrot,  B.,  283,  320,  456,  457,  458, 

463,  490 

Markov,  A.  A.,  409,  423,  424,  490 
Marschak,  J.,  456,  490 
Marx,  M.,  265 
Matthews,  G.  H.,  304,  320,  370,  373, 

374,414,417,469,476,490 
Mehler,  J.,  482 
Mill,  J.,  275 
Miller,  G.  A.,  107,  119,  280,  321,  334, 

336,  416,  417,  429,  430,  439,  440, 

459,  461,  462,  464,  465,  476,  482, 

484,  485,  486,  489,  490 
Miller,  N.  E.,  213 
Millward,  R.  B.,  163,  181,  267 
Minas,  J.  S.,  548,  558,  579 
Morf,  A.,  490 


5*4 


AUTHOR  INDEX 


Morgenstern,  O.,  572,  579 

Mosteller,  F.,  4,  6,  9,  10,  11,  13,  14,  19, 
20,  22,  31,  32,  37,  38,  42,  45,  48,  50, 
51,  52,  53,  61,  62,  66,  75,  76,  82,  83, 
85,  86,  89,  90,  92,  94,  95,  96,  98,  99, 
101,  103,  104,  107,  110,  111,  118, 
119,  120,  159,  200,  213,  226,  234, 
250,  256,  257,  266 

Mourer,  O.  H.,  213 

Murchison,  C.,  417 

Myhill,  J.,  338,  417 

Nagel,  E.,  319 

Nash,  L  F.,  548,  550,  555,  578 

Nehnevajsa,  J.,  520,  577 

Newcomb,  T.  M.,  539,  541,  578 

Newell,  A.,  62,  119,  340,  417,  484,  491 

Newman,  E.  B.,  424,  461,  462,  464,  490, 

491 

Neyman,  J.,  511,578 
Nicks,  D.  C.,   13,   115,   116,  119,  179, 

268 
Norman,  R.  Z.,  530,  53 1,  577 

Oettinger,  A.,  340,  343,  417 
Osgood,  C.  E.,  275,  321 

Pareto,  V.,  457,  491,  548,  549,  555 

Parikh,  R.,  366,  367,  389,  391,  417 

Park,  T.,  511,578 

Patterson,  G.  W.,  409,  418 

Penfield,  W.,  319 

Pereboom,  A.  C.,  109, 119 

Perles,  M.,  367,  379,  380,  382,  383,  385, 

386,387,391,394,415 
Perry,  A.  D.,  53 6,  53 7,  578 
Peterson,  L.  R.,  223,  268 
Pickett,  V.  B.,  410,416 
Pike,  K.  L.,  410 
Polya,  G.,  443,489,578 
Popper,  J.,  250,  268 
Post,  E.,  358,  382,  383,  417 
Postal,  P.,  297,  304,  321,  378,  414,  417 
PoweJU,  J.  H.,  542,  577 
Pribram,  K.,  430,  485,  486,  490 
Pushkin,  A.  S.,  423,  424 

Quastler,  H.,  439,  489,  490 


Rabin,  M.,  333,  334,  337,  338,  379,  417 
Raiffa,  H.,  234,  268,  550,  555,  578 
Rainboth,  E.  D.,  520,  577 
Rappaport,  A.,  506,  519,  544,  545,  548, 

570,  576,  577,  578,  579 
Rashevsky,  N.,  500,  501,  502,  503,  504, 

548,  578, 579 
Ratoosh,  P.,  548,  558,  579 
Restle,  F.,  62,  104,  119,  202,  256,  257, 

268 

Rhodes,  I.,  340 
Ricardo,  D.,  549 
Rice,  H.  G.,  370,  402,  409,  416 
Richardson,  L.  F.,  498,  500,  501,  502, 

509,510,563,579 
Ritchie,  R.  W.,  352,  417 
Rogers,  H.,  291,  321,  354,  418 
Rose,  G.  F.,  391,  393,  402,  416 
Rousseau,!.  J.,  549 
Runge,  R.,  577 

Saltzman,D.,  223,268 

Sapir,  E.,  309,  310,  321 

Sardinas,  A.  A.,  409,  418 

Sauermann,  H.,  551,  579 

Saussure,   F.   de,   309,   321,   327,    328, 

329,  330,  414,  418 
Schachter,  S.,  567,  577 
Schatz,C.D.,  313,  321 
Scheinberg,  S.,  362,  367,  380,  418 
Schelling,  T.  C.,  555,  579 
Schjelderup-Ebbe,  T.,  545,  579 
Schmitt,  S.  A.,  321 
Schutzenberger,  M.  P.,  279,  280,  281, 

282,  321,  337,  345,  347,  348,  352, 

370,  376,  381,  383,  386,  388,  391, 

393,  403,  406,  407,  408,  409,  418 
Scodel,  A.,  548,  558,  579 
Scott,  D.,  333,  334,  337,  338,  379,  417 
Scott,  E.  L.,  511,578 
SeBreny,  J.,  193 

Self  ridge,  J.  A.,  336,  417,  429,  490 
Selten,  R.,  551,  579 
Shamir,  E.,  333,  367,  374,  378,  379,  380, 

382,  383,  385,  386,  387,  391,  394, 

411,413,415,418 
Shannon,  C.  E.,  273,  321,  336,  418,  423, 

428,  431,  432,  439,  440,  441,  443, 

452,  A^t 
Shapiro,  H.  N.,  83 


AUTHOR  INDEX 


5*5 


Shaw,  J.C.,  340,417,484,491 

Sheffield,  F.  D.,  113,120 

Shepherdson,  J.  C.,  338,  418 

Shimbel,  A.,  579 

Siegel,  S.,  554,  579 

Silverman,  R.  A.,  489 

Simon,  H.  A.,  340,  417,  484,  491,  563, 
565,  566,  567,  579 

Skinner,  B.  F.,  474,  491 

Slobodkin,  L.B.,511,579 

Smith,  A.,  549 

Smoke,  K.  L.,  48 1,491 

Snell,  J.  L.,  123,  132,  146,  228,  267,  534, 
577 

Solomon,  H.,  265, 576 

Solomon,  H.  C.,  319 

Solomon,  R.  L.,  50,  53,  75,  92,  95,  96, 
97,  101,  103,  104,  110,  111,  215, 
268 

Solomonoff,  R.,  378,  506,  579 

Somers,  H.  H.,  430,  491 

Spence,  K.  W.,  206,  268 

Sternberg,  S.  H.,  14,  33,  34,  50,  65,  66, 
70,  75,  85,  89,  90,  94,  95,  99,  100, 
102,  108,  118,  120,  140,  226,  266 

Stevens,  K.  N.,  319,  320,  321,  465,  489 

Stevens,  S.  S.,  202,  204,  268 

Stigler,G.J.,555,579 

Straughan,  J.  H.,  7,  13,  61,  98,  118, 
141,  267 

Suci,  G.  J.,  275,  321 

Sumby,  W.  H.,  443,  489 

Suppes,  P.,  63,  64,  75,  85,  89,  118,  119, 
125,  133,  134,  141,  153,  154,  159, 
163,  164,  170,  173,  179,  181,  183, 
187,  200,  213,  215,  216,  226,  228, 
233,  234,  236,  238,  248,  265,  266, 
267,  268,  319,  571,  572,  573,  574, 
579 

Suszko,R.,  41 1,418 

Sweet,  H.,  309,  321 

Swets,LA.,250,256,268 

Tagiuri,  R.,  545, 578 
Tannenbaum,  P.  H.,  275, 321 
Tanner,  W.  P.,  Jr.,  250, 256, 268 


Tarski,  A.,  319 

Tatsuoka,  M.,  66,  82,  83,  86,  119,  120 

Theios,J.,213,214,215,248,268 

Thompson,  G.  L.,  20,  118,  123,  132, 

267,  534,  577 
Thorndike,  E.  L,  456,  491 
Thorpe,  W.  H.,  118 
Thrall,  R.  M.,  118,  577 
Thurstone,  L.  L.,  11,  31,  39,  40,  120 
Toda,  M.,  442,  491 
Tolman,  E.  C.,  326,  418 
Trakhtenbrot,  B.  A.,  291,  321 
Trucco,  E.,  579 

Underwood,  B.  J,  109, 120 

Volterra,V.,510,579 
Von  Neumann,  J.,  572,  579 

Wald,A.,;96, 120 
Wallace,  A.  F.  C.,  275,  321 
Wason,P.C.,481,  491 
Weaver,  W.,  336,  418 
Weiner,N,  43 1,432,  491 
Weinstock,  S.,  114,120 
Weiss,  W.,  48 1,489 
Wells,  R,  4 10,  418 
Wilks,  S.  S.,  106, 120 
Willis,  J.  C.,  457,  491 
Wilson,  T.  R.,  62, 101,  104,118 
Witte,  R.,  223 
Wright,  S.,  579 
Wunderheiler,A.,411,418 
Wunderheiler,  L.,  41 1,418 
Wyckoff,  L.  B.,  Jr.,  257,  268 
Wynne,  L.  C.,  50,  53,  75,  92,  95,  96, 
97,  101,  103,  104,  110,  111,  215,  268 

Yamada,  H,  334,  352,  417, 418 
Yngve,  V.  H.,  471,  474,  475,  484,  491 
Yule,  G.  U.,  457,  464,  491 

Zajonc,R.B.,  54 1,579 
Zangwell,  O.  L.,  118 
Ziff,  P.,  292,  321, 466,  491 
Zipf,G.K.,  457,  46 1,463,  491 


Subject  Index 


Abbreviation,  law  of,  463 
Absorbing  state,  82,  498,  571,  575 
Absorption,  probability  of,  82-83,  88 
Acquaintance  circle,  516-520 
Activity,  amount  of,  562—563 
Additive  increment  model,  38,  51-52 
Algebraic  model    (for  language  user), 
422,  464-483 

assumptions  about,  472-475 
ALGOL,  403,  409 
Algorithms,  354-357 
Allophone,  311 
All-or-none   learning   assumption,    126, 

153,  170 
Alphabet(s),  273 

coding,  450,  452-453 

information  capacity  of,  273,  439 

input,  338 

output,  338 

universal,  339 
Alternation  tendency,  34 
Altruism,  548 
Ambiguity,  of  grammar,  405,  470 

of  language,  274,  280,  466 

of  segmentation,  280 

structural,  387-390 

Amount  of  information,  431-432,  437, 
439,  481 

see  also  Information 
Analyzable  string,  301,  303 
Animal  sociology,  545 
Antisymmetric  relation,  542 
Approximation,  continuous,  42,  45 

deterministic,  39-41,  47-49 

differential  equation,  85,  87 

to  English,  428-429 

expected  operator,  42-43,  45-47 

A:-order,  336 

model  as  an,  102 
Arms  race,  500-501,  563 


Artificial  language,  see  Language 
Association,  law  of,  153 
Associative  chain  theory,  376 
Asymmetry,  292 

in  grammar,  373,  414 
left-right,  473 
of  responses,  10 
of  sentences,  399,  472 
Asymptotic,  see  specific  topics 
Attitudes  within  social  groups,  539-541 
Authoritarianism,  558 
Autoclitic  responses,  474 
Autocorrelation,   of  errors,  71,  73,  79, 

136,  214 

of  responses,  33,  69 
Automata,  abstract,  326-357 
behavior  of,  332 
codes  as,  283 
with  counters,  345,  352 
definite,  376 
deterministic,  334,  343,  379-380,  389, 

406 

equivalent,  334 

finite,   331-338,   343,   345,   369,   376, 

378-379,  382,  389,  390-401,  406, 

421,  424,  426-427,  467,  469-470, 

486 

Jt-limited,     336-337,    426-427,     430, 

441_443 
linear-bounded,  338-339,  342,  353, 

371,  379-380 
nondeterministic,  379 
one-way,  338 

PDS     (pushdown-storage),    339-345, 
351-352,  371-374,  376,  378-379, 
391,  413,  469,  484 
real  time,  352 
restricted-infinite,  352,  360,  371-380, 

407,  484 
two-tape,  379 
two-way,  337-338 


587 


588 


SUBJECT  INDEX 


Avoidance    learning    experiment,    111, 

213,215 

component  model  for,  213-215 
Axiom(s),  bargaining,  555 
of  component  model,  192,  199 
conditioning,  for  component  model, 

192 

for  linear  model,  226-227 
for  mixed  model,  244 
for  pattern  model,  155 
of  linear  model,  226-227 
Luce's,  26-27,  36 
of  pattern  model,  154-155 
response,  155,  191-192,  226-227,  244 
sampling,  155,  192,  199 
Axone,  513 
Axone  density,  514-515,  520,  523-524 

Background,  195-197,  251-252 

Balance  (of  graph),  541 

Bargaining,  549-551,  554-555 

Barrier,  absorbing,  81-82 

Baseline   studies   with   models,   49-50, 

106-109 

Behaviorism,  328 
Bernoulli  sequence,  214-215 
Beta  model,   19,  25-30,  36-37,  50-54, 

58,60,66,83,89,96,  106,  111 
asymptote  for,  62-64 
commutivity  of,  64 
and  damping,  67 
and  experimenter-controlled  events, 

58 

explicit  formula  for  response  proba- 
bility in,  29 
and  linear  model,  35,  50-51,  57,  58, 

96,  113 

and  logistic  function,  36 
nomogram  for,  51,  54-55 
parameter  estimates  for,  53-54,  97 
rate  of  learning  in,  52 
recursive  formula  for  response  prob- 
ability in,  29-30 
response-strength  for,  25 
responsiveness  of,  60 
and  shuttlebox  data,  28-29,  53-54 
sufficient  statistics  for  parameters  of, 

96 

and  urn  scheme,  50 
validation  of,  111 


Bias,  circularity,  515 

distance,  515,  525 

interaction  of  sources  of,  23 1,  528 

overlay,  518-519,524,  528 

popularity,  525,  528 

reciprocity,  525,  528 

response,  142 

sociometric,  523 

sociostructural,  517-519,  521 

symmetry  of,  515 

transitivity  of,  515,  525,  528 

see  also  Net 

Bilateral  monopoly,  551-556 
Biomass  (of  prey  and  predator),  509 
Birth  rate,  509 
Bit,  435,  462 
Block-moment   method    of    estimation, 

101 

Boundary  condition,  128-129 
Boundary  marker  (symbol),  280,  287, 
292-293,  334,  338,  452,  459-460 

see  also  P-marker 
Branch  (of  tree),  290 
Branching  process,  for  N-element  mod- 
el, 156 

left,  474 

multiple,  474-475 

for  one-element  model  in  two-choice 
contingent  case,  152 

for  one-element  model  in  two-choice 
noncontingent  case,  142,  145 

for  paired-comparison  learning,  184- 
185 

right,  473-474 

for  two-process  discrimination-learn- 
ing model,  261 
Bridge  (of  graph),  532 

Calculus,  first-order  predicate,  355 

sequential,  370,  406 
Categorical  grammar,  410-414 
Categories,  primitive,  411 

system  of,  444,  447-449 
Categorization  of  order  z,  444 
Centrality  (in  social  groups),  533 
Chain  of  infinite  order,  226 
Channel,  critical,  532 

noiseless,  450 
Channel  capacity,  431,  448 


SUBJECT  INDEX 


Choice (s),  4 

cooperative,  558-563,  575 

distribution  of  number  of,  528 

Greenwood- Yule      distribution      of, 
528-529 

independence  of,  517 

matrix  of  sociometric,  535 

pattern  of,  141 

Poisson  distribution  of,  528-529 

of  strategy,  557-561,  574 

sociometric,  516,  535,  542,  545 

see  also  Response 
Choice  point,  278 
Classificatory  matrix,  310-311 
Clique,  516,  532,  537-539 
Closure,  380-381 
Code(s),  277-278,  409 

anagrammatic,  281 

artificial,  277 

as  automata,  283 

binary,  453-454 

classification  of,  280-281 

error-correcting,  455 

general,  280-281 

left  tree,  280-281,452 

memory  of,  279 

minimum  redundancy,  450-456,  462 

natural,  277,  281-282,  452 

nontree,  283 

right  tree,  280-281 

self-synchronizing,  28 1 

tree,  280-281,452 

uniform,  281 

word  boundaries  in  self-synchroniz- 
ing, 281 

Code  symbol,  452 
Coding,  277-283 

efficiency  of,  455 

optimal,  450 

Coding  alphabet,  450-453 
Coding  theorem,  452 
Coding  tree,  graph  of,  278,  289,  484 

for  minimum  redundancy,  455 

for  P-markers,  289 
Cohesiveness,  565-566 
Combining-classes  condition,  19-24,  26 
Communality,  degree  of,  205 
Commutativity,  see  Event 
Competence    (of  language  user),   326, 
330,352,390,464,467,472 


Compiler  (for  computer),  410 
Complementarity,  assumption  of,  7,  9— 

12 

Complete  family  of  operators,  10-11 
Completely  connected  graph,  532 
Complicated  behavior,  483-488 
Component  model,  123-125,  153,  191- 

238 
asymptotic   response   probability   in, 

249-250 
asymptotic  response  variance  in,  215- 

216 

autocorrelation  of  errors  in,  214 
for   avoidance   learning   experiment, 

213-215 

axioms  of,  192,  199 
for  discrimination  learning,  249-250 
with  fixed-sample-size,  198,  207-219 
with  fixed  sampling  probabilities,  198 
and  linear  model,  206-238 
mean  learning  curve  for,  214,   228, 

250 

for  multiperson  interactions,  234-238 
probability  of  reversal  in,  214 
for  simple  learning,  206-219 
with  stimulus  fluctuation,  219-226 
and  total  number  of  errors,  214 
Comprehension  (of  language),  275 
Computer,  356-357 

handling  of  natural  language  by,  343 
memory  in,  468-469 
program  of  instructions  for  serial,  486 
Computer    program  (s),    simulation   of 

behavior,  485 
theory  of,  283 
Concatenation,  273-274,  277-278,  283, 

292-295 
Concept-attainment    experiment,    48 1- 

482 

Conditional  expectations,  16,  78-83 
Conditional  probability,  16,  131 
Conditioning,  operant,  47-49 
Conditioning  assumptions,  131, 141 
Conditioning  axioms,  see  Axioms 
Conditioning  experiment,  239 
Conditioning  parameter,  127, 131, 133 
Conditioning  state,  125,  130-131,  155, 

192 
changes  in,  143 


59° 


SUBJECT  INDEX 


Configuration,  341-342,  350-351 

initial  tape-machine,  339-340 
Conflict,  logical  structure  of,  495 
Conformity,  560 
Confusion  errors,  1 54 
Connected  graph,  532,  542 
Connotation,  275 
Constituent-structure      grammar,      see 

Grammar 
Contact,  channels  for,  523 

frequency  of,  504 

randomization  of,  520 
Contagion,  496-499,  504-508 

theory  of,  525 

Context-free  grammar,  see  Grammar 
Context-free  language,  see  Language 
Context-sensitive  grammar,  see  Gram- 
mar 

Context-sensitive  language,  see  Langu- 
age 
Contingent-noncontingent      distinction, 

15 
Contingent  reinforcement,  157-158 

and  one-element  model,  151-153 
Continuum  (of  states),  497 
Control  unit,  331 
Convergence  (of  response  probability), 

19 

Co-occurrence  relation,  296-297 
Cooperation  (related  to  attitudes),  558 
Copying  machine,  365-366 
Correction  (role  in  language  learning), 

276 

Correlation,  see  specific  topic 
Correspondence      problem,      382-384, 

387-388 

Counter,  automata  with,  345,  352 
Countersystem,  345,  378 
Cournot  lines,  547 

Criterion-reference  learning  curve,  109 
Critical  channel,  532 
C-terminal  string,  299,  306 
Cues,  background,  251-252 

discarded,  174 

irrelevant,  240-242,  257 

relevant,  240-242 

verbal,  174,431-432 

see  also  Stimulus  and  specific  topics 
Curve,  see  specific  topics 
Cycle  (of  sociogram),  sign  of,  540-541 


Damping  of  response  effects,  67,  72 
Death  rate,  509 
Decidability  of  sets,  354-356 
Decipherability,  unique,  389 
Decision  problem,  recursively  solvable, 

354,356 

Decision  theory,  256,  512 
Delayed  effect,  17 
Denotation,  275 
Dependency  system,  288 
Depth  of  postponed  symbol,  474,  484 
Derivation(s),286,  292 

completely  determined  set  of,  292 

left-to-right,  373-374,  414 

^-embedded,  374 

right-to-left,  373-374,414 
Derived  categories,  411 
Derived  utterances,  474 
Detection  experiments,  251-255 
Deviant  utterances,  444-446 
Diad,  546 
Difference  equation,  42-43,  82-85 

partial,  84-85 

power-series  expansion,  85 

solution  of,  85,  139,  148,  152,  159 
Differential  equation,  82,   85,  87,   199, 

495-496,501,563 
Diffusion,  497,  508,  512 
Discourse,  initial,  448 
Discriminating  statistics,  49 

of  sequential  properties,  70-73 

of  variance  of  total  errors,  73-75 
Discrimination  learning,  238-265 

component  model  for,  249-250 

defined,  238-239 

mixed  model  for,  243-249 

multiple-process  model  for,  257-264 

observing  responses  in,  258 

orienting  response  in,  257 

pattern  model  for,  239-243 

probabilistic  experiment  on,  194-198 

relevant  cues  (patterns)  in,  242,  257 

stimulus  sampling  model  for,  250-256 
Dissipation  rate,  499 
Distance  bias,  515,  525 
Distinctive  features    (of   phoneme    se- 
quence), 310 

Distribution  of  asymptotic  response 
probability,  45,  144-146,  150, 
167-168,  237,  242,  249-250, 
253-254 


SUBJECT  INDEX 


591 


Distribution,  cumulative  normal,  25-30 

equilibrium,  569-570 

of  errors,  71,94,  135-139 

Greenwood- Yule,  528-529 

length-frequency  (of  words),  461 

nonnormal,  458 

Poisson,  528-529 

rank-frequency  (of  words),  457-459, 
460,  462 

of  response  probabilities,  13,  186 

se2  also  specific  topics 
Dominance  relation,  541-546,  570 
Domination     (between    string    deriva- 
tions), 293 
Drive  stimuli,  197-198 

Element(s),  junctural,  308 
left-recursive,    290,    293,    394,    399, 

471-472 
neutral,  210 
nonrecursive,  293 
nonterminal,  294 
recursive,  290,  293,  295,  394 
right-recursive,   290,   293,   394,   399, 

471-472 
self-embedding,  290,  293,  394,  399, 

472 

stimulus,  123 
terminal,  294 
Embedding,  degree  of,  474 
in  natural  language,  286 
see  also  Self-embedding 
English,  coding  efficiency  of,  439-440 
double-negative  in,  481-482 
letter  approximation  to,  428 
passive  construction  in,  482 
probabilities  of  strings  in,  440-441 
redundancy  in,  440,  443 
rewriting  rules  for,  447 
self -embedding  in,  47 1 
speed  of  transformation  in,  482 
transformational  grammar  for,  477- 

478 

word  approximation  to,  428-429 
word  frequency  in,  456 
English  grammar,  288 
context-sensitive,  365 

Entropy,  436 


Environment,  competition  for,  510 

Epidemic,  39,  497,  507-508 

Equations,  definability  of  language  by 

system  of,  401-409,  501 
see  also  specific  topics 
Equilibrium,  dynamic,  511,  569 

in  interspecies  competition,  510-511 

Nash,  548,  560 

in  noncooperative  nonzero-sum  game, 

548 

of  probability  distribution,  570 
stability  of,  500-503,  510-511,  547- 

548,  554 

static,  554,  556-567 
Equivalence  class,  8,  12,  14 
Equivalent  events,  7 
Equivalent  grammars,  see  Grammar 
Error (s),   autocorrelation   of,    71,   73, 

130,  136 

autoco variance  of,  75 
confusion,  154 

distribution  of  number  of,  137-138 
/-tuples  of,  78 
last,  76,  135,  214 
number  of  (expected),  70,  77,  81,  86, 

90,94,130,  133-134,226 
and  component  model,  214 
frequency  distribution  of,  135—139 
predicted  and  observed  values  of, 

136 

variance  of,  73-75,  135-136 
types  of,  141 
variance  of,  130 
Error  runs,  70,  77,  90-91 

as  discriminator  among  models,  103 
distribution  of  lengths  of,  71,  94 
lengths  of,  70-72 
mean  number  of,  70,  214 
observed  and  predicted  number  of,  71 
Error  statistics  from  models,  130,  134— 

138 

Estimation    of    parameters,    block-mo- 
ment, 101 
errors  in,  94 

maximum-likelihood,  89,  93-98 
method  of,  52,  94 
minimum-chi-square,  93,  96-97 
for  single-operator  model,  94 
use  of  statistics  in,  76 
Eugene  Onegin,  423-424 


SUBJECT  INDEX 


Event(s),  commutative,  7,  17-19,  22, 
24,  28,  32,  38,  41,  56,  58,  61, 
64,67 

complementary,  7,  9-12 

contingent,  14-15 

equivalent,  7,  9 

experimental,  6,  8,  12 

experimenter-controlled,    13-14,   23- 
24,  29-30,  44,  52,  58,  65,  115 

experimenter-subject  controlled,    13- 
14,  45-41,  57,  63 

model,  6,  12-15 

neutral,  227 

path-independent,  7,  16-18 

regular,  333 

reinforcing,   15,   123,   142,   154,   177, 
182-183,227,235 

repeated  occurrence  of,  18-19 

subject-controlled,  13-15,  23,  28,  31, 
65,69 

trial,  5 

trial-independent,  17 

see  also  Operator  and  specific  topics 
Event  effects,  invariance  of,  36-38,  61 
Excitatory  tendency,  206 
Expectation,  conditional,  16,  78-83 
Expected  gain,  see  Payoff 
Expected  operator,  44-45 
Expected-operator   approximation,   42- 

43,  45-47 

Experiment,  see  specific  topics 
Experimental  event,  see  Event 
Explicit  formula  for  response  probabili- 
ty, 5,  15-18,  110 

analysis  of,  65 

for  beta  model,  29-30,  50,  57 

for  commutative  events,  18,  23 

for     experimenter-controlled     event 
model,  62 

for  linear  model,  50,  57 

for  logistic  model,  35 

for  perseveration  model,  34 

for  prediction  experiment,  24,  29 

for  shuttlebox  experiment,  23,  29,  32 

for  subject-controlled  events,  23,  28 

transformation  of,  50-55 

trial-to-trial  change  in,  5 

for  two-event  experiment,  51 

and  urn  scheme,  30-32,  50-51 


Explicit  formula  for  state  probabilities, 

162 
Exposure  time,  211 

Feedback  process,  learning  model  as, 

69-70 

Fidelity  criterion,  273 
Finite  automata,  see  Automata 
Finite  transducer,  see  Transducer 
First  success,  trials  before,  90-91 
Fixed  point  of  operator,  21 
Fixed-sample-size     component    model, 

see  Component  model 
Forgetting  (in  learning),  18,  24,  56,  65- 

66,  127,  221,  230 

Formal  power  series,  see  Power  series 
Free-recall  verbal  learning  experiment, 

data  from,  107 
Functional  equations,  81-89 

differential    equation    approximation 

to,  85,  87-89 
power-series  solution,  86-87 

Game-learning  theory,  571-576 
Games,  against  nature,  572 

cooperative  nonzero-sum,  549 

experimental,  556-561 

mixed-motive,  550 

noncooperative,  548,  556 

nonnegotiable,  556-560 

nonzero-sum,  547-549,  558 

Prisoner's  dilemma,  548,  557,  574 

theory  of,  234,  556-557,  571 

zero-sum,  572-574 
Generalization,  stimulus,  200-206 
Generative    capacity    (of    grammar), 
strong,  325-326,  357,  371,  377- 
378,406 

weak,  325-326,  357,  377-378,  379 
Generative   grammar,   290,   292,   296, 

326,411-412,465-467 
Goodness-of-fit,  76,  96,  103,   133-134, 

179,215 
Grammar  (s),  271,  276,  284 

adequacy  of,  283-284,  291-292,  297- 
300 

ambiguity  of,  405,  470 

asymmetry  in,  373, 414 

categorical,  410-414 


SUBJECT  INDEX 


593 


Grammar(s),   constituent-structure,   295 
component  of,  296-298 
context-free,  343 
deficiencies  of,  297-298,  378-379 
generalization  of,  414 
grammatical     transformations     in, 

300-306 
context-free,  294,  352,  366-410,  413, 

469-470,  472,  474 
ambiguity  of,  387-388 
constituent-structure,  343 
linear,  383-390 
nonself-embedding,  396,  467 
power-series  satisfying,  406 
and     restricted-infinite     automata, 

371 

special  classes  of,  368-371 
sufficient  condition  of,  394 
theory  of,  294-295,  340 
undecidable  properties  of,  382-388 
context-sensitive,  294,  360-368,  373- 

374,  378,  468-469 
asymmetrical,  469 
general  property  of,  364 
strictly,  373-374 
undecidable  properties  of,  363 
definition  of,  284-285 
discontinuous,  414 
English,  288,  365 
equivalent,  293,  297,  356,  362,  395- 

396,400,413 

strongly,  297,  395-396,  400 
weakly,  293,  395,  397,413 
generative,  290,  292-296,  326,  356, 

411-412,465-467 
strong,    325-326,   357,   371,   377- 

378,  406 

weak,  325-326,  357,  377-379 
linear,  369-370,  379-390,  393,  399 
meta-,  369-370,  380,  385 
minimal,  386,  388 
one-sided,  369-370,  379,  389-390, 

409,421,467,470 
of  natural  language,  366 
normal,  369-371,  393,  396 
modified,  374-375,  377 
nonself-embedding,  400 
phonological    component    of,    288, 

306-313 
phonological  rule  of,  288,  313-319 


Grammar(s),  of  programming  language, 
409 

properties  of,  363-364,  382-387 

recursive  rules  of,  284,  328-329 

self-embedding,    394 

sequential,  369-371,  389,  409 

syntactic  component  of,  306 

theory  of,  285,  295 

transformational,  296-306,  357,  364- 
365,476-483 

type/,  360-367 

undecidable  properties  of,  363 

universal,  295-296 

well-formed,  291,  364,  367-368 
Grammatical,  see  specific  topic 
Grammaticalness,     categorization     by, 
445,  447 

degree  of,  291,  295,  443-449,  466 

deviation  from,  291, 444 

hierarchy  of,  292 

and  well-formedness,  449 
Graph,  articulation  point  of,  532 

balanced,  541 

bridge  of,  532 

component  of,  532 

connected,  532,  542 

directed,  530,  536,  542 

linear,  495,  512,  530-531,  542 

signed,  530,  540-541 

symmetric,  540 

tree,  278,  289,  455,  484,  532 
Greenwood- Yule  distribution,  528-529 
Group,  egalitarian,  543 

single-clique,  532 

small,  529-535 

social,  531-532 

symmetric  relations  among  members 

of,  537 
Group  dynamics,  562-576 

classical  model  of,  563-565 

game-learning  theory  of,  571-576 

Markov  chain  model  for,  567-576 

qualitative  hypothesis  for,  564-567 

semiquantitative,  565-567 
Group  members,  attitudes  of,  539-541 
Guessing,  in  one-element  model,   129, 
137 

in  RTT  experiment,  209 
Guessing  procedure,  441-443 


594 

Guessing-state    model,    134-140,    170- 
172 

Hebrew,  double-negative  in,  482 
Heuristics,  277,318 
Hierarchy,  292,  543 

of  grammaticalness,  292 

in  groups,  543 

index  of,  543 

of  tote  units,  486-487 
Homo  economicus,  548 
Homogeneity  assumption,  99-101,  104 
Homonyms,  447-448 
Hullian  model,  54-55 
Hypothesis  model,  154 

Identification   learning  experiment,   see 

Discrimination  learning 
Imitation,  experiment  on,  10 
linear  model  for,  224 

in  mass  behavior,  500-504 

in  peck  hierarchy,  545 
Immediate  constituent  analysis,  370, 41 1 
Immediate  constituents,  289 
Independence  from  irrelevant  alterna- 
tives, assumption  of,  26-27,  437 
Independence  of  responses,  13 
Independence  of  unit,  assumption  of,  26 
Independent    sampling    restriction    (in 

RTT  experiment),  221 
Index,  of  cliquishness,  516 

hierarchy,  543,  545 

of  similarity,  201 

structure,  301 

Indicator  random  variables,  77-78 
Individual  differences,  13,  99-101,  111, 
230 

in  amount  of  retention  loss,  230 

in  learning  rate,  133,  174,  229-230 

in  parameter  values,  111 

in  rate  of  forgetting,  230 
Infection  rate,  498,  505,  507 
Information,  484 

amount  of,  431-432, 437,  439,  481 

bits  of,  435, 462 

chunks  of,  462 

levels  of  processing  of,  280 

measure  of,  431-439 

model  of,  105 

spread  of,  505-506,  519-522 


SUBJECT  INDEX 

Information  capacity,  43 1-455 

of  alphabets,  273,439 
Initial  symbol,  292 
Innate  faculte  de  langage,  327,  329 
Input  alphabet,  338 
Input  tape,  339 
Insight  model,  103,248 
Intention,  486 
Interaction,  intensity  of,  562-563 

model  for,  501-512 
Internal  state,  331 
International  behavior,  501 
Intertrial  interval,  141,  219,  224-226 
Invariance  condition,  310-313,  318 
Irrelevant  alternatives,  independence  of, 

26-27,  437 
Irreversibility,  498 
Item  difficulty,  differences  in,  229-230 

Joint  profit,  maximization  of,  552-555 

Kernel  string,  299 
Kinship  relation,  533 
^-limited  automaton,  336-337,  426-427, 
430,  441-443 

Liaison  person,  532 
Language(s),271,283 
accepts  a,  332,342-343 
ambiguity    of,    278,    280,    387-390, 

408-409,  466 

artificial,  272,  283,  285-286,  343,  364 
complement  of,  380-381,  386 
comprehension  of,  275 
computer,  273,  343,  402-403,   409- 

410 
context-free,  294,  351-352,  366-367, 

373-374,     376-377,     380,     386, 

392-393,  402-403,  408-409 
context-sensitive,  379-381 
definable,  402 
definition  of,  283 
formal,  272,  411 
generates  a,  322,  342-343 
intersection  of ,  380-381 
^-limited,  336 

knowledge  of,  326,  352,  441-443,  464 
learning  of,  272,  275-279,  307,  314, 

330,430 
meta-linear  sequential,  380 


SUBJECT  INDEX 

Language(s),  mirror-image,  342,  383 
natural,  271-272,  274,  280-281,  283, 

286,    288,   295,    343,   366,   378, 

389-390,  421,  450,  475,  483 
programming,  273,  343,  409-410 
regular,  333-335,  338,  347-348,  376- 

378,    380,    383,    386-387,    393- 

394,  407-409,  470 
terminal,  293 
theory  of,  329 
type/,  360-367 
union  of,  380-381 
user  of,  325-326,  330,  352,  390,  421- 

422,    441-443,    464,    467,    472- 

475,483,487 
Langue,  327-329 
Last  error,  76,  135-136,  214 
Late  Thurstone  model,  1 1 
Learning,  all-or-none,  126 
avoidance,  111,  213-215 
criterion  of,  130 

discrimination,  194-198,  238-265 
see  also  Discrimination  learning 
language,    272,    275-277,    307,    314, 

330,  430 

paired-associate,  123,  126-141,239 
paired-comparison,  181-189,243 
in  peck  hierarchy,  545 
probability,  141-163,  167,  169,  173- 

174,  179,  193-194 
rate  of,  52 
rote  serial,  141 
see  also  specific  topics 
Learning  assumptions,  see  specific  mod- 
els 
Learning  curve,  6,  46-47,  134,  140,  172, 

225 

asymptote  of,  173 
for  avoidance  learning,  213-215 
for  component  model,  153,  213-214, 

228,  250 

criterion-reference,  109 
as  discriminator  among  models,  72, 

102 

form  of,  37,  153 
individual,  72 

effect  of  individual  differences  on,  174 
for  linear  model,  140,  153,  228,  233 
mean,  40,  70,  72,  74-75,   173-174, 

213-215,228,233,250 


595 

Learning  curve,  for  one-element  model, 

140,  153 
for  paired-associates   learning,    134, 

140 

for  pattern  model,  215,  233 
in  probability  learning,  173 
of  stat-organisms,  46-47 
Learning  model,  see  specific  models 
Learning-rate  parameter,    22,    61,    90, 

101,572 
individual   differences   in,    133,    174, 

229-230 

Learning-to-criterion   experiment,    one- 
element  model  for,  130 
Left-branching,  474 
Left-recursive   element,   see    Recursive 

element 

Left  tree  code,  280-281, 452 
Length-frequency  distribution,  461 
Lexical  morphemes,  308 
Lexicon,  370 
Liaison  person,  532 
Likelihood-ratio  test,  96 

and  comparison  of  models,  106 
Limit  point  of  operator(s),  21,  28 
Linear,  see  specific  topic 
Linear  (interaction)  model,  499-504 
Linear  (operator)  model,  19-24,  37,  50- 
51,  58-59,  67,   82-83,  226-227 
asymptotic  properties,   62,   82,  226- 

228,  233 

axioms  of,  226-227 
and  beta  model,  37,  50-51,  57-58,  96, 

113 

commutativity  of,  67-68,  79 
and  component  model,  206-238 
conditioning  axiom  of,  226-227 
and  damping,  67 

explicit  formula  for  response  prob- 
ability, 56 
and  fixed-sample-size  component 

model,  216,  227-228 
for  imitative  behavior,  234 
learning  curve  for,  140,  153,  228,  233 
as  limiting  case  of  stimulus  sampling 

model,  226-234 

and  multiperson  interaction,  234-238 
nomogram  for,  51,  54-55 
and  one-element  model,  140 


596 

Linear   (operator)   model,    and  pattern 
model,  228-233 

and  perseveration,  34,  108 

probability  matching  in,  62 

recursive  formula  for  response  prob- 
ability in,  23-24,  56 

for  RRT  experiment,  228-232 

for  simple  learning,  206-234 

variance  for,  216 

and  urn  scheme,  50-5 1 

see  also  Single-operator  linear  model 
Linguistics,  271,  274,  283-293,  325-331 
List  structure,  484 
Logic,  271 

artificial  language  of,  283 

methods  of  symbolic,  535 
Logistic,  30,  35-36,  96-97,  504-508 

sufficient  statistics  for  parameters  of, 

96 
Luce's  axiom,  26-27,  36 

Many-trial  perseveration  model,  66 
Markov  chain,  103,  123,  131,  145-147, 
186,    227,   245,    260-261,   424- 
425,463,561-576 
aperiodic,  145 

conditioning  states  as  a,  157,  212 
discrete-time,  17 
erdogic,  150 
higher  order,  426-427 
irreducible,  145 
^-limited,  426-427 
limit  vector  of,  145, 186 
Markov  source,  424-430,  437 
Marriage  types,  533—534 
Mass  behavior,  model  for,  501-504 
Matching     theorem,     see     Probability 

matching 

Maximization  of  joint  profit,  552-555 
Maximum-likelihood  method  of  estima- 
tion, 89,  93-98 
Maze  experiment,  7,  9,  10-12,  14,  20, 

65,73-75,92,103,113-114 
with  correction  procedure,  7 
effect  of  reward  in,  73-75,  1 14 
experimenter-subject  controlled 

events  in,  13 
overlearning  in,  96 

reversal  of  reward  in,  75,  92,  112-113 
and  single-event  model,  14 


SUBJECT  INDEX 

Meaning,  275,  329,  456 

Meaningfulness,  429 

Mean  learning  curve,  see  Learning  curve 

and  specific  topics 
Memory,  279,  471 

computer,  468-469 

human,  16,  471-472,  475-476,  556 

long-term,  476 

short-term,  471,476,  480 
Mentalism,  327-328 
Messages,  432-435 
Metathetic  stimulus  dimension,  202 
Minimax  strategy,  560,  569,  574 
Minimum  chi-square  method  of  estima- 
tion, 93,  96-97 
Minimum  redundancy  code,   450-456, 

462 

Mirror-image  language,  342,  383 
Mixed  model,  243-249 
Mob  effect,  504 
Mobility,  508 
Model,  see  specific  topic 
Model-free  test,  107 
Model  type,  6 

testing  of,  90,  102-104 
Mohawk  (language),  378 
Monogenic  system  of  rules,  359 
Monogenic  type  1  grammar,  361 
Monoid,  274,  277 
Monomolecular   autocataletic  reaction, 

analogy  to,  37 

Monopoly,  bilateral,  551-556 
Monte  Carlo  method,  76-77,  89,  94 
Morale,  564 
Morpheme,    282,    289,    295-296,    299, 

302,  308,  414 

Morpheme  structure  rules,  314 
Morphophonemics,  309 
Multi-element  pattern   model,   see   N~ 

element  pattern  model 
Multiperson  games,  234 
Multiperson  interaction,  234—238 
Multiple-alternatives,  10,  19-21 
Multiple-branching,  474-475 
Multiplicative  learning  model,  27 
Multiprocess  model,  125,  257-264 

asymptotic  predictions  from,  263-264 

branching  process  in,  261 
Mutual  attractiveness,  566 


SUBJECT  INDEX 

Nash  equilibrium  point,  548,  560 
Natural  code,  see  Code 
Natural  language,  see  Language 
Negative  recency  effect,  115,  179 
Negative  response  effect,  74,  113 
Negotiation  set,  556 
Neighborhoods,  515 
Neighbors,  512 

JV-element  pattern  model,  153-191 
asymptotic  variance  of,  233 
branching  process  in,  156,  184-185 
mean  learning  curve  for,  173,  233 
and  one-element  guessing  state  model, 

170 

for  paired  comparisons,  187-188 
for  probability  learning,  173-174 
sequential  statistics  for,  173-174 
for    two-choice    noncontingent   rein- 
forcement experiment,  162-181 
Nesting,  degree  of,  480 

of  dependencies,  470-471,  475 
of  phrases,  343 

Net(s),  acquaintance  relation,  515 
biases  in,  515-519 
information-spreading,  520,  522 
neural,  513 

of  social  relations,  515 
statistical  aspects  of,  512-519 
tightness  of,  519 
tracing  of  contracts  in,  514-515,  523- 

528 

Neurophysiological  correlates,  35 
Neutral  element,  210 
Neutral  event,  227 
Node,  278,289,513 

-to-terminal-node  ratio,  480, 485 
Nomogram,  for  beta  model,  51,  54-55 
of  clique  structure,  536,  538 
of  Huilian  model,  55 
for  linear-operator  model,  51,  54-55 
for  urn  model,  5 1 
Noncontingent  choice  experiment,  163- 

166 
Noncontingent-contingent  distinction, 

15 
Noncontingent  reinforcement  schedules, 

142-151, 157 

Nondeterministic  automaton,  379 
Nondeterministic  transducer,  406-407 
Nonlinear  interaction  models,  definition 
of,  503 


597 

Nonlinear  operator(s),  38 
Nonrecursive  element,  293 
Nonreward,  effect  of,  19 

see  also  Reward,  effect  of 
Nonsense  syllables,  314 
Nonstationary  time  series,  116 
Nonterminal  element,  294 
Nonterminal  vocabulary  (universal), 

357 

Nontree  codes,  283 
Normal  grammar,  see  Grammar 
Number  theory  (formalized),  356 

Observing  responses,  258-260,  263 

Oligopoly,  234,551 

One-element  guessing-state  model,  170- 

172 

One-element  model,  125-153 
all-or-none  property  of ,  126 
autocorrelation  of  errors,  136 
branching  process  for,  142,  145,  152 
conditioning  assumptions  of,  131,  141 
conditioning  parameter  for,  127 
errors,  distribution  of,  135-137 
following  kth  success,  140 
last,  135 
and      fixed-sample-size      component 

model,  208,  211 
learning   curve  for,    131,    134,    140, 

153 
for  learning-to-criterion   experiment, 

130 
for  paired-associates  experiment,  126, 

128-141 

reference  experiment  for,  141 
reinforcement  schedules  in,  142,  145- 

153 

sequence  of  responses  in,  127-130 
special  cases  of,  125,  147-151 
for  stimulus-response  association 

learning,  126 
for  two-choice  learning  problems, 

141-153 

One-element  pattern  model,  126-128 
One-person  game,  572 
One-sided  linear  grammar,  see  Gram- 
mar 
One-trial  perseveration   model,   linear, 

33,  66,  108 
logistic,  36, 98, 108 


SUBJECT  INDEX 


One-trial  perseveration  model,  recursive 
formula  for  response  probability 
in,  34,  56 
Operant  conditioning,  47-48 

model  for,  47-59 
Operationalism,  328 
Operator(s),5,  17,56 

average,  42 

classification  of,  56 

commutative,  7,  17-19,  22,  24,  28,  32, 
38,56,58,61,64,67 

complete  family  of ,  10-11 

fixed  point  of,  2 1 

identity,  22 

ineffectiveness  of,  22 

limit  point  of ,  21 

linear,  9,  21 

nonlinear,  17,  27 

trial-dependent,  110 

trial-independent,  17 

unidirectional,  28 

see  also  Event 

Opinion,  amount  of  change  in,  566 
Ordinal  scale,  567 
Orienting  response,  257 
Outcome(s),  4,  7-14,  154 

contingency  of,  14 

definition  of,  14 

differential  effect  of,  33 

equivalent  classes  of,  12 

Pareto-optimal,  549,  557 

response-controlled,  12 

response-correlated,  33 

symmetry  of,  10-12,  33, 45 
Outcome  probability,  20 
Outcome  sequence  (in  two-choice  pre- 
diction experiment),  24 
Output  alphabet,  338 
Output  tape,  346 
Overlap,  degree  of,  219 
Overlap  bias,  518-519,  524,  528 
Overlearning,  75,  92,  96,  103,  112-113 

Paired-associates  experiment,  123,  126- 

127,  239 

data  from,  128,  134,  140 
interpreted  in  terms  of  one-element 

model,  128-141 

Paired-associates  learning-to-criterion 
experiment,  130 


Paired-associates  model,  128-141 
Paired-comparison  experiment,  181- 

191,243 

and  pattern  model,  243 
reference  experiment  for,  181-182 
response  probability  in,  187 
Pandemic,  508 
Panic,  497 
Paradise  fish,  104 
Parameter (s),  5 

in  avoidance  learning,  53-54 
choice  of  statistic  to  estimate,  95 
comparison  of  methods   (of  estima- 
tion), 97 

conditioning,  127,  131,  133 
as  descriptive  statistics  of  the  data, 

103-104 

estimates  of,  53-54,  76,  89-99,    127, 
129,    133,    167,    169,    171,    214, 
222,  229-233,  253,  256 
see  also  Estimation  of  parameters 
free,  106 

learning-rate,  22,  61,  90,  101,  572 
Parameter-free  properties,  76,  93,  190 
Parameter  invariance,  36,  104 
Parameter  space,  89 
Pareto-optimal  strategy,  548-549,  557 
Parasitism,  547-548 
Parole,  327-328 

Passive  transformation,  300,  482 
Path  dependence,  34,  56 

see  also  Path  independence 
Path  independence,  7,  16-21,  26,  38,  52, 

56 
and  combining-classes  condition,  19- 

21 

and  commutativity,  19 
and  event  invariance,  19 
quasi-,  32 
Path  length,  17 
Pattern  model,  123-124,  125-153,  153- 

191,  222-223 
asymptotic  properties,  161-162,  213, 

233 

axioms  for,  154-155 
comparison  with  linear  model,  233 
for  discrimination  learning,  239-243 
and  fixed-sample-size  component 
model,  212,215 


SUBJECT  INDEX 


5.9.9 


Pattern  model,  mean  learning  curve  for, 
215,233 

W-element,  153-191 

one-element,  126-128 

transition  probabilities  for,  156 

and  verbal  discrimination  experiment, 
243 

see  also  N-element  pattern  model 
Pattern  of  stimulation,  123 
Pawnee  marriage  rules,  535 
Payoff  (s),  expected,  560 

joint,  557 

matrix  of,  234,  237,  57 1-573 

maximization  of  in  bilateral  monop- 
oly, 552-553 

maximization  of  by  homo  economi- 
cus,  548 

maximization  of  joint,  548-549,  552 

probabilistic,  574 

PDS  (pushdown  storage),  339-352,  371, 
400-401,469 

generation  of  a  string  with,  344 
PDS  automaton  (pushdown  storage  au- 
tomaton), 339-345,  351-352, 
371-380,  391,413,469,484 
Peck  right  (of  hens),  542-545 
Percept,  329 
Perceptual  capacity,  47 1 
Perceptual  model,  318,  377,  401 

decision  theory,  256 

incorporating  generative  processes, 
483 

left-to-right,  472 

optimal,  469-470 

single-pass,  472 

see  also  Speech  perception 
Perceptual  process,  329-330 
Performance,  6,  123 

of  language  user,  326-330,  390,  464, 

467 

Permutation,  304-305,  534 
Perseveration  model,  linear,  34,  108 

logistic,  36,98,  108 

many- trial,  66 

one-trial,  33,  66,  108 
Phase  space,  496,  511-512 
Phoneme,  308-310 
Phones,  308 
Phonetic  alphabet,  295,  307-308 


Phonetic  representation  and  related 

topics,  288,  308-314 
Phonological   component   and   related 

topics,  288,  306-319 
Phrase-marker,  see  P-marker 
Phrases,  nesting  of,  343 
Phrase  structure,  288 
Phrase  types,  categorization  into,  410- 

411 

Plan,  486-487 
Player,  rational,  571 
P-marker(s),  293-294,  296,  298-299, 
304,306,359,363,365,405, 
468,473-474,477-481 

and  ambiguity  of  grammar,  405 

and  attachment  transformation,  305 

contruction  of,  301 

derived,  301,  303-304,  307,  478-481 

generated  by  rewriting  rules,  477 

graph  of,  289 

and  singularity  transformation,  305 

strong  derivation  of,  368 

and  structural  complexity,  48 1 
Poisson  distribution,  528-529 
Polish  notation,  370,  406 
Polynomial  expression,  401-402 
Popularity,  535 
Popularity  bias,  525,  528 
Population,  assumption  of  well-mixed- 
ness,  505 

dissemination  of  genes  in,  497 

homogeneity  of,  503 

nonhomogeneity  of,  503 

predator,  509 

size  of,  504 

statistical  study  of,  522-529 
Positive  response  effect,  72,  74,  108 
Postponed  symbol,  474-475 
Postponement,  depth  of,  484-485 
Power  series,  406 

algebraic  elements  of,  407 

characteristic,  406 

closed,  403 

formal,  403-407 

ordinary,  403 

solution,  to  function  equations,  86-87 

to  difference  equations,  85 
Practice  effect  in  signal  detection  experi- 
ment, 250-251 


6oo 


SUBJECT  INDEX 


Predator  population,  509 

Prediction  experiment,  7-9,  11-14,  44, 

65,  84 

asymptotic  behavior  in,  61-65 
and  beta  model,  29-30,  56,  63-64 
experimental  event  in,  8 
experimenter-controlled  event  in,  13 
explicit  formula  for  response  prob- 
ability for,  24,  29 
and  linear  model,  56 
and  single-event  model,  14 
and  stimulus-fluctuation  model,  223- 

226 

Preference,  intransitivity  of,  542 
Pretraining,  36 
Price,  456 

Price  leader,  553-555 
Primitive  categories,  411 
Prisoner's  dilemma  game,  548,  557,  574 
Probabilistic  reinforcement  schedule, 

141-153 

Probability,  of  absorption,  82-83,  88 
of  choice,  517 
conditional,  16?  131 
of  contact,  499 

of  reversal  in  component  model,  214 
transition,  156-158,  212-213,  498, 

575 

see  also  Response  probability 
Probability  learning,  141-162 

asymptote  in,  144-146,  150,  153,  167, 

169 
and  contingent  reinforcement,   151- 

153 

N-element  model  for,  173-174 
and  noncontingent  reinforcement, 

142-144,  162-163 
one-element  model  for,  147-151 
pattern  model  for,  153-162 
and  probability  matching,  151,  179 
reference  experiment  for,  141-142 
and  stimulus  compounding,  193-194 
Probability  matching,  61-64,  151,  179 
by  paradise  fish,  105 
and  urn  scheme,  65 
Probability  vector,  5,  146 
Profit,  see  Payoff 

Programming  language,  see  Language 
Pronunciation,  247-275 
Proper  analysis,  301 


Property-space,  89-90 
Prothetic  stimulus  dimension,  204 
Psychoeconomics,  546-561 
Psycholinguistic  model,  327,  329 
Psychophysical  experiments,  33,  256 
Pure  reinforcement  model,  226 
Pushdown  storage,  see  PDS 

Quantification  theory,  355 

Random  net,  506,  513-517,  519,  528 

connectivity  of,  506 

rejection  of  (hypothesis  of ),  514,  524 
Random  walk,  463 
Rank-frequency  relation  (of  words), 

457_464 

Rational  function,  407 
Rationality,  collective,  556-557,  571 
Ratio  scale,  26,  562 
Reaction  potential,  25,  35,  54 
Reaction  probability,  54 
Reading  head,  331 
Real-time  automaton,  352 
Receiver  operating  characteristic  (ROC) 

curve,  256 
Receiver's  uncertainty,  measure  of,  432, 

435 

Reciprocal  bias,  525,  528 
Recognition  routine,  377,  469 
Recognition  of  words,  465 
Recovery,  497 
Recursive  element,  290,  293,  295,  394 

left,  290,  293,  394,  399,  471,  472 

right,  290,  293,  394,  399,  471,  472 

self-embedding,  290,  293,   394,  399, 
472 

types  of,  290 

Recursive  formula  for  response  prob- 
ability, 16,  18,  23-24,  28,  30,  34, 
44,56 

approximate,  44 

for  beta  model,  29-30 

classification  of,  56 

for  commutative  events,  18 

general,  30 

for  linear  operator  model,  23,  56 

for  one- trial  perseveration  model,  34, 
56 

for  prediction  experiment,  24,  29, 44 

for  shuttlebox  experiment,  23,  28 


SUBJECT  INDEX 


601 


Recursive  formula  for  response  proba- 
bility, for  single-operator  model, 
56 

for  subject-controlled  events,  23,  28 
for  two-event  experiment,  51 
for  urn  scheme,  30-32,  56 
Recursive  generative  process,  290 
Recursively   enumerable   set,    355-356, 

361-362 

Recursive  rules,  284,  328-329 
Recursive  set,  355 
Redundancy,  431,  439-443,   449,  455, 

484 

in  English,  440,  443 
estimation  of,  440-442 
maximization  of,  449 
minimum,  450-456,  462 
sequential,  442 
Reflexivity,  279,  293 
Regression  analysis  of  binary  sequence, 

35 

Regular  event,  333 

Regular   language,    334-335,    347-348, 
376-378,    380,     383,     386-387, 
393-394,  470 
defined,  333 

and  formal  power  series,  407-409 
structural    characterization    theorem 

for,  334 

and  two-way  automata,  338 
Reinforcement,  123 
conditions  of,  20 
contingent,  151-153,  157-158 
noncontingent,  142-151 
probability  of,  572 
schedule  of,  142,  158 
Reinforcing  event (s),  see  Event 
Relation,  acquaintance,  515,  518 
antisymmetric,  542 
asymmetric,  292 
binary,  495,  530 
co-occurrence,  296-297 
dependency,  286 
dominance,  542-546,  570 
equivalence,  7,  9,  14,  279 
grammatical,  477-478,  480 
kinship,  533 

rank-frequency,  462-464 
reflexive,  279,  293 
submissive,  543 
symmetrical,  25,  33,  45,  279,  530 


Relation,  transitive,  279,  293,  518,  542 
Removal  rate,  507 
Repetition  tendency,  33 
Representing  expression,  334-336 
Reproduction,  rate  of,  499 
Resolution,  rules  of,  411,  413 
Response  (s),  4 
asymmetric,  10-11 
autoclitic,  474 
autocorrelation  of ,  33,  69 
dependence  of,  33,  56 
discriminative,  258-260 
independence  of,  1 3 
observing,  258-260,  263 
orienting,  257 

problem  of  definition  of,  20 
repetition  tendency  of,  108 
set  of,  5,  123 
symmetry  of,  9-12,  45 
variance  of,  174-177,  233 
see  also  specific  topics 
Response   axioms,    155,    192,   226-227, 

244 

Response  bias,  142 
Response  effect,  accumulation  of,  67 
damping  of,  67,  72 
direct,  66-68 
erasing  of,  67, 72 
indirect,  68-69 
magnitude  of,  67 
negative,  74, 113 
positive,  72,  74,  108 
undamped,  67 
Response-outcome  event,  7,  12,  33-34, 

78 
Response  probability,  5 

asymptotic,  45,  61-65,  102-103,  105, 
144-146,  150,  159-162,  167-169, 
176,  187-188,  224-227,  233,  237, 
242,  249-250,  253-254 
convergence  of,  19 
distribution  of ,  13,186 
explicit  formula  for,  5,  15-16,  18,  20, 
22-24,  28-32,  34-35,  50-57,  62, 
65,110 

in  independent  sampling  model,  224 
mean  of  distribution  of,  172-173 
and  moments  in  pattern  model,  158- 
162 


6os 


SUBJECT  INDEX 


Response  probability,   nonlinear  trans- 
formation on,  27 

recursive  formula  for,  16,  18,  23-24, 
28-32,34,44,51,56 

for  stimulus  fluctuation  model,  224 

variance  of  distribution  of,  159,  172- 

173 

Response  sequences,  75,  127-133 
Response  strength,  25,  32 

see  also  Beta  model 
Responsiveness  of  model,  56-61 
Restricted-infinite  automaton,  352,  360, 

371-380,407,484 
Retention  curve,  219-220 
Retention  loss,  209,  211 
Reversal,  112-113 

of  dominance,  570 

in  learning,  214 
Reversibility,  497-498 
Reward,  effect  of,  19,  33,  45,  48,  57,  64, 
73,107,110 

in  prediction  experiment,  44 

on  response  probability,  113 

in  shuttlebox  experiment,  74 

in  T-maze,  73 

in  urn  scheme,  48 

on  variance  of  total  errors,  73-74 
Reward  and  nonreward  parameters,  esti- 
mates of ,  107 
Rewriting  rules,  468-475,  477,  481 

unrestricted,  357-360,  379 
Right-branching,  473-474 
Right  recursive  element,  see  Recursive 

element 

Right  tree  code,  280-281 
ROC  (receiver  operating  characteristic) 

curves,  256 

Rote  serial  learning,  model  for,  141 
RTT  experiment,  207 

fixed-sample-size    component    model 
for,  207-211,221-232 

linear  models  for,  228-230 

neutral  element  model  for,  210 

retention  loss  in,  209,  229 

stimulus  fluctuation  model  for,  207, 
221-223,  232 

stimulus-sampling  model  for,  229-232 
Rules  (grammatical),  left-linear,  369 

linear,  369-370 

meta-linear,  369 


Rules  (grammatical),  monogenic  system 
of,  359 

recursive,  284,  328-329 

of  resolution,  411,  413 

rewriting,    357-360,    379,    468-475, 
477,481 

right-linear,  369 

selection,  301 

for  synthesizing  sentences,  466 

terminating,  369 
Rumor,  497 

Runs  of  errors,  see  Error;  Error  runs 
Runway  experiment,  8-10 

Saddle  point,  574 
Sampling  axiom,  155,  199,  252 
Sampling  model,  see  Stimulus  fluctua- 
tion,   Stimulus   sampling,    Com- 
ponent, A/-element,  One  element, 
and  Pattern  models 
Satisfies,  404 

Saussurian  view  of  linguistics,  327-330 
Scale,  ordinal,  567 

ratio,  26,  562 
Score  structure,  543 
Secondary  reinforcement,  19 

model  for,  105 
Segmentation,  280 
Selective  information,  measure  of,  431, 

438-439 

Selective  sampling,  effects  of,  108 
Self-embedding,  286,  343,  470,  473-475, 

480 
degree  of,  396,  400,  468,  470,  474, 

480,  484 
in  English,  471 
Self-embedding  elements,  290,  293,  394, 

399,  472 

Self -embedding  grammar,  394 
Self-synchronizing  code,  281 
Semantic  information,  measure  of,  438 
Semantics,  328, 466 
Semigroup,  274,  280 
Sentence(s),283,  292 
asymmetry  of,  399,  472 
definition  of,  332-333 
recognizing  device  for,  318,  465 
rules  for  synthesizing,  466 
structure  of,  228,  297-298,  326-327, 
399 


SUBJECT  INDEX 


605 


Sentence(s),  structural  complexity,  meas- 
ure of,  480-481 

structural  description  of,   289,  297- 

298,  399 

Sentence-matching  test,  482 
Sequence,  of  conditioning  states,  132 

index,  382-383 

of  responses,  75,  127-133 

of  trials,  fixed-sample-size  model,  pre- 
dictions of,  211 
Sequential  calculus,  37,  406 
Sequential  grammar,  369-371,  389,  409 
Sequential  statistics,  5,  70-73,  188 

for  error  runs,  70-71 

estimation  procedures  for,  166-169 

for  fixed-sample-size  component 
model,  211-213,216-219 

for  linear  model  and  pattern  model, 
190 

and  mean  learning  curve,  173-174 

for  ^element  model,  173,  188-191 

observed  and   estimated  values   for, 
167,  169-170,  254-256 

for  one-element  model,  148 

for   paired-comparison    learning    ex- 
periment, 188-191 

for  pattern  model,  164,  169-170 

technique  for  deriving,   177-178 

for  visual  detection  experiment,  254- 

256 

Serial  autocorrelation  of  errors,  71 
Serial  computer,  program  of  instructions 

for,  486 
Set(s),  computable,  354 

decidable,  354-355 

recursively  enumerable,  355-356, 
361-362 

stimulus,  123-124,  182,  192 

of  strings,  356-357,  362-363 

theory  of,  495 

Shuttlebox  experiment,  8-10,  13,  40,  65, 
74-75,  93,  95-96,  110 

beta  model  for,  28-29,  53-54 

data  from,  50,  103,111 

explicit  formula  for  response  proba- 
bility for,  23,  29,  32 

linear-operator  model  for,  22,  53-54, 
105,  109 

model-free  analysis  of,  53 

recursive  formula  for  response  proba- 
bility of,  23,  28 


Shuttlebox    experiment,    Restle    model 

for,  104 
urn  model  for,  32,  53,  54 

Signal  detection  experiment,  250-256 

Signed  graph,  530,  540-541 

Similarity,  index  of,  201 

Simplicity,  38 

Simulation  of  behavior,  computer  pro- 
gram for,  485 

Single-event  model,  14,  66 

Single-operator  linear  model,  34,  66,  89, 

90,  92,  140 

estimation  of  parameters  for,  94 
expected  number  of  errors  in,  90 
mean  learning  curve  for,  140 
in  prediction  experiment,  84 
recursive  formula  for,  56,  84 

Single-pass  device,  469,  472 

Singularity  transformation,  303-305 

Sink,  497,  499,  509 

Small  group,  see  Group 

Social  group,  see  Group 

Social  disintegration,  564 

Social  dominance,  541-546,  570 

Social  interactions,  measurability  of,  495 

Social  rank,  570 

Social  space,  distance  in,  528-529 
index  of  cliquishness  in,  5 1 6 
metrical  properties  of,  528 
topology  of,  515,528 

Social  structure,  543 

Sociogram,  522-529,  539-541 

Sociometric  choice,  496,  516,  523-529, 
542,  545 

Sociostructural  bias,  517-519,  521 

Sound  structure,  306-319 

Source,  497-498,  509 

Speech  perception,  273,  311,  314,  318 

Spontaneous  recovery,  221 

Spontaneous  regression,  220-221 

Stability  of  equilibrium,  500-503,  510- 
511,547-548,554 

State(s),  absorbing,  498,  571,  575 
asymptotic  probabilities  of,  573,  575 
change  in,  498-499,  504,  509,  520 
conditioning,  125,  130-131,  143,  155, 

192 

continuum  of,  497 
diagram,  332 
final,  334 


604 

State(s),  internal,  331 

irreversible,  497-498 

steady,  463,  496 

of  subjects,  575 
State  probability,  162 
Stat-organism,  data  from,  46-47 
Status,  535 
Stereotypy,  484 
Stimulus,  123 

background,  195-197 

communality  of ,  124 

intensity  of,  124 

overlapping  of  samples  of,  219 

in  paired-comparison  experiment,  182 

scaling  of,  203-205 

transfer,  206 

variability  of ,  124 

see  also  Cues 

Stimulus  compounding,  193-198 
Stimulus  elements,  123 

background,  195-196 
Stimulus  fluctuation  model,  219 

applied  to  RTT  experiment,  221-223, 
232 

linear  model  as  limiting  case  of,  226- 
228 

for  noncontingent  case,  223-226,  228 

for  prediction  experiment,  223-226 
Stimulus  generalization,  200-206 
Stimulus  pattern  model,  153-191 

asymptotic  distribution,  159-162,  169 

axioms  for,  155 

matching  theorem,  179-181 

for    paired-comparison    experiment, 

181-191 

Stimulus-sampling  model,  31,  61,  123- 
125 

for  discrimination  learning,  250-256 

limiting  case,  226-234 

exposure  time  in,  211 

urn  scheme,  3 1 
Stimulus  similarity,  124 
Stochastic  learning  model,  4,  569 

see  also  specific  models 
Stochastic  source,  427-430,  432 
Stochastic    theory    of    communication, 

422-423 

Storage,  pushdown,  see  PDS 
Storage  tape,  339,  346 


SUBJECT  INDEX 

Strategy,  choice  of,  557-558,  574 

cooperative,  557-563,  575 

in   cooperative   nonzero-sum   games, 
549 

dominating,  574 

in  finite  nonzero-sum  game,  548 

minimax,  560-569,  574 

mixed,  574 

in  negotiable  games,  550 

noncooperative,  557,  575 

Pareto-optimal,  549,  557 

sure-thing  principle  in  choice  of,  574 

in  two-person  nonzero-sum  game,  547 

in  two-person  zero-sum  game,  574 
Strictly  finite  automata,  see  Automata 
String(s),  accept  a,  332,  337-338,  340, 
342,  353,  359 

analyzable,  301,  303 

binary  operation  on,  292 

blocked,  348 

constituent,  304 

C-terminal,  299,  306 

derivation  of,  286,  292-293,  373-374, 
414 

domination  of,  293 

finite,  362-363 

generate  a,  332,  337,  342,  353,  356- 
363 

generated  with  PDS,  344 

infinite,  362-363 

kernel,  299 

length  of,  340 

null,  362-363 

probability  of  (in  English),  440-441 

recursively  enumerable,  356 

reduce  a,  348 

redundancy  in,  440,  443 

terminal,  288,  293 

termination  of  a,  293 

transformed,  303-306 

T-terminal,  299 

unique,  294 

Structural  ambiguity,  387-390,  406 
Structural  balance,  theory  of,  539-541 
Structural  complexity,  480-481,  485 
Structural  description,  285,  289,  297- 

298,  307,  479 

left-right  asymmetry  in,  399 
possible,  295 
Structural  regularities,  330 


SUBJECT  INDEX 


60J 


Structure,  of  clique,  526,  537-539 

dominance,  541-546,  569-570 

grammatical,  488 

group,  543 

index  of,  301 

peck-right,  542-543 
Stylistic  transformations,  471-472 
Subgraph,  532 
Subgroup, 532 

Subject-controlled  event,  see  Event 
Subject-controlled  event  model,  34,  65 
Submissive  relations,  543 
Subspace,  model  type,  89-93 
Substitution  transformation,  304 
Substitutive   stimulus   dimension,    202- 

203 

Success,  trials  before  first,  90-91 
Successive    samples,    independence    of, 

219 

Successive  symbols,  correlated,  423 
Sufficient  statistics,  38,  96 
Summation  eifect,  196 
Sure-thing  principle,  574 
Survival,  conditions  of,  510-511 
Syllabaries,  273 
Symbiosis,  model  of,  546-549 
Symbol (s),  amount  of  information  per, 
432 

boundary,  287,  292-293,  334,  338 

code,  452 

correlation  of  successive,  423 

initial,  292 

postponed,  474,  478 

string  of,  see  String 

terminal,  359,  474 
Symmetry,  25,  33,  45,  279,  530 
Symmetry  bias,  515 
Synchronization,  278 
Syntactic  category,  430 
Syntactic  component   (of  sound  struc- 
ture), 306 
Syntactic  structure,  285,  328 

Tagmemics,  410 
Tape,  blocked,  33 1,348 

contents  of,  341 

as  counter,  345 

input,  339 

output,  346 

storage,  339,  346 


Terminal  element,  294 

Terminal  language,  293 

Terminal  situation  of  scanning  device, 

340 

Terminal  string,  288,  293 
Terminal  symbols,  359, 474 
Terminal  vocabulary,  universal,  357 
Testing  models,  insensitivity  of  methods 
of,  100 

see  also  specific  models;   Goodness- 

of-fit 

Theory,  see  specific  topics 
Threats,  550-551,  554 
Threshold,  11,54 
Threshold  model,  11,25 
Thurstone  model,  late,  1 1 
Tightness  of  net,  measure  of,  519 
Time  series,  nonstationary,  116 
T-maze  experiment,  see  Maze  experi- 
ment 

Total  errors,  see  Errors 
Tote  unit,  485-486 
Tracing  (of  information  flow),  513-514, 

524-528 

Transducer,    351,    374-375,    377,    387, 
391-392,  397,  399-400 

bounded,  347,  392-393 

finite,  346-348,  395,  410,  466,  468- 
469 

generation  of  a  structural  description 
by  a,  396 

information  lossless,  347 

nondeterministic,  406-407 

strong  equivalence  of,  396,  400 

understanding  of  all  sentences  by,  469 

with  PDS,  348 
Transfer,  effect,  206,  247 
Transformation  (s),  additive,  11 

adjunction,  306 

attachment,  304-305 

behavioral,  487 

classes  of,  304 

complexity  of,  485 

deletion,  306 

elementary,  301-302 

generalization,  303-304 

grammatical,  296-306,  357,  365,  377, 
379,387,477,481-482 

interrogative,  482 

negative,  482 


6o6 


SUBJECT  INDEX 


Transformation^),  obligatory,  299 

optional,  299 

passive,  300,  482 

permutation,  304,  534 

phonological,  314 

psychological  correlates  of,  482 

singularity,  303-305 

speed  of,  482 

stylistic,  471-472 

substitution,  304 
Transformational  cycle,  315 
Transformational    grammar,    296-306, 

357,365,476-483 

Transformed  string,   constituent  struc- 
ture of,  303-306 
Transition  probabilities,  156-158,  498 

for  fixed-sample-size  component 
model,  212-213 

of  individual  state,  575 

of  system  state,  575 
Transition  rule,  see  Operator 
Transitivity,  279,  293,  518,  542 
Transitivity  bias,  515,  525,  528 
Tree,  278,  289-290,  455,  528,  532 

see  also  Graph;  Branching  process; 

Tree  code 

Tree  code (s),  280-282,  452 
Trial-dependent  operators,  110 
Trial  event,  5 

Trial  independent  event,  17 
Trials,  before  first  success,  90-91 

to  last  error,  76,  135,214 
T-terminal  string,  299 
Turing  machine,  352-362,  371,  379 
Two-armed  bandit  experiment,  9,    15, 
65-66,73,75,  101 

data  from,  34,  50, 70-73,  105 

and  linear  one-trial  perseveration 

model,  108 
Two-choice  prediction  experiment,  see 

Prediction  experiment 
Two-event  experiment,  51 
Type  /  (grammars  and  languages),  360- 
367 

Uncertainty  measure  of,  440 
reduction  of,  432 


Undecidability  (of  grammar),  363 
Understandability,  measure  of,  480 
Uniform  codes,  281 
Uniformity    of    opinion,    pressure    to 

achieve,  565-566 
Union  of  languages,  380-381 
Uniqueness,  proof  of,  86,  294 
Unit,  independence  of,  26 
Universal,  see  specific  topic 
Unrestricted  rewriting  system,  357,  359- 

360,  379 
Urn  scheme,  31-36,  48,  50-51,  66 

approximations  for,  40-43 

error  probability  in,  52 

explicit  formula  for  response  proba- 
bility in,  30-32,  50-51 

nomogramfor,  51 

quasi-independence  of  path,  32 

recursive  formula  for  response  proba- 
bility in,  56 
Utility,  547-548 
Utterance,  derived,  474 

deviant,  444,  446 

grammatical,  429 

Variance,  asymptotic,  219,  228,  233 
of  errors,  72-75,  130,135-136 
for  linear  model,  216-228 

Verbal  behavior,  271 

Verbal  cues,  174,431-432 

Verbal  experiments,  107,  243 

Verbal  operant  response,  474 

Visual  detection  experiment,  253-254 

Vocabulary,  273,  292,  357 

Vowel  reduction,  3 14-3 15 

War  moods,  498 

Weil -formed  grammar,  291,  364,  367- 

368 

Well-mixedness,  499,  512 
Word(s),315 
distribution   of  length-frequency  of, 

457-464 
distribution    of    rank-frequency    of, 

457_464 

distribution  of  sequences  of,  464 
frequencies  of,  456-464 


16  1 94