While ontogeny (for individuals) and phylogeny (for species) seem to agree on the evolutive development of languages from communication, linguists disagree about the parts played by nature and culture. From a cognitive perspective debates have for long put a question mark on the ability of children to learn and master the complexity of grammars making the most of a very limited experience in scope as well as time, in sheer contrast with the massive corpus of texts and computing power needed by Large language models (LLM) to generate meanings. Like the James Webb telescope putting new light on planets, stars, and galaxies, LLMs could provide some clues about the origin and arrangement of meanings.
Languages & Grammars
From a functional as well as evolutionary perspective languages operate at two levels: conversational (spoken) and mediated (written) communication, the former with meanings attached to actual contexts, the latter with meanings rooted in symbolic representations.
Symbolic representations are by nature built on categories, some specific to communication some to contents, and some pertaining to both. On the whole grammars can thus be (summarily) organised into three sets, possibly with overlaps:
- Layered grammars, epitomised by Noam Chomsky’s pioneering work, with distinctions between syntax (forms), lexical (terms), semantic (sentences), and pragmatic (discourses) tiers
- Semantic grammars, epitomised by Role reference grammars (RRG), which define linguistic constructs as a combination of syntax and lexicons
- Functional grammars, which align linguistic constructs with templates meant to epitomise states of affairs in possible worlds, e.g. situation, event, action, process
- Cognitive grammars, typified for instance by George Lakoff, which ground linguistic constructs in human ingrained metaphors
Compared to these explicit grammars, generative transformers suggest the presence of implicit grammars that could be mined from actual discourses. The question is whether generative paths to meanings (solid, left) can be related to mediated ones (dotted, right).
Grammars & Meanings
The sudden emergence of generative chatbots and the flurry of business announces may be misguiding: more than a technological breakthrough it’s a two-pronged revelation of business openings and engineering maturity. Discovery (as compared to invention) may also explain the mystery surrounding the pitched chatbots’ nuts and bolts: there could be little to be revealed simply because these chatbots rely on established deep learning algorithms, with competitive edges mainly built on tuning and integration with other technologies. Based on the grammar taxonomy introduced above, three typical perspectives can be considered:
- Grammatical, which rely on lexicons and syntactic structures to map terms into well-formed phrases (a)
- Cognitive (or mental), which define meanings as the realization of metaphors rooted in human consciousness of emotions (e.g. anger, hunger, …) or percepts (e.g. over, before, …) (b)
- Functional (or communicative), which align meanings from conversations’ context and purposes (c)
These perspectives can help to clarify the benefits of generative solutions for typical undertakings, e.g.: programming (grammatical), writing (cognitive), or searching (functional).
The Mining of Meanings
Linguistic tenets can be summarily associated to typical language models meant to translate sequences of terms into meaningful sentences.
Grammatical (a) and cognitive (b) approaches build meanings in reference (*) to pre-defined conceptual and/or syntactical categories.
By comparison, LLM approaches (c) rely on likelihood to establish the meaning of terms: the probability of a meaning in a sequence is conditioned by its forerunners based on semantic networks encoding an encyclopaedic corpus of terms weighted by billions if not trillions of parameters. As explained by Mark Riedl’s introduction, such networks are built through a two-step scheme:
Sequences of terms are first mapped to subsets of smaller orders of magnitude meant to provide compact semantic footprints (encoding); compact footprints are then crossed back to initial sequences (decoding), with bi-directional pairings serving as yardsticks of semantic distances.
In actuality, LLMs apply recursively trillions of coding parameters to sequences (aka windows) of ten thousand terms (aka tokens), with neural networks trained on multiple years of internet public data as well as on specific files. Nonetheless, a sandbox exemple and symbolic connectors can serve to illustrate the basic modus operandi:
Assuming terms in lexicons and placeholders (roles) for names:
- The miniature sequence is encoded by a reference set (same order of magnitude) of terms with semantic (analogies) or functional (metonymies) proximity
- Reference terms are mapped back to initial ones, and the ones without decoding confirmation (e.g. circle and parade) are left out
- Syntactic lexicon (noun, verb, etc) and metonymies (car/wheels; drive/car) are used to establish or consolidate meanings (wheels/car; flat/puncture; drive/accompany)
Ontological prisms can be used to align LLMs processes with EA symbolic resources:
And represented with OWL/Protégé nodes and symbolic connectors:
The tsunami-like bursting of LLMs has taken both business and technology communities by surprise, but the exponential increases in size (from billions to trillions in a few months) can be misleading, not only because of the parallel increase in computing costs, but more importantly because quantitative breakthroughs induce qualitative boundaries: like with jets’ sonic barrier, LLMs’ performances raise barriers to trust. Hence the focus put on transparency issues and the potential benefits of symbolic AI in both conversational and mediated communication. That’s when semiotic connectors come into play.
Adding symbolic capabilities to generative AI is already a major development axis on the conversational last mile of applications like programming, authoring, or search. But improving conversations with end-users can do little for the fairness and transparency of the meanings (and more generally knowledge) obtained by pre-training; to that effect one needs hybrid LLMs fine-tuned through reinforcement learning from all-inclusive corpus as well as specific ones.
On the machine hand (LLMs) self-supervised learning on all-inclusive contents relies on the assumption that implicit categories (conceptual or grammatical) will emerge from unbiased and massive enough training material; on the human hand (chatbots) conversations open the door to supervised learning that can put meanings on guide-rails. In between reinforcement learning can be focused on explicit grammatical or conceptual categories. In such hybrid models the origination of meanings, and consequently their traceability, is determined by the precedence ascribed to internet primal soup or domain-specific corpuses; probing the formation of meanings can be compared to tracking the origin of galaxies, stars, and planets.
Were meanings be celestial bodies their life cycle would be revealed by Newton’s laws of motion, and generative processes would perform like the James Webb telescope discovering the formation of stars and planets (meanings) organized into systems (grammars), bound together by semantic forces (gravity), taxonomies (properties of stars and planets), and logic (laws of motion).
Taking English as an exemple, generative processes would time-travel across French and Old Norse galaxies recording precedence along semantics pathways. Likewise, LLMs pre-training processes could draw cartographies of meaning spaces that could support transparency and consequently regulations.
In the actual (aka digital) world of LLMs semantic gravity forces are expressed through the encoding of parameters; that’s where transparency issues should be dealt with, except that they can’t, due to the exponential complexity of lineages induced by billions of parameters.
LLMs must thus rely on a dual assumption: an implicit functional model combining conversational purposes with grammatical constraints, free of social or cognitive biases; or the wisdom of crowds. In fact business and engineering factors are already pushing for hybrid models that could meet transparency expectations as well as regulatory constraints.
Regarding transparency hybrid models combining self-supervised and reinforcement learning could improve the traceability (in terms of precedence) of exchanges between sequences, sentences, and phrases:
- Sequences/Sentences: temporal capabilities could be added to thesauruses in order to chronicle the formation of semantic pathways
- Sequences/Phrases: taxonomies could be fleshed out as to minute the arrangements of features into conceptual or grammatical categories
- Phrases/Sentences: ontologies should include logical clauses and predicates in order to ensure the reliability and traceability of transformations
Regarding regulations, and taking into account the intrinsic opacity of generative processes, hybrid models should prioritise the alignment of boundaries with existing regulatory frameworks aimed at privacy, intellectual property, and risks.
Concerning privacy LLMs regulations should ensure that any linking of features and taxonomies (encoded through parameters) with identified individuals could only be achieved through legitimate business applications, typically by preventing direct or inferred association of features (data) to identified individuals (information).
Regarding intellectual property, the importance and ubiquity of intangible assets in enterprises value chains calls for a distinction between commons (e.g. public data), marketed resources (e.g., data factories), and proprietary ones (protected). On that basis, LLMs regulations should be contingent on a reset of copyrights and patents regulatory frameworks.
Copyrights dealt with the ownership of creative works, something that can be summarily defined as singular and nominal:
- Singular: creative works are identified as a whole, not in reference to types or categories
- Nominal: creative works are identified at conceptual level independently of structures or roles
For LLMs regulations, the objective would be to use thesauruses and built-in ontological modalities to manage the association between copyrights (concepts) and the corresponding actual footprints (documents).
Once limited to physical apparatuses, the remit of patents has been extended to the design and modus operandi of the whole range of artifacts: physical or symbolic, actual or virtual. From an LLM perspective it ensues that:
- Compared to copyrights and privacy rules which pertain to actual resources, patents solely pertain to structural or functional categories.
- Compared to copyrights, patents do not protect their target but exclude or restrain the way they can be used.
Taking a clue from the quagmire of forestalling and patent trolls, harnessing LLMs regulations to patents would entail as a prerequisite a charting of legitimate lays of the land in terms of facts (existing patents), concepts (patent purposes), and categories (patent designs). That could be achieved by using generative capabilities and modal logic to sort out the categories not subsumed within patent footprints.
Last but not least, LLMs regulations must be considered on the broader perspective of risk management when generative technologies are used to support decision-making. On that account the European Union proposed approach approach is doubly flawed:
- Risks are defined according to administrative distinctions made redundant by the digital integration of organizations and processes and by the ubiquity of AI technologies.
- Obligations are defined in equivocal terms like “adequate”, “high” quality or level, “detailed”, “necessary”, “clear”, or “appropriate”.
Instead, risk profiles should be framed by the kind of cognitive capacity involved (observation, reasoning, judgment, experience) and the reliability of corresponding resources (data, information, knowledge):
- Routine: experience and observation (facts/data) support reasoning (categories/information) and judgement (concepts/knowledge) (a)
- Operational: more experience and observation are needed to support judgment (b)
- Critical: more experience and observation are needed to support reasoning and judgment (c)
LLMs and more generally AI regulations could thus put the focus on transparency obligations that would provide for (1) informed risk assessment by business analysts and knowledge managers and, (2) sound designs by architects and engineers.
Within a few months LLMs have simultaneously turned into a game changer and a commodity, and now churning out waves of plugins; on that account the LangChain framework gives some cues about future developments.
LangChain is an open source toolbox that allows AI developers to combine LLMs with external sources of computation and data. Besides the integration of mundane functions like emails or customary applications like bookings, LangChain tools are organized into three sets:
- Chains: a scripting language to organize tasks, identify components, and assign agents
- Components: LLM wrappers, prompt templates, indexes for information retrieval
- Agents: any kind of components with agency
These capabilities could be used to build guide-rails and pathways across the galaxies of meanings, and thus significantly improve LLMs’ transparency:
- Local resources (documents), assets (relevant information), and experience (prompt templates and searches with established track record)
- Wrapped models providing the maps supporting the travels
- Agents tasked with indexing information contents (facts/categories), processing searches (concepts/categories), managing conversations (facts/concepts) and journeys (facts/concepts/categories).
Besides ensuring a reliable planning and traceable execution of searches, such a framework would provide decision-makers with relevant indicators of hazards and therefore of appropriate qualifications.
LlamaIndex offers an efficient way to connect LLMs to new data sources without costly retraining. It could be used to add a LLM copilot to the Caminao corpus providing:
- A semantic backbone built from primitives that could ensure open-ended thesauruses devoid of circular definitions (Facts/Concepts)
- Standard grammar and domain-specific categories (Facts/Categories)
- Common sense reasoning capabilities (Categories/Concepts)
Such configurations could for instance be applied to documents in support of requirements analysis.
- Signs & Symbols
- Generative & General Artificial Intelligence
- Thesauruses, Taxonomies, Ontologies
- EA Engineering interfaces
- Ontologies Use cases
- Cognitive Capabilities
Other internal references
- Things Speaking in Tongues
- What Did You Learn Last Year ?
- Brands, Bots, & Storytelling
- Transcription & Deep Learning
- Out of Mind Content Discovery
- Caminao Framework Overview
- A Knowledge Engineering Framework
- Knowledge interoperability
- Edges of Knowledge
- The Pagoda Playbook
- ABC of EA: Agile, Brainy, Competitive
- Knowledge-driven Decision-making (1)
- Knowledge-driven Decision-making (2)
- Ontological Text Analysis: Example
- A very gentle introduction to large language models without the hype
- Role and Reference Grammar and Functional Discourse Grammar
- Possible Worlds
- Federated Learning
- Sparks of Artificial General Intelligence: Early experiments with GPT-4
- LangChain explained in 13 Minutes
- Large Language Models and the Reverse Turing Test