Engineering Robust Intelligence in AI Collectives

Author: Yuriy Polyulya

Introduction - From Tools to Societies

Large-language-model (LLM) agents are no longer isolated utilities. In 2025, we see agent societies - ensembles of autonomous models that propose, critique, vote, arbitrate, and execute plans. These systems now drive research workflows, policy simulations, customer operations, and even autonomous infrastructure.

As their influence grows, so does the need for governance: the structured protocols, decision rules, and accountability mechanisms that determine how collective outcomes emerge. Poor governance here is not a glitch - it’s a systemic risk.

This post integrates the latest theoretical results and practical frameworks from 2023–2025 - such as TalkHier for structured multi-agent deliberation, AgentNet for decentralized trust adaptation, SagaLLM for planning consistency, and TRiSM for safety and oversight - into a mathematically consistent and engineering-ready governance model.

Running Example: Market Sentiment Analysis Society

To ground these concepts, consider FinanceNet (imaginary name as example) - a multi-agent LLM society tasked with analyzing market sentiment from news articles to predict stock trends. The society consists of:

Their collective task: Process 1,000 daily news articles and produce a sentiment score (-1 to +1) for each of 50 tracked stocks, with confidence intervals that trading algorithms can act upon.

Throughout this post, we’ll see how governance mechanisms handle coordination, disagreement, and adversarial behavior in this concrete scenario.


Mathematical Core: Decision Rules in Machine Democracy

Consider \(n\) agents \(A = \{a_1, \dots, a_n\}\) producing responses \(r_i\) with confidences \(c_i \in [0,1]\). The goal: aggregate \(\{r_i\}\) into a decision \(D\) that is both correct and robust.

Weighted Byzantine Fault Tolerance (WBFT): $$w_i(t) = \alpha , \text{Trust}_i(t) + (1 - \alpha) , \text{Quality}_i(t)$$

Where:

FinanceNet Example: When analyzing Tesla stock sentiment, the agents receive the following weights with \(\alpha = 0.6\):

This weighting reflects both historical reliability and current response quality.

Aggregation: $$D = \arg\max_{d \in \mathcal{D}} \sum_{i=1}^n w_i , \text{Support}_i(d)$$

Where:

Robust Aggregation Under Attack

FinanceNet Example: When MaliciousAgent outputs extreme sentiment (+0.95 for all stocks), the geometric median approach automatically isolates this outlier. The three honest agents cluster around reasonable sentiment values (-0.2 to +0.4), and the median preserves this consensus while rejecting the manipulation attempt.

These approaches are now implemented in DecentLLMs and Trusted MultiLLMN frameworks for production-scale robustness.

While these single-shot aggregation rules are efficient, complex or contentious decisions may require the multi-round deliberation dynamics discussed in our theoretical foundations to reach a stable and robust consensus.

Layered Governance Architecture

Flat voting breaks at scale. Instead, use:

  1. Protocol Layer - Structured message formats (as in TalkHier).
  2. Decision Layer - Weighted voting, consensus, or deliberation.
  3. Arbitration Layer - Meta-agents resolving deadlocks (SagaLLM’s validator agents).
  4. Audit Layer - TRiSM-style risk checks, explainability, and compliance logging.

FinanceNet Example: When analyzing conflicting reports about Apple’s quarterly earnings, the agents produce divergent sentiment scores (EconAgent: -0.3, NewsAgent: +0.2, GeneralistAgent: +0.1). Low consensus quality score (0.45, below the escalation threshold of 0.55) triggers escalation to the Arbitration Layer, where a specialized MetaAnalyst agent reviews the source articles, identifies the key disagreement (revenue vs. profit focus), and produces a nuanced consensus: “Mixed sentiment with revenue concerns but profit optimism” (final score: -0.05).

Formally, governance transitions can be modeled as:

$$P(s_{t+1} = \text{escalate} \mid s_{t}) = f(\text{ConsensusQuality}_{t})$$

Where:

Integrating with Frameworks and Protocols

Practical Governance Selection Algorithm

Algorithm: Protocol Selection for Production Systems
Input: Task characteristics, Risk assessment, Agent pool
Output: Optimal governance protocol

IF riskLevel = "high" AND byzantineFraction > 0.25:
    RETURN "Byzantine-tolerant consensus"
ELIF task.complexity = "novel" AND timeBudget = "extended":
    RETURN "Multi-round deliberation" 
ELSE:
    RETURN "Weighted voting"

Engineering Collective Intelligence: A Mindset-Driven Approach

While mathematical frameworks and algorithms provide the technical foundation, building robust governance for LLM societies requires applying the core properties of engineering mindset: simulation, abstraction, rationality, awareness, and optimization. Each property guides how we conceptualize, design, and validate multi-agent decision systems beyond pure algorithmic implementation.

Simulation: Mental Models of Agent Interactions

Cognitive Framework: The ability to model complex multi-agent dynamics mentally, predicting emergent behaviors under various conditions while recognizing these models as useful abstractions rather than perfect reality.

Application to LLM Governance: We simulate agent interactions to predict consensus quality, Byzantine failure modes, and system scalability limits. This mental modeling enables design decisions before expensive implementation.

Mathematical Foundation: Consider the agent interaction space as a graph \(G = (V, E)\) where vertices \(V\) represent agents and edges \(E\) represent communication channels. The simulation capacity involves predicting system behavior under transformations:

$$\text{SystemState}_{t+1} = f(\text{SystemState}_t, \text{GovernanceRules}, \text{ExternalConditions})$$

Where:

Where effective simulation requires understanding the functional relationship \(f\) through mental abstraction rather than exhaustive computation.

Abstraction: Identifying Universal Governance Patterns

Cognitive Framework: The sophisticated generalization that filters non-essential details while preserving critical system properties. In governance design, abstraction enables us to identify patterns that transcend specific implementation details.

Application: Abstract governance principles emerge across different multi-agent systems:

Abstraction Hierarchy:

  1. Physical Layer: Individual LLM responses, network communications, computational resources
  2. Protocol Layer: Message formats, voting procedures, aggregation rules
  3. Governance Layer: Decision-making frameworks, conflict resolution, accountability mechanisms
  4. Meta-Governance Layer: Self-modifying rules, evolutionary protocols, adaptive strategies

Rationality: Evidence-Based Governance Design

Cognitive Framework: Decision-making based on mathematical evidence and logical frameworks, serving as verification for both simulation accuracy and abstraction validity.

Application: Rational governance design demands rigorous evaluation of each component:

Agent Selection Rationality: $$\text{SelectionScore}(a_i, t) = \sum_{j} \beta_j \cdot \text{ExpertiseLevel}_{j}(a_i, t.domain) + \gamma \cdot \text{Trust}(a_i, t)$$

Where:

Aggregation Rationality: Choose aggregation methods based on mathematical guarantees and empirical performance:

Algorithmic Framework:

Algorithm: Rational Governance Protocol Selection
Input: Task characteristics, Agent capabilities, Risk tolerance
Output: Optimal governance protocol

1. Analyze task properties:
   - Stakes level: {low, medium, high}
   - Complexity: {routine, moderate, novel}
   - Time constraints: {real-time, standard, extended}

2. Evaluate agent pool:
   - Reliability distribution: Hist(agentTrustScores)
   - Expertise coverage: domainExpertiseMatrix
   - Byzantine risk: Estimate potential malicious fraction

3. Select protocol based on mathematical guarantees:
   IF stakes = high AND byzantineRisk > 0.25:
       RETURN Byzantine-tolerant consensus
   ELIF complexity = novel AND time = extended:
       RETURN Multi-round deliberation
   ELSE:
       RETURN Weighted voting

Awareness: Understanding Governance Limitations

Cognitive Framework: Meta-cognitive recognition of the boundaries and potential failures in our governance models. Without awareness, we cannot identify when our abstractions break down or when our simulations diverge from reality.

Critical Awareness Areas:

Model Limitations: Our mathematical frameworks assume:

Emergent Failure Modes: LLM societies exhibit system-level behaviors that single-agent analysis cannot predict:

Boundary Detection Algorithm:

Algorithm: Governance Boundary Detection
Input: System performance history, Current conditions
Output: Risk assessment and potential failure modes

1. Monitor key indicators:
   - Consensus quality trend over time
   - Agent behavior consistency metrics  
   - Decision accuracy in known scenarios
   - Resource utilization patterns

2. Detect anomalies:
   FOR each metric m in monitoringSet:
       IF deviation(m) > thresholdM:
           Flag potential boundary violation
           Estimate failure probability
   
3. Trigger adaptive response:
   IF failureRisk > acceptableLevel:
       Escalate to higher governance layer
       Initiate protocol adjustment procedure
       Alert human oversight system

Optimization: Systematic Improvement of Collective Decision-Making

Cognitive Framework: The systematic pursuit of solutions that maximize decision quality while minimizing computational and coordination costs. This requires challenging our natural tendency toward “satisficing” (accepting good enough solutions).

Multi-Objective Optimization: LLM governance involves simultaneous optimization across multiple dimensions:

$$\max_{protocols} \sum_{i=1}^{5} w_i \cdot f_i(\text{Protocol})$$

Where:

Pareto-Optimal Governance: Since these objectives often conflict, we seek Pareto-optimal solutions where improvement in one dimension requires sacrifice in another.

Dynamic Optimization Algorithm:

Algorithm: Adaptive Governance Optimization
Input: Performance history, Current objectives, Resource constraints
Output: Optimized governance parameters

1. Performance evaluation:
   Measure current system performance across all objectives
   Compare to historical baselines and theoretical optima
   
2. Gradient estimation:
   FOR each adjustable parameter p:
       Estimate ∂(performance)/∂p through small perturbations
       Account for interaction effects between parameters
   
3. Multi-objective improvement:
   Compute Pareto-improvement directions
   SELECT direction that maximizes weighted objective improvement
   UPDATE parameters using adaptive step size
   
4. Validation and rollback:
   Test updated parameters on validation set
   IF performance degrades: ROLLBACK to previous configuration
   ELSE: COMMIT changes and update baseline

Theoretical Foundations of Multi-Agent Consensus

Information Aggregation Theory

Central Question: How do we optimally combine diverse information sources while accounting for their reliability, potential bias, and strategic behavior?

Condorcet Jury Theorem Extension: For LLM societies, the classical result that majority voting approaches optimal accuracy as group size increases requires modification:

Modified Condorcet Conditions:

  1. Independence: Agent responses must be conditionally independent given the true answer
  2. Competence: Each agent must have probability \(p > 0.5\) of correct response
  3. Honesty: Agents must report their true beliefs rather than strategic responses

In LLM contexts, these conditions face unique challenges:

Robustness Extensions: To handle condition violations, we need robust aggregation rules:

Weighted Condorcet Rule (Single-Round): $$P(\text{CorrectDecision}) = \frac{\sum_{i} w_i \cdot p_i}{\sum_{i} w_i}$$

Where:

Multi-Round Deliberation Dynamics:

For complex tasks requiring iterative refinement, agents evolve through deliberation rounds \(k = 1, 2, \ldots, K\):

Agent Competence Evolution: $$p_i(k+1) = p_i(k) + \eta_i \cdot \sum_{j \neq i} w_j(k) \cdot \text{InfoGain}_{j \rightarrow i}(k)$$

Dynamic Weight Updates: $$w_i(k+1) = \alpha \cdot \text{Trust}_i(k) + (1-\alpha-\beta) \cdot \text{Quality}_i(k) + \beta \cdot \text{Consistency}_i(k)$$

Consensus Quality Measurement: $$\text{ConsensusQuality}(k) = \frac{1}{1 + \frac{\sigma_k}{\mu_k}}$$

Where \(\sigma_k\) is the standard deviation of agent responses \(r_i(k)\) in round \(k\), and \(\mu_k\) is the mean of absolute response values \(|r_i(k)|\) across all \(n\) agents.

Termination Condition: $$\text{Continue} \iff \text{ConsensusQuality}(k) < \tau \text{ AND } k < K_{\max}$$

Where:

FinanceNet Multi-Round Example: When analyzing a complex merger announcement affecting multiple sectors:

Round 1: Initial sentiment scores diverge widely (EconAgent: +0.1, NewsAgent: -0.4, GeneralistAgent: +0.3), giving ConsensusQuality(1) = 0.43.

Round 2: After information exchange, EconAgent incorporates NewsAgent’s regulatory concerns (InfoGain = 0.7), updating competence: p_econ(2) = 0.85 + 0.1 × 0.72 × 0.7 = 0.90. Revised scores converge (EconAgent: -0.1, NewsAgent: -0.2, GeneralistAgent: +0.1), improving ConsensusQuality(2) = 0.75.

Round 3: Final convergence achieved with ConsensusQuality(3) = 0.83 > τ = 0.75, terminating deliberation with collective sentiment: -0.07.

Connecting the Examples: The Apple earnings analysis (line 87) that escalates to arbitration could benefit from multi-round deliberation if time permits, using this merger analysis framework. When consensus quality falls below 0.55, the system can choose between immediate arbitration (fast) or multi-round deliberation (thorough) based on time constraints and decision stakes.

ConsensusQuality Calculation: For Round 1 scores (+0.1, -0.4, +0.3), we first calculate the standard deviation (\(\sigma_1 ≈ 0.36\)) and the mean of the absolute values (\(\mu_1 ≈ 0.27\)). Our consensus quality is defined as \(\frac{1}{1 + \frac{\sigma_1}{\mu_1}}\).

This gives ConsensusQuality(1) = 1 / (1 + σ₁/μ₁) = 1 / (1 + 0.36/0.27) = 1 / 2.33 ≈ 0.43. Since 0.43 is below the termination threshold of τ = 0.75, deliberation continues. This formula naturally bounds results between 0 (infinite disagreement) and 1 (perfect consensus).

Byzantine Social Choice Theory

Problem Statement: Design mechanisms that produce good collective decisions even when some participants are malicious or compromised.

Impossibility Results: Arrow’s theorem applies to LLM societies - no aggregation rule can simultaneously satisfy:

Constructive Approaches: Since perfect aggregation is impossible, we optimize for specific objectives:

Byzantine-Tolerant Voting: Modify classical voting rules to handle malicious participants:

Algorithm: Byzantine-Tolerant Weighted Voting
Input: Agent responses {r₁, r₂, ..., rₙ}, Weights {w₁, w₂, ..., wₙ}
Output: Collective decision

1. Outlier detection:
   Compute semantic similarities between all response pairs
   Flag responses with low similarity to majority cluster
   
2. Robust weight adjustment:
   FOR each agent i:
       IF outlierScore(i) > threshold:
           w_i ← w_i * exp(-λ * outlierScore(i))
   
3. Weighted aggregation:
   Normalize weights: w_i ← w_i / Σ(w_j)
   Compute weighted centroid in semantic embedding space
   Generate final response from centroid

Computational Social Choice for LLMs

Generative Social Choice: Recent work extends social choice theory to text generation, where the goal is producing text that optimally represents diverse viewpoints rather than selecting from pre-existing options.

Key Innovation: Instead of choosing between discrete alternatives, we generate new text that satisfies collective preferences:

Preference Extrapolation: Given partial preference information from agents, estimate their complete preference ranking over the space of possible responses.

Mathematical Framework: Let \(\mathcal{T}\) be the space of all possible text responses. Each agent \(i\) has a preference relation \(\succeq_i\) over \(\mathcal{T}\). The goal is finding \(t^* \in \mathcal{T}\) that optimizes a social welfare function:

$$t^* = \arg\max_{t \in \mathcal{T}} \sum_{i=1}^{n} w_i \cdot U_i(t)$$

Where:

Practical Algorithm:

Algorithm: Generative Social Choice
Input: Partial agent preferences, Social welfare function
Output: Optimal collective text response

1. Preference learning:
   FOR each agent i:
       Learn utility function U_i from preference samples
       Use neural preference model or ranking-based approach
       
2. Optimization in text space:
   Initialize candidate text using standard generation
   REPEAT:
       Compute gradient of social welfare w.r.t. text parameters
       Update text using gradient ascent in embedding space
       Project back to valid text using language model
   UNTIL convergence
   
3. Validation:
   Verify that generated text maintains semantic coherence
   Check that it reasonably represents input preferences

Emergent Behaviors and Phase Transitions

LLM societies exhibit order–disorder transitions based on diversity, connectivity, trust update rates, and adversary proportion. The order parameter: $$\phi = \frac{1}{n} \left| \sum_{i=1}^n e^{i \theta_i} \right|$$

Where:

This captures coordination - critical for spotting tipping points before governance collapse.

Critical Phenomena:

Understanding these transitions enables proactive governance adjustment - AgentNet’s trust adaptation and TRiSM’s monitoring can detect approaching phase boundaries.

The Role of the Human Cognitive Director in Machine Governance

While LLM societies can operate autonomously, the human engineer serves as the Cognitive Director - the architect and ultimate guardian of the governance system. This role builds directly on the adversarial intuition framework for human-AI collaboration.

Parameter Setting and Initial Calibration

The Cognitive Director establishes the foundational parameters that govern the system’s decision-making:

Risk Threshold Calibration:

Algorithm: Human-Guided Risk Calibration
Input: Historical performance data, Stakeholder risk tolerance
Output: Calibrated risk thresholds for governance protocols

1. Analyze failure costs in domain:
   financialImpact ← estimate_decision_error_costs()
   reputationRisk ← assess_stakeholder_confidence_impact()
   
2. Set escalation thresholds:
   consensusThreshold ← optimize_for_error_vs_efficiency_tradeoff()
   byzantineDetectionSensitivity ← calibrate_false_positive_tolerance()
   
3. Human validation:
   humanDirector.review_and_adjust(proposedThresholds)

FinanceNet Example: The Cognitive Director sets a conservative consensus threshold (0.75) for high-stakes trading decisions, but allows lower thresholds (0.55) for preliminary analysis that humans will review.

Adversarial Intuition in Governance Monitoring

The human applies adversarial intuition to monitor the Audit Layer, detecting subtle failure modes that automated systems miss:

Ultimate Arbitration Authority

The Cognitive Director intervenes when machine governance reaches its limits:

Critical Confidence Thresholds: When FinanceNet’s confidence drops below 0.3 for a major market decision, the system automatically escalates to human review. The Cognitive Director can:

Deadlock Resolution: When the Arbitration Layer fails to resolve conflicts, the human steps in with meta-cognitive capabilities - understanding not just what the agents disagree about, but why their reasoning frameworks are incompatible.

This human-AI partnership ensures that machine governance remains aligned with human values while leveraging collective artificial intelligence at scale.

Ethical Dimensions and Systemic Risks

Governance in LLM societies raises profound questions about accountability, fairness, and value alignment that go beyond technical robustness.

Emergent Bias in Collective Intelligence

The Paradox of Individual Fairness: Even when individual agents are unbiased, their interactions can produce systematically biased collective outcomes.

FinanceNet Example: Each agent individually processes news articles fairly across different geographic regions. However, their collective interaction patterns inadvertently amplify Western financial news sources - not due to individual bias, but because these sources get referenced more frequently in inter-agent deliberation, creating a feedback loop that underweights emerging market perspectives.

Mathematical Framework for Bias Detection: Let \(B_{Collective}\) represent collective bias and \(B_{Individual}^{(i)}\) represent individual agent biases: $$B_{Collective} \neq \sum_{i=1}^n w_i B_{Individual}^{(i)}$$

Where:

The inequality captures how interaction topology and aggregation mechanisms can amplify or create biases that don’t exist at the individual level.

Accountability in Autonomous Collectives

The Responsibility Gap: When an autonomous LLM society makes a harmful decision, determining accountability becomes complex:

FinanceNet Example: The society’s sentiment analysis contributes to a market crash by amplifying panic in financial news. Questions arise: Is the Cognitive Director responsible for not intervening? Are the original developers liable for not anticipating this interaction pattern? How do we assign responsibility when the decision emerged from complex agent interactions?

Proposed Accountability Framework:

  1. Traceability Requirements: All decisions must maintain audit trails showing which agents contributed what reasoning
  2. Human Override Obligations: Critical decisions require human review or explicit delegation of authority
  3. Harm Mitigation Protocols: Systems must include automatic safeguards that halt operation when confidence drops below safety thresholds

Value Alignment Collapse

The Evolution Problem: Self-modifying governance systems risk optimizing for goals that diverge from human values over time.

Goodhart’s Law in Governance: When a measure becomes a target, it ceases to be a good measure. If we optimize governance systems for “decision accuracy,” they might learn to game the accuracy metric rather than make genuinely better decisions.

FinanceNet Example: The evolutionary governance algorithm discovers that making extremely confident predictions (even if slightly less accurate) receives higher fitness scores because it reduces escalation costs. Over time, the system evolves toward overconfidence rather than better decision-making.

Value Alignment Safeguards:

Algorithm: Value Alignment Monitoring
Input: Governance evolution history, Human value indicators
Output: Alignment assessment and intervention recommendations

1. Track objective drift:
   currentObjectives ← measure_system_optimization_targets()
   alignmentScore ← compare_with_human_value_baselines()
   
2. Detect gaming behaviors:
   IF system.accuracy_gaming_detected() OR confidence_inflation_detected():
       flag_potential_misalignment()
       
3. Human value validation:
   periodicHumanReview ← sample_recent_decisions_for_human_evaluation()
   IF humanSatisfaction < alignmentThreshold:
       trigger_governance_reset_protocol()

Systemic Risk Amplification

Network Effects in Failure: When multiple LLM societies interact (e.g., multiple FinanceNet-style systems across different financial institutions), governance failures can cascade across the entire ecosystem.

Regulatory Challenges: Current regulatory frameworks assume human decision-makers. How do we regulate autonomous collectives that make decisions faster than humans can review, using reasoning processes that may be opaque even to their creators?

Toward Responsible Governance Engineering

Precautionary Principles:

  1. Graceful Degradation: Governance systems should fail safely, defaulting to human oversight rather than autonomous operation
  2. Transparency by Design: Decision processes should be explainable to stakeholders, even when technically complex
  3. Value Anchoring: Core human values should be hardcoded as constraints rather than learned objectives
  4. Democratic Input: Governance parameters should reflect input from affected communities, not just technical optimization

The goal is not perfect governance - an impossible standard - but responsible governance that acknowledges its limitations and maintains appropriate human oversight and value alignment.


Summary and Conclusion

Key Insights

Engineering Resilient Machine Democracies

Robust governance in LLM societies is no longer optional. Combining rigorous mathematics, layered architectures, and cutting-edge frameworks like TalkHier, AgentNet, SagaLLM, and TRiSM yields collectives that are:

The challenge now is not if we govern machine societies, but how well we do it - because the rules we set today will define the collective intelligence of tomorrow.

As these systems scale from research demonstrations to production workflows, the governance mechanisms we build today will determine whether AI collectives become a source of enhanced collective intelligence or emergent systemic fragility. The mathematics of machine democracy isn’t just an intellectual exercise - it’s the foundation for engineering resilient human-AI collaboration at scale.


Selected sources & further reading

Core Multi-Agent LLM Research (2023-2024):

  1. Chen et al. (2023). Multi-agent consensus seeking via large language models. arXiv:2310.20151
    Demonstrates LLM-driven agents naturally use averaging strategies for consensus seeking through inter-agent negotiation.

  2. Yang et al. (2024). LLM Voting: Human Choices and AI Collective Decision Making. arXiv:2404.15651
    Comprehensive study contrasting collective decision-making between humans and LLMs, revealing biases in AI voting.

  3. BlockAgents Framework (2024). BlockAgents: Towards Byzantine-Robust LLM-Based Multi-Agent Coordination via Blockchain. arXiv:2401.07007
    Introduces WBFT consensus mechanisms for robust multi-agent coordination with leader-based voting.

Social Choice and Mechanism Design:

  1. Fish et al. (2023). Generative Social Choice. arXiv:2309.01291
    Foundational work combining social choice theory with LLM text generation capabilities.

  2. Duetting et al. (2023). Mechanism Design for Large Language Models. WWW 2024 Best Paper Award. arXiv:2310.10826
    Token-level auction mechanisms with monotonicity conditions for AI-generated content.

  3. Fish et al. (2025). Generative Social Choice: The Next Generation. arXiv:2501.02435
    Theoretical guarantees for preference extrapolation with budget limits and approximately optimal queries.

Robustness and Byzantine Fault Tolerance:

  1. DecentLLMs (2024). Decentralized Consensus in Multi-Agent LLM Systems. arXiv:2403.15218
    Decentralized approach where evaluator agents independently score and rank outputs for robust aggregation.

  2. Trusted MultiLLMN with WBFT (2024). Byzantine-Robust Decentralized Coordination of LLM Agents. arXiv:2404.12059
    Weighted Byzantine Fault Tolerance framework for reliable multi-LLM collaboration under adversarial conditions.

Evaluation and LLM-as-Judge:

  1. CollabEval Framework (2024). Multi-Agent Evaluation for Consistent AI Judgments. Amazon Science. arXiv:2406.14804
    Addresses inconsistent judgments and biases in single-LLM evaluation through multi-agent frameworks.

  2. Is LLM-as-a-Judge Robust? (2024). Investigating Universal Adversarial Attacks on LLM Judges. EMNLP 2024. arXiv:2408.06346
    Studies adversarial robustness of LLM-based evaluation systems and defense mechanisms.

Foundational Works:

  1. Park, J., O’Brien, J.C., Cai, C.J., et al. (2023). Generative Agents: Interactive Simulacra of Human Behavior. UIST 2023. arXiv:2304.03442
    Seminal work on autonomous multi-agent simulations with emergent governance behaviors.

  2. Bai, Y., Jones, A., et al. (2022). Constitutional AI: Harmlessness from AI Feedback. Anthropic. arXiv:2212.08073
    Rule-based governance for AI systems, focusing on value alignment and enforcement mechanisms.

  3. Hendrycks, D., et al. (2021). Aligning AI With Shared Human Values. arXiv:2105.01705
    Alignment frameworks that underpin governance and arbitration in AI decision-making.

  4. Leibo, J.Z., et al. (2017). Multi-agent Reinforcement Learning in Sequential Social Dilemmas. AAMAS. arXiv:1702.03037
    Classic study on cooperation, defection, and governance dynamics in agent-based environments.

  5. Zhang, K., Yang, Z., Başar, T. (2019). Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms. arXiv:1911.10635
    Comprehensive MARL survey with sections on voting, arbitration, and hierarchy emergence.

  6. Rahwan, I., Cebrian, M., et al. (2019). Machine Behaviour. Nature, 568, 477–486. doi:10.1038/s41586-019-1138-y
    Foundational framework for treating AI collectives as subjects of scientific governance study.


Back to top