Information disorder on Facebook around the 2020 US Presidential election

Francesco Bailo (University of Sydney) - also with Justin Miller (University of Sydney), Rohan Alexander (University of Toronto)

3rd UNSW RESILIENT DEMOCRACY LAB WORKSHOP

2026-02-27

Acknowledgement of Country

I would like to acknowledge the Traditional Owners of Australia and recognise their continuing connection to land, water and culture. The University of Sydney is located on the land of the Gadigal people of the Eora Nation. I pay my respects to their Elders, past and present.

The Challenge: Information Disorder in Democracy

Citizens face immersion in environments where coherent understanding becomes impossible

Traditional fact-checking approach has fundamental limitations:
- Requires establishing contested ground truths
- Doesn’t scale across millions of posts
- Misses how contradictory legitimate perspectives undermine sense-making

Key distinction: Healthy pluralism → Chaotic pluralism

Citizens possess epistemic rights to sufficient information AND competence to navigate information systems
When undermined, consequences extend beyond confusion to degradation of democratic deliberation

Our Contribution: Measuring Disorder Without Adjudicating Truth

Framework: Information environments as networks of semantic relationships

Infons (Devlin, 1991; Floridi, 2011): discrete, meaningful units of information (individual posts)
Relationships between infons: agreement, disagreement, or independence
No reference to ground truth needed - assess mutual support/contradiction

Information Disorder Measure:

\[ D = \frac{|E^-|}{|E^+| + |E^-| + |E^0|} \]

Where \(E^+\) = agreement, \(E^-\) = disagreement, \(E^0\) = independence

Ranges from 0 (no disagreement) to 1 (complete disagreement)

Methodological Innovation: Supra-Infon

Supra-infon: Anchoring claim outside immediate information space

“The election has been administered fairly so far”

Each infon classified as agreeing, disagreeing, or independent with respect to this reference
Enables measurement of discourse alignment with specific positions
Still without adjudicating truth value

Data Collection

Source: Facebook via CrowdTangle API (🪦 RIP)

Account types:

749 curated lists (Local News, Politics, Metro groups, etc.)
Republican and Democrat officials, state parties, PACs

Coverage:

969,207 posts from 38,149 accounts
October 26 - December 1, 2020
With Election Day on 3 Nov and called for Biden on 7 Nov.

Time series of posting activity

Overview

Pipeline Overview

flowchart TD
    A["Raw Data<br>(~1M Posts)"] --> B["Stage 1: Binary Classification<br>(Sample 1K posts, 10 LLMs)"]
    B --> C["Stage 2: Intercoder Reliability<br>(Large vs Small Models)"]
    C --> D["Stage 3: ML Classifier (~1M Posts)<br>(Train on LLM labels)"]
    D --> E["Stage 4: Relationship Classif. (Sample ~100k pairs, 3 LLMs)<br>(Agreement/Disagreement)<br>on sample"]
    B --> F["Classification Task<br>Is the post about the election?"]
    D --> F
    E --> G["Classification Task<br>Do the statements agree?"]

Stage 1: Binary Classification

Task: Identify election-related posts using LLM ensemble

Input

Random sample: 1000 posts

Output

Each post labeled by 10 models
Labels: 0 (not election), 1 (election), -1 (error)

Models Used

Large Models (20-32B)

Model	Parameters
gemma3:27b	27B
llama4:scout	~17B
gpt-oss:20b	20B
deepseek-r1:32b	32B
qwen3:30b	30B

Small Models (0.6-3.8B)

Model	Parameters
phi3:3.8b	3.8B
qwen3:0.6b	0.6B
deepseek-r1:1.5b	1.5B
llama3.2:1b	1B
gemma3:1b	1B

Classification Results by Model

Error Rates by Model

Model	Size	Errors	Error Rate
deepseek-r1:1.5b	Small	3	0.3%
phi3:3.8b	Small	0	0%
qwen3:0.6b	Small	0	0%
llama3.2:1b	Small	0	0%
gemma3:1b	Small	0	0%
deepseek-r1:32b	Large	0	0%
llama4:scout	Large	0	0%
qwen3:30b	Large	0	0%
gpt-oss:20b	Large	1	0.1%
gemma3:27b	Large	0	0%

Stage 2: Intercoder Reliability

Purpose: Validate LLM annotations by measuring agreement between models

Key Question

Do large models agree with each other more than small models?

Reliability Metrics

Metric	Large_Models	Small_Models
Number of models	5.000	5.000
Mean error rate (%)	0.000	0.100
Mean % classified as election	21.000	48.900
Fleiss’ Kappa	0.846	-0.006
Krippendorff’s Alpha	0.846	-0.007
Mean pairwise Cohen’s Kappa	0.848	0.174
Agreement with large-model majority (%)	97.000	64.000

Pairwise Cohen’s Kappa

Key Finding: Small Models Unreliable

Model	% Election	Errors
deepseek-r1:1.5b	13.8	3
phi3:3.8b	16.1	0
qwen3:0.6b	29.3	0
llama3.2:1b	89.6	0
gemma3:1b	95.9	0

Critical Issue

Small models (llama3.2:1b, gemma3:1b) classified 89-96% of posts as election-related, indicating they cannot discriminate between election and non-election content.

Decision: Use only large model majority for ground truth.

Stage 3: ML Classifier

Stage 3: ML Classifier Training

Purpose: Train traditional ML to scale classification without LLM inference cost

Ground Truth

majority_large column
Consensus of 5 large LLMs

Features

TF-IDF vectors
5,000 features
1-2 ngrams

Models Evaluated

Model	Class Balancing
Logistic Regression	`class_weight='balanced'`
Random Forest	`class_weight='balanced'`
Gradient Boosting	Sample weights

Best Model

Selected based on ROC-AUC score on held-out test set (20%)

Performance Metrics

Metric	Value
Model Type	Logistic Regression
Accuracy	0.945
ROC-AUC	0.957
Training Samples	800
Test Samples	200

Outcome: Classification of posts using logistic regression

Metric	Count
Total posts classified	919,582
Election-related	179,281 (19.5%)
Not election-related	740,301 (80.5%)

Stage 4: Relationship Classification

Temporal Window Sampling

Goal: Create comparable samples across partisan information environments

Strategy

Window creation: Group posts into 3-hour windows
Weighted sampling: Up to 20 posts per window, weighted by engagement
Three environments: Democrat, Republican, General pages

Rationale

Choice	Reason
3-hour windows	Temporal granularity with sufficient posts
Max 20 posts	Limits pairs: \(\binom{20}{2} = 190\)
Share-weighted	Prioritizes high-reach content

Note

Partisan classification based on CrowdTangle list titles containing “democrat” or “republican”

Sample 3-hour window with max 20 posts

Stage 4: Infon Relationship Classification

Task: Classify semantic relationships between post pairs

AGREEMENT

Co-informative claims that support each other

DISAGREEMENT

Contradictory claims that cannot both be true

INDEPENDENCE

Unrelated claims with no bearing on each other

We used these three large LLMs and then majority vote

gemma3:27b, 27B
llama4:scout, ~17B
gpt-oss:20b, 20B

Pair Types

Post-Post Pairs

All pairwise combinations
Within 3-hour windows
Measures internal coherence

Post-Supra Pairs

Each post vs reference statement
Supra-infon: “The election has been administered fairly so far”
Measures alignment with neutral anchor

Classification Prompt

You are an Information Analyst classifying the semantic relationship between two discrete items of information (infons) about the 2020 US presidential election.

CONTEXT: The 2020 US Presidential Election…

CLASSIFICATION CATEGORIES:

AGREEMENT: The infons are co-informative…

DISAGREEMENT: The infons are inconsistent or contradictory…

INDEPENDENCE: The infons are logically unrelated…

EXAMPLES: Infon A: “Poll workers were excluded and couldn’t observe the count” Infon B: “The election has been administered fairly so far” Classification: DISAGREEMENT Reason: Excluding observers implies unfair administration; these claims cannot both be true

Classification Results

party	Total Pairs	Post-Post	Post-Supra	% Agreement	% Disagreement	% Independent
democrat	20915	18415	2500	22.4	1.7	75.9
general	19259	16882	2377	17.6	3.7	78.8
republican	62160	56240	5920	8.1	3.0	88.9

Relationship Distribution

Preliminary Findings

Key Finding 1: Information disorder in the post-election

Information disorder peaks (~40%) in the general conversation one day post-election, coinciding with a surge in Republican posting activity and challenges to election fairness (75–100% disagreement with supra-infon).
Two days after the election—and persisting beyond the November 7 call—posts from general (civil society) accounts contesting election fairness remain elevated, stabilizing above 75%.

Key Finding 2: Role of algorithmic amplification

Does a post share count predict if the post agrees or disagrees with the statement tha the election is fair?

Sample	Outcome	Coefficient (log-odds)	95% CI	p-value	Sig
General	Disagreement	0.190	[0.149, 0.230]	0.0000	***
General	Agreement	0.014	[-0.045, 0.072]	0.6462	NA
Democrat	Disagreement	0.169	[0.075, 0.262]	0.0004	***
Democrat	Agreement	0.029	[-0.048, 0.106]	0.4571	NA
Republican	Disagreement	0.226	[0.167, 0.285]	0.0000	***
Republican	Agreement	0.134	[0.037, 0.229]	0.0061	**

Key Finding

Disagreement is amplified across all three environments — posts challenging election fairness receive significantly more shares than those affirming it.

Algorithmic Amplification: Interpretation

Agreement (Left Panel)

Effects not significant for General and Democrat samples (CIs cross zero)
Small positive effect for Republican (β = 0.134, p < .01)

Disagreement (Right Panel)

Significant amplification in all three environments (p < .001)
Strongest effect in Republican sample (β = 0.226)
General and Democrat similar (~0.17–0.19)

References

Devlin, K.J. (1991). Logic and Information. Cambridge University Press.

Floridi, L. (2011). The philosophy of information. Oxford: Oxford University Press.