Mapping Violence Perceptions through YouTube Comments

A New Approach to Real-Time Violence Monitoring

Francesco Bailo
francesco.bailo@sydney.edu.au

3rd Social Conflict and Political Economy (SCoPE) Workshop
University of Sydney

April 1, 2026

Based on forthcoming publication with EPJ Data Science

with

Ashani Amarasinghe\(^1\), Sascha Nanlohy\(^2\), Thomas Morgan\(^2\), David Hammond\(^2\), Yashdeep Dahiya\(^{1,3}\) and Francesco Bailo\(^1\)

DOI: 10.1140/epjds/s13688-026-00649-y

\(^1\)University of Sydney | \(^2\)Insititue for Economics and Peace | \(^3\)Monash University

Acknowledgement of Country

I would like to acknowledge the Traditional Owners of Australia and recognise their continuing connection to land, water and culture. The University of Sydney is located on the land of the Gadigal people of the Eora Nation. I pay my respects to their Elders, past and present.

Introduction

The Challenge of Measuring Violence

Traditional violence datasets face critical limitations:

Event-based datasets (ACLED, UCDP): Document actual violence through fatality counts
Cannot capture perceptions, fear, rumors, and community discourse
Geographic bias: Urban centers receive extensive coverage
Marginalized areas remain systematically underreported
Yet perceptions matter: Fear shapes behavior, economic activity, and social stability

What We Can’t Measure, We Can’t Manage

What Traditional Datasets Capture:

Documented fatalities
Verified violent events
Urban violence
Media-reported incidents

What They Miss:

Perceived threats
Rumors and speculation
Remote area violence
Community fear
Unverified claims

The Gap

Violence in marginalized and remote areas remains invisible to traditional monitoring systems

Research Question

Can we systematically measure violence perceptions at scale using social media discourse?

Develop a Violence Perception Index (VPI) from geolocated YouTube comments
Validate against established violence indicators
Test whether VPI captures dynamics in underrepresented areas

The Violence Perception Index (VPI)

What Does the VPI Measure?

VPI quantifies intensity of violence-related discourse in geolocated comments:

Direct experience: Eyewitness accounts
Perceived threat: Fear and concern
News circulation: Discussion of reported violence
Rumors: Unverified claims
Historical reference: Past violence patterns

Key Insight

All types matter for understanding community behavior—whether threats are immediate or diffuse, local or national

Why YouTube Comments?

Platform Advantages:

Geolocated content
Vernacular discourse
“Third spaces” for organic discussion
Complementary to news data

Mexico Context:

78% social media usage (2023)
79% Internet as primary news source
Robust digital engagement
High violence levels

Why Mexico?

Geographic and temporal distribution of homicides in Mexico

Methodology

Overview of Data Collection

Overview of data collection pipeline and methods

Data Collection Pipeline

Data Scale and Coverage

Distribution of videos and comments over time

Building the Violence Dictionary

Semantic network expansion from seed words:

10 seed terms: violencia, asesinato, homicidio, tiroteo, ataque…
WordNet expansion: 2 levels of semantic relations
118 total terms with distance-based weights:
- Seed words: weight = 1.0
- Distance 1: weight = 0.5
- Distance 2: weight = 0.25
Weighted scoring: Sum of term frequencies × weights

Scalability

WordNet resources exist for dozens of languages → cross-linguistic application

From Comments to Geographic Index

Multi-stage transformation:

Text processing: Lemmatization with SpaCy Spanish model
Scoring: Weighted term frequency for each comment
Geographic aggregation: Inverse Distance Weighting (IDW)
- Comments influence nearby grid cells
- Weight ∝ 1/distance²
Temporal aggregation: Monthly averages at ~50km resolution (PRIO-GRID)

Validation

Validating the Dictionary Approach

Comparison with 4 Large Language Models (700 stratified comments):

Agreement Metrics:

Cohen’s κ: 0.52-0.62
Raw agreement: 75-81%
Fleiss’ Kappa: 0.80 (p<0.0001)

Correlation (continuous):

Spearman ρ: 0.61-0.68
ICC: 0.61-0.68

Important

Dictionary approach shows substantial agreement with semantic LLM analysis while maintaining computational efficiency

Benchmark Datasets

We validate VPI against three established indicators:

ACLED fatalities: News-based conflict event data
UCDP fatalities: Armed conflict data
Official homicide statistics: Mexican government records

Plus contextual data:

Population density (WorldPop 2020)
Marginalization index (Census 2020)

Geographic Distribution Comparison

VPI, ACLED, and Homicides geographic distributions

Results

Strong Correlation with Realized Violence

Panel regression with comprehensive fixed effects:

Predictor	Coefficient	R²
ACLED Fatalities	0.0257***	0.974
Homicides	0.0142***	0.974

Grid fixed effects: Time-invariant characteristics
Year fixed effects: Temporal shocks
Month fixed effects: Seasonal patterns
37-67% increase in VPI per 1-SD increase in violence

The Key Finding: Geographic Heterogeneity

Split sample analysis (High vs. Low population grids):

ACLED (News-based):

High pop: Significant (0.0004***)
Low pop: Not significant

Official Homicides:

High pop: Not significant
Low pop: Significant (0.0004**)

Critical Insight

VPI correlates with ACLED in urban areas BUT with official records in marginalized areas where news coverage fails

Why This Matters

Residual analysis by marginalization level

Spatial Dynamics

Spillover analysis at different distances:

Own-grid violence remains significant across specifications
Low-population grids: Regional spillovers dominate
- Communities form functional economic regions
- Violence in neighbors affects travel, markets, security
High-population grids: Own-grid effects dominate
- Hyperlocal discourse saturates information environment

Variance Decomposition

Where does VPI variation come from?

Component	% of Total Variance
Between-grid (spatial)	97.6%
Within-grid (temporal)	2.7%

Interpretation

VPI primarily captures localized/regional dynamics rather than uniform national discourse about historical events

Discussion

VPI as Complementary Intelligence

Traditional Datasets:

Document verified events
Strong in urban areas
Retrospective
Complete event details

Violence Perception Index:

Captures discourse & fear
Strong in marginalized areas
Near real-time
Community perspective

Use Case

Early warning and monitoring in precisely those underrepresented areas where traditional systems provide incomplete coverage

Methodological Contributions

Feasibility: Large-scale systematic measurement of violence perception
Scalability: Dictionary-based approach works across languages
Granularity: ~50km spatial, monthly temporal resolution
Validation: Moderate-substantial agreement with LLMs and realized violence
Innovation: Captures dynamics in areas with systematic reporting bias

Limitations and Future Directions

Current Limitations:

Platform bias (digital literacy, age, urban)
Aggregates all discourse types
Can’t distinguish local vs. national
Vulnerable to manipulation
3.45% geolocation rate

Future Extensions:

LLM-based discourse classification
Multi-platform integration
Real-time monitoring systems
Cross-linguistic implementation
Rumor vs. fact decomposition

Temporal Alignment with Events

Conclusion

Key Takeaways

Violence perception is measurable at scale through social media discourse
VPI correlates strongly with established violence indicators (R² = 0.97)
Geographic heterogeneity is a feature: VPI captures discourse where traditional datasets fail
Marginalized areas gain visibility through community-generated content
Immediately scalable across languages and geographies

Policy Implications

When communities fear violence, that fear shapes behavior—regardless of whether threats are documented

Early warning systems can integrate perception data
Monitoring gaps in remote/marginalized areas can be filled
Community perceptions provide actionable intelligence
Near real-time detection of emerging hotspots

Final Thoughts

What we’ve demonstrated:

Systematic measurement works
Dictionary approach is valid
Complements existing data
Addresses known biases

What comes next:

Multi-platform integration
Real-time deployment
Cross-national application
Discourse type classification

The Promise

Capturing violence dynamics in precisely those underrepresented areas where traditional monitoring systems provide incomplete coverage

Thank You

Questions?

Paper & Data:

Paper (EPJ Data Science, open access): 10.1140/epjds/s13688-026-00649-y
VPI Dataset for Mexico (OSF): 10.17605/OSF.IO/FA493
Replication materials (Harvard Dataverse): 10.7910/DVN/C6TJ9K

Contact:

Francesco Bailo

University of Sydney

[francesco.bailo@sydney.edu.au]