Mapping Violence Perceptions through YouTube Comments

A New Approach to Real-Time Violence Monitoring

3rd Social Conflict and Political Economy (SCoPE) Workshop
University of Sydney

April 1, 2026

Based on forthcoming publication with EPJ Data Science

with

Ashani Amarasinghe\(^1\), Sascha Nanlohy\(^2\), Thomas Morgan\(^2\), David Hammond\(^2\), Yashdeep Dahiya\(^{1,3}\) and Francesco Bailo\(^1\)

DOI: 10.1140/epjds/s13688-026-00649-y

\(^1\)University of Sydney | \(^2\)Insititue for Economics and Peace | \(^3\)Monash University

Acknowledgement of Country

I would like to acknowledge the Traditional Owners of Australia and recognise their continuing connection to land, water and culture. The University of Sydney is located on the land of the Gadigal people of the Eora Nation. I pay my respects to their Elders, past and present.

Introduction

The Challenge of Measuring Violence

Traditional violence datasets face critical limitations:

  • Event-based datasets (ACLED, UCDP): Document actual violence through fatality counts
  • Cannot capture perceptions, fear, rumors, and community discourse
  • Geographic bias: Urban centers receive extensive coverage
  • Marginalized areas remain systematically underreported
  • Yet perceptions matter: Fear shapes behavior, economic activity, and social stability

What We Can’t Measure, We Can’t Manage

What Traditional Datasets Capture:

  • Documented fatalities
  • Verified violent events
  • Urban violence
  • Media-reported incidents

What They Miss:

  • Perceived threats
  • Rumors and speculation
  • Remote area violence
  • Community fear
  • Unverified claims

The Gap

Violence in marginalized and remote areas remains invisible to traditional monitoring systems

Research Question

Can we systematically measure violence perceptions at scale using social media discourse?

  • Develop a Violence Perception Index (VPI) from geolocated YouTube comments
  • Validate against established violence indicators
  • Test whether VPI captures dynamics in underrepresented areas

The Violence Perception Index (VPI)

What Does the VPI Measure?

VPI quantifies intensity of violence-related discourse in geolocated comments:

  1. Direct experience: Eyewitness accounts
  2. Perceived threat: Fear and concern
  3. News circulation: Discussion of reported violence
  4. Rumors: Unverified claims
  5. Historical reference: Past violence patterns

Key Insight

All types matter for understanding community behavior—whether threats are immediate or diffuse, local or national

Why YouTube Comments?

Platform Advantages:

  • Geolocated content
  • Vernacular discourse
  • “Third spaces” for organic discussion
  • Complementary to news data

Mexico Context:

  • 78% social media usage (2023)
  • 79% Internet as primary news source
  • Robust digital engagement
  • High violence levels

Why Mexico?

Geographic and temporal distribution of homicides in Mexico

Methodology

Overview of Data Collection

Overview of data collection pipeline and methods

Data Collection Pipeline

Data collection and processing workflow

Data Scale and Coverage

Distribution of videos and comments over time

Building the Violence Dictionary

Semantic network expansion from seed words:

  1. 10 seed terms: violencia, asesinato, homicidio, tiroteo, ataque…
  2. WordNet expansion: 2 levels of semantic relations
  3. 118 total terms with distance-based weights:
    • Seed words: weight = 1.0
    • Distance 1: weight = 0.5
    • Distance 2: weight = 0.25
  4. Weighted scoring: Sum of term frequencies × weights

Scalability

WordNet resources exist for dozens of languages → cross-linguistic application

From Comments to Geographic Index

Multi-stage transformation:

  1. Text processing: Lemmatization with SpaCy Spanish model
  2. Scoring: Weighted term frequency for each comment
  3. Geographic aggregation: Inverse Distance Weighting (IDW)
    • Comments influence nearby grid cells
    • Weight ∝ 1/distance²
  4. Temporal aggregation: Monthly averages at ~50km resolution (PRIO-GRID)

Validation

Validating the Dictionary Approach

Comparison with 4 Large Language Models (700 stratified comments):

Agreement Metrics:

  • Cohen’s κ: 0.52-0.62
  • Raw agreement: 75-81%
  • Fleiss’ Kappa: 0.80 (p<0.0001)

Correlation (continuous):

  • Spearman ρ: 0.61-0.68
  • ICC: 0.61-0.68

Important

Dictionary approach shows substantial agreement with semantic LLM analysis while maintaining computational efficiency

Benchmark Datasets

We validate VPI against three established indicators:

  1. ACLED fatalities: News-based conflict event data
  2. UCDP fatalities: Armed conflict data
  3. Official homicide statistics: Mexican government records

Plus contextual data:

  • Population density (WorldPop 2020)
  • Marginalization index (Census 2020)

Geographic Distribution Comparison

VPI, ACLED, and Homicides geographic distributions

Results

Strong Correlation with Realized Violence

Panel regression with comprehensive fixed effects:

Predictor Coefficient
ACLED Fatalities 0.0257*** 0.974
Homicides 0.0142*** 0.974
  • Grid fixed effects: Time-invariant characteristics
  • Year fixed effects: Temporal shocks
  • Month fixed effects: Seasonal patterns
  • 37-67% increase in VPI per 1-SD increase in violence

The Key Finding: Geographic Heterogeneity

Split sample analysis (High vs. Low population grids):

ACLED (News-based):

  • High pop: Significant (0.0004***)
  • Low pop: Not significant

Official Homicides:

  • High pop: Not significant
  • Low pop: Significant (0.0004**)

Critical Insight

VPI correlates with ACLED in urban areas BUT with official records in marginalized areas where news coverage fails

Why This Matters

Residual analysis by marginalization level

Spatial distribution of residuals

Spatial Dynamics

Spillover analysis at different distances:

  • Own-grid violence remains significant across specifications
  • Low-population grids: Regional spillovers dominate
    • Communities form functional economic regions
    • Violence in neighbors affects travel, markets, security
  • High-population grids: Own-grid effects dominate
    • Hyperlocal discourse saturates information environment

Variance Decomposition

Where does VPI variation come from?

Component % of Total Variance
Between-grid (spatial) 97.6%
Within-grid (temporal) 2.7%

Interpretation

VPI primarily captures localized/regional dynamics rather than uniform national discourse about historical events

Discussion

VPI as Complementary Intelligence

Traditional Datasets:

  • Document verified events
  • Strong in urban areas
  • Retrospective
  • Complete event details

Violence Perception Index:

  • Captures discourse & fear
  • Strong in marginalized areas
  • Near real-time
  • Community perspective

Use Case

Early warning and monitoring in precisely those underrepresented areas where traditional systems provide incomplete coverage

Methodological Contributions

  1. Feasibility: Large-scale systematic measurement of violence perception
  2. Scalability: Dictionary-based approach works across languages
  3. Granularity: ~50km spatial, monthly temporal resolution
  4. Validation: Moderate-substantial agreement with LLMs and realized violence
  5. Innovation: Captures dynamics in areas with systematic reporting bias

Limitations and Future Directions

Current Limitations:

  • Platform bias (digital literacy, age, urban)
  • Aggregates all discourse types
  • Can’t distinguish local vs. national
  • Vulnerable to manipulation
  • 3.45% geolocation rate

Future Extensions:

  • LLM-based discourse classification
  • Multi-platform integration
  • Real-time monitoring systems
  • Cross-linguistic implementation
  • Rumor vs. fact decomposition

Temporal Alignment with Events

VPI and ACLED temporal comparison

Conclusion

Key Takeaways

  1. Violence perception is measurable at scale through social media discourse
  2. VPI correlates strongly with established violence indicators (R² = 0.97)
  3. Geographic heterogeneity is a feature: VPI captures discourse where traditional datasets fail
  4. Marginalized areas gain visibility through community-generated content
  5. Immediately scalable across languages and geographies

Policy Implications

When communities fear violence, that fear shapes behavior—regardless of whether threats are documented

  • Early warning systems can integrate perception data
  • Monitoring gaps in remote/marginalized areas can be filled
  • Community perceptions provide actionable intelligence
  • Near real-time detection of emerging hotspots

Final Thoughts

What we’ve demonstrated:

  • Systematic measurement works
  • Dictionary approach is valid
  • Complements existing data
  • Addresses known biases

What comes next:

  • Multi-platform integration
  • Real-time deployment
  • Cross-national application
  • Discourse type classification

The Promise

Capturing violence dynamics in precisely those underrepresented areas where traditional monitoring systems provide incomplete coverage

Thank You

Questions?

Paper & Data:

Contact:

Francesco Bailo

University of Sydney

[francesco.bailo@sydney.edu.au]