class: center, middle, inverse, title-slide .title[ # Quantifying Perceptions of Violence on YouTube for Near Real-Time Sensing of Vulnerable Communities and Mitigating Misinformation ] .author[ ### Francesco Bailo (University of Sydney) ] .institute[ ### CS2Italy - University of Trento ] .date[ ### 16 January 2025 ] --- layout: true <div style="position: absolute;left:60px;bottom:11px;color:gray;"><small><small><small><a href = 'https://fraba.github.io/presentation/2025-CS2ITALY/youtube'>fraba.github.io/presentation/2025-CS2ITALY/youtube <svg viewBox="0 0 512 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M326.612 185.391c59.747 59.809 58.927 155.698.36 214.59-.11.12-.24.25-.36.37l-67.2 67.2c-59.27 59.27-155.699 59.262-214.96 0-59.27-59.26-59.27-155.7 0-214.96l37.106-37.106c9.84-9.84 26.786-3.3 27.294 10.606.648 17.722 3.826 35.527 9.69 52.721 1.986 5.822.567 12.262-3.783 16.612l-13.087 13.087c-28.026 28.026-28.905 73.66-1.155 101.96 28.024 28.579 74.086 28.749 102.325.51l67.2-67.19c28.191-28.191 28.073-73.757 0-101.83-3.701-3.694-7.429-6.564-10.341-8.569a16.037 16.037 0 0 1-6.947-12.606c-.396-10.567 3.348-21.456 11.698-29.806l21.054-21.055c5.521-5.521 14.182-6.199 20.584-1.731a152.482 152.482 0 0 1 20.522 17.197zM467.547 44.449c-59.261-59.262-155.69-59.27-214.96 0l-67.2 67.2c-.12.12-.25.25-.36.37-58.566 58.892-59.387 154.781.36 214.59a152.454 152.454 0 0 0 20.521 17.196c6.402 4.468 15.064 3.789 20.584-1.731l21.054-21.055c8.35-8.35 12.094-19.239 11.698-29.806a16.037 16.037 0 0 0-6.947-12.606c-2.912-2.005-6.64-4.875-10.341-8.569-28.073-28.073-28.191-73.639 0-101.83l67.2-67.19c28.239-28.239 74.3-28.069 102.325.51 27.75 28.3 26.872 73.934-1.155 101.96l-13.087 13.087c-4.35 4.35-5.769 10.79-3.783 16.612 5.864 17.194 9.042 34.999 9.69 52.721.509 13.906 17.454 20.446 27.294 10.606l37.106-37.106c59.271-59.259 59.271-155.699.001-214.959z"></path></svg></a></small></small></small></div> --- ## Access slides here <svg viewBox="0 0 512 512" style="height:1em;display:inline-block;position:fixed;top:10;right:10;" xmlns="http://www.w3.org/2000/svg"> <path d="M326.612 185.391c59.747 59.809 58.927 155.698.36 214.59-.11.12-.24.25-.36.37l-67.2 67.2c-59.27 59.27-155.699 59.262-214.96 0-59.27-59.26-59.27-155.7 0-214.96l37.106-37.106c9.84-9.84 26.786-3.3 27.294 10.606.648 17.722 3.826 35.527 9.69 52.721 1.986 5.822.567 12.262-3.783 16.612l-13.087 13.087c-28.026 28.026-28.905 73.66-1.155 101.96 28.024 28.579 74.086 28.749 102.325.51l67.2-67.19c28.191-28.191 28.073-73.757 0-101.83-3.701-3.694-7.429-6.564-10.341-8.569a16.037 16.037 0 0 1-6.947-12.606c-.396-10.567 3.348-21.456 11.698-29.806l21.054-21.055c5.521-5.521 14.182-6.199 20.584-1.731a152.482 152.482 0 0 1 20.522 17.197zM467.547 44.449c-59.261-59.262-155.69-59.27-214.96 0l-67.2 67.2c-.12.12-.25.25-.36.37-58.566 58.892-59.387 154.781.36 214.59a152.454 152.454 0 0 0 20.521 17.196c6.402 4.468 15.064 3.789 20.584-1.731l21.054-21.055c8.35-8.35 12.094-19.239 11.698-29.806a16.037 16.037 0 0 0-6.947-12.606c-2.912-2.005-6.64-4.875-10.341-8.569-28.073-28.073-28.191-73.639 0-101.83l67.2-67.19c28.239-28.239 74.3-28.069 102.325.51 27.75 28.3 26.872 73.934-1.155 101.96l-13.087 13.087c-4.35 4.35-5.769 10.79-3.783 16.612 5.864 17.194 9.042 34.999 9.69 52.721.509 13.906 17.454 20.446 27.294 10.606l37.106-37.106c59.271-59.259 59.271-155.699.001-214.959z"></path></svg> </br></br></br></br></br></br> .center[.large[[fraba.github.io/presentation/2025-CS2ITALY/youtube](https://fraba.github.io/presentation/2025-CS2ITALY/youtube)]] </br></br></br></br> <p style = "font-size: 80px"> ↙</p> --- class: segue-red # Research goal and approach --- ## Research goal .content-box-yellow[ ### a. Developing a highly world-wide granular set of measures of perceived violence, with precise spatial and temporal resolution. 1. Measures will differentiate the **vector** of perceived violence and the **actors** involved 2. Measures will anchor the perceived violence to public **events**. ### b. Fine-tuning the measure against alternative measures to systematic biases of each approach ### c. Understand how to integrate alternative measures for improving real-time sensing and forecasting ] --- ## Research approach 1. Set a population grid for the area of interest (i.e., country); -- 2. Set a population density threshold for areas to exclude from data collection because too scarcely populated; -- 3. Identifying YouTube videos a. associated with each cell of the grid within the geographic area of interest using the YouTube API; and b. published within the timeframe of interest. -- 4. Collect all comments posted to the YouTube videos identified in 3. -- 5. Retrieve information from the texts of the comments to estimate of perceived violence. 🚧 -- 6. Aggregate this information to compute a set of measures. 🚧 --- class: segue-red # YouTube: Data source justification --- ## Why using YouTube .content-box-yellow[ ### Data collection is practical 1. API queries for videos with a geographic parameter (`lon+lat` + `radius`) are available. 2. A research program to increase API quotas is also available. ] .content-box-purple[ ### Data is useful 3. YouTube is one of the most diffuse **social media application**in the world. Users not only use it to watch videos, but of course they also use it to comment videos. 4. YouTube is probably the social media application with the **higher average penetration** in the world. This makes it an excellent data source for developing world-wide measures. ] --- class: segue-red # Research Approach --- ## 1. Set a population grid for the area of interest (Mexico) <img src="youtube_files/figure-html/unnamed-chunk-5-1.svg" width="90%" style="display: block; margin: auto;" /> --- ## 2. Set a population density threshold <img src="youtube_files/figure-html/unnamed-chunk-6-1.svg" width="90%" style="display: block; margin: auto;" /> --- ## 3. Identifying YouTube videos a. associated with each cell of the grid within the geographic area of interest using the YouTube API; and b. published within the timeframe of interest. <img src="youtube_files/figure-html/unnamed-chunk-7-1.svg" width="80%" style="display: block; margin: auto;" /> --- ## 4. Collect all comments posted to the YouTube videos identified in 3. .content-box-green[ ### Three steps in terms of API calls 1. `https://www.googleapis.com/youtube/v3/search` * Video search results filtered by geographic coordinates, radius, and date (`location`, `locationRadius`, `publishedAfter`, and `publishedBefore` parameters). 2. `https://www.googleapis.com/youtube/v3/videos` * Relevant metadata about all videos returned from the video search. 3. `https://www.googleapis.com/youtube/v3/commentThreads` * Data and relevant metadata of the comments returned by the video search. ] --- ### Data collection for Mexico 2020-01-01 - 2024-06-18 - **Videos**: 1.2M - **Video - geolocation** pairs: 3.4M - **Comments**: 14.8M - **Comments - geolocation** pairs: 41.9M #### 2024 Mexican general election was held on 2 June 2024 .center[<img src = "https://upload.wikimedia.org/wikipedia/commons/thumb/5/53/Elecciones_federales_de_M%C3%A9xico_de_2024_10.jpg/640px-Elecciones_federales_de_M%C3%A9xico_de_2024_10.jpg" width = "35%">] --- ### Data for Mexico 2020-01-01 - 2024-06-18 #### Comments by day <img src="youtube_files/figure-html/unnamed-chunk-8-1.svg" width="90%" style="display: block; margin: auto;" /> <img src="youtube_files/figure-html/unnamed-chunk-9-1.svg" width="90%" style="display: block; margin: auto;" /> --- ### Data for Mexico 2020-01-01 - 2024-06-18 #### Comments by hour <img src="youtube_files/figure-html/unnamed-chunk-10-1.svg" width="60%" style="display: block; margin: auto;" /> --- ### Data for Mexico 2020-01-01 - 2024-06-18 #### Comments by location <img src="youtube_files/figure-html/unnamed-chunk-11-1.svg" width="90%" style="display: block; margin: auto;" /> --- ## 5. Retrieve information from the texts of the comments to estimate of perceived violence. 🚧 ### Dictionary based approach 1. We started with a list of seed terms related to violence 2. Using Wordnet we expanded to include all words within 2 steps from the seed words. 3. We weight each word to reflect the distance from the seed words: 1 for the original seed words, 0.5 for words with distance = 1 to the seed words and 0.25 for words with distance = 2 from the seed words. `$$L = \{ (e_1, w_1), (e_2, w_2), \ldots, (e_n, w_n) \}$$` --- .center[<img src = 'assets/wordnet.png' width = '100%'></img>] --- 4\. We assign to each comment `\(i\)` a score based on the presence of word : `$$\text{Comment Score} = \sum_{i=1}^{n} (c_i \cdot w_i)$$` - `\(c_i\)` is the count of the `\(i\)`-th word in the document. - `\(w_i\)` is the weight associated with the `\(i\)`-th word in the list `\(L\)`. - `\(n\)` is the total number of words in the list `\(L\)`. --- ### Average score per day (7-day moving average) <img src="youtube_files/figure-html/unnamed-chunk-12-1.svg" width="90%" style="display: block; margin: auto;" /> --- ### 6. Aggregate this information to compute measures. 🚧 1. We average comments for **commenter** and by day. 2. We jitter **commenters** within the geographic boundary of the cell associated to the commented video. 3. We use the inverse distance weighted (IDW) technique to map the entire grid (this can be done for grids of different granularity) --- #### Preliminary results <img src="youtube_files/figure-html/unnamed-chunk-13-1.svg" width="90%" style="display: block; margin: auto;" />