Big Data Techniques to Study the Impact of Gender-Based Violence in the Spanish News Media

: Despite being an underreported topic in the news media, gender-based violence (GBV) undermines the health, dignity, security and autonomy of its victims. Research has studied many of the factors that generate or maintain this kind of violence. However, the influence of the media is still uncertain. This paper used Big Data techniques to explore how GBV is depicted and reported in digital news media. By feeding neural networks with news, the topic information associated with each article can be recovered. Our findings show a relationship between GBV news and public awareness, the effect of well-known GBV cases, and the intrinsic thematic relationship of GBV news with justice themes.


INTRODUCTION
Gender-based violence (GBV) is directly related at an individual based on their biological sex or gender identity (Heise et al., 2002;Violenciagenero (2019) and includes verbal, physical, sexual, and psychological abuse, occurring in either public or private life.All genders globally experience GBV, but women are disproportionately harmed (Russo & Pirlott, 2006;UN Women, 2021;WHO, 2013).
News media are key to understanding how society and the general population react to a topic (Cullen et al., 2019;Dijk, 1995).News can influence public perceptions and consequently, social policies (Carlyle et al., 2008;Cullen et al., 2019;Gillespie et al., 2013;Maydell, 2018), which is the case with GBV with news media being one of the main sources of information.Media may report available help resources added to the piece of news (Comas-d'Argermir, 2015).In Spain, where more than a thousand women have been killed in the last two decades (Menéndez, 2014;Teruelo, 2011), many notorious cases have attracted media attention as never before.For instance, The Manada [wolf pack] case (see News Desk (2019) -a group sexual assault in 2016 that lead to a series of protests across the country.This event (and its corresponding social, juridic and public response) motivated us to study the reporting of GBV news in Spain (LSE, 2020) utilizing Big Data analysis techniques.
Big Data is a novel tool for examining the reporting of GBV in digital media.Moreover, its applications might be used in the future to prevent this kind of violence through the news media.Also, this method could be used for the study and research of other social problems.

THEORETICAL BACKGROUND
Gender-based violence is a pervasive global issue that affects the health, dignity, security, and autonomy of its victims, particularly women and girls.GBV is "violence that is directed against a person on the basis of their gender or sex and includes acts that inflict physical, mental or sexual harm or suffering, threats of such acts, coercion, and other deprivations of liberty" (UN Women, 2012).While GBV can result in physical injuries, psychological trauma, and long-term health consequences (WHO, 2013) of individuals, it also has social and economic costs, including reduced productivity, increased healthcare expenses, and decreased educational attainment (Heise et al., 2019).Research has explored numerous factors to understand the causes and drivers of GBV, including cultural and social norms, economic inequality, and political instability (Heise et al., 2019;Jewkes et al., 2017).However, the role of the media in perpetuating or challenging GBV is still uncertain.Research suggests that media representations of GBV can have significant effects on public attitudes and understanding of the issue (Baker & Ascione, 1995;DeFleur & Ball-Rokeach, 1989).However, the nature BIG DATA TECHNIQUES TO STUDY THE IMPACT OF GENDER-BASED VIOLENCE IN THE SPANISH NEWS MEDIA and extent of the influence of media on GBV is not well understood.Research on the media coverage of violence against women has shown that it is often characterized by various patterns of problematic reporting (Boyle, 2005;Cuklanz, 2014).One common issue is the use of victim-blaming language and framing (Moorti, 2002;Wong & Lee, 2018).News media may focus on the victim's behavior or clothing as a cause of the violence, rather than the perpetrator's actions (Howe, 1997).This approach shifts the responsibility for the violence onto the victim and reinforces harmful stereotypes about women's behavior and responsibility for their own safety.Another point is the sensationalizing of violence against women, which can result in graphic and voyeuristic depictions of violence that can further traumatize victims and reinforce harmful stereotypes about women's vulnerability.The use of sensationalist headlines and images can also trivialize the seriousness of the issue and lead to a lack of public engagement (Tranchese & Zollo, 2013).Furthermore, the media often fail to provide context and information about the social and cultural factors that contribute to violence against women.For example, the role of patriarchal beliefs and gender inequality in perpetuating violence against women may be downplayed or ignored (Rollè et al., 2020;Sutherland et al., 2019).In addition, the media may present a narrow and limited view of the aspects that constitute violence against women, with a focus on physical violence rather than other forms of abuse, such as emotional or economic abuse (Easteal et al., 2018).Recent advances in technology and the availability of Big Data have provided an opportunity to explore the manner the media depict and report GBV.By analyzing large datasets of digital news articles, it is possible to gain insights into the ways in which GBV is represented and reported, as well as the factors that shape these representations.This paper aims to use big data techniques to explore the role of the media in perpetuating or challenging GBV.By feeding neural networks with the contents of digital news articles, we recover the topic information associated with each article and analyze the representations of GBV.

DATASET EXTRACTION PROCESS
We used a Big Data technique called 'web scraping' to extract a great amount of news from the main Spanish online newspapers.This technique requires a group of servers querying massive quantities of data from public pages on the Internet.
We deployed a network of cloud servers and a local server to break up the load of the work.One of the servers stored data and the others searched the Internet for news items of the seven selected Spanish online newspapers-La Vanguardia, El País, El Mundo, ABC, 20minutos, Público and diario.es.A local server first trained the neural networks to classify the subjects in the news, and secondly analyzed the full database and by using neural network models, calculated the probability that each news item had of covering the subject of GBV.
This process resulted in a dataset of 784 259 news items from January 2005 to March 2020.Online news media classify every piece of news with tags, which are keywords helping to describe the text and allows search functions to find the tag.Tags are not only descriptive of topics like justice, economy, politics, international, technology and health but also combinations of several data, such as media outlet, title, content, and date.
DATA MINING USING NATURAL LANGUAGE PROCESSING Natural language processing (NLP) is a subfield of Artificial Intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data.Its goal is to build algorithms capable of understanding the contents of documents, including the contextual nuances of the language within them.The technique can then accurately extract information and insights contained in the documents as well as categorize and organize them.
After the data extraction process using web scraping, we used NLP (Chapman et al., 2011) to analyze local news articles from the selected Spanish online media.Natural language processing consists of a type of data analysis focused on transforming free text (i.e.unstructured data) in documents and databases into normalized, structured data suitable for analysis (Coppersmith et al., 2018;Farzindar & Inkpen, 2017;Mori & Haruno, 2021;Goldberg, 2017).This process enabled us to classify the news.To obtain the main subjects of written media content, we used neural networks (Mikolov, 2017;Goldberg, 2017).Neural networks, with an accuracy of 98%, were also used to obtain news topics related to GBV.This enabled us to formulate three research questions (RQ): • RQ1-is there a relationship between GBV news topics and public awareness; • RQ2-do high profile GBV cases in the media affect public awareness, • RQ3-are there any intrinsic themes in media coverage of GBV.

TOPOLOGICAL DATA ANALYSIS METHODS (MAPPER ALGORITHM)
Topological based Data Analysis (TDA) is an approach to the analysis of datasets using techniques from topology.Extraction of information from datasets that are high-dimensional, incomplete and noisy is generally challenging.The application of TDA provides a general framework to analyze such data in a manner that is insensitive to the particular metric chosen and provides dimensionality reduction and robustness against noise.Beyond this, TDA inherits functoriality, a fundamental concept of modern mathematics, from its topological nature, which allows it to adapt to new mathematical tools.
One key part of our TDA analysis is the use of the 'Mapper' algorithm (Singh et al., 2007), which focuses on the way that the parts of a system are connected rather than the distance between them.This is useful for analyzing and visualizing data.We used Kepler Mapper, the flexible python implementation of the Mapper algorithm, to study the connectivity of GBV news and other news within a specific period (Veen et al., 2019).

DATA EXTRACTION AND TOPIC CLASSIFICATION
The study used NLP techniques to transform unstructured text into structured data.Also, it allowed topic classification of news based on the list of tags each piece of news contained (such as justice, economy, politics, international, technology, feminism.).However, Spanish media-both analogue and digital-do not commonly use the phrase ‚la violencia de género' (GBV) as a tag or topic.To bring light to this issue and, in addition, enhance the description tags, we extracted the subject for each news article and the probability it was related to GBV.To achieve this outcome, we applied a stack protocol that consisted of two consecutive neural network models for NLP: first, by analyzing the content of each text, we extracted the general subject classification.Secondly, we applied a binary GBV classification, which produced a probability rate, where zero means a lack of GBV content in the individual piece of news considering tags such as feminine, domestic violence, female sexual abuse and sexual crimes against women.

ANOMALY DETECTION
In data analysis, anomaly detection (also referred to as outlier detection) is the identification of rare items, events or observations which deviate significantly from the majority of the data and do not conform to a well-defined notion of normal behavior.Such observations may arouse suspicions of being generated by another mechanism, or appear inconsistent with the remainder of that set of data.
We used the Prophet forecasting model (a Facebook open software algorithm) (Shen et al., 2020) on our GBV average probability.Prophet is a procedure for analyzing time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects.This model works best with time series that have strong seasonal effects and several seasons of historical data.

GENDER-BASED VIOLENCE IN NEWS MEDIA AND PUBLIC AWARENESS-RQ1
To understand the GBV press coverage in more detail, we analyzed the GBV probability over time, i.e. the temporal changes (Tsay, 1989;Mills, 2015).Considering the whole corpus of news, we measured the temporal monthly average of GBV probability.The larger the probability, the greater the chance to be related to GBV in a certain month.Figure 1, Chart (a) shows how GBV probability rises over time, by a factor of 1.8 in the last three years.Remarkably, 20% of the news in 2020 had GBV connotations.We compared such enhancement with public awareness concerning gender movements.Thus, we examined the social perception of GVB using the monthly survey CIS (Centre for Sociological Research, a Spanish public research institute) (Violenciagenero 2019).We applied the same seasonal-trend decomposition model see Figure 1, Chart (b).Comparing Charts (a) and (b), the monthly GBV probability in news and gender violence awareness have the same upward trend.Furthermore, quantitative information about the consistent trend behavior is obtained from the time correlation between average GBV probability against the opinion survey about domestic violence as a function of the displacement of one relative to the other, see Chart (c).Cross correlation measures the relationship between two time series, in particular it measures if lags in the first time series can be used to predict future values of the second.

THE IMPACT OF FAMOUS GENDER VIOLENCE CASES-RQ2
Our results show that the GBV monthly probability has increased over the years.A similar trend can be observed in the daily grouped data.Figure 2 shows the temporal trends (green lines) of the GBV data (black symbols).Although it seems that the data fits well in the plot, certain days are out of the range: these are anomalies.Anomalies are understood as values that are sufficiently different or that do not fit the trend of the previous ones and are depicted as red spots.Table 1 shows the list of anomalies detected during 2018 and 2019 by the previous analysis (marked as red points in Figure 2), the corresponding average GBV probabilities, and the event in the news.Note that several of these anomalies correspond to the court's sentence at the end of the 'La Manada [Wolf-pack] rape case' (Aurrekoetxea, 2020;Egea, 2019;News Desk 2019).

Source: Authors
To visualize the anomalies and the dates of high GBV probability in more detail, we used a heatmap plot (see Figure 3), where all the dates of 2018, 2019 and 2020 are shown.Lighter colors relate to higher average GBV probabilities.Surprisingly, high probabilities of GBV not only correspond to the anomaly day, but also extended for several days.In addition, the heatmap shows how the years get lighter (in terms of probability of GBV news), which confirms the upward trend observed in the time series (see Figure 1).

SUBJECT ANALYSIS OF GENDER-BASED VIOLENCE NEWS
Next, we focused on the intrinsic connection of subjects (or tags) in the news-RQ3.We used the extracted GBV probability to discern, which news were about this issue.We considered a piece of news to mainly cover GBV if the probability returned by the neural network was greater than 0.9999, i.e., if the neural network confirmed it almost without doubt.We thus obtained a set of 5375 news items that covered GBV.Then, we proceeded with a subject classification neural network to tag these news items and to find out their main subjects.

Source: Authors
Figure 4 shows the resulting classification.The most common tags were justice, gender violence and crime.Using TDA, we obtained a mapper algorithm in a diagram (Figure 5) that summarizes the closeness of the news' tags with a geometric interpretation.The resulting graph draws a node for each cluster of news for those close in the context of their themes.The size of the node reflects the quantity of news items they contain.Note that if these nodes overlap, i.e., there are news items belonging to more than one cluster, the nodes will be joined with segments.This allowed us to see the ways that not only subjects change but also news organize themselves in terms of closeness.In Figure 5, the colour schematic is dark blue (cool-low probability) through green (warm) to yellow (warmer-high probability).The yellow nodes represent news (or groups of news) with higher GBV probability.Each branch represents news' clusters whose subjects evolve similarly.As Figure 5 (2019a) shows, the GBV branch forms at a junction between feminism and politics.These findings fully accord with the central node of the GBV in regard to legislation and court cases.We repeated the analysis for news in the following year, see Figure 5 (2020b), where the GBV branch merges with that of justice.

DISCUSSION
Gender-based violence (GBV) is a worldwide issue that despite its prevalence has been grossly under-covered in public institutions and also, in the news media (Cullen et al., 2019).Fortunately, awareness of the problem has been growing for years, thanks to social movements and increased media coverage (Luengo, 2018).The media has helped in spreading social awareness about GBV (Luengo, 2018) and transforming a public issue into a social problem (Alexander, 2006).However, GBV depictions continue to be represented wrongly in the news media (Boyle, 2005;Cullen et al., 2019).
In this paper, we presented an analysis of news coverage of GBV using novel Big Data techniques.Our study was limited to Spanish media, but we believe that our methodology has general validity.Our results show that GBV news do not have a specific topic or tag, which contributes to the invisibility of the problem and forces the news to be classified by numerous related tags.However, justice subjects are highly related to GBV news, which can be explained by the famous cases that end up in court or that public attention is fixed on the judicial process.
Additionally, when covering GBV, the media usually present legal communications or discuss protecting the laws.
We found that the media have of late begun to spotlight GBV, the recent feminine movement and gender victims' voices.Our analysis shows an enhancement of the public perception of GBV in recent years resulting from the amount of news about the topic, which echoes well-known events.Moreover, we found that the public opinion lags behind the news media for several months, which means that after a relevant GBV event, people react to it.Indeed, the proposed anomaly detection method showed that a match between the anomaly and the GBV cases confirms this impact.Our results demonstrated that the impact of a GBV event lasts for several days and media has an influential role in raising both social awareness and conscience for the topic.
Our study's strength is the utilization of a GBV probability for each news and how it evolves over a certain period.We mapped how GBV probability appears daily and pinpointed related events that push both the topic's increase in the news and public awareness (UN Women, 2017).Moreover, we used TDA techniques to extract how certain topics (or tags) are related in the news.Our approach shows that GBV tends to relate to the subject of justice and rarely with that of feminism.
Violence against women is a serious social issue and appears in the news media rather frequently.The way that news media portray GBV has significant implications for public perception and policy responses.Research on the media coverage of violence against women has shown that GBV is often characterized by various patterns of problematic reporting.Women's role is usually depicted in the news media as victims (Bleiker & Hutchison, 2019;Busso et al., 2020) and this influences the way GBV is covered.Still, information regarding GBV is biased and the problem it poses is usually silenced by explicit details (Buiten & Salo, 2007;Cuklanz, 2014;ElSherief et al., 2017;Wong & Lee, 2018).Overall, the portrayal of violence against women in news media is a complex issue that requires critical examination and attention.Media outlets should avoid victimblaming language, but provide context, and accurately reflect the experiences of survivors to raise awareness and promote effective policy responses (Easteal et al., 2015;Sutherland et al., 2019;Wolf, 2018).
Previous research on GBV has utilized Big Data techniques to show some interesting outcomes (Subramani et al., 2018(Subramani et al., , 2019;;Xue et al., 2019).Those works tend to focus on social media, which is an important way of conveying social concern, but none of them approaches this issue in the way, this study did.
Our findings provided a new point of view and representation of the news about GBV that should lead to better reporting.On this line, the creation or improvement of already existing guidelines could be an interesting objective (GenderIT, 2012;UNESCO, 2019).Hence, news media would help raise awareness on the problem that GBV poses, and thus, promote resources for its prevention.Furthermore, as the media reflects the concerns of our societies, it should help to neutralize GVB and combat it.
BIG DATA TECHNIQUES TO STUDY THE IMPACT OF GENDER-BASED VIOLENCE IN THE SPANISH NEWS MEDIA
Figure 3. Gender Based Violence news heatmap for probabilities each day between 2018 to 2020.

Figure 4 .
Figure 4. Top topic classification of the Gender Based Violence news by subject frequency (percent).
BIG DATA TECHNIQUES TO STUDY THE IMPACT OF GENDER-BASED VIOLENCE IN THE SPANISH NEWS MEDIA

Figure 5 .
Figure 5. Interdependence between topics in the Spanish news for 2019a and 2020b.