1 Algorithmic Fairness Beyond Individuals and Groups

Machine learning (ML) systems are among other things hailed for their ability to sort and scrutinize vast sets of information, for efficiency gains in complex distributional tasks, for automating repetitive tasks etc. Considering these potentials, ML systems increasingly become used in automated decision-making (ADM) processes in many areas of life - from automated services and allocations in public administration, to workforce management or medical diagnoses (AlgorithmWatch, 2020; Jarrahi et al., 2021; Krzywdzinski et al., 2022; Tomašev et al., 2019). They, in other words, become implemented as a problem-solving agent for dealing and managing individuals - with partly far-reaching impacts as demonstrated in numerous cases of algorithmic discrimination (Ensign et al., 2018; Zuiderveen Borgesius, 2018).

Such examples of algorithmic discrimination have been accompanied by an extensive scientific discussion on algorithmic fairness (Barocas et al., 2019; Chouldechova & Roth, 2018; Binns, 2017). Unfair outcomes in ADM leading to discrimination have been established as one of the ethical challenges in the use of algorithms in ADM systems, next to others such as lack of transparency or accountability (Mittelstadt et al., 2016). Within ML research, engagement with questions of algorithmic fairness has led both to discussions of how to assess and evaluate fairness in ML as well as of ways how to mitigate bias, as one of the causes for unfair outcomes - resulting in a broad field of fairness definitions as well as fairness metrics (the mathematical or statistical translation of fairness definitions) (Barocas et al., 2019). These have mostly focused on individual or group-level fairness (Castelnovo et al., 2022; Hertweck & Heitz, 2021).

Similarly, also anti-discrimination jurisdiction considers non-discrimination as a personal right (European Union, 2000; Council of Europe, 1953; United Nations, 1948). In this sense, the question of algorithmic fairness has been considered as a question of individual or group concern - in other words as a question of whether individuals due to certain characteristics or belonging to a particular group of people (e.g. based on gender, ethnicity, education etc.) or otherwise algorithmically curated groups (Wachter, 2022) have been treated differently compared with others.

In this work, we argue that a perspective on algorithmic fairness as an individual or group level concern is insufficient. Technologies such as artificial intelligence (AI) and ML are increasingly being considered as a “strategic technology” (Durant et al., 1998) with supposed problem-solving capabilities far beyond the individual. ML applications are for instance referred to and strategically implemented as incremental tools within complex global transformational tasks or crisis situations, thereby implementing forms of algorithmic governance (Katzenbach & Ullbricht, 2019). Fostered by supranational political bodies such as the United Nations (UN) or the European Union (EU) but also by global corporations (European Commission et al., 2022) (UN DESA, 2021), we are observing that ML applications are expected to solve complex societal problems on a global scale (Katzenbach, 2021). Such problems range for instance from algorithmically managing online hate speech through automated platform governance (Gorwa et al., 2020), migration flows through automated border control and border construction (Amoore, 2021; Dijstelbloem et al., 2011; Pfeifer, 2021), the reduction of CO2 emissions through automated energy distribution (Klobasa et al., 2019; Nishant et al., 2020), or the coordination of disaster management after crisis events through automated damage classifications (Depardey et al., 2019; Linardos et al., 2022).

Considering the increasing implementation of ML applications to automate decision-making processes targeting entire populations across nation-states and broader geographical regions, we are proposing a transnational perspective on algorithmic fairness as a complementary and necessary addition to individual and group-level fairness. The application scenario we consider are ML systems in Disaster Response Management (DRM). We develop novel concepts for evaluating established fairness definitions in a transnational setting and demonstrate their efficacy in comprehensive empirical evaluations. While ML methods to support DRM have been applied in recent years on a global scale by categorizing disaster events and consequences based on image- and text-classification of social media content, from an ethical perspective it needs to be critically assessed whether such systems perform equally well across groups of countries or whether these applications systematically disadvantage disaster response support for specific types of countries (for instance based on their socio-economic status). If that would be the case, efforts would need to be undertaken to enhance performance to ensure a globally fairly distributed disaster response based on ML methods.

We will provide a concept of transnational algorithmic fairness that grounds its fairness assessment on country-based development indicators as representations for sensitive discriminatory attributes based on which nation-states might be discriminated against. We will then apply our concept of transnational fairness to the case of a specific disaster response ML application and dataset to test whether certain groups of countries are structurally disadvantaged within the classification output.

Eventually, we will reflect on social media platforms, wider telecommunications infrastructures and different patterns of using and appropriating social media platforms in the context of discussions on digital divides to reflect how processing data for such systems are being produced. Considering these data producing infrastructures then also raises concerns about origins of algorithmic (un)fairness beyond the (technical) ML system. In doing so, we are introducing an interdisciplinary perspective informed by research on data analytics, ML, digital media studies and media sociology to come to an encompassing assessment of algorithmic fairness which goes beyond the technical system and includes an embedded perspective of peoples’ everyday media use and social media platforms as the producers of sociality and processing data - with relevance beyond the case of algorithmic fairness in disaster scenarios.

2 Transnational Algorithmic Fairness and Global Justice

The increasing use of ML as a “strategic technology” (Durant et al., 1998) that has the capacity to potentially change societies and whose role for society is being negotiated in political and public discourses (Katzenbach, 2021), requires a perspective on algorithmic fairness beyond individuals and groups within a national setting. On a global level, the Sustainable Development Goals (SDGs) and the discussion of how AI will contribute to their achievement (UN DESA, 2021; Vinuesa et al., 2020) stand exemplary for the problem-solving capabilities that ML systems are being assigned to on a transnational level. Further examples above have demonstrated that ML applications are increasingly affecting individuals and populations on a global scale. They exemplify potentially unsettling consequences when ML applications produce very different outcomes regarding different populations. What so far has not been sufficiently addressed is how to ensure fairness beyond individuals and groups within nation-states when ML applications become applied in critical areas on a global scale. In other words, what is missing is a link between matters of global justice and algorithmic fairness.

The automated moderation tools of large online platforms for example differ fundamentally in their performance depending on the language that needs to be moderated. Automated tools to detect hate speech on social media platforms are usually first optimized for English and subsequently for other majority languages. For minority languages, these tools might not work optimally - as has been demonstrated by Facebook’s inability to moderate the wave of hatred and calls to violence in Burmese targeted at the Rohingya minority in Myanmar since 2017. A UN fact finding mission since postulated that Facebook, as the platform through which incitements of hatred and calls to violence were able to spread in Myanmar, played a role in the resulting genocide (UNHRC, 2018). And the later revelations by whistleblower Frances Haugen confirmed that Facebook did not invest sufficiently in safeguards to prevent extreme forms of hatred in Myanmar (or similarly happening in Ethiopia) from spreading, considering that their automated moderation tools did not perform sufficiently in the local language Burmese (Akinwotu, 2021). Such examples demonstrate that differences in performance as the basis of unfair algorithmic outcomes and forms of algorithmic discrimination matter from a global justice perspective.

However, not all different treatment can be considered as leading to unfair outcomes or discrimination. While in the discussion on algorithmic fairness, the focus does lie on disparity, it is essential “to ask (...) whether the disparities are justified and whether they are harmful.“(Barocas et al., 2019, p. 3). While Mittelstadt et al. (2016, p. 8) define discrimination in the context of ADM as the “adverse disproportionate impact resulting from algorithmic decision-making”, Barocas et al. (2019, p. 76) specify that discrimination “is not different treatment in and of itself, but rather treatment that systematically imposes a disadvantage on one social group relative to others” (Barocas et al., 2019). Different reasons exist why some differential treatment might be considered adverse, as imposing disadvantages and thus being morally objectionable (Mittelstadt et al., 2016), among them relevance, generalizations, prejudice, disrespect, immutability and compounding injustice (Barocas et al., 2019).

For the case of transnational algorithmic fairness, we are following Wachter’s argument that also algorithmic groups that do not align with traditionally protected attributes in discrimination law should be brought into the focus of discussions on algorithmic fairness (Wachter, 2022). Understanding nation-states and groups of nation-states as algorithmic groups makes it possible to acknowledge morally objectionable forms of discrimination from a global justice perspective with potentially far-reaching adversarial impacts on local populations. Groups of nation-states then do not form algorithmic groups based on extensive online-profiling, as is the case for individuals who might be grouped based on their characteristic as “dog owners” or “sad teens”, as Wachter describes. Instead, they can be considered algorithmic groups since as an entity they might become objects of algorithmic decision-making with potential adversarial impacts effecting their populations and since they build contexts for data aggregation and data processing for ML applications in transnational settings.

A transnational perspective on algorithmic fairness considers moral obligations in relation to ADM targeting populations of certain geographical regions. Algorithmic discrimination, as the outcome of unfair ADM, in a transnational setting thus conceptualizes adversarial impacts on certain populations based on their location and belonging to certain geographically distinguishable entities. Due to matters of data availability, we are focusing in the case study presented here on national contexts, even though also sub-national contexts could be considered. Further, we are specifically conceptualizing a transnational perspective (Nye & Keohane, 1971) on algorithmic fairness, allowing us to consider relations between populations across state-boundaries and not focusing primarily on relations between nation-states. A transnational perspective on algorithmic fairness is in this sense well-aligned to group nation-states and populations within the assessment of algorithmic fairness according to similarities regarding certain characteristics and does not assess solely based on national belonging. A transnational perspective in other words allows for seeing similarities across nation-states and accordingly allocating them into algorithmic groups.

Further, a transnational perspective on algorithmic fairness is grounded in a cosmopolitan understanding of global justice, which stresses on moral grounds questions of a just distribution among every living human being, acknowledging universality as a key element of a global justice perspective (Pogge, 1992). Stressing distributive justice, Matthias Risse summarizes a cosmopolitan perspective as asking: “If shares of material goods are among the rights and protections everyone deserves, we must ask if this depends on where people live.“(Risse, 2011, p. 3) Questions of global distributional injustices have lately been adopted in ML research, especially in relation to colonialism, the dominance of western values and cultural hegemony in AI applications, regulatory and infrastructural monopolies in the AI industry (Birhane et al., 2022; Mohamed et al., 2020; Png, 2022) as well as in discussions on extractivist and exploitative AI production (Crawford, 2021; Gray & Siddharth, 2019; Bender et al., 2021). A transnational approach to ML fairness, tries to connect these observations on distributional injustices with algorithmic fairness assessments of specific ML applications.

3 Quantifying Transnational Fairness

In Sect. 2 we considered the need for a transnational fairness assessment from a global justice perspective with its moral obligations to object forms of discrimination with potential adversarial impacts on local populations. While our concept for transnational fairness allows for grouping nation-states and populations beyond nationality based on certain characteristics or similarities, the attributes on which to group nation-states is an open subject. However, it is only by the selection of certain attributes that transnational fairness assessments can be made. Attributes or group characteristics will determine on which basis disadvantages of groups or which form of discrimination can be defined and discussed. In the following, we will propose a method to inform a transnational fairness assessment with a data centric approach, that allows an application on a case-by-case basis. We will demonstrate this method in a disaster response use-case and report fairness infringements we are able to assess with our method.

3.1 Development Indicators as Sensitive Attributes for Nation-States

ML components can be biased and can discriminate against groups of citizens in policing (Angwin et al., 2016), recruiting (Lahoti et al., 2019), to name just two examples. Commercially available ML based face recognition software has been reported to achieve much lower accuracy on females with darker skin color (Buolamwini & Gebru, 2018). As the reasons for ML predictions are often not easily interpretable in general, also the reasons for unfair ML decisions can be difficult to trace back. But most of the origins of unfair ML predictions are related to the quality and bias of the training data (Barocas & Selbst, 2016).

In application contexts as described above, attributes like income, gender, ethnicity, marital status, and disabilities among others commonly get declared as sensitive variables or used to define protected or unprivileged groups to measure fairness. The literature refers to many different measures that are used for quantifying fairness or the disparity between protected and unprotected groups or individuals. Each of these metrics emphasizes different aspects of fairness and often single fairness notions lack a consistent definition, are difficult to combine or even contradict each other (Caton & Haas, 2020). As fairness is eventually an ethical or normative problem, there is no single definition or metric for fairness that accounts for all relevant factors in all situations. Hence, fairness metrics should be selected depending on the given context and requirements, while inquiring relevant stakeholders (affected, domain experts, policy makers and ethicists).

The same principles as described for individual notions of fairness can be applied to a transnational concept of fairness in ML. While sensible discriminatory attributes and protected variables on an individual level are for instance reflected in anti-discrimination jurisdiction, there is no established framework based on which to assess unfairness in a comparison of nation-states. To assess transnational algorithmic fairness sensitive attributes could be aggregated on a national level. This aggregation of attributes could in turn be used to group countries. Unfortunately, the required sensitive attributes are difficult to obtain with sufficiently high quality and volume globally. Aggregating sparse and low quality data on sensitive attributes could lead astray analyses on transnational algorithmic fairness.

Few studies attempted to show fairness infringements in ML on transnational scale. Fairness is often merely defined based on coarse geographic locations like the (Global) North and the (Global) South or based on continents. But analyzing fairness based on directions and continents (or even only on nation-states) often lacks socio-economic characteristics and thus cannot give substance to a discussion of what fair means from a global justice perspective. So far there are only modest and rudimentary approaches that attempt to consolidate socio-economic data as a basis to a fairness discussion in ML on a transnational scale (Shankar et al., 2017; DeVries et al., 2019; Goyal et al., 2022). Such approaches thus cannot adequately contribute to a debate on which fairness notion is appropriate in the context of DRM.

The problem of plain geo-diversity results from the lack of sensitive attributes on a transnational level (like gender, ethnicity, income account for individuals). While acknowledging their limitations (see Sect. 3.3), we propose development indicators (such as the Human Development Index) as a basis for investigating sensitive attributes in a transnational context. Development indicators often result from socio-economic data and are common instruments in social sciences and global development, and generally said, have the purpose to empirically compare development across countries (Baster, 1972). By addressing development (or sometimes called progress in the literature) they imply an underlying definition or theory of what development (or progress) is and how to measure it (McGranahan, 1972)Footnote 1. Development indicators especially got designed and used by the United Nations to monitor economic, social, demographic, environmental and other goals set by the institution and their organs (Vries, 2001). Inequality and disparities between nation-states are often the cause why development indicators got created, and so they are often also used as tools and models in policy making (Vries, 2001).

3.2 Case Study: Disaster Response Management

As in many other fields, ML methods have become applied and researched in Disaster Informatics and Disaster Risk Management (DRM). Applications in DRM include evaluating exposure and vulnerability of geographical regions to disasters, assisting to build resilience to disasters, forecasting of disaster events and assessing post-disaster impacts (Depardey et al., 2019; Linardos et al., 2022) to assist humanitarian aid and rebuilding efforts. Questions of fairness are relatively under-explored in disaster informatics (Yang et al., 2020; Gevaert, 2021) and ML in disaster related contexts, although the disparate impact of disaster events to different socio-economic groups is a well-studied topic in DRM (Hallegatte et al., 2018), both on individual as on global level.

The application scenario for ML we are researching in this study is concerned with post-disaster impact, particularly with immediate disaster response, which aims to locate and identify damages caused by disaster events to assist humanitarian aid by rapidly allocating resources and evaluating urgency of emergencies. While traditionally satellite imagery and images from (unmanned) aerial vehicles (AVs and UAVs) are used to assess damages, social media has become an important means for both actors in humanitarian aid as for those affected by disaster events (Imran et al., 2015; Said et al., 2019). Informing about incidents on and calling for help via social media (due to collapsed or overloaded emergency call systems) enables information to be distributed faster and with more coverage than traditional methods.

The volume of information requires scalable methods, often based on ML, to detect these reports on disaster events. The speed of the response to disaster events is one of the most important factors for successful aid. ML algorithms can help to improve and scale the automatic detection of content related to disaster events in online social media. One example is the AI for Disaster Response (AIDR) system (Imran et al., 2014). AIDR can classify online social media posts into a set of user-defined categories. The system has been successfully used on data from Twitter during the 2013 Pakistan Earthquake to distinguish informative vs. non-informative tweets posted leveraging a combination of ML components with human-participation (through crowd-sourcing) in real-time.

In this study, we work with the MEDIC dataset (Alam et al., 2023), a collection of multiple datasets comprising 71,198 images from social media feeds worldwide, mainly from Twitter and partly from Google, Bing, Flickr, and Instagram. MEDIC is a recent computer vision dataset and serves as a benchmark for disaster event classification and features four tasks: disaster type, humanitarian, informative and damage severity (see also Fig. 1). These tasks are novel for disaster response datasets and originate from consultation to disaster response experts. The tasks are designed to assist humanitarian aid with information about the disaster for coordinating an immediate and appropriate disaster response.

Fig. 1
figure 1

Image samples from the MEDIC dataset, sourced from the MEDIC publication (Alam et al., 2023). Example images depict all classification tasks of the dataset: T1 disaster type, T2 informativeness, T3 humanitarian (efforts), T4 damage severity

As in other applications of ML or ADM systems in the context of disaster response, inequalities between participants or groups can as well be defined by socio-economic aspects. These can include economic (fiscal resilience of state and citizens, rebuilding capabilities), technological (robustness, availability, and access to digital infrastructure), political (governmental policy for internet and social media, censorship, laws), cultural (habits and practices of using social media and photography for informing about disasters) and health related aspects (health-care, availability of medical supplies, infrastructure and staff, welfare). The disposition of these attributes or factors for every nation-state can determine inequalities between states and outcomes in disaster response. To find corresponding attributes to these factors, we are using the following development indicators based on which to conduct our fairness assessment: the Human Development Index (HDI) (UNDP, 1990), the Democracy Index (DI) (EIU, 2021) and the ICT Development Index (IDI or Information and Communication Technologies Development Index) (ITU, 2017):Footnote 2

The HDI is a popular index published annually by the United Nations which is used throughout many disciplines. It is the composite index of life expectancy, education and per capita income. These indicators overlap with measures for social inequality often used in social sciences: income, education and health.

The DI (EIU, 2021) is an index of 60 questions answered by experts to capture political pluralism, consensus about government, political participation, democratic political culture and civil liberties.

The IDI is as well a composite index (of 11 indicators), first published in 2008, to measure and compare developments in information and communication technologies across nation-states. One of its main objectives is to measure the (global) Digital Divide, which describes disparities in access to internet and information technology. The three main categories the IDI measures are access to information and communication technologies (ICT), the use of ICT and skills regarding ICT. The access category describes with five indicators infrastructure and access to telephone, mobile-cellular and internet, but also gives information about number of subscriptions and households making use of computers and internet. The use category describes three indicators about the intensity of internet usage by individuals, while the skills category contains three indicators about literacy and education of communication technology users (ITU, 2017). The IDI appears as a very suitable index in the context of a disaster response system based on social media.

As Caton and Haas state that sensitive attributes might be represented or encoded in other (not so obvious) variables to the fairness context (Caton & Haas, 2020), we assume that the different development indicators are either directly sensitive variables or at least proxies for more encrypted protected variables. We hypothesize that the here explained sensitive variables are indeed relevant latent factors encoded in the data set. In other words, e.g. the availability and usage of ICT infrastructure, as measured by the IDI, is assumed to impact availability and quality of images in online social networks and thereby also the quality of the predictive performance of ML models trained on this data.

For defining sensitive groups and investigating fairness in ML algorithms trained on the MEDIC dataset with regards to the development indicators, the nation-states involved in the disaster events contained by the dataset needed to be identified and assigned to each image sample. The MEDIC dataset itself does not contain any information about the involved nation-states. The metadata only consists of information about the four classification tasks and the source dataset, from which the images were obtained. The involved disaster events of the dataset are mentioned in the publication for the dataset (Alam et al., 2023), but are not part of the actual metadata. However, we could relate disaster events of images by the naming of the files and the folders they were contained in. Via the identified disaster events, we were able to locate and assign the nation-states in which a specific disaster event occurred to the majority of the images in the datasetFootnote 3Footnote 4, resulting in 10,675 located test images.

For the actual grouping of the nation-states based on the described development indicators (HDI, DI and IDI) we use dimensionality reduction, particularly Principal Component Analysis (PCA), to find the maximal variance directions in the dataFootnote 5. The results of PCA revealed that 97% of the variance of the data is explainable with the first and second principal component (PC)Footnote 6. Based on the first two PCs we divided the nation-states into three groups as indicated in Fig. 3.

As also shown in Fig. 2 group A is particularly described by nation-states with very high HDI, IDI and DI. Except for Chile, it includes states of north and west Europe and North America and is characterized by higher economic and living status and advanced development. Group B is described by nation-states with still high HDI and medium DI. But this group suffers from lower IDI and regarding this index is not distinguishable to group C. Group B states are economically and by levels of living well developed, but have smaller democracy deficits and suffer significantly from the Digital Divide. Involved countries are spread over Eastern Europe, Africa, South and Southeast Asia and Central and South America. Finally, low HDI and DI are the main explanations for group C. These countries belong to the Middle East and South Asia and are developing countries. Low level of living, political and economical instability, civil wars and autocracies characterize this group.

Fig. 2
figure 2

Distributions for the different used development indexes for every sensitive group. HDI and DI show clear separations of the groups. While group A in IDI is distinct from the other groups, groups B and C exhibit same range of values, but different medians and interquartile ranges. Group A describes advanced developed nation-states. Group B nation-states are by levels of living and economically well advanced, but especially have information technology deficits. Group C nation-states are characterized by low levels of living, economical instability, political conflicts and underdeveloped information technology

With our aim to investigate fairness in a specific application scenario and demonstrate the relevance of transnational fairness on the MEDIC dataset and validate the usefulness of development indicators, we focus on comparability with prior work on the same dataset. To that end, we align our selection of the fairness metrics on the metrics used in the MEDIC publication. The authors of the MEDIC dataset report accuracy, precision, recall and hamming loss but emphasize F1-Scores for their benchmarkFootnote 7. While one could consider the F1-Score as a useful metric combining several aspects, such as precision and recall, in one value, the F1-Score has not been adopted by the ML community for fairness metrics as it does not offer the level of detail needed to evaluate fairness relevant aspects. Instead of F1-Score fairness our research focuses directly on precision and recall (Barocas et al., 2019). Among the most popular statistical fairness metrics (Mehrabi et al., 2022; Verma & Rubin, 2018; Caton & Haas, 2020), precision is used to measure Predictive Parity (also called outcome test) and recall is used to measure Equal Opportunity (also called false negative error rate balance) (Caton & Haas, 2020):

Predictive Parity is met when sensitive groups have equal precision or equal probability of samples with a positive predicted label being labeled with a positive class. For two sensitive groups \(\mathcal {A}\) and \(\mathcal {B}\), a target variable Y, a predicted target variable \(\hat{Y}\) and a data point X Predictive Parity is defined as:

$$\begin{aligned} P(Y = 1 \ | \ \hat{Y} = 1, \ X \in \mathcal {A})&= P(Y = 1 \ | \ \hat{Y} = 1, \ X \in \mathcal {B}) \end{aligned}$$

Equal Opportunity requires sensitive groups to have equal recall or having equal probability that samples with a positive class label will also have positive predicted values. For two sensitive groups \(\mathcal {A}\) and \(\mathcal {B}\) Equal Opportunity is defined as:

$$\begin{aligned} P(\hat{Y} = 1 \ | \ Y = 1, \ X \in \mathcal {A})&= P(\hat{Y} = 1 \ | \ Y = 1, \ X \in \mathcal {B}) \end{aligned}$$

In the context of disaster response with the MEDIC dataset, Predictive Parity means for instance that samples from all sensitive groups have equal probability that if they are predicted to be earthquakes, they will actually be labeled earthquakes. Predictive Parity also means that samples predicted to represent non-disaster or rescue volunteering are actually equally likely to be labeled correct, regardless of their sensitive groupFootnote 8. In other words, Predictive Parity means that the predictions of all sensitive groups have equal chances of triggering a false alarm. Equal Opportunity in this context would imply, for example, that samples from all sensitive groups have an equal probability of being labeled earthquake or not-a-disaster or rescue volunteering if they actually have that label. In other words, Equal Opportunity is satisfied if all sensitive groups have the same probability of not detecting actual label classes (misses). We consider missing an actual disaster event and confusing different disaster characteristics (false alarm) as the most severe scenarios in an automated disaster response system.

3.3 Transnational Fairness Infringements in Disaster Response Management based on Development Indicators

For assessing fairness of automated disaster classification, we reproduced the results of the MEDIC paper using ResNet-18 architecture (He et al., 2015) with the setup and hyperparameters the authors provided in their publicationFootnote 9. Note that we did not aim at improving the classification performance of the published model. It is likely that with more advanced computer vision models higher overall predictive performance could be reached. The goal of this study was to investigate a case study using a published model. We were thus interested in the relative differences in predictive performance rather than optimizing the published model.

In order to conduct our fairness investigations, we grouped the data into three sensitive groups based on PCA on the selected development indicators of all involved nation-states as we describe in Sect. 3.2 and show in Table 1 and Fig. 3. Besides the group results, we report also images which are not locatable and the ungrouped results which contain all group and not locatable images

Table 1 Nation-States of the fairness assessment in the disaster response use-case and their development indicators

(see Table 3).

Fig. 3
figure 3

Nation-states in the MEDIC dataset grouped into three sensitive groups (A, B, C) with Principal Component Analysis (PCA) on Human Development Index (HDI) (UNDP, 1990), Information and Communications Technology Development Index (IDI) (ITU, 2017) and Democracy Index (DI) (EIU, 2021). First (PC1) and second principal component (PC2) explains ca. 97% of the variance of the three used development indicators. A (blue): Canada, United States (U.S.), United Kingdom (U.K), Italy, Greece, Ireland, Chile, France, Finland. B (red): India, Ecuador, Ukraine, Sri Lanka, Philippines, Ghana, Mexico. C (green): Iraq, Syria, Iran, Nepal, Yemen. Development indicators are dated 2017 accordingly to the latest disaster events in the MEDIC dataset

For evaluating the grouped classification performances, we refer to precision and recall as metrics as described in Sect. 3.2. For the evaluation we used the same test dataset as in the publication containing 15,688 images. The results overall are quite diverse and fairness issues are heterogeneously spread out between tasks, subtasks and sensitive groups. According to the selected performance metrics precision and recall, respectively we are reporting Predictive Parity and Equal Opportunity infringements.

Our results demonstrate that the image classifier trained on the MEDIC dataset achieves different precision and recall values across nation-state groups. While the predictive performance on the ungrouped data on subtask level and even on the grouped data on task level (see Tables 2 and 3) show sound results, where the classification metrics only vary between 2% to 5% points across the sensitive groups for the four tasks, inspecting the predictive performances of subcategories of each task reveals significant contrasts between the sensitive groups (see Table 3). Figure 4 shows the fairness infringements for each subtask: Referring to Predictive Parity, bias is most prominent for the subtasks earthquake, fire, flood, hurricane and other disaster in the task disaster types with deviations up to 58%, 46%, 43%, 40%, and 46% points between sensitive groups in precision. But bias is also prominent in the task humanitarian in the subtask affected injured and dead people and rescue volunteering and donation effort, with up to 42% and 39% points. Further fairness issues are evident in the damage severity task in the subtasks mild and severe with differences up to 26% and 15% points and also in informative task with a deviation of 9% points in informative subtask. Equal Opportunity infringements show a similar pattern throughout the subtasks and the sensitive groups. While there are also significant fairness infringements observable, they are less pronounced than the bias defined by Predictive Parity. The disaster type subtasks earthquake, fire, flood, hurricane and other disaster deviate by 26%, 38%, 27%, 37%, and 8% points in recall. Affected injured and dead people and rescue volunteering and donation effort in humanitarian task diverges by 37% and 17% points between the groups, while in the damage severity task mild and severe show variations of 5% and 9%.Footnote 10

Viewing the fairness infringements from a group perspective (see Fig. 4) the results show that group A is especially disadvantaged for earthquake, other disaster and injured or dead subtask. Less drastic, but still significant differences are observable in flood, informative, rescue volunteering and donation effort, mild and severe subtask. Equal Opportunity infringements are less pronounced for earthquake subtask and almost not evident or even irrelevant for flood, other disaster, informative, rescue volunteering and donation effort and mild subtask. Looking at group B the results show strong disadvantages regarding fire, other disaster and injured or dead detection. But also apparent shortcomings are observable in hurricane and severe subtask. Equal Opportunity infringements in this group only deviate to a small amount compared to the Predictive Parity infringements, showing a very similar pattern. Strong bias for Group C is evident in classifying fire, flood, hurricane, rescue or donation efforts and mild damage severity, while this group does not show lesser pronounced biases. As in group A, we observe also that Equal Opportunity infringements are less distinct than Predictive Parity.

Fig. 4
figure 4

Predictive Parity and Equal Opportunity infringements for each subcategory of all four classification tasks. For each subcategory (rows), differences in precision (Predictive Parity) or recall (Equal Opportunity) respectively to the best performing group are calculated. While the best performing nation-state group per subcategory (row) has a value of 0 (dark blue), the infringements for this subcategory are indicated with higher numbers (up to 1) or bright colors (yellow). As a result, there is no systematic discrimination apparent for a specific sensitive group. Development indicators rather articulate the different forms disasters are being documented and reported in social media by different nation-state groups on subtask basis. Most severe disadvantages for group A contain: earthquake, other disaster, injured or dead people, for group B: fire, other disaster, injured or dead people and for group C: fire, flood, hurricane, rescue and donation effort, mild damage severity. While both fairness metrics show significant infringements, Predictive Parity is impacted more severely. Further details and implications are discussed in Sects. 3.3, 3.4 and 3.5

Table 2 Classification metrics (class frequency weighted precision and recall) of the reproduced classifier of all disaster detection tasks (T1–T4) across different sensitive groups (A, B, C), the ungrouped data (ungr.) and not locatable samples (n.l.) retrieved from the test dataset
Table 3 Results of classification metrics and number of samples in each country group for each disaster classification task and subtask, computed on the test dataset with 15,688 images

3.4 Data Set Imbalances Correlate with Fairness Infringements

In order to investigate potential causes for these differences in predictive performance across nation-state groups, we investigated whether the number of training examples is correlated with predictive performance. Indeed, we found that sample size in individual subtasks and sensitive groups are often correlated strongly with predictive performance (see Table 3). But in some cases lower sample sizes for one sensitive group could perform equally or better than another sensitive group with a higher amount of images e.g. like in the case of earthquakes for group B and C. For Predictive Parity infringements, classification matrices also reveal explanations to some extent: deviations in precision are mostly explainable with the large amount of negative samples in each task: there is a considerable class imbalance with 49% to 71% negative samples in opposite to each single positive subclass. This circumstance contributes to the major proportion of false positives for each subclass. The precision value, which is the base for Predictive Parity, is highly influenced by this fact. Especially in cases where sensitive groups contain only a small amount of samples for a subclass compared to other groups, they suffer easily from lower precision and hence Predictive Parity issues. In consequence, differently balanced subtasks among the sensitive groups appear as the main cause for Predictive Parity issues. This aspect points towards biases in the sampling process of the MEDIC dataset concerning the different sensitive groups.

In contrast to the precision metric, the true positive rate or recall is invariant with respect to false positives caused by class imbalance. Consequently, Equal Opportunity issues, which are based on recall, can not be explained with class imbalance issues across different subtasks (like for Predictive Parity). Recall, the base of equal opportunity, only depends on the frequency and composition of false negatives, despite sample sizes of any other subclasses. Differences in recall between two sensitive groups can result from two reasons, that can also be interweaved: first of all images of one sensitive group are of different quality (or have different features), thus might be easier or more difficult to learn. If this fact is paired with imbalances of samples between different sensitive groups (group imbalance), the difficulty to learn a subclass becomes twofold. As a matter of course, Equal Opportunity issues point both towards qualitative differences of subtask samples as to biases in the sampling process of the MEDIC dataset concerning the different sensitive groups.

When interpreting differences in predictive performance and consequently differences in fairness metrics across groups it is important to consider the different causes. Ideally, the predictive performance and fairness measures of any ADM system should be invariant to the frequency of a disaster. Just as we expect an ADM for disease detection to perform equally well across socio-demographic groups, regardless of the varying frequency of diseases in those groups. This non-utilitarian notion of fairness is particularly favored in ML applications concerning health and human lives (Hertweck et al., 2021), aligning often with civil or human rights principles (Friedler et al., 2021) and, in our case, also with the global justice perspective outlined in Sect. 2.

In practice however, there are many factors that lead to sampling biases and subsequently differences in fairness metrics – and not all of those reasons can be altered or should be considered as discrimination. Due to the different geographic locations of countries in the respective groups, some disaster types are more likely for some groups than for others. For instance, earthquakes are less likely to occur in group A countries (although Chile, Greece and Italy are contained in this groups where earthquakes are not unlikely) than other country groups, while hurricanes (or tropical cyclones) are less likely to occur in group C countries.

Other causes for class imbalance between country groups could include a smaller amount of coverage and social media posts for some country groups, which we discuss in Sect. 3.5. These effects could be seen as discrimination and indeed could be addressed with appropriate counter measures as they highlight structural biases and systemic issues between socio-demographic groups. Interpreting differences in fairness should differentiate between these causes.

Next to these effects there are also statistical aspects related to class imbalance that should be considered when interpreting fairness metrics. Most importantly, we emphasize that sampling biases, originating from structural discrimination or from geographical factors, can lead to apparent differences in predictive performance and fairness metrics across groups even if a classifier actually has the same predictive performance across groups. One example is that of Predictive Parity, or differences in precision across groups: changing the ratio of true positive examples to true negative examples, or in other words changing the class imbalance, will lead to a change in precision even if the performance of a classifier is constant (Williams, 2021). Simply put, the lower the ratio of true instances relative to false instances per category, the lower the precision and F1 will be for that category, if the false and true positive rates, characterizing the actual predictive performance, are constant. Assuming that the ratio of positive instances within each country group is the same in the training and test data, this does not affect the interpretation of Predictive Parity within a country group for a given classification category. But we emphasize that interpretations that evaluate relative differences of Predictive Parity across country groups are difficult if the number of positive instances in a classification category vary.

3.5 Discussion: Discrimination of Nation-States Depends on Disaster Classification Task and Subtask

Interestingly there appears to be not one specific underprivileged group of countries in terms of disaster classification metrics. In other words, we do not find evidence for systematic unfairness against a group of nation-states. The results suggest that the predictive performance of the classifier is biased towards different nation-state groups depending on disaster classification task and subtasks of these tasks. Low sample sizes are often correlated with low prediction scores as shown in Table 3 and Sect. 3.4), but this representation bias does not explain all shortcomings. Qualitative differences of disaster depictions between the sensitive groups seem also relevant as described in Sect. 3.4. With regard to the used development indicators, this means that lower index values do not relate to being disadvantaged within the analyzed classifier. Groupings by development indicators rather articulate the different expressions disasters can have regarding being documented and reported in social media by different groups: advanced developed nation-states based on the selected development indicators (group A) are disadvantaged within the system regarding detecting disaster types of earthquake, other disaster, but also regarding injuries or casualties. Well-developed nation-states by the used development indicators with democracy and information technology deficits (group B) are disadvantaged regarding fire, and other disaster disaster types and also for injuries and casualties. Developing countries and nation-states with economic and political instabilities or even civil wars and with autocracies (group C) are suffering from shortcomings in detecting fire, flood, and hurricane disaster types and recognizing humanitarian efforts and milder infrastructural damages.

While our approach to assessing transnational fairness in ML focuses on the technical system, it needs to be acknowledged that the basis for unfair results of an ML system also lies beyond the technical system. Our results suggest that it is important to consider how data producing infrastructures, availability and quality of data sets influence differences in predictive performance of ML solutions deployed worldwide. Using the ICT development index as basis for the fairness assessments already reflects possibly discriminatory differences in the ways telecommunication and digital media infrastructures are distinct across groups of countries. But we need to acknowledge that possible bias for transnationally applied ML systems might be rooted in more subtle differences across countries, that cannot be accounted for in development indicators, but which might still matter as potentially discriminatory attributes.

The system used in DRM that we based our study on relies on social media data (images) posted by users after a disaster event. Such images are then processed by the system to classify essential information about the kinds of disaster events that have happened to coordinate and support humanitarian aid. It is plausible to assume that differences in telecommunication infrastructures (Zorn & Shamseldin, 2015; ITU, 2021) and different ways and cultural habits of using social media (Anduiza et al., 2012; Hasebrink et al., 2015; Kleis Nielsen & Schrøder, 2014; Plantin & Punathambekar, 2019) across the globe as well as the different technical set-ups and affordances (Hutchby, 2001) of social media platforms will lead to differences in the kinds of data that will become produced across the globe after a disaster event.

These differences might matter for the processing of the ML application and consequently for its outputs. It is thus misleading to think of an algorithm as deciding impartially as long as the technical problems of unfairness are becoming fixed. Instead, it is essential to acknowledge and to account for the fact that bias might lie outside of the technical system, for instance in the way different platforms lead to different content, the culturally distinct ways in which people appropriate social media platforms and the ways it is culturally accustomed to produce certain types of social media content in some countries compared to others. It is essential to further push an interdisciplinary perspective combining perspectives on the technical questions (ML, data analytics) with approaches that consider the ways in which social media and social media production (digital media studies, media sociology) work.

4 Outlook: Transnational Algorithmic Fairness Beyond the Technical System

Automated decision making has become ubiquitous and the use of ML technology is already impacting societies on a transnational scale. These systems carry the risk of amplifying existing biases, further marginalizing less privileged groups and aggravating global injustices. In order to analyze and alleviate these effects, we argue that existing notions of fairness should be complemented by transnational concepts of fairness that are able to capture global justice implications of ADM. Transnational algorithmic fairness assessments can then build the basis for critical interdisciplinary investigations into algorithmic justice implications.

We proposed a grouping of nation-states based on development indicators and demonstrate that state-of-the-art computer vision methods for DRM exhibit substantial differences with respect to the predictive performance across groups of nation-states. Such differences could ultimately impact the availability and speed of disaster response - or considering the broad range of algorithmic applications on a transnational scale many other adversarial impacts on certain populations.

Examining the potential causes for these algorithmic biases we find that often data availability influences predictive performance. This effect can be associated with some factors captured in the development indicators for access to information technology infrastructure and highlights the importance of trustworthy and reliable data for measuring existing biases in globally deployed ML solutions.

While we have presented here a case study into DRM to demonstrate the relevance of a transnational approach to algorithmic fairness, further reflections on normative or ethical grounds for moral assessments of discrimination on a transnational scale, solutions to observed fairness issues, the topic of accountability and transparency for a transnationally-applied algorithmic system are out of the scope of this article and will be left open to future research. More conceptual work and empirical analyses into transnational fairness in ML is thus desperately needed - and should align with emerging discussions on matters of global justice in the context of ML applications (Birhane, 2020; Birhane et al., 2022; Mohamed et al., 2020; Png, 2022).

From a global justice perspective, it will be essential to question availability of data and underlying causes for global justice implications already in the data producing infrastructures. Here, investigations into transnational algorithmic fairness can benefit from discussions on data justice, critical data studies or data colonialism (Taylor, 2017; Dencik et al., 2022; Couldry & Mejias, 2019; Iliadis & Russo, 2016). Based on investigations into transnational algorithmic fairness as presented in this case study, further critical investigations into the processes, practices, capitalist market structures in data production etc. can provide profound insights into how data extracting media technologies contribute towards manifestations of injustices through algorithmic (un)fairness. Such research could contribute towards reflecting algorithmic fairness beyond the technical system.

The presented case study relies on development indicators, which come with their own limitations. While economic indicators (such as the gross domestic product) can be defined and measured relatively easily, others, such as social indicators are more difficult to define and could be considered as proxies of underlying social phenomena (McGranahan, 1972). For instance life expectancy not only reflects medical services, but also other variables like literacy, housing conditions, diet, income, etc. (McGranahan, 1972). Other essential limitations include that the negotiation and constitution of such indicators reflect power asymmetries between nation-states and global injustices in global governance frameworks and international bodies (Moellendorf, 2009; Caney, 2006; Jongen & Scholte, 2022; Reus-Smit & Zarakol, 2023). Our reliance on such indicators, despite their limitations, lies grounded in matters of data availability. Future investigations into transnational algorithmic fairness should consider limitations of available data sets and test new ways of grouping populations for assessments of transnational fairness. But even though acknowledging such limitations is essential, it should should not prevent conceptual and empirical work into transnational algorithmic fairness. Instead more conceptual work, beyond indicators from a global governance framework that deals with its own justice implications, is required here.

Further research should especially focus on three aspects regarding transnational forms of algorithmic (un)fairness: fairness implication in relation to the technical system, in the organizational embedding of such transnationally applied systems as well in the production of data and their underlying infrastructures (inlcuding people’s media and data producing practices) for transnationally applied systems to process. An interdisciplinary perspective on transnational algorithmic (un)fairness is thus indispensable, not only reflecting ML research on fairness but equally science and technology studies-inspired approaches to the social shaping of technologies (Bijker et al., 1987; Bijker & Law, 1994) as well as research into media use and appropriation as central ways in which data becomes produced, to what purpose and how this might equally manifest global injustices and the relevance of global digital divides (Hargittai, 2003; Fuchs & Horak, 2008) for algorithmic fairness from a global justice perspective.

In summary, more work on transnational fairness needs to evolve in order to develop mitigation strategies and in order to reflect in conceptual and empirical sound ways on global injustices relating to transnational forms of algorithmic (un)fairness. Such research could then undergird critical reflections on ADM in times when ML application increasingly become applied as assistive and strategic technologies.