Categories
Issue 4 My Project

Modeling influence in online social networks

🕒 11 min

With this post, we are bringing you a new series of posts, where our long standing Alumni will be presenting their projects for you. In the first post of this kind, we are welcoming our ex-swapshop and project leader, as well as ex-organizer – Matija Piskorec. He did his PhD in Computer Science at the Faculty of Electrical Engineering and Computing at the University of Zagreb in combination with the Ruđer Bošković Institute.

Let’s welcome Matija and learn something about his project!

The Editorial Team

Instead of an introduction

Matija Piskorec
PhD Computer Science
Ruđer Bošković Institute

2012 S3++ project leader
2011 S3 organizer
2010 S3 organizer
2009 S3 swapshop leader
2008 S3 project leader

The topic of my PhD research, which I started in 2014, was related to the development of a statistical method for estimation of endogenous (internal/peer/social) and exogenous (external) influences in online social networks. Rather than giving you a whole exposé of my PhD research, I will outline its context, motivation and some final results. There will be no mathematical details, only outlines of some of the technical concepts. Interested readers can find more information in the original journal paper published in IEEE Access in 2019 [1] and a conference paper published on Complex Networks conference in 2017 [2], as well as my PhD thesis [3].

The rise of online social networks

Although online social networks are not a new phenomenon – the first ones started appearing in the mid 2000’s – we can rightly claim that the 2010’s were the decade of online social networks as the number of users tripled in that period, from one billion to more than three billion. For context, out of the five most valuable global companies in 2009 (PetroChina, Exxon Mobil, Microsoft, ICBC, Wal-Mart) only one (Microsoft) was an information technology company, and the most valuable by market capitalization (PetroChina) was worth around 350 billion USD. In comparison, out of the five most valuable global companies in 2019 (Microsoft, Amazon, Apple, Alphabet, Facebook) all are information technology companies, with the most valuable by market capitalization (Microsoft) worth more than a trillion USD.

How do these companies make money, considering that many of their services are free for the end user – think Google’s search engine (Alphabet is Google’s parent company), or Facebook’s social network? If you aren’t sure about the answer, you’re not alone – even policymakers responsible for government regulation are clueless sometimes [4].

Mark Zuckerberg (CEO of Facebook) testifies in front of the US Senate committee. Senator Orrin Hatch: “So, how do you sustain a business model in which users don’t pay for your services?” Mark Zuckerberg: “Senator, we run ads.”

In addition to information technology services for which they charge their customers – Microsoft, Amazon and Google all run successful cloud infrastructures where people can rent computing power on their servers – a significant source of revenue for these companies is monetization of data on their users in one way or another. The customer themself becomes their most valuable product, and data on user behavior becomes their most valuable commodity, an equivalent of 21st century oil which fuels the information economy [5].

The question

The central question of my PhD research was: “How much can we know about users using only their digital traces?” The answer in general is, probably much more than we suspect, considering that user activity on online social networks is used for quite effective targeted advertising, which brings the majority of revenue to major online social network companies. Among characteristics which can be easily inferred using just the text segments that users post publicly is the user’s psychological profile [6].

However, my PhD research (as do all PhD’s, after all!) deals with only one very specific question. Imagine that we are given a social network represented as a mathematical graph where the nodes are the users of an online social network service and the edges are some kind of “friendship” connections between them.

A simple representation of a social network where nodes are users and the edges are social connections between them. Nodes are annotated with the activation times which represent some kind of process within the social network, like information propagation, and together they make an activation cascade.

I’m putting “friendship” in quotes here because its exact form will vary from service to service – Facebook friendships are typically undirected (both users have to approve the friendship), while Twitter’s are typically directed because of the follower relation (in most cases any user can follow any other without explicit approval). In my work I assume undirected social connections.

Furthermore, let’s assume we have an activation time for each user. Again, “activation” here is a pretty general concept; it can mean being exposed to a particular piece of information, like a text status, a link to external content, images or videos. A more specific question we can ask now is “Can we infer for what reason individual users activated?” Again, this question is too general, it gives us too much freedom in hypothesizing various possible causes and mechanism of activation. So let’s narrow it down by hypothesizing that the activation of users can be due to two influences:

  • Endogenous influence – a kind of social influence which happens due to social interaction between users
  • Exogenous influence – a kind of external influence which originates from outside the social network through other channels of communication

How do we differentiate between these two influences, given that we only have a social network between users and activation times of each user (an activation cascade)? If we observe that a particular user is activated at a given time, how would we differentiate which of these two influences is a likely cause for their activation? A simple line of reasoning would be to expect that users who have a lot of already active friends are more likely to be activated due to their friends’ social influence, while those that have few (or none) are more likely to have been activated by an external influence.

A simple example of a social network where two users (blue nodes labeled as 1 and 2) are activated at certain time. The activation of user 1 (on the left) is explained easier with the social influence hypothesis because three of his friends are already active (red nodes), unlike activation of user 2 (on the right) that has only one active friend.

This is not an entirely new problem and many approaches already exist in literature, but each has some kind of limitation. Some require information on the external influence beforehand (for example, the number of potential external sources, whatever they might be) [7]. Others require many activation cascades, which is sometimes impossible to satisfy [8] or are theoretically sound but without an actual inference method which would allow us to infer these two influences from data [9].

So I set off to develop a method of inferring these two influences from empirical data on user behavior, which would ideally work on just a single activation cascade and where no particular information on external sources is needed. But first, I needed data to which I would eventually apply my new method!

The data

There are many challenges while collecting online social network data, ranging from methodological (can we collect a representative sample of population of interest?), to technical (whether to conduct an online survey, or to use programmable application interfaces to collect data automatically), to ethical (how to preserve users’ privacy) [10]. For my research, I used data collected from an online political survey applications which used Facebook’s API to access information on Facebook users. Facebook users could register on the survey and cast their votes for the upcoming elections, and see voting statistics for their Facebook friends that completed the same survey. The identities of individual users were preserved, so you couldn’t know how each of your friends voted on the survey.

Facebook friendship networks used in estimation of exogenous and endogenous influence. Users are colored based on their vote on the survey application. It is not coincidental that users with similar political preferences are grouped together – the network layout algorithm tries to cluster users which have more friendship connections between themselves. In sociology this phenomena is called homophily – people are more likely to associate (in this context through a Facebook friendship) with other people that are somehow similar to them.

Through these applications, I also had access to the exact times when users registered on the online survey applications – the registration times for each user thus created an activation cascade. I also collected the times when major online news sources (Croatian news portals such as jutarnji.hr and vecernji.hr for example) reported on the survey. As you can imagine, the news cycle around the elections is hungry for any hint of the potential outcome, and this helped generate interest and attract new users to the survey. This news coverage was crucial for spreading the word on the surveys, and we can observe that each news announcement is followed by a sharp peak in user registrations.

How many users registered (in half-hour intervals) on each of the three online survey applications. Times when major online news portals reported on the survey applications are annotated with vertical lines. We can observe that they are usually followed by a sharp increase in user registrations – not surprising considering that they usually contain a direct link to the survey web page!

Having collected data on the social network and the activation times of users, I could finally use my inference method, which I’d developed in the meantime, on actual social network data.

The inference

The inference of the two influences from data – endogenous (social) and exogenous (external) influence require us to hypothesize a specific model for social influence. There are several options here, some of which are inspired by epidemiological models, where social influence spreads literally like virus from user to user – a classic model is called Susceptible-Infected (SI). There are more elaborate options as well. For example, you can incorporate a decay of influence, meaning that users’ eagerness to propagate the information reduces over time. The external influence is modelled with very few assumptions – basically everything that cannot be explained as a social influence will be declared as external. This includes actual external news sources, but also social influence from other forms of social interaction (for example, old fashioned word-of-mouth communication).

Maximum likelihood inference of endogenous and exogenous influence in a simple simulated activation cascade (left figure) where social influence follows the Susceptible-Infected (SI) model. The likelihood function for two distinct moments is shown on the center and right figure – its maximum designates the most likely magnitudes for endogenous (horizontal axis) and exogenous (vertical axis) influence. The magnitude of exogenous influence is larger at time 21 than at time 50, which is indicated by the position of the maximum. The shape of the likelihood function at time 21 and time 50 is different as well – in the second case, there is much less data (in terms of the number of activated nodes) available for inference, so the function is more spread out, indicating there is more uncertainty in the estimates.

The inference method itself – the process of inferring the relative strength of the two influences from data – is achieved through a likelihood function which identifies the most likely relative strength of the two influences given observed data (the social network and the activation times of users). Crucial assumptions of the inference method are that social influence is dependent (in a very quantifiable way, governed by the social influence models mentioned above) on the local social neighborhood of the user, while external influence is independent of it. Because of these assumptions, the inference itself is very efficient, requiring only a single activation cascade.

Inferring relative magnitudes of endogenous (blue line) and exogenous (red line) influences which (hypothetically!) drive the registration of users on the three online survey applications. The inference method actually gives an estimate for each user individually, and these estimates are aggregated in these figures to show a general trend. We observe that exogenous influence dominates, which is not surprising given that many users visited the survey application by following a link on external new sources.

The fully probabilistic approach to the inference gives us an opportunity to infer many other parameters of interest – including the influence of individual users or groups of users and their contribution to information propagation, as well as the susceptibility of individual users to the external influence.

Conclusion

I hope this research will lead to a better understanding of the extent to which user activity can be manipulated by third parties. I believe that most of the present problems related to trust in online information systems, including proliferation of fake news in online reports and within social networks, as well as increased polarization in online opinion formation, can be at least partially alleviated by independent research efforts into methods and design principles that mitigate the aforementioned negative effects.

How did you like his post? Would you like to ask Matija a few questions? Feel free to leave some comments!

References

[1] Disentangling Sources of Influence in Online Social Networks

[2] Modeling Peer and External Influence in Online Social Networks: Case of 2013 Referendum in Croatia

[3] Statistical inference of exogenous and endogenous information propagation in social networks

[4] Lawmakers seem confused about what Facebook does — and how to fix it

[5] The world’s most valuable resource is no longer oil, but data

[6] Private traits and attributes are predictable from digital records of human behavior

[7] Peer and Authority Pressure in Information-Propagation Models

[8] Information Diffusion and External Influence in Networks

[9] The unified model of social influence and its application in influence maximization

[10] Bit by Bit: Social research in the digital age

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.