🕒 11 min
With this post, we are bringing you a new series of posts, where our long standing Alumni will be presenting their projects for you. In the first post of this kind, we are welcoming our ex-swapshop and project leader, as well as ex-organizer – Matija Piskorec. He did his PhD in Computer Science at the Faculty of Electrical Engineering and Computing at the University of Zagreb in combination with the Ruđer Bošković Institute.
Let’s welcome Matija and learn something about his project!The Editorial Team
Instead of an introduction
The topic of my PhD research, which I started in 2014, was related to the development of a statistical method for estimation of endogenous (internal/peer/social) and exogenous (external) influences in online social networks. Rather than giving you a whole exposé of my PhD research, I will outline its context, motivation and some final results. There will be no mathematical details, only outlines of some of the technical concepts. Interested readers can find more information in the original journal paper published in IEEE Access in 2019  and a conference paper published on Complex Networks conference in 2017 , as well as my PhD thesis .
The rise of online social networks
Although online social networks are not a new phenomenon – the first ones started appearing in the mid 2000’s – we can rightly claim that the 2010’s were the decade of online social networks as the number of users tripled in that period, from one billion to more than three billion. For context, out of the five most valuable global companies in 2009 (PetroChina, Exxon Mobil, Microsoft, ICBC, Wal-Mart) only one (Microsoft) was an information technology company, and the most valuable by market capitalization (PetroChina) was worth around 350 billion USD. In comparison, out of the five most valuable global companies in 2019 (Microsoft, Amazon, Apple, Alphabet, Facebook) all are information technology companies, with the most valuable by market capitalization (Microsoft) worth more than a trillion USD.
How do these companies make money, considering that many of their services are free for the end user – think Google’s search engine (Alphabet is Google’s parent company), or Facebook’s social network? If you aren’t sure about the answer, you’re not alone – even policymakers responsible for government regulation are clueless sometimes .
In addition to information technology services for which they charge their customers – Microsoft, Amazon and Google all run successful cloud infrastructures where people can rent computing power on their servers – a significant source of revenue for these companies is monetization of data on their users in one way or another. The customer themself becomes their most valuable product, and data on user behavior becomes their most valuable commodity, an equivalent of 21st century oil which fuels the information economy .
The central question of my PhD research was: “How much can we know about users using only their digital traces?” The answer in general is, probably much more than we suspect, considering that user activity on online social networks is used for quite effective targeted advertising, which brings the majority of revenue to major online social network companies. Among characteristics which can be easily inferred using just the text segments that users post publicly is the user’s psychological profile .
However, my PhD research (as do all PhD’s, after all!) deals with only one very specific question. Imagine that we are given a social network represented as a mathematical graph where the nodes are the users of an online social network service and the edges are some kind of “friendship” connections between them.
I’m putting “friendship” in quotes here because its exact form will vary from service to service – Facebook friendships are typically undirected (both users have to approve the friendship), while Twitter’s are typically directed because of the follower relation (in most cases any user can follow any other without explicit approval). In my work I assume undirected social connections.
Furthermore, let’s assume we have an activation time for each user. Again, “activation” here is a pretty general concept; it can mean being exposed to a particular piece of information, like a text status, a link to external content, images or videos. A more specific question we can ask now is “Can we infer for what reason individual users activated?” Again, this question is too general, it gives us too much freedom in hypothesizing various possible causes and mechanism of activation. So let’s narrow it down by hypothesizing that the activation of users can be due to two influences:
- Endogenous influence – a kind of social influence which happens due to social interaction between users
- Exogenous influence – a kind of external influence which originates from outside the social network through other channels of communication
How do we differentiate between these two influences, given that we only have a social network between users and activation times of each user (an activation cascade)? If we observe that a particular user is activated at a given time, how would we differentiate which of these two influences is a likely cause for their activation? A simple line of reasoning would be to expect that users who have a lot of already active friends are more likely to be activated due to their friends’ social influence, while those that have few (or none) are more likely to have been activated by an external influence.
This is not an entirely new problem and many approaches already exist in literature, but each has some kind of limitation. Some require information on the external influence beforehand (for example, the number of potential external sources, whatever they might be) . Others require many activation cascades, which is sometimes impossible to satisfy  or are theoretically sound but without an actual inference method which would allow us to infer these two influences from data .
So I set off to develop a method of inferring these two influences from empirical data on user behavior, which would ideally work on just a single activation cascade and where no particular information on external sources is needed. But first, I needed data to which I would eventually apply my new method!
There are many challenges while collecting online social network data, ranging from methodological (can we collect a representative sample of population of interest?), to technical (whether to conduct an online survey, or to use programmable application interfaces to collect data automatically), to ethical (how to preserve users’ privacy) . For my research, I used data collected from an online political survey applications which used Facebook’s API to access information on Facebook users. Facebook users could register on the survey and cast their votes for the upcoming elections, and see voting statistics for their Facebook friends that completed the same survey. The identities of individual users were preserved, so you couldn’t know how each of your friends voted on the survey.
Through these applications, I also had access to the exact times when users registered on the online survey applications – the registration times for each user thus created an activation cascade. I also collected the times when major online news sources (Croatian news portals such as jutarnji.hr and vecernji.hr for example) reported on the survey. As you can imagine, the news cycle around the elections is hungry for any hint of the potential outcome, and this helped generate interest and attract new users to the survey. This news coverage was crucial for spreading the word on the surveys, and we can observe that each news announcement is followed by a sharp peak in user registrations.
Having collected data on the social network and the activation times of users, I could finally use my inference method, which I’d developed in the meantime, on actual social network data.
The inference of the two influences from data – endogenous (social) and exogenous (external) influence require us to hypothesize a specific model for social influence. There are several options here, some of which are inspired by epidemiological models, where social influence spreads literally like virus from user to user – a classic model is called Susceptible-Infected (SI). There are more elaborate options as well. For example, you can incorporate a decay of influence, meaning that users’ eagerness to propagate the information reduces over time. The external influence is modelled with very few assumptions – basically everything that cannot be explained as a social influence will be declared as external. This includes actual external news sources, but also social influence from other forms of social interaction (for example, old fashioned word-of-mouth communication).
The inference method itself – the process of inferring the relative strength of the two influences from data – is achieved through a likelihood function which identifies the most likely relative strength of the two influences given observed data (the social network and the activation times of users). Crucial assumptions of the inference method are that social influence is dependent (in a very quantifiable way, governed by the social influence models mentioned above) on the local social neighborhood of the user, while external influence is independent of it. Because of these assumptions, the inference itself is very efficient, requiring only a single activation cascade.
The fully probabilistic approach to the inference gives us an opportunity to infer many other parameters of interest – including the influence of individual users or groups of users and their contribution to information propagation, as well as the susceptibility of individual users to the external influence.
I hope this research will lead to a better understanding of the extent to which user activity can be manipulated by third parties. I believe that most of the present problems related to trust in online information systems, including proliferation of fake news in online reports and within social networks, as well as increased polarization in online opinion formation, can be at least partially alleviated by independent research efforts into methods and design principles that mitigate the aforementioned negative effects.
How did you like his post? Would you like to ask Matija a few questions? Feel free to leave some comments!