PageRank high-level explanation

Alisher Bazarbay
5 min readAug 4, 2022

In this article, we will dive into everything you need to know about Google PageRank, its importance, how to calculate it using different methods and Python implementation from scratch.

Outline

  • What is PageRank
  • Intuition behind it
  • Power Iteration Method
  • Random Walk — Monte Carlo Simulation
  • Implementation & error analysis

What is PageRank?

PageRank is a ranking algorithm that evaluates the number and quality of links leading to web pages. Generally speaking, it evaluates importance/significance of a node in a given graph.

2 engineers from Google - Larry Page and Sergey Brin developed this algorithm that was introduced in 1998. Back then it was a breakthrough for the functioning of search space as search engine for the first time evaluated the significance of web pages. So, PageRank aims to streamline the web-space.

Intuition behind it

  • PageRank aims to objectively evaluate web pages in accordance with the subjective behaviour of users;
  • It is obvious that the more often a web page is referenced, the more important it is for a user looking for some information;
  • The algorithm was developed taking into account a conditional random user who just goes from one page to another random page by following links;
  • It takes into account the significance of referring web page: the more significant the PageRank score of a particular page, the more significance/weight will transfers to referencing web page.

Calculations

Power Iteration Method

Below is the original formula for page rank calculation:

As you can see the function is quite recursive as it depends on other pages’ PR. Important thing here is that we have to initialise our initial PageRank Vector (vector containing PR of each node in a graph) — we initialise it with same PageRank values equal 1/n (where n is a number of nodes in a graph). So, we have an iterative formula (that is why the method called iterative).

To implement this we need to know that at each step we can randomly jump to a neighbour node. We can represent this as Transition Matrix P which simply denotes the probability of transition to the next node. Let’s take a look at matrix representation example.

For a given simple directed graph,

we have following transition matrix:

Intuitively, having initial distribution x(0) (initial PageRank vector) the distribution after 1 step is simply multiplication of PageRank vector x(0) and Transition Matrix P:

And after t steps we have :

So, the Power Iteration method:

  1. Initialise PageRank vector x(0) = {1/n, …, 1/n}
  2. Update the vector with the following rule:

where 1^T is a matrix with all 1 entries, P is a transition matrix;

3. Repeat until convergence:

Random Walk — Monte Carlo simulations

In most cases we do not have full access to a large graphs (e.g. Social Media with billions of users, or Web Space which his extremely large for crawling). And obviously just sampling based on the index of nodes is impossible as we have no full access.

We can perform Random Walk on the graph with particular number of steps to approximate some statistic about the graph. How we do:

  • We start at random node in a graph
  • At each step we:
    - with probability a jump to a random node
    - with probability 1-a jump to a node reference by a current node
    - if the current node has no reference, jump to random node

Simply, taking into account the procedures from previous method, the matrix form is:

and if we solve for x, we get:

where I is identity matrix, the above is simply:

So, now we can simulate the random walk:
- starting from a random node (1^T/n)
- stop with probability a
- if not stopping (which is (1-a)^k), jump to a neighbour node (which is P^k)

Implementation

There is a file describing an undirected graph (each line contains 2 numbers representing neighbouring nodes in the graph). First of all let’s implement the graph and a node:

We have Node class that has 2 attributes: neighbours and pageRank. The structure of the graph is simple: dict(Node Index: Node) which stored in structure attribute.

Now let’s implement the Power Iteration method with static method to calculate L1 norm (for convergence):

Implementation of RandomWalk:

There is a parameter ‘sub’. When this is set to ‘stopping_node_only’ , we only use the stopping node to approximate PageRank which is wasteful as all the non-stopping nodes in random walks are ignored.

Error Analysis

We can simulate M random walks. Let f_v be number of steps terminated at node v. Then we can use f_v/M to estimate the PageRank of v. Let’s do the experiment and approximate the PageRank with Random Walk:

  • Denote x_v PageRank Value computed by Power Iteration Method
  • Find the Difference between approximated by Monte Carlo Method - f_v/M and ground truth vector x_v (Manhattan Distance):

We can run the experiment with the following code:

80 iterations were done until convergence to compute ground truth PageRank vector. Also, with increase in number of random walks (M = 2n, 4n, 6n, 8n, 10n; where n is the number of nodes in the graph) Manhattan Distance metrics obviously decreases:

Please, do not hesitate to leave any comments/feedback or share my work. Thank you!

--

--