Investigation of the role of similarity measure and ranking algorithm in mining social networks
Article
Investigation of the role of similarity measure and ranking algorithm in mining social networks
Rasim Alguliev, Ramiz Aliguliyev and Fadai Ganjaliyev
Institute of Information Technology of Azerbaijan National Academy of Sciences
Abstract
Social networks have attracted much attention recently. Different studies have been conducted to automatically extract social networks among various kinds of entities from the web. Social network analysis finds its application in many current business areas. In this paper we demonstrate how the choice of the similarity measure affects ranking results of entities in a social network extracted from the web. We use different similarity measures in order to build different social networks. By applying formulas described below for each of the networks we derive a new network which is different from the srcinal one by edge weights. Subsequently, in the derived networks we rank entities again. Finally we compare the results.
Keywords
social network on the web; similarity measure; PageRank; resistance distance; entity ranking
1. Introduction
Social networks have gained a high level of popularity recently. Social networks on the web have been extracted by retrieving relationships between entities automatically derived from multilingual news [1] from a user’s email inbox [2]. An algorithm is proposed for automatically extracting social hierarchy data from electronic communication behaviour which is based on data mining users’ behaviours to automatically analyse and catalogue patterns of communications between entities in an email connection to extract social standing [3]. Social networks have also been extracted from log files of online shared workspaces [4]. A method for extracting social networks from the web using similarity between collective contexts is proposed in [5].One of the most important concepts in social network analysis is the concept of
centrality
. It shows the degree of importance of an actor in the network. Freeman defined three basic measures of centrality which are
degree
,
closeness
and
betweenness
. Degree centrality identifies the number of direct links to a node. Closeness centrality shows how few links an actor must go through in order to reach other actors in the network. Nodes having the highest value for betweenness centrality are those through which other nodes most often must go through so that they can reach each other. Figure 1 demonstrates that actor
A
possesses the highest number of direct ties to all other actors in the network, which tells us that it has the highest degree centrality. In comparison to other actors, actor
A
also has to make minimal effort to reach other actors in the network, i.e. the rest of the actors are one step away from actor
A
. It can also be seen that actors
B
,
C
and
D
must go through actor
A
to reach each other, which assigns the highest value for betweenness centrality to it.Centrality tells us which are key actors in the network, i.e. which actors have the highest ranks. Ranking is one of the most important functions which need to be computed when conducting social network analysis. Ranking shows which actors in the network are most important, which are the most influential, which take decisions and so on. For example, assume we have extracted a social network of a terrorist group from the web. Ranking of entities of this network may help us to identify who is the group leader, who is the middleman between subgroups of the group and so on.
Journal of Information Science37(3) 229–234© The Author(s) 2011Reprints and permission: sagepub.co.uk/journalsPermissions.navDOI: 10.1177/0165551511400946 jis.sagepub.com
Corresponding author:
Ramiz Aliguliyev, Institute of Information Technology of Azerbaijan National Academy of Sciences Email: a.ramiz@science.az
Alguliev et al.
230
Journal of Information Science, 37 (3) 2011, pp. 229–234 © The Author(s), DOI: 10.1177/0165551511400946
Some studies primarily focus on ranking entities in a social network rather than constructing it. In [6], several approaches are proposed to choose the best ranking method for entities in a social network constructed from the web. Another algorithm is proposed for ranking networked entities in a social network based on Markov walks with parameterized conductance values associated with the network edges in [7]. One of the primary things which need to be taken into an account when performing ranking is the similarity coefficient which is used to build the network. Application of one and the same ranking algorithm to a different similarity coefficient may yield different results. This is what we shall attempt to show in the current paper.In this paper we construct three social networks on the same set of named entities. We use Jaccard, overlap and Normalized Google Distance (NGD) [8] coefficients to retrieve degree of closeness between entities. The obtained value of the coefficient we consider to be the weight of an edge connecting two nodes. Afterwards, for each of the networks we build a new one using formulas described below. The only difference between this new network and the srcinal are the edge weights. Finally, for each pair of networks we rank entities using one and the same formula and compare the results.
2. Constructing social networks using different similarity measures
We consider undirected networks only. We assume that the entities are given beforehand. To build a social network given a named set of entities we use a general search engine. An edge is constructed from one person to another if the corresponding similarity measure coefficient is higher than a predefined value. The obtained value of the coefficient is considered to be the weight of an edge connecting the two entities. For example, to derive the value for overlap coefficient for two entities A and B first we put the query ‘A’ AND ‘B’ to the engine and receive the number of documents, say
x
, where both entities A and B appear simultaneously. Then we put to the engine separately A and afterwards B and receive the number of documents where A appears independently, say
y
, and B, say
z
. The sought value of the overlap coefficient would be
x
/min(
y
,
z
).We randomly picked up five participants from the WWW2006 conference. Using the method described above we constructed three social networks using three similarity coefficients: Jaccard, overlap and NGD. We connected two people with an edge if the value of the similarity coefficient was greater than zero. Since all coefficient values for all three networks give a value greater than zero then in each of the cases we received the complete graph shown in Figure 2.We shall denote participants
e
1
,
e
2
,
e
3
,
e
4
,
e
5
starting from Stuart Williams and continuing in a clockwise direction on the graph. Tables 1–3 contain values for each of the three similarity coefficients.
ACDB
Figure 1.
Graphical demonstration of centrality measures.
John Miller Mark Baker John DarlingtonGarole GobleStuart Williams
Figure 2.
Social network of five participants of WWW2006 conference.
Alguliev et al.
231
Journal of Information Science, 37 (3) 2011, pp. 229–234 © The Author(s), DOI: 10.1177/0165551511400946
Table 1.
Search engine results for Jaccard similarity coefficientEntity
e
1
e
2
e
3
e
4
e
5
e
1
0.000.250.130.030.13
e
2
0.250.000.040.130.21
e
3
0.130.040.000.050.14
e
4
0.030.130.050.000.14
e
5
0.130.210.140.140.00
Table 2.
Search engine results for overlap similarity coefficientEntity
e
1
e
2
e
3
e
4
e
5
e
1
0.000.500.750.520.50
e
2
0.500.000.870.130.21
e
3
0.750.870.000.050.14
e
4
0.520.130.050.000.06
e
5
0.500.210.140.060.00
Table 3.
Search engine results for NGD similarity coefficientEntity
e
1
e
2
e
3
e
4
e
5
e
1
0.000.500.750.520.50
e
2
0.500.000.870.800.67
e
3
0.750.870.000.890.82
e
4
0.520.800.890.000.42
e
5
0.500.670.820.420.00
Afterwards, for each of the networks we built a new network. This we accomplished by assigning new values to the edges connecting each pair of entities by means of the following formula:
sim e er e e
i ji j
( , )( , )
=
1
, (1) where
r
(
e
i
,
e
j
) is the
resistance distance
between entities
e
i
and
e
j
[9].In order to compute the resistance distance, firstly the Laplace matrix must be constructed. For a nonoriented graph, a Laplace matrix is built as follows [10]:
L = D

A,
where
A
is adjacency matrix with elements defined in the following way:
aw e e
ijij i j
=
,if and areconnectedentities0, otherwise
,wherein
w
ij
≥ 0 is a weight of edge connecting entities
e
i
and
e
j
, and
D
is a diagonal matrix with elements
d a
ii ii ij jn
= =
=
∑
[ ] .
D
1
Alguliev et al.
232
Journal of Information Science, 37 (3) 2011, pp. 229–234 © The Author(s), DOI: 10.1177/0165551511400946
Calculation of pseudotransformation for matrix
L
, which is a standard method for computing resistance distance, is a tedious task. For this reason a simplified method is suggested:
r i ji
ij
=
det ( , )det ( ),
LL
where
L
(
i
,
j
) is a matrix derived from the Laplace matrix by removing the
i
th and
j
th rows and columns and
L
(
i
) is a matrix retrieved from the Laplace matrix by removing the
i
th row and
i
th column. Resistance distance can be considered as a resultant which consolidates all possible routes between two entities [10]. From here we can calculate resistance distance for any pair of entities for each of the networks. At this step with the aid of Equation (1) we compute new weights between entities in the three networks. We received the values shown in Tables 4–6.To summarize, initially for a given set of named entities we used a search engine in order to build three social networks. In this way we received three networks with weight values equal to corresponding similarity measure values. Hence, for each of the networks we built a Laplace matrix to calculate the resistance distance between network nodes. Afterwards, we used Equation (1) to generate new values for edge weights. In this way using four similarity measures we received six networks of the same set of entities.
3. Entity ranking
After constructing the social networks, we performed ranking of the entities in each of the networks with the aid of the TopicCentric algorithm which is the following iterative formula [11]:
PR e d nd sim e e sim e e PR
TC ji ji k e F ee B eT
k ii j
( ) ( ) ( , )( , )
( )( )
= −+
∈∈
∑∑
1
C C i
e
( )
, (2)
−−−−
−−−−
−−−−
−−−−
−−−−
=
20.107.023.030.060.0
07.053.003.013.030.0
23.003.073.007.040.0
30.013.007.053.003.0
60.030.040.003.033.1
J
A
Figure 3.
Laplace matrix for Jaccard network.
−−−−
−−−−
−−−−
−−−−
−−−−
=
54.006.014.021.013.0
06.027.005.013.003.0
14.005.036.004.013.0
21.013.004.063.025.0
13.003.013.025.054.0
Ov
A
Figure 4.
Laplace matrix for Overlap network.
−−−−
−−−−
−−−−
−−−−
−−−−
=
41.242.082.067.050.0
42.063.289.080.052.0
82.089.033.387.075.0
67.080.087.084.250.0
50.052.075.050.027.2
NGD
A
Figure 5.
Laplace matrix for NGD network.
The following tables contain Laplace matrix values for each of the networks.
Alguliev et al.
233
Journal of Information Science, 37 (3) 2011, pp. 229–234 © The Author(s), DOI: 10.1177/0165551511400946
Table 4.
Resistance distance weights for Jaccard similarity coefficientEntity
e
1
e
2
e
3
e
4
e
5
e
1
0.000.400.650.470.89
e
2
0.400.000.310.320.47
e
3
0.650.310.000.330.57
e
4
0.470.320.330.000.40
e
5
0.890.470.570.400.00
Table 5.
Resistance distance weights for overlap similarity coefficientEntity
e
1
e
2
e
3
e
4
e
5
e
1
0.000.410.280.250.34
e
2
0.410.000.250.240.39
e
3
0.280.250.000.180.29
e
4
0.250.240.180.000.21
e
5
0.340.390.290.210.00
Table 6.
Resistance distance weights for NGD similarity coefficientEntity
e
1
e
2
e
3
e
4
e
5
e
1
0.001.751.691.471.41
e
2
1.750.001.961.751.64
e
3
1.691.960.001.891.79
e
4
1.471.751.890.001.47
e
5
1.411.641.791.470.00
where
PR
TC
(
e
j
) is the rank of the node
e
j
,
sim
(
e
i
,
e
j
) is the similarity coefficient value of the edge connecting nodes
e
i
and
e
j
,
B
(
e
i
) is the set of nodes referencing node
e
j
,
F
(
e
i
) is the set of nodes to which the node
e
i
refers,
d
is the damping coefficient which belongs to the interval [0.8;1], and
n
is the number of nodes in the network. Using Equation (2) we ranked the entities of the first three srcinal networks. Subsequently, we applied the same formula for the three new networks. Table 7 demonstrates the obtained values. Analysis of the table shows that, if we rank entities in a network and apply the same ranking function to a network derived from the first one by means of resistance distance, results may vary. From the table it can be seen that in the case of an overlap network all conference participants received the same ranks, although in the first case we ranked entities using ordinary overlap coefficients and in the second we applied resistance distance to this value. In comparison to an overlap network for Jaccard and NGD network rankings do not provide the same results. For Jaccard network TopicCentric algorithm for weights obtained through a search engine puts John Miller and Garole Goble at 4th and 5th places respectively, but resistance distance ranking swaps them. Similarly, for the NGD network, Stuart Williams is 4th and Garole Goble 3rd respectively for ranking with search
Table 7.
Ranking results for srcinal and derived networks of conference participantsEntityJaccard networkOverlap networkNGD network Rank
TC
Rank
RD
Rank
TC
Rank
RD
Rank
TC
Rank
RD
e
1
112243
e
2
451122
e
3
334411
e
4
545534
e
5
223355