Isolating and analyzing fraud activities in a large cellular network via voice call graph analysis

of 14

Please download to get full document.

View again

All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
14 pages
0 downs
Isolating and analyzing fraud activities in a large cellular network via voice call graph analysis
  Isolating and Analyzing Fraud Activities in a Large CellularNetwork via Voice Call Graph Analysis NanJiang ∗ ,YuJin,AnnSkudlark,Wen-LingHsu,GuyJacobson,SivaPrakasam,Zhi-LiZhang ∗∗ ComputerScienceDept.,UniversityofMinnesota,MN55414AT&TLabs-Research,NJ07940 ∗ {njiang,zhzhang}{yjin,aes,hsu,guy,asp} ABSTRACT With widespread adoption and growing sophistication of mobile devices, fraudsters have turned their attention fromlandlines and wired networks to cellular networks. Whilesecurity threats to wireless data channels and applicationshave attracted the most attention,  voice-related fraud activ-ities   also represent a serious threat to mobile users. In par-ticular, we have seen increasing numbers of incidents wherefraudsters deploy malicious apps, e.g., disguised as gamingapps to entice users to download; when invoked, these appsautomatically – and without users’ knowledge – dial certain(international) phone numbers which charge exorbitantlyhigh fees. Fraudsters also frequently utilize social engineer-ing (e.g., SMS or email spam, Facebook postings) to trickusers into dialing these exorbitant fee-charging numbers.In this paper, we develop a novel methodology for de-tecting voice-related fraud activities using only call records.More specifically, we advance the notion of   voice call graphs  to represent voice calls from domestic callers to foreign re-cipients and propose a Markov Clustering based method forisolating dominant fraud activities from these internationalcalls. Using data collected over a two year period from one of the largest cellular networks in the US, we evaluate the effi-cacy of the proposed fraud detection algorithm and conductsystematic analysis of the identified fraud activities. Ourwork sheds light on the unique characteristics and trends of fraud activities in cellular networks, and provides guidanceon improving and securing hardware/software architectureto prevent these fraud activities. Categories and Subject Descriptors C.2.0 [ Computer Communication Networks ]: Securityand Protection General Terms Measurement, Management, Security Keywords Cellular Network, Fraud, Malware, Mobile Apps Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.  MobiSys’12,  June 25–29, 2012, Low Wood Bay, Lake District, UK.Copyright 2012 ACM 978-1-4503-1301-8/12/06 ...$10.00. 1. INTRODUCTION The past decade has witnessed the rapid deployment andevolution of mobile cellular networks, which now supportbillions of users and a vast diverse array of mobile devicesfrom smartphones, tablets, to e-readers and smart meters.It was reported [1] that in 2010 there were over 5 billionmobile phones in operation, in comparison to the total worldpopulation of 6.8 billion. Mobile phones and tablets aregradually replacing traditional wire-lines as well as personalcomputers, and are becoming an indispensable componentin our daily life [2,3]. With breath-taking advances in smartmobile devices and the growing sophistication in the mobileapplications (apps) and services (e.g., location services andcloud services) they spur, we are now entering in a new eraof mobile computing.With their wide adoption, smartphones, while providingvaluable utility and convenience to mobile users, also bringwith them new security threats. As smartphones functionboth as phones as well as mobile computers, smartphoneusers not only face the usual Internet security threats (e.g.,malware and botnets [4]) through their web browsing andother data activities (albeit such data-related security threatsare mitigated by the perhaps markedly improved hardwareand software platforms compared to conventional desktopplatforms); they may also encountera variety of   voice-related  security threats, ranging from conventional voice scams sim-ilar to those on landlines, e.g., stealing customers privacyinformation or defrauding users of money through varioussocial engineering techniques, to new forms of voice fraudthat utilize the data functionality of smartphones for voice-related trickeries. For instance, we have seen increasingnumbersof incidents where fraudsters deploy malicious apps,disguised as interesting games and other applications to en-tice users to download them; when invoked, these apps auto-matically – and without users’ knowledge – dial certain (in-ternational) phone numbers which charge exorbitantly highfees. Fraudsters also frequently utilize other social engineer-ing trickeries to deceive users, e.g., through SMS or emailspam, Twitter tweets, or fake online postings to lure usersinto clicking on malicious URL links, resulting in automaticdialing of exorbitant fee-charging international numbers.So far most research efforts have focused on applying anddeveloping anomaly detection and prevention techniques tosecure wireless data channels, mobile devices, and appli-cations/services [5–9]. To the best of our knowledge, lit- tle work has been devoted to detecting and understandingvarious voice-related fraud activities targeting mobile users,especially when  international   phone numbers are involved.  Voice-related fraud activities can have a much wider impactin cellular networks, as potentially all mobile users can bevictims of such activities. Furthermore, unlike data traf-fic, mobile international voice calls often follow the pay-per-call compensation model, and are far more expensive; hencevoice-related fraud activities involving international phonenumbers can bring direct and significant financial losses toboth mobile users and cellular service providers. Detect-ing and rooting out such voice-related fraud activities, es-pecially those that target users through the data plane trig-gered voice fraud, is not an easy task, due to the large userpopulation, the vast phone number space and limited data.Using voice call records collected over a two year periodin one of the largest cellular networks in the US, the goalof our study is two-fold: 1)  to develop an effective approach to proactively isolate dominant fraud calls from a myriad of legitimate calls  ; 2) and  to conduct a systematic analysis of the unique characteristics and trends of fraud activities in cellular networks  , e.g., techniques for soliciting fraud callsand social engineering. Achieving these goals can providea means of alerting customers and cellular providers of po-tential fraud threats to avoid financial loss for both parties,and ultimately improve customers’ satisfaction. Moreover,understanding different fraud activities can help gain usefulinsights in developing better hardware/software architecturefor preventing future fraud activities.Since voice call records only contain limited information,such as call time, srcinating/terminating numbers, countrycodes and call durations, we explore the relationship amongparties participating in the calls (i.e.,“who calls whom”) forthe fraud detection task. In particular, we advance the no-tion of   voice call graphs   for representing the call records. Avoice call graph is a bi-partite graph, where two indepen-dent sets of nodes represent the groups of domestic srcinat-ing numbers and foreign terminating numbers, respectively,and the edges stand for phone calls between these srcinat-ing numbers and terminating numbers. By visualizing smallscale voice graphs and characterizing large scale voice graphswith classic graph statistics, we find that fraud numbers andvictims often exhibit very strong correlation, which resultsin community structures in the voice call graph. Therefore,the task of isolating fraud calls can be formulated as theproblem of extracting dominant community structures fromvoice graphs. This serves as our basic heuristic for detect-ing voice-related fraud. Based on this heuristic, we proposea Markov Clustering (MCL) based algorithm to decomposevoice call graphs in an iterative manner, which producesmillions of disconnected subgraphs on a month-long voicegraph. We further rely on the strength of community struc-tures (measured by the number of cliques) and their pop-ularity (measured by the number of callers) as a gauge toisolate fraud activities from these subgraphs.We validate the proposed detection algorithm using twosources of ground truth: 1) a list of phone numbers thatare reported by mobile users to the cellular service provider,which are then manually verified by fraud agents to be in-volved in international revenue sharing fraud (IRSF); 2) on-line reports from mobile users that are posted on forums,blogs or social media sites. By matching our detection re-sults against the ground truth, we find that the proposedalgorithm is able to isolate from millions of terminating num-bers the most dominant IRSF fraud numbers. In particu-lar, these IRSF numbers together have attracted more than85% of the victims and resulted in 78% of the fraud calls.More importantly, in 60% of the cases, our method is ableto detect fraud numbers at least 1 month prior to the ear-liest online user reports. Such an advantage in early frauddetection allows us to effectively reduce exposure to signifi-cant financial loss for both mobile users and cellular networkproviders. In addition to IRSF activities, our method alsoidentifies a wide variety of other types of fraud, ranging fromtraditional voice scams to emerging fraud cases committedthrough mobile devices, smartphone apps and online socialmedia sites. This enables us to gain a comprehensive viewof voice-related fraud in today’s large cellular networks.Based on our detection results, we conduct extensive anal-ysis of fraud activities in cellular networks. Our analysis un-veils two major types of fraud: 1) IRSF fraud which bringsdirect revenue to the fraudsters through victims placing callsto premium rate international numbers; and 2) scams thatrely on social engineering to defraud victims. For bothtypes of fraud, we observe interesting characteristics thatare unique to cellular networks. For example, we find mal-ware apps, unlocked devices and online media sites can serveas new channels for carrying IRSF fraud, and smartphoneusers are more susceptible to many of these fraud activities.Also, personal information such as email contact lists andonline transaction details are becoming popular componentsof social engineering techniques. In addition, we identify the heteronym property   of fraud numbers, which take advantageof the fact that most mobile devices lack the ability of distin-guishing foreign numbers from domestic ones to solicit callsto fraud numbers. Moreover, we find that the vetting pro-cess used by online app marketplace and online media sitesplays an important role in effectively preventing fraud activ-ities. All these observations provide us with useful insightsin designing better and more secure hardware/software plat-forms to prevent future fraud activities.We note that analyzing voice call graphs alone is not suf-ficient to detect fraud activities. Nonetheless, by analyzingvoice graphs, we can winnow down and zero in on likely sus-picious activities. Additional (expensive) approaches, e.g.,which incorporate billing information and manual investiga-tion, can then be applied to analyze the detection resultsto further confirm the fraud activities. In other words, ourmethod serves as a “first-line” defense in alerting users andcellular providers of emerging fraud activities and isolatinglikely fraud activities proactively and quickly from a largenumber of call records. In addition to cellular networks, theproposed method can also be readily applied for combatingfraud activities on landlines.The remainder of this paper is organized as follows. Thebackground of the UMTS network and the datasets studiedare introduced in Section 2. In Section 3, we formally definevoice call graphs and motivate the heuristics for fraud detec-tion. We then propose a MCL based algorithm in Section 4to decompose voice call graphs and isolate fraud related sub-graphs. Using ground truth from two sources, we evaluatethe detection results in Section 5 and conduct systematicanalysis of detected fraud activities in Section 6. Section 7discusses related work and Section 8 concludes the paper. 2. BACKGROUND AND DATASETS In this section, we provide a quick overview of the cellularnetwork under study, and briefly introduce the composition  of telephone numbers. We then discuss the datasets andground truth used in our fraud analysis. 2.1 UMTS Network Overview The cellular network under study utilizes primarily UMTS(Universal Mobile Telecommunication System), a popular3G mobile communication technology supporting both voiceand data services. The key components of a typical UMTSnetwork are illustrated in Fig. 1. When making a voice callor accessing a data service, a mobile device directly com-municates with a cell tower (or node-B), which forwards thevoice/data traffic to a Radio Network Controller (RNC). Inthe case of mobile voice (also including Short Message Ser-vice or SMS), the RNC delivers the voice traffic to the PSTN(Public Switched Telephone Network) or ISDN (IntegratedServices Digital Network) telephone network, through a Mo-bile Switching Center (MSC) server. All voice call records,domestic or international, can be observed at MSCs. Figure 1: UMTS network architecture. 2.2 Primer on Phone Numbers A telephone number consists of a sequence of digits forreaching a particular phone line in a public switched phonenetwork. The phone line that initiates the call is associ-ated with an  srcinating number   and the number of thetargeting phone line is called the  terminating number  . Ina UMTS network, the phone number is also referred to asa MSISDN (Mobile Subscriber Integrated Services DigitalNetwork Number). A MSISDNcomprises three components:a country code followed by an area code (also called a na-tional destination code), then by the subscriber number. Amajority of phone numbers within the same geographicalarea share the same country code and area code. Depend-ing on the specific country, phone numbers vary in length.Under certain circumstances, phone numbers from two dif-ferent countries can be exactly the same (see Section 6.4).An international dialing prefix (also referred to as an exitcode) is attached in front of the country code to distinguishthese numbers. The specific exit code is determined by boththe srcinating country and the terminating country and isprovided directly by the cellular service provider. Users needto explicitly dial the exit code in order to initiate an inter-national phone call. However, when receiving a phone callfrom a foreign party, the exit code is already contained inthe incoming foreign number. Therefore, when returningsuch a call, the exit code is often attached automatically bythe mobile device without the user’s knowledge. 2.3 Datasets We use datasets consisting of a complete set of interna-tional voice calls collected at the MSCs of the UMTS net-work under study. Theses phone calls are initiated by mobileusers in the cellular network (i.e., domestic users) to inter-national terminating numbers. We emphasize here that nocustomer private information is used in our analysis andwe have  anonymized   all customer identities (domestic srci-nating numbers). In particular, the anonymization processkeeps the area code intact and only anonymizes the remain-ing 7 digits in the srcinating numbers. More importantly,the distance between two srcinating numbers 1 is also pre-served after anonymization. In addition to protecting users’privacy, this type of anonymization enables us to study therelationship among phone numbers that participate in thesame fraud activities (see Section 6.2). Similarly, to ad-here to the confidentiality under which we have access tothe data, in places, we only present normalized views of ourresults while retaining the scientifically relevant magnitudes. 2.4 Obtaining Ground Truth To evaluate the efficacy of the proposed fraud detectionalgorithm and to understand different fraud activities, weutilize two sources of user reports as our ground truth. IRSF list : This list contains phone numbers associatedwith  international revenue share fraud   activities. In thescenarios of IRSF, an international revenue share providerdesignates a set of numbers as  premium rate service (PRS) numbers, which are often priced much higher than normalcalls terminating to the same foreign country 2 . The profitgenerated from calls to these PRS numbers are shared bythe revenue share provider and the content provider. Byserving as the content provider and attracting victims tocall these PRS numbers, the attackers gain direct revenuefrom these IRSF calls. Cellular service providers are directlyimpacted by IRSF fraud because they will suffer from mone-tary loss when their customers refuse to pay for the cost dueto IRSF calls. The IRSF list in our study is created fromcustomer reports to the care center of a large cellular ser-vice provider regarding suspicious IRSF activities observedfrom their monthly bill statements. Manual validation thenfollows up by placing phone calls to the reported IRSF can-didates before adding these numbers to the list. If suchnumbers are of premium rates, provider-specific strategieswill be applied to prevent future customers’ calls to thesenumbers. Some agents may even check the numbers thatare adjacent to the reported numbers to see if they are alsopremium-rate. Even though these numbers may not be in-volved in any reported fraud activities so far, they can beacquired by fraudsters in future. We refer to the phone num-bers identified in this way as  premium rate number ranges  in the rest of this paper. Online feedback : This list contains phone numbers re-garding which we have found customer complaints posted onforums, blogs and other online media sites, such as [10,11],through popular search engines. This data source covers awide variety of fraud activities, such as voice calls related toscams and malware, etc. We note that a small fraction of the IRSF numbers identified from the online feedback over-lap with the IRSF list. However, there are also many IRSFnumbers which are not covered by the IRSF list, possibly be- 1 We treat two phone numbers  on i ,on j  as two integers, andthe distance between them defined as  | on i  − on j | . 2 In a few countries, revenue sharing numbers can have sim-ilar rates as regular calls.  cause no one has complained about them yet. In addition,to understand different fraud activities and their associatedsocial engineering techniques, we assign labels to the fraudnumbers by distilling and summarizing keywords from usercomments describing these numbers (see Section 6.3).We note that since there is no guarantee that users will re-port all fraud activities they encounter, not to mention thatmany users lack the knowledge to identify fraud activities,these two data sources only cover a subset of all fraud activ-ities in the network. Moreover, there is often a lag betweenfraud activities and user feedback. For example, users mayonly start to notice the IRSF activity when they observeunexpected charges on their monthly bills. As we shall seein Section 5.2, such a lag can last weeks to even months,rendering much less effective the widely used reactive frauddetection method based on user feedback. 3. VOICE CALL GRAPHS In this section, we advance the notion of   voice call graphs  as a means to represent the communication patterns exhib-ited in the mobile voice channel. After characterizing voicegraphs constructed from different time spans, we proposeour key heuristic in identifying fraud activities. 3.1 Definition of Voice Call Graphs In this paper, we only study a single direction of phonecalls, i.e., theoutbound phonecalls placed by domestic phonenumbers to international numbers. This choice not onlyhelps reduce the volume of the data, but also enables us toobserve both fraud calls initiated by domestic numbers andreturned calls solicited by incoming fraud calls (e.g., randomscanning, see Section 6).In our dataset, each voice record contains the domesticsrcinating number and the targeted international terminat-ing number (note that we drop the terms “domestic” and“international” in the rest of the paper for simplicity). Bydepicting each record as an edge, all voice records can bereadily captured by a bi-partite graph, which we refer to asa  voice call graph  , or a  voice graph   in short. Formally, wedefine a voice call graph  G  :=  {{ON  , T N} , E} , where  ON  and  T N   stand for the set of srcinating numbers and theset of terminating numbers appearing within the observationtime window  T  , respectively. An edge  e ij  is drawn between on i  ∈ ON   and  tn j  ∈ T N   if at least one voice call is madefrom  on i  to  tn j  within  T  . Note that since we only look atphone calls on one direction (i.e., from domestic numbers tointernational numbers), we treat edges as undirected 3 . 3.2 Voice Call Graph Properties Fig. 2 shows a call graph plotted using the Graphviztool [12],which represents voice calls from 1,000 randomly sampledsrcinating numbers in one single day, where the blue/rednodes represent srcinating/terminating numbers, respec-tively. At a glance, the voice graph in Fig. 2 is extremelysparse and contains a large number of disconnected com-ponents (subgraphs), with a majority of the subgraphs con-taining only one single edge. For those subgraphs with morethan one edge, most of them exhibit a star structure centered 3 The definition of voice call graphs can be easily extendedto weighted graphs or directed graphs. For instance, theweight on an edge represents the number of calls associatedwith each edge.on srcinating numbers, representing one srcinating num-ber placing phone calls to a few terminating numbers. Incomparison, most terminating numbers only have degree 1.As we extend the observation time period and the srcinat-ing number population, the voice graph grows significantlyand renders direct visualization inapplicable. Instead, wecharacterize larger voice graphs using popular graph statis-tics. Fig. 3[a] shows the log-log plot of the node degreedistribution in a one-day voice graph. We observe that thedegrees of both the srcinating numbers and terminatingnumbers display a power-law shape. The power-law shapeof the srcinating numbers implies that a majority of do-mestic customers rarely call foreign numbers from their cellphones. In addition, the power-law shape of the terminatingnumbers indicates that, except for a few very popular ter-minating numbers (on the low end of the curve) associatedwith hotlines of popular hotels and resorts or foreign agen-cies like embassies, etc., most terminating numbers receivecalls from a very small number of srcinating numbers. Figure 2: A voice graph from 1,000 randomly sam-pled originating numbers in one day. Blue/rednodes represent srcinating/terminating numbers. Similarly to what we observed in Fig. 2, srcinating num-bers tend to have a higher degree than terminating numbers.The low popularity of terminating numbers also reflects thelack of correlation among srcinating numbers. This is notsurprising, due to the much larger space of foreign terminat-ing numbers, in general, two mobile customers are unlikelyto call the same international number(s). Therefore, voicegraphs often consist of a large number of small disconnectedsubgraphs. Fig. 3[b] shows the increase in the numberof sub-graphs over time, which ranges from 0.3 million in a one-daygraph to more than 3 million in a one-month graph. Thegrowth is sublinear, plausibly due to the expansion of a giantconnected component, which we will discuss in section 3.4.In addition, Fig. 3[c] demonstrates the distribution of thesizes of subgraphs from voice graphs spanning different timeperiods (in terms of the number of edges within each sub-graph). The subgraph sizes display again a power-law shape,indicating the dominance of small subgraphs in voice graphs.The same observation holds for the voice graphs from oneday, one week, and one month data, though we do observethat the number of subgraphs grows during a longer observa-tion period. The extremely low connectivity in a voice graph   10 −5 10 −4 10 −3 10 −2 10 −1 10 0 Node degree (log scale)    1 −   C   D   F src. numberterm. number (a) Node degrees in a one-day graph 05101520253000.511.522.533.5x 10 6 Time (days)    N  u  m   b  e  r  o   f   d   i  s  c  o  n  n  e  c   t  e   d  s  u   b  g  r  a  p   h  s (b) Number of subgraphs over time 10 0 10 1 10 2 10 3 10 −5 10 −4 10 −3 10 −2 10 −1 10 0 Subgraph size (# of edges in log scale)    1 −   C   D   F 1 day1 week1 month (c) Subgraph size distribution Figure 3: Voice call graph properties distinguishes it from other types of widely studied graphs,such as network traffic activity graphs [13] and online so-cial network graphs [14], which often exhibit much strongerconnectivity and correlations among nodes.We note that, since the objective of this paper is to iden-tify and analyze fraud activities in cellular networks, wewant to identify as many fraud activities as possible. Aswe shall see in Section 6.4, fraud activities can take days tobecome noticeable. To cover such cases, we need to extendour observation period. Therefore, in the rest of this paper, by default we choose one month as the observation window  for constructing voice graphs  . 3.3 HeuristicforDetectingVoice-relatedFraud Based on our analysis of voice graphs, we propose ourheuristic for detecting fraud activities from these graphs.We utilize two properties of voice graphs. First, in com-parison to most legitimate terminating numbers that havelow degrees, fraudsters intend to attract more victims tocall fraud numbers and hence fraud numbers often appearto be much more popular (also referred to as heavy hitters).We note that detecting heavy hitters is a common approachused for identifying anomalous activities [15,16]. However, the popularity of terminating numbers alone is not enoughin this scenario. We find that the most popular terminatingnumbers are associated with hotel hotlines, traveling agen-cies, embassies, and so on. In comparison, as we shall seein Section 5, the most popular fraud activities only attracthundreds of victims during a month-long period, which arenot among the top high degree terminating numbers. Weneed additional features to help identify fraud numbers.The second property we utilize is the low connectivityof voice graphs. Based on our experience, fraudsters oftenemploy several foreign numbers to increase the chance of reaching victims and to avoid detection and regulation. Aswe shall see in Section 6, these numbers can even come fromdifferent countries. Therefore, we are alerted with a poten-tial fraud activity while observing many domestic users startplacing phone calls to the same set of foreign numbers. Thecommunication patterns exhibiting the above two proper-ties are often referred to as  community structures   in voicecall graphs, where a set of srcinating numbers place phonecalls to a set of terminating numbers. This serves as our keyheuristic for detecting voice fraud in cellular networks.Based on this heuristic, we formulate the fraud detec-tion problem as finding the community structures from voicegraphs 4 . We note that a community structure can also src-inate to legitimate activities, for example, due to touristscalling hotlines in a popular resort or companies communi-cating with foreign branches. However, as we shall see inSection 5, this heuristic can help successfully isolate a largenumber of popular fraud activities from millions of phonecalls. In addition, we shall discuss in Section 6 that certaintypes of fraud activities can be detected in a more accurateway by investigating properties of the community structures. 05101520253000. (days)    F  r  a  c   t   i  o  n  c  o  v  e  r  e   d   b  y   G   C   C src.term.edge Figure 4: Evolution of GCC’s in voice call graphsspanning different time intervals. 3.4 Challenges Identifying all community structures is still a challeng-ing task. This is mainly due to the appearance of randomedges or weak connections which connect different commu-nities, thereby forming large subgraphs mixed with differentfraud activities. For example, Fig. 4 shows the evolution of the largest subgraph (a.k.a. giant connected component orGCC) in the voice call graph over a month-long time pe-riod. In Fig. 4, three curves display the coverage of GCCat a particular time in terms of srcinating numbers, ter-minating numbers and edges, respectively. We observe theGCC becomes significant from 10 days onwards and coversup to 25% edges in a 30-day voice graph. Since the voicegraphs used in this paper are constructed using month-long 4 We have also explored other call features such as call dura-tion and call time. However, none of them exhibit significantdifference between fraud numbers and legitimate numbers.In future work, we will investigate other features like usercalling history to improve detection accuracy.
Related Search
Similar documents
View more...
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks