Security Challenges In An Increasingly Tangled Web

3y ago
37 Views
2 Downloads
564.42 KB
8 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Ronan Garica
Transcription

Security Challenges in an Increasingly Tangled WebDeepak Kumar†Zane Ma†Zakir Durumeric‡†Ariana Mirian‡Joshua Mason†J. Alex Halderman‡Michael Bailey†† Universityof Illinois, Urbana Champaign ‡ University of Michigan{dkumar11, zanema2, joshm, mdbailey}@illinois.edu {zakir, amirian, jhalderm}@umich.eduABSTRACTOver the past 20 years, websites have grown increasingly complexand interconnected. In 2016, only a negligible number of sites aredependency free, and over 90% of sites rely on external content.In this paper, we investigate the current state of web dependenciesand explore two security challenges associated with the increasingreliance on external services: (1) the expanded attack surface associated with serving unknown, implicitly trusted third-party content,and (2) how the increased set of external dependencies impactsHTTPS adoption. We hope that by shedding light on these issues,we can encourage developers to consider the security risks associated with serving third-party content and prompt service providersto more widely deploy HTTPS.Keywordswebsite complexity, HTTPS adoption, privacy/tracking1.INTRODUCTIONSince its inception in 1989, the Internet community has had tocope with the decision to allow HTTP pages to load third-partycontent on the World Wide Web. From the need to limit the accessof malicious websites through the same-origin policy, to copingwith long page load times through fine-grained resource scheduling,the dependency graph that underlies websites has been of criticalimportance in understanding the security and performance of theweb. Existing measurements of this tangle of dependencies predate an explosion of advertising and tracking technologies and anincreased reliance on shared platforms.In order to untangle the interdependent web, we extend the headless version of Google Chromium to determine the services andnetworks that the Alexa Top Million sites load resources from. Wefind that sites load a median of 73 resources, 23 of which are loadedfrom external domains — each twice the corresponding figure fromfive years prior [5]. We show that a small number of third partiesserve resources for a large fraction of sites, with Google, Facebook,and Twitter appearing on 82%, 34%, and 11% of the top millionsites respectively. Investigating the top networks that provide theseremote resources, we find that shared platforms deliver content toc 2017 International World Wide Web Conference Committee(IW3C2), published under Creative Commons CC BY 4.0 License.WWW 2017, April 3–7, 2017, Perth, Australia.ACM 3038912.3052686.many sites — content distribution networks (CDNs) serve content to60% of the top million and cloud providers serve resources to 37%of the top million.The increase in the number of externally loaded resources andtheir distribution yield an attractive attack vector for compromisinglarge numbers of clients through shared dependencies and platformsin use across popular sites. Perhaps more troubling, we find that 33%of the top million sites load unknown content indirectly through atleast one third-party, exposing users to resources that site operatorshave no relationship with. Coupled with the observation that 87%of sites execute active content (e.g., JavaScript) from an externaldomain, a tangled attack landscape emerges. We note that modern resource integrity techniques, such as subresource integrity (SRI), arenot applicable to unknown content, and it remains an open problemto mitigate these kinds of attacks. The complex interdependencieson the web have consequences beyond security. We consider, asone example, the widespread deployment of HTTPS. We find that28% of HTTP sites are blocked from upgrading to HTTPS by activecontent dependencies that are not currently available over HTTPS.This only accounts for active content — 55% of sites loaded overHTTP rely on an active or passive HTTP resource that is not currently available over HTTPS. Encouragingly, we find that 45% ofHTTP sites could migrate to full HTTPS by changing the protocolin the resources they load. However, the community still has muchwork to do to encourage the widespread adoption of HTTPS.We hope that by bringing to light these security and privacychallenges introduced by an increasingly tangled web, we remindthe community of the many challenges involved with securing thisdistributed ecosystem. To facilitate further research on the topic, weare releasing our headless browser as an open source project that isdocumented and packaged for measuring the composition of sites.2.DATA COLLECTIONOur analysis focuses on the resources (e.g., images, scripts, stylesheets, etc.) that popular websites load when rendered on a desktopbrowser. To track these dependencies, we extended the headlessversion of Google Chromium [12] to record websites’ networkrequests, which we use to reconstruct the resource tree for each site.Unlike prior web resource studies, which have used a bipartite graphmodel of web pages [5, 13, 25], we build a tree that preserves therelationships between nested dependencies.We visited all of the sites in the Alexa Top Million domains [1] onOctober 5–7, 2016 from the University of Michigan using our headless browser. We specifically attempted to load the root page of eachsite over HTTP (e.g., http://google.com). If we could not resolve theroot domain, we instead attempted to fetch www.domain.com. Westopped loading additional content 10 seconds after the last resource,or after 30 seconds elapsed. Our crawl took approximately 48 hours

CDF Alexa Top MillionCDF Alexa Top scriptTotal ImageJavascriptTotal Object110Resources (by URL)1001000Resources (by URL)(a) All Resources(b) External ResourcesFigure 1: CDF of Resources — We show the CDF of the number of resources (by URL) loaded by the top million websites.100CDF Alexa Top Millionon 12 entry-level servers. We were able to successfully load the rootpage for 944,000 sites. 15K domains did not resolve, 13K timed out,24K provided an HTTP error (i.e., non-200 response), and 5K couldnot be rendered by Chromium.After completing the crawl, we annotated each resource withadditional metadata. We first made an additional HTTP and HTTPSrequest for each resource using ZGrab [7, 8] to determine whetherthe resource was available over HTTPS. We also captured theSHA-1 hash of each resource, the AS it was loaded from, and theMaxmind [21] geolocation of the web server. We are releasingour headless browser at https://github.com/zmap/zbrowse and ourdataset at https://scans.io/study/tangled.806040External CountriesExternal ASesExternal DomainsExternal Resources2001101001000CountFigure 2: External Dependencies — Websites load a median 23 external resources from 9 external domains, 3 external ASes, and1 foreign country.Ethical Considerations As with any active scanning methodology, there are many ethical considerations at play. In our crawl, weonly loaded sites in the Alexa Top Million, for which our trafficshould be negligible — at most equivalent to a user loading a pagethree times. The only exception to this were sites for which wemeasured temporal changes, which we loaded hourly over a fiveday period. We followed the best practices defined by Durumericet al. [8] and refer to their work for more detailed discussion of theethics of active scanning.Determining Content Origin Throughout our analysis, we differentiate between local and external resources. We make this distinction using the Public Suffix List [23]. For example, we wouldnot consider staticxx.facebook.com to be hosted on a different domain than www.facebook.com, but we do consider gstatic.com tobe external from google.com. In order to group resources by entityrather than domain, we additionally label each resource with the ASit was loaded from.Root vs. Internal Pages Our study approximates the resourcesthat websites load by analyzing the dependencies loaded by eachwebsite’s root page. To determine whether this is representative ofthe entire website, we analyzed five internal pages for a random 10Ksites in our dataset.1 On average, the root page of each site loadscontent from 87% of the union of external domains that all pagesdepend on and 86% of the external ASes. This is nearly doublethat of just the internal pages, which contribute 40% of the totaldomains and 44% of the total networks. In other words, while theroot page of a site does not typically load content from all of a site’sdependencies, it is representative of most of the dependencies on asite as a whole.Temporal Changes We also measured how dependencies changeover time by loading a random 1,000 domains every hour for fivedays and tracking the new content introduced in each crawl. We findthat 57% of external domains are captured in the first crawl, 5.7%in the second, and less than 1% in the third. A single snapshot does1 Weidentified five pages for each site using Bing.Figure 3: LA Times — The LA Times homepage loads 540 resources from 267 unique IP addresses in 58 ASes and 8 countries.not capture all of the dependencies for a site, but the additional datacollected from subsequent runs quickly diminishes.3.WEB COMPLEXITYThere has been a stark increase in website complexity over thepast five years. In 2011, Butkiewicz et al. found that the 20,000 mostpopular websites loaded a median 40 resources and 6 JavaScriptresources [5]. Today, the top 20K websites load twice the numberof scripts and total resources. A negligible number of sites withuser content are altogether dependency free, and the top millionsites now load a median 73 resources. Sites are not only including alarger number of resources, but they are also increasingly relyingon external websites and third-party networks. In 2011, 30% ofresources were loaded from an external domain. In the last five years,this has nearly doubled, and, today, the majority (64%) of resourcesare external. More than 90% of sites have an external dependencyand 89% load resources from an external network (Figure 2).3.1Site CompositionAs shown in Figure 1a, there is a large variation in the numberof resources that sites load. The top million sites load a median73 resources, but popular, media-heavy sites tend to include manymore. News and sports sites have the most dependencies and load amedian 247 and 207 resources, respectively. News and sports also

Resource TypeImagesJavaScriptData (HTML, XML, JSON, %25.5%14.3%4%4.4%Figure 4: Types of Resource Loaded by the Top Million Sites —We show the breakdown of types of resources loaded by the topmillion sites. External content on sites skews more towards scriptsand data when compared to all .comfacebook.netTop comtwitter.comfbcdn.netadnxs.comTop 1M23.1%19.6%14.1%12.8%10.7%10.5%Figure 5: Domains Loaded by 10% of Sites — The domains thatare commonly included on at least 10% of sites are owned by justfour companies, Google, Facebook, Twitter, and AppNexus. Googlecontrols 8 of the top 10 most loaded domains.load the most content from external services: 80% of their resourcesare external. In one of the more complex cases, the LA Timeshomepage includes 540 resources from nearly 270 IP addresses,58 ASes, and 8 countries (Figure 3). CNN — the most popularmainstream news site by Alexa rank — loads 361 resources.Approximately 6% of sites do not have any dependencies. Whenwe manually investigated these sites, we found that the vast majority do not use their root pages to serve user content. Instead, thedomains are used to host media content at specific paths. The rootpage of mostazaweb.com.ar (an Argentinian fast food chain) loadedthe most resources — nearly 20K. In general, sites with an extremenumber of dependencies were either broken — loading resources aninfinite number of times until timing out — or are less popular sitesthat load a large number of images.Most resources are images (59%) and scripts (22%).2 Of the73 median files that sites load, 30 are images, 12 are scripts, and 5 arestyle sheets (Figure 4). Nearly all of the most commonly includedresources (by file hash) support advertising/tracking programs andare served from external sites. The single most common file is a 1 1white gif, which is used by an array of analytics and advertisementproviders. More than 70% of sites include the pixel from an externaldomain and 91% of those load it from Google (e.g., as part ofGoogle Analytics). The most common file unrelated to tracking oradvertising is jquery.migrate.min.js, a WordPress dependency.It is the 11th most common file and is loaded on 13% of sites.3.2External DependenciesJust over 90% of the top million sites have external dependencies, and more than two thirds of all resources are loaded fromexternal sites. Despite the large number of external dependencies,there are only four companies — Google, Facebook, Twitter, andAppNexus — that serve content on more than 10% of the top millionsites (Figure 5). The most frequently included external domain isgoogle-analytics.com, which is present on 68% of websites. Sites2 We categorized resource types by analyzing the Content-TypeHTTP header (e.g., image/png) and file extension, which allowed usto classify 99.8% of dependencies.Category% of Ext. ResourcesAnalytics/TrackingCDN/Static ContentService/APIAdvertisingSocial MediaUnknown8.2%8.4%8.8%13.9%11.0%50.3%Top 1M %)(65.2%)(39.0%)(42.2%)(39.7%)–Figure 6: External Dependency Purposes — We categorized common dependencies that appear on more than 1% of sites. Analyticsand tracking resources are the most commonly included category forTop Million websites, but advertising and social media dependenciesaccount for more total resources loads.CompanyTypeTop 1MCompanyTypeTop 1MGoogleFacebookAmazon e 7: ASes that Serve Content for 10% of Sites — Googleresources are loaded on over 4 out of 5 sites in the top millionsites. Other social, cloud, and CDN providers are significantly lessprominent and serve content for 11–34% of sites.also include an array of other Google services beyond their analyticsprogram: 47% use AdWords/DoubleClick, 44% serve Google Fonts,and 43% call other Google APIs. Facebook and Twitter both providesocial media plugins; AppNexus is a popular ad provider.While there are only a handful of providers that serve contenton more than 10% of sites, there is a long tail of dependencies thatappear on a smaller number of domains: 22 domains are loadedby 5–10% of sites and 185 domains are loaded by 1–5%. We manually categorized the domains that more than 1% of sites dependon and find that most common dependencies are part of analytics/tracking (29.4%) or advertising (29.0%) programs. We provide adetailed breakdown in Figure 6. We note that while analytics andtracking services are used by more websites than any other typeof service, advertising accounts for the largest number of externaldependencies.3.3Network ProvidersWhile aggregating external resources by domain provides oneperspective, this analysis fails to identify many of the lower layerservices that are shared between websites. Cloud providers, CDNs,and network services have visibility into much of the same data asthe websites themselves do, and their compromise could cascadeonto the upstream services that rely on them. To measure the serviceproviders that popular sites rely on, we aggregated resources by theAS they are served from.At least 20% of sites depend on content loaded from Google,Facebook, Amazon, Cloudflare, and Akamai ASes. Five of the topten networks most relied on are CDNs, two are cloud providers,and one (Google) serves several roles (Figure 7). The five largestCDNs serve 12% of all resources and content for 60% of the top million. Cloudflare and Akamai both have a large set of customers andserve a variety of files to many domains. Several lesser knownCDNs are used by a surprisingly large number of sites. 10%of sites depend on MaxCDN to provide bootstrap.min.js andfont-awesome.min.css. Fastly serves content for three sites thatprovide popular embedded content: imgur.com, shopify.com, andvimeo.com.

4.IMPLICIT TRUSTRelying on a larger number of resources does not inherently posea security risk. However, the expanding trust on external sites doesrepresents an increase in the attack surface for both websites andend users. In many cases, we find that websites operators no longerknow who they are trusting because external services load implicitlytrusted content from third parties that are unknown to the main siteoperator.To understand who sites are implicitly trusting, we analyzed thedepth at which different resources are loaded. We denote resourcesthat the main site directly includes as explicitly trusted, and objectsloaded from third parties by external services as implicitly trusted.For example, if the New York Times loads advertising JavaScriptfrom DoubleClick and DoubleClick loads additional content froma third-party ad provider such as SmartAdServer, we should statethat the New York Times explicitly trusts DoubleClick, and implicitly trusts SmartAdServer. We note that the distinction betweenimplicit trust is dependent not only on depth, but also owner. If aDoubleClick script loaded additional content from its own serversor the original site, we would not mark this as implicitly trusted.Nearly 33% of the top million sites include at least one implicitresource, and 20% of all external resources are loaded implicitlyfor the top 100K websites. Websites most commonly implicitlytrust images (48%). In this situation, there is a modest security riskbecause the ad provider could serve a deceiving ad (e.g., phishingcontent). More worryingly, 32% of implicitly trusted resources arescripts and 3% are style sheets. An astounding 9% of the top millionand 13% of the top 100K sites load implicitly trusted scripts fromthird parties.One of the primary reasons this occurs is because real-time adbidding has become a standard practice and ad exchanges are commonly serving content directly from bidders [37]. DoubleClickloads the most implicit content and is responsible for implicitlytrusted resource on 9.6% of the top million sites (Figure 8). We notethat while Facebook, Google, and YouTube also appear near thetop of the list, they are primarily loading additional content fromIncludes Implicit% Top .3%1.1%1.0%0.9%Implicit Domain% Top .1%2.0%1.9%1.8%Figure 8: Sources of Implicitly Trusted Content — We show thedomains that load the most implicitly-trusted content and the domains that are most implicitly-trusted.100CDF Alexa Top MillionThe two cloud providers that serve content to the largest number of sites in the top million, Amazon EC2 and Softlayer, serve7.2% of all resources and are relied on by 37.3% of the top millionsites. Though both cloud providers serve a varied set of customers,most domains load resources from these providers because several popular advertising services use them for hosting. Nine ofthe top ten domains served out of EC2 and SoftLayer belong toadvertising campaigns; the only non-advertising related domain iss3.amazonaws.com — Amazon’s storage platform.While loading passive content (e.g., an image) from a CDN maynot be a serious security concern for many w

Javascript Total Object (a) All Resources 0 10 20 30 40 50 60 70 80 90 100 1 10 100 1000 CDF Alexa Top Million Resources (by URL) CSS/Font HTML Image Javascript Total Object (b) External Resources Figure 1: CDF of Resources—We show the CDF of the number of resources (by URL) loaded by the top million websites. on 12 entry-level servers.

Related Documents:

security challenges that are on the forefront of 5G and need prompt security measures. We further discuss the security solutions for the threats described in this paper. The rest of the paper is organized as follows: Section II describes the key security challenges followed by security solutions for the highlighted security challenges in .

Chapter 6 Security in the Cloud 153 6.1 Chapter Overview 153 6.2 Cloud Security Challenges 158 6.3 Software-as-a-Service Security 162 6.3.1 Security Management (People) 164 6.3.2 Security Governance 165 6.3.3 Risk Management 165 6.3.4 Risk Assessment 165 6.3.5 Security Portfolio Management 166 6.3.6 Security Awareness 166

www.gsdrc.org helpdesk@gsdrc.org Facts about security and justice challenges . Becky Carter . 10.09.2015 . Question What is the recent evidence on the scale of security and justice challenges worldwide? Identify available facts and figures in the literature on: 1) access to security and justice services; 2) poor people's demands for security

AVG Internet Security 9 ESET Smart Security 4 F-Secure Internet Security 2010 Kaspersky Internet Security 2011 McAfee Internet Security Microsoft Security Essentials Norman Security Suite Panda Internet Security 2011 Sunbelt VIPRE Antivirus Premium 4 Symantec Norton Internet Security 20

Coding Challenges Our targeted coding challenges booklet provides a set of coding challenges that students can use as practice to get used to the process of Design, Write, Test and Refine process using a highlevel - text-based language. Algorithm Challenges Our algorithm challenges booklet provides 40 algorithm challenges which help build

the Security certification can enhance a security career 1428340661_ch01_REV2.qxd 6/18/08 11:51 AM Page 1. 2 Chapter 1 Introduction to Security Today's Attacks and Defenses . This chapter introduces security fundamentals that form the basis of the Security certification. It begins by examining the current challenges in computer security .

Slack’s security team, led by our Chief Security Officer (CSO), is responsible for the implementation and management of our security program. The CSO is supported by the members of Slack’s Security Team, who focus on Security Architecture, Product Security, Security Engineering and Opera

3 CONTENTS Notation 10 Preface 12 About the Author 18 PART ONE: BACKGROUND 19 Chapter 1 Computer and Network Security Concepts 19 1.1 Computer Security Concepts 21 1.2 The OSI Security Architecture 26 1.3 Security Attacks 27 1.4 Security Services 29 1.5 Security Mechanisms 32 1.6 Fundamental Security Design Principles 34 1.7 Attack Surfaces and Attack Trees 37