Hacking AES-128 - Stanford University

3y ago
61 Views
2 Downloads
393.86 KB
5 Pages
Last View : 14d ago
Last Download : 3m ago
Upload by : Philip Renner
Transcription

Hacking AES-128Timothy ChongKostis KaffesStanford Universityctimothy@stanford.eduStanford Universitykkaffes@stanford.eduAbstract—Advanced Encryption Standard, commonly knownas AES, is one the most well known encryption protocols. It isused in a large variety of applications ranging from encryptingclassified documents in government agencies to allowing securedata transfer in everyday transfer protocols, such as HTTPSand FTPS. In this paper, we use template attack and SupportVector Machine to perform a side-channel attack on AES-128and recover the cipher key from a power trace of a hardwareencrypting device. Using SVM, we achieve an overall predictionaccuracy of 35%, which is more than sufficient to retrievethe entire cipher key. In addition, we use a balanced SVM tocounteract the skewing of our prediction classes.I. I NTRODUCTIONThe ubiquitous need for cryptography in a large numberof tasks today has led to the development and usage ofcommodity cryptographic hardware modules. These modulesare preferred to software implementations of cryptosystemsdue to their higher speed and lower power consumption. Mostencryption algorithms today have been well studied and provedto be impervious to attacks in the information flow. However, the hardware that implements the encryption/decryptionalgorithms has to be studied more in depth as it can bevulnerable to side channel attacks. In cryptography, a sidechannel attack is any attack based on information gained fromthe physical implementation of a cryptosystem, rather thanbrute force or theoretical weaknesses in the algorithms. Forexample, such an attack can exploit the power consumptionof a cryptographic device which depends on the manipulateddata and the executed instructions. The power consumptioncan be easily monitored on an oscilloscope by inserting aresistor in series with the ground. The obtained measurementis called a power trace. In this work we perform a side channelattack against the AES-128 encryption standard [1] usingthe power traces obtained from a large vendor’s commercialhardware security module. We use statistical and machinelearning techniques to extract information from the power tracethat can be used to get the key. More particularly, our goalis to employ multinomial Gaussian distribution modeling andSupport Vector Machine methods to get information aboutthe output of a specific stage of the AES-128 encryptionalgorithm.The rest of this paper is structured as follows. First, wepresent some relevant work on the topic. In section III wedescribe the AES-128 encryption algorithm. In section IV wediscuss our attack methodology, the machine learning methodswe use, and several adjustments to further improve them. Insections V and VI, we present the data set we use and theresults of our attack. We conclude with some remarks andfuture directions.II. R ELATED W ORKTemplate attacks, introduced by Chari et al. [2], deal withmany of the shortcomings of the Simple and the DifferentialPower attacks although there are some countermeasures thatcan prevent them. Machine learning has been recently usedextensively in side channel attacks since the better understanding of the physical behavior of a cryptosystem makesmachine learning algorithms an important component of anattack strategy. Hospodar et al. [3] use the Least-SquaresSupport Vector Machine and compare between different feature selection methods and kernel parameters. However, theyonly consider a single S-Box look-up and do not performactual key recovery. Lerman et al. [4] empirically evaluatethe effectiveness of both several feature selection methodsand of different supervised learning approaches such as SelfOrganizing Map, Support Vector Machine and Random Forest.Finally, He, Haffe, and Zou [5], after analyzing DES, concludethat the type of feature selection method used is not veryimportant regarding key extraction.III. AES-128AES-128 uses a 128-bit cipher key to encrypt 128-bit inputblocks. The encryption flow of the algorithm is shown inFigure 1.Fig. 1. Block diagram of AES encryptionIn the beginning, the round key is added to the initialinput text using bit-wise XOR. This operation is known asKey Addition, which is also used in subsequent phases. Theencryption then enters a chain of 10 rounds of operations.The cipher key generates a different round key for each roundusing the Rijndael key schedule as well as for the beginning

key addition. Each of the next 9 rounds contains four phases,which includes Byte Substitution, Shift Row, Mix Columns,and Key Addition. In Byte Substitution operation, each byteof the matrix is substituted using an 8-bit substitution box, theRijndael S-box. In Shift Row operation, the rows perform aleft shift in an ascending order with the first row performingno shift at all and the fourth row shifting 3 positions. In MixColumns, each column is transformed using the Rijndael mixcolumns function. The final round is identical to the previousrounds with the exception that Mix Columns operation isskipped. The output of each round is used as the input of thenext round, while the output of the final round is the encryptedcipher text.IV. ATTACK M ETHODOLOGYIn this work, our goal is to use machine learning techniques to predict the key used by an encryption device thatimplements AES-128 in hardware. We assume that the powertrace P depends on the cipher key C and the input text I, i.e.P f (C, I). However, instead of trying to directly predict thekey from the power trace, we target the hamming weight ofeach byte of the output of the Byte Substitution operation ofthe first round (OBFS) of the encryption algorithm (Figure 1).The term hamming weight refers to the number of 1’s in thebinary representation of a number. For example, the decimalnumber 9(10) is 1001(2) in binary, which has a hammingweight of 2 because it has two 1’s. The hamming weight ofthe output of this particular phase of the encryption algorithmis a good target for AES side-channel power analysis for twomain reasons. First, the power trace is highly correlated withthe hamming weight of a value stored to the internal registersof a hardware device. In the AES case, the OBFS is the firsttime where registers are used to store some value related to thekey, so it is a good target for our attack. Secondly, the inputkey is expanded and key additions are performed at the end ofeach round. If we wanted to target values in later rounds of thealgorithm, the complexity of our side-channel analysis wouldgrow because operations such as Shift Row and Key Additionwould have to be taken into account. All operations in the AESalgorithm run in 8-bit granularity, and each 8-bit sub-numberis in-place until the first Shift Row operation. Therefore, withthe hamming weight of each 8-bit sub-number in the outputof the first Byte Substitution phase, one can retrieve the secretkey since this hamming weight output is a function of theinput secret key and the plain text. In the following section, wewill describe the procedure by which we predict the hammingweight of the first 8-bit of the OFBS given the power trace,the first 8-bit of the input cipher text, and the first 8-bit of theplain text. It is important to note that the same analysis canbe done on all subsequent 8-bit sub-numbers to retrieve theentire 128-bit key.A. Feature SelectionMachine learning techniques usually require some kind ofdata processing before the training phase. In our case, eachpower trace consists of thousands of time-series values, mostof which are not correlated with the cipher key. Thus, ifevery point of a power trace was considered a feature of thetraining set, the training phase would take too long and leadto overfitting and increased variance no matter which machinelearning algorithm is used. To address that, we use point ofinterest selection and principal component analysis (PCA) toreduce the dimensionality of our input data.1) Point of interest selection: We use two methods in orderto identify d points of interest in our long power traces, whichare then used for template attack as described in Section IV-B.In the first method, we calculate the average for each pointamong all traces associated with a specific OFBS hammingweight. Since a 8-bit output has only 9 possible hammingweights, we obtain 9 mean power traces. We then calculatethe sum of the absolute pairwise difference among all combinations of these 9 traces to obtain only one ”trace”. Finally,we select the d highest peaks in this trace as our points ofinterest.In the second method, the Pearson correlation between eachpoint in the power trace and the OFBS hamming weightis calculated. This metric represents the linear relationshipbetween two data sets. We choose the d points with the highestPearson correlation to be our point of interest.2) Principal component analysis: PCA tries to identifythe subspace in which the input data lie and to get rid ofredundant dimensions. First, we need to preprocess the databy zeroing out the mean and rescaling each coordinate to havezero variance. If we assume that we want to project the datato only one dimension then we need to find a unit vector u sothat when the data is projected onto u’s direction the varianceof the projected data is maximized. i.e. we need to maximize:muT (1 X (i) (i)Tx x)um i 1Given the constraint u 2 1, the minimizer is the principalPm (i) (i)T1eigenvector of Σ m. In order to get the di 1 x xdimensional subspace of the data that maximizes variance, wechoose u1 , . . . , ud to be the top d eigenvectors of Σ and weget: T (i) u1 xy (i) . . . uTd x(i)B. Template AttackTemplate attacks [2] are a very successful type of profilingside-channel attacks. The method uses a multinomial Gaussiandistribution as its underlying model. It assumes that given theOFBS hamming weight of a given power trace, the powervalues (x) of the d points of interest (IV-A1) are going tovary according to a d dimensional Gaussian distribution withprobability:p(x; µ, Σ) 11exp( (x µ)T Σ 1 (x µ))2(2π)n/2 Σ 1/2where µ Rd is the mean vector, and Σ Sd is thecovariance matrix.

To generate the model for a particular hamming weightclass, we collect the power values of the d points of interestpertaining to that hamming weight. We can then generate amean vector and a co-variance matrix for power values ofthat hamming weight class at these points of interest. With9 possible hamming weight, we generate 9 such Gaussiandistributions, each of d dimensions. To make a prediction givenan input power trace, we simply calculate the probability of thed points of interest with our generated models assuming thatthey are generated from each of the 9 models. The model withthe highest probability is chosen to be our hamming weightprediction.C. Support Vector MachineThe Support Vector Machine (SVM) family of machinelearning algorithms performs classification by finding a decision boundary that maximizes the geometric margin. In otherwords, given training vectors xi Rp , i 1, ., m, in twoclasses, and a vector y {1, 1}m , we train the SVM bysolving the following primal linear programming problem:mX1ζiminw,b,ζ wT w C2i 1subject to yi (wT φ(xi ) b) 1 ζi and ζi 0, i 1, . . . , m.One of the problems with any SVM method is that it isdesigned to perform binary classification. There are two basicapproaches on how one can perform multi-class classificationusing SVM, ”one-against-one” and ”one-against-all”[6]. In the”one-against-one” approach a classifier is constructed for eachpair of classes and the class which receives the most votesis selected. In the event of a tie, the class with the highestaggregate classification score, i.e. the sum of the pair-wiseclassification confidence levels for all the binary classifiers,is selected. For this method if we have n classes, n (n 1)2classifiers are required. In the ”one-against-all” approach aclassifier is constructed for each class, i.e. n classifiers andmakes a binary decision between that class and the rest ofthe class. The class that has the decision function with thehighest value is selected. We use the ”one-against-one” methodbecause we have a relatively small number of classes (9) andthe O(n2 ) classifier number is manageable. Furthermore, Hsuand Lin [7] have shown that it provides higher accuracy for awide range of problems and data sets.Another problem that we have to deal with is that thedistribution of hamming weights for numbers with a set lengthis inherently skewed. For example, for an 8-bit binary number,there is only one number with hamming weight 0 (0(10) ) andone number with hamming weight 8 (255(10) ), whereas thefrequency of other hamming weight classes is much higher.This observation drives us to adjust our algorithm to ensurebalanced precision among hamming weight classes by addingweights to the samples. Thus, for each binary classifier thatwe train, we assign weights inversely proportional to classfrequencies in the input data.Our first step is to do a design space exploration for differentkernel functions and parameters. An initial attempt to use asimple linear or polynomial kernel is quickly abandoned sincethe data set is clearly not separable by such kinds of functions.After fine tuning of the parameters of the SVM, the radial x1 x2 22}basis function kernel (rbf), K(x1 , x2 ) exp{σ2proofs to be successful. The parameters that we use are Cand γ σ12 . The C parameter trades off misclassification oftraining examples against simplicity of the decision surface.Thus, the lower the C parameter, the smoother the decisionsurface while a high value of C means that the labels ofall training data points, even outliers, are taken strictly intoaccount. The γ parameter defines how far the extent of theinfluence of each training example. Low values mean ”far”and high values mean ”close”. In our experiments we havechosen C 100 and γ 10 6 , values that seem to providethe highest prediction accuracy. Intuitively, the high value ofC and low value of γ we have chosen mean that every datapoint (trace) is important and decisive for the outcome of theprediction.In order to avoid the issue of the skewed hamming weightdistribution altogether, we also try to predict the value of theOBFS directly. More specifically, we target the first bit of theOBFS and use both balanced and unbalanced SVM with linearand rbf kernels.D. Data Set and EvaluationOur data set consists of 20k power traces collected froma hardware security module1 when it is actively performingAES encryption. Each trace contains 80k power values alongthe time axis. We divide our data set into a 17.5k training setand a 2.5k testing set. All traces have the same key and randomplain texts, but our algorithm should work either random keysbecause the OFBS is a function of both the plain text and key.We implement both template attack and Support VectorMachine in Python using packages such as numpy and sklearn.Points of interest selection method is used for template attack,while Principal Component Analysis is used for SupportVector Machine.V. R ESULTSFig. 2. 25 Points of interest exampleFigure. 2 shows an example plot of the points of interestselected by maximizing the differences across OFBS hammingweight classes. This particular plot shows the normalized1 The exact model and specifications of the module is unknown due to nondisclosure agreement from Google Inc., where the data set was collected.

power trace of the class with hamming weight 4 and 25 pointsof interest. The plot is zoomed into the region where the pointsof interest are selected for clarity. As one may have expected,the points of interest are found to be located around the peaksand valleys of the power values because these points cancorrespond to power numbers when values are being storedto registers, which deviate the most for different hammingweight classes.for the unbalanced SVM is much higher for hamming weightclasses 1 and 4, while the other classes are doing very poorly.In the balanced SVM case, however, the precision performanceis more balanced among all the different classes. The balancedcase sacrifices overall accuracy but retains higher overallprecision across different hamming weight classes, which canbe helpful if we want to sustain reasonable accuracy for allclasses.Fig. 3. Overall accuracy of our algorithms with different number of points ofinterest/ feature dimension. a) Template attack with point of interest selectionmethod 1. b) Template attack with point of interest selection method 2. c)SVM with PCA. d) balanced SVM with PCA.Fig. 4. Precision of OBFS hamming weight prediction. a) Template attackwith 25 points of interest selected by method 1. b) Template attack with 25point of interest selected by method 2. c) SVM with 200 PCA features. d)balanced SVM with 200 PCA features.The overall accuracy of our algorithm is shown in Figure 3.The left two sub-bars represent template attack with differentnumber of points of interest along the horizontal axis, whilethe right two sub-bars represent the SVM results with thenumber of PCA features along the horizontal axis.Looking at template attack alone, the two point of interestselection method seem to have similar overall performance,with accuracy at around 20% to 25% with 25 points of interest.SVM in general gives better, although not significantly higher,accuracy. Perhaps due to over-fitting, the accuracy of templateattack becomes abysmal as we further increase the numberof points of interest beyond 50, whereas the performance ofSVM continues to increase as we increase the number of PCAfeatures to 200, even though the accuracy seems to plateau athigher number of features.It is important to note that even though the accuracy is lessthan 40%, the actual key retrieval involves the prediction ofhamming weight with a number of input traces of the identicalkey and random plain texts. As long as our classifier performsbetter than a random guess (1/9), we should still be ableto guess the correct key given a sufficient number of powertraces.The precision (true positive divided by the sum of truepositive and false negative) across the 9 classes of OBFShamming weights is shown in figure 4. The two point of interest selection methods seem to have similar precision acrossdifferent hamming weights. On the other hand, the precisionFig. 5. One-bit prediction accuracy. a) Un-balanced SVM with RBF kernel.b) Balanced SVM with RBF kernel. c) Un-balanced SVM with linear kernel.d) Balanced SVM with linear kernel.The accuracy of the one-bit prediction with Support VectorMachine is shown in Figure. 5. Since we are using SVM asa binary classifier, we expect there is no significant differencebetween the balanced and unbalanced implementation. SVMwith linear kernel performs nothing better than a randomclassifier, with an accuracy of 50%, possibly because our dataset is not linearly separable. The RBF kernel case gives anaccuracy of 70% probably due to the infinite numbers of

features it uses. Such an accuracy allows the prediction ofthe key if enough attack traces are available.VI. C ONCLUSION AND F UTURE W ORKIn our work we use multinomial Gaussian distributionmodeling and Support Vector Machine to predict either thehamming weight or directly the value of the OBFS. To ourknowledge, this is the first time that a direct prediction ofthe hamming weight is attempted. We have shown that it isnot a trivial task to predict the hamming weight using a singleattack trace. It is worth noting that the inherent skewing of thehamming weight distribution makes it harder to sustain highaccuracy while taking into account low frequency hammingweights. To alleviate this problem, we implemented the one-bitpredictor which does not have to deal with hamming weights.The performance of this prediction has proved to be on parwith similar attacks against the DES encryption standard [5].This work can be expanded to two different directions: improving the accuracy of the prediction of the hamming weight

Hacking AES-128 Timothy Chong Stanford University ctimothy@stanford.edu Kostis Kaffes Stanford University kkaffes@stanford.edu Abstract—Advanced Encryption Standard, commonly known as AES, is one the most well known encryption protocols. It is used in a large variety of applications ranging from encrypting

Related Documents:

Hacking Concepts 1.10 What is Hacking? 1.11Who is a Hacker? 1.12 Hacker Classes 1.13 Hacking Phases o Reconnaissance o Scanning o Gaining Access o Maintaining Access o Clearing Tracks Ethical Hacking Concepts 1.14 What is Ethical Hacking? 1.15 Why Ethical Hacking is Necessary 1.16 Scope and Limitations of Ethical Hacking

ktm 250 sx-f 2006-08 vit013 vet013 128.43 128.43 250 sx-f 2009-12 vit013 vet013 cc032 128.43 128.43 65.74 250 sx-f 2013-15 vit043 vet041 cc033 128.43 128.43 83.52 250 exc-f/ 2007-13 vit013 vet013 cc032 xc-f/xcf-w 128.43 128.43 65.74 250 exc-f 2014-15 vit043 vet041 cc033 128.43 128.43 83.52 350 sx-f 2011-15 vit038 vet038 cc033

SEISMIC: A Self-Exciting Point Process Model for Predicting Tweet Popularity Qingyuan Zhao Stanford University qyzhao@stanford.edu Murat A. Erdogdu Stanford University erdogdu@stanford.edu Hera Y. He Stanford University yhe1@stanford.edu Anand Rajaraman Stanford University anand@cs.stanford.edu Jure Leskovec Stanford University jure@cs.stanford .

Chapter 7 Passwords In This Chapter Identifying password vulnerabilities Examining password-hacking tools and techniques Hacking operating system passwords Hacking password-protected files Protecting your systems from password hacking P assword hacking is one of the easiest and most common ways attack-ers obtain unauthorized network, computer, or application access.

Hacking The Wild: Desert Island Castaway Survival Series Marathon Hacking The Wild: Escape from Death Valley Hacking The Wild: Deadly Glacier Hacking The Wild: Alaskan Ice Forest Hacking The Wild: Black Bayou, The Hacking The Wild: Desert Island Castaway

Chapter 7 Passwords In This Chapter Identifying password vulnerabilities Examining password-hacking tools and techniques Hacking operating system passwords Hacking password-protected files Protecting your systems from password hacking P assword hacking is one of the easiest and most common ways attack-ers obtain unauthorized network, computer, or application access.

private sectors is ethical hacking. Hacking and Ethical Hacking Ethical hacking can be conceptualized through three disciplinary perspectives: ethical, technical, and management. First, from a broad sociocultural perspective, ethical hacking can be understood on ethical terms, by the intentions of hackers. In a broad brush, ethical

TOP SECRET//HCS/COMINT -GAMMA- /TK//FGI CAN GBR//RSEN/ORCON/REL TO USA, CAN, GBR//20290224/TK//FGI CAN GBR//RSEN/ORCON/REL TO USA, CAN, GBR//20290224 In the REL TO marking, always list USA first, followed by other countries in alphabetical trigraph order. Use ISO 3166 trigraph country codes; separate trigraphs with a comma and a space. The word “and” has been eliminated. DECLASSIFICATION .