A Toolkit For Detecting Technical Surprise

3y ago
16 Views
2 Downloads
6.47 MB
61 Pages
Last View : 2m ago
Last Download : 3m ago
Upload by : Tia Newell
Transcription

SANDIA REPORTSAND2010-7392Unlimited ReleasePrinted October 2010A Toolkit for Detecting TechnicalSurpriseMichael W. Trahan, Mark C. FoehsePrepared bySandia National LaboratoriesAlbuquerque, New Mexico 87185 and Livermore, California 94550Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly ownedsubsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administrationunder contract DE-AC04-94AL85000.Approved for public release; further dissemination unlimited.

Issued by Sandia National Laboratories, operated for the United States Department of Energyby Sandia Corporation.NOTICE: This report was prepared as an account of work sponsored by an agency of theUnited States Government. Neither the United States Government, nor any agency thereof, norany of their employees, nor any of their contractors, subcontractors, or their employees, makeany warranty, express or implied, or assume any legal liability or responsibility for theaccuracy, completeness, or usefulness of any information, apparatus, product, or processdisclosed, or represent that its use would not infringe privately owned rights. Reference hereinto any specific commercial product, process, or service by trade name, trademark,manufacturer, or otherwise, does not necessarily constitute or imply its endorsement,recommendation, or favoring by the United States Government, any agency thereof, or any oftheir contractors or subcontractors. The views and opinions expressed herein do notnecessarily state or reflect those of the United States Government, any agency thereof, or anyof their contractors.Printed in the United States of America. This report has been reproduced directly from the bestavailable copy.Available to DOE and DOE contractors fromU.S. Department of EnergyOffice of Scientific and Technical InformationP.O. Box 62Oak Ridge, TN 37831Telephone:Facsimile:E-Mail:Online ordering:(865) 576-8401(865) /bridgeAvailable to the public fromU.S. Department of CommerceNational Technical Information Service5285 Port Royal Rd.Springfield, VA 22161Telephone:Facsimile:E-Mail:Online order:(800) 553-6847(703) v/help/ordermethods.asp?loc 7-4-0#online2

SAND2010-7392Unlimited ReleasePrinted October 2010A Toolkit for Detecting Technical SurpriseMichael W. TrahanEmergent Threats DepartmentMark C. FoehseProliferation Sciences DepartmentSandia National LaboratoriesP.O. Box 5800Albuquerque, New Mexico 87185-MS1207AbstractThe detection of a scientific or technological surprise within a secretive country orinstitute is very difficult. The ability to detect such surprises would allow analysts toidentify the capabilities that could be a military or economic threat to nationalsecurity. Sandia’s current approach utilizing ThreatView has been successful inrevealing potential technological surprises. However, as data sets become larger, itbecomes critical to use algorithms as filters along with the visualizationenvironments.Our two-year LDRD had two primary goals. First, we developed a tool, a SelfOrganizing Map (SOM), to extend ThreatView and improve our understanding of theissues involved in working with textual data sets. Second, we developed a toolkit fordetecting indicators of technical surprise in textual data sets. Our toolkit has beensuccessfully used to perform technology assessments for the Science & TechnologyIntelligence (S&TI) program.3

ACKNOWLEDGMENTSThis work was supported by the Laboratory Directed Research and Development program atSandia National Laboratories. Sandia is a multiprogram laboratory operated by SandiaCorporation, a Lockheed Martin Company, for the United States Department of Energy’sNational Nuclear Security Administration under Contract DE-AC04-94AL85000.4

CONTENTS1.Introduction . 112.Building a Tool: Self-Organizing Maps . 13Data Pre-Processing . 13Training . 15Metrics . 18Visualization . 192.1.2.2.2.3.2.4.3.3.1.3.2.3.3.3.4.Building a Toolkit . 21Sandia-Developed Tools . 213.1.1 Stanley-Based Tools . 213.1.2 Titan-Based Tools . 23Oak Ridge-Developed Tools. 29COTS (Commercial Off The Shelf) Tools. 313.3.1 COTS Analysis and/or Visualization Tools . 313.1.2 COTS Support Tools . 42Open Source Tools . 503.4.1 Gephi . 503.4.2 KNIME . 503.4.3 ORA. 504.Future Work . 535.Conclusions . 556.References . 57Distribution . 595

EQUATIONSEquation 1. Calculate the Distance Between an Input Vector and a Node’s Weight Vector. . 17Equation 2. Calculate the BMU’s Neighborhood Size. . 17Equation 3. Adjust the Weights of the BMU and Its Neighbors. . 17Equation 4. Update the Learning Rate. . 18Equation 5. Calculate the SOM’s Average Quantization Error. . 18Equation 6. Calculate the SOM’s Average Topology Preservation Error. . 19Equation 7. Log-Entropy. . 21Equation 8. Cosine Similarity. . 21Equation 9. Term Frequency. 30Equation 10. Term Frequency-Inverse Document Frequency. . 30Equation 11. Term Frequency-Inverse Corpus Frequency. . 306

FIGURESFigure 1. CSV2SOM – main window. . 14Figure 2. CSV2SOM – raw data window. . 14Figure 3. CSV2SOM – pre-processed data window. . 15Figure 4. CSV2SOM – define data set window. . 15Figure 5. SOM PAK – typical commands for training a basic SOM. . 16Figure 6. SOM PAK – typical commands for training an optimized SOM. . 16Figure 7. SOM PAK – Umat plot. . 19Figure 8. SOM PAK – typical commands for visualizing a SOM. 20Figure 9. Data Trace Tool – main window. . 22Figure 10. LDRDView – main window. . 24Figure 11. P2 – the main window. . 25Figure 12. P2 – the Document Text view. . 25Figure 13. P2 – the Document Clusters view. . 26Figure 14. P2 – the Corpus Map window (tree-ring layout). . 26Figure 15. P2 – the Corpus Map window (“force-directed” graph layout). . 27Figure 16. P2 – the Entities view. . 27Figure 17. P2 – the Hotlist view. . 28Figure 18. P2 – the Hotlist Map view of entity-to-document relations. . 28Figure 19. ThreatView – main window. . 29Figure 20. Piranha – plot of clustered documents. 31Figure 21. dtSearch – start-up window. . 33Figure 22. dtSearch – creating an index. 33Figure 23. dtSearch – a simple search. . 34Figure 24. dtSearch – search terms highlighted in context. . 34Figure 25. dtSearch – a complex search. . 35Figure 26. dtSearch – results of a complex query. . 35Figure 27. Analyst's Notebook – a graph showing relationships between Osama Bin Laden andthe 9/11 attackers. . 36Figure 28. Analyst's Notebook – a theme line showing events ordered by time. . 37Figure 29. TextChart – text document window. . 38Figure 30. Google Trends – “metamaterials.” . 40Figure 31. Google Insights for Search – "metamaterials." . 41Figure 32. Beyond Compare – home view. . 42Figure 33. Beyond Compare – comparing folder contents. . 43Figure 34. Beyond Compare – text file comparison. . 44Figure 35. Beyond Compare – synchronizing folders. . 44Figure 36. Beyond Compare – comparing binary files (the data is displayed in hexadecimalformat). 45Figure 37. Beyond Compare – comparing data files. . 45Figure 38. Beyond Compare – comparing image files. . 46Figure 39. Camtasia Studio – edit window. . 47Figure 40. MindManager – main window. . 48Figure 41. MindView – main window. . 48Figure 42. SnagIt – main window. . 497

Figure 43. SnagIt – editor window. . 49Figure 44. ORA – main window. . 51Figure 45. ORA – a network visualization. . 51Figure 46. ORA – results of Newman's community finding algorithm. . 528

KS&TISNASNLSOMSTANLEYTF-ICFTF-IDFVTKWWWApplication Programming InterfaceBest Matching UnitCommercial Off The ShelfComma Separated valueDepartment of EnergyData Trace ToolGraphical User InterfaceHigh Performance ComputingLaboratory Directed Research and DevelopmentLatent Semantic AnalysisNamed Entity RecognitionNatural Language ToolkitScience & Technology IntelligenceSocial Network AnalysisSandia National LaboratoriesSelf-Organizing MapSandia Text AnaLysis Extensible LibrarYTerm Frequency-Inverse Corpus FrequencyTerm frequency-Inverse Document FrequencyVisualization ToolKitWorld Wide Web9

10

1. INTRODUCTIONThe detection of a scientific or technological surprise within a secretive country or institute isvery difficult. The ability to detect such surprises would allow analysts to identify thecapabilities that could be a military or economic threat to our national security. Sandia’s currentapproach utilizing ThreatView has been successful in revealing potential technological surprises.However, ThreatView has limitations.ThreatView presents data visually, which allows analysts to identify trends, patterns, andrelationships that otherwise are very difficult to detect. However, this detection is dependentupon the analyst: some analysts see the patterns; some analysts miss the patterns (falsenegatives); and still other analysts see patterns that are not real (false positives). In addition,ThreatView uses a single algorithm (LSA) to cluster the data set. There is no way to compare itsresults to an alternative clustering or to measure the quality of the clustering. We have addressedthese limitations by developing a data mining toolkit, which can be used independently or as anextension to ThreatView.As data sets become larger, it becomes critical to use algorithms as filters along with thevisualization environments. Our toolkit provides a suite of algorithms to filter the data so thatanalysts are presented with less, but more relevant, data increasing the chance of detecting ascientific or technological surprise.11

12

2. BUILDING A TOOL: SELF-ORGANIZING MAPSOur first effort was to build a tool to extend ThreatView and improve our understanding of theissues involved in working with textual data sets. We chose to implement a Self-Organizing Map(SOM).The self-organizing map (SOM) is a type of artificial neural network first described by ProfessorTeuvo Kohonen of the Helsinki University of Technology, Laboratory of Computer andInformation Science, Neural Networks Research Centre, in the early 1980s. The SOM provides away of representing multidimensional data in a two-dimensional space, while maintaining thedata's topological relationships. SOMs are frequently used as visualization aids. They can makeit easy for us to see relationships between vast amounts of multidimensional data. SOMs havebeen successfully used in many applications, including: speech recognition (Kohonen’s originalarea of research); bibliographic classification; image browsing systems; medical diagnosis;seismic data interpretation; data compression; and, environmental modeling.SOMs have many advantages. They are easy to understand (especially compared to most otherneural network architectures). They work very well on a large number of problem classes andthey are adaptive – they cannot be over-trained.There are, however, some disadvantages to SOMs. It can be hard to get the “right” data: Youmust have a value for every dimension of every input vector. Every SOM is different and findsdifferent similarities in the data. In the final map, every vector is surrounded by similar vectors;however, similar vectors are not always near each other. And, especially during training, SOMsare computationally expensive.2.1. Data Pre-ProcessingThe data for this application is records of scientific and technical articles. The data is provided asa Microsoft Excel CSV-format file. Most of the fields consist of natural language text. This textmust be pre-processed into a form (numeric) that is usable by the self-organizing map (SOM).The pre-processor, called CSV2SOM, was written in the Python scripting language. It reads therecords from the CSV-format data file and allows the user to generate a set of training andtesting data for the SOM. The graphical user interface (GUI) is built with the wxPython toolkit(wxPython is a wrapper for the wxWidgets cross-platform GUI API, which is written in C ).The natural language text is processed using the Natural Language Toolkit (NLTK). See Figure1, Figure 2, Figure 3, and Figure 4 for screen shots of the pre-processor.NLTK provides the user with information about the data set. For each field, the NLTK parses thedata to determine the number of empty records, the number of tokens, the number of uniquewords, the diversity score, the number of common words, and the number of unusual words.13

Figure 1. CSV2SOM – main window.Figure 2. CSV2SOM – raw data window.14

Figure 3. CSV2SOM – pre-processed data window.Figure 4. CSV2SOM – define data set window.To convert the data to a form usable by the SOM, the NLTK allows the user to: reduce the wordsto their head words; reduce the words to their stems (using the Lancaster or Porter stemmers);remove stop words (of, the, etc.); remove common words; and/or remove unusual words. Inaddition, the user can choose to remove high frequency words and/or low frequency words.Finally, the user can specify the percentage of the data set to be used for training the SOM(empty records are ignored); the remainder of the data set is automatically generated for testingthe SOM. In the training and testing data sets, each record is represented as a vector ofdimension n, where each component represents the Term Frequency-Inverse DocumentFrequency (TF-IDF) of the associated word.2.2. TrainingFor this application, we used the public-domain SOM PAK software package. SOM PAK iswritten in C and is provided by the Helsinki University of Technology, Laboratory of15

Computer and Information Science, N

A Toolkit for Detecting Technical Surprise Michael W. Trahan, Mark C. Foehse Prepared by Sandia National Laboratories Albuquerque, New Mexico 87185 and Livermore, California 94550 Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned

Related Documents:

Bruksanvisning för bilstereo . Bruksanvisning for bilstereo . Instrukcja obsługi samochodowego odtwarzacza stereo . Operating Instructions for Car Stereo . 610-104 . SV . Bruksanvisning i original

10 tips och tricks för att lyckas med ert sap-projekt 20 SAPSANYTT 2/2015 De flesta projektledare känner säkert till Cobb’s paradox. Martin Cobb verkade som CIO för sekretariatet för Treasury Board of Canada 1995 då han ställde frågan

service i Norge och Finland drivs inom ramen för ett enskilt företag (NRK. 1 och Yleisradio), fin ns det i Sverige tre: Ett för tv (Sveriges Television , SVT ), ett för radio (Sveriges Radio , SR ) och ett för utbildnings program (Sveriges Utbildningsradio, UR, vilket till följd av sin begränsade storlek inte återfinns bland de 25 största

Hotell För hotell anges de tre klasserna A/B, C och D. Det betyder att den "normala" standarden C är acceptabel men att motiven för en högre standard är starka. Ljudklass C motsvarar de tidigare normkraven för hotell, ljudklass A/B motsvarar kraven för moderna hotell med hög standard och ljudklass D kan användas vid

LÄS NOGGRANT FÖLJANDE VILLKOR FÖR APPLE DEVELOPER PROGRAM LICENCE . Apple Developer Program License Agreement Syfte Du vill använda Apple-mjukvara (enligt definitionen nedan) för att utveckla en eller flera Applikationer (enligt definitionen nedan) för Apple-märkta produkter. . Applikationer som utvecklas för iOS-produkter, Apple .

och krav. Maskinerna skriver ut upp till fyra tum breda etiketter med direkt termoteknik och termotransferteknik och är lämpliga för en lång rad användningsområden på vertikala marknader. TD-seriens professionella etikettskrivare för . skrivbordet. Brothers nya avancerade 4-tums etikettskrivare för skrivbordet är effektiva och enkla att

Den kanadensiska språkvetaren Jim Cummins har visat i sin forskning från år 1979 att det kan ta 1 till 3 år för att lära sig ett vardagsspråk och mellan 5 till 7 år för att behärska ett akademiskt språk.4 Han införde två begrepp för att beskriva elevernas språkliga kompetens: BI

**Godkänd av MAN för upp till 120 000 km och Mercedes Benz, Volvo och Renault för upp till 100 000 km i enlighet med deras specifikationer. Faktiskt oljebyte beror på motortyp, körförhållanden, servicehistorik, OBD och bränslekvalitet. Se alltid tillverkarens instruktionsbok. Art.Nr. 159CAC Art.Nr. 159CAA Art.Nr. 159CAB Art.Nr. 217B1B