UvA-DARE (Digital Academic Repository) SMTP: Stedelijk .

2y ago
126 Views
4 Downloads
2.17 MB
5 Pages
Last View : 17d ago
Last Download : 3m ago
Upload by : Jenson Heredia
Transcription

UvA-DARE (Digital Academic Repository)SMTP: Stedelijk Museum Text Mining ProjectSmeets, J.; Scholtes, J.C.; Rasterhoff, C.; Schavemaker, M.Publication date2016Document VersionFinal published versionPublished inDigital Humanities 2016Link to publicationCitation for published version (APA):Smeets, J., Scholtes, J. C., Rasterhoff, C., & Schavemaker, M. (2016). SMTP: StedelijkMuseum Text Mining Project. In W. Eder, & J. Rybicki (Eds.), Digital Humanities 2016:Concerence abstracts : Jagiellonian University & Pedagogical University, Kraków, 11-16 July2016 (pp. 683-685). European Association for Digital Humanities [etc.].http://dh2016.adho.org/abstracts/270General rightsIt is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s)and/or copyright holder(s), other than for strictly personal, individual use, unless the work is under an opencontent license (like Creative Commons).Disclaimer/Complaints regulationsIf you believe that digital publication of certain material infringes any of your rights or (privacy) interests, pleaselet the Library know, stating your reasons. In case of a legitimate complaint, the Library will make the materialinaccessible and/or remove it from the website. Please Ask the Library: https://uba.uva.nl/en/contact, or a letterto: Library of the University of Amsterdam, Secretariat, Singel 425, 1012 WP Amsterdam, The Netherlands. Youwill be contacted as soon as possible.UvA-DARE is a service provided by the library of the University of Amsterdam (https://dare.uva.nl)Download date:15 Jun 2021

The European Association for Digital Humanities (EADH)Association for Computers and the Humanities (ACH)Canadian Society for Digital Humanities / Société canadienne des humanités numériques (CSDH/SCHN)centerNetAustralasian Association for Digital Humanities (aaDH)Japanese Association for Digital Humanites (JADH)Digital Humanities 2016Conference AbstractsJagiellonian University&Pedagogical UniversityKraków11–16 July 2016Kraków 2016

SMTP: Stedelijk Museum Text MiningProject1980 that resulted from the query ”Stedelijk Museum” AND”Amsterdam” were used, forming a set of 18.290 articles.MethodologyJeroen SmeetsThe following methodology uses two approaches toobtain a quick and detailed overview of the content of adigitized archive that contains unstructured information.The first one focuses on the relations between namedentities and aims at finding communities in the relationnetwork. The second approach uses time based topicmodeling to get an overview of content changes over time.Finally, a name extraction method is presented that is ableto handle multiple causes of name variations.smeetsjeroen@hotmail.comMaastricht University, Netherlands, TheJohannes C. ht University, Netherlands, TheClaartje RasterhoffC.Rasterhoff@uva.nlCREATE, University of Amsterdam, Netherlands, TheRelation networks and community detectionMargriet SchavemakerM.Schavemaker@stedelijk.nlStedelijk Museum Amsterdam, Netherlands, TheIntroductionThis paper addresses how text-mining, machinelearning and information retrieval algorithms from thefield of artificial intelligence can be used to analyze ArtResearch archives and conduct (art-) historical research.To gain quick insight into the archive, two aspects arefocused on: relations between groups of people usingcommunity detection, and global content changes overtime using topic modeling. For such archives pre-taggedground-truth collections are generally not available, andthe archives are often too large, geographically distributed,and not always available in digital formats to build such aground-truth at reasonable costs. To develop and test thevalidity and relevance of existing tools, close collaborationwas established between the AI researchers, museum staff,and researchers in CREATE, a digital humanities projectthat investigates the development of cultural industriesin Amsterdam over the course of the last five centuries.DataThe research draws on two datasets. The principaldataset is the digitized archive of the Stedelijk MuseumAmsterdam, a renowned international museum dedicatedto modern and contemporary art and design. The archiveof the Stedelijk Museum Amsterdam contains documentsfrom the period 1930-1980. The corpus is a static collection of approximately 160.000 text documents that weredigitized using OCR. The second dataset is drawn fromDelpher, developed by (Koninklijke Bibliotheek Nederland,2015). Delpher provides a collection of digitized newspapers, books and magazines that is available for research.A selection of newspapers was made that is used as anadditional dataset for this project. Only articles from 1930-In its most basic form, a relation between two namedentities can be said to exist when they occur together inthe same document. The strength of a relation can becharacterized by the number of documents in which bothnamed entities occur. When all the co-occurrences arefound, a relation network can be constructed.In addition, sentiment analysis can be done to furthercharacterize a relation. A sentiment score is assigned toeach document, indicating the sentiment content of thedocument. No distinction is made between positive andnegative sentiment polarity. The hypothesis is that relations between individuals with a high sentiment are moreinteresting than relations with a low sentiment. This isbecause sentiments around trigger-events are often higherthan around common-day events. A lexicon based approach is used with lists of language specific sentimentwords. The sentiment score of a document is then givenby the sigmoid of the count of the sentiment words inthe document, normalized by the number of words inthe document.Finally, community detection algorithms can be applied to the relation network. These types of algorithmsaim at finding clusters of groups of entities that have denseconnections between members of the clusters and sparseconnections with members of other clusters (Fortunato,2010). The relation weight measure that is used to calculatethe communities, is taken as the product of the strengthof the relation, i.e. the number of documents where bothentities occur in, and the average sentiment score of thedocuments of a relation. It was found that combining thesetwo measures, resulted in more meaningful communities.683Time based Topic ModelingIn the next approach, topic modeling algorithms areapplied to analyze the information content and their evolution over time. Topic modeling tries to discover theunderlying thematic structure in a collection of documents.Non-Negative Matrix Factorization (NMF) is being used

as a tool for topic modeling (Arora et al., 2012). NMF isan unsupervised method where a matrix is approximatedby two low rank non-negative matrices. The extracted semantic feature vectors have only non-negative values andare sparse so they are easily interpretable. Furthermore,NMF is shown to generate more consistent results overmultiple runs (Choo et al., 2013), compared to other toolsused for topic modeling such as LDA (Blei et al., 2003).The approach suggested in (Vaca et al., 2014) uses atime-based collective matrix factorization based on NMFand is used in this project. It extends NMF by introducinga topic transition matrix that allows to track topics as theyemerge, evolve and fade over time.to the museum director, could be identified with the helpof a museum expert.Name ExtractionThe following method was used to extract namedentities from a collection of documents in order to buildthe relation network. It handles different causes of namevariations such as OCR induced errors commonly foundin digitized document collections, spelling mistakes, nameabbreviations and first and last name combinations.The method makes use of lists of name variations.Starting from a set of names extracted from a name database, such as RKDArtists and (RKD, 2015), the documentcollection is searched for possible name variations. Thesevariations are found by searching for the last name using afuzzy search. The similarity between the group of tokensaround the found last name, and the original name is thencalculated as a similarity score. The similarity score calculation is based on the idea described in (Song and Chen,2007), which uses a n-gram set matching technique. Thelists of name variations can then be evaluated manually ora threshold on the similarity score can be used to identifyname variations that correspond to the original name. Themethod using a threshold of 0.9 on the similarity scorewas tested on 50 randomly chosen names. The averageprecision was found to be 81 percent.Figure 1: Found communities for graphic artists in the archive ofthe Stedelijk MuseumFigure 2: Time based topic modeling for the archive of the StedelijkMuseum AmsterdamResultsA relation network was constructed for the document collection of the archive of the Stedelijk MuseumAmsterdam. Only artists with the graphic artist qualification in the RKDArtists and database were used. Themethods were implemented using available open sourcesoftware libraries such as the Apache Lucene text searchengine library (The Apache Software Foundation, 2015)and the Gephi platform (Bastian et al., 2009). The standardcommunity detection feature in Gephi was used, whichis based on the Louvain method (Blondel et al., 2008).The result is shown in Figure 1. The color of the relationbetween the nodes indicates the average sentiment scoreof the relation, starting from blue (neutral) to red (highsentiment content). Communities such as group exhibitions, art movements or a group of artists closely relatedFigure 3: Time based topic modeling for Delpher newspaper articlesThe time based topic modeling algorithm suggestedin (Vaca et al., 2014) was implemented in MATLAB andJava. The algorithm was applied to both the archive of theStedelijk Museum Amsterdam and newspaper articles684

from the Delpher database. The results are visualized overtime in the form of stacked topic rivers (Wei et al., 2010),shown in Figure 2 and Figure 3. Several exhibitions andevents could be identified and are annotated on the chart.ConclusionThis paper discusses two approaches to gain insightinto a digitized archive. Relation networks of persons withcommunity detection are considered, relying on a robustname extraction method. Furthermore, the evolution ofcontent over time can be explored using time based topicmodeling.For the humanities researchers in this project, themain aim was to asses the research potential of computational analysis of digitized art archives in general, and theStedelijk Museum in particular. Two types of preliminaryresearch questions were developed to do so. The first typehad to do with identifying patterns of change and continuity, across time and place. These include for instancetracing the position of the Stedelijk Museum as an intermediary in Dutch design industries, or the developmentof the Stedelijk Museum as an increasingly internationalplayer. The second type of question is less concernedwith general historical patterns, and more with specificart-historical research questions, regarding for instance(networks of) particular artists, artworks or exhibitions.But before we could start asking such questions to digitized art-historical archives, the quality and accessibilityof the texts needed to be established. Secondly, specificmethods needed to be explored and adapted in order toclean, identify, retrieve, extract, and structure the texts.The first results presented in this paper demonstrate thateven though they may not be clean at the first try or capture all historical nuance, they do help archives to openup and show unexpected relationships and patterns, toanswer specific questions, and to get connected with otherrelevant sources, such RKDartists and Delpher. The community detection in relation with sentiment mining, thetopic modeling and name extraction method developedin this project therefore provide a solid basis for the nextstep in assessing the research potential of art-historicalarchives: developing in-depth case studies, again in closecollaboration with art-historians and historians, allowingthe archive to speak up in unprecedented ways, offeringaccess to hidden story lines that subvert and augmentprevailing historical narratives.Blei, D. M., Ng, A. Y. and Jordan, M. I. (2003). Latent dirichletallocation. The Journal of Machine Learning Research, 3:993–1022.Blondel, V. D., Guillaume, J.-L., Lambiotte, R. and Lefebvre,E. (2008). Fast unfolding of communities in large networks.Journal of Statistical Mechanics: Theory and Experiment,2008(10): P10008.Choo, J., Lee, C., Reddy, C. K. and Park, H. (2013). Utopian:User-driven topic modeling based on interactive nonnegativematrix factorization. Visualization and Computer Graphics,IEEE Transactions on, 19(12): 1992–2001.Fortunato, S. (2010). Community detection in graphs. PhysicsReports, 486(3): 75–174.Koninklijke Bibliotheek Nederland (2015). Delpher - BoekenKranten Tijdschriften http://www.delpher.nl/ (accessed 1November 2015).RKD (2015). Netherlands Institute for Art History https://rkd.nl/en/ (accessed 1 November 2015).Song, S. and Chen, L. (2007). Similarity joins of text withincomplete information formats. Advances in Databases:Concepts, Systems and Applications. Springer, pp. 313–24.The Apache Software Foundation (2015). Apache Lucene - Welcome to Apache Lucene http://lucene.apache.org/ (accessed1 November 2015).Vaca, C. K., Mantrach, A., Jaimes, A. and Saerens, M. (2014).A time-based collective factorization for topic discovery andmonitoring in news. Proceedings of the 23rd InternationalConference on World Wide Web. ACM, pp. 527–38.Wei, F., Liu, S., Song, Y., Pan, S., Zhou, M. X., Qian, W., Shi,L., Tan, L. and Zhang, Q. (2010). Tiara: a visual exploratorytext analytic system. Proceedings of the 16th ACM SIGKDDInternational Conference on Knowledge Discovery and DataMining. ACM, pp. 153–62.BibliographyArora, S., Ge, R. and Moitra, A. (2012). Learning topic models going beyond SVD. Foundations of Computer Science (FOCS),2012 IEEE 53rd Annual Symposium on. IEEE, pp. 1–10.Bastian, M., Heymann, S. and Jacomy, M. (2009). Gephi: anopen source software for exploring and manipulating networks. ICWSM, 8: 361–62.685

Margriet Schavemaker M.Schavemaker@stedelijk.nl Stedelijk Museum Amsterdam, Netherlands, The Introduction This paper addresses how text-mining, machine-learning and information retrieval algorithms from the field of artificial intelligence can be used to analyze Art-

Related Documents:

The love dare challenge day 1. The love dare challenge reviews. The love dare daily challenges. The love dare challenge printable. The fireproof love dare challenge. The love dare challenge app. I believe the only thing you need to have to know true love is true love. SearchReSearchDaniel M. Surprisingly, I am not in a failing marriage, but I .

DARE!! Instruments DARE!! EMC & RF Measurement equipment Vijzelmolenlaan 3 3447 GX Woerden The Netherlands Tel. 31 348 416 592 www.dare.eu instruments@dare.eu DARE!! Products B.V. CoC number: 30138672 VAT number: NL8056.13.390.B01 . The CI test Bundle is a turn-key solution for

DARE Digital Storytelling Handbook for Empowerment 5 DARE Project The DARE Digital Storytelling Handbook was developed as part of DARE: Disable the Barriers Project. It includes accessible multimedia resources to accommodate the needs of people with and without impairments. The aims of the Digital Storytelling Handbook and DARE Project are to:

solaris repository description Local\ copy\ of\ the\ Oracle\ Solaris\ 11.1\ repository solaris repository legal-uris solaris repository mirrors solaris repository name Oracle\ Solaris\ 11.1\ Package\ Repository solaris repository origins solaris repository

* 2. One to three Dare ground rod clamps 3. Dare insulated underground & hook-up wire 4. One Dare cut-off switch, if desired 5. Dare line clamps/split bolts/fence taps 6. Surge Protector * The pulse energy of the DE 20, DE 60, or DE 80 is low enough where one ground rod may be all that is needed. INSTALLING THE GROUND SYSTEM

Creating, Restoring, and Configuring the Informatica Repository 78 Starting the Informatica Repository Server 78 Creating or Restoring the Informatica Repository 79 Dropping the Informatica Repository (Optional) 81 Registering the Informatica Repository Server in Repository Server Administration Console 81 Pointing to the Informatica Repository 82

Introduction Basic Git Branching in Git GitHub Hands-on practice Git: General concepts (II/II) I clone: Clone remote repository (and its full history) to your computer I stage: Place a le in the staging area I commit: Place a le in the git directory (repository) I push: Update remote repository using local repository I pull: Update local repository using remote repository

Mar 01, 2020 · dare. I did apologize years later. But the point is that there is power in a dare. Most of us are have a daring spirit. We almost always want to rise to the challenge of something put before us, especially if done by a peer, a teacher or an employer. In general, we like to be dared. So, I want to dare you to something. This Lent I dare you to .