Logo Utrecht University

Handleidingen

Back

IAnalyzer

Welcome to I-Analyzer.

IAnalyzer was developed by  Digital Humanities Lab Utrecht @UU on the request of I-Lab @ UU.

Using I-analyzer you can search in various corpora using queries on content and metadata. It is also possible to visualize the results.

Corpora now available:

-The Dutch Newspaper corpus from the Royal library (free access for issues published before 1878)
-The Times Newspaper archive 1785-2010 (access within UU)
-The Guardian (access within UU)
-Dutch Banking annual reports of financial Institutes (access for research group)
-Thesaurus Musicarum Latinarum (Online Archive of Music Theory in Latin 3rd to 17th century) (free access)

And more to follow……

Want to add you own corpus? Contact the Digital Humanities Lab coördinator: J.dekruif@uu.nl

  • © Copyright and Access:

    I-analyzer contains open source materials as well as materials that are under copyright protection. Therefore, some materials are freely available for all. Others are only accessible for UU staff and students or even individual research groups. Access is indicated with the corpus name in the list. UU staff and students can login to corpora with their Solis-ID and Solis password. Individual research groups will be given individual accounts. It is forbidden to take copyright-protected material out of I analyzer, regardless in what form, without the express written permission of the rightsholder.

    The Times will serve as an example for this manual. Basically, steps for searching,visualizing etc. are the same for each corpus. Filtering depends on the available metadata.

  • Privacy
    The DHLab processes personal data belonging to I-analyzer users. We do this in order to offer the best possible service.
    We process the following personal data:
    1. Name and email address
    2. The queries you process
    We don’t store your data for longer than is necessary for the purposes for which the details are processed. If you want your personal data removed, please contact us.
    We have taken appropriate technological and organizational measures to protect your data. We keep up to date with the current situation regarding data protection. We will do everything possible to prevent the loss or unlawful use of personal data.

    By using I-analyzer you accept these terms.

    If you have questions about our processing of your personal data then contact:

    dr José de Kruif, Utrecht University | Digital Humanities Lab

    Tijdelijk adres: Kromme Nieuwegracht 80, 3512 HM Utrecht |
    Tel +31(0)30 253 7867

  • Step 1

    Type your query in the search box. Do you want information on the search syntax? Click the question mark on the left of the search box or read the tip in the Information of this manual.

  • The database used in I-analyzer is organised using Elasticsearch. To search it, it is necessary to use terms and operators which Elasticsearch can understand. They are explained in more details in the Simple Query String manual of Elasticsearch itself.

    For your convenience, a summary of the search operators is shown below.

    Simple Query String Syntax

    The search method supports the following operators:

    Operator Description
    + means AND (bank AND assets)
    | means OR (bank OR assets)
    - means NOT (NOT assets)
    " allows the search for an entire phrase “the assets of the bank”
    * only allowed after other characters and is a wildcard for any number of characters (asset* is allowed, *asset is not)
    ~N the fuzziness, when placed after a term this signifies how many characters are allowed to differ. So bank~1 also searches for bang, sank, dank etc.
    ~N when placed after a phrase this signifies how many words may differ

    Symbols such as | and + are reserved characters. If you want to search for text containing these characters then they should be escaped by prefixing them with \.

    By default the search will combine all terms using OR. This means that when you type: Tram Bike, documents will be searched containing Tram and/or Bike. This also has implications for the operator. Tram Bike –Car becomes documents containing TramBike or any document not containing Car. A more expected result could be obtained by using (Tram Bike) +-Car which will return all hits containing Tram or Bike and withhold all those containing Car.Be Careful with Spaces

    Adding or removing a space can change the results of your query. For example search for +- term is different than searching for +-term. It might be necessary to escape a space (also by placing a \ in front of it).

  • Step 2

    Querying The Times for “Winston Churchill” results in a list of 15204 hits. The first results are shown. By default results are sorted by Relevance.
    You can change this and choose sorting by:
    Publication date
    Issue number
    Image count
    OCR confidence
    etc.

  • At the bottom of the results screen the  button  can be used to expand the results list.

  • Step 3

    You can filter on the metadata that are available for your corpus. In the case of The Times:

    *publication dates (after….before….)
    *Page Type:
    Standard
    or
    Supplement
    *If your query appears on the cover
    *Newspaper section (Arts and Entertainment,  Business and Finance etc.)
    *Illustrations (Cartoons, Photographs etc.)

  • Step 4

    Do you need to investigate your results further offline? Then download a csv file that contains your subset. If needed, you can specify the fields you need to be included in your file.

  • Step 5

    Visualize results.
    I-analyzer allows you to visualize your query results in:
    *A timeline by publication date
    *A wordcloud of article titles
    *A Wordcloud of article contents.
    *A Histogram of article categories
    *A Histogram of Illustrations
    All graphs can be displayed in absolute numbers or percentages.
    All graphs can also be displayed as a table.

  • Related Words Query

    Besides word clouds and timelines it is also possible to request a related words query. The resulting graph shows the words which appear in the same contexts as the query term, over the whole dataset, and how similar they are to the query term within each time window. The similarity score is based on singular value decomposition of a word-document matrix, for which all word counts have been transformed with positive pointwise mutual information (Levy et al., 2015). As in latent semantic analysis, the vectors for each word in this matrix can be compared. If words appear in the same context (i.e., in the same “topics”), their vectors are more alike, which is reflected in a higher cosine similarity. To show this graph, one matrix has been computed for the whole corpus, from which the five most similar words for the query term have been selected (if one of these words is the query term itself, it is excluded). Separate matrices have been computed for consecutive time frames of one decennium (the mean of which is shown on the x-axis), and for each decennium, the similarity of the query term to the overall most similar words is computed and shown. This means that we don’t necessarily get to see the words which are most similar for one specific time frame. Please also note that this visualization always represents the texts from the whole corpus (and its subsets per time frame), which means it is not affected by selections you may have applied in filters.

  • Step 6

    Need more information?
    Contact: Digital Humanities Lab, dr. J. de Kruif, Drift 10 rm 3.08, 3512 BS Utrecht

Last modified: 13/05/2019

Back