Handleidingen
IAnalyzer
Welcome to I-Analyzer.
IAnalyzer was developed by @UU on the request of I-Lab @ UU.
Using I-analyzer you can search in various corpora using queries on content and metadata. It is also possible to visualize the results.
Corpora now available:
-The Dutch Newspaper corpus from the Royal library (free access for issues published before 1878)
-The Times Newspaper archive 1785-2010 (access within UU)
-The Guardian (access within UU)
-Dutch Banking annual reports of financial Institutes (access for research group)
-Thesaurus Musicarum Latinarum (Online Archive of Music Theory in Latin 3rd to 17th century) (free access)
And more to follow……
Want to add you own corpus? Contact the Digital Humanities Lab coördinator: J.dekruif@uu.nl
-
© Copyright and Access:
I-analyzer contains open source materials as well as materials that are under copyright protection. Therefore, some materials are freely available for all. Others are only accessible for UU staff and students or even individual research groups. Access is indicated with the corpus name in the list. UU staff and students can login to corpora with their Solis-ID and Solis password. Individual research groups will be given individual accounts. It is forbidden to take copyright-protected material out of I analyzer, regardless in what form, without the express written permission of the rightsholder.
The Times will serve as an example for this manual. Basically, steps for searching,visualizing etc. are the same for each corpus. Filtering depends on the available metadata.
-
Privacy
The DHLab processes personal data belonging to I-analyzer users. We do this in order to offer the best possible service.
We process the following personal data:
1. Name and email address
2. The queries you process
We don’t store your data for longer than is necessary for the purposes for which the details are processed. If you want your personal data removed, please contact us.
We have taken appropriate technological and organizational measures to protect your data. We keep up to date with the current situation regarding data protection. We will do everything possible to prevent the loss or unlawful use of personal data.By using I-analyzer you accept these terms.
If you have questions about our processing of your personal data then contact:
dr José de Kruif, Utrecht University | Digital Humanities Lab
Tijdelijk adres: Kromme Nieuwegracht 80, 3512 HM Utrecht |
Tel +31(0)30 253 7867 -
Step 1
Type your query in the search box. Do you want information on the search syntax? Click the question mark on the left of the search box or read the tip in the Information of this manual.
-
The database used in I-analyzer is organised using Elasticsearch. To search it, it is necessary to use terms and operators which Elasticsearch can understand. They are explained in more details in the Simple Query String manual of Elasticsearch itself.
For your convenience, a summary of the search operators is shown below.
Simple Query String Syntax
The search method supports the following operators:
Operator Description +
means AND (bank AND assets) | means OR (bank OR assets) -
means NOT (NOT assets) "
allows the search for an entire phrase “the assets of the bank” *
only allowed after other characters and is a wildcard for any number of characters ( asset*
is allowed,*asset
is not)~N
the fuzziness, when placed after a term this signifies how many characters are allowed to differ. So bank~1
also searches for bang, sank, dank etc.~N
when placed after a phrase this signifies how many words may differ Symbols such as
|
and+
are reserved characters. If you want to search for text containing these characters then they should be escaped by prefixing them with\
.By default the search will combine all terms using
OR
. This means that when you type:Tram Bike
, documents will be searched containingTram
and/orBike
. This also has implications for the–
operator.Tram Bike –Car
becomes documents containingTram
,Bike
or any document not containingCar
. A more expected result could be obtained by using(Tram Bike) +-Car
which will return all hits containingTram
orBike
and withhold all those containingCar
.Be Careful with SpacesAdding or removing a space can change the results of your query. For example search for
+- term
is different than searching for+-term
. It might be necessary to escape a space (also by placing a\
in front of it). -
Step 2
Querying The Times for “Winston Churchill” results in a list of 15204 hits. The first results are shown. By default results are sorted by Relevance.
You can change this and choose sorting by:
Publication date
Issue number
Image count
OCR confidence
etc. -
At the bottom of the results screen the
button can be used to expand the results list.
-
Step 3
You can filter on the metadata that are available for your corpus. In the case of The Times:
*publication dates (after….before….)
*Page Type:
Standard
or
Supplement
*If your query appears on the cover
*Newspaper section (Arts and Entertainment, Business and Finance etc.)
*Illustrations (Cartoons, Photographs etc.) -
Step 4
Do you need to investigate your results further offline? Then download a csv file that contains your subset. If needed, you can specify the fields you need to be included in your file.
-
Step 5
Visualize results.
I-analyzer allows you to visualize your query results in:
*A timeline by publication date
*A wordcloud of article titles
*A Wordcloud of article contents.
*A Histogram of article categories
*A Histogram of Illustrations
All graphs can be displayed in absolute numbers or percentages.
All graphs can also be displayed as a table. -
Related Words Query
Besides word clouds and timelines it is also possible to request a related words query. The resulting graph shows the words which appear in the same contexts as the query term, over the whole dataset, and how similar they are to the query term within each time window. The similarity score is based on singular value decomposition of a word-document matrix, for which all word counts have been transformed with positive pointwise mutual information (Levy et al., 2015). As in latent semantic analysis, the vectors for each word in this matrix can be compared. If words appear in the same context (i.e., in the same “topics”), their vectors are more alike, which is reflected in a higher cosine similarity. To show this graph, one matrix has been computed for the whole corpus, from which the five most similar words for the query term have been selected (if one of these words is the query term itself, it is excluded). Separate matrices have been computed for consecutive time frames of one decennium (the mean of which is shown on the x-axis), and for each decennium, the similarity of the query term to the overall most similar words is computed and shown. This means that we don’t necessarily get to see the words which are most similar for one specific time frame. Please also note that this visualization always represents the texts from the whole corpus (and its subsets per time frame), which means it is not affected by selections you may have applied in filters.
-
Step 6
Need more information?
Contact: Digital Humanities Lab, dr. J. de Kruif, Drift 10 rm 3.08, 3512 BS Utrecht
Last modified: 13/05/2019