Overview of WORDij

  • WORDij is a suite of data science programs that automates many aspects of natural language processing. Unstructured text from sources such as social media, news, speeches, focus groups, interviews, email, web sites, and any text sources can be readily processed.

  • The software runs on Windows 32-bit and 64-bit, Mac 32-bit and 64-bit (See the FAQ about running macOS Catalina.)

  • WORDij runs very fast because the modules are written in C. Java is used only for the Graphical User Interface.

  • The suite can run data files as large as 550 megabytes with 8 gigabytes of RAM on a 64-bit machine. Small files with only 10s or 100s of documents can run in seconds.

  • Files analyzed are in UTF-8 format, so the programs handle languages with graphic characters such as Chinese, Japanese, Arabic, or Russian.

  • WordLink is the initial text processor. It moves a sliding window through text, centering on each word and tabulating directed word pair bigrams appearing three words before and three words afterward. By treating proximate words, WORDij is more precise than "bag of words" programs that treat all words in a document as related. WORDij preserves the order of words in a bigram, which embeds syntax effects. Directed bigrams are useful for the Opticomm software in WORDij, whose strings linking a seed word and target word then lead to near-grammatical statements.

  • String conversions are possible. One can convert multi-word phrases to a single unigram, such as "New York City" rendered as "new_york_city." As well, one can convert synonyms to a single unigram. Also enabled is creation and analysis of ontologies containing categories of words.

  • By using an "include list," the opposite of "drop list" or "stopword list," one can analyze networks among the included words. For example, an analyst can include a list of persons' names to find the network among them based on their cooccurrence in texts such as news stories. For another use of an include list, see the article in the Publications tab on "Scaling constructs with semantic networks." The publication demonstrates how to build an index for a construct using natural language, based on principal components analysis, taking the first eigenvector and its highly loading words to create an index.

  • A Proper Noun extractor makes a list of all of them occurring in the text. This can aid in the building of "include lists."

  • The program computes the similarity of pairs of networks from different sets of texts using Quadratic Assignment Procedures (QAP) that produce a correlation coefficient for the comparison of whole networks.

  • The Timeslice program in the suite's Utilities tab enables time-series analysis of networks changing over time. One can create daily, weekly, monthly, or yearly intervals of varying widths.

  • The VISij module can produce movies showing change over time across mutiple .net files.

  • WORDij output files, 8 per run, enable importing files into a number of other network analysis programs, such as UCINET, NodeXL, Pajek, Negopy, and others.

  • Comparisions of the relative frequencies of word unigrams or bigrams can be compared in the suite's Z-Utilities tab to see differences and similarities. The statistical test used is a Z-test for proportions. The zero values are converted to a very small decimal value to avoid the division by zero problem.

  • Another useful function is to extract word bigrams up to three steps away from any word unigram you select. This helps reducing the "hairball" or "bowl of spaghetti" problem to extract a node-centric network.

  • WORDij output can be used to compute Word embeddings to use word algebra in the Python packages Word2Vec or Google's Glove by importing WORDij's bigram output in the .pr file

  • There are many other functionalities in WORDij that you can see in the 15 tutorial files in the suite's Documentation folder.

  • WORDij is free for non-commercial academic research. Commercial licensing is available. Contact James Danowski at jdanowski@gmail.com.