Skip to main content

Stanford CS448B 18 TextVis

· 6 min read

TLDR

This article contains my notes from Stanford's CS448B (Data Visualization) course, specifically focusing on the eighteenth lecture about text visualization. I'll discuss the importance of documents, collections of documents, and the different types of visualizations for text data.

Original

Download PDF.

Notes

Text as data

  • Documents

    • Articles, books and novels
    • Computer programs
    • E-mails, web pages, blogs
    • Tags, comments
  • Collection of documents

    • Messages (e-mail, blogs, tags, comments)
    • Social networks (personal profiles)
    • Academic collaborations (publications)

Why visualize text?

  • Understanding: get the “gist”of a document
  • Grouping: cluster for overview or classification
  • Compare: compare document collections, or inspect evolution of collection over time
  • Correlate: compare patterns in text to those in other data, e.g., correlate with social network

Example: Health Care Reform

Background:

  • Initiatives by President Clinton
  • Overhaul by President Obama Text data:
  • News articles
  • Speech transcriptions
  • Legal documents

What questions might you want to answer?
What visualizations might help?

A Concrete Example

Word/Tag Clouds: Word Count

President Obama’s Health Care Speech to Congress

WordTree: Word Sequences


Gulf of Evaluation

  • Many (most?) text visualizations do not represent text directly. They represent the output of a language model (word counts, word sequences, etc.)
  • Can you interpret the visualization?
    • How well does it convey the properties of the model?
  • Do you trust the model?
    • How does the model enable us to reason about the text?

Text as Data

Words as nominal data?

High dimensional (10,000+)

More than equality tests

  • Correlations: Hong Kong, San Francisco, Bay Area
  • Order: April, February, January, June, March, May
  • Membership: Tennis, Running, Swimming, Hiking, Piano
  • Hierarchy, antonyms & synonyms, entities, ...

Words have meanings and relations

Text Processing Pipeline

Tokenization

  • Segment text into terms.
  • Remove stop words? a, an, the, of, to, be
  • Numbers and symbols? #cardinal, @Stanford, OMG!!!!!!!!
  • Entities? Palo Alto, O’Connor, U.S.A.

Stemming

  • Group together different forms of a word.
  • Porter stemmer? visualization(s), visualize(s), visually -> visual
  • Lemmatization? oes, went, gone -> go

Ordered list of terms

The Bag of Words Model

Ignore ordering relationships within the text

A document \approx vector of term weights

  • Each term corresponds to a dimension (10,000+)
  • Each value represents the relevance
    • For example, simple term counts

Aggregate into a document x term matrix

  • Document vector space model

Document * Term matrix

Each document is a vector of term weights
Simplest weighting is to just count occurrences

https://books.google.com/ngrams/

Strengths

  • Can help with gisting and initial query formation

Weaknesses

  • Sub-optimal visual encoding (size not pos. encodes freq.)
  • Inaccurate size encoding (long words are bigger)
  • May not facilitate comparison (unstable layout)
  • Term frequency may not be meaningful
  • Does not show the structure of the text

Keyword Weighting

Given a text, what are the best descriptive words?

Term Frequency:

tftd=count(t) in dtf_{td} = count(t)\ in\ d

Can take log frequency: log(1 + tftd)log(1\ +\ tf_{td})

Can normalize to show proportion: tftd / ttftdtf_{td}\ /\ \sum_ttf_{td}

Fightin’ Words: Lexical Feature Selection and Evaluation for Identifying the Content of Political Conflict

TF.IDF: Term Freq by Inverse Document Freq

tf.idftd = log(1 + tftd)  log(N/dft)tf.idf_{td}\ =\ log(1\ +\ tf_{td})\ *\ log(N/df_t)
dtt = # docs containing tdt_t\ =\ \#\ docs\ containing\ t
N = # of docsN\ =\ \#\ of\ docs

Limitations of Frequency Statistics

  • Typically focus on unigrams (single terms)
  • Often favors frequent (TF) or rare (IDF) terms
    • Not clear that these provide best description
  • “Bag of words” ignores additional info
    • Grammar / part-of-speech
    • Position within document
    • Recognizable entities

How do people describe text?

Asked 69 graduate students to read and describe dissertation abstracts
Each given 3 documents in sequence; summarized each using keyphrases, then summarized the 3 together as a whole using keyphrases
Were matched to both familiar and unfamiliar topics; topical diversity within a collection was varied systematically

Term Commonness

log(tfw)/log(tfthe)log(tf_w)/log(tf_{the})

The normalized term frequency relative to the most frequent n-gram, e.g., the word “the”

Yelp:Review Spotlight

Download PDF《Review Spotlight: A User Interface for Summarizing User-generated Reviews Using Adjective-Noun Word Pairs》

Tips: Descriptive Keyphrases

Understand the limitations of your language model

  • Bag of words:
    • Easy to compute
    • Single words
    • Loss of word ordering

Select appropriate model and visualization

  • Generate longer, more meaningful phrases
  • Adjective-noun word pairs for reviews
  • Show keyphrases within source text

Visualizing Document Content

Information Retrieval

  • Search for documents
  • Match query string with documents
  • Visualization to contextualize results

Concordance

What is the common local context of a term?

wordTree

\

  • Filter infrequent runs

  • Recurrent themes in speech

Glimpses of structure

Concordances show local, repeated structure
But what about other types of patterns?

Example

Lexical: <A> at <B>
Syntactic: <Noun> <Verb> <Object>

Phrase Nets [van Ham 2009]

  • Look for specific linking patterns in the text:
    • ‘A and B’, ‘A at B’, ‘A of B’, etc
    • Could be output of regexp or parser
  • Visualize extracted patterns in a node-link view
    • Occurrences -> Node size
    • Pattern position -> Edge direction
    • Darker color -> higher ratio of out-edges to in-edges

Visualizing Conversation

  • Many dimensions to consider:
    • Who (senders, receivers)
    • What (the content of communication)
    • When (temporal patterns)
  • Interesting cross-products:
    • What x When -> Topic “Zeitgeist”
    • Who x Who -> Social network
    • Who x Who x What x When -> Information flow

Usenet Visualization [Viégas]

Show correspondence patterns in text forums
Initiate vs. reply; size and duration of discussion

Themail (Viégas)

One person over time, TF.IDF weighted terms

Document Collections

Topic modeling Approach

  • Assume documents are a mixture of topics
  • Topics are (roughly) a set of co-occurring terms
  • Latent Semantic Analysis (LSA): reduce term matrix
  • Latent Dirichlet Allocation (LDA): statistical model

ThemeRiver (Havre et al 99)

Termite: Visualizing Topic Models [Chuang ’12]

Show salient (vs. frequent) terms. Seriate rows & columns.

Stanford Dissertation Browser

Summary

  • High Dimensionality
    • Where possible use text to represent text...
    • ... which terms are the most descriptive?
  • Context & Semantics
    • Provide relevant context to aid understanding.
    • Show (or provide access to) the source text.