How do personal assistants typically generate sentences

algorithmsdata structuresnatural-language-processing

This is sort of a follow up to this question about NLG research directions in the linguistics field.

How do personal assistant tools such as Siri, Google Now, or Cortana perform Natural Language Generation (NLG)? Specifically, the sentence text generation part. I am not interested in the text-to-speech part, just the text generation part.

I'm not looking for exactly how each one does it, as that information is probably not available.

I am wondering what setup is required to implement sentence generation of that quality?

What kind of data would you need in a database (at a high level)?
- Does it require having a dictionary of every possible word and it's meaning, along with many books/corpora annotated and statistically analyzed added to it?
- Does it require actually recording people talk in a natural way (such as from TV shows or podcasts), transcribing them to text, and then adding that somehow to their "system"? (to get really "human"-like sentences)
- Or are there just simple syntax-based sentence patterns they are using, with no gigantic semantic "meaning" database? Where someone just wrote a bunch of regular expressions type thing..
What are the algorithms that are used for such naturally written human-like sentences?

One reason for asking is, it seems like the NLG field is very far from being able to do what Siri and Google Now and others are accomplishing. So what kind of stuff are they doing? (Just for the sentence text generation part).

Best Answer

Siri typically doesn't "generate" sentences. She parses what you say and 'recognizes' certain keywords, sure, and for common responses, she will use a template, such as I found [N] restaurants fairly close to you or I couldn't find [X] in your music, [Username].

But most of her responses are canned, depending on her interpretation of your speech, in addition to a random number generator to choose a creative answer to a flippant question. Simply asking Siri "How much wood can a wood chuck chuck?" or "What is the meaning of life?" will generate any of a variety of answers. There are numerous cultural references and jokes built-in (and repeated verbatim) that prove with relative certainty that Siri is not just spontaneously generating most of her text, but pulling it from a database of some sort. It's likely that incoming questions are saved to a central server, where new responses to those questions can be created by Apple employees, allowing Siri to "learn".

Her text-to-speech part is good enough, however, that it sometimes makes it seem as though the answers are being generated...

Related Solutions

Algorithms – Generating Hierarchical Structure from Relational Data

There are several ways to create structure from flatland. Recursion is one of the best. This program uses it, with a preliminary step of figuring out which items are parents.

While the strategy is language-agnostic, every language has different details of data structures and building blocks. I've rendered the approach in Python.

NB, about half this code is for display and demonstration purposes, so you can follow along and see how the algorithm is working. For production, fee free to remove those parts.

Oh...and I changed the labels from 'parent' to 'name' and 'children' to 'reports' because I found that more agreeable. You can, of course, choose whatever you like.

from pprint import pprint
from random import shuffle
from collections import defaultdict
import json

def show_val(title, val):
    """
    Debugging print helpler.
    """
    sep = '-' * len(title)
    print "\n{0}\n{1}\n{2}\n".format(sep, title, sep)
    pprint(val)


# ORIGINAL NAMES:   emp_id, emp_name, mgr_id
# SIMPLIFIED NAMES: eid,    name,     mid
text = """
323,The Boss,
4444,Manager Joe,323
3,Manager Sally,323
4,Peon Frank,4444
33,Peon Dave,3
5,Peon Jill,4444
6,Peon Rodger,3
7,Peon Ralph,3
233,Clerk Jane,99
99,Supervisor Henri,3
"""

# parse text into lines
lines = [ l.strip() for l in text.strip().splitlines() ]

# construct list of people tuples
people = [ tuple(l.split(',')) for l in lines ]

# for demonstration and testing only, shuffle the results
shuffle(people)
show_val("randomized people", people)

# contstruct list of parents
parents = defaultdict(list)
for p in people:
    parents[p[2]].append(p)
show_val("parents", parents)

def buildtree(t=None, parent_eid=''):
    """
    Given a parents lookup structure, construct
    a data hierarchy.
    """
    parent = parents.get(parent_eid, None)
    if parent is None:
        return t
    for eid, name, mid in parent:
        report = { 'name': name }
        if t is None:
            t = report
        else:
            reports = t.setdefault('reports', [])
            reports.append(report)
        buildtree(report, eid)
    return t

data = buildtree()
show_val("data", data)

show_val("JSON", json.dumps(data))

Running this shows the following output:

-----------------
randomized people
-----------------

[('233', 'Clerk Jane', '99'),
 ('4444', 'Manager Joe', '323'),
 ('33', 'Peon Dave', '3'),
 ('6', 'Peon Rodger', '3'),
 ('99', 'Supervisor Henri', '3'),
 ('3', 'Manager Sally', '323'),
 ('5', 'Peon Jill', '4444'),
 ('323', 'The Boss', ''),
 ('4', 'Peon Frank', '4444'),
 ('7', 'Peon Ralph', '3')]

-------
parents
-------

defaultdict(<type 'list'>, {'99': [('233', 'Clerk Jane', '99')], '323': [('4444', 'Manager Joe', '323'), ('3', 'Manager Sally', '323')], '3': [('33', 'Peon Dave', '3'), ('6', 'Peon Rodger', '3'), ('99', 'Supervisor Henri', '3'), ('7', 'Peon Ralph', '3')], '4444': [('5', 'Peon Jill', '4444'), ('4', 'Peon Frank', '4444')], '': [('323', 'The Boss', '')]})

----
data
----

{'name': 'The Boss',
 'reports': [{'name': 'Manager Joe',
              'reports': [{'name': 'Peon Jill'}, {'name': 'Peon Frank'}]},
             {'name': 'Manager Sally',
              'reports': [{'name': 'Peon Dave'},
                          {'name': 'Peon Rodger'},
                          {'name': 'Supervisor Henri',
                           'reports': [{'name': 'Clerk Jane'}]},
                          {'name': 'Peon Ralph'}]}]}

----
JSON
----

'{"name": "The Boss", "reports": [{"name": "Manager Joe", "reports": [{"name": "Peon Jill"}, {"name": "Peon Frank"}]}, {"name": "Manager Sally", "reports": [{"name": "Peon Dave"}, {"name": "Peon Rodger"}, {"name": "Supervisor Henri", "reports": [{"name": "Clerk Jane"}]}, {"name": "Peon Ralph"}]}]}'

Some preliminaries: It also uses print to help show nested data structures. We'd normally get data through a database connection, here we just parse it out of static text. Finally, while the data you presented is beautifully ordered, with the bosses at the top and with the lowest employeed id numbers (simplifying the) problem, I'd like to confirm that the code works in any order. So I've modified some of the id numbers to reflect a non-sequential allocation, and brought in random.shuffle to randomize the order of data. You wouldn't do this in production, but as a part of testing, it increases confidence the logic is working by design, not accident.

Algorithms – Best Methods for Detecting Plagiarism

This is somewhat of an XY answer but given you started with

read a body of text and compare it to search-engine results (from searching for substrings of the given text), with the goal of detecting plagiarism in, for example, academic papers.

It seems text search itself is a good, practical answer to your problem. The basic way of detecting plagiarisms would be the following:

Start with a corpus of documents that the target document could have been plagiarized.
Create, e.g., a Lucene based inverted index over those documents (through say Solr or Elasticsearch).
Split your target document into a set of phrases (e.g. by breaking off each sentence / sub-sentence / every n words).
Search your corpus for each phrase. You will return a (possibly empty) set of documents that that phrase could have been plagiarized from (and the location(s) in each document it was possibly taken from).
Collect all of these potential instances of plagiarism. If this exceeds more than a small threshold of phrases, alarm the target as probably being plagiarized.

This approach has several advantages over trying to diff strings:

It allows you to pinpoint exactly what in the target document might have been plagiarized and where it could have come from. This will allow humans reviewing the output to have visibility and make intelligent decisions on the output.
A good indexing solution will buy you the ability to work around misspellings and different stop words / tiny differences in phrasing.
A good indexing solution will scale very well.
Having a self-managed corpus will behave much better than searching the internet. The internet is such a wild and unruly place that you are likely to get spurious matches and miss out on important matches. That is, Google may catch students copying from Wikipedia, but it is also liable to falsely accuse people of copying from random blogs if you are not very, very careful. It is also liable to miss things like ArXiv papers in the field, essays students can buy from shady websites, past essays written from other students, that are very realistic sources of plagiarism.

If you think about Turn-it-in, their approach must be similar to this as they

Tell you where the essay could have been plagiarized
Can include past-papers / non-wiki & co. sourcing.

The value that Turn-it-in and similar can add over just setting up a system like this yourself (which honestly would not be too hard) is

Size and quality of their reference corpus
Development time of their UI
Tuning of their indexing and searching
Sophistication in how they determine phrases and their thresholds for likely plagiarism.

Best Answer

Related Solutions

Algorithms – Generating Hierarchical Structure from Relational Data

Algorithms – Best Methods for Detecting Plagiarism

Related Topic