Holocron: Natural Language Processing with Python

El lenguage Python ofrece ventajas para el manejo de cadenas o string, por ello no sorprende que se halla elegido este lenguaje para trabajar con procesamiento de lenguaje natural. hacemos referencia al libro Natural Language Processing with Python (Analyzing text with latural language toolkit) Steven bird, Ewan Klein & Eward Loper.

Requerimientos de Software
Python
NLTK
NLTK-Data
Numpy -- Es una libreria de computación científica que soporta arrays multidimensionales y algebra lineal, probabilidad,tagging entre otros.
Matplotlib -- Es una librería para gráficas en 2D
NetworkX -- Es una librería para manipular y almacenar estructuras de red conteniedo nodos y bordes. para visualizar redes semánticas. Requiere de la libreria Graphviz
Prover9 -- probador automatizado de teoremas para lógica equacional y primer orden, usado para soportar inferencia en procesamiento de lenguage.

>>import nltk
>>nltk.download()

>>> from nltk.book import *
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908
>>>

>>text4.dispersionplot(['citizens','democrazy','freedom','duties','america'])

BROWN CORPUS

>>> from nltk.corpus import brown
>>> news_text=brown.words(categories='news')
>>> fdist=nltk.FreqDist([w.lower() for w in news_text])
>>> modals=['can','could','may','might','must','will'])
>>> for m in modals:
... print m + ':', fdist[m]
...
can: 94
could: 87
may: 93
might: 38
must: 53
will: 389

>>> cfd=nltk.ConditionalFreqDist(
... (genre, word)
... for genre in brown.categories()
... for word in brown.words(categories=genre))
>>>
>>> genres=['news','religion','hobbies','science_fiction','romance','humor
>>> modals=['can','could','may','might','must','will']
>>> cfd.tabulate(conditions=genres, samples=modals)
                 can could may might must will
           news   93   86   66   38   50 389
       religion   82   59   78   12   54   71
        hobbies 268   58 131   22   83 264
science_fiction   16   49    4   12    8   16
        romance   74 193   11   51   45   43
          humor   16   30    8    8    9   13

INAUGURAL ADDRESS CORPUS

>>> from nltk.corpus import inaugural
>>> inaugural.fileids()
['1789-Washington.txt', '1793-Washington.txt', '1797-Adams.txt', '1801-Jefferson
.txt', '1805-Jefferson.txt', '1809-Madison.txt', '1813-Madison.txt', '1817-Monro
e.txt', '1821-Monroe.txt', '1825-Adams.txt', '1829-Jackson.txt', '1833-Jackson.t
xt', '1837-VanBuren.txt', '1841-Harrison.txt', '1845-Polk.txt', '1849-Taylor.txt
', '1853-Pierce.txt', '1857-Buchanan.txt', '1861-Lincoln.txt', '1865-Lincoln.txt
', '1869-Grant.txt', '1873-Grant.txt', '1877-Hayes.txt', '1881-Garfield.txt', '1
885-Cleveland.txt', '1889-Harrison.txt', '1893-Cleveland.txt', '1897-McKinley.tx
t', '1901-McKinley.txt', '1905-Roosevelt.txt', '1909-Taft.txt', '1913-Wilson.txt
', '1917-Wilson.txt', '1921-Harding.txt', '1925-Coolidge.txt', '1929-Hoover.txt'
, '1933-Roosevelt.txt', '1937-Roosevelt.txt', '1941-Roosevelt.txt', '1945-Roosev
elt.txt', '1949-Truman.txt', '1953-Eisenhower.txt', '1957-Eisenhower.txt', '1961
-Kennedy.txt', '1965-Johnson.txt', '1969-Nixon.txt', '1973-Nixon.txt', '1977-Car
ter.txt', '1981-Reagan.txt', '1985-Reagan.txt', '1989-Bush.txt', '1993-Clinton.t
xt', '1997-Clinton.txt', '2001-Bush.txt', '2005-Bush.txt', '2009-Obama.txt']
>>> [fileid[:4] for fileid in inaugural.fileids()]
['1789', '1793', '1797', '1801', '1805', '1809', '1813', '1817', '1821', '1825',
'1829', '1833', '1837', '1841', '1845', '1849', '1853', '1857', '1861', '1865',
'1869', '1873', '1877', '1881', '1885', '1889', '1893', '1897', '1901', '1905',
'1909', '1913', '1917', '1921', '1925', '1929', '1933', '1937', '1941', '1945',
'1949', '1953', '1957', '1961', '1965', '1969', '1973', '1977', '1981', '1985',
'1989', '1993', '1997', '2001', '2005', '2009']

>> cfd=nltk.ConditionalFreqDist(
... (target, fileid[:4])
... for fileid in inaugural.fileids()
... for w in inaugural.words(fileid)
... for target in ['america','citizen']
... if w.lower().startswith(target))
>>> cfd.plot()

Libreria Mathplotlib

Utilizando la libreria Mathplotlib, se prueba un código que muestra la frecuencia de algunos verbos modales en el Brown Corpus, clasificado por genero.

El despliegue es:

Libreria NetworkX

Con esta librería podemos visualizar redes, tal como la WordNet (red semántica).
El siguiente programa inicializa un objeto Graph, y luego recorre la gerarquia WordNet agregando bordes a la gráfica.

la visualización resultante del código es la siguiente:

Apendice:

Bitácora de instalación de libreria NetworkX

Bitacora Instalacion de networkx-1.9

pygraphviz-1.2
En C:/Python27/pygraphviz-1.2

Se ejecuta su intalador desde el fuente de python como lo indica el archivo de instalacion
INSTALL.TXT con el comando python setup.py install y retorna:
library_path=
include_path=
running install
running build
running build_py
running build_ext
building 'pygraphviz._graphviz' extension
error: Unable to find vcvarsall.bat

En ese mismo archivo dice:
   3) You are using Windows
      There are no PyGraphviz binary packages for Windows but you might be
      able to build it from this source. See
      http://networkx.lanl.gov/pygraphviz/reference/faq.html

   If you think your installation is correct you will need to manually
   change the include_path and library_path variables in setup.py to
   point to the correct locations of your graphviz installation.

En la pagina dice que no soportan un paquete autoinstalable para windows, y en el codigo de
setup.py utiliza una funcion setup la cual le configura los parametros con nombres de paquetes, paths, etc
y la manda a llamar, esa funcion es la que crea el error Unable to find vcvarsall.bat

se respalda el path antes de trabajar su modificacion
SET PATHRESP=%PATH%
SET GRAPHVIZ-2.38=%ProgramFiles(x86)%\graphviz-2.38
se integra al PATH
SET PATH=%PATH%;%GRAPHVIZ-2.38%

En el archivo pygraphviz-1.2\setup.py se asigna el path a lib e include
library_path='\Program Files (x86)\Graphviz2.38\lib'
include_path='\Program Files (x86)\Graphviz2.38\include'

graphviz-2.38
En Beto/Descargar/PythonDownloads

swinwin-3.0.2
En Beto/Descargar/PythonDownloads

Holocron

Navegación

jueves, 14 de agosto de 2014

Natural Language Processing with Python

Libreria Mathplotlib

Libreria NetworkX

Apendice:

No hay comentarios:

Publicar un comentario