TopPage

nltk¤Çtwitter¤Î¥í¥°¤ò½èÍý¤¹¤ë¡£

  1. TwitterLog20100805 ¤ò¼«Ê¬¤Î¥Ç¥£¥ì¥¯¥È¥ê¤Ë¥³¥Ô¡¼¤¹¤ë
    	cp -r /home/maeda/TwitterLog/20100805/ .
  2. »²¹Í¥µ¥¤¥È http://d.hatena.ne.jp/nokuno/20100123/1264239192
  3. ÆüËܸì¥Æ¥­¥¹¥È¤òÆɤ߹þ¤à¥µ¥ó¥×¥ë¥×¥í¥°¥é¥à
    	#! /usr/bin/env python
    	#encoding: utf-8
    	import nltk
    	raw = open('./sample.mecab').read()
    	words = raw.split();
    	print len(words)
    	text = nltk.Text(words)
    	gen = text.generate(300)
    	print gen
  4. bigrams
    	#!/usr/bin/python
    	#encoding: utf-8
    	import nltk
    	
    	raw = open('sample.dat').read()
    	words = raw.split();
    	bigrams = nltk.bigrams(words)
    	fd = nltk.FreqDist(bigrams)
    	for w in fd:
    	        if fd[w]==100 :
    	                break;
    	        print w[0],w[1],fd[w]
    	
    	#ʸ»ú²½¤±¤¹¤ë
    	cfd = nltk.ConditionalFreqDist(bigrams)
    	print cfd['Ȋ']
  5. 24»þ´Öx60ʬ=1440¥Õ¥¡¥¤¥ë¤òʬ¤«¤Á½ñ¤­¤Ë¤¹¤ë

¥Ç¥£¥ì¥¯¥È¥ê²¼¤Î*.dat¥Õ¥¡¥¤¥ë¤òÆɤ߹þ¤ß¡¢¤½¤ÎÃ椫¤étweet¤òÃê½Ð¤¹¤ë¡£(É®Àס§Ashihara)

  • »öÁ°½èÍý¤È¤·¤Æ.datÆâ¤ÎÀ©¸æʸ»ú¤òºï½ü¤¹¤ëɬÍפ¬¤¢¤ë¡£
    • ¥³¥Þ¥ó¥É
      	find . -name "*.dat" | xargs sed -i "s/^M//g"
    • ^M¤Ï[Ctrl]+[v] + [Ctrl] + [M]¤ÇÆþÎϤ¹¤ë¡£Ä¾ÀÜÆþÎϤ·¤Æ¤Ï¤¤¤±¤Ê¤¤¡£
      	[Ctrl]+[v] + [Ctrl] + [M]
    • ¾¤ÎÀ©¸æʸ»ú¤¬´Þ¤Þ¤ì¤Æ¤¤¤ë¾ì¹ç¤â¤¢¤ë¡£Å¬µ¹½üµî¤¹¤ëɬÍפ¬¤¢¤ë¡£
  • tweet¤ÎÃê½Ð
    • ¥Ç¡¼¥¿¤ÏÀ¸¥Ç¡¼¥¿¤òÆɤ߹þ¤à¡£¤½¤ì¤òsplit()¤Ç¥ê¥¹¥È¤Ë¤¹¤ë¡£
    • split¤µ¤ì¤¿¥ê¥¹¥È¤ÎÍ×ÁÇÃæ¤Ë'2010'¤ÎÍ×ÁǤ¬É¬¤º°ì¤Ä¤Îtweet¤Ë¸ºß¤·¡¢¤½¤Î¸å¤ÏtweetÆâÍƤȤʤ롣
    • ¤³¤ì¤òÍøÍѤ·¡¢index´Ø¿ô¤Ç'2010'¤¬Â¸ºß¤¹¤ë²Õ½ê¤Îź»ú¤ò¼èÆÀ¡¢¤½¤ì°Ê¹ß¤ÎÍ×ÁǤòÁ´¤Æ·ë¹ç¤·tweet¤È¤¹¤ë¡£
    • °ì¤Ä¤Îtweet¤ËÂФ·°ì¤Ä¤º¤Ä²þ¹Ô¤·¤Ä¤Ä¥Õ¥¡¥¤¥ë¤Ë½ñ¤­¹þ¤à¡£tweetList.txt¤È¤¤¤¦¥Õ¥¡¥¤¥ë¤¬À¸À®¤µ¤ì¤ë¡£
    • ¤³¤ÎtweetList.txt¤òmecab¤Ë¤«¤±¤ë¡£º£²ó¤Ï¥³¥Þ¥ó¥É¥é¥¤¥ó¤«¤éľÀܼ¹Ԥ·¤¿¡£
    • ¤³¤Î½èÍý¤ÏÀ©¸æʸ»ú¤òºï½ü¤·¤Ê¤¤¸Â¤ê¡¢¼ºÇÔ¤¹¤ë¡£°Ê²¼¥½¡¼¥¹¥³¡¼¥É¡£
    • ½ÐÎϤ¬´°Á´¤Ç¤Ï¤Ê¤¤¤«¤â¤·¤ì¤Þ¤»¤ó¡£
         #vim fileencoding:utf-8 
         import commands
         import codecs
         list = commands.getoutput('ls *.dat')
         fileList = list.split("\n")
         datList = []
         for file in fileList:
                 for dat in codecs.open(file,'r','utf-8'):
                         datList.append(dat.encode('utf-8'))
         tweetList = []
         swaplist = []
         tw = str()
         for dats in datList:
                 swaplist = dats.split()
                 index = swaplist.index('2010')#2010°Ê²¼¤¬tweet
                 tw = ""
                 for tweet in swaplist[index + 1:]:
                         tw += tweet
                 tweetList.append(tw)
         f = open('tweetList.txt', 'w')
         for tweet in tweetList:
                 f.write(tweet+'\n')
         f.close()

Éü½¬ÌäÂê

  • "TwitterLog20100805-1600.dat" ¤Î¥Õ¥¡¥¤¥ë¤òÆɤ߹þ¤ß¡¢¥³¥á¥ó¥È¤À¤±¤ò¡ÖComment.txt¡×¥Õ¥¡¥¤¥ë¤Ë½ñ¤­½Ð¤¹¥×¥í¥°¥é¥à¤ò½ñ¤±
    • ^M ¤òºï½ü¤¹¤ë¤³¤È
  • ɬÍפÊÃμ±
    • ¥Õ¥¡¥¤¥ëÆɤ߹þ¤ß
      	#!/usr/bin/env python
      	for line in open('TwitterLog20100805-1600.dat', 'r'):
      	    print line
    • ¥Õ¥¡¥¤¥ë½ñ¤­¹þ¤ß
      	strs = "abc";
      	f = open('Comment.txt', 'w')
      	f.writelines(strs)
      	f.close()
    • "TwitterLog20100805-1600.dat" ¤òÆɤ߹þ¤ß¡¡"Comment.txt" ¤Ë¤½¤Î¤Þ¤Þ½ñ¤­¹þ¤à
      	#!/usr/bin/env python
      	
      	f = open('Comment.txt', 'w')
      	for line in open('TwitterLog20100805-1600.dat', 'r'):
      	        print line
      	        f.writelines(line)
      	f.close()
    • split ´Ø¿ô¤òÍøÍѤ·¤Æ¡¢¥¿¥Ö¤Ç¶èÀڤꡢ4ÈÖÌܤÎÍ×ÁǡʤĤ֤䤭¤ÎÆâÍơˤò"Comment.txt"¤Ë½ÐÎϤ¹¤ë
      	#!/usr/bin/env python
      	
      	f = open('Comment.txt', 'w')
      	for line in open('TwitterLog20100805-1600.dat', 'r'):
      	        print line
      	        items = line.split("\t")
      	        print items[3]
      	        f.writelines(items[3])
      	f.close()
    • ÃÖ´¹½èÍý

ÌÚ¼¥¼¥ßÀ¸¸ÂÄê

ÊÔ½¸²èÌÌ
¥¼¥ßÀ¸
2021-2022ǯÅÙÀ¸(14´ü)
°¤ÉôÍÚÂçÆâñ¥
²¬ÅÄ°¼²»¶½À±ÍÛ
³á±ï²ÏÌîͳÌï
º´¡¹ÌÚô¥º´¡¹ÌÚÈþÇÈ
ßÀÅÄϵ®×¢µÈϵ®
Æ£°æ°ì»ÖÆ£ÅĽ¡¿¿
2020-2021ǯÅÙÀ¸(13´ü)
¾®ß·¿¿ô¥³Þ¸¶Í­¿¿
²ÃÆ£ÀµÃè³÷ÅÄÌöÅÍ
ºä¼Íã½»µÈ¿¿Æà
¹âÌîÂç²ÏÃæ°æÍÕ·î
±ÊÞ¼·ÊÍ´Ê¿´ÛºÚ¡¹»Ò
2019-2020ǯÅÙÀ¸(12´ü)
Âç°²¶³Ê¿--
¶áÆ£ÂÀͺÀ¶¿åÈþΤ
Ãæ¼²ÄÎçÊ¡»³³èµ¯
Ê¡²ÈÍ´µªÁ¥±ÛÅ·ºÌ
Æ°áΤ»³²¼²À·î
ºäËÜÎÃÂÀÅÚ²°ºÌ²Æ
2018-2019ǯÅÙÀ¸(ÉÔºß)
SEA-NAÂåɽ¼èÄùÌò
Ê¿²ìľµ±²£»³è½²Ö
½»µÈ¼Âµ§¼¼¶¶ÏºÈ
2017-2018ǯÅÙÀ¸(11´ü)
ÀйõÛÙÆà°ìµÜÂó³¤
µµ°æ³¤½®º´Æ£ÛÙ
º´Æ£Í­´õÉ°¿¹Âó¿¿
Æ£Ëܼë²Æ¥Û¥ï¥¤¥È¥¸¥Ë¡¼
ÁýÅÄÍ¥ºîëÆâ·òÂÀ
2016-2017ǯÅÙÀ¸(10´ü)
°ËÆ£¤ß¤­²¬Åç·ò¸ç
¾®À¾ÀãÍÕÍ´ÀîÂÙµ±
ÎëÌÚͤºÚÂçÌçÂó»Ë
ÅÄƬ¤ï¤«¤Ð¸ÍÅèºéÊæ
Ãæ¼ÃÒµ®À¾ÌîůÀ¸
²£»³Í´²ÌÀî´ßÍ´²Ì
2015-2016ǯÅÙÀ¸(09´ü)
Àõ²ì¼·³¤¾®ÎÓ¿¿ºÚ
À¾Â¼°Ë±ûËÙ¹¾ÃÎ̤
¿ËÀ¸°Ô´õ¼¾å¹ÀÂÀ
2014-2015ǯÅÙÀ¸(08´ü)
ÂçÀÐÀ¿ÂçÌÂÀϯ
²Ãƣ͵¼ùº´¡¹ÌÚº¸¶á
¹â¶¶Íýº»ÉðÅÄè½Êæ
»ûÅçÉñ»ÒȪ²ìÂç
»³ÅĽ¤À¤
2013-2014ǯÅÙÀ¸(07´ü)
²ÃÆ£»Ë¿¥¹©Æ£ÃÒ»Ò
º´¡¹ÌÚÍÕ»Ò»Ö³ù¼þ
¹â¶¶¸¼Î¶üâ¾¾æÆ
ÃæÈøÀéºéÃæÀîÎèºÚ
Ãæé®Âçµ®
2012-2013ǯÅÙÀ¸(06´ü)
±óÆ£À±ÃÏÂçÌî¼Óµ¨
³ùÅĤᤰ¤ßÌÚ²¼ÏÂÂç
ã·ÌÚÎò𺴡¹ÌÚÍÚ
º´Æ£Í¥»Ò¾Â߷ʸ¹á
¸Å²°¿¿ÍýµÈÅÄÃÒ¹°
2010-2011ǯÅÙÀ¸(05´ü)
°±¸¶»ËÉÒ°ËÆ£Â絯
°ËÆ£¤ß¤É¤ê±Êºäʸǵ
Æ£ÅĹҺÈÁ°Â¿ÂçÊå
¾¾ËÜÎÍͤµÜÄÅÍ­º»
»³ÅÄ°¡µ¨
2009-2010ǯÅÙÀ¸(04´ü)
´ßËÜδ»Ö·¦ÃÏͳ·Ã
»Ö²ìÀéÄáÄÅÅÄÍ­»Ò
»°±º¹©Ìï
2008-2009ǯÅÙÀ¸(03´ü)
°ÀÄŹ¯Í¤°æ¾å¤µ¤æ¤ê
Ë̺êͤ¼ù¹©Æ£Ï´²
¸ÅËóÍ¥²Öº´Àî¾´¹¨
º´Æ£Ä÷ÍÎÎëÌÚ°¡°á
Ãݸ¶´õÈþÆ£°æÍ¥ºî
ËÙ¸ø°ìËÙÆâ¾®¿¥
ÊÆß·¹¨»Ë
2007-2008ǯÅÙÀ¸(02´ü)
º´Æ£·òÂÀ¾å¼²Â¹°
±üÅÄ·¼µ®¾®ÌîÀ¿
Çò°æ¤«¤º¤ß¹â°æÍDzð
¿¹Ã«Î¼²ðÏ»ÅÏÍ­Íü·Ã
¼ãËÜůʿ
2006-2007ǯÅÙÀ¸(01´ü)
¿û°æ°´ÅÏÉô¸¬ÂÀϺ
Áêºä¿¿Â缲¿µ
±üÅí»Ò³Þ°æÌÔ
¾®ÎÓϹ¬óîÆ£¤¤¤Ä¤³
óîƣͺµªº´¡¹ÌÚËã̤
º´Æ£Æü²ÃÍùëËܵ®Ç·
ÆÁ¹¾Í¤²ðĹÎææûÊ¿
À¾Ëܤߤ椭ÎÓ³¨Î¤»Ò
ß·ÅÄÂçµ±