2010年08月17日Python勉強会の変更点 - Kimura seminar in Otaru University of Commerce

[ トップ ] [ 編集 | 差分 | バックアップ | 添付 | リロード ] [ 新規 | 一覧 | 単語検索 | 最終更新 | ヘルプ ]
追加された行はこの色です。
削除された行はこの色です。
2010年08月17日Python勉強会へ行く。
2010年08月17日Python勉強会の差分を削除
[[TopPage]]

***nltkでtwitterのログを処理する。 [#d3e1dcf3]
+TwitterLog20100805 を自分のディレクトリにコピーする
	cp -r /home/maeda/TwitterLog/20100805/ .
+参考サイト http://d.hatena.ne.jp/nokuno/20100123/1264239192
+日本語テキストを読み込むサンプルプログラム
	#! /usr/bin/env python
	#encoding: utf-8
	import nltk
	raw = open('./sample.mecab').read()
	words = raw.split();
	print len(words)
	text = nltk.Text(words)
	gen = text.generate(300)
	print gen
+bigrams
	#!/usr/bin/python
	#encoding: utf-8
	import nltk
	
	raw = open('sample.dat').read()
	words = raw.split();
	bigrams = nltk.bigrams(words)
	fd = nltk.FreqDist(bigrams)
	for w in fd:
	        if fd[w]==100 :
	                break;
	        print w[0],w[1],fd[w]
	
	#文字化けする
	cfd = nltk.ConditionalFreqDist(bigrams)
	print cfd['私']
+24時間x60分=1440ファイルを分かち書きにする


***ディレクトリ下の*.datファイルを読み込み、その中からtweetを抽出する。(筆跡：Ashihara) [#s4de06db]
--事前処理として.dat内の制御文字を削除する必要がある。
---コマンド
	find . -name "*.dat" | xargs sed -i "s/^M//g"
---^Mは[Ctrl]+[v] + [Ctrl] + [M]で入力する。直接入力してはいけない。
	[Ctrl]+[v] + [Ctrl] + [M]
---他の制御文字が含まれている場合もある。適宜除去する必要がある。
--tweetの抽出
---データは生データを読み込む。それをsplit()でリストにする。
---splitされたリストの要素中に'2010'の要素が必ず一つのtweetに存在し、その後はtweet内容となる。
---これを利用し、index関数で'2010'が存在する箇所の添字を取得、それ以降の要素を全て結合しtweetとする。
---一つのtweetに対し一つずつ改行しつつファイルに書き込む。tweetList.txtというファイルが生成される。
---このtweetList.txtをmecabにかける。今回はコマンドラインから直接実行した。
---この処理は制御文字を削除しない限り、失敗する。以下ソースコード。
---出力が完全ではないかもしれません。
    #vim fileencoding:utf-8 
    import commands
    import codecs
    list = commands.getoutput('ls *.dat')
    fileList = list.split("\n")
    datList = []
    for file in fileList:
            for dat in codecs.open(file,'r','utf-8'):
                    datList.append(dat.encode('utf-8'))
    tweetList = []
    swaplist = []
    tw = str()
    for dats in datList:
            swaplist = dats.split()
            index = swaplist.index('2010')#2010以下がtweet
            tw = ""
            for tweet in swaplist[index + 1:]:
                    tw += tweet
            tweetList.append(tw)
    f = open('tweetList.txt', 'w')
    for tweet in tweetList:
            f.write(tweet+'\n')
    f.close()
***復習問題 [#qecaa592]
-"TwitterLog20100805-1600.dat" のファイルを読み込み、コメントだけを「Comment.txt」ファイルに書き出すプログラムを書け
--^M を削除すること
-必要な知識
--ファイル読み込み
	#!/usr/bin/env python
	for line in open('TwitterLog20100805-1600.dat', 'r'):
	    print line
--ファイル書き込み
	strs = "abc";
	f = open('Comment.txt', 'w')
	f.writelines(strs)
	f.close()
--"TwitterLog20100805-1600.dat" を読み込み　"Comment.txt" にそのまま書き込む
	#!/usr/bin/env python
	
	f = open('Comment.txt', 'w')
	for line in open('TwitterLog20100805-1600.dat', 'r'):
	        print line
	        f.writelines(line)
	f.close()
--split 関数を利用して、タブで区切り、4番目の要素（つぶやきの内容）を"Comment.txt"に出力する
	#!/usr/bin/env python
	
	f = open('Comment.txt', 'w')
	for line in open('TwitterLog20100805-1600.dat', 'r'):
	        print line
	        items = line.split("\t")
	        print items[3]
	        f.writelines(items[3])
	f.close()
--置換処理