英文文本词频统计

英文文本：

统计英文词频分为两步：

文本去噪及归一化

使用字典表达词频

代码：

#CalHamletV1.pydef getText():    txt = open("hamlet.txt", "r").read()    txt = txt.lower()    for ch in '!"#$%&()*+,-./:;<=>?@[\\]^_‘{|}~':        txt = txt.replace(ch, " ")   #将文本中特殊字符替换为空格    return txt hamletTxt = getText()words  = hamletTxt.split()counts = {}for word in words:               counts[word] = counts.get(word,0) + 1items = list(counts.items())items.sort(key=lambda x:x[1], reverse=True) for i in range(10):    word, count = items[i]    print ("{0:<10}{1:>5}".format(word, count))

运行结果：

the        1138and         965to          754of          669you         550i           542a           542my          514hamlet      462in          436

中文文本词频统计

中文文本：

统计中文词频分为两步：

中文文本分词

使用字典表达词频

#CalThreeKingdomsV1.pyimport jiebatxt = open("threekingdoms.txt", "r", encoding='utf-8').read()words  = jieba.lcut(txt)counts = {}for word in words:    if len(word) == 1:        continue    else:        counts[word] = counts.get(word,0) + 1items = list(counts.items())items.sort(key=lambda x:x[1], reverse=True) for i in range(15):    word, count = items[i]    print ("{0:<10}{1:>5}".format(word, count))

运行结果：

曹操      953孔明  836将军  772却说  656玄德  585关公  510丞相  491二人  469不可  440荆州  425玄德曰     390孔明曰     390不能  384如此  378张飞  358

能很明显的看到有一些不相关或重复的信息

优化版本

统计中文词频分为三步：

中文文本分词

使用字典表达词频

扩展程序解决问题

我们将不相关或重复的信息放在 excludes 集合里面进行排除。

#CalThreeKingdomsV2.pyimport jiebaexcludes = {"将军","却说","荆州","二人","不可","不能","如此"}txt = open("threekingdoms.txt", "r", encoding='utf-8').read()words  = jieba.lcut(txt)counts = {}for word in words:    if len(word) == 1:        continue    elif word == "诸葛亮" or word == "孔明曰":        rword = "孔明"    elif word == "关公" or word == "云长":        rword = "关羽"    elif word == "玄德" or word == "玄德曰":        rword = "刘备"    elif word == "孟德" or word == "丞相":        rword = "曹操"    else:        rword = word    counts[rword] = counts.get(rword,0) + 1for word in excludes:    del counts[word]items = list(counts.items())items.sort(key=lambda x:x[1], reverse=True) for i in range(10):    word, count = items[i]    print ("{0:<10}{1:>5}".format(word, count))

考研英语词频统计

将词频统计应用到考研英语中，我们可以统计出出现次数较多的关键单词。

文本链接: https://pan.baidu.com/s/1Q6uVy-wWBpQ0VHvNI_DQxA 密码: fw3r

# CalHamletV1.pydef getText():    txt = open("86_17_1_2.txt", "r").read()    txt = txt.lower()    for ch in '!"#$%&()*+,-./:;<=>?@[\\]^_‘{|}~':        txt = txt.replace(ch, " ")   #将文本中特殊字符替换为空格    return txtpyTxt = getText()   #获得没有任何标点的txt文件words  = pyTxt.split()  #获得单词counts = {} #字典，键值对excludes = {"the", "a", "of", "to", "and", "in", "b", "c", "d", "is",\            "was", "are", "have", "were", "had", "that", "for", "it",\            "on", "be", "as", "with", "by", "not", "their", "they",\            "from", "more", "but", "or", "you", "at", "has", "we", "an",\            "this", "can", "which", "will", "your", "one", "he", "his", "all", "people", "should", "than", "points", "there", "i", "what", "about", "new", "if", "”",\            "its", "been", "part", "so", "who", "would", "answer", "some", "our", "may", "most", "do", "when", "1", "text", "section", "2", "many", "time", "into", \            "10", "no", "other", "up", "following", "【答案】", "only", "out", "each", "much", "them", "such", "world", "these", "sheet", "life", "how", "because", "3", "even", \            "work", "directions", "use", "could", "now", "first", "make", "years", "way", "20", "those", "over", "also", "best", "two", "well", "15", "us", "write", "4", "5", "being", "social", "read", "like", "according", "just", "take", "paragraph", "any", "english", "good", "after", "own", "year", "must", "american", "less", "her", "between", "then", "children", "before", "very", "human", "long", "while", "often", "my", "too", \            "40", "four", "research", "author", "questions", "still", "last", "business", "education", "need", "information", "public", "says", "passage", "reading", "through", "women", "she", "health", "example", "help", "get", "different", "him", "mark", "might", "off", "job", "30", "writing", "choose", "words", "economic", "become", "science", "society", "without", "made", "high", "students", "few", "better", "since", "6", "rather", "however", "great", "where", "culture", "come", \            "both", "three", "same", "government", "old", "find", "number", "means", "study", "put", "8", "change", "does", "today", "think", "future", "school", "yet", "man", "things", "far", "line", "7", "13", "50", "used", "states", "down", "12", "14", "16", "end", "11", "making", "9", "another", "young", "system", "important", "letter", "17", "chinese", "every", "see", "s", "test", "word", "century", "language", "little", \            "give", "said", "25", "state", "problems", "sentence", "food", "translation", "given", "child", "18", "longer", "question", "back", "don’t", "19", "against", "always", "answers", "know", "having", "among", "instead", "comprehension", "large", "35", "want", "likely", "keep", "family", "go", "why", "41", "home", "law", "place", "look", "day", "men", "22", "26", "45", "it’s", "others", "companies", "countries", "once", "money", "24", "though", \            "27", "29", "31", "say", "national", "ii", "23", "based", "found", "28", "32", "past", "living", "university", "scientific", "–", "36", "38", "working", "around", "data", "right", "21", "jobs", "33", "34", "possible", "feel", "process", "effect", "growth", "probably", "seems", "fact", "below", "37", "39", "history", "technology", "never", "sentences", "47", "true", "scientists", "power", "thought", "during", "48", "early", "parents", \            "something", "market", "times", "46", "certain", "whether", "000", "did", "enough", "problem", "least", "federal", "age", "idea", "learn", "common", "political", "pay", "view", "going", "attention", "happiness", "moral", "show", "live", "until", "52", "49", "ago", "percent", "stress", "43", "44", "42", "meaning", "51", "e", "iii", "u", "60", "anything", "53", "55", "cultural", "nothing", "short", "100", "water", "car", "56", "58", "【解析】", "54", "59", "57", "v", "。","63", "64", "65", "61", "62", "66", "70", "75", "f", "【考点分析】", "67", "here", "68",  "71", "72", "69", "73", "74", "选项a", "ourselves", "teachers", "helps", "参考范文", "gdp", "yourself", "gone", "150"}for word in words:    if word not in excludes:        counts[word] = counts.get(word,0) + 1items = list(counts.items())items.sort(key=lambda x:x[1], reverse=True) for i in range(10):    word, count = items[i]    print ("{0:<10}{1:>5}".format(word, count))x = len(counts)print(x)r = 0next = eval(input("1继续"))while next == 1:    r += 100    for i in range(r, r+100):        word, count = items[i]        print ("\"{}\"".format(word), end = ", ")    next = eval(input("1继续"))

转载地址：http://ddtkz.baihongyu.com/

你可能感兴趣的文章

Mysql InnoDB存储引擎 —— 数据页

查看>>

Mysql InnoDB存储引擎中的checkpoint技术

查看>>

Mysql InnoDB存储引擎中缓冲池Buffer Pool、Redo Log、Bin Log、Undo Log、Channge Buffer

查看>>

MySQL InnoDB引擎的锁机制详解

查看>>

Mysql INNODB引擎行锁的3种算法 Record Lock Next-Key Lock Grap Lock

查看>>