2022-07-10 13:09:36

Python collections.Counter用法

什么是collections
Counter
Counter操作
例子

什么是collections

collections在python官方文档中的解释是High-performance container datatypes，直接的中文翻译解释高性能容量数据类型。
它总共包含五种数据类型：
在这里插入图片描述
其中Counter中文意思是计数器，也就是我们常用于统计的一种数据类型，在使用Counter之后可以让我们的代码更加简单易读。

Counter

我们先看一个简单的例子：

#统计词频
colors=['red','blue','red','green','blue','blue']
result={}for colorin colors:if result.get(color)==None:
        result[color]=1else:
        result[color]+=1print(result)
#{'red':2,'blue':3,'green':1}

下面我们看用Counter怎么实现：

from collectionsimport Counter
colors=['red','blue','red','green','blue','blue']
c=Counter(colors)print(dict(c))

显然代码更加简单了，也更容易读和维护了。

Counter操作

可以创建一个空的Counter：

cnt=Counter()

之后在空的Counter上进行一些操作。
也可以创建的时候传进去一个迭代器（数组，字符串，字典等）：

c=Counter('gallahad')                 # 传进字符串
c=Counter({'red':4,'blue':2})      # 传进字典
c=Counter(cats=4, dogs=8)             # 传进元组

判断是否包含某元素，可以转化为dict然后通过dict判断，Counter也带有函数可以判断：

c=Counter(['eggs','ham'])
c['bacon']                              # 不存在就返回0
#0

删除元素：

c['sausage']=0                        # counter entrywith a zero count
del c['sausage']

获得所有元素：

c=Counter(a=4, b=2, c=0, d=-2)list(c.elements())
#['a','a','a','a','b','b']

查看最常见出现的k个元素：

Counter('abracadabra').most_common(3)
#[('a',5),('r',2),('b',2)]

Counter更新：

c=Counter(a=3, b=1)
d=Counter(a=1, b=2)
c+ d                       # 相加
#Counter({'a':4,'b':3})
c- d                       # 相减，如果小于等于0，删去
#Counter({'a':2})
c& d                       # 求最小
#Counter({'a':1,'b':1})
c| d                       # 求最大
#Counter({'a':3,'b':2})

例子

例子：读文件统计词频并按照出现次数排序，文件是以空格隔开的单词的诸多句子：

from collectionsimport Counter
lines=open("./data/input.txt","r").read().splitlines()
lines=[lines[i].split(" ")for iinrange(len(lines))]
words=[]for linein lines:
    words.extend(line)
result=Counter(words)print(result.most_common(10))

当需要统计的文件比较大，使用read()一次读不完的情况：

from collectionsimport Counter
result=Counter()withopen("./data/input.txt","r")as f:while True:
        lines= f.read(1024).splitlines()if lines==[]:break
        lines=[lines[i].split(" ")for iinrange(len(lines))]
        words=[]for linein lines:
            words.extend(line)
        tmp=Counter(words)
        result+=tmpprint(result.most_common(10))

具体可以参考https://docs.python.org/2/library/collections.html#collections.Counter.most_common