当前位置:  开发笔记 > 编程语言 > 正文

计算双列表python 3中唯一数据双精度出现的次数

如何解决《计算双列表python3中唯一数据双精度出现的次数》经验,为你挑选了1个好方法。

假设我在python中有一个双列表[[],[]]:

doublelist = [["all", "the", "big", "dogs", "eat", "chicken", "all", "the", "small", "kids", "eat", "paste"], 
              ["the", "big", "dogs", "eat", "chicken", "all", "the", "small", "kids", "eat", "paste", "lumps"]]

我想计算doublelist[0][0] & doublelist[1][0] = all, the双列表中出现的次数.第二个[]是索引.

例如,你看到一个计数在doublelist[0][0] doublelist[1][0]和另一个在doublelist[0][6] doublelist[1][6].

我将在Python 3中使用什么代码来迭代doublelist[i][i]抓取每个值集ex.[["all"],["the"]]还有一个整数值,表示该列表中存在的值集的次数.

理想情况下,我想将它输出到triplelist[[i],[i],[i]]包含[i][i]值和第三个中的整数的三元组列表[i].

示例代码:

for i in triplelist[0]:
    print(triplelist[0][i])
    print(triplelist[1][i])
    print(triplelist[2][i])

输出:

>"all"
>"the"
>2
>"the"
>"big"
>1
>"big"
>"dogs"
>1

等等...

此外,它最好跳过重复,因此列表中不会有2个索引,[i][i][i] = [[all],[the],[2]]因为原始列表中有2个实例([0] [0] [1] [0]和[0] [6] [1] [6]).我只想要所有独特的双组词和它们在原始文本中出现的次数.

代码的目的是查看一个单词在给定文本中跟随另一个单词的频率.它用于构建一个智能马尔可夫链生成器,可以对单词值进行加权.我已经有了代码将文本分成双列表,其中包含第一个列表中的单词和第二个列表中的后续单词.

这是我目前的代码供参考(问题是在我初始化wordlisttriple之后,我不知道如何让它做到我之后描述的那样):

#import
import re #for regex expression below

#main
with open("text.txt") as rawdata:    #open text file and create a datastream
    rawtext = rawdata.read()    #read through the stream and create a string containing the text
rawdata.close()    #close the datastream
rawtext = rawtext.replace('\n', ' ')    #remove newline characters from text
rawtext = rawtext.replace('\r', ' ')    #remove newline characters from text
rawtext = rawtext.replace('--', ' -- ')    #break up blah--blah words so it can read 2 separate words blah -- blah
pat = re.compile(r'([A-Z][^\.!?]*[\.!?])', re.M)    #regex pattern for grabbing everthing before a sentence ending punctuation
sentencelist = []    #initialize list for sentences in text
sentencelist = pat.findall(rawtext)    #apply regex pattern to string to create a list of all the sentences in the text
firstwordlist = []    #initialize the list for the first word in each sentence
for index, firstword in enumerate(sentencelist):    #enumerate through the sentence list
    sentenceindex = int(index)    #get the index for below operation
    firstword = sentencelist[sentenceindex].split(' ')[0]    #use split to only grab the first word in each sentence
    firstwordlist.append(firstword)    #append each sentence starting word to first word list
rawtext = rawtext.replace(', ', ' , ')    #break up punctuation so they are not considered part of words
rawtext = rawtext.replace('. ', ' . ')    #break up punctuation so they are not considered part of words
rawtext = rawtext.replace('"', ' " ')    #break up punctuation so they are not considered part of words
sentencelistforwords = []    #initialize sentence list for parsing words
sentencelistforwords = pat.findall(rawtext)    #run the regex pattern again this time with the punctuation broken up by spaces
wordsinsentencelist = []    #initialize list for all of the words that appear in each sentence
for index, words in enumerate(sentencelist):    #enumerate through sentence list
    sentenceindex = int(index)    #grab the index for below operation
    words = sentencelist[sentenceindex].split(' ')    #split up the words in each sentence so we have a nested lists that contain each word in each sentence
    wordsinsentencelist.append(words)    #append above described to the list
wordlist = []    #initialize list of all words
wordlist = rawtext.split(' ')    #create list of all words by splitting the entire text by spaces
wordlist = list(filter(None, wordlist))    #use filter to get rid of empty strings in the list
wordlistdouble = [[], []]    #initialize the word list double to contain words and the words that follow them in sentences
for index, word in enumerate(wordlist):    #enumerate through word list
    if(int(index) < int(len(wordlist))-1):    #only go to 1 before the end of list so we don't get an index out of bounds error
        wordlistindex1 = int(index)    #grab index for first word
        wordlistindex2 = int(index)+1    #grab index for following word
        wordlistdouble[0].append(wordlist[wordlistindex1])    #append first word to first list of word list double
        wordlistdouble[1].append(wordlist[wordlistindex2])    #append following word to second list of word list double
wordlisttriple = [[], [], []]    #initialize word list triple
for index, unit in enumerate(wordlistdouble[0]):    #enumerate through word list double
    word1 = wordlistdouble[0][index]    #grab word at first list of word list double at the current index
    word2 = wordlistdouble[1][index]    #grab word at second list of word list double at the current index
    count = 0    #initialize word double data set counter
    wordlisttriple[0].append(word1)    #these need to be encapsulated in some kind of loop/if/for idk
    wordlisttriple[1].append(word2)    #these need to be encapsulated in some kind of loop/if/for idk
    wordlisttriple[2].append(count)    #these need to be encapsulated in some kind of loop/if/for idk
    #for index, unit1 in enumerate(wordlistdouble[0]):
        #if(wordlistdouble[0][int(index)] == word1 && wordlistdouble[1][int(index)+1] == word2):
            #count++

#sentencelist = list of all sentences
#firstwordlist = list of words that start sentencelist
#sentencelistforwords = list of all sentences mutated for ease of extracting words
#wordsinsentencelist = list of lists containing all of the words in each sentence
#wordlist = list of all words
#wordlistdouble = dual list of all words plus the words that follow them

任何建议将不胜感激.如果我以错误的方式解决这个问题并且有一种更简单的方法来完成同样的事情,那也会是惊人的.谢谢!



1> niemmi..:

假设你已经将文本解析为单词列表,你可以创建从第二个单词开始的迭代器,zip它带有单词并运行它Counter:

from collections import Counter

words = ["all", "the", "big", "dogs", "eat", "chicken", "all", "the", "small", "kids", "eat", "paste", "lumps"]
nxt = iter(words)
next(nxt, None)

print(*Counter(zip(words, nxt)).items(), sep='\n')

输出:

(('big', 'dogs'), 1)
(('kids', 'eat'), 1)
(('small', 'kids'), 1)
(('the', 'big'), 1)
(('dogs', 'eat'), 1)
(('eat', 'paste'), 1)
(('all', 'the'), 2)
(('chicken', 'all'), 1)
(('paste', 'lumps'), 1)
(('eat', 'chicken'), 1)
(('the', 'small'), 1)

上面nxt是一个遍历单词列表的迭代器.因为我们希望它从第二个单词开始,所以我们next在使用之前将一个单词拉出来:

>>> nxt = iter(words)
>>> next(nxt)
'all'
>>> list(nxt)
['the', 'big', 'dogs', 'eat', 'chicken', 'all', 'the', 'small', 'kids', 'eat', 'paste', 'lumps']

然后我们将原始列表和迭代器传递给zip它将返回可迭代的元组,其中每个元组都有两个项目:

>>> list(zip(words, nxt))
[('all', 'the'), ('the', 'big'), ('big', 'dogs'), ('dogs', 'eat'), ('eat', 'chicken'), ('chicken', 'all'), ('all', 'the'), ('the', 'small'), ('small', 'kids'), ('kids', 'eat'), ('eat', 'paste'), ('paste', 'lumps')]

最后,输出来zip传递给Counter每个对计数,并返回dict像对象,其中键是对,值是计数:

>>> Counter(zip(words, nxt))
Counter({('all', 'the'): 2, ('eat', 'chicken'): 1, ('big', 'dogs'): 1, ('small', 'kids'): 1, ('kids', 'eat'): 1, ('paste', 'lumps'): 1, ('chicken', 'all'): 1, ('dogs', 'eat'): 1, ('the', 'big'): 1, ('the', 'small'): 1, ('eat', 'paste'): 1})

推荐阅读
凹凸曼00威威_694
这个屌丝很懒,什么也没留下!
DevBox开发工具箱 | 专业的在线开发工具网站    京公网安备 11010802040832号  |  京ICP备19059560号-6
Copyright © 1998 - 2020 DevBox.CN. All Rights Reserved devBox.cn 开发工具箱 版权所有