假设我在python中有一个双列表[[],[]]
:
doublelist = [["all", "the", "big", "dogs", "eat", "chicken", "all", "the", "small", "kids", "eat", "paste"], ["the", "big", "dogs", "eat", "chicken", "all", "the", "small", "kids", "eat", "paste", "lumps"]]
我想计算doublelist[0][0] & doublelist[1][0] = all, the
双列表中出现的次数.第二个[]是索引.
例如,你看到一个计数在doublelist[0][0] doublelist[1][0]
和另一个在doublelist[0][6] doublelist[1][6]
.
我将在Python 3中使用什么代码来迭代doublelist[i][i]
抓取每个值集ex.[["all"],["the"]]
还有一个整数值,表示该列表中存在的值集的次数.
理想情况下,我想将它输出到triplelist[[i],[i],[i]]
包含[i][i]
值和第三个中的整数的三元组列表[i]
.
示例代码:
for i in triplelist[0]: print(triplelist[0][i]) print(triplelist[1][i]) print(triplelist[2][i])
输出:
>"all" >"the" >2 >"the" >"big" >1 >"big" >"dogs" >1
等等...
此外,它最好跳过重复,因此列表中不会有2个索引,[i][i][i] = [[all],[the],[2]]
因为原始列表中有2个实例([0] [0] [1] [0]和[0] [6] [1] [6]).我只想要所有独特的双组词和它们在原始文本中出现的次数.
代码的目的是查看一个单词在给定文本中跟随另一个单词的频率.它用于构建一个智能马尔可夫链生成器,可以对单词值进行加权.我已经有了代码将文本分成双列表,其中包含第一个列表中的单词和第二个列表中的后续单词.
这是我目前的代码供参考(问题是在我初始化wordlisttriple之后,我不知道如何让它做到我之后描述的那样):
#import import re #for regex expression below #main with open("text.txt") as rawdata: #open text file and create a datastream rawtext = rawdata.read() #read through the stream and create a string containing the text rawdata.close() #close the datastream rawtext = rawtext.replace('\n', ' ') #remove newline characters from text rawtext = rawtext.replace('\r', ' ') #remove newline characters from text rawtext = rawtext.replace('--', ' -- ') #break up blah--blah words so it can read 2 separate words blah -- blah pat = re.compile(r'([A-Z][^\.!?]*[\.!?])', re.M) #regex pattern for grabbing everthing before a sentence ending punctuation sentencelist = [] #initialize list for sentences in text sentencelist = pat.findall(rawtext) #apply regex pattern to string to create a list of all the sentences in the text firstwordlist = [] #initialize the list for the first word in each sentence for index, firstword in enumerate(sentencelist): #enumerate through the sentence list sentenceindex = int(index) #get the index for below operation firstword = sentencelist[sentenceindex].split(' ')[0] #use split to only grab the first word in each sentence firstwordlist.append(firstword) #append each sentence starting word to first word list rawtext = rawtext.replace(', ', ' , ') #break up punctuation so they are not considered part of words rawtext = rawtext.replace('. ', ' . ') #break up punctuation so they are not considered part of words rawtext = rawtext.replace('"', ' " ') #break up punctuation so they are not considered part of words sentencelistforwords = [] #initialize sentence list for parsing words sentencelistforwords = pat.findall(rawtext) #run the regex pattern again this time with the punctuation broken up by spaces wordsinsentencelist = [] #initialize list for all of the words that appear in each sentence for index, words in enumerate(sentencelist): #enumerate through sentence list sentenceindex = int(index) #grab the index for below operation words = sentencelist[sentenceindex].split(' ') #split up the words in each sentence so we have a nested lists that contain each word in each sentence wordsinsentencelist.append(words) #append above described to the list wordlist = [] #initialize list of all words wordlist = rawtext.split(' ') #create list of all words by splitting the entire text by spaces wordlist = list(filter(None, wordlist)) #use filter to get rid of empty strings in the list wordlistdouble = [[], []] #initialize the word list double to contain words and the words that follow them in sentences for index, word in enumerate(wordlist): #enumerate through word list if(int(index) < int(len(wordlist))-1): #only go to 1 before the end of list so we don't get an index out of bounds error wordlistindex1 = int(index) #grab index for first word wordlistindex2 = int(index)+1 #grab index for following word wordlistdouble[0].append(wordlist[wordlistindex1]) #append first word to first list of word list double wordlistdouble[1].append(wordlist[wordlistindex2]) #append following word to second list of word list double wordlisttriple = [[], [], []] #initialize word list triple for index, unit in enumerate(wordlistdouble[0]): #enumerate through word list double word1 = wordlistdouble[0][index] #grab word at first list of word list double at the current index word2 = wordlistdouble[1][index] #grab word at second list of word list double at the current index count = 0 #initialize word double data set counter wordlisttriple[0].append(word1) #these need to be encapsulated in some kind of loop/if/for idk wordlisttriple[1].append(word2) #these need to be encapsulated in some kind of loop/if/for idk wordlisttriple[2].append(count) #these need to be encapsulated in some kind of loop/if/for idk #for index, unit1 in enumerate(wordlistdouble[0]): #if(wordlistdouble[0][int(index)] == word1 && wordlistdouble[1][int(index)+1] == word2): #count++ #sentencelist = list of all sentences #firstwordlist = list of words that start sentencelist #sentencelistforwords = list of all sentences mutated for ease of extracting words #wordsinsentencelist = list of lists containing all of the words in each sentence #wordlist = list of all words #wordlistdouble = dual list of all words plus the words that follow them
任何建议将不胜感激.如果我以错误的方式解决这个问题并且有一种更简单的方法来完成同样的事情,那也会是惊人的.谢谢!
假设你已经将文本解析为单词列表,你可以创建从第二个单词开始的迭代器,zip
它带有单词并运行它Counter
:
from collections import Counter words = ["all", "the", "big", "dogs", "eat", "chicken", "all", "the", "small", "kids", "eat", "paste", "lumps"] nxt = iter(words) next(nxt, None) print(*Counter(zip(words, nxt)).items(), sep='\n')
输出:
(('big', 'dogs'), 1) (('kids', 'eat'), 1) (('small', 'kids'), 1) (('the', 'big'), 1) (('dogs', 'eat'), 1) (('eat', 'paste'), 1) (('all', 'the'), 2) (('chicken', 'all'), 1) (('paste', 'lumps'), 1) (('eat', 'chicken'), 1) (('the', 'small'), 1)
上面nxt
是一个遍历单词列表的迭代器.因为我们希望它从第二个单词开始,所以我们next
在使用之前将一个单词拉出来:
>>> nxt = iter(words) >>> next(nxt) 'all' >>> list(nxt) ['the', 'big', 'dogs', 'eat', 'chicken', 'all', 'the', 'small', 'kids', 'eat', 'paste', 'lumps']
然后我们将原始列表和迭代器传递给zip
它将返回可迭代的元组,其中每个元组都有两个项目:
>>> list(zip(words, nxt)) [('all', 'the'), ('the', 'big'), ('big', 'dogs'), ('dogs', 'eat'), ('eat', 'chicken'), ('chicken', 'all'), ('all', 'the'), ('the', 'small'), ('small', 'kids'), ('kids', 'eat'), ('eat', 'paste'), ('paste', 'lumps')]
最后,输出来zip
传递给Counter
每个对计数,并返回dict
像对象,其中键是对,值是计数:
>>> Counter(zip(words, nxt)) Counter({('all', 'the'): 2, ('eat', 'chicken'): 1, ('big', 'dogs'): 1, ('small', 'kids'): 1, ('kids', 'eat'): 1, ('paste', 'lumps'): 1, ('chicken', 'all'): 1, ('dogs', 'eat'): 1, ('the', 'big'): 1, ('the', 'small'): 1, ('eat', 'paste'): 1})