RError.com

RError.com Logo RError.com Logo

RError.com Navigation

  • 主页

Mobile menu

Close
  • 主页
  • 系统&网络
    • 热门问题
    • 最新问题
    • 标签
  • Ubuntu
    • 热门问题
    • 最新问题
    • 标签
  • 帮助
主页 / 问题 / 1062233
Accepted
Tony Stark
Tony Stark
Asked:2020-12-23 20:46:16 +0000 UTC2020-12-23 20:46:16 +0000 UTC 2020-12-23 20:46:16 +0000 UTC

OOP 中的多处理

  • 772

请告诉我,我运行了代码,但没有加载 CPU,图表上没有任何动作。没有结果。我的关节在哪里?

stop = stopwords.words('russian')
class TextPreprocessor(BaseEstimator, TransformerMixin):
    def __init__(self, n_jobs=-1):
        """
        Text preprocessing transformer:
        name - name of dataframe, saves files according to name
        n_jobs - parallel jobs to run
        """
        self.n_jobs = n_jobs


    def fit(self, X, name, y=None):
        self.name=name
        return self

    def transform(self, X, *_):
        # main transformer
        data=self._text_indexing(X)
        data=self._multi(self._proc_target, data)
        return data

    def _proc_target(self, task):
        #task=task.reset_index(drop=True, inplace=True)
        data=self._preprocess_text(task)
        data=self._stemmer(data)
        data = self._punc(data)
        data = self._stopwords_remover(data)
        return data

    def _multi(self, target, tasks, workers=None):
        if workers is None: workers = max(2, mp.cpu_count() - 1)
        pool = mp.Pool(processes=workers)
        res = pool.map(target, tasks)
        pool.close()
        pool.join()
        return res



    def _preprocess_text(self, text):
        low_cased_text = self._low_case(text['text'])
        eng_cleaned = self._english(low_cased_text)
        stopwords_cleaned = self._stopwords(eng_cleaned)
        return self._numbers(stopwords_cleaned)

    def _stemmer(self, text):
        mst=MyStem(mystem_path='/home/azubochenko/work/plagiat/baseline/pipeline/mystem')
        data=[]
        for i in text:
            data.append(mst._make_mystem_lemma(i))
        return data

    def _text_indexing(self, data):
        for i in data.iloc[0]:
            data.rename(columns={0: 'text'}, inplace=True)
        return data

    def _low_case(self, text):
        #text to lower case
        return text.str.lower()

    def _english(self, text):
        #delete english words
        return text.apply(lambda x : re.sub(r'[a-z]+', '', x))

    def _stopwords(self, text):
        #delete stopwords from russian nltk vocabulary
        return text.apply(lambda x: " ".join(x for x in str(x).split() if x not in stop))

    def _numbers(self, text):
        #delete digits
        return text.apply(lambda x : re.sub(r'\d+', '', x))

    def _punc(self, text):
        #delete punctuation
        return [[i.translate(str.maketrans(dict.fromkeys(string.punctuation))) for i in j] for j in text]

    def _sentence_tok(self, data):
        #indexing of texts to paragrapsh and text indexes
        corp=[]
        text_indexes=[]
        indexes=[]
        text_idx=0
        for i in data.iloc[:,0]:
            j=sent_tokenize(i)
            sentences=[]
            idx=1
            text_idx+=1
            for k in j:
                if len(sentences)<3:
                    sentences.append(k)
                else:
                    corp.append(str(sentences).strip('[]'))
                    sentences=[]
                    indexes.append(idx)
                    text_indexes.append(text_idx)
                    idx+=1
        return pd.DataFrame({'text': corp, 'paragraph_index':indexes, 'text_index':text_indexes})

    def _tokenize(self, data):
        #text tokenizing
        data.dropna(inplace=True)
        return data.apply(word_tokenize)

    def _get_text(self, url, encoding='utf-8', to_lower=True):
        #stopwords getter of from Github stopwords-iso
        url = str(url)
        if url.startswith('http'):
            r = requests.get(url)
            if not r.ok:
                r.raise_for_status()
            return r.text.lower() if to_lower else r.text
        elif os.path.exists(url):
            with open(url, encoding=encoding) as f:
                return f.read().lower() if to_lower else f.read()
        else:
            raise Exception('parameter [url] can be either URL or a filename')

    def _remove_stopwords(self, tokens, stopwords=None, min_length=4):
        #stopwords remover using stopwords-iso
        if not stopwords:
            return tokens
        stopwords = set(stopwords)
        tokens = [tok
                  for tok in tokens
                  if tok not in stopwords and len(tok) >= min_length]
        return tokens

    def _stopwords_remover(self, tokens):
        url_stopwords_ru = "https://raw.githubusercontent.com/stopwords-iso/stopwords-ru/master/stopwords-ru.txt"
        stopwords_ru = self._get_text(url_stopwords_ru).splitlines()
        output=[]
        [output.append(self._remove_stopwords(x, stopwords=stopwords_ru)) for x in tokens]
        return output'''
text_preprocessing=TextPreprocessor()
test_df=text_preprocessing.fit_transform(test, name='test')

怀疑错误出在函数本身_multi,但我不明白在哪里。

python
  • 1 1 个回答
  • 10 Views

1 个回答

  • Voted
  1. Best Answer
    Tony Stark
    2020-12-24T22:43:54Z2020-12-24T22:43:54Z

    我自己想通了,熊猫与多处理冲突。要实现此代码,您必须以列表形式提交数据,或使用多处理的其他实现,例如pandarallel. 问题已结束。

    • 0

相关问题

Sidebar

Stats

  • 问题 10021
  • Answers 30001
  • 最佳答案 8000
  • 用户 6900
  • 常问
  • 回答
  • Marko Smith

    根据浏览器窗口的大小调整背景图案的大小

    • 2 个回答
  • Marko Smith

    理解for循环的执行逻辑

    • 1 个回答
  • Marko Smith

    复制动态数组时出错(C++)

    • 1 个回答
  • Marko Smith

    Or and If,elif,else 构造[重复]

    • 1 个回答
  • Marko Smith

    如何构建支持 x64 的 APK

    • 1 个回答
  • Marko Smith

    如何使按钮的输入宽度?

    • 2 个回答
  • Marko Smith

    如何显示对象变量的名称?

    • 3 个回答
  • Marko Smith

    如何循环一个函数?

    • 1 个回答
  • Marko Smith

    LOWORD 宏有什么作用?

    • 2 个回答
  • Marko Smith

    从字符串的开头删除直到并包括一个字符

    • 2 个回答
  • Martin Hope
    Alexandr_TT 2020年新年大赛! 2020-12-20 18:20:21 +0000 UTC
  • Martin Hope
    Alexandr_TT 圣诞树动画 2020-12-23 00:38:08 +0000 UTC
  • Martin Hope
    Air 究竟是什么标识了网站访问者? 2020-11-03 15:49:20 +0000 UTC
  • Martin Hope
    Qwertiy 号码显示 9223372036854775807 2020-07-11 18:16:49 +0000 UTC
  • Martin Hope
    user216109 如何为黑客设下陷阱,或充分击退攻击? 2020-05-10 02:22:52 +0000 UTC
  • Martin Hope
    Qwertiy 并变成3个无穷大 2020-11-06 07:15:57 +0000 UTC
  • Martin Hope
    koks_rs 什么是样板代码? 2020-10-27 15:43:19 +0000 UTC
  • Martin Hope
    Sirop4ik 向 git 提交发布的正确方法是什么? 2020-10-05 00:02:00 +0000 UTC
  • Martin Hope
    faoxis 为什么在这么多示例中函数都称为 foo? 2020-08-15 04:42:49 +0000 UTC
  • Martin Hope
    Pavel Mayorov 如何从事件或回调函数中返回值?或者至少等他们完成。 2020-08-11 16:49:28 +0000 UTC

热门标签

javascript python java php c# c++ html android jquery mysql

Explore

  • 主页
  • 问题
    • 热门问题
    • 最新问题
  • 标签
  • 帮助

Footer

RError.com

关于我们

  • 关于我们
  • 联系我们

Legal Stuff

  • Privacy Policy

帮助

© 2023 RError.com All Rights Reserve   沪ICP备12040472号-5