ParaCrawl @ParaCrawl

Joined December 2018

Tweets

51
Followers

244
Following

7
Likes

36

HPLT @hplt_eu

5 months ago

HPLT News and Tools!!! If you are interested in filtering your datasets for quality and using them to train MT and LLMs, you are interested in this thread 👇

Nikolay Bogoychev @XapaJIaMnu

5 months ago

HPLT News and Tools!!! If you are interested in filtering your datasets for quality and using them to train MT and LLMs, you are interested in this thread 👇

2 12 45 4K 8

0 3 6 436 0

Interested in Open and Community-Driven MT initiatives? CrowdMT is for you! 🎙️Invited speakers from Wikimedia Foundation and Apertium announced. 📜Accepted papers and abstracts announced. Time to register at events.tuni.fi/eamt23/registr… Details: hplt-project.org/events

0 2 2 265 0

Prompsit @Prompsit

a year ago

#MT people: submission date extended for the CrowdMT workshop to present works on Open Source and Community-Driven MT: 21st April 2023! Abstracts and papers wanted! You wanted also in Tampere, for the whole #EAMT23 conference or at least for this workshop on the 15th of June!

0 4 7 2K 0

Download Image

ParaCrawl @ParaCrawl

2 years ago

A new ParaCrawl parallel corpus is available! 🌍 languages: Polish-Czech 🎒 size: 24 million sentences 🗒️ license: CC0 🎯 location: paracrawl.eu bonus section 🧐 more info: paracrawl.eu/moredata

1 3 4 0 0

Prompsit @Prompsit

2 years ago

Indeed, this is the first data release of the #Macocu effort. You will find both monolingual and bilingual (with English) corpora on ELRC-Share and CLARIN repositories and the website. Insights coming soon! Most of the code also ready for you to try it out!

Clarin.si @ClarinSlovenia

2 years ago

0 7 22 0 1

Download Image

0 3 7 0 0

ParaCrawl @ParaCrawl

2 years ago

Check out MultiParacawl 9, including 36 parallel corpora for Ukrainian and a total of 705 bitexts. Thanks OPUS and @TiedemannJoerg to share this great resource! paracrawl.eu/news/item/18-m…

0 4 3 0 0

Barry Haddow @bazril

2 years ago

@anas_ant If you have an MT system, try bleualign (github.com/bitextor/bleua…) from @ParaCrawl . Scales to ParaCrawl-sized data.

2 1 5 0 2

ParaCrawl @ParaCrawl

2 years ago

We're back with more language resources: English-Ukrainian parallel corpus with aprox. 13M sentence pairs has been released. More info and downloads: paracrawl.eu/news/item/17-e… Please, spread the word and use it!

0 17 24 0 0

ParaCrawl @ParaCrawl

3 years ago

Done! All #ParaCrawl v9 corpora are now available at paracrawl.eu, some also on Corset corset.paracrawl.eu to further inspect or filter them and a new Bitextor is also out github.com/bitextor! Thanks to #CEF and the EU for co-funding this great project!

1 0 13 0 1

ParaCrawl @ParaCrawl

3 years ago

Summer was for work! Now #ParaCrawl v9 corpora are done and again bigger than the previous ones!🤩 Extrinsic evaluation through MT almost finished and, according to old BLEU and new COMET, the quality of the MT output improves! 🥳 We will share corpora and more results soon!🕑

1 4 21 0 0

ParaCrawl @ParaCrawl

3 years ago

Very clear TODO from #ParaCrawl's last stakeholder board meeting: we need better language identification, specially for closely-related languages and for under-resourced ones. Such a basic thing! Trying here to improve current results mixing Fastext and Hunspell, take a look👇

marta / motagirl2 @motagirl2

3 years ago

0 1 4 0 0

0 1 6 0 0

ParaCrawl @ParaCrawl

3 years ago

A new version of ParaCrawl is being cooked! We are aiming at not only more bilingual but also monolingual data. And we are applying neural cleaning this time with bicleaner-ai (github.com/bitextor/bicle…). Stay tuned!🧐

0 2 11 0 3

ParaCrawl @ParaCrawl

3 years ago

Milestone reached! We just published Corset, a data selection portal to get relevant data from massive amounts of parallel data such as ParaCrawl corpora. Thanks #CEFTelecom! Users welcome! Test it here: corset.paracrawl.eu Code & docs here: github.com/paracrawl/cors…

0 2 12 0 1

ParaCrawl @ParaCrawl

3 years ago

We almost forgot to tell you, ParaCrawl 8 is out! First highlight: wow the size of it! Check yourself at paracrawl.eu/releases #ParaCrawl #crawling #parallelcorpus #CEFTelecom #MT

1 9 20 0 6

Download Image

Leo P @pacofonix

3 years ago

Bitextor 8 is out! Many improvements and features that will make it into next @ParaCrawl data release, including ones from Snakemake 6 by @johanneskoester. Check all the changes: github.com/bitextor/bitex…

0 1 3 0 1

ParaCrawl @ParaCrawl

3 years ago

Our corpora were evaluated as part of the great effort at "Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets" (🧐 arxiv.org/pdf/2103.12028…). We will keep our efforts in trying to deliver high-quality corpora out of web crawled content. ParaCrawl v8 about to come!

Stella Biderman @BlancheMinerva

3 years ago

0 1 15 0 0

Download Image

3 3 16 0 1

ParaCrawl @ParaCrawl

3 years ago

We are exploring neural document alignment and neural parallel data filtering in ParaCrawl. The massive amount of data to be processed might be an issue for neural approaches in our setting. Scale or fail...

0 2 15 0 1

ParaCrawl @ParaCrawl

3 years ago

Working in ParaCrawl V8, due 31st March 2021, which will include: ➕ parallel data, mainly from Internet Archive --> we are processing 1.1 petabyte of .warc files, about 41% of 77T of compressed text ➖ MACHINE TRANSLATION --> we are focusing on this issue Stay tuned!

0 0 5 0 0

Marcin Junczys-Dowmun.. @marian_nmt

2K Followers 396 Following NLP. NMT. Main author of Marian NMT. Research Scientist at Microsoft Translator. Non-NLP silliness and stuff on @emjotde

Co-founder @RekaAILabs and Honorary Researcher @IxaGroup (University of the Basque Country) | Past: Research Scientist @AIatMeta (FAIR)

Mikel Artetxe @artetxem

6K Followers 221 Following Co-founder @RekaAILabs and Honorary Researcher @IxaGroup (University of the Basque Country) | Past: Research Scientist @AIatMeta (FAIR)

Machine Translation/#NLProc/ML Researcher at Microsoft.
Past: @UCBerkeley CS ugrad; @LiltHQ research intern; @jhuCLSP/@jhuCompSci PhD

Huda Khayrallah @HudaKhay

1K Followers 911 Following Machine Translation/#NLProc/ML Researcher at Microsoft. Past: @UCBerkeley CS ugrad; @LiltHQ research intern; @jhuCLSP/@jhuCompSci PhD

University Professor, Linguistics/Translation Studies @univ_lille @STL_ULille @CNRS. @Master_TSM, @affumt_fr vice-president, EMTNet
Also @ https://t.co/r1HtIZjexO

Rudy Loock 🌻 @RudyLoock

2K Followers 890 Following University Professor, Linguistics/Translation Studies @univ_lille @STL_ULille @CNRS. @Master_TSM, @affumt_fr vice-president, EMTNet Also @ https://t.co/r1HtIZjexO

🌨🌞 @w_uf9

93 Followers 553 Following

Luca Moroni @AndrewWyn1

19 Followers 88 Following PhD student @SapienzaNLP

Tianqi Zhang @tzhangzh12

416 Followers 905 Following English, Spanish, Catalan to Mandarin Chinese Conference Interpreter | Ph.D in Translation and Terminology

Adam @ka2

10 Followers 502 Following

Ak Esved @mathaehuman

2 Followers 28 Following

Machine Learning Scientist at Amazon Berlin. let's try this mastodon thing: @tdomhan@sigmoid.social

Tobias Domhan @tdomhan

564 Followers 873 Following Machine Learning Scientist at Amazon Berlin. let's try this mastodon thing: @[email protected]

$Hiroki TAKEMURA / @KeioSFC B4 / {@WIDE_Project, RG} Delight / @IPAjp 未踏2022 スパクリ / ex-NIT,NC / アイコン: @wonderernmn / 興味: Decentralized System / Web3は関与しません$

kekeho @k3k3h0

3K Followers 4K Following Hiroki TAKEMURA / @KeioSFC B4 / {@WIDE_Project, RG} Delight / @IPAjp 未踏2022 スパクリ / ex-NIT,NC / アイコン: @wonderernmn / 興味: Decentralized System / Web3は関与しません

Waldemar Boszko @WaldemarBoszko

23 Followers 398 Following

J. C. Segovia @jcsegoviacastel

256 Followers 1K Following Gerente Tecnología Integrada TIC, Alianza APTA, Consultor TIC, padre, esposo y creyente.

Micho @mihranmihran

124 Followers 747 Following My grandfather once told me that his dad spoke 11 languages - Thanks for starting a beef, grandpa! 🔥

I fix bugs for breakfast; Software Engineer.
Chess player.
I have zero respect for societies with a homelessness problem.
Private account.

Aiham @aihkas

243 Followers 5K Following I fix bugs for breakfast; Software Engineer. Chess player. I have zero respect for societies with a homelessness problem. Private account.

Professor, Linguist, Translator, Professional Tour Guide, Forensic Expert,
Profesör, Dilbilimci, Çeviribilimci, Profesyonel Tur Rehberi, Adli Bilirkişi

burcu ilkay karaman @aladut

401 Followers 4K Following Professor, Linguist, Translator, Professional Tour Guide, Forensic Expert, Profesör, Dilbilimci, Çeviribilimci, Profesyonel Tur Rehberi, Adli Bilirkişi

Transdisciplinarian (stats, datasci, ml, lang/socSci, tech, art, science, philosophy). (Use-inspired) fundamental research.Opinions my own. Accidental activist.

Ada Wan @adawan919

157 Followers 1K Following Transdisciplinarian (stats, datasci, ml, lang/socSci, tech, art, science, philosophy). (Use-inspired) fundamental research.Opinions my own. Accidental activist.

game + web dev, musician, artist. 99% retweet account: #gamedev / #illustration / #animation / #sounddesign / #languagelearninggames

manzZzari @manzZzari

166 Followers 2K Following game + web dev, musician, artist. 99% retweet account: #gamedev / #illustration / #animation / #sounddesign / #languagelearninggames

Yeb Havinga @YebHavinga

101 Followers 505 Following Trying to keep up with capabilities of large language models. For Dutch GPT and T5, check out https://t.co/ZOBPojaoZI

Bingo @fubinfri

2 Followers 24 Following

stvhuang @_stvhuang

0 Followers 610 Following

Amulya Ratna Dash @amulya_r

31 Followers 671 Following Senior AI Engineer at @IQVIA_global, PhD scholar at @bitspilaniindia

PhD candidate. Proud @UniBogazici alumni. Research areas: historical TS, Turkish language and literature, machine translation, translation technologies.

Gözde Büklüm @meyyalhanim

174 Followers 995 Following PhD candidate. Proud @UniBogazici alumni. Research areas: historical TS, Turkish language and literature, machine translation, translation technologies.

Maurits van Wijland @mvwijland

129 Followers 317 Following proud dad of 2 extrodonary boys | pondering and constant learning as a techno-humanizer | into Tai Jitsu

Pinzhen "Patrick" Che.. @pinzhen_chen

77 Followers 224 Following Working on LLMs and MT @EdinburghNLP @InfAtEd

Antonio Valerio Micel.. @AVMiceliBarone

962 Followers 2K Following ML / NLP School of Informatics, The University of Edinburgh

Yiyi Hu @Elaineable11

122 Followers 645 Following PhD candidate in Translation Studies at Fudan University

The Slovene national consortium of the European research infrastructure @CLARINERIC, providing language resources and technologies, expertise and knowledge.

Clarin.si @ClarinSlovenia

256 Followers 276 Following The Slovene national consortium of the European research infrastructure @CLARINERIC, providing language resources and technologies, expertise and knowledge.

BaohaoLiao @baohao_liao

145 Followers 262 Following PhD for NLP @UvA_Amsterdam. Previously study @RWTH @sjtu1896

Vincent Nguyen @vince62s

77 Followers 156 Following OpenNMT-py

渋谷系NLPer @enullper

1K Followers 2K Following Natural Language Processing / 対訳コーパス編纂業 / Speaks: 🇯🇵 🇧🇷 🇺🇸 🇪🇸

Co-Founder @aspect_labs | Previously: Senior ML Engineer at @Apple, Asst. Research Prof at @DukeU Duke Data Science | @penn_state '15 | @Cornell '11

Matthew Kenney @baykenney

3K Followers 3K Following Co-Founder @aspect_labs | Previously: Senior ML Engineer at @Apple, Asst. Research Prof at @DukeU Duke Data Science | @penn_state '15 | @Cornell '11

Ian Bjelovar @IanBjelovar

41 Followers 698 Following

Raheel @raheel_qader

50 Followers 866 Following NLP Researcher, NLG, MT

Lilian Bordeau @BordeauLilian

114 Followers 5K Following

Diego Bartolome @diegobartolome

3K Followers 975 Following Generative Artificial Intelligence

José G. C. de Souza @accezz

502 Followers 1K Following PhD, Computer Science, NLP, Machine Translation and Machine Learning.

NLP Engineer working on Indic languages | Earlier taught Physics for IIT-JEE | Alumni @iitdelhi | Self-taught programmer | Fascinated by the field of Languages

chitreddy sairam @chittiman

324 Followers 4K Following NLP Engineer working on Indic languages | Earlier taught Physics for IIT-JEE | Alumni @iitdelhi | Self-taught programmer | Fascinated by the field of Languages

Deyi Xiong @DeyiXiong

16 Followers 264 Following

Sushant Daga @sushant_daga

19 Followers 1K Following "Data always win in ML" - a hill I'll die on

Victor Sanh @SanhEstPasMoi

9K Followers 2K Following Dog sitter by day, Scientist at @huggingface 🤗 by night

Lils @langsupade

41 Followers 5K Following

Oliver Blake @O_b1ake

258 Followers 764 Following Project Officer @LIBEREurope | Opinions are my own.

addicted to tpus @GoogleDeepMind @uwcse | varying proportions of AI and mediocre jokes (not mutually exclusive) | she/her/hers

Sneha Kudugunta @snehaark

2K Followers 747 Following addicted to tpus @GoogleDeepMind @uwcse | varying proportions of AI and mediocre jokes (not mutually exclusive) | she/her/hers

Jon Olds @JOFT_trans

180 Followers 874 Following Tofu-eating wokerato

Currently: Governance @EvmosDAO
---
Previously: @ShapeShift DAO, Matic (pre-Polygon), @ENSdomains, @HackerNoon, D3 Consortium, DAO Transparency Index

LPX ☄️ @LPX_404

203 Followers 476 Following Currently: Governance @EvmosDAO --- Previously: @ShapeShift DAO, Matic (pre-Polygon), @ENSdomains, @HackerNoon, D3 Consortium, DAO Transparency Index

Giuseppe Deriard @gderiard

8 Followers 86 Following

PhD researcher at @ufal_cuni of @CharlesUniPRG. I dabble with multimodal translation in AI systems and a bit of neuro-cognitive-linguistics. He/Him.

Sunit Bhattacharya @official_sunit

153 Followers 969 Following PhD researcher at @ufal_cuni of @CharlesUniPRG. I dabble with multimodal translation in AI systems and a bit of neuro-cognitive-linguistics. He/Him.

B1ff Jones @b1ffjones

7 Followers 89 Following

Frank Facundo @telescientia

27 Followers 253 Following Científico de datos, informático y electrónico Télécom Paris - Sciences Po Paris

Teacher & Researcher @oeg_upm, focused on Linked Data, Semantic Web and ML+NLP techs. Responsible for https://t.co/u3tXdIQnjr and creator of DylanQ & KeyQ.

Mariano Rico @marianorico

581 Followers 300 Following Teacher & Researcher @oeg_upm, focused on Linked Data, Semantic Web and ML+NLP techs. Responsible for https://t.co/u3tXdIQnjr and creator of DylanQ & KeyQ.

Nitika Mathur @probablyNitika

177 Followers 172 Following

Dreamer. From Silicon Saxony to Silicon Valley to Isar Valley. Enjoys transitioning ML R&D concepts into production. ❤️ Languages, food, sports, 80's. She/Her

carmen 🌍🤖😷 @stedomedo

111 Followers 472 Following Dreamer. From Silicon Saxony to Silicon Valley to Isar Valley. Enjoys transitioning ML R&D concepts into production. ❤️ Languages, food, sports, 80's. She/Her

Yonas @yonasg_

121 Followers 953 Following @CarnegieMellon alum ML | NLP | Speech

Marcin Junczys-Dowmun.. @marian_nmt

2K Followers 396 Following NLP. NMT. Main author of Marian NMT. Research Scientist at Microsoft Translator. Non-NLP silliness and stuff on @emjotde

@EU_Commission European Climate, Infrastructure & Environment Executive Agency #CINEA_EU #EUGreenDeal
Data protection https://t.co/O3DcSWlGqL

CINEA🇪🇺 @cinea_eu

36K Followers 2K Following @EU_Commission European Climate, Infrastructure & Environment Executive Agency #CINEA_EU #EUGreenDeal Data protection https://t.co/O3DcSWlGqL

Omniscien Technologies is a leading supplier globally of high-performance, high-quality Machine Translation (MT) and Language Processing technologies.

Omniscien @omniscientech

1K Followers 4K Following Omniscien Technologies is a leading supplier globally of high-performance, high-quality Machine Translation (MT) and Language Processing technologies.

Barry Haddow @bazril

1K Followers 698 Following Researcher in Informatics at University of Edinburgh. Mainly working on machine translation.

Prompsit @Prompsit

625 Followers 433 Following We speak Natural Language Processing, Data Analysis and Artificial Intelligence, among many other languages!

Basic dad. @OurRadiantEarth @TechsOnTexts. “Normally, I’m against big things. I think the world’s going to be solved by millions of small things.” – Pete Seeger

Jed Sundwall @jedsundwall

3K Followers 5 Following Basic dad. @OurRadiantEarth @TechsOnTexts. “Normally, I’m against big things. I think the world’s going to be solved by millions of small things.” – Pete Seeger

TAUS @T21Century

7K Followers 576 Following We generate, collect & annotate language training #data for #AI and #MachineLearning and offer #NLP services and resources. Follow us for the latest data news!

Barry Haddow @bazril

2 months ago

Happy to announce our first HPLT model release!

HPLT @hplt_eu

2 months ago

First datasets, then models! Initial HPLT models (LLMs and MT) are out: hplt-project.org/models, some still running 🏃 We explain what we are doing in the deliverables section: hplt-project.org/deliverables Meanwhile, we keep cooking IA peta-data-bytes 🥘, enriching, dashboarding 📊

1 15 34 4K 7

2 0 24 1K 6

LTG Oslo @ltgoslo

3 months ago

It's snowing large language models this week in Norway! 1st, the 5th NLPL and @hplt_eu Winter School on LLMs is ongoing now in Skeikampen And 2nd, the LTG has released three fully open generative language models for Norwegian, based on Mistral and BLOOM architectures #NLProc

3 5 23 606 1

Download Image

HPLT @hplt_eu

5 months ago

We just published version 1.2 of HPLT datasets. What's new? - we fixed a bug in monolingual dedup, please redownload! 🛠️ - we filtered out very ugly monolingual documents🤮 - we anonymised the bilingual datasets🕵️‍♀️ hplt-project.org/datasets/v1.2

0 4 12 2K 4

HPLT @hplt_eu

5 months ago

HPLT News and Tools!!! If you are interested in filtering your datasets for quality and using them to train MT and LLMs, you are interested in this thread 👇

Nikolay Bogoychev @XapaJIaMnu

5 months ago

[1/6] After about 14 months of hard work, together with multiple people we present you with OpusTrainer and OpusCleaner! OpusCleaner is your one stop data fetching/preprocessing/cleaning pipeline, complete with GUI and designed to implicitlyvisualise your data before ...

2 12 45 4K 8

0 3 6 436 0

Clarin.si @ClarinSlovenia

11 months ago

We are excited to share with you that we now provide 4 more massive monolingual corpora for under-resourced languages: you can access Icelandic, Ukrainian, Catalan and Greek #MaCoCu web corpora for free from the CLARIN.SI repository 😃

1 21 41 5K 2

Download Image

Taja Kuzman @TajaKuzman

12 months ago

We really enjoyed the @vardialworkshop at the EACL 2023. We presented two papers on #MaCoCu language variety tools: British-American English variety classifier, models for automatic discrimination between Bosnian, Croatian, Montenegrin and Serbian, and more. More in🧵⬇️

0 2 12 2K 0

Download Image

Taja Kuzman @TajaKuzman

12 months ago

I presented our new British-American English classifier and a genre classifier, and showed that they reveal big differences between the massive #MaCoCu parallel web corpora. All the datasets and classifiers are freely available! Find them in our paper: aclanthology.org/2023.vardial-1…

Taja Kuzman @TajaKuzman

12 months ago

0 2 12 2K 0

Download Image

0 4 12 1K 0

Download Image

Taja Kuzman @TajaKuzman

12 months ago

Visiting the @eaclmeeting #eacl2023 conference in Dubrovnik these days 🙂☀️ We will be presenting our work on South Slavic and English language variety identification from the #MaCoCu project at the #VarDial worshop. Hope to see you there on Friday!

0 2 23 729 0

Download Image

HPLT @hplt_eu

a year ago

After a great 2nd physical meeting of the #hplt project in Oslo❄️ we have some highlights to share: 🤩 OPUS includes now ELRC and mtdata 🧐 1PB of new web data is ready to be processed 🦾 Ready to work on LUMI (AMD) 😻 LLM/MT models completed for fi/no and cs-uk @DataEcoEU

0 4 10 436 0

Download Gif

MultiTraiNMT @MultiNmt

2 years ago

Download for free this great book on Machine translation for everyone: lnkd.in/eP94jD4u

1 15 35 0 7

Prompsit @Prompsit

a year ago

HPLT (aka Hippolyta) is a space that combines petabytes of natural language data with large-scale model training. This Horizon Europe projects gathers top MT and LM researchers, HPC centres and ourselves as partners. 3 years of ambitious goals ahead! hplt-project.org

0 1 2 0 0

Prompsit @Prompsit

2 years ago

What are we crawling about at macocu.eu? Very interesting work by @IjsTk presented in our bi-weekly meeting. Lots of promotion multilingual websites in the .tr, .bg and .sl domains. Lots of legal in the .mt one. Stay tuned for more results! #macocucorpora

0 2 4 0 0

Download Image

Clarin.si @ClarinSlovenia

2 years ago

Massive AND high-quality corpora for Bulgarian, Croatian, Slovene, Macedonian, Icelandic, Maltese and Turkish, collected by the #MaCoCu project, are now available in our repository! Check them out and share the word: ➡️macocu.eu ➡️clarin.si/repository/xml…

0 7 22 0 1

Download Image

Prompsit @Prompsit

2 years ago

Clarin.si @ClarinSlovenia

2 years ago

0 7 22 0 1

Download Image

0 3 7 0 0

Prompsit @Prompsit

2 years ago

Looking for parallel English-Ukranian data? Then you need to check this 👇

ParaCrawl @ParaCrawl

2 years ago

0 17 24 0 0

0 3 5 0 1

Barry Haddow @bazril

2 years ago

@anas_ant If you have an MT system, try bleualign (github.com/bitextor/bleua…) from @ParaCrawl . Scales to ParaCrawl-sized data.

2 1 5 0 2

CINEA🇪🇺 @cinea_eu

5 years ago

Welcome to the #CEFTelecom #EuroPat project with a focus on #eTranslation in the patent domain. ➡️Know more: europa.eu/!hB97MK #ConnectingEurope

Prompsit @Prompsit

5 years ago

We will be "Unleashing European Patent Translations" for a couple of years from now as part of the #EuroPat project, a EU co-funded #CEFTelecom project by @inea_eu . Take a look! ec.europa.eu/inea/en/connec…

0 3 8 0 0

Download Image

0 0 2 0 0

Prompsit @Prompsit

5 years ago

0 3 8 0 0

Download Image

Tom Kocmi @KocmiTom

3 years ago

I'm excited to share our MT metric evaluation study on a whopping 4380 human-judged MT systems! It isn't a surprise that BLEU ranked suboptimal and pretrained methods rule. arxiv.org/abs/2107.10821

6 31 121 0 24

MultiTraiNMT @MultiNmt

3 years ago

Thank you @Prompsit for presenting neural machine platform MutNMT developed under @MultiNmt project to @tradumatica master's students and lecturers! [BTW, any master interested in presentations, contact @Prompsit !]